• 0 Posts
  • 7 Comments
Joined 11 months ago
cake
Cake day: November 9th, 2023

help-circle
  • We don’t currently know exactly why gradient descent works to find powerful, generalizing minima

    But, like, it does

    The minima we can reliably find, in practice, don’t just interpolate the training data. I mean, they do that, but they find compressions which seem to actually represent knowledge, in the sense that they can identify true relationships between concepts which reliably hold outside the training distribution.

    I want to stress, “predict the next token” is what the models are trained to do, it is not what they learn to do. They learn deep representations and learn to deploy those representations in arbitrary contexts. They learn to predict tokens the same way a high-school student learns to fill in scantrons: the scantron is designed so that filling it out requires other more useful skills.

    It’s unclear if gradient descent will continue to work so unreasonably well as we try to push it farther and farther, but so long as the current paradigm holds I don’t see a huge difference between human inference ability and Transformer inference ability. Number of neurons* and amount of training data seem to be the things holding LLMs back. Humans beat LLMs on both counts, but in some ways LLMs seem to outperform biology in terms of what they can learn with a given quantity of neurons/data. As for the “billions of years” issue, that’s why we are using human-generated data, so they can catch up instead of starting from scratch.

    • By “number of neurons” I really mean something like “expressive power in some universally quantified sense.” Obviously you can’t directly compare perceptrons to biological neurons


  • These answers seem weird to me. Am I misunderstanding? Here’s the obvious-seeming answer to me:

    You need two different matrices because you need an attention coefficient for every single pair of vectors.

    If there are n tokens, then for the n’th token you need n-1 different attention coefficients (one for each token it attends). For the n-1’th token, you need n-2 different coefficients, and so on, until the 2nd vector which needs only one coefficient, and the first vector which needs zero (it can’t attend anything).

    That’s ~n^2 coefficients in total. If you compute key and query vectors, then you only need 2n different vectors (one key and one query for each of the n vectors). If the key/query vectors are d-dimensional that’s 2dn numbers, which is still smaller than n^2 if the context size is bigger than the key/query dimension

    So using separate vectors is more efficient and more scalable.

    The other answers on this thread seem different, which is surprising to me since this answer feels very straightforward. If I’m missing something, I’d love an explanation



  • You can just give an LLM an internal monologue. It’s called a scratchpad.

    I’m not sure how this applies to the broader discussion, like honestly I can’t tell if we’re off-topic. But once you have LLMs you can implement basically everything humans can do. The only limitations I’m aware of that aren’t trivial from an engineering perspective are

    1. current LLMs mostly aren’t as smart as humans, like literally they have fewer neurons and can’t model systems as complexly
    2. humans have more complex memory, with a mix of short-term and long-term and a fluid process of moving between them
    3. humans can learn on-the-go, this is equivalent to “online training” and is probably related to long-term memory
    4. humans are multimodal, it’s unclear to what extent this is a “limitation” vs just a pedantic nit-pick, I’ll let you decide how to account for it