The sample efficiency you mention is an empirical observation, that doesn’t make it not surprising. Why should a single small, noisy, step of gradient descent allow you to immediately memorize the data. I think that’s fundamentally surprising.
No, I still think it’s not that surprising even taking it as a whole. Humans memorize things all the time after a single look. (Consider, for example, image recognition memory.) If a NN can memorize entire datasets after a few epoches using ‘a single small noisy step of gradient descent over 1-4 million tokens’ on each datapoint once per epoch, why is saying that some of this memorization happens in the first epoch so surprising? (If it’s good enough to memorize given a few steps, then you’re just haggling over the price, and 1 step is well within reason.) And there is usually not that much intrinsic information in any of these samples, so if a LLM has done a good job of learning generalizable representations of things like names or phone numbers, it doesn’t take up much ‘space’ inside the LLM to encode yet another slight variation on a human name. (If the representation is good, a ‘small’ step covers a huge amount of data.)
Plus, you are overegging the description: it’s not like it’s memorizing 100% of the data on sight, nor is the memorization permanent. (The estimates from earlier papers are more like 1% get memorized at the first epoch, and OP estimates they could extract 1GB of text from GPT-3/4, which sounds roughly consistent.) So it’s more like, ‘once every great once in a while, particularly if a datapoint was very recently seen or simple or stereotypical, the model can mostly recall having seen it before’.
I appreciate your position, but I don’t think your intuition holds here, for instance biological neural nets very likely use a qualitatively different learning algorithm than back propagation.
Humans memorize things all the time after a single look.
I think what’s going on in humans there is a lot more complex than something like a single SGD step updating some weights. Generally if you do memorize something you replay it in your head consciously several times.
The sample efficiency you mention is an empirical observation, that doesn’t make it not surprising. Why should a single small, noisy, step of gradient descent allow you to immediately memorize the data. I think that’s fundamentally surprising.
No, I still think it’s not that surprising even taking it as a whole. Humans memorize things all the time after a single look. (Consider, for example, image recognition memory.) If a NN can memorize entire datasets after a few epoches using ‘a single small noisy step of gradient descent over 1-4 million tokens’ on each datapoint once per epoch, why is saying that some of this memorization happens in the first epoch so surprising? (If it’s good enough to memorize given a few steps, then you’re just haggling over the price, and 1 step is well within reason.) And there is usually not that much intrinsic information in any of these samples, so if a LLM has done a good job of learning generalizable representations of things like names or phone numbers, it doesn’t take up much ‘space’ inside the LLM to encode yet another slight variation on a human name. (If the representation is good, a ‘small’ step covers a huge amount of data.)
Plus, you are overegging the description: it’s not like it’s memorizing 100% of the data on sight, nor is the memorization permanent. (The estimates from earlier papers are more like 1% get memorized at the first epoch, and OP estimates they could extract 1GB of text from GPT-3/4, which sounds roughly consistent.) So it’s more like, ‘once every great once in a while, particularly if a datapoint was very recently seen or simple or stereotypical, the model can mostly recall having seen it before’.
I appreciate your position, but I don’t think your intuition holds here, for instance biological neural nets very likely use a qualitatively different learning algorithm than back propagation.
I appreciate that it’s possible to find a not-illogical explanation (logical would entail a real proof), but it remains surprising to me.
I think what’s going on in humans there is a lot more complex than something like a single SGD step updating some weights. Generally if you do memorize something you replay it in your head consciously several times.