It’s not surprising at all. The more sample-efficient a model is, the more it can learn a datapoint in a single shot. And that they are often that sample-efficient has been established by tons of previous work.
The value of this work is that it shows that what looked like memorized data from a secret training corpus is memorized data, by checking against an Internet-wide corpus. Otherwise, it’s very hard to tell if it’s simply a confabulation.
People have been posting screenshots of this stuff on Twitter for ages, but it’s usually been impossible to tell if it was real data or just made-up. Similar issues with extracting prompts: you can ‘extract a prompt’ all you like, but is it the actual prompt? Without some detail like the ‘current date’ timestamp always being correct, it’s hard to tell if what you are getting has anything to do with the actual hidden prompts. (In some cases, it obviously didn’t because it was telling the model to do impossible things or describing commands/functionality it didn’t have.)
No, I still think it’s not that surprising even taking it as a whole. Humans memorize things all the time after a single look. (Consider, for example, image recognition memory.) If a NN can memorize entire datasets after a few epoches using ‘a single small noisy step of gradient descent over 1-4 million tokens’ on each datapoint once per epoch, why is saying that some of this memorization happens in the first epoch so surprising? (If it’s good enough to memorize given a few steps, then you’re just haggling over the price, and 1 step is well within reason.) And there is usually not that much intrinsic information in any of these samples, so if a LLM has done a good job of learning generalizable representations of things like names or phone numbers, it doesn’t take up much ‘space’ inside the LLM to encode yet another slight variation on a human name. (If the representation is good, a ‘small’ step covers a huge amount of data.)
Plus, you are overegging the description: it’s not like it’s memorizing 100% of the data on sight, nor is the memorization permanent. (The estimates from earlier papers are more like 1% get memorized at the first epoch, and OP estimates they could extract 1GB of text from GPT-3/4, which sounds roughly consistent.) So it’s more like, ‘once every great once in a while, particularly if a datapoint was very recently seen or simple or stereotypical, the model can mostly recall having seen it before’.