@gwern

gwern@alien.top · 11 months ago

No, I still think it’s not that surprising even taking it as a whole. Humans memorize things all the time after a single look. (Consider, for example, image recognition memory.) If a NN can memorize entire datasets after a few epoches using ‘a single small noisy step of gradient descent over 1-4 million tokens’ on each datapoint once per epoch, why is saying that some of this memorization happens in the first epoch so surprising? (If it’s good enough to memorize given a few steps, then you’re just haggling over the price, and 1 step is well within reason.) And there is usually not that much intrinsic information in any of these samples, so if a LLM has done a good job of learning generalizable representations of things like names or phone numbers, it doesn’t take up much ‘space’ inside the LLM to encode yet another slight variation on a human name. (If the representation is good, a ‘small’ step covers a huge amount of data.)

Plus, you are overegging the description: it’s not like it’s memorizing 100% of the data on sight, nor is the memorization permanent. (The estimates from earlier papers are more like 1% get memorized at the first epoch, and OP estimates they could extract 1GB of text from GPT-3/4, which sounds roughly consistent.) So it’s more like, ‘once every great once in a while, particularly if a datapoint was very recently seen or simple or stereotypical, the model can mostly recall having seen it before’.

gwern@alien.top · 1 year ago

It’s not surprising at all. The more sample-efficient a model is, the more it can learn a datapoint in a single shot. And that they are often that sample-efficient has been established by tons of previous work.

The value of this work is that it shows that what looked like memorized data from a secret training corpus is memorized data, by checking against an Internet-wide corpus. Otherwise, it’s very hard to tell if it’s simply a confabulation.

People have been posting screenshots of this stuff on Twitter for ages, but it’s usually been impossible to tell if it was real data or just made-up. Similar issues with extracting prompts: you can ‘extract a prompt’ all you like, but is it the actual prompt? Without some detail like the ‘current date’ timestamp always being correct, it’s hard to tell if what you are getting has anything to do with the actual hidden prompts. (In some cases, it obviously didn’t because it was telling the model to do impossible things or describing commands/functionality it didn’t have.)

gwern@alien.top · 1 year ago

For example, in prompt tuning, we only need to save the tiny trained soft prompt (~very few megabytes), rather than the entire changed model weights (~many, many GBs) on our hard disk/SSD. But from a practical point-of-view, I feel that most people suffer from a lack of compute (e.g. GPU memory) than hard disk space. In other words, it seems that training time and GPU memory consumption are more relevant concerns than saving on checkpoint storage space.

I think you severely overestimate how many people are training a LoRA ever, and underestimate how many are using them (ie. downloading them). For every person who actually gets their hands dirty training their own LoRA and burning GPU, there’s probably >100 downloading it (often repeatedly) just as part of a set of LoRAs to generate their own images. Look at Civitai or Hugging Face bandwidth usage. It makes a huge difference to the vastly overwhelming majority of people if the checkpoint is 100 gigabytes or 0.001 gigabytes! And if you have to spend a terabyte of disk space to store the handful of mods you want to try out, too…

gwern@alien.top · 1 year ago

Can I use the trained discriminator to detect anomalous images? I guess the discriminator should mark them as “fake” due to not being prevalent in the dataset?

Generally, no. What a Discriminator learns seems to be weirder than that. It seems to be closer to ‘is this datapoint in the dataset’ (the original dataset, not the distribution). You can look at the ranking of a Discriminator over a dataset and this can be useful for finding datapoints to look at more closely, but it’s weird: https://gwern.net/face#discriminator-ranking

gwern@alien.top · 1 year ago

GANs learn to generate samples in similar ratios as the original data: if there’s 10% dogs, there will be 10% dogs in the samples. But they don’t work backwards from a dog image to 10%, you might say - they are ‘likelihood-free’. They just generate plausible images. They don’t know how plausible an existing image is.

In theory, a VAE can tell you this and look at a dog image and say ‘10% likelihood’ and look at a weird pseudoimage and say ‘wtf this is like, 0.00000001% likely’, and you could use it to eliminate all your pseudoimages. In practice, they don’t always work that well for outlier detection and seem to be fragile. So, the advantage of VAEs there may be less compelling than it sounds on a slide.

gwern@alien.top · 1 year ago

Probably should mention this is from 1992 and doesn’t use CLOS. Is Eoops used in Emacs at all these days? Looks defunct: https://www.emacswiki.org/emacs/EmacsObjectOrientedProgrammingSystem