• zalperst@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    The sample efficiency you mention is an empirical observation, that doesn’t make it not surprising. Why should a single small, noisy, step of gradient descent allow you to immediately memorize the data. I think that’s fundamentally surprising.

    • gwern@alien.topB
      link
      fedilink
      English
      arrow-up
      1
      ·
      1 year ago

      No, I still think it’s not that surprising even taking it as a whole. Humans memorize things all the time after a single look. (Consider, for example, image recognition memory.) If a NN can memorize entire datasets after a few epoches using ‘a single small noisy step of gradient descent over 1-4 million tokens’ on each datapoint once per epoch, why is saying that some of this memorization happens in the first epoch so surprising? (If it’s good enough to memorize given a few steps, then you’re just haggling over the price, and 1 step is well within reason.) And there is usually not that much intrinsic information in any of these samples, so if a LLM has done a good job of learning generalizable representations of things like names or phone numbers, it doesn’t take up much ‘space’ inside the LLM to encode yet another slight variation on a human name. (If the representation is good, a ‘small’ step covers a huge amount of data.)

      Plus, you are overegging the description: it’s not like it’s memorizing 100% of the data on sight, nor is the memorization permanent. (The estimates from earlier papers are more like 1% get memorized at the first epoch, and OP estimates they could extract 1GB of text from GPT-3/4, which sounds roughly consistent.) So it’s more like, ‘once every great once in a while, particularly if a datapoint was very recently seen or simple or stereotypical, the model can mostly recall having seen it before’.

      • zalperst@alien.topB
        link
        fedilink
        English
        arrow-up
        1
        ·
        1 year ago

        I appreciate your position, but I don’t think your intuition holds here, for instance biological neural nets very likely use a qualitatively different learning algorithm than back propagation.

      • zalperst@alien.topB
        link
        fedilink
        English
        arrow-up
        1
        ·
        1 year ago

        I appreciate that it’s possible to find a not-illogical explanation (logical would entail a real proof), but it remains surprising to me.

      • ThirdMover@alien.topB
        link
        fedilink
        English
        arrow-up
        1
        ·
        1 year ago

        Humans memorize things all the time after a single look.

        I think what’s going on in humans there is a lot more complex than something like a single SGD step updating some weights. Generally if you do memorize something you replay it in your head consciously several times.