• UnknownEssence@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    If it is truly memorizing the ENTIRE set of training data, then is it not lossless data compression that is much more efficient than any known compression algorithms?

    It has to be lossy compression aka it doesn’t remember its ENTIRE set of training data, word for word.

  • Zondartul@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    The point of the paper is that LLMs memorize an insane amount of training data and, with some massaging, can be made to output it verbatim. If that training data has PII (personally identifiable information), you’re in trouble.

    Another big takeaway is that training for more epochs leads to more memorization.

    • Mandelmus100@alien.topB
      link
      fedilink
      English
      arrow-up
      1
      ·
      1 year ago

      Another big takeaway is that training for more epochs leads to more memorization.

      Should be expected. It’s overfitting.

      • FaceDeer@alien.topB
        link
        fedilink
        English
        arrow-up
        1
        ·
        1 year ago

        Indeed. Just like with training humans to be smart, rote memorization sometimes happens but is generally not the goal. Research like this helps avoid it better in future.

      • n_girard@alien.topB
        link
        fedilink
        English
        arrow-up
        1
        ·
        1 year ago

        Hopefully I’m not being offtopic here, but a recent paper suggested that repeating a requirement several times within the same instructions lead the model to be more compliant towards it.

        Do you know whether it’s true or grounded ?

        Thanks in advance.

        • DigThatData@alien.topB
          link
          fedilink
          English
          arrow-up
          1
          ·
          1 year ago

          it’s possible to “overfit” to a subset of the data. generalization error going up is a symptom of “overfitting” to the entire dataset. memorization is functionally equivalent to locally overfitting, i.e. generalization error going up in a specific neighborhood of the data. you can have a global reduction in generalization error while also having neighborhoods where generalization gets worse.

          • seraphius@alien.topB
            link
            fedilink
            English
            arrow-up
            1
            ·
            1 year ago

            On most tasks, memorization would be overfitting, but I think one would see that “overfitting” is task/generalization dependent. As long as accurate predictions are being made for new data, it doesn’t matter that it can cough up the old.

          • Hostilis_@alien.topB
            link
            fedilink
            English
            arrow-up
            1
            ·
            1 year ago

            Memorization is functionally equivalent to locally overfitting.

            Uh, no it is not. Memorization and overfitting are not the same thing. You are certainly capable of memorizing things without degrading your generalization performance (I hope).

    • oldjar7@alien.topB
      link
      fedilink
      English
      arrow-up
      1
      ·
      1 year ago

      How is that a problem? The entire point of training is to memorize and generalize the training data.

      • narex456@alien.topB
        link
        fedilink
        English
        arrow-up
        1
        ·
        1 year ago

        Learning English is not simply memorizing a billion sample sentences.

        The problem is that we want it to learn to string words together for itself, not regurgitate words which already appear in the training set in that order.

        This paper attempts to solve the difficult dilemma of detecting how much of the success of an llm is due to rote memorization.

        Maybe more importantly: how much parameter space/ training resources are wasted on this?

    • HateRedditCantQuitit@alien.topB
      link
      fedilink
      English
      arrow-up
      1
      ·
      1 year ago

      The point isn’t just that they memorize a ton. It’s also that current alignment efforts that purport to prevent regurgitation fail.

    • Seankala@alien.topB
      link
      fedilink
      English
      arrow-up
      1
      ·
      1 year ago

      Nothing about this is novel though; the fact that language models are able to uncover sensitive training information has been a thing for a while now.

  • blimpyway@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    It is not about being able to search for relevant data when prompted with a question.

    The amazing thing is they seem to understand the question sufficiently so the answer is both concise and meaningful.

    That’s what folks downplaying it as “a glorified autocomplete” are missing.

    PS and those philosphising it can’t actually understand the question are also missing the point: nobody cares as long as its answers are sufficiently correct and meaningful as if it was understanding the question.

    It mimics understanding well enough.

    • squareOfTwo@alien.topB
      link
      fedilink
      English
      arrow-up
      1
      ·
      1 year ago

      these things don’t “understand”. Ask it something which is “to much OOD” and you get wrong answers, even when a human would give the correct answer according to the training set.

      • blimpyway@alien.topB
        link
        fedilink
        English
        arrow-up
        1
        ·
        1 year ago

        I said they mimic understanding well enough, that wasn’t a claim LLMs actually understand.

        Sure training dataset limits apply,

        And sure they very likely fail when the question is OOD, but figuring out the question is OOD isn’t that hard, so an honest “Sorry, your question is way too OOD” answer (instead of hallucinating) shouldn’t bee too difficult to implement.

  • zalperst@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    It’s extremely surprising given many instances of data are only seen once or very few times by the model during training

    • cegras@alien.topB
      link
      fedilink
      English
      arrow-up
      1
      ·
      1 year ago

      What is the size of ChatGPT or the biggest LLMs compared to the dataset? (Not being rhetorical, genuinely curious)

      • StartledWatermelon@alien.topB
        link
        fedilink
        English
        arrow-up
        1
        ·
        1 year ago

        GPT-4: 1.76 trillion parameters, about 6.5* trillion tokens in the dataset.

        • could be twice that, the leaks weren’t crystal clear. The above number is more likely though.
    • gwern@alien.topB
      link
      fedilink
      English
      arrow-up
      1
      ·
      1 year ago

      It’s not surprising at all. The more sample-efficient a model is, the more it can learn a datapoint in a single shot. And that they are often that sample-efficient has been established by tons of previous work.

      The value of this work is that it shows that what looked like memorized data from a secret training corpus is memorized data, by checking against an Internet-wide corpus. Otherwise, it’s very hard to tell if it’s simply a confabulation.

      People have been posting screenshots of this stuff on Twitter for ages, but it’s usually been impossible to tell if it was real data or just made-up. Similar issues with extracting prompts: you can ‘extract a prompt’ all you like, but is it the actual prompt? Without some detail like the ‘current date’ timestamp always being correct, it’s hard to tell if what you are getting has anything to do with the actual hidden prompts. (In some cases, it obviously didn’t because it was telling the model to do impossible things or describing commands/functionality it didn’t have.)

      • zalperst@alien.topB
        link
        fedilink
        English
        arrow-up
        1
        ·
        1 year ago

        The sample efficiency you mention is an empirical observation, that doesn’t make it not surprising. Why should a single small, noisy, step of gradient descent allow you to immediately memorize the data. I think that’s fundamentally surprising.

        • gwern@alien.topB
          link
          fedilink
          English
          arrow-up
          1
          ·
          1 year ago

          No, I still think it’s not that surprising even taking it as a whole. Humans memorize things all the time after a single look. (Consider, for example, image recognition memory.) If a NN can memorize entire datasets after a few epoches using ‘a single small noisy step of gradient descent over 1-4 million tokens’ on each datapoint once per epoch, why is saying that some of this memorization happens in the first epoch so surprising? (If it’s good enough to memorize given a few steps, then you’re just haggling over the price, and 1 step is well within reason.) And there is usually not that much intrinsic information in any of these samples, so if a LLM has done a good job of learning generalizable representations of things like names or phone numbers, it doesn’t take up much ‘space’ inside the LLM to encode yet another slight variation on a human name. (If the representation is good, a ‘small’ step covers a huge amount of data.)

          Plus, you are overegging the description: it’s not like it’s memorizing 100% of the data on sight, nor is the memorization permanent. (The estimates from earlier papers are more like 1% get memorized at the first epoch, and OP estimates they could extract 1GB of text from GPT-3/4, which sounds roughly consistent.) So it’s more like, ‘once every great once in a while, particularly if a datapoint was very recently seen or simple or stereotypical, the model can mostly recall having seen it before’.

          • ThirdMover@alien.topB
            link
            fedilink
            English
            arrow-up
            1
            ·
            1 year ago

            Humans memorize things all the time after a single look.

            I think what’s going on in humans there is a lot more complex than something like a single SGD step updating some weights. Generally if you do memorize something you replay it in your head consciously several times.

          • zalperst@alien.topB
            link
            fedilink
            English
            arrow-up
            1
            ·
            1 year ago

            I appreciate your position, but I don’t think your intuition holds here, for instance biological neural nets very likely use a qualitatively different learning algorithm than back propagation.

          • zalperst@alien.topB
            link
            fedilink
            English
            arrow-up
            1
            ·
            1 year ago

            I appreciate that it’s possible to find a not-illogical explanation (logical would entail a real proof), but it remains surprising to me.

  • exomni@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    The operative word here is “just”. The models are so large and the training is such that of course one of the things they are likely doing is memorizing the corpus; but they aren’t “just” memorizing the corpus: there is some amount of regularization in place to allow the system to exhibit more generative outputs and behaviors as well.