I loved Andrej’s talk about in his “Busy person’s intro to Large Language Models” video, so I decided to create a reading list to dive in deeper to a lot of the topics. I feel like he did a great job of describing the state of the art for anyone from an ML Researcher to any engineer who is interested in learning more.

The full talk can be found here: https://youtu.be/zjkBMFhNj_g?si=fPvPyOVmV-FCTFEx

Here’s the reading list: https://blog.oxen.ai/reading-list-for-andrej-karpathys-intro-to-large-language-models-video/

Let me know if you have any other papers you would add!

  • um-xpto@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    Nice! Thank you for your work.

    Regarding the video.

    Q1) minute 14:14 Finetuning into an Assistant, when you have multiple tasks / datasets with diverse outputs how is training performed ? Are all datasets combined in a single training ? Or Is finetuning done over a previous finetuning ? Or the question is parsed and sent to a specific model ?

    Q2) minute 27:43 Tool Use (Browser, Calculator, etc. ) Anyone has links for similar implementations for llama and how is done or what kind of tech/frameworks are used ?

    • Disastrous_Elk_6375@alien.topB
      link
      fedilink
      English
      arrow-up
      1
      ·
      1 year ago

      Q2) minute 27:43 Tool Use (Browser, Calculator, etc. ) Anyone has links for similar implementations for llama and how is done or what kind of tech/frameworks are used ?

      The naive way is to use langchain, but that’s hit and miss for several reasons, and whatever you build will be held together by duct tape and prayers. Alternative frameworks include Haystack and Griptape.

      I’ve found that for local models the best tool-usage you can get is by using an advanced control library. This gives you a lot of flexibility in organising the prompts and “helping” the local models a lot. Guidance and LMQL are two such libraries.

    • FallMindless3563@alien.topOPB
      link
      fedilink
      English
      arrow-up
      1
      ·
      1 year ago

      You certainly can combine all the tasks and datasets into a single instruction fine tuning dataset. Then you would have a separate dataset for the reinforcement learning half where the model is learning human preferences.

    • FallMindless3563@alien.topOPB
      link
      fedilink
      English
      arrow-up
      1
      ·
      1 year ago

      The only book he explicitly mentions is “Thinking Fast and Slow” by Daniel Kahneman, but I think there are a ton of books that would be great resources along side the papers. I just happened to pull a lot of the papers from the footnotes and concepts he mentioned.

  • Maykey@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    I haven’t watch the talk, but I think the reading list should have some love for SSM. (S4, S5, H3): on one hand their variants are very prominent on long range arena on other they are relatively “unknown”.

    They are not unknown to researchers seeing how many variants there are, but there are hundreds more videos and blogs explaining transformers. If you find a course about LLM, it will likely include Transformers but not SSM, so I think their success in LRA and absence in learning materials qualifies them for “dive in deeper” list.

  • coumineol@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    Thanks but here’s the problem with this list: most of the papers mentioned are on a very high technical level, and people who would be able to understand them are probably people who have already read them. Note that Andrej was careful to keep the material at a certain level because he addresses those who want to go one step further than talking to ChatGPT, without necessarily understanding all the underlying theory.

    • teryret@alien.topB
      link
      fedilink
      English
      arrow-up
      1
      ·
      1 year ago

      Right, that’s why OP prefaced with “to dive deeper into a lot of the topics”. If folks aren’t at a point where diving deeper makes sense, it’s not a list for them. There are plenty of resources for any given level of understanding, obviously no list is going to be appropriate for every member of a diverse community.

      • coumineol@alien.topB
        link
        fedilink
        English
        arrow-up
        1
        ·
        1 year ago

        Not to start an argument here but I can’t imagine anybody with any level of understanding who should start diving deeper by reading the “Attention is All You Need” paper. Yes, this is a diverse community, but when you try to address everybody’s needs, you usually end up with addressing nobody’s needs.

        • eek04@alien.topB
          link
          fedilink
          English
          arrow-up
          1
          ·
          1 year ago

          Since “Attention is All You Need” is fairly high on my reading list for understanding the details of transformer architecture, what do you recommend instead?

        • whymauri@alien.topB
          link
          fedilink
          English
          arrow-up
          1
          ·
          1 year ago

          Just me, but I think of busy coworkers with great background in math/stats and ‘classic’ ML who would ramp up quickly from a list like this. When I onboarded chemists (PhDs) to my ML team at a drug startup, I would send them a similarly dense reading list. With their strong background in physics, it would take them two weeks flat to understand the necessary theory and jargon to be productive (in our niche field).

          • coumineol@alien.topB
            link
            fedilink
            English
            arrow-up
            1
            ·
            1 year ago

            Didn’t mean to say those papers are completely useless, but even for those with a strong Math/ML background I would advise starting with recent survey papers. Reading “Attention is All You Need” is kind of like reading the General Relativity papers of Einstein - cool as a historical curiosity, but not ideal for optimizing expertise acquisition.