So Mistral-7b is a pretty impressive 7B param model … but why is it so capable? Do we have any insights into its dataset? Was it trained very far beyond the scaling limit? Any attempts at open reproductions or merges to scale up # of params?

  • FPham@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    It’s simply the time bonus - coming after all the big models.

    - better filtering - kill outright junk

    - you use already big models (OpenAI and LLama) that you can use for data tuning and filtering

    - use available synthetic data

  • Feztopia@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    As far as I know (I might be wrong) it’s partly the team that made llama1 (and maybe made the first steps for llama2?). So they already knew what they were doing. How llama could be improved* and so on.

    *The dataset

  • Nkingsy@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    Trained on a larger # of tokens. All the llama models are under trained it appears, especially the 70b

  • meetrais@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    I second this. Mistral-7B gave me good results. After fine-tuning it’s result is even better.

    • PwanaZana@alien.topB
      link
      fedilink
      English
      arrow-up
      1
      ·
      1 year ago

      Are there notable finetunes to your knowledge? I’ve started using LLMs today, starting with openorca mistral 7B and it seems pretty good.

      • meetrais@alien.topB
        link
        fedilink
        English
        arrow-up
        1
        ·
        1 year ago

        On HuggingFace you can find many fine-tuned/quantized models. Look for models from TheBloke on HuggingFace.

    • kaszebe@alien.topB
      link
      fedilink
      English
      arrow-up
      1
      ·
      1 year ago

      Mistral-7B gave me good results

      Can you expand upon that? Do you mean in terms of its ability to write at a college level without major grammatical errors?

  • involviert@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    I assume the progress is based on well structured, high quality training data, combined with an incremental “learning schedule”. At least that’s where some reports of massive progress seem to be coming from and it’s also very intuitive that this would help a lot.

  • Charuru@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    The results are okay, but I’m hard-pressed to call it “very capable”. My perspective on it is that other bigger models are making mistakes they shouldn’t be making because they were “trained wrong”.

    • Monkey_1505@alien.topB
      link
      fedilink
      English
      arrow-up
      1
      ·
      1 year ago

      Knowledge is a strange goal for any model when we have the internet. IMO. Just connect your model to a web search.

  • Dorialexandre@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    My current hunch is that they use a lot of non easily accessible online ressources (including a specific archive owned by someone named Anna).

  • Technical_Spirit_622@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    Is there any version of mistral or llama2 with RHLF applied to make tasks of text summarisation without having the censorship. Sometimes the output is totally different from what one could expect with the input sentences. Even if I state in the prompt to avoid applying censorship and focus on the input.

  • qubedView@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    Do people find that it holds up in use? Or are we mostly going on benchmarks? I’m skeptical of benchmarks, and a highly performant 7B model would be of great use.

  • obeymypropaganda@alien.top
    cake
    B
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    They matched parameters and tokens when training.

    Podcast on Spotify “No Priors” has the CEO of Mistral on who discusses this.