If i have multiple 7b models where each model is trained on one specific topic (e.g. roleplay, math, coding, history, politic…) and i have an interface which decides depending on the context which model to use. Could this outperform bigger models while being faster?

  • remghoost7@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    I believe this is what GPT4 actually is.

    I remember reading somewhere that it’s actually a mix of 8 different models and it directs your question depending on the context of it.

    Would be neat to implement on a local level though. Haven’t seen many people on the local side talk about doing this.

    • FullOf_Bad_Ideas@alien.topB
      link
      fedilink
      English
      arrow-up
      1
      ·
      1 year ago

      Jondurbin made something like this with qlora.

      The explanation that gpt-4 is MoE model doesn’t make sense to me. Gpt4 api is 30x more expensive than gpt-3-5-turbo. Gpt-3-5 turbo is 175B parameters, right? So, if they had 8 220B experts, it wouldn’t need to cost 30x more, it would be 20-50% more for API use. There was also some speculation that 3.5 turbo is 22B. In that case it also doesn’t make sense to me that it would be 30x as expensive.

      • AutomataManifold@alien.topB
        link
        fedilink
        English
        arrow-up
        1
        ·
        1 year ago

        Just to note: don’t read too much into OpenAI’s prices. They’re deliberately losing money as a market-capturing strategy, so it’s not guaranteed that there’s a linear relationship between what they charge for a given service and what their actual costs are.

      • Cradawx@alien.topB
        link
        fedilink
        English
        arrow-up
        1
        ·
        1 year ago

        No, several sources include Microsoft have said GPT 3.5 Turbo is 20B. GPT 3 was 175B, and GPT 3.5 Turbo was about 10x cheaper on the API than GPT 3 when it came out so it makes sense.

    • feynmanatom@alien.topB
      link
      fedilink
      English
      arrow-up
      1
      ·
      1 year ago

      Lots of rumors, but tbh I think it’s highly unlikely they’re using an MoE. MoEs work on batch size = 1 (you can take advantage of sparsity) but not on larger batch sizes. You would need so much RAM and would miss out on the point of using an MoE.

      • remghoost7@alien.topB
        link
        fedilink
        English
        arrow-up
        1
        ·
        1 year ago

        Lots of rumors…

        Very true.

        We honestly have no clue what’s going on behind ClosedAI’s doors.

        I don’t know enough about MoEs to say one way or the other, so I’ll take your word on it. I’ll have to do more research on them.