If i have multiple 7b models where each model is trained on one specific topic (e.g. roleplay, math, coding, history, politic…) and i have an interface which decides depending on the context which model to use. Could this outperform bigger models while being faster?

  • feynmanatom@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    This might be pedantic, but this is a field with so much random vocabulary and it’s better for folks to not be confused.

    MoE is slightly different. An MoE is a single LLM with gated layers that “select” which layers to route embeddings/tokens to. It’s pretty difficult to scale and serve in practice.

    I think what you’re referring to is more like a model router. You can use a general LLM to “classify” a prompt and then route the entire prompt to a downstream LLM. It’s unclear if this would be faster than a 70B LLM since you would repeat the encoding phase and have some generation, but it could certainly be better.

    • wishtrepreneur@alien.topB
      link
      fedilink
      English
      arrow-up
      1
      ·
      1 year ago

      You can use a general LLM to “classify” a prompt and then route the entire prompt to a downstream LLM.

      why can’t you just train the “router” LLM on which downstream LLM to use and pass the activations to the downstream LLMs? Can’t you have “headless” (without encoding layer) downstream LLMs? So inference could use a (6.5B+6.5B) params model with the generalizability of a 70B model.

      • feynmanatom@alien.topB
        link
        fedilink
        English
        arrow-up
        1
        ·
        1 year ago

        Hmm, not sure if I track what an encoding layer is? The encoding phase involves filling the KV cache across the depth of the model. I don’t think there’s an activation you could just pass across without model surgery + additional fine tuning.