Could multiple 7b models outperform 70b models?

freehuntx@alien.top · 2 years ago

Could multiple 7b models outperform 70b models?

yahma@alien.top · 2 years ago

Yes. This is known as Mixture of Experts (MOE).

We already have several promising ways of doing this:

QMoE: A Scalable Algorithm for Sub-1-Bit Compression of Trillion-Parameter Mixture-of-Experts Architectures. Paper - Github
S-Lora: Serving thousands of concurrent adapters.
Lorax: Serve hundreds of concurrent adapters.
LMoE: Simple method of dynamically loading Loras

sampdoria_supporter@alien.top · 2 years ago

I can’t believe I hadn’t run into this. Would you indulge me on the implications for agentic systems like Autogen? I’ve been working on having experts cooperate that way rather than being combined into a single model.