If i have multiple 7b models where each model is trained on one specific topic (e.g. roleplay, math, coding, history, politic…) and i have an interface which decides depending on the context which model to use. Could this outperform bigger models while being faster?
yes, this is done by Mixture of Experts (MoE)
and we already have this type of examples:
coding - deepseek-coder-7B is better at coding than many 70B models
answering from the context - llama2-7B is better than llama-2-13B at openbookqa test
https://preview.redd.it/1gexvwd83i2c1.png?width=1000&format=png&auto=webp&s=cda1ee16000c2e89410091c172bf4756bc8a427b
etc.
Does this use of mixture-of-experts mean that multiple 70b models would perform ?better than multiple 7b models
the question was if multiple small models can beat a single big model but also having the speed advantage, and answer is yes, and an example of that is MOE, which is a collection of small models all inside a single big model,
https://huggingface.co/google/switch-c-2048 is a such example
Thank you for sharing, I understand now
big is an understatement. Please do correct me if I got it wildly wrong, but it appears to be a 3.6TB colossus.