At Microsoft, we’re expanding AI capabilities by training small language models to achieve the kind of enhanced reasoning and comprehension typically found only in much larger models.
It’d be interesting to see how an MoE framework of multiple Orca 2s each trained on different subsets of data basically routing your prompt to different orca 2 experts would fair. I feel like that can come extraordinarily close to a GPT 4 in performance metrics but would take decent computing power to test the hypothesis. If each orca 2 expert is 10 billion parameters and you wanted to run a 100 billion sparse orca 2 MoE that’s gonna require at least 500 gig+ of VRAM at minimum.
It’d be interesting to see how an MoE framework of multiple Orca 2s each trained on different subsets of data basically routing your prompt to different orca 2 experts would fair. I feel like that can come extraordinarily close to a GPT 4 in performance metrics but would take decent computing power to test the hypothesis. If each orca 2 expert is 10 billion parameters and you wanted to run a 100 billion sparse orca 2 MoE that’s gonna require at least 500 gig+ of VRAM at minimum.