Perhaps we should have like hundred different 7B models for different categories like history, arts, science etc. and then above that there’s new layer where there’s generic LLM which parses the question to correct category, and then finally the correct 7B model loads into your VRAM? :D Like if you had the fastest NVME (not sure if DirectStorage would help, probably not?) perhaps the waiting wouldn’t be too terrible unless every of your question is in different category
Perhaps we should have like hundred different 7B models for different categories like history, arts, science etc. and then above that there’s new layer where there’s generic LLM which parses the question to correct category, and then finally the correct 7B model loads into your VRAM? :D Like if you had the fastest NVME (not sure if DirectStorage would help, probably not?) perhaps the waiting wouldn’t be too terrible unless every of your question is in different category