Has anyone tested this?
Wow, could this be used to replicate MOE?
Or just do it WAY more efficiently, yeah.
This is an important development!
A LoRA is ~25MB. A full Mistral fine-tune is 14GB.
So, this decimates the VRAM required to host a customized model…twice-nearly-thrice.
This means a provider can step in and offer on-demand hosting of many custom models at very reasonable cost. This is a massively better user experience for experimenters and small businesses training and offering LLMs than the existing options, which are renting GPUs, buying GPUs, and on-demand “serverless” GPU offerings. Of these the last has been the most practical for small players, but the latency sucks when the model hasn’t been used recently, as it is multiple GB and it needs to be loaded into VRAM. LoRAs are not only cheaper to host, they’re quicker to swap.
If one was building on Mistral-7B, they could serve from gaming cards (4090s), which are dirt cheap per FLOP relative to nvidia’s datacenter offerings. This might be a way to take business OpenAI can’t serve, basically small players at the long tail.
Given the number of fine-tuning and hosting startups competing for the segment of the market OpenAI hasn’t gobbled, this should be a commercial offering by December. Perhaps someone reading this will build it!
I’m wondering though, from an engineering perspective, when traffic is high, wouldn’t this be causing a lot of weight switching? Basically limited by host to device bandwidth.