I’ve been working on a project with my roommate to make it incredibly simple to run batch inference on LLMs while leveraging a massive amount of cloud resources. We finally got the tool working and created a tutorial on how to use it on Mistral 7B.
Also, if you’re a frequent HuggingFace user you can easily adapt the code to run inference on other LLM models. Please test it out and provide feedback, I feel really good about how easy it is to use but I want to figure out if anything is not intuitive. I hope the community is able to get some value out of it! Here is the link to the tutorial https://docs.burla.dev/Example:%20Massively%20Parallel%20Inference%20with%20Mistral-7B
This is actually really cool, it was simple and I didn’t run into any issues! Couple of major questions though, since you’re managing the infrastructure… how long are you going to let people use your compute for? Do I get a certain amount for free? How much once I need to start paying?
Unique concept, I like it
Very impressive! Lots of good use cases for this.
You should look into continuous batching as most of your parallel requests are batch size 1 & heavily under utilising the VRAM & overall throughput that would have been easily possible.
I’m in the middle of building my app on Modal. Guess I’ll adapt it to run on your service and see. Thanks for sharing!
This is really cool! We are more focused on lengthy workloads so running 500k inputs through an LLM in one batch instead of on-demand inference (starting to support this). Right now the startup time is pretty long (2-5 minutes) but we are working on cutting it down.