I’ve been working on a project with my roommate to make it incredibly simple to run batch inference on LLMs while leveraging a massive amount of cloud resources. We finally got the tool working and created a tutorial on how to use it on Mistral 7B.
Also, if you’re a frequent HuggingFace user you can easily adapt the code to run inference on other LLM models. Please test it out and provide feedback, I feel really good about how easy it is to use but I want to figure out if anything is not intuitive. I hope the community is able to get some value out of it! Here is the link to the tutorial https://docs.burla.dev/Example:%20Massively%20Parallel%20Inference%20with%20Mistral-7B
I’m in the middle of building my app on Modal. Guess I’ll adapt it to run on your service and see. Thanks for sharing!
This is really cool! We are more focused on lengthy workloads so running 500k inputs through an LLM in one batch instead of on-demand inference (starting to support this). Right now the startup time is pretty long (2-5 minutes) but we are working on cutting it down.