I’m currently exploring the deployment of Llama models in a production environment and I’m keen to hear from anyone who has ventured into this territory. My primary concern is managing multiple concurrent users while optimizing resources effectively.
While there are numerous methods to tweak Llama for testing with a single user, scaling up poses its own set of challenges. I’m particularly interested in learning how others have approached this problem.I’m curious about projects like vLLM and Huggingface TGI for faster inference. Has anyone had experience with these, and how have they contributed to your scaling efforts?
My goal is to implement an API utilizing Llama models for a small organization’s private use. I’m eager to learn from your experiences and any advice or insights you can share on this topic.
Llama.cpp supports batched inference since 4 weeks https://github.com/ggerganov/llama.cpp/issues/2813
-cb, --cont-batching enable continuous batching (a.k.a dynamic batching) (default: disabled)
FYI, discussed here 11 days ago https://www.reddit.com/r/LocalLLaMA/comments/17m2lql/best_framework_for_llm_based_applications_in/
Three thoughts:
TGI is no longer free software (in the sense that their new license is not OSI approved, nor would it be remotely eligible).
LightLLM is another option that is permissively licensed, and reportedly fast. I haven’t tried it yet.
Speculative inference can yield a significant performance bump, but the devil’s in the details. Some implementations seem to work a lot better than others.
Vllm is performing good so far. Better than expected. Using distributed gpu and trying to work on extending gpu based on load. Need to figure out correct metric on which to trigger the scaling up/down