Hi everyone,
We’ve recently experimented with deploying the CodeLlama 34 Bn model and wanted to share our key findings for those interested:
- Best Performance: Quantized GPTQ, 4-bit CodeLlama-Python-34B model using vLLM.
- Results: Average lowest latency of 3.51 sec, average token generation at 58.40/sec, and a cold start time of 21.8 sec on our platform, using Nvidia A100 GPU.
- Other Libraries Tested: HuggingFace Transformer Pipeline, AutoGPTQ, Text Generation Inference.
Keen to hear your experiences and learnings in similar deployments!
You must log in or register to comment.
Thanks for sharing this. Have you used exllama 2 as well?
Not yet, made a note. Will add when i update the tutorial.