[D] Insights from Deploying CodeLlama 34Bn Model with Multiple Libraries

Best Performance: Quantized GPTQ, 4-bit CodeLlama-Python-34B model using vLLM.
Results: Average lowest latency of 3.51 sec, average token generation at 58.40/sec, and a cold start time of 21.8 sec on our platform, using Nvidia A100 GPU.

Tiny_Cut_8440@alien.top · 1 year ago

Striped_Orangutan@alien.top · 1 year ago

Thanks for sharing this. Have you used exllama 2 as well?

Tiny_Cut_8440@alien.top · 1 year ago

Not yet, made a note. Will add when i update the tutorial.