Need help setting up a cost-efficient llama v2 inference API for my micro saas app

m1ss1l3@alien.top · 3 years ago

Need help setting up a cost-efficient llama v2 inference API for my micro saas app

ggerganov@alien.top · 3 years ago

I just wrote a post today about serving 7B models with `llama.cpp` from cheap AWS instances - might be useful:

https://github.com/ggerganov/llama.cpp/discussions/4225

m1ss1l3@alien.top · 3 years ago

Thanks for all your work!!
The instance you used looks like it was 0.526 per hour which would fit our budget!!

Also, I want to make sure I’m reading the benchmark results right, is it correct that it took about 26s to serve all the 4 requests in parallel with the quantized model and the 2048+512 tokens assumption?