Need help setting up a cost-efficient llama v2 inference API for my micro saas app

m1ss1l3@alien.top · 11 months ago

I tried this but got a bunch of errors with the binary, can you share the versions of cuda and other dependencies needed for this?

m1ss1l3@alien.top · 1 year ago

this is pretty cool, thanks for sharing will try out and check performance

m1ss1l3@alien.top · 1 year ago

Thanks for all your work!!
The instance you used looks like it was 0.526 per hour which would fit our budget!!

Also, I want to make sure I’m reading the benchmark results right, is it correct that it took about 26s to serve all the 4 requests in parallel with the quantized model and the 2048+512 tokens assumption?

m1ss1l3@alien.top · 1 year ago

Need help setting up a cost-efficient llama v2 inference API for my micro saas app