m1ss1l3@alien.topOPBtoLocalLLaMA@poweruser.forum•Need help setting up a cost-efficient llama v2 inference API for my micro saas appEnglish
1·
1 year agothis is pretty cool, thanks for sharing will try out and check performance
this is pretty cool, thanks for sharing will try out and check performance
Thanks for all your work!!
The instance you used looks like it was 0.526 per hour which would fit our budget!!
Also, I want to make sure I’m reading the benchmark results right, is it correct that it took about 26s to serve all the 4 requests in parallel with the quantized model and the 2048+512 tokens assumption?
I tried this but got a bunch of errors with the binary, can you share the versions of cuda and other dependencies needed for this?