Need help setting up a cost-efficient llama v2 inference API for my micro saas app

m1ss1l3@alien.top · 3 years ago

Need help setting up a cost-efficient llama v2 inference API for my micro saas app

noobgolang@alien.top · 3 years ago

you can try https://nitro.jan.ai/ its built for this purpose

m1ss1l3@alien.top · 3 years ago

I tried this but got a bunch of errors with the binary, can you share the versions of cuda and other dependencies needed for this?

noobgolang@alien.top · 3 years ago

for cuda version you can use this link for linux version https://github.com/janhq/nitro/releases/download/v0.1.17/nitro-0.1.17-linux-amd64-cuda.tar.gz , you need to make sure the system has cudatoolkit. i remcommend following the exact step in quickstart docs here https://nitro.jan.ai/quickstart to make sure it will work

MannowLawn@alien.top · 3 years ago

chrome is marking the download as suspicious from the github repo

noobgolang@alien.top · 3 years ago

also the build is 100% built in public with the source code on the page, you can check the Actions button to see it, there is nothing hidden here

MannowLawn@alien.top · 3 years ago

thanks, ill have a look. It seems very promising with my use case as well. Btw is nitro different than the download you have on the main page? Nitro seems only for m1 models of apple and on main page it mentions m2 models as well?

noobgolang@alien.top · 3 years ago

m1 models of apple and on main page it mentions m2 models as well?

yeah arm64 mac should be able to run on all mac m1 and m2 including, we also have cuda version in the release

MannowLawn@alien.top · 3 years ago

cheers! ill keep a close watch on this, nice work!

kivathewolf@alien.top · 3 years ago

Checkout fastChat api. Easy to deploy and you can scale it. It can also support an open AI format api.

m1ss1l3@alien.top · 3 years ago

this is pretty cool, thanks for sharing will try out and check performance

ggerganov@alien.top · 3 years ago

I just wrote a post today about serving 7B models with `llama.cpp` from cheap AWS instances - might be useful:

https://github.com/ggerganov/llama.cpp/discussions/4225

m1ss1l3@alien.top · 3 years ago

Thanks for all your work!!
The instance you used looks like it was 0.526 per hour which would fit our budget!!

Also, I want to make sure I’m reading the benchmark results right, is it correct that it took about 26s to serve all the 4 requests in parallel with the quantized model and the 2048+512 tokens assumption?