What kind of specs to run local llm and serve to say up to 20-50 users

Appropriate-Tax-9585@alien.top · 2 years ago

What kind of specs to run local llm and serve to say up to 20-50 users

dododragon@alien.top · 2 years ago

Have a look at https://www.runpod.io/ for AI cloud hosting. You could do some testing based on the number of users you want to cater for, and see what capacity you’ll get for your $.

Start with a basic plan, run some tests to see what it can handle and compare it as you scale up the number of users with simultaneous queries.

Tiny_Arugula_5648@alien.top · 2 years ago

unless you’re doing this as a business it’s going to be massively cost prohibitive, hundreds of thousands dollars of hardware. If it is a business you better get talking to cloud vendors because GPUs are an incredibly scarce resource right now.

seanpuppy@alien.top · 2 years ago

It depends a lot on the details tbh. Do they share one model? Do they each use a different lora? If its the latter theres some cool recent research on efficiently hosting many loras on one machine

Appropriate-Tax-9585@alien.top · 2 years ago

At the moment I’m just trying to grasp the basics, like for example what kind of GPUS I will need and how many. This is more for comparison to SaaS options, however in reality I need to setup a server for testing with just few users. I’m going to research into but I like this community and to hear others view on the case as many have tried to manage their own servers I imagine :)

Prudent-Artichoke-19@alien.top · 2 years ago

One or two a6000s can serve a 70b with decent tps for 20 people. You can run a swarm using petals and just add a gpu as needed. LLM sharding can be pretty useful.

Aggressive-Drama-899@alien.top · 2 years ago

We run llama 2 70b for around 20-30 active users using TGI and 4xA100 80gb on Kubernetes. If 2 users send a request at the exact same time, there is about a 3-4 second delay for the second user. Never really had any complaints around speed from people as of yet. We do have the ability to spin up multiple new containers if it became a problem though. This is all on prem

Appropriate-Tax-9585@alien.top · 2 years ago

Thank you, this is really good to hear!

SupplyChainNext@alien.top · 2 years ago

figure out the size and speed you need. Buy the Nvidia pro gpus (A series) x 20-50 + the server cluster hardware and network infrastructure needed to make them run efficiently.

Think in the several hundred thousand dollar range. I’ve looked into it.

pablines@alien.top · 2 years ago

Hugging face text inference can handle concurrency you just need to power with gpus

a_beautiful_rhind@alien.top · 2 years ago

You would have to benchmark batching speed in something like llama.cpp or exllamav2 and then divide it by the users to see what they get per request.

There are some other backends like MLC/tgi/vllm that are more adapted to this as well but have way worse quant support.

The “minimum” is one GPU that completely fits the size and quant of the model you are serving.

People serve lots of users through kobold horde using only single and dual GPU configurations so this isn’t something you’ll need 10s of 1000s for.