Hi all,

Just curious if anybody knows the power required to make a llama server which can serve multiple users at once.

Any discussion is welcome:)

  • dododragon@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    10 months ago

    Have a look at https://www.runpod.io/ for AI cloud hosting. You could do some testing based on the number of users you want to cater for, and see what capacity you’ll get for your $.

    Start with a basic plan, run some tests to see what it can handle and compare it as you scale up the number of users with simultaneous queries.

  • Tiny_Arugula_5648@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    10 months ago

    unless you’re doing this as a business it’s going to be massively cost prohibitive, hundreds of thousands dollars of hardware. If it is a business you better get talking to cloud vendors because GPUs are an incredibly scarce resource right now.

  • seanpuppy@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    10 months ago

    It depends a lot on the details tbh. Do they share one model? Do they each use a different lora? If its the latter theres some cool recent research on efficiently hosting many loras on one machine

    • Appropriate-Tax-9585@alien.topOPB
      link
      fedilink
      English
      arrow-up
      1
      ·
      10 months ago

      At the moment I’m just trying to grasp the basics, like for example what kind of GPUS I will need and how many. This is more for comparison to SaaS options, however in reality I need to setup a server for testing with just few users. I’m going to research into but I like this community and to hear others view on the case as many have tried to manage their own servers I imagine :)

  • Prudent-Artichoke-19@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    10 months ago

    One or two a6000s can serve a 70b with decent tps for 20 people. You can run a swarm using petals and just add a gpu as needed. LLM sharding can be pretty useful.

  • Aggressive-Drama-899@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    10 months ago

    We run llama 2 70b for around 20-30 active users using TGI and 4xA100 80gb on Kubernetes. If 2 users send a request at the exact same time, there is about a 3-4 second delay for the second user. Never really had any complaints around speed from people as of yet. We do have the ability to spin up multiple new containers if it became a problem though. This is all on prem

  • SupplyChainNext@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    10 months ago

    figure out the size and speed you need. Buy the Nvidia pro gpus (A series) x 20-50 + the server cluster hardware and network infrastructure needed to make them run efficiently.

    Think in the several hundred thousand dollar range. I’ve looked into it.

  • pablines@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    10 months ago

    Hugging face text inference can handle concurrency you just need to power with gpus

  • a_beautiful_rhind@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    10 months ago

    You would have to benchmark batching speed in something like llama.cpp or exllamav2 and then divide it by the users to see what they get per request.

    There are some other backends like MLC/tgi/vllm that are more adapted to this as well but have way worse quant support.

    The “minimum” is one GPU that completely fits the size and quant of the model you are serving.

    People serve lots of users through kobold horde using only single and dual GPU configurations so this isn’t something you’ll need 10s of 1000s for.