I saw deepinfra. Their price is $0.7-0.95/million tokens for llama2 70b.

How is that possible? Even the quantized 70b models is 35 gbs.

How do you minimize costs of the GPUs, bandwidths, and on the software side?

On their article:

“I think other than the WhatsApp team, they are maybe first or second in the world to having the capability to build efficient infrastructure to serve hundreds of millions of people.”

https://venturebeat.com/data-infrastructure/deepinfra-emerges-from-stealth-with-8m-to-make-running-ai-inferences-more-affordable

But technology is not magic, can someone shine some light on running cost-effective AI clusters? I was looking at vast.ai etc but renting GPUs directly in that way would be much more expensive.

  • Evening_Ad6637@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    10 months ago

    Hmm, would it really be more expensive? vast.ai can be extremely cheap.

    But I would also be interested in the topic as a whole. I think you would first have to calculate this very precisely. e.g. what is the scope of an average user request? Not many users will fill the entire context window with every request. If we had or could estimate an average value, we would derive how many tokens/second an economically efficient GPU works with and extrapolate that to the price of 1 million tokens.

    But there are other important factors as well:

    • Which country is the hardware located in? Electricity prices can also be extremely different from country to country.

    • How much did the operator have to pay for all his hardware? As a bulk buyer, you almost always get better prices, regardless of the sector.

    • Does he perhaps also operate his own photovoltaic systems and, if so, to what extent?

    • It is also important to remember that not every product leads directly to financial profits. If you have enough capital and can afford it over a certain period of time, you may deliberately consider doing a loss-making business in order to eliminate the competition. This way one could hope to gain reach and popularity with customers, who would then buy other products in the future. (See OpenAI and ChatGPT).

    • keklsh@alien.topOPB
      link
      fedilink
      English
      arrow-up
      1
      ·
      10 months ago

      Nah, a simple calculation of 3090 ($0.22/hr, and not enough VRAM to run 70b 4bit!) generating at 20t/s puts it at $13.8/million tokens.

      That’s extremely expensive compared to the API price.

      • FullOf_Bad_Ideas@alien.topB
        link
        fedilink
        English
        arrow-up
        1
        ·
        10 months ago

        That’s assuming batch 1. 4090 for example can serve multiple batches of 7B model at once, around 850 t/s. https://github.com/casper-hansen/AutoAWQ Now get a bigger gpu that has more vram and can host multiple llama 70b batches, or split the layers across multiple gpus. You can get 10-20x t/s uplift by doing batched generation.

  • toidicodedao@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    10 months ago

    In production, most API uses something like TGI or vLLM that support batching, batch multiple requests and inference them at the same time. This doesn’t increase inference speed but it increase thoughput. For example, if running 70B llama normally take 20token/s for a single user, with batching the speed is 15-18 token/s but you can serve 20-50 users at the same time. The whole throughout will be 300-1000token/s, which makes the low price possible.