I saw deepinfra. Their price is $0.7-0.95/million tokens for llama2 70b.

How is that possible? Even the quantized 70b models is 35 gbs.

How do you minimize costs of the GPUs, bandwidths, and on the software side?

On their article:

“I think other than the WhatsApp team, they are maybe first or second in the world to having the capability to build efficient infrastructure to serve hundreds of millions of people.”

https://venturebeat.com/data-infrastructure/deepinfra-emerges-from-stealth-with-8m-to-make-running-ai-inferences-more-affordable

But technology is not magic, can someone shine some light on running cost-effective AI clusters? I was looking at vast.ai etc but renting GPUs directly in that way would be much more expensive.

  • keklsh@alien.topOPB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    Nah, a simple calculation of 3090 ($0.22/hr, and not enough VRAM to run 70b 4bit!) generating at 20t/s puts it at $13.8/million tokens.

    That’s extremely expensive compared to the API price.

    • FullOf_Bad_Ideas@alien.topB
      link
      fedilink
      English
      arrow-up
      1
      ·
      1 year ago

      That’s assuming batch 1. 4090 for example can serve multiple batches of 7B model at once, around 850 t/s. https://github.com/casper-hansen/AutoAWQ Now get a bigger gpu that has more vram and can host multiple llama 70b batches, or split the layers across multiple gpus. You can get 10-20x t/s uplift by doing batched generation.