• 1 Post
  • 1 Comment
Joined 10 months ago
cake
Cake day: November 24th, 2023

help-circle

  • In production, most API uses something like TGI or vLLM that support batching, batch multiple requests and inference them at the same time. This doesn’t increase inference speed but it increase thoughput. For example, if running 70B llama normally take 20token/s for a single user, with batching the speed is 15-18 token/s but you can serve 20-50 users at the same time. The whole throughout will be 300-1000token/s, which makes the low price possible.