How to minimize model inference costs?

keklsh@alien.top · 1 year ago

How to minimize model inference costs?

FullOf_Bad_Ideas@alien.top · 1 year ago

That’s assuming batch 1. 4090 for example can serve multiple batches of 7B model at once, around 850 t/s. https://github.com/casper-hansen/AutoAWQ Now get a bigger gpu that has more vram and can host multiple llama 70b batches, or split the layers across multiple gpus. You can get 10-20x t/s uplift by doing batched generation.