NVidia H200 achieves nearly 12,000 tokens/sec on Llama2-13B with TensorRT-LLM

rihard7854@alien.top · 2 years ago

NVidia H200 achieves nearly 12,000 tokens/sec on Llama2-13B with TensorRT-LLM

drplan@alien.top · 2 years ago

Perfect. Next please a chip that can do half the inference speed of an A100 with 15 Watts power.

MrTacobeans@alien.top · 2 years ago

I don’t think that will come from Nvidia. It’s going to take in memory compute to get anywhere near that level of efficiency. First samples of these SOCs are no where near the memory requirements needed even for small models. These type of accelators will likely come from Intel/arm/risc/amd before Nvidia does it.