NVidia H200 achieves nearly 12,000 tokens/sec on Llama2-13B with TensorRT-LLM

rihard7854@alien.top · 10 months ago

NVidia H200 achieves nearly 12,000 tokens/sec on Llama2-13B with TensorRT-LLM

ZenEngineer@alien.top · 10 months ago

There was a paper where you’d return a faster model to come up with a sentence and then basically run a batch on them big model with each prompt being the same sentence, with different lengthsending in a different word predicted by the small model, to basically see where the small one went wrong. That gets you a speed up if the two models are more or less aligned.

Other than that I could imagine other things, like having batches with one sentence being generated for each actor, one for descriptions, one for actions, etc. Or simply multiple options for you to choose.