Tokens per Second

meetrais@alien.top · 2 年前

phree_radical@alien.top · 2 年前

I just wrap it in tqdm

andrewlapp@alien.top · 2 年前

It depends on your inference engine. It will probably be much higher in TGI or vLLM than what you’re presumably using, Transformers. You also need to measure input and output token rate separately. Additionally longer contexts will take more time than shorter contexts.
No, it’s mostly bound by memory bandwidth.
Your finetuned model, assuming you have the same output format (fp16, gguf, AWQ, etc) as the base model will have the same inference speed as the base model.

MINIMAN10001@alien.top · 2 年前

I understanding is that tokens per second typically splits into two categories the preprocessing time and the actual token generation time.

At least from what I remember from oobabooga