Hey All,
I have few doubts about method to calculate tokens per second of LLM model.
-
The way I calculate tokens per second of my fine-tuned models is, I put timer in my python code and calculate tokens per second. So if length of my output tokens is 20 and model took 5 seconds then tokens per second is 4. Am I using correct method or is there any other better method for this?
-
If tokens per second of my model is 4 on 8 GB VRAM then will it be 8 tokens per second on 16 GB VRAM?
It depends on your inference engine. It will probably be much higher in TGI or vLLM than what you’re presumably using, Transformers. You also need to measure input and output token rate separately. Additionally longer contexts will take more time than shorter contexts.
No, it’s mostly bound by memory bandwidth.
Your finetuned model, assuming you have the same output format (fp16, gguf, AWQ, etc) as the base model will have the same inference speed as the base model.