I’m working on a project to generate text from a 1.2B parameter full precision LLM (5gb)
Unfortunately I’m limited in the infrastructure I can use to deploy this model. There is no batch inference supported. The infrastructure I have allows me to deploy a copy of the model on a single A100, 1 per process with up to 9 processes supported (these are called “replicas”). I understand that this makes little sense given my model is memory bound, and each process will fight for memory bandwidth to read in the same weights, but I can’t change that for now.
My average input and output tokens are roughly 1000 each. I estimate the kv cache per token is roughly 400kB using full precision.
I have benchmarks of the latency of the model using various “replicas” as described above. I wanted to compare this to the theoretical performance of the A100. For my use case time to first token is negligible (<200ms), and generation is memory bound.
I find that with 5 or more replicas, the math works out and my model is roughly as fast as I expect. For example, with 1000 output tokens, 6 replicas, it’s like I’m generating using a batch of 6 requests from a 30gb model + 5gb for the kv cache. At a memory bandwidth around 1-1.3tbps that translates to ~30s per request, which is not far from what I see. The same goes for other replica numbers, 5, 7, 8 and 9.
However, when I run with a single replica, I expect generation to hover around the 5-6s mark on average. Instead, I see > 20s. I need to add 4 more replicas before the number starts to make sense. It almost seems like the model takes up too little memory to be allocated the entire memory bandwidth.
Does anyone know where this extra latency could be coming from? Do models have to reach a certain amount of used memory for A100 memory bandwidth to hit their available memory bandwidth?
It is very normal/usual/expected that a GPU won’t run at 100% unless you provide it with enough parallel computations to perform (either extremely large layers or a big batch of samples). There’s just so much overhead in so many places, it’s impossible to be efficient at a small scale, and there’s no one thing we can point to and say “that’s why it’s slow”. My best advice is to look into the optimization possibilities provided by the model/framework/version you’re using. I’m in pytorch, and use things like torch.jit.script, torch.jit.trace, torch.compile, torch.autocast, etc.