I’m using a100 pcie 80g. Cuda11.8 toolkit 525.x
But when i inference codellama 13b with oobabooga(web ui)
It just make 5tokens/s
It is so slow.
Is there any config or something else for a100???
Sounds like you might be using the standard transformer loader. Try exllama or exlamav2
Sounds like you run it on CPU. If you using oobabooga you have to explicitly set how many layers you offload to GPU and by default everything runs on CPU (at least gguf models)
Uhmmm where did you buy that a100? Was it a good deal? lol. Just kidding, you probably set sth up wrong or the drivers are messing up. Is the card working fine otherwise in benchmarks?
Something is wrong with your environment. even P40s give more than that.
Other option is you don’t get enough tokens to get proper t/s speed. What was the total inference time?
Have you tried: import torch print(torch.cuda.is_available())
Try use GGUF, this format likes single GPU especially you have 80GB vram. I think you can run 70gb GGUF with all layers in GPU.
That sounds like CPU speed. What you see from `watch nvidia-smi -d -n 0.1` while you’re running inference ?
Tried a 13B model with Koboldcpp on one of the runpod A100’s, its Q4 and FP16 speed both clocked in around 20T/S at 4K context, topping at 60T/S for smaller generations.