I’m using a100 pcie 80g. Cuda11.8 toolkit 525.x
But when i inference codellama 13b with oobabooga(web ui)
It just make 5tokens/s
It is so slow.
Is there any config or something else for a100???
I’m using a100 pcie 80g. Cuda11.8 toolkit 525.x
But when i inference codellama 13b with oobabooga(web ui)
It just make 5tokens/s
It is so slow.
Is there any config or something else for a100???
Tried a 13B model with Koboldcpp on one of the runpod A100’s, its Q4 and FP16 speed both clocked in around 20T/S at 4K context, topping at 60T/S for smaller generations.