I’m using a100 pcie 80g. Cuda11.8 toolkit 525.x
But when i inference codellama 13b with oobabooga(web ui)
It just make 5tokens/s
It is so slow.
Is there any config or something else for a100???
I’m using a100 pcie 80g. Cuda11.8 toolkit 525.x
But when i inference codellama 13b with oobabooga(web ui)
It just make 5tokens/s
It is so slow.
Is there any config or something else for a100???
Something is wrong with your environment. even P40s give more than that.
Other option is you don’t get enough tokens to get proper t/s speed. What was the total inference time?