Why is a single a100 so slow?

Radiant-Practice-270@alien.top · 2 years ago

Why is a single a100 so slow?

opi098514@alien.top · 2 years ago

Sounds like you might be using the standard transformer loader. Try exllama or exlamav2

uti24@alien.top · 2 years ago

Sounds like you run it on CPU. If you using oobabooga you have to explicitly set how many layers you offload to GPU and by default everything runs on CPU (at least gguf models)

hudimudi@alien.top · 2 years ago

Uhmmm where did you buy that a100? Was it a good deal? lol. Just kidding, you probably set sth up wrong or the drivers are messing up. Is the card working fine otherwise in benchmarks?

a_beautiful_rhind@alien.top · 2 years ago

Something is wrong with your environment. even P40s give more than that.

Other option is you don’t get enough tokens to get proper t/s speed. What was the total inference time?

SativaSawdust@alien.top · 2 years ago

Have you tried: import torch print(torch.cuda.is_available())

easyllaama@alien.top · 2 years ago

Try use GGUF, this format likes single GPU especially you have 80GB vram. I think you can run 70gb GGUF with all layers in GPU.

nuvalab@alien.top · 2 years ago

That sounds like CPU speed. What you see from `watch nvidia-smi -d -n 0.1` while you’re running inference ?

henk717@alien.top · 2 years ago

Tried a 13B model with Koboldcpp on one of the runpod A100’s, its Q4 and FP16 speed both clocked in around 20T/S at 4K context, topping at 60T/S for smaller generations.