I’m running TheBloke/Llama-2-13B-chat-GGUF on my 14 CPU/30GPU 36GB Ram M3 Max via Text generation web UI.
I get max 20 tokens/second. I’ve tried various parameter presets and they all seem to get me around the same 20 toks/sec.
I’ve tried increasing the threads to 14 and n-GPU-layers to 128. I don’t really understand most of the parameters in the model and parameters tab. Just cranking different options to see how I can increase toks/sec. But so far, nothing above 20 toks/sec.
What can I change to crank the performance here? I’m yet to hear the fan go off for this 13B model. I’m trying to push it to the max to see the max toks/sec I can achieve on my machine. Any settings I should try?
I’ll be interested to see what responses you get, but I’m gonna come out and say that the Mac’s power is NOT its speed. Pound for pound, a CUDA video card is going to absolutely leave our machines in the dust.
So, with that said- I actually think your 20 tokens a second is kind of great. I mean- my M2 Ultra is two M2 Max processors stacked on top of each other, and I get the following for Mythomax-l2-13b:
- Llama.cpp directly:
- Prompt eval: 17.79ms per token, 56.22 tokens per second
- Eval: 28.27ms per token, 35.38 tokens per second
- 565 tokens in 15.86 seconds: 35.6 tokens per second
- Llama cpp python in Oobabooga:
- Prompt eval: 44.27ms per token, 22.59 tokens per second
- Eval: 27.92 ms per token, 35.82 tokens per second
- 150 tokens in 5.18 seconds: 28.95 tokens per second
So you’re actually doing better than I’d expect an M2 Max to do.
- Llama.cpp directly:
You don’t say what quant you are using, if any. But on Q4K_M I get this on my M1 Max using pure llama.cpp.
llama_print_timings: prompt eval time = 246.97 ms / 10 tokens ( 24.70 ms per token, 40.49 tokens per second)
llama_print_timings: eval time = 28466.45 ms / 683 runs ( 41.68 ms per token, 23.99 tokens per second)
Your M3 has lower memory bandwidth than my M1. It’s the 300GB/s version versus 400GB/s.
What quantization are you using? Smaller tends to be faster.
I get 30 tokens/s with a q4_0 quantization of 13B models on a M1 Max on Ollama (which uses llama.cpp). You should be in the same ballpark with the same software. You aren’t going to do much/any better than that. The M3’s GPU made some significant leaps for graphics, and little to nothing for LLMs.
Allowing more threads isn’t going to help generation speed, it might improve prompt processing though. Probably best though to keep the number of threads to the number of performance cores.