Question about GGUF, gpu offload and performance

Jokaiser2000@alien.top · 2 years ago

Question about GGUF, gpu offload and performance

longtimegoneMTGO@alien.top · 2 years ago

I have a 3080 12Gb, and can run a 20B-Q4_K_M with about 50 layers offloaded and 8k context.

It starts off at just under 4 t/s, and once the context is filled it gets as slow as just over 2 t/s

It might be worth setting up a linux partition to boot into for this, I was getting much slower speeds under windows.

Jokaiser2000@alien.top · 2 years ago

That might be worth a try actually, i’ll look into it, thanks

Desm0nt@alien.top · 2 years ago

By loading a 20B-Q4_K_M model (50/65 layers offloaded seems to be the fastest from my tests) i currently get arround 0.65 t/s with a low context size of 500 or less, and about 0.45t/s nearing the max 4096 context.

Sound suspicious. A use Yi-Chat-34b-Q4_K_M on old 1080ti (11 gb VRAM) with 20 layers offloaded and got around 2.5 t/s.But it is on Threadripper 2920 with 4 channel RAM (also 3200). However I don’t think it would make that much difference. Ofcourse in 4 channel I have ram bandwidth x2 of your’s but I run 34b and I load only 20 layers on gpu…

marblemunkey@alien.top · 2 years ago

It’s been a couple months since I used less-than-complete GPU offloading; When I was using my Alienware laptop (i7-8th gen, 2060 6GB) to run 13B models with 13/25 layers offloaded I was getting 1-2 t/s, so yours sounds low.

vikarti_anatra@alien.top · 2 years ago

some of my results:

System:

2xXeon E5-2680v4, 28 cores total, 56 HT, 128 Gb RAM

RTX 2060 6 Gb via PCIE x16 3.0

RTX 4060 Ti 16 Gb via PCIE x8 4.0

Windows 11 Pro

OpenHermes-2.5-AshhLimaRP-Mistral-7B (llama.cpp in text generation UI):

Q4_K_M,RTX 2060 6 Gb RAM, all 35 layers offloaded, 8k context, - approx 3 t/s

Q5_K_M,RTX 4060 Ti 16 Gb RAM, all 35 layers offloaded, 32k context - approx 25 t/s

Q5_K_M,CPU-only , 8 threads,32k context - approx 2.5-3.5 t/s

Q5_K_M,CPU-only , 16 threads,32k context - approx 3-3.5 t/s

Q5_K_M,CPU-only , 32 threads,32k context - approx 3-3.6 t/s

euryale-1.3-l2-70b (llama.cpp in text generation UI)

Q4_K_M,RTX 2060+RTX 4060 Ti,35 layers offloaded, 4K context - 0.6-0.8 t/s

goliath-120 (llama.cpp in text generation UI)

Q2_K, CPU-only,32 threads - 0.4-0.5 t/s

Q2_K, CPU-only,8 threads - 0.25-0.3 t/s

Noromaid-20b-v0.1.1 (llama.cpp in text generation UI)

Q5_K_M , RTX 2060+RTX 4060 Ti, 65 layers offloaded,4K context - approx 5 t/s

Noromaid-20b-v0.1.1 (exllamav2 in text generation UI)

3bpw-h8-exl2, RTX 2060+RTX 4060 Ti, cache 8 bit, 4k context, approx 15 t/s (looks like it fits in 4060)

6bpw-h8-exl2, RTX 2060+RTX 4060 Ti, cache 8 bit, 4k context, no flash attention, gpu split 12, 6 - approx 10 t/s

Observations:

- number of cores in cpu-only modes matters very little

- “numa” does matter (I have 2 CPU sockets)

I would say - try to get additional another card?

multiverse_fan@alien.top · 2 years ago

I have an older 6GB 1660 and get like 0.3 t/s on a q2 quant of Goliath 120B. I guess I’m just thinking that comparatively your setup with a 20B model should be faster than that but I’m sure I’m missing something. I guess with offloading, the CPU plays a role as well. How many cores ya got?

-Ellary-@alien.top · 2 years ago

R5 5500 (on stock 3600Mhz) | 3060 12gb | 32gb 3600, Win10 v2004.
I’m using LM Studio for heavy models (34b (q4_k_m), 70b (q3_k_m) GGUF.
On 70b I’m getting around 1-1.4 tokens depending on context size (4k max),
I’m offloading 25 layers on GPU (trying to not exceed 11gb mark of VRAM),
On 34b I’m getting around 2-2.5 tokens depending on context size (4k max).
I’m offloading 30 layers on GPU (trying to not exceed 11gb mark of VRAM),
On 20b I was getting around 4-5 tokens, not a huge user of 20b right now.

So I can recommend LM Studio for models heavier then 13b+, woks better for me.
Here is a 34b YI Chat generation speed:

https://preview.redd.it/h4d0lbm5u63c1.png?width=903&format=png&auto=webp&s=fdc161b136879d1c1de6ef065cb80f35f188e46f