🐺🐦‍⬛ LLM Format Comparison/Benchmark: 70B GGUF vs. EXL2 (and AWQ)

WolframRavenwolf@alien.top · 1 year ago

🐺🐦‍⬛ LLM Format Comparison/Benchmark: 70B GGUF vs. EXL2 (and AWQ)

panchovix@alien.top · 1 year ago

The major reason I use exl2 is speed, like on 2x4090 I get 15-20 t/s at 70b depending of the size, but GGUF I get like tops 4-5 t/s.

When using 3 gpus (2x4090+1x3090), it is 11-12 t/s at 6.55bpw vs GGUF Q6_K that runs at 2-3 t/s.

Though I agree with you, for model comparisons and such you need to have deterministic results and also the best quality.

If you can sometime, try 70b at 6bpw or more, IMO it is pretty consistent and doesn’t have issues like 5bpw/bits.

The performance hit is too much on multigpu systems when using GGUF. I guess if in the future the speed gets to the same level, I would use it most of the time.

a_beautiful_rhind@alien.top · 1 year ago

I’m surprised you get speeds so bad with GGUF. I get almost 9t/s on P40s and 18t/s on 3090.

GGUF is actually the fastest format until you load it up with context.

A couple of things have to be changed in cmakelists under vendor/llama.cpp if you’re using python

set(LLAMA_CUDA_MMV_Y        "2" CACHE STRING "llama: y block size for mmv CUDA kernels")
option(LLAMA_CUDA_FORCE_MMQ                  "llama: use mmq kernels instead of cuBLAS"         ON)

I have nvlink so this helps me. Since you don’t it still may help using direct communication via PCIE:

set(LLAMA_CUDA_PEER_MAX_BATCH_SIZE "8192" CACHE STRING

and since you’re using all new cards:

option(LLAMA_CUDA_F16                        "llama: use 16 bit floats for some calculations"   OFF)

Try out the FP16 support.

easyllaama@alien.top · 1 year ago

‘The performance hit is too much on multigpu systems when using GGUF’

I agree. GGuF has multi GPU panelty. But it”s the most friendly to Apple silicons. I have same setup with you. one 4090 can run Xwin 13b at 40t/s. but when 2 cards present, it get only 1/4 of speed at 10t/s. So to get it fast, I have to flag CUDA device to single card while 2 cards present.

Since GGUF liks single GPU, those who have 3090/4090 will find 34B the best spot with the format.

candre23@alien.top · 1 year ago

GGUF I get like tops 4-5 t/s.

You’re doing something very wrong. I get better speeds than that on P40s with low context. Are you not using cublas?

bullerwins@alien.top · 1 year ago

What motherboard do you have that can run 3x GPU’s?

Model	Format	Quant	Offloaded Layers	VRAM Used	Primary Score	Secondary Score	Speed +mmq	Speed -mmq
lizpreciatior/lzlv_70B.gguf	GGUF	Q4_K_M	83/83	39362.61 MB	18/18	4+3+4+6 = 17/18
lizpreciatior/lzlv_70B.gguf	GGUF	Q5_K_M	70/83 !	40230.62 MB	18/18	4+3+4+6 = 17/18
TheBloke/lzlv_70B-GGUF	GGUF	Q2_K	83/83	27840.11 MB	18/18	4+3+4+6 = 17/18	4.20T/s	4.01T/s
TheBloke/lzlv_70B-GGUF	GGUF	Q3_K_M	83/83	31541.11 MB	18/18	4+3+4+6 = 17/18	4.41T/s	3.96T/s
TheBloke/lzlv_70B-GGUF	GGUF	Q4_0	83/83	36930.11 MB	18/18	4+3+4+6 = 17/18	4.61T/s	3.94T/s
TheBloke/lzlv_70B-GGUF	GGUF	Q4_K_M	83/83	39362.61 MB	18/18	4+3+4+6 = 17/18	4.73T/s !!	4.11T/s
TheBloke/lzlv_70B-GGUF	GGUF	Q5_K_M	70/83 !	40230.62 MB	18/18	4+3+4+6 = 17/18	1.51T/s	1.46T/s
TheBloke/lzlv_70B-GGUF	GGUF	Q5_K_M	80/83	46117.50 MB	OutOfMemory
TheBloke/lzlv_70B-GGUF	GGUF	Q5_K_M	83/83	46322.61 MB	OutOfMemory
LoneStriker/lzlv_70b_fp16_hf-2.4bpw-h6-exl2	EXL2	2.4bpw		11,11 -> 22 GB	BROKEN
LoneStriker/lzlv_70b_fp16_hf-2.6bpw-h6-exl2	EXL2	2.6bpw		12,11 -> 23 GB	FAIL
LoneStriker/lzlv_70b_fp16_hf-3.0bpw-h6-exl2	EXL2	3.0bpw		14,13 -> 27 GB	18/18	4+2+2+6 = 14/18
LoneStriker/lzlv_70b_fp16_hf-4.0bpw-h6-exl2	EXL2	4.0bpw		18,17 -> 35 GB	18/18	4+3+2+6 = 15/18
LoneStriker/lzlv_70b_fp16_hf-4.65bpw-h6-exl2	EXL2	4.65bpw		20,20 -> 40 GB	18/18	4+3+2+6 = 15/18
LoneStriker/lzlv_70b_fp16_hf-5.0bpw-h6-exl2	EXL2	5.0bpw		22,21 -> 43 GB	18/18	4+3+2+6 = 15/18
LoneStriker/lzlv_70b_fp16_hf-6.0bpw-h6-exl2	EXL2	6.0bpw		> 48 GB	TOO BIG
TheBloke/lzlv_70B-AWQ	AWQ	4-bit			OutOfMemory

🐺🐦‍⬛ LLM Format Comparison/Benchmark: 70B GGUF vs. EXL2 (and AWQ)

🐺🐦‍⬛ LLM Format Comparison/Benchmark: 70B GGUF vs. EXL2 (and AWQ)

My AI Workstation:

Observations:

Conclusion: