🐺🐦‍⬛ LLM Format Comparison/Benchmark: 70B GGUF vs. EXL2 (and AWQ)

WolframRavenwolf@alien.top · 1 year ago

🐺🐦‍⬛ LLM Format Comparison/Benchmark: 70B GGUF vs. EXL2 (and AWQ)

lone_striker@alien.top · 1 year ago

For the 2.4bpw and 2.6bpw exl2 models, you have to change a setting in ooba to get them to generate coherent text. Disable this setting:

Add the bos_token to the beginning of prompts

https://preview.redd.it/4v8m7ciu0y1c1.png?width=356&format=png&auto=webp&s=785837b8466a3bcda3e49477424b7c377a8d542f

The very low bpw models need the above setting as well as being more strict with the prompt format. The higher bpw models are more flexible and can deal with prompt formats they were not specifically tuned for.

I would also set the VRAM for 2.4 to use only a single GPU. Spreading them out over two GPUs is not needed and will slow them down. That’s the main reason I generate 2.4 (and 2.6bpw) versions is to allow people with only a single 3090 or 4090 to run 70B models at full speeds. Though obviously quality will be lower than the higher-bit models. For 2.6bpw to fit on a single 24 GB VRAM GPU, you will need to enable the cache_8bit option.

WolframRavenwolf@alien.top · 1 year ago

Does 8-bit cache reduce quality or speed or what’s the disadvantage of it? (If it had none, it would be default, I assume.)

Model	Format	Quant	Offloaded Layers	VRAM Used	Primary Score	Secondary Score	Speed +mmq	Speed -mmq
lizpreciatior/lzlv_70B.gguf	GGUF	Q4_K_M	83/83	39362.61 MB	18/18	4+3+4+6 = 17/18
lizpreciatior/lzlv_70B.gguf	GGUF	Q5_K_M	70/83 !	40230.62 MB	18/18	4+3+4+6 = 17/18
TheBloke/lzlv_70B-GGUF	GGUF	Q2_K	83/83	27840.11 MB	18/18	4+3+4+6 = 17/18	4.20T/s	4.01T/s
TheBloke/lzlv_70B-GGUF	GGUF	Q3_K_M	83/83	31541.11 MB	18/18	4+3+4+6 = 17/18	4.41T/s	3.96T/s
TheBloke/lzlv_70B-GGUF	GGUF	Q4_0	83/83	36930.11 MB	18/18	4+3+4+6 = 17/18	4.61T/s	3.94T/s
TheBloke/lzlv_70B-GGUF	GGUF	Q4_K_M	83/83	39362.61 MB	18/18	4+3+4+6 = 17/18	4.73T/s !!	4.11T/s
TheBloke/lzlv_70B-GGUF	GGUF	Q5_K_M	70/83 !	40230.62 MB	18/18	4+3+4+6 = 17/18	1.51T/s	1.46T/s
TheBloke/lzlv_70B-GGUF	GGUF	Q5_K_M	80/83	46117.50 MB	OutOfMemory
TheBloke/lzlv_70B-GGUF	GGUF	Q5_K_M	83/83	46322.61 MB	OutOfMemory
LoneStriker/lzlv_70b_fp16_hf-2.4bpw-h6-exl2	EXL2	2.4bpw		11,11 -> 22 GB	BROKEN
LoneStriker/lzlv_70b_fp16_hf-2.6bpw-h6-exl2	EXL2	2.6bpw		12,11 -> 23 GB	FAIL
LoneStriker/lzlv_70b_fp16_hf-3.0bpw-h6-exl2	EXL2	3.0bpw		14,13 -> 27 GB	18/18	4+2+2+6 = 14/18
LoneStriker/lzlv_70b_fp16_hf-4.0bpw-h6-exl2	EXL2	4.0bpw		18,17 -> 35 GB	18/18	4+3+2+6 = 15/18
LoneStriker/lzlv_70b_fp16_hf-4.65bpw-h6-exl2	EXL2	4.65bpw		20,20 -> 40 GB	18/18	4+3+2+6 = 15/18
LoneStriker/lzlv_70b_fp16_hf-5.0bpw-h6-exl2	EXL2	5.0bpw		22,21 -> 43 GB	18/18	4+3+2+6 = 15/18
LoneStriker/lzlv_70b_fp16_hf-6.0bpw-h6-exl2	EXL2	6.0bpw		> 48 GB	TOO BIG
TheBloke/lzlv_70B-AWQ	AWQ	4-bit			OutOfMemory

🐺🐦‍⬛ LLM Format Comparison/Benchmark: 70B GGUF vs. EXL2 (and AWQ)

🐺🐦‍⬛ LLM Format Comparison/Benchmark: 70B GGUF vs. EXL2 (and AWQ)

My AI Workstation:

Observations:

Conclusion: