I posted my latest LLM Comparison/Test just yesterday, but here’s another (shorter) comparison/benchmark I did while working on that - testing different formats and quantization levels.
My goal was to find out which format and quant to focus on. So I took the best 70B according to my previous tests, and re-tested that again with various formats and quants. I wanted to find out if they worked the same, better, or worse. And here’s what I discovered:
Model | Format | Quant | Offloaded Layers | VRAM Used | Primary Score | Secondary Score | Speed +mmq | Speed -mmq |
---|---|---|---|---|---|---|---|---|
lizpreciatior/lzlv_70B.gguf | GGUF | Q4_K_M | 83/83 | 39362.61 MB | 18/18 | 4+3+4+6 = 17/18 | ||
lizpreciatior/lzlv_70B.gguf | GGUF | Q5_K_M | 70/83 ! | 40230.62 MB | 18/18 | 4+3+4+6 = 17/18 | ||
TheBloke/lzlv_70B-GGUF | GGUF | Q2_K | 83/83 | 27840.11 MB | 18/18 | 4+3+4+6 = 17/18 | 4.20T/s | 4.01T/s |
TheBloke/lzlv_70B-GGUF | GGUF | Q3_K_M | 83/83 | 31541.11 MB | 18/18 | 4+3+4+6 = 17/18 | 4.41T/s | 3.96T/s |
TheBloke/lzlv_70B-GGUF | GGUF | Q4_0 | 83/83 | 36930.11 MB | 18/18 | 4+3+4+6 = 17/18 | 4.61T/s | 3.94T/s |
TheBloke/lzlv_70B-GGUF | GGUF | Q4_K_M | 83/83 | 39362.61 MB | 18/18 | 4+3+4+6 = 17/18 | 4.73T/s !! | 4.11T/s |
TheBloke/lzlv_70B-GGUF | GGUF | Q5_K_M | 70/83 ! | 40230.62 MB | 18/18 | 4+3+4+6 = 17/18 | 1.51T/s | 1.46T/s |
TheBloke/lzlv_70B-GGUF | GGUF | Q5_K_M | 80/83 | 46117.50 MB | OutOfMemory | |||
TheBloke/lzlv_70B-GGUF | GGUF | Q5_K_M | 83/83 | 46322.61 MB | OutOfMemory | |||
LoneStriker/lzlv_70b_fp16_hf-2.4bpw-h6-exl2 | EXL2 | 2.4bpw | 11,11 -> 22 GB | BROKEN | ||||
LoneStriker/lzlv_70b_fp16_hf-2.6bpw-h6-exl2 | EXL2 | 2.6bpw | 12,11 -> 23 GB | FAIL | ||||
LoneStriker/lzlv_70b_fp16_hf-3.0bpw-h6-exl2 | EXL2 | 3.0bpw | 14,13 -> 27 GB | 18/18 | 4+2+2+6 = 14/18 | |||
LoneStriker/lzlv_70b_fp16_hf-4.0bpw-h6-exl2 | EXL2 | 4.0bpw | 18,17 -> 35 GB | 18/18 | 4+3+2+6 = 15/18 | |||
LoneStriker/lzlv_70b_fp16_hf-4.65bpw-h6-exl2 | EXL2 | 4.65bpw | 20,20 -> 40 GB | 18/18 | 4+3+2+6 = 15/18 | |||
LoneStriker/lzlv_70b_fp16_hf-5.0bpw-h6-exl2 | EXL2 | 5.0bpw | 22,21 -> 43 GB | 18/18 | 4+3+2+6 = 15/18 | |||
LoneStriker/lzlv_70b_fp16_hf-6.0bpw-h6-exl2 | EXL2 | 6.0bpw | > 48 GB | TOO BIG | ||||
TheBloke/lzlv_70B-AWQ | AWQ | 4-bit | OutOfMemory |
My AI Workstation:
- 2 GPUs (48 GB VRAM): Asus ROG STRIX RTX 3090 O24 Gaming White Edition (24 GB VRAM) + EVGA GeForce RTX 3090 FTW3 ULTRA GAMING (24 GB VRAM)
- 13th Gen Intel Core i9-13900K (24 Cores, 8 Performance-Cores + 16 Efficient-Cores, 32 Threads, 3.0-5.8 GHz)
- 128 GB DDR5 RAM (4x 32GB Kingston Fury Beast DDR5-6000 MHz) @ 4800 MHz ☹️
- ASUS ProArt Z790 Creator WiFi
- 1650W Thermaltake ToughPower GF3 Gen5
- Windows 11 Pro 64-bit
Observations:
- Scores = Number of correct answers to multiple choice questions of 1st test series (4 German data protection trainings) as usual
- Primary Score = Number of correct answers after giving information
- Secondary Score = Number of correct answers without giving information (blind)
- Model’s official prompt format (Vicuna 1.1), Deterministic settings. Different quants still produce different outputs because of internal differences.
- Speed is from koboldcpp-1.49’s stats, after a fresh start (no cache) with 3K of 4K context filled up already, with (+) or without (-)
mmq
option to--usecublas
. - LoneStriker/lzlv_70b_fp16_hf-2.4bpw-h6-exl2: 2.4b-bit = BROKEN! Didn’t work at all, outputting only one word and repeating that ad infinitum.
- LoneStriker/lzlv_70b_fp16_hf-2.6bpw-h6-exl2: 2.6-bit = FAIL! Achknowledged questions like information with just OK, didn’t answer unless prompted, and made mistakes despite given information.
- Even EXL2 5.0bpw was surprisingly doing much worse than GGUF Q2_K.
- AWQ just doesn’t work for me with oobabooga’s text-generation-webui, despite 2x 24 GB VRAM, it goes OOM. Allocation seems to be broken. Giving up on that format for now.
- All versions consistently acknowledged all data input with “OK” and followed instructions to answer with just a single letter or more than just a single letter.
- EXL2 isn’t entirely deterministic. Its author said speed is more important than determinism, and I agree, but the quality loss and non-determinism make it less suitable for model tests and comparisons.
Conclusion:
- With AWQ not working and EXL2 delivering bad quality (secondary score dropped a lot!), I’ll stick to the GGUF format for further testing, for now at least.
- Strange that bigger quants got more tokens per second than smaller ones, maybe that’s because of different responses, but Q4_K_M with mmq was fastest - so I’ll use that for future comparisons and tests.
- For real-time uses like Voxta+VaM, EXL2 4-bit is better - it’s fast and accurate, yet not too big (need some of the VRAM for rendering the AI’s avatar in AR/VR). Feels almost as fast as unquantized Transfomers Mistral 7B, but much more accurate for function calling/action inference and summarization (it’s a 70B after all).
So these are my - quite unexpected - findings with this setup. Sharing them with you all and looking for feedback if anyone has done perplexity tests or other benchmarks between formats. Is EXL2 really such a tradeoff between speed and quality in general, or could that be a model-specific effect here?
Here’s a list of my previous model tests and comparisons or other related posts:
- LLM Comparison/Test: 2x 34B Yi (Dolphin, Nous Capybara) vs. 12x 70B, 120B, ChatGPT/GPT-4
- LLM Comparison/Test: Mistral 7B Updates (OpenHermes 2.5, OpenChat 3.5, Nous Capybara 1.9)
- Huge LLM Comparison/Test: Part II (7B-20B) Roleplay Tests Winners: OpenHermes-2-Mistral-7B, LLaMA2-13B-Tiefighter-GGUF
- Huge LLM Comparison/Test: 39 models tested (7B-70B + ChatGPT/GPT-4)
- My current favorite new LLMs: SynthIA v1.5 and Tiefighter!
- Mistral LLM Comparison/Test: Instruct, OpenOrca, Dolphin, Zephyr and more…
- LLM Pro/Serious Use Comparison/Test: From 7B to 70B vs. ChatGPT! Winner: Synthia-70B-v1.2b
- LLM Chat/RP Comparison/Test: Dolphin-Mistral, Mistral-OpenOrca, Synthia 7B Winner: Mistral-7B-OpenOrca
- LLM Chat/RP Comparison/Test: Mistral 7B Base + Instruct
- LLM Chat/RP Comparison/Test (Euryale, FashionGPT, MXLewd, Synthia, Xwin) Winner: Xwin-LM-70B-V0.1
- New Model Comparison/Test (Part 2 of 2: 7 models tested, 70B+180B) Winners: Nous-Hermes-Llama2-70B, Synthia-70B-v1.2b
- New Model Comparison/Test (Part 1 of 2: 15 models tested, 13B+34B) Winner: Mythalion-13B
- New Model RP Comparison/Test (7 models tested) Winners: MythoMax-L2-13B, vicuna-13B-v1.5-16K
- Big Model Comparison/Test (13 models tested) Winner: Nous-Hermes-Llama2
- SillyTavern’s Roleplay preset vs. model-specific prompt format
Disclaimer: Some kind soul recently asked me if they could tip me for my LLM reviews and advice, so I set up a Ko-fi page. While this may affect the priority/order of my tests, it will not change the results, I am incorruptible. Also consider tipping your favorite model creators, quantizers, or frontend/backend devs if you can afford to do so. They deserve it!
The major reason I use exl2 is speed, like on 2x4090 I get 15-20 t/s at 70b depending of the size, but GGUF I get like tops 4-5 t/s.
When using 3 gpus (2x4090+1x3090), it is 11-12 t/s at 6.55bpw vs GGUF Q6_K that runs at 2-3 t/s.
Though I agree with you, for model comparisons and such you need to have deterministic results and also the best quality.
If you can sometime, try 70b at 6bpw or more, IMO it is pretty consistent and doesn’t have issues like 5bpw/bits.
The performance hit is too much on multigpu systems when using GGUF. I guess if in the future the speed gets to the same level, I would use it most of the time.
I’m surprised you get speeds so bad with GGUF. I get almost 9t/s on P40s and 18t/s on 3090.
GGUF is actually the fastest format until you load it up with context.
A couple of things have to be changed in cmakelists under vendor/llama.cpp if you’re using python
I have nvlink so this helps me. Since you don’t it still may help using direct communication via PCIE:
and since you’re using all new cards:
Try out the FP16 support.
‘The performance hit is too much on multigpu systems when using GGUF’
I agree. GGuF has multi GPU panelty. But it”s the most friendly to Apple silicons. I have same setup with you. one 4090 can run Xwin 13b at 40t/s. but when 2 cards present, it get only 1/4 of speed at 10t/s. So to get it fast, I have to flag CUDA device to single card while 2 cards present.
Since GGUF liks single GPU, those who have 3090/4090 will find 34B the best spot with the format.
You’re doing something very wrong. I get better speeds than that on P40s with low context. Are you not using cublas?
What motherboard do you have that can run 3x GPU’s?