You mean my recent LLM Format Comparison/Benchmark: 70B GGUF vs. EXL2 (and AWQ)? The quants below 3bpw probably didn’t work because the smaller models need to be run without BOS token (which was on by default), something I didn’t know then yet.
Q2_K didn’t degrade compared to Q5_K_M - given that K quants are actually higher bitrate for the most important parts, that may not be so surprising.
Still surprising that Q2_K also beat 5bpw, though. Not sure if that’s just because of the bitrate or also a factor of how EXL2 quants are calibrated.
However, all that said, I’d be careful trying to compare quant effects across models. The models themselves have a huge impact beyond quant level, and it’s hard to say which has what strength of effect.
Yeah, GGUF is rather slow for me, that’s why I’ve begun to use ExLlamav2_HF which lets me run even 120B models at 3-bit with nice quality at around 20 T/s.