I recently found out about Chronos-Hermes 13B and have been trying to play around with it.
I’ve tried three formats of the model, GPTQ, GPML, and GGUF. It’s my understanding that GPML is older and more CPU-based, so I don’t use it much. Whenever I use the GGUF (Q5 version) with KobaldCpp as a backend, I get incredible responses, but the speed is extremely slow. I even offload 32 layers to my GPU, and confirmed that it’s not overusing VRAM, and it’s still slow. The GPTQ model on the other hand is way faster, but the quality of responses is worse.
My question is, are there any tricks to loading GPTQ models I might not be aware of?
I think exl2 is being let down by the number of quants that are using wikitext as the quantization dataset, even when it is obvious that this is completely mismatched to the model’s fine tuning. Activation order based quantization needs good measurement data to make the correct decisions on quantization.
If however you see the quantization data fits with the fine tune then the effects would be completely the opposite.