QuIP# is a novel quantization method. Its 2-bit performance is better than anything previously available.
Repository: https://github.com/Cornell-RelaxML/quip-sharp
Blog post: https://cornell-relaxm...
To create quants of new models, one has to create Hessians for it and it uses several GB of RedPajama to calibrate these. Generating Hessians for Mistral is taking 17 minutes per LAYER on my 3090. I’ll see if it can even finish later. Much later. That’s over 16 hours just to quantize a 7B model, yikes.
The paper for this is one of the worst for me in years, full on “I know some of these words.” I didn’t think 8-dimensional sphere packing was going to be in my attempted light reading for the night.
P…S.: Rollback to transformers 4.34.0 or edit the code in hessian_offline_llama.py and change all instances of
With Llama-2-70b-chat-E8P-2Bit from their zoo, quip# seems fairly promising. I’d have to try l2-70b-chat in exl2 at 2.4 bpw to compare but this model does not really feel like a 2 bit model so far, I’m impressed.
To create quants of new models, one has to create Hessians for it and it uses several GB of RedPajama to calibrate these. Generating Hessians for Mistral is taking 17 minutes per LAYER on my 3090. I’ll see if it can even finish later. Much later. That’s over 16 hours just to quantize a 7B model, yikes.
The paper for this is one of the worst for me in years, full on “I know some of these words.” I didn’t think 8-dimensional sphere packing was going to be in my attempted light reading for the night.
P…S.: Rollback to transformers 4.34.0 or edit the code in hessian_offline_llama.py and change all instances of
to
and add an import to the top of the same file.
With Llama-2-70b-chat-E8P-2Bit from their zoo, quip# seems fairly promising. I’d have to try l2-70b-chat in exl2 at 2.4 bpw to compare but this model does not really feel like a 2 bit model so far, I’m impressed.
From the issue about this in the exllamav2 repo, quip was using more memory and slower than exl. How much context can you fit?