The title, pretty much.
I’m wondering whether a 70b model quantized to 4bit would perform better than a 7b/13b/34b model at fp16. Would be great to get some insights from the community.
The title, pretty much.
I’m wondering whether a 70b model quantized to 4bit would perform better than a 7b/13b/34b model at fp16. Would be great to get some insights from the community.
Depending on the task. For anything related to multilingual, like translating, the quant will destroy the model. I suspect that this is because the sampling data used during the process is all English.