Quantizing 70b models to 4-bit, how much does performance degrade?

ae_dataviz@alien.top · 1 year ago

Quantizing 70b models to 4-bit, how much does performance degrade?

tu9jn@alien.top · 1 year ago

Usually number of parameters matter more than bit per weight, but I had some problems with really low bpw models like 70b 2.55bpw exllamav2.

34b Yi could be a good compromise, I am impressed with it, and it has a long context length as well.

AutomataManifold@alien.top · 1 year ago

Early research suggested that there was an inflection point below 4-bits, where things got markedly worse. In my personal use, I find that accuracy definitely suffers below there, though maybe modern quants are a bit better at it.

34B Yi does seem like a sweet spot, though I’m starting to suspect that we need some fine-tunes that use longer stories as part of the training data, because it doesn’t seem to be able to maintain the quality for the entire length of the context. Still, being able to include callbacks to events from thousands of tokens earlier is impressively practical. I’ve been alternating between a fast 13B (for specific scenes), 34B Yi (for general writing), and 70B (for when you need it to be smart and varied). And, of course, just switching models can help with the repetition sometimes.