@FieldProgrammable

FieldProgrammable@alien.top · 11 months ago

Yes. See this post and the graphs in it for an illustration of what happens to model performance with different context (this post is for 2k native context Llama 1 models so just scale the X axis accordingly for Llama 2).

As you increase the RoPE scaling, the positional embeddings of the prompt are deviating further and further from what the model was trained on. The different compression methods simply attempt to trade off usable quality at longer contexts in exchange for reduced performance at lower contexts. If the model is fine tuned on the compressed scaling, then this alleviates some of the losses, this is what is done with models like SuperHOT and Llongma, which fine tune the model on linear RoPE scaled data.

FieldProgrammable@alien.top · 1 year ago

I think exl2 is being let down by the number of quants that are using wikitext as the quantization dataset, even when it is obvious that this is completely mismatched to the model’s fine tuning. Activation order based quantization needs good measurement data to make the correct decisions on quantization.

If however you see the quantization data fits with the fine tune then the effects would be completely the opposite.