Is there a way to prevent coherency degradation when using high levels of RoPE scaling?

tenmileswide@alien.top · 2 years ago

Is there a way to prevent coherency degradation when using high levels of RoPE scaling?

FieldProgrammable@alien.top · 2 years ago

Yes. See this post and the graphs in it for an illustration of what happens to model performance with different context (this post is for 2k native context Llama 1 models so just scale the X axis accordingly for Llama 2).

As you increase the RoPE scaling, the positional embeddings of the prompt are deviating further and further from what the model was trained on. The different compression methods simply attempt to trade off usable quality at longer contexts in exchange for reduced performance at lower contexts. If the model is fine tuned on the compressed scaling, then this alleviates some of the losses, this is what is done with models like SuperHOT and Llongma, which fine tune the model on linear RoPE scaled data.

qrios@alien.top · 2 years ago

I don’t think it’s so simple as “the nature of the beast.”

From my own experiments, you can maintain coherence by having stuff scale more the further back it is, but at some cost to accuracy. So stuff further back is more confused, but still accessible, and stuff more recent is still grounding the generation.

I haven’t tested super thoroughly though.

mrjackspade@alien.top · 2 years ago

Switch to using YARN is the best I’m aware of at the moment.

YARN is basically dynamic alpha scaling with extra steps, functions better without fine tuning, and also benefits from fine tuning.

https://private-user-images.githubusercontent.com/567732/276779985-6b37697c-896e-4199-a541-a489b6fad213.png

SomeOddCodeGuy@alien.top · 2 years ago

I’ve seen a couple of YARN models, but I honestly have no idea how to use them lol. That and the mistral models; they always want to load up at 32k tokens, but then coherency of the model just dies after 5k. I can’t find really clear instructions on what’s expected to get maximum context value from either, so I tend to just ignore using either at high context.

mcmoose1900@alien.top · 2 years ago

Have you considered running a Yi 200K model instead?