I don't understand Mistral and context size, honestly.

anti-lucas-throwaway@alien.top · 2 years ago

I don't understand Mistral and context size, honestly.

mll59@alien.top · 2 years ago

I’ve been playing around with this. The standard model uses a rope freq base of 10,000. At that freq base it can handle slightly more than the 8K tokens it was trained on, according to the Mistral AI info, before producing garbage (at roughly 9K tokens).

However when I use a rope freq base of 45000 I can have reasonable conversations, at least for some of the mistral models, up to more than 25k tokens. Not all of the mistral models are still very coherent at 25K tokens but the dolphin 2.1 model and some others still work quite well. For the details see: here

Sabin_Stargem@alien.top · 2 years ago

This is very helpful. The GGUF format is supposed to set the correct ROPE, but this apparently isn’t the case for Mistral. This is something to bring up at the llamaCPP github, so that whoever works on ROPEs can adjust Mistral behavior.

mll59@alien.top · 2 years ago

Thanks for your reaction. In this case I think it’s not a bug in llama.cpp but in the parameters of the Mistral models. The original Mistral models have been trained on 8K context size, see Product | Mistral AI | Open source models .

But when I load a Mistral model, or a finetune of a Mistral model, koboldcpp always reports a trained context size of 32768, like this:

llm_load_print_meta: n_ctx_train = 32768

So llama.cpp (or koboldcpp) just assume that up to 32768 context size, no NTK scaling is needed and they leave the rope freq base at 10000, which I think is correct. I don’t know why the model has this n_ctx_train parameter at 32768 instead of 8192, maybe a mistake?