Hi, I have searched for a long time on this subreddit, in Ooba’s documentation, Mistral’s documentation and everything, but I just can’t find what I am looking for.
I see everyone claiming Mistral can handle up to 32k context size, however while it technically won’t refuse to generate anything above like 8k, the output is just not good. I have it loaded in Oobabooga’s text-generation-webui and am using the API through SillyTavern. I loaded the normal Mistral 7B just to check, but with my current 12k story, all it can generate is gibberish if I give it the full context. However, I also checked using other fine-tunes of Mistral.
What am I doing wrong? I am using the GPTQ version on my RX 7900 XTX. Is it just advertising that it won’t crash until 32k or something, or am I doing something wrong for not getting coherent output above 8k? I did mess with the alpha values, and while doing so does eliminate the gibberish, I do get the idea that the quality does suffer somehow.
I can’t speak to running on AMD cards, but Mistral uses what’s called “Sliding Window Attention.” That means, Mistral only looks at the last 4k tokens of context, but each of those tokens looked at the 4k before it. That is, it doesn’t have to recompute the entire attention KV cache until 32k tokens have passed.
E.g., imagine the sliding window was only 6 words/punctuations. If you wrote “Let’s eat grandma! She sounds like she” then the model can remember here “grandma” is a food item and will have put that information into the word “she”. Meanwhile, for “Let’s eat, grandma! She sounds like she” then the added comma makes clear the speaker is speaking to “grandma” so “She” is probably a different person, and it may assume the sound is the end of the cooking for the preparation of eating.
A model that didn’t have sliding window attention and had a limit of 6 words/punctuations would only see “grandma! She sounds like she” and would have to make up the context – it wouldn’t remember anything about eating.
I don’t believe messing with alpha values is a good idea, but I’ve never done it on any model. My Mistral 7B instance in chat mode had no trouble with a conversation extending past 9k tokens, though for obvious reasons it couldn’t remember the beginning of the conversation and it was expectedly dumb with esoteric information, being only a 7B model.
I don’t believe messing with alpha values is a good idea, but I’ve never done it on any model. My Mistral 7B instance in chat mode had no trouble with a conversation extending past 9k tokens
This is the part that threw me off, and why Im interested in the answers from this post.
Normally, on a Llama 2 model for instance, I’d use alpha to increase the context past the regular cap. For example, on XWin 70b with a max seq length of 4096, I run it at 1.75 alpha and 17000 rope base to kick the context to 6144.
Codellama is a little different. I don’t need to touch the alpha for it to use 100,000 tokens, but the rope base has to be at 1,000,000. So its 1 alpha, rope base 1,000,000, 1 compress == 100,000 tokens.
But then there’s mistral. Mistral loads up and is like “I can do 32,000 tokens!” and has 1 alpha, 0 rope base, 1 compress. And the readme files on the models keep showing “4096” tokens. So I’ve been staring at it, scratching my head, unsure whether it can do 32k, 4k, does it need rope, etc.
I just keep loading it in 4096 until I have a chance to look it up lol
I’ve been playing around with this. The standard model uses a rope freq base of 10,000. At that freq base it can handle slightly more than the 8K tokens it was trained on, according to the Mistral AI info, before producing garbage (at roughly 9K tokens).
However when I use a rope freq base of 45000 I can have reasonable conversations, at least for some of the mistral models, up to more than 25k tokens. Not all of the mistral models are still very coherent at 25K tokens but the dolphin 2.1 model and some others still work quite well. For the details see: here
This is very helpful. The GGUF format is supposed to set the correct ROPE, but this apparently isn’t the case for Mistral. This is something to bring up at the llamaCPP github, so that whoever works on ROPEs can adjust Mistral behavior.
Thanks for your reaction. In this case I think it’s not a bug in llama.cpp but in the parameters of the Mistral models. The original Mistral models have been trained on 8K context size, see Product | Mistral AI | Open source models .
But when I load a Mistral model, or a finetune of a Mistral model, koboldcpp always reports a trained context size of 32768, like this:
llm_load_print_meta: n_ctx_train = 32768
So llama.cpp (or koboldcpp) just assume that up to 32768 context size, no NTK scaling is needed and they leave the rope freq base at 10000, which I think is correct. I don’t know why the model has this n_ctx_train parameter at 32768 instead of 8192, maybe a mistake?
Very interesting. Well, in hindsight I should’ve noticed but performance did decrease after 8k tokens, but became completely unusable after 10k. I actually am pretty disappointed to still know nothing. No one actually documents what does and doesn’t work and when or how. I can barely find anything about SWA (I know what it is in essence) but no one documents how it works, where and if you can set the window size and whether or not it’s in Ooba’s app.
And then there is the problem that I don’t know if it’s supported on AMD cards like you said. Try to look it up, sliding window attention on Google by itself just gives endless pages of “tutorials” and “guides” that don’t tell anything. And combining it with Rocm just gives random results that don’t lead anywhere useful.
- I did my best to explain Sliding Window Attention briefly there, so do let me know where my explanation is deficient.
- No, you cannot set the window size and no, it’s not in Oobabooga/text-generation-webui. It’s trained in.
- Well, good luck. AMD doesn’t even support their own cards properly for AI (RoCm support skipped my last card’s generation and the generation before it was only ever in beta support) which is why I finally gave up and switched to team green last year.
I noticed this problem in llama.cpp too. I suspect that it may be because something is not implemented, that is required for Mistral models, e.g. sliding window attention. To confirm that, one can compare outputs from PyTorch with other software. I tried to do it, but PyTorch model runs out of system RAM with ~15k token prompt.
So I did some research and after I while in the rabbit hole I think that sliding window attention is not implemented in ExLlama (or v2) yet, and it is not in the AMD ROCm fork of Flash Attention yet either.
I think that means it’s just unsupported right now. Very unfortunate, but I guess I’ll have to wait. Waiting for support is the price I pay for saving 900 euros on a GPU by not buying a 4090, but a 7900 XTX. I’m fine with that.
I think it really depends on the finetune, for example, Mistral-Instruct is able to summarize or extract information from a 32K context, for writing, you will have to find a finetuned model for that task
That 32k is theoretical. Base mistral hasn’t actually been trained to work with contexts that large. You might want to look at amazon’s finetune of it for long context. https://huggingface.co/amazon/MistralLite
I suspect it’ll amount to trading off quality for size though.