Using Oobabooga’s Webui on cloud.
I haven’t noticed that immediately, but apparently once I breach the context limit, or some short time after the fact, the inference time increases significantly. For example, in the beginning of the conversation a single message goes about 13-16 tps. After reaching the threshold, the speed starts decreasing until it becomes around 0.1 tps.
Not only that, but the text also starts repeating. For example, character’s certain features or their actions start coming up in almost every sunsequent message with almost identical wording, like some sort of a broken record. It’s not impossible to stir the plot forward, but it gets tiring, especially considering a huge delay on top of that.
Is there any solution or a workaround to these problems?
What do you want to happen when the total chat reaches 8k? Because there the server has to make a choice it can keep adding more context so it slows down, it can simply cut off the first messages but then it will for example forget its own name, or it could for example (this is a method I use but it costs interference time as you ask a 2nd question behind the scenes) ask the model to summarize the first 4K of the context so it will retain some context and still retain speed.
While speed decreases with context length, I wonder if with a small context you’re fitting entirely in GPU, but as the context length increases, you exceed the available VRAM and it has to offload part of it to CPU?
To check if this is the issue, take a chat that’s performing poorly and load a smaller model on it.
If that is your issue, you can try to fine-tune your context length to fit inside your available VRAM, or use a smaller model if you need the longer context.
Interesting timing, I don’t know if this exists yet or not but I was just thinking about a feature that would use like a range for context size.
The idea would be that you specify a min and a max context, say 6k and 8k and the way it would work is when you breach the 8k max, instead of just cutting it off there, it would go further forward and cut it off at 6k and then it would build on that context until it once again reached 8k and keep repeating the process after that. This would make it so that instead of reprocessing the entire context every time, it would only need to do it when the max was exceeded. I’m a programmer by trade so I’m kind of tempted to look into building this but I haven’t even looked into what that requires or if the feature already exists out there somewhere.
That would be amazing. I think something like that could even be included into ooba’s official extension repo.
I think koboldCPP already does this unless I’m misunderstanding, have a look at this:
Context Shifting is a better version of Smart Context that only works for GGUF models. This feature utilizes KV cache shifting to automatically remove old tokens from context and add new ones without requiring any reprocessing. So long as you use no memory/fixed memory and don’t use world info, you should be able to avoid almost all reprocessing between consecutive generations even at max context. This does not consume any additional context space, making it superior to SmartContext. Context Shifting is enabled by default, and will override smartcontext if both are enabled. Your outputs may be different with shifting enabled, but both seem equally coherent. To disable Context Shifting, use the flag --noshift.