Using Oobabooga’s Webui on cloud.

I haven’t noticed that immediately, but apparently once I breach the context limit, or some short time after the fact, the inference time increases significantly. For example, in the beginning of the conversation a single message goes about 13-16 tps. After reaching the threshold, the speed starts decreasing until it becomes around 0.1 tps.

Not only that, but the text also starts repeating. For example, character’s certain features or their actions start coming up in almost every sunsequent message with almost identical wording, like some sort of a broken record. It’s not impossible to stir the plot forward, but it gets tiring, especially considering a huge delay on top of that.

Is there any solution or a workaround to these problems?

  • LocoLanguageModel@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    I think koboldCPP already does this unless I’m misunderstanding, have a look at this:

    Context Shifting is a better version of Smart Context that only works for GGUF models. This feature utilizes KV cache shifting to automatically remove old tokens from context and add new ones without requiring any reprocessing. So long as you use no memory/fixed memory and don’t use world info, you should be able to avoid almost all reprocessing between consecutive generations even at max context. This does not consume any additional context space, making it superior to SmartContext. Context Shifting is enabled by default, and will override smartcontext if both are enabled. Your outputs may be different with shifting enabled, but both seem equally coherent. To disable Context Shifting, use the flag --noshift.