Any way to decrease inference time during long chats?(+decrease repetition without breaking things)

The_One_Who_Slays@alien.top · 1 year ago

Any way to decrease inference time during long chats?(+decrease repetition without breaking things)

LocoLanguageModel@alien.top · 1 year ago

I think koboldCPP already does this unless I’m misunderstanding, have a look at this:

Context Shifting is a better version of Smart Context that only works for GGUF models. This feature utilizes KV cache shifting to automatically remove old tokens from context and add new ones without requiring any reprocessing. So long as you use no memory/fixed memory and don’t use world info, you should be able to avoid almost all reprocessing between consecutive generations even at max context. This does not consume any additional context space, making it superior to SmartContext. Context Shifting is enabled by default, and will override smartcontext if both are enabled. Your outputs may be different with shifting enabled, but both seem equally coherent. To disable Context Shifting, use the flag --noshift.