Any way to decrease inference time during long chats?(+decrease repetition without breaking things)

The_One_Who_Slays@alien.top · 1 year ago

Any way to decrease inference time during long chats?(+decrease repetition without breaking things)

the_quark@alien.top · 1 year ago

While speed decreases with context length, I wonder if with a small context you’re fitting entirely in GPU, but as the context length increases, you exceed the available VRAM and it has to offload part of it to CPU?

To check if this is the issue, take a chat that’s performing poorly and load a smaller model on it.

If that is your issue, you can try to fine-tune your context length to fit inside your available VRAM, or use a smaller model if you need the longer context.