People talk about it around here like this is pretty simple (these days at least). But once I hit about 4200-4400 tokens (with my limit pushed to 8k) all I get is gibberish. This is with the LLaMA2-13B-Tiefighter-AWQ model, which seems highly regarded for roleplay/storytelling (my use case).
I also tried OpenHermes-2.5-Mistral-7B and it was nonsensical from the very start oddly enough.
I’m using Silly Tavern with Oobabooga, sequence length set to 8k in both, and a 3090. I’m pretty new to all of this and it’s been difficult finding up to date information (because things develop so quickly!) The term fine-tuning comes up a lot, and with it comes a whooooole lot of complicated coding talk I know nothing about.
As a layman, is there a way to achieve 8k (or more) context for a roleplay/storytelling model?
Does anyone have some hints how to use exllamav2 and extended context length by using GPTQ weights?