@4onen

4onen@alien.top · 2 years ago

Hiya! Seen a few of your analyses but please pardon me because I haven’t seen an answer to this.

Why are you testing models on Q4_0? Isn’t Q4_K_S the same size but with a speedup and quality improvement?

4onen@alien.top · 2 years ago

I did my best to explain Sliding Window Attention briefly there, so do let me know where my explanation is deficient.
No, you cannot set the window size and no, it’s not in Oobabooga/text-generation-webui. It’s trained in.
Well, good luck. AMD doesn’t even support their own cards properly for AI (RoCm support skipped my last card’s generation and the generation before it was only ever in beta support) which is why I finally gave up and switched to team green last year.

4onen@alien.top · 2 years ago

I can’t speak to running on AMD cards, but Mistral uses what’s called “Sliding Window Attention.” That means, Mistral only looks at the last 4k tokens of context, but each of those tokens looked at the 4k before it. That is, it doesn’t have to recompute the entire attention KV cache until 32k tokens have passed.

E.g., imagine the sliding window was only 6 words/punctuations. If you wrote “Let’s eat grandma! She sounds like she” then the model can remember here “grandma” is a food item and will have put that information into the word “she”. Meanwhile, for “Let’s eat, grandma! She sounds like she” then the added comma makes clear the speaker is speaking to “grandma” so “She” is probably a different person, and it may assume the sound is the end of the cooking for the preparation of eating.

A model that didn’t have sliding window attention and had a limit of 6 words/punctuations would only see “grandma! She sounds like she” and would have to make up the context – it wouldn’t remember anything about eating.

I don’t believe messing with alpha values is a good idea, but I’ve never done it on any model. My Mistral 7B instance in chat mode had no trouble with a conversation extending past 9k tokens, though for obvious reasons it couldn’t remember the beginning of the conversation and it was expectedly dumb with esoteric information, being only a 7B model.

4onen@alien.top · 2 years ago

I’m unclear on your definition of “top” but

https://github.com/oobabooga/text-generation-webui - Popular open source llm frontend for just basic model usage.
https://github.com/ggerganov/llama.cpp - Popular llm backend (See: gguf file format.)
https://github.com/eth-sri/lmql - Best system for structured model outputs.