I’m trying to run mistral 7b on my laptop, and the inference speed is fine (~10T/s), but prompt processing takes very long when the context gets bigger (also around 10T/s). I’ve tried quantizing the model, but that doesn’t speed up processing, only generation. I’ve also tried using openblas, but that didn’t provide much speedup. I’m using koboldcpp’s prompt cache, but that doesn’t help with initial load times (which are so slow the connection times out)
From my other testing, smaller models are faster at prompt processing, but they tend to completely ignore my prompts and just go off in random directions.
So my question is: 1) is there a way to speed up prompt processing for mistral (using koboldcpp, preferably) or 2) if not, are there any coherent models around 3b parameters that support contexts around 4k?
Edit: I misremembered the generation speed. It’s around 10 T/s for generation only. It’s changed now in the original post
on quality: if you go with a smaller model or even another model you will lose quality, as Mistral (and his finetunes) is the best among <70B models and another rule of thumb is that a bigger model quantized (even 2bits) is better than a smaller unquantized,
on speed: the fastest inference is from Q4_K_S https://github.com/ggerganov/llama.cpp/pull/1684