So. My rig (Ryzen 7 3700x, 64G Ram, RTX3070, Intel Arc 380) can run up to 70B parameter models… but they run at a snails pace. Furthermore, i don’t honestly see that big of an improvement for regular chat task from a 70B parameter model vs a 13B parameter model. Don’t get me wrong… there is an improvement in adherence sometimes, it’s just not a GIANT leap forward as i expected. Especially the 30B ish models. Basically no difference between 30B and 70B. I run everything at Q5.
Here is my question… Would running a 70b at Q2 be better than a 7B or 13B at Q5? Would speed improve?
Also, I notice that Mistral runs faster on my machine even at the same parameter counts than LLAMA models… anyone know why?
I know i could run all these test myself theoretically but there is just so much to test and so little time. I figured I’d ask around and see if someone else did it first.
The Q does a lot, check that many models have recommendations which one is better. Low Q is faster but less accurate. But mind that best hits are those marked with K and S/M/L. Downloaded and tried every one of same model to compare, and I recommended you to do the same. And also to check out what that K, S,M,L exactly stand for.