I want to download the goliath model but I can only afford Q3_K_M. It is written that it has high quality losses. How much quality loss is there?
I heard that the larger the model, the less it suffers intellectually when it is optimized. I usually use 70B Q5_K_M. Can I expect that 120B Q3_K_M will be significantly better than 70B Q5_K_M so that the time spent on downloading will be worth it?
What’s the tok/s for each of those models on that system?
Edit: also, if you don’t mind my asking, how much context are you able to use before inference degrades?
for comparison sake EXL2 4.85bpw version runs around 6-8 t/s on 4x3090s at 8k context it’s the lower end.