ae_dataviz@alien.topB to

LocalLLaMA@poweruser.forumEnglish · 2 years ago

Quantizing 70b models to 4-bit, how much does performance degrade?

1

Quantizing 70b models to 4-bit, how much does performance degrade?

ae_dataviz@alien.topB to

LocalLLaMA@poweruser.forumEnglish · 2 years ago

The title, pretty much.

I’m wondering whether a 70b model quantized to 4bit would perform better than a 7b/13b/34b model at fp16. Would be great to get some insights from the community.

Chat

Sea_Particular_4014@alien.topB
link
fedilink
English
arrow-up
1·
2 years ago
Well… none at all if you’re happy with ~1 token per second or less using GGUF CPU inference.

I have 1 x 3090 24GB and get about 2 tokens per second with partial offload. I find it usable for most stuff but many people find that too slow.

You’d need 2 x 3090 or an A6000 or something to do it quickly.