The title, pretty much.

I’m wondering whether a 70b model quantized to 4bit would perform better than a 7b/13b/34b model at fp16. Would be great to get some insights from the community.

  • Sea_Particular_4014@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    10 months ago

    Well… none at all if you’re happy with ~1 token per second or less using GGUF CPU inference.

    I have 1 x 3090 24GB and get about 2 tokens per second with partial offload. I find it usable for most stuff but many people find that too slow.

    You’d need 2 x 3090 or an A6000 or something to do it quickly.