The title, pretty much.

I’m wondering whether a 70b model quantized to 4bit would perform better than a 7b/13b/34b model at fp16. Would be great to get some insights from the community.

    • Sea_Particular_4014@alien.topB
      link
      fedilink
      English
      arrow-up
      1
      ·
      1 year ago

      Well… none at all if you’re happy with ~1 token per second or less using GGUF CPU inference.

      I have 1 x 3090 24GB and get about 2 tokens per second with partial offload. I find it usable for most stuff but many people find that too slow.

      You’d need 2 x 3090 or an A6000 or something to do it quickly.

    • Dusty_da_Cat@alien.topB
      link
      fedilink
      English
      arrow-up
      1
      ·
      1 year ago

      The golden standard is 2 x 3090/4090 cards, which is 48 GBs of VRAM total. You can get by with 2 P40s(Need cooling solution) and run onboard video, if you want to save some money. The speeds will be slower, but still better than running on System RAM on typical setups.

      • Dry-Vermicelli-682@alien.topB
        link
        fedilink
        English
        arrow-up
        1
        ·
        1 year ago

        44GB of GPU VRAM? WTH GPU has 44GB other than stupid expensive ones? Are average folks running $25K GPUS at home? Or those running these like working for company’s with lots of money and building small GPU servers to run these?

      • harrro@alien.topB
        link
        fedilink
        English
        arrow-up
        1
        ·
        1 year ago

        Using Q3, you can fit it in 36GB (I have a weird combo of RTX 3060 with 12GB and P40 with 24GB and I can run a 70B at 3bit fully on GPU).

          • harrro@alien.topB
            link
            fedilink
            English
            arrow-up
            1
            ·
            1 year ago

            Yes llama.cpp will automatically split the model to work across GPUs. You can also specify how much of the full model should be on each GPU.

            Not sure on AMD support but for nvidia it’s pretty easy to do.