I have a server with 512gb RAM and 2x Intel Xeon 6154. It has spare 16x pcie 3.0 slot once I get rid of my current gpu.

I’d like to add a better gpu so I can generate paper summaries (the responses can take a few minutes to come back) that are significantly better than the quality I get now with 4bit Llama2 13b. Anyone know whats the minimum gpu I should be looking at with this setup to be able to upgrade to the 70b model?Will hybrid cpu+gpu inference with RTX 4090 24GB be enough?

  • Sea_Particular_4014@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    Your 512GB of RAM is overkill. Those Xeons are probably pretty mediocre for this sort of thing due to the slow memory, unfortunately.

    With a 4090 or 3090, you should get about 2 tokens per second with GGUF q4_k_m inference. That’s what I do and find it tolerable but it depends on your use case.

    You’d need a 48GB GPU, or fast DDR5 RAM to get faster generation than that.

    • Dankmre@alien.topB
      link
      fedilink
      English
      arrow-up
      1
      ·
      1 year ago

      Op seems to want 5-10 T/s on a budget with 70B… Not going to happen I think.