Just wondering if anyone with more knowledge on server hardware could point me in the direction of getting an 8 channel ddr4 server up and running (Estimated bandwidth speed is around 200gb/s) So I would think it would be plenty for inferencing LLM’s.
I would prefer to go used Server hardware due to price, when comparing the memory amount to getting a bunch of p40’s the power consumption is drastically lower. Im just not sure how fast a slightly older server cpu can process inferencing.

If I was looking to run 80-120gb models would 200gb/s and dual 24 core cpu’s get me 3-5 tokens a second?

  • Aphid_red@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    To get 3-5 tokens a second on 120GB models requires a minimum of 360-600 GB/s throughput (just multiply the numbers~), likely about 30% more due to various inefficiencies, as you usually never reach the maximum theoretical RAM throughput and there are other steps to evaluating the LLM than just the matmuls. So 468-780 GB/s.

    This might be what you’re looking for, as a platform base:

    https://www.gigabyte.com/Enterprise/Server-Motherboard/MZ73-LM1-rev-10

    24 channels of DDR-5 gets you up to 920 GB/s of total memory throughput, so that meets the criterion. About as much as a high-end GPU, actually. The numbers on genoa look surprisingly good (well, maybe not the power consumption; ~1100W for CPU and RAM is a lot more than the ~300W the A100 would use, you could probably power limit it to 150W and still be faster.).

    Of course, during prompt processing, you’ll be bottlenecked by the CPU speed. I’d estimate a 32-core genoa CPU does ~ 2 tflops or so of fp64 (based on 9654’s number of 5.4 tflops, it’ll be a bit more than a third due to higher clock speed), so perhaps 4 tflops of fp32 (fp16 I don’t think is native instruction yet in genoa afaik, and fp32 should be 2x of fp64 using AVX). Compare 36 tflops for the 3090; so it’s going to be 1/5th the speed at prompt processing, which is compute limited (two CPUs), or 1/10th if that’s unoptimized for numa. Honestly, that’s not too bad. But, if you want the best of both worlds, add in a 3090, 4090 or 7900XTX and offload the prompt processing with BLAS, and you get decent inference speed for a huge model (basically, roughly equal or better than anything except A100/H100), and also good prompt processing, as the kv cache should fit in the GPU memory.

    As far as CPU prices… . the 9334 seems to range from about $700 (used, quality samples) to $2700 (new), and would have the core count. A bit of a step up is the 9354 which has the full cache size. That might be relevant for inference.

    • jasonmbrown@alien.topOPB
      link
      fedilink
      English
      arrow-up
      1
      ·
      1 year ago

      I appreciate the info this is probably the closest to what I am asking for. It seems no matter what I look at unless I have 10,000 to fork over I am going to be restricted in someway or another.