I am going to build a LLM server very soon, targeting 34B models (specifically phind-codellama-34b-v2.Q4 GGUF GPTQ AWQ).

I am stuck between these two setups:

  1. 12400 + DDR5 6000MHz 30CL + 4060 Ti 16GB (GGUF; Split the workload between CPU and GPU)
  2. 3090 (GPTQ/AWQ model fully loaded in GPU)

Not sure if the speed bump of 3090 is worth the hefty price increase. Does anyone have benchmarks/data comparing these two setups?

BTW: Alder Lake CPUs run DDR5 in gear 2 (while AM4 run DDR5 in gear 1). AFAIK gear 1 offers lower latency. Would this give AM4 big advantage when it comes to LLM?

  • Woof9000@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    I have 4060 ti 16GB and I’m quite happy with it, for the price I paid. But if you actually can afford it, to be even considering it as an option, then probably you should be going to 3090 instead, it will perform significantly better.

  • FutureIsMine@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    you gotta look at the internal memory clocks and data transfer rates on the GPUS and what you’re going to see is that only the XX80 and XX90 cards have enough memory bandwidth to transfer all that vRAM so the 4060 with all that vRAM can’t actually move that much memory around

  • fallingdowndizzyvr@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    There’s really no comparison. The 4060s, even the Ti, have crap for memory bandwidth. 288GB/s in the case of the Ti. DDR5 is also not fast enough to make much difference. So that combo is not going to be speedy. It in no way compares to a 3090.

    • Caffeine_Monster@alien.topB
      link
      fedilink
      English
      arrow-up
      1
      ·
      1 year ago

      DDR5 is also not fast enough to make much difference

      The real issue is that the consumer cpus / motherboards have very few lanes. DDR5 is plenty fast, but you are probably maxing out motherboard bandwidth with two sticks.

      Would not surprise me at all if server CPU inference is somewhere between x3 and x5 times faster.

  • mcmoose1900@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    Here’s a 7B llama.cpp bench on a 3090 and 7800X3D, with CL28 DDR5 6000 RAM.

    All layers offloaded to GPU:

    Generation:5.94s (11.6ms/T), Total:5.95s (86.05T/s)

    And here is just 2/35 layers offlloaded to CPU:

    Generation:7.59s (14.8ms/T), Total:7.75s (66.10T/s)

    As you can see, the moment you offload even a little bit to CPU, you are going to hit performance hard. More than a few layers and the hit is very severe.

    Here is exllamav2 for reference, though the time also includes prompt processing so its actually faster than indicated:

    3.91 seconds, 512 tokens, 130.83 tokens/second (includes prompt eval.)

  • tntdeez@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    I’d do the 4060 ti and add a 16gb p100 to the mix to avoid doing any cpu inference. Use exl2. Otherwise I’d go 3090. CPU is slowww

  • candre23@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    The 3090 will outperform the 4060 several times over. It’s not even a competition - it’s a slaughter.

    As soon as you have to offload even a single layer to system memory (regardless of the speed), you cut your performance by an order of magnitude. I don’t care if you have screaming fast DDR5 in 8 channels and a pair of the beefiest xeons money can buy, your performance will fall off a cliff the minute you start offloading. If a 3090 is within your budget, that is the unambiguous answer.