I plan to infer 33B models at full precision, 70B is second priority but a nice touch. Would I be better off getting an AMD EPYC server cpu like this or a RTX 4090? With the EPYC, i am able to get 384GB DDR4 RAM for ~400USD on ebay, the 4090 only has 24GB. Moreover, both the 4090 and EPYC setup + ram cost about the same. which would be a better buy?

  • easyllaama@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    My AMD 7950X3D ( 16 core 32 threads), 64GB DDR5, Single RTX 4090 on 13B Xwin GGUF q8 can run at 45T/S. With exllamav2, 2x 4090 can run 70B q4 at 15T/s. Motherboard is Asus Pro Art AM5. In Local LLama, I think you can run similar speed with RTX 3090s. But in SD, 4090 is 70% better though.

  • mcmoose1900@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    If you must run at high precision… the best system in that budget is probably a compromise?

    Grab a 3090 or 3060 and slap it on the most RAM bandwidth you can get, with a more modest CPU. The GPU will offload prompt processing and enough response layers to help.

  • multiverse_fan@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    If I had the money, I’d go with the cpu.

    Also, I’m not sure a 4090 could run 33B modes at full precision. Wouldn’t that require like 70GB of vRAM?

  • tvetus@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    at full precision

    Full precision is not as useful as you think. Even at 4bit, the losses are not that large.

    70B

    What is your motivation for such large models? You’re sacrificing a lot of speed for the larger model.

  • extopico@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    Unquantized, 34B models require at least 65GB plus extra depending on context. I cannot see how your comparison of alternatives can work.

    You would need 3 x 4090 to run the model.

  • Mastershima@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    The Motherboard and RAM alone cost another $2300… The CPU, Ram, and Motherboard looks like $4500 rounded down… How does this make sense versus three 3090s?

  • XTJ7@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    I tried a 70B model on my 48 core epyc with 512gb RAM and it was unusable. I think 1.5t/s or so? Even if you double that it’s not great. My M1 Ultra runs it comfortably at 6-7t/s and sips power.

    Probably a dual 3090 setup would be the most cost effective solution at the moment while the M1/M2 ultra are the most power efficient solution.

    • Silent-Edge1414@alien.topB
      link
      fedilink
      English
      arrow-up
      1
      ·
      1 year ago

      What is the frequency of your ram? DDR4 2666 mhz, DDR4 3200 mhz or DDR5 4800mhz? and how are they installed on the motherboard? 4x128gb(quad channel) or 8x64gb(octa channel)?

      Rams frequencies are the most important for llm token generation, as these are often the bottleneck. With a 32 or more-core Epyc 7003 cpu in octa channel (DDR4 3200), you can expect 3 to 4 tokens(70b) equivalent to a 200GB/S vram of speed.

      For OP, a 48-core cpu (genoa) or more, in 12-channel DDR5 4800 can expect to go up to 6 to 8 tokens(70b) equivalent to 400GB/S vram of speed.

      an rtx 4090 is around 1000GB/s but with only 24gb vram, gpus are generally much faster for prompt processing than PC cpu (over 100 times faster), but I don’t know about modern server cpu(Genoa), normally they are faster in prompt processing than PC cpu, as they natively support fp16/BF16 operation.

      But take it with a pinch of salt, as I don’t have these configurations at hand, so you’ll have to ask someone who does.

      • XTJ7@alien.topB
        link
        fedilink
        English
        arrow-up
        1
        ·
        1 year ago

        In my case it’s an Epyc 7642 with 8x64GB DDR4 2666, so that may be why my generation is significantly slower.

        I find anything below 5 tokens per second not really usable, so that’s why I stick with my M1 Ultra. It has plenty of really fast RAM and that again explains most likely why it performs so well, if LLMs are that dependend on fast memory.

        I also have a 3090 in another machine but that’s also just 24gb and I don’t want to shell out more money right now for playing with LLMs, if the M1 Ultra is doing good enough :)

        • runforpeace2021@alien.topB
          link
          fedilink
          English
          arrow-up
          1
          ·
          1 year ago

          If you like m1 ultra, wait until m3 ultra … . M3 max already smokes M1 Max by 3x the speed in inference.

          So expect m3 ultra to be in the 20t/s range

          • XTJ7@alien.topB
            link
            fedilink
            English
            arrow-up
            1
            ·
            1 year ago

            For sure! But the M1 ultra still holds up really well. I doubt I will replace it for another 3 years at the very least. Currently CPUs are progressing at an impressive rate across the board. Would I like an M3 ultra? Sure, but do I really need it? Sadly no :) The upgrade to an M5 ultra will be insane though.

  • fallingdowndizzyvr@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    I plan to infer 33B models at full precision

    At full precision as in FP16, you are not going to be able to fit ot in a 4090. So if that’s your goal, between the choices you are giving, there is only 1 choice. But it won’t be speedy.