If cpu processing is slow af, and gpu takes $$$$ to get enough memory for larger models; I am wondering if an APU could deliver some of that gpu speed, but using cheaper ram to get the larger models in memory; with 128gb of ram, that’s the equivalent of 6x 30/4090s, without allowing for overhead at least!

Wondering if anyone has got any current apu benchmarks vs cpu/gpu? Do you know if the GPU side of APU architecture can be used to get an increase over traditional CPU results?

I’ve been seeing a lot of claims that the ryzen 8000 series is going to be competing with low end Gpus, some people think all the way up to 3060.

If it’s possible to do, it might be the new best way to get large models working for cheap?

  • CKtalon@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    10 months ago

    Although for inferencing, memory bandwidth is the most important, FLOPS still matter. APUs are just too slow, so the bottleneck will get shifted to calculating all those matrix operations (provided there’s high bandwidth designed for APUS like Apple which I doubt so)

  • vikarti_anatra@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    10 months ago

    Unlikely. As far as I understoodd, first limit is not even matrix multiplication cores, it’s memory bandwith and solution for this is faster RAM and multi-channel connections.

  • AnomalyNexus@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    10 months ago

    It’s vaguely how the Mac’s work.

    The current APUs are still quite slow but maybe it’ll change. Also in most cases you need to designate memory as gpu specific. So not quite shared

    • RayIsLazy@alien.topB
      link
      fedilink
      English
      arrow-up
      1
      ·
      10 months ago

      The new oryon cpu from Qualcomm looks pretty good, pretty much better than m2 but for windows.

      • FlishFlashman@alien.topB
        link
        fedilink
        English
        arrow-up
        1
        ·
        10 months ago

        A chip that won’t be available for ~6 months will be better than a chip that came out a year ago? Amazing ;)

  • ccbadd@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    10 months ago

    It AMD would put out an APU with 3D VCache and quad channel memory that lets you use all four slots at full speed (6000 mt/s or better) and not cripple it in the bios they could be kicking Apple tail.

    • he29@alien.topB
      link
      fedilink
      English
      arrow-up
      1
      ·
      10 months ago

      I’m not sure if 3D cache would help in this case, since there isn’t a particular small part of the model that could be reused over and over: you have to read _all_ the weights when inferring the next word, right?

      But I’m definitely looking forward to the 8000 series, since AM5 boards should get even cheaper by the time it comes out, and support for faster DDR5 should get better as well. And I really need to move on from my 10 years old Xeon haha…

      • ccbadd@alien.topB
        link
        fedilink
        English
        arrow-up
        1
        ·
        10 months ago

        I didn’t think so either about the 3d vcache until the article about getting 10X the performance from a ramdrive that came out a few days ago. If it works for ramdrives then surely we can figure a way to use that performance for inferencing.

        • FlishFlashman@alien.topB
          link
          fedilink
          English
          arrow-up
          1
          ·
          10 months ago

          It’s not going to help because the model data is much larger than the cache and the access pattern is basically long sequential reads.

          • rarted_tarp@alien.topB
            link
            fedilink
            English
            arrow-up
            1
            ·
            10 months ago

            It might help for LLMs since a lot of values are cached after each loop, but still highly unlikely to make a difference.

  • Zemanyak@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    10 months ago

    Does anybody have benchmarks or numbers to compare token/sec relatives to GPU, DDR4, DDR5 and CPU inference ? I don’t care what hardware and LLMs, just to get a rough idea.

  • MINIMAN10001@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    10 months ago

    It’s not that CPUs are slow it’s that typically RAM that the CPU is connected to is slow.

    That’s why unified memory is fast it’s just faster and connected to the CPU.

    • rarted_tarp@alien.topB
      link
      fedilink
      English
      arrow-up
      1
      ·
      10 months ago

      The UMA has a lot more to do with the speed than distance, and GPU has a much different architecture and memory access patterns than a CPU.

  • Astronomer3007@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    10 months ago

    4 channel ddr5 above 6400mhz should get 200 gb/s bandwidth. I wonder how many token/s that setup would get on 34B and 70B models.