Hi all, I need a help from all of you. I am going to buy H100s for training LLMs. Currently for fine-tuning 70b models but later we may consider pre-training larger models too. H100s looks more promising than A100s considering its power of FP8 support, so I asked quotes from multiple vendors. And then, realized there are too many options!

  1. DGX - 8x H100, much more expensive than other options but they say its performance is worth it.

  2. Buy PCI-E H100 cards and a Supermicro machine - from 2x upto 8x, looks cost effective.

2.a. some vendors offered a combination with NVLinks. Some says 1 link is needed for 2 and some says 3 links are needed for 2.

  1. H100 NVL - no idea what the difference is compared to the PCI-E with NVLinks but looks like they are newly introduced ones.

  2. Some other options, like a custom build made by the vendors.

Any BEST PRACTICE I can take a look to make a decision? Any advice from experts here who suffered a similar situation already? Thanks in advance 🙏

      • aadoop6@alien.topB
        link
        fedilink
        English
        arrow-up
        1
        ·
        1 year ago

        Out of curiosity, what kind of projects are you working on that require purchasing such GPUs rather than renting on the cloud?

    • fadenb@alien.topB
      link
      fedilink
      English
      arrow-up
      1
      ·
      1 year ago

      Nooooo! DGX you pay for the name and “service” by Nvidia. PCIe is lacking fast interconnect with nvswitch. There is a layer in between: HGX.,it’s basically DGX without the branding.

      You can get such systems from Supermicro and ASUS