Hi all, I need a help from all of you. I am going to buy H100s for training LLMs. Currently for fine-tuning 70b models but later we may consider pre-training larger models too. H100s looks more promising than A100s considering its power of FP8 support, so I asked quotes from multiple vendors. And then, realized there are too many options!
-
DGX - 8x H100, much more expensive than other options but they say its performance is worth it.
-
Buy PCI-E H100 cards and a Supermicro machine - from 2x upto 8x, looks cost effective.
2.a. some vendors offered a combination with NVLinks. Some says 1 link is needed for 2 and some says 3 links are needed for 2.
-
H100 NVL - no idea what the difference is compared to the PCI-E with NVLinks but looks like they are newly introduced ones.
-
Some other options, like a custom build made by the vendors.
Any BEST PRACTICE I can take a look to make a decision? Any advice from experts here who suffered a similar situation already? Thanks in advance 🙏
H100 in the DGX is not the H100 PCI-e, but about 30% faster. When in doubt, just go DGX
I will talk to my boss for more money 😆
Out of curiosity, what kind of projects are you working on that require purchasing such GPUs rather than renting on the cloud?
Nooooo! DGX you pay for the name and “service” by Nvidia. PCIe is lacking fast interconnect with nvswitch. There is a layer in between: HGX.,it’s basically DGX without the branding.
You can get such systems from Supermicro and ASUS