What are the benefits of using an H100 over an A100 (both at 80 GB and both using FP16) for LLM inference?
Seeing the datasheet for both GPUS, the H100 has twice the max flops, but they have almost the same memory bandwidth (2000 GB/sec). As memory latency dominates inference, I wonder what benefits the H100 has. One benefit could, of course, be the ability to use FP8 (which is extremely useful), but I’m interested in the difference in the hardware specs in this question.
H100 was additionally specialized to have higher performance for transformer models. I think it is about 8x faster than a A100 for transformers, but don’t nail me down on it
At first I thought that number was almost unbelievably high. It appears it can be 8x faster when using FlashAttention and a multi-GPU setup. Without multi-gpu and flash attention, it is a bit more than 2x faster.
Source: https://lambdalabs.com/blog/flashattention-2-lambda-cloud-h100-vs-a100
Thanks for clarifying:)
Sure but isn’t it the case that the H100 is what can sustain such a high throughput system whereas A100s are generally independent?