MosaicML, just acquired by DataBricks for $1.3B, published some interesting benchmarks for training LLMs on the AMD MI250 GPU, and said it is ~80% as fast as an NVIDIA A100. Did the world just change?
To be brutally honest, everyone wants to see a fight, between AMD and NVIDIA for running the Large Language Models that have excited the AI world and may create a trillion dollar market. So, when MosaicML, a software company led by Ex-Intel AI Chief Naveen Rao, published this blog explaining how they used an MI250 out of the box, with ZERO code changes, to perform a training run versus the NVIDIA A100, the fight had officially begun. Let’s look at what MosaicML concluded and handicap the next round: the AMD MI300x vs the NVIDIA H100.
What Just Happened?
MosaicML is in the business of making AI easier to run efficiently, and is a strong supporter of PyTorch, which many claim will make it much easier. Well, they are right. They used the ROCm libraries to replace CUDA, and PyTorch 2.0, and were able to run a segment of a training run for a smaller LLM, with zero code changes.
The results? The AMD MI250 is ~80% as fast as A100-40GB and ~73% as fast as A100-80GB.
The next round: MI300 vs. the H100, AT SCALE.
Let’s assume, for arguments sake, that the MI300 and H100 are roughly equivalent in flops. (The MMI300 has not yet been announced and will ship in volume in 2024.) The MI300 has considerably larger HMB memory, yielding significantly higher memory capacity (192GB vs. 80GB) and memory bandwidth (5.2TB/s vs. 3.2TB/s). We note that the benchmarks they ran used 16-bit floating point. I get that. It is the industry standard and produces known good results.
What we don’t know is how the Transformer Engine in the H100 will increase performance significantly (4X) might offset AMD’s memory capacity and bandwidth advantage. The transformer engine enables the use of faster 8-bit integer math vs. the more precise 16-bit floating point math, while preserving the quality of the output. And we do not have any idea of what the NVIDIA “H100-next” will be next year when the AMD MI300 starts shipping to customers. But we suspect “Next” will support more HBM, potentially eliminating that AMD advantage.
Finally, these benchmarks were not run “at scale”, so we do not know how the NVIDIA NVLink will compare to the MI300, presumably using Infinity Fabric within a node and InfiniBand between nodes. And ALL LLMs are trained at significant scale, using up to thousands of GPUs.
Conclusions
We congratulate MosaicML on this work (the blog is a great read!) and AMD for their ROCm work to improve the performance of the MI250. We look forward to the next round, whether run by MosaicML or MLCommons. But regardless of what we don’t yet know, we DO know that the days of NVIDIA having sole reign over AI is coming to an end; AMD and possibly Intel have chips that can run AI models without change and have the next-gen chips in the hopper. (Sorry for the pun.)
That being said, it all comes down to software, and NVIDIA’s software stack remains a massive moat beyond just CUDA. But for LLMs, Pytorch 2.0 and OpenAI Triton (which was not used in the MosaicML benchmarks) will fill in the CUDA moat.