
With so much focus on inference processing, it is easy to overlook the AI training market, which continues to drive gigawatts of AI computing capacity. The latest benchmarks show that the training of AI models, an immense investment in power and compute, continues to improve across hardware, software, and scale. Nvidia and AMD want you to know more. (And I still want to see how Google Ironwood performs, but I’ll have to wait!)
AI Training: Nvidia, The 4-Bit Champion
In the good ol’ days, it was easy. What’s the benchmark? “AI”. Who won? “Nvidia, of course!” But now, you have AMD, Cerebras and Google all trying to compete for the crown. But only Nvidia and AMD seem sufficiently confident to put their hardware where their mouths are with open benchmarks through the MLCommons’ MLPerf benchmarks.
Nvidia, for now, is pinning its reputation on the use of 4-bit floating point, initially in inference processing and now even with the traditionally 16- and recently 8-bit training processing. This is important. Reducing the precision by half can double the performance of silicon if, and only if, the results remain accurate, and if and only if you don’t need to double the amount of data and runs required to obviate the inherently poorer results stemming from using lower precision. And Nvidia applied this approach to win 1st place in every MLPerf benchmark. Herein lies a coming squabble between AMD and Nvidia.

Nvidia touts three breakthroughs in Training. NVIDIA
In their press release, Nvidia highlights three breakthroughs. First, Nvidia is applying FP4, specifically NVFP4, to improve training performance for the first time. Second, the performance of Blackwell Ultra, which has only recently begun shipping large-scale systems. Third, Nvidia measured the performance of a cluster of over 5000 GPUs collaborating to run a training job.

Blackwell Ultra shows a signifciant improvement over the Hopper and Blackwell Generations, A lot of this comes from FP4. NVIDIA
Nvidia’s performance of the GB300 is truly impressive, with a 4-5 times performance improvement of the Hopper generation. Much of this is due to the reduction in precision to 4 bits, as described above. However, Nvidia is also leveraging software improvements and networking advancements to deliver more value. Nvidia is more than a chip company; it’s a full-system play.
AMD’s 1st AI Training Benchmark: Nearly as Fast as Nvidia with Higher Precision
AMD’s results are so close to Nvidia’s that the difference is not even worth mentioning. AMD is now that good. But the debate over precision is only just beginning. AMD says it can compete with higher precision, but is working on lower-precision models. They are concerned that these could only produce a 10% advantage instead of the expected 50% since the processing could increase to overcome the impact of lower precision math. Note that AMD’s MI350 generation supports FP4. AMD is just not yet convinced it is a good idea.

AMD’s training performance on MI355X increased by 2-3 fold over its previous generation. AMD
It is important to note that nobody can train a modern LLM in 10 minutes. It takes weeks or months of heavy computing to build a model. These benchmarks only measure a small portion of the work, not the complete training of a model. But they indicate the relative performance quite well.

AMD ran the 2 LLM benchmarks using 8-bit floating-point, while Nvidia ran them using 4-bit floating-point. AMD
Conclusions
AMD is definitely catching up in GPU performance. But Nvidia maintains its leadership at scale, where AMD does not yet have anything that can approach the performance of NVLink for Scale-Up performance, nor the Scale-Out performance of Nvidia rail-optimized networking.
But I am still anxious to see how the Google Ironwood performance measures up! We can expect to see more in three months, as MLCommons operates on a quarterly cadence, alternating between inference and training benchmarks.