
NVIDIA Headquarters Getty Images
Everyone is not just talking about AI inference processing; they are doing it. Analyst firm Gartner released a new report this week forecasting that global generative AI spending will hit $644 billion in 2025, growing 76.4% year-over-year. Meanwhile, MarketsandMarkets projects that the AI inference market is expected to grow from $106.15 billion in 2025 to $254.98 billion by 2030. (Excuse the overly-precise estimate) However, buyers still need to know what AI processor to buy, especially as inference has gone from a simple one-shot run through a model to agentic and reasoning models that can increase computational requirements by some 100-fold.

Performance continues to skyrocket, driving down price/token. MLCommons
For seven years, the not-for-profit group MLCommons has been helping AI buyers and vendors by publishing peer-reviewed quarterly AI benchmarks. It has just released its Inference 5.0 suite of results, with new chips, servers, and models. Let’s take a look.
The New Benchmarks
New benchmarks were added for the larger Llama 3.1 405B, Llama 2 70B with latency constraints for interactive work, and a new “R-GAT” benchmark for graph models. Only Nvidia ran benchmarks for all the models. A new benchmark was also added for edge inference, the Automotive PointPainting test for 3D object detection.

There are now 11 AI benchmarks managed by MLCommons. NVIDIA
The New Chips
AI is built on silicon, and MLCommons received submissions for six new chips this round, including AMD Instinct MI325X (launched last Fall), Intel Xeon 6980P “Granite Rapids” CPU, Google TPU Trillium (TPU v6e) which has become generally available, Nvidia B200 (Blackwell), Nvidia Jetson AGX Thor 128 for AI at the Edge, and perhaps most importantly the Nvidia GB200, the beast that powers the NVL72 rack that has data centers scrambling to power and cool. Nvidia and many other semiconductor firms are clients of Cambrian-AI Research.
The New Results: Nvidia
As usual, Nvidia won all benchmarks; this time, they won by a lot. First, the B200 tripled the performance of the H200 platform, delivering over 59,000 tokens per second on the latency-bounded Llama 2 70B Interactive model.

Blackwell blows away what was the fastest AI GPU three-fold. NVIDIA
And running the larger 405B Llama 3.1, eight Blackwells outperformed 8 Hoppers by over 3-fold in Server mode. (This was normalized to eight GPUs but was run on an NVL72.)

The new Llama 3.2 405B model is 3.4 times faster on Blackwell. NVIDIA
Now for the real test: is the NVL72 as fast as Nvidia promised at launch? Yes, it is thirty times faster than the 8-GPU H200 running the new Llama 405B, but it has 9 times more GPUs.
The new Llama 3.1 405B benchmark supports input and output lengths up to 128,000 tokens (compared to only 4,096 tokens for Llama 2 70B). The benchmark tests three distinct tasks: general question-answering, math, and code generation.

GB200 NVL72: 30X, as promised for inference without Dynamo. NVIDIA
But when you add Nvidia’s new open-source Dynamo “AI Factory OS” that optimizes AI at the data center level, AI factory throughput can double again running Llama and thirty times faster running DeeSeek. I think that means AI just got 30 times cheaper.

Dynamo: the next big step in delivering Inference. NVIDIA
And, Surprise, AMD Has Rejoined the MLPerf Party!
Welcome back, AMD! The new AMD MI325 did quite well at the select benchmarks AMD ran, competing admirably with the previous generation Hopper GPU. I recently talked with two second-tier cloud service providers, IBM and Cirrascale, and both said their customers are using AMD GPUs. They say that, while it takes more software work compared to using Nvidia, customers are delighted with the performance and price/performance they are realizing with AMD GPUs.

Updated slide from AMD on 3/3 AMD
So, for AI practitioners who know what they are doing and don’t need the value of Nvidia software, AMD MI325 can save them a lot of money.
The same can be said of Intel Gaudi3, which also matches the Nvidia H100 and is now available on the IBM Cloud at about 1/2 the price of using Nvidia, according to an IBM spokesman I reached here at the Intel Vision event in Las Vegas.

AMD says that Meta is using the MI300 exclusively for inferencing on Llama 405B. AMD
AMD also did quite well at the Llama 3.1 405B Serving benchmark (distinct from the interactive 405B benchmark mentioned previously). AMD proudly said that Meta is now using the (older) MI300X as the exclusive inference server for the 405B model.

While this is the older MI300, MangoBoost submitted the highest ever Llama 2 70B ever with just 4 nodes. AMD memory capacity is a key differentiator vs Nvidia. AMD
While Nvidia gets all the attention in AI, AMD continues to make progress in optimizing their ROCm software alternative to CUDA and in attracting industry support. This is evidenced by the AMD partner submissions to MLPerf, including MangoBoost, which made the first-ever multi-node AMD submission.
So Where Does That Leave Us?
Nvidia retains the crown of AI King across all AI applications. Although competition is on the horizon, AMD delivers competitive performance only when measured against the previous Nvidia GPU generation. AMD expects that the MI350, due later this year, will close the gap. However, thanks to the GB300, Nvidia will retain the lead at the GPU performance level by then.
But the real issue here is that while everyone else is trying to compete at the GPU level, Nvidia keeps raising the bar at the data center level with massive investments in software, solutions, and products to ease AI deployment and lower TCO.