Did AMD Just Beat Nvidia In AI Performance?

by | Apr 6, 2026 | In the News

AMD performed well on the Llama2-70B single node submission, +/- 10% of the Nvidia B300 GPU.

AMD performed well on the Llama2-70B single node submission, +/- 10% of the Nvidia B300 GPU. AMD

AMD, Nvidia and Intel participated in the most recent round of MLPerf inference benchmarks. While I was surprised that Google didn’t participate with their new Ironwood TPU, there is still plenty to talk about.

This particular round of MLPerf benchmarks makes it difficult-to-impossible to make a real A/B comparisons, as we will explore later. But, to answer the headline question directly: no. AMD did get close in a few MLPerf 6.0 Inferencing benchmarks, and even surpassed the Blackwell GPU by 4% on a single-node Llama2-70B benchmark (which was run by an Nvidia partner, not Nvidia). However, this result is a bit of an outlier, as Nvidia maintained its leadership in per-GPU and per-rack AI performance with the Blackwell Ultra B300 GPU. This is especially true for recent state-of-the-art AI models for which AMD did not compete.

Let’s dive in and discuss how these results could change the landscape and when. (Nvidia, like many AI silicon vendors, is a client of Cambrian-AI Research, LLC.)

AMD is Now On The Playing Field

AMD Instinct MI355X single node performance on ML Perf is close with 80-90% of an Nvidia B300 GPU.

AMD Instinct MI355X single node performance on ML Perf is close with 80-90% of an Nvidia B300 GPU. AMD

Without a doubt, AMD has made significant progress. With the latest MLPerf 6.0 benchmarks, AMD gets close to, and even passes, some Nvidia partner submissions on B200 and B300 on select models. AMD has now demonstrated it has a competitive GPU for certain workloads, and their ROCm software has improved as well. The MI355X adopts the smaller, more efficient FP4 precision, as does Nvidia, to boost performance, and the latest ROCm software speeds the execution of token generation.

AND realized over three times better performance than its previous MI325X

AND realized over three times better performance than its previous MI325X. AMD

AMD did run a more modern model for text to image generation, unfortunately after the submission deadline had passed. If verified by MLCommons, this would reinforce my conclusions that AMD performance is now in the ball park, with 88% of the performance of the Nvidia B300 GPU. The caveat I would mention, is that AI is no longer just about a fast GPU. AI performance is determined in the real world by the complete infrastructure stack, something AMD realizes with their upcoming Helios rack-scale system and which Nvidia has been espousing for years.

AMD has shared an MLPerf Wan-2.2-t2v benchmark that has not yet been peer-reviewed for the Offine scenario, and performance was only 12% lower than the Nvidia B300 GPU.

AMD has shared an MLPerf Wan-2.2-t2v benchmark that has not yet been peer-reviewed for the Offine scenario, and performance was only 12% lower than the Nvidia B300 GPU. AMD

Below is AMD’s “Open Division” result for the same benchmark (the light blue bar). I would caution readers that this slide compares an Open Division result for AMD MI355x to a Closed Division result for Nvidia Blackwell. MLPerf “Open” benchmarks (intended to showcase new innovations beyond silicon) mean that the submission can use different models, retraining, and arbitrary pre‑ and post‑processing, as long as the same dataset and quality metric are used for that benchmark. The “Closed” division provides “all else being equal” apples-to-apples accelerator comparisons. Nonetheless, while your mileage may vary, this is a solid result.

AMD shared this slide which includes a submission for a more modern model for text-to-video work.  The mioddle blue bar was in the Open category and the dark blue bar was not completed before the submission deadline.

AMD shared this slide which includes a submission for a more modern model for text-to-video work. The mioddle blue bar was in the Open category and the dark blue bar was not completed before the submission deadline. AMD

While AMD does not yet have a NVLink class scale-up capability, AMD was quite proud of its ability to scale nearly linearly to 11 nodes (88 and 96 GPUs) over Ethernet to achieve one million tokens per second.

Here is AMD’s latest roadmap. I would expect the addition of the now-Nvidia Groq-based LPX to the Vera Rubin disaggregated inferencing story will enable Nvidia to show far better results than projected here by AMD.

AMD says its rack-scale MI450 will compete head-to-head with Nvidia Vera Rubin. But of course the VR story changes significantly with the addition of the Groq LPU.

AMD says its rack-scale MI450 will compete head-to-head with Nvidia Vera Rubin. But of course the VR story changes significantly with the addition of the Groq LPU. AMD

Nvidia Results

NVIDIA ran every newly added benchmark and won every one of them. Only two of these were attempted by AMD, and for which Nvidia out-performed them by ~30 and ~50%. So, no, AMD did not beat Nvidia.

Nvidia ran (and won) every new benchmark.

Nvidia ran (and won) every new benchmark. AMD

This round, NVIDIA GB300 NVL72—launched last year—delivered nearly three times higher token throughput compared to its first submissions six months ago by applying new software and inference management with Nvidia Dynamo. This speedup was achieved by NVIDIA partner Nebius and demonstrates the impact of software optimization in AI performance, even with partners.

Using the same hardware as last round, Nvidia increase performance by nearly Three-Fold

Using the same hardware as last round, Nvidia increase performance by nearly three-fold and delivered 2.5M tokens per second. NVIDIA

Nvidia and its partners blew through the one million tokens per second (tps) number and scored 2.5M tps on DeepSeek R1 offline, with 288 B300s. DeepSeek R1 is far more computationally challenging than Llama2-70B.

Nvidia shared results in the NVL72 which exceed the 1M mark that AMD achieved.

Nvidia shared results in the NVL72 which exceed the 1M mark that AMD achieved using more GPUs. NVIDIA

In this round of MLPerf, NVIDIA increased its performance using the same hardware on returning scenarios for DeepSeek-R1 as well as the Llama 3.1-405B, from 21% to nearly 3X.

Nvidia saw performance on the same hardware improve from 1.2x to 2.77x on the same hardware.

Nvidia saw performance on the same hardware improve from 1.2x to 2.77x on the same hardware. NVIDIA

While GPU pure hardware results are certainly converging, Nvidia remains in the lead by applying new techniques to get even more out of their systems, such as:

  • Disaggregated serving: This capability in Nvidia Dynamo separates and individually optimizes the configurations of each inference phase (prefill and decode) respectively, enabling optimal overall throughput. This will become even more important as Nvidia further disaggregates inference to include the Groq LPX.
  • Wide Expert Parallel (WideEP): For higher-interactivity scenarios, execution time for MoE models is bound by expert weight load time. By splitting, or sharding, the experts across multiple GPUs across NVL72 nodes, this bottleneck is reduced, improving end-to-end performance.
  • Multi-Token Prediction (MTP): At higher interactivity levels, batch sizes are smaller, and performance is dominated by how quickly weights can load into memory, leaving compute performance underutilized. By applying compute otherwise that goes under-utilized to predict and verify additional tokens in parallel (up to three in this implementation), throughput at high interactivity is increased.
  • KV-aware routing: This capability of Dynamo routes inference requests by evaluating their compute costs across different workers

My Perspectives

I’m reminded of my experience at IBM and we worked hard to win the Unix market share lead from Sun Microsystems with superior performance. Once we became the indisputable king with IBM POWER, we looked around and found that the other RISC vendors were failing against the Intel x86 Juggernaut. We were fighting the wrong battle; the war had moved on to Linux and server efficiency!

So, now it looks like AMD is roughly within 10-30% of Nvidia as a GPU. But keep in mind that AI processing is no longer a chip thing; its a full AI Factory-produced intelligence that requires far more than a fast chip. It demands a massive optimized system of networking, accelerators, CPUs, software and a host of infrastructure to produce tokens at scale.

And as the market has once again moved from raw performance to one of efficiency, Nvidia was even willing to deploy a non-GPU beast into its roadmap. We look forward to the time what Nvidia can add the Groq LPX to its benchmarks.

And of course, we anxiously await AMD MI4xx and Nvidia Vera Rubin. At that point we could return to the old game of leap frogging. But I doubt it.

Disclosures: This article expresses the opinions of the author and is not to be taken as advice to purchase from or invest in the companies mentioned. My firm, Cambrian-AI Research, is fortunate to have many semiconductor firms as our clients, including Baya Systems, BrainChip, Cadence, Cerebras Systems, d-Matrix, Flex, Groq, IBM, Infleqtion, Intel, Micron, NVIDIA, Qualcomm, SImA.ai, Synopsys, Tenstorrent, Ventana Microsystems, and scores of investors.