The Instinct MI200 is nearly five times faster than the NVIDIA A100 for HPC, but is theoretically only 20% faster for AI.
One year ago I complained that the newly announced AMD MI100 GPU was great for HPC, but inadequate for most AI workloads. Now AMD has announced the upgraded Instinct MI200, designed in large part to fulfill the needs of the US DOE Exascale program. The chip has astonishingly fast HPC performance, and decent theoretical AI performance. AMD also teased an upgrade to the EPYC CPU, with a massive 3D stacked memory V-cache for faster HPC performance, with a cache coherent fabric that can connect four MI200’s to the EPYC CPU. Let’s take a look.
What did AMD announce?
Well, technically the company did not announce these new platforms; they previewed them to get the limelight before SuperComputing ‘21 to be held in St. Louis on November 14-19. And since Oak Ridge National Labs is already installing these chips in the HPE Cray Frontier Exascale system, well, they had to say something. But while some details and pricing are not yet available, there is still plenty of news to excite the HPC and AI community
EPYC with V-Cache
Let’s start with the updated EPYC Milan CPU. Many HPC and AI apps are memory bandwidth constrained. By tripling the CPU’s cache with 3D stacked memory to 804MB per socket, AMD is seeing a 50% average improvement in “targeted” HPC applications such as EDA, Mechanical engineering, and computational fluid dynamics. In fact, the company demonstrated greater than 60% better performance for the Synopsys verification application. Since many (most) HPC applications are priced by the number of cores in a server, increasing performance per core dramatically lowers the total cost of ownership for engineering systems. However AMD pricing was not disclosed. The new “MilanX” will be launched in Q1 2022.
But if you need still more performance, …
Then check out this new two-die GPU, the Instinct MI200. The design team knew who their first customers would be, as the Frontier and El Capitan Exascale systems for the US DOE had already been awarded to HPE and AMD when the chip was still on the drawing board. While AMD let it slip that the chip would burn some 550 watts (!), it will deliver nearly five times the 64-bit floating point performance for HPC workloads, compared to the NVIDIA A100, so performance per watt will still be astounding.
As for AI performance, AMD has closed the considerable gap with NVIDIA, at least in terms of theoretical performance, based on 16-but floating point FLOPS that is 20% higher than the NVIDIA A100. We say “theoretical” because AMD is not yet ready to disclose AI benchmark results such as the MLPerf suite. And we suspect it will take some time for the Instinct software team to optimize AI models, kernels, and the ROCm development stack, so don’t hold your breath. But we do believe that the performance will attract developers to begin the journey to create an AI ecosystem around AMD.
Two other features bear noting. One is the performance and memory coherency of the new Infinity fabric that interconnects the GPUs to MilanX, and presumably Milan. Cache coherency greatly simplifies memory management for software developers, improves application performance and enables multi-billion or trillion-parameter AI models. And of course the performance of these direct links will blow away PCIe-based GPU architectures.
The other new technology is called the Elevated Fan-out Bridge (EFB), that replaces the traditional silicon transposer. EFB promises higher scalability and lower costs, leveraging standard “flip chip” processes to simplify assembly.
AMD disclosed on the analyst conference call that they have not submitted any benchmarks to MLCommons. We hope they do in the next round of benchmarks, but would not be surprise if the company choses not to. Closing the gap with NVIDIA for AI is as much of a software challenge as a silicon one, and that will take time. However, AMD did share raw Flop performance numbers to substantiate their marketing claims that the new platform is potentially nearly 5-times fast than NVIDIA. Potentially.
Looking beyond raw FLOPS for some real performance comparisons, the MI200 looks to be roughly 1.5 to 3 times NVIDIA GPU performance for HPC. Unfortunately, no AI application performance data were disclosed.
Clearly AMD is fulfilling CEO Lisa Su’s promise to build its data center business from HPC on down. MilanX and MI200 combine to deliver excellent HPC performance and scalability which will be substantiated in the Exascale systems being built and planned.
On the other hand, AI seems to have been relegated to a lower priority once again, both on the EPYC CPU when compared to Intel Sapphire Rapids and the GPU when compared to NVIDIA. While 20% better 16-bit floating point is nice to see, FLOPS is a poor performance predictor in the software-driven world of optimized AI acceleration.
All said, however, I cannot argue with AMD’s prioritization: as NVIDIA focuses more on AI, it has created vulnerability in HPC which AMD intends to exploit.