AMD Goes After NVIDIA With New GPU For The Datacenter

by | Dec 2, 2020 | AI and Machine Learning, DataCenter AI, In the News, Semiconductor

Today AMD announced its first data center GPU to compete in high-performance computing (HPC) and Artificial Intelligence (AI). Last year, the company announced that it would bifurcate its GPU technology into two architectures: CDNA for computation and RDNA for graphics. This approach is now coming to fruition and should allow the company to optimize chips for different workloads while still leveraging common elements. The new Instinct MI100 GPU is the first instantiation of the CDNA design. As such, it is an essential milestone in the company’s journey to an exascale supercomputer platform for Frontier and El Capitan, projected for 2023. Let’s dive in and see whether this chip will allow AMD to finally compete with NVIDIA, today’s leader in HPC and AI GPUs.

What is the MI100, and should NVIDIA be worried?

The new GPU is essentially a ground-up design focused on high performance and high memory bandwidth for HPC and, to a lesser extent, for AI. Like the NVIDIA A100, the AMD MI100 utilizes fast HMB2 memory (1.23 TB/s). However, it only supports four HBM2 dies (32GB) while the A100 supports five (40 GB). The new architecture includes Matrix Core operations, which sound similar to the Tensor Cores NVIDIA introduced in the Volta generation in 2017. Tensor Cores now support a wide range of precisions, while the Matrix Core appears to provide only support for 16- and 32-bit tensor operations. I suspect AMD will enhance these cores capabilities in the next-generation GPU, including integer and perhaps even 64-bit floating-point operations.

As Figure 1 shows, the MI100 performs admirably for traditional floating-point intensive applications. HPC centricity makes sense for now, given AMD’s focus on that market for both CPUs and now GPUs.

However, the NVIDIA A100 outperforms the MI100 for AI intensive workloads that typically take advantage of “quantization,” the utilization of lower precision formats and operators in tensor operations. I had many discussions with AMD about this data, and they correctly pointed out that the TF32 format, which only has the precision of FP16 (10 bits) is not the same as the IEEE standard FP32 format (23 bits). TF32 has 8 bits of mantissa, just like FP32. However, NVIDIA has not published FP32 FLOPS on Tensor Cores, only TF32, so that is the only comparison I can make for now. AMD also said that most customers it talks with have built models that use IEEE FP32 for training, not BFloat16 or TF32. That may be true—however, in my experience, many HPC and AI implementation teams will refactor their code to take advantage of the massive performance increases that Tensor Cores and TF32 offer.

Figure 1: The AMD MI100 has solid HPC performance, besting NVIDIA by some 18% at a lower price. But … [+] Image: MOOR INSIGHTS & STRATEGY

I must note that NVIDIA also announced a new data center GPU product on the same day, which we covered here. This new version of the industry-leading A100 GPU supports 80 GB of HBM2e, which is twice the memory of the still-new A100 announced in May. This announcement widens the performance gap versus AMD, although at a higher price point.

There was no mention of any support for sparsity optimizations for the MI100, which could theoretically double AI models’ performance. There was also no mention of GPU sharing, along the lines of the NVIDIA Multi-GPU Instance capability of the A100. This feature may become especially important for cloud service providers.

Notably, the new chip supports the Infinity Fabric that AMD will use in exascale GPUs to improve CPU (EPYC) to GPU bandwidth and cache-coherent memory sharing. But there were no announcements of such a tightly integrated configuration for the MI100, at least not yet. Perhaps this feature could be activated when the next-generation EPYC (Milan) comes out next year, possibly by HPE’s Cray with whom AMD has won two out of three US DOE exascale supercomputers.

Finally, AMD shared a slide that claimed a 50% lower cost per FLOP than the A100 for HPC apps. The HPC performance described above is about 18% faster than NVIDIA, which would imply an aggressive list price of around $7200 by my math. That’s a price that could help AMD get significant attention and traction.

ROCm Software: Ready for prime time

So, the new chip is an attractive HPC platform and a respectable chip for AI training workloads. But the challenge any NVIDIA competitor must address lies in the software needed to engender an ecosystem of AI models and applications. AMD has been developing its ROCm open compute offering for over three years. The new V4.0 release features the missing elements that can ease porting to AMD from NVIDIA. When combined with the MI100, AMD claims it has delivered an 8X increase in throughput in two years.

Figure 2: The ROCm software stack has matured to the point needed for full support for production AI deployments.  Image: AMD

Since AMD cannot directly support NVIDIA CUDA without incurring significant legal and technical risks, the ROCm approach relies on up-compiling CUDA to “HIP,” a higher-level parallel programming language, which is then down-compiled to a GPU. Ideally, this approach would entice developers to use HIP to create a single-source language that can be down-compiled to NVIDIA or other accelerators. While this is an elegant vision, it will take considerable time to be realized and will require significant AMD competitive advantage. After all, I would anticipate any major GPU installation would need to support both NVIDIA and AMD concurrently. Meanwhile, ROCm remains a viable porting and run-time platform to evaluate and deploy AMD GPUs, but customers may have to support two source code trees—one for ROCm and one for CUDA.

AMD also shared a few success stories for ROCm porting that are quite impressive. If customers can port a complex application in a half-day, I believe AMD will begin to see traction for its data center-class GPUs. However, QUDA took three weeks to port, probably due to deep exploitation of CUDA libraries.

Figure 3: AMD has made significant progress with the ROCm software stack and can share a few impressive success stories. Image: AMD

Conclusions

So, given this analysis, should NVIDIA be worried about AMD’s new entry into the data center? In AI, no. In HPC, yes. AMD’s new GPU is an excellent stepping stone to its exascale platform. I expect AMD will be selected by price-sensitive supercomputer installations which may not have a tremendous need for bleeding-edge AI on the same platform in the short term. However, many or even most HPC users see AI as an integral part of the workflow these days.

NVIDIA will remain unchallenged in AI-centric environments, for the time being, based on these early raw performance numbers and its new 80GB A100 just announced. Note that the performance data shared by AMD is limited to peak performance of mathematical operations. I will anxiously await application performance data such as mlPerf for AI to determine AMD’s competitive strength.

In the longer term, improvements in AI performance and a tighter Infinity Fabric interconnect with EPYC Rome could make AMD a significant competitive force. I have previously opined that this approach is one reason why NVIDIA needs to develop a data center-class CPU using Arm.

Meanwhile, the Frontier and El Capitan applications’ ecosystem now has an affordable on-ramp to AMD experimentation and support, especially if Cray creates a MI100/EPYC Rome platform in 2021.

Welcome back to the party, AMD! It’s great to have you back.