NVIDIA Keeps The Performance Crown For AI Inference For The 6th Time In A Row

by | Sep 29, 2022 | In the News

In The Data Center And On The Edge, the bottom line is that the H100 (Hopper-based) GPU is up to four times faster than the NVIDIA A100 on the newly released ​MLPerf V2.1 benchmark suite. The A100 retains leadership in many benchmarks versus other available products while the NVIDIA Jetson AGX Orin leads in edge computing.

Every three months we share our perspectives with you as the MLCommons consortium coordinates the publication of new AI benchmarks for inference and training. While NVIDIA lost the absolute performance crown during the last smack-down for training while awaiting the H100, NVIDIA totally stole the performance leadership for Inference processing with the H100 this time around, and we expect it will repeat this dominance in the training round in three months. Let’s take a look.

The Results

AS is typical with successive generations of NVIDIA GPUs, the Hopper GPU is 1.5x to 2.5x the performance of the Ampere A100 on most AI benchmarks. But in the fast growing world of Natural Language Processing (NLP), the H100’s Transformer Engine and Software showed better than four times the performance of the A100 and handily beat the Chinese-built Biren BR 104 GPU. NLP is extremely computationally intense, as it can take hundreds or even thousands of GPUs to train the large models, while the inference processing of those models is also performance-hungry.

NVIDIA H100 stole the show for inference performance, especially for the transformer network BERT-Large at 99.9% accuracy.

NVIDIA H100 stole the show for inference performance, especially for the transformer network BERT-Large at 99.9% accuracy.NVIDIA

Once again, NVIDIA’s mature software stack enabled it to submit results for all 7 inference benchmarks, while its competitors cherry-pick one or two benchmarks they have been tuning for customer projects. This is a critical point for companies considering new technologies, especially cloud service providers, since a) multi-model inference processing is quickly becoming critical, and 2) the fungibility of a processor across a large number of workloads greatly simplifies IT management, reduces capital expenses, and improves utilization rates dramatically, increasing profits for CSPs.

Here is an example of how a multi-model inference workflow looks today for a natural language interface to a query. It takes nine AI models working together to run this simple query, “what kind of flower is this?”. You certainly don’t want to have to tackle this query across multiple chips, right? Especially since you need the answer in milliseconds. We strongly encourage all the vendors to submit across all models in the MLPerf suite.

A simple query can decompose into nine AI models. Together these models need to get the answer to the user in 10 milliseconds.

A simple query can decompose into nine AI models. Together these models need to get the answer to the user in 10 milliseconds.NVIDIA

Before we wrap up, we want to share the significant performance improvements NVIDIA has realized for their edge platform, the Jetson Orin.

A 50% performance improvement also translates to a 50% improvement in power efficiency.

A 50% performance improvement also translates to a 50% improvement in power efficiency. NVIDIA


Speaking of encouraging vendors, I have a message for those still MIA when it comes to these benchmarking efforts (I’m talking to you, AMD, AWS, Groq, SambaNova, Tenstorrent, ….). (Note that Intel, Google, and Graphcore have frequently submitted training results to MLPerf.) You can’t hide. You know you run these models all the time to support sales efforts and to use internally to identify performance enhancements (or bottlenecks). The ML community values openness and transparency. Get on board, or I will assume you don’t have a chip good enough to compete. If you think you do, prove it. (Ok, I’m sure I just lost a lot of business!)

Speaking of new submissions, this was the first by SAPEON X220-Enterprise from SK-Telecom and the Biren BR104 GPU out of China, the latter of which showed impressive performance for image processing with just 32GB of HBM2E memory. We look forward to seeing how these chips perform on Training benchmarks in a few months. We congratulate SK-Telecom and Biren on YOUR openness and transparency, and on excellent results.

(As an aside, the BR104 seems to make the US Governments prohibition on A100 and H100 GPU shipments to China look a little odd to us; that horse already left the barn.)
Power Efficiency is another story

While NVIDIA H100 is undoubtably the fastest chip for AI, Qualcomm steals the show when it comes to power efficiency, a critical requirement for many edge and edge cloud applications. Once again, the Qualcomm Cloud AI 100 bested all submissions for power efficiency, albeit only for image processing and NLP (BERT). The recent adoption of the Cloud AI 100 by a string of server vendors including Dell, HPE, and Lenovo are early indicators of significant customer interest in using the Qualcomm platform for edge deployments. We have covered more details regarding the Qualcomm submissions in another article here on Forbes and on our website.

Conclusions

The story is pretty simple for now. In nnference processing, NVIDIA wins the performance race (albeit with a chip that is not yet generally available) and Qualcomm wins the power efficiency battle (albeit only for image and natural language processing to date). But stay tuned; new vendors will eventually see the value, and will hopefully submit their results. Unfortunately, I suspect some of these organizations will adhere to the old adage, “Don’t release any benchmark you didn’t win!”. In other words, the wall of shame listed above probably has a lot of chips that just aren’t as good as their marketing hype, or as good as NVIDIA and Qualcomm.

I hope you aren’t growing tired of these MLPerf stories. I am not, as I firmly believe that the MLCommons organization, which has grown and evolved significantly over the last 2 years, is providing a valuable service to chip makers, server vendors, and their customers.