NVIDIA wins all 16 benchmarks
NVIDIA has dominated the AI accelerator business since it created the market, growing it to the point where NVIDIA AI revenues will likely overtake the gaming segment over the next year. It is rare in technology that a single company owns a multi-billion-dollar market space without attracting competition, and there are now scores of startups, and established companies like Intel and Qualcomm, who are developing competitive products. But after 3 years of challenging promises in Powerpoint, and a few notable failures, NVIDIA still stands virtually alone in a market that some project will exceed $28B annually in 10 years.
To refresh your memory, mlperf provides three categories of benchmark systems (available, preview, and research) across multiple AI benchmarks. These benchmarks measure training times for a maximum scaled system (like the NVIDIA Selene, which is the world’s 7th fastest Supercomputer) and for a single server (such as the 8GPU DGX A100). The new V0.7 results are only for training; new inference benchmarks will probably be published in the next few months. The results? NVIDIA won all 16 benchmarks by a healthy margin in the commercially available category. The unannounced Google TPU V4 looks promising but is currently only marginally better than the NVIDIA A100 on three out of eight benchmarks.
Interestingly the only competition to submit results on commercially available servers were Google and Huawei, and even these firms only submitted results for image classification (Resnet-50) and natural language processing (BERT). While the mammoth “at scale” results were incredibly strong for NVIDIA, demonstrating its prowess in system design, Mellanox networking and AI software, I tend to look more at the per-system results and per-chip normalized analyses. After all, very few companies can afford the supercomputer and engineering effort required to run a benchmark at massive scale.
For commercially available systems, NVIDIA handily beat Huawei by over 2 times, and bested the Google TPU V3 by 25-75%. But why didn’t Huawei and Google run the other 6 or 7 benchmarks? Well, Google probably did, but realized that the new A100 would crush its 3-year-old chip. Instead, it wisely focused its efforts on the upcoming V4 chip, which it submitted in the research category.
Note that the A100 bested the V100 by anywhere from 50-250%. In fact, NVIDIA increased its mlperf performance up to 4 times over the last 18 months, thanks to both new silicon and software innovations. It is not surprising that so many challengers (including Intel’s Nervana) struggle to surpass this fast-moving green target.
What could the future hold?
The widely anticipated V4 version of Google’s TPU looks strong, but keep in mind that this chip is not yet available on the Google Cloud. Even if Google launches it later this year, as I expect, the TPU-V4 will only just barely catch up with the A100, which is in full production now. While the T4 gets crushed in fast-growing recommendation and reinforcement learning applications, this could possibly be remedied with ongoing improvements in the software stacks. Also, keep in mind that mlperf does not provide price/performance comparisons nor power-consumption analyses, where Google may possibly hold an advantage over NVIDIA.
So, where the heck is everyone else? There is nothing significant from Intel, which changed horses from Nervana to Habana Labs last December. That chip (called Gaudi) is still under wraps. Also, neither Cerebras nor Graphcore, two well-heeled and publicized startups, published any results. Frankly, I’m not sure they will do so any time soon, if ever. Both companies are focusing on applications which they hope will be a relatively poor fit for NVIDIA GPUs, using in-processor memory and massive wafer-scale chips, respectively. But one wonders how large those niche markets could be. Looking at markets that are new for GPUs, some have been skeptical that NVIDIA could excel in recommendation engines, which require a lot of memory and are typically deployed on CPUs today. Yet NVIDIA performed surprisingly well on the new DLRM benchmark.
And publishing a benchmark that NVIDIA has spent years optimizing would not be a smart move. When I was managing marketing for IBM POWER servers, we knew that one should never publish a benchmark that is not a #1 result; that would just add fodder to our competition’s arsenal.
Let’s talk about software
Ok, so NVIDIA’s new A100 chip looks to be as good as was promised just a couple months ago. While that silicon is impressive (and huge), NVIDIA’s results in mlperf benchmarks are also enabled by significant investments in and improvements to the company’s AI software stacks. NVIDIA provided us with an apples-to-apples comparison of performance on the Volta V100 with and without the new software; a good demonstration of the speedup attributed to software enhancements.
NVIDIA’s ecosystem investments and scale extend beyond supporting standard software stacks such as PyTorch and Tensorflow. The company built application and partner ecosystems for seven application segments, from Health Care to Autonomous Vehicles, all supported by the Selene Supercomputer and other DGX systems in NVIDIA. These solutions were recently extended by the Jarvis and Merlin frameworks for conversational AI and recommendation systems respectively, two fast-growing markets enabled by AI.
A couple years ago, investors took NVIDIA down with a vengeance, spurred by a combination of the collapse of the currency mining bubble and concerns that an army of specialized AI ASICs would devastate NVIDIA’s surging AI business. Those ASIC concerns were based on the simple thesis that a chip that does one thing (AI) extremely well would be far better than a more general-purpose GPU—the chants of “10 times better” were commonly heard in marketing and conference presentations. Fast-forward to 2020, and we have seen that NVIDIA’s prowess in silicon and software, now supplemented by its Mellanox networking and DGX system designs, presents a far deeper and wider competitive moat. The latest mlperf results were somewhat disappointing—not because NVIDIA failed to impress, but because none of these vaunted competitors were able to rally a challenge to NVIDIA’s leadership in the data center.
But as usual, there are caveats here: NVIDIA cannot possibly remain the only game in town. Google’s TPU V4 results are promising and could further improve as the company finishes its chip and software optimizations. Meanwhile, Qualcomm should be close to finishing up its Snapdragon inspired Cloud AI100 design for edge cloud applications, Intel will probably roll out its Habana Labs product later this year and startups such as Cerebras and Graphcore continue to make headway in their silicon and software products and traction.
Grab the popcorn. This movie isn’t over yet!