Does NVIDIA Selene Form A Wider Moat Than CUDA?

by | Jul 6, 2020 | AI and Machine Learning, In the News

The annual International Supercomputer Conference (ISC), held virtually this year, kicked off today. Not surprisingly, NVIDIA has already made a few announcements of note. Especially of interest to me was the announcement of Selene, NVIDIA’s in-house 1+ Exaflop AI supercomputer, which ranks as the fastest industrial system in the USA and #7 overall in the Top 500. NVIDIA also announced a new PCIe version of the A100 accelerator, six A100-based supercomputer wins and a new Mellanox UFM Cyber AI platform to predict and detect security threats and predict network failures. Still, Selene was the star of the show.

Selene: A deep competitive moat

Most people think of CUDA when someone mentions NVIDIA’s competitive defenses. Certainly, the high-performance software is a significant advantage for NVIDIA, even 13 years after its introduction. CUDA enables HPC and AI applications to run efficiently on NVIDIA GPUs, and is embraced by programmers around the world. It supports thousands of applications on millions of GPUs. However, Selene may form an even more formidable defensive moat than the venerable CUDA libraries and tools.

Let’s look at Selene. It is comprised of 280 NVIDIA DGX A100 servers, each with 8 Ampere GPUs, interconnected by over 490 200Gb Mellanox Switches. Supercomputers typically require up to a year to be installed, but NVIDIA engineers assembled and tested the platform in under one month—a testament to the DGX platform’s plug-and-play ease of installation.

Figure 1: the Selene supercomputer was built in under a month and will provide a level of computational capacity that can form a serious competitive weapon for NVIDIA. image: NVIDIA

Back in 2017, NVIDIA announced the V100, along with the company’s Saturn V in-house supercomputer. A top-30 supercomputer built to enable research and development of NVIDIA software and hardware, the platform has been used to increase the performance of many AI and HPC workloads at scale. Additionally, it was used extensively in the development of the new Ampere-based products.  Having such a supercomputer available to NVIDIA engineers and partners can form a strategic competitive advantage in several areas.

First, it provides a state-of-the-art platform for software optimization and model development. Figure 2 shows that NVIDIA doubled the performance of the V100 across a wide range of HPC applications in the two years following that chip’s introduction. Furthermore, the release of the mlperf benchmarks showed than NVIDIA quadrupled performance for AI, all without a single change to the hardware.

Figure 2: NVIDIA has been able to apply its compute resources and talent to doubling or even quadrupling the performance of their silicon after it is launched. Image: NVIDIA

Second, a platform like Saturn V or Selene creates an powerful opportunity for research and collaboration. An example here is the development of Megatron, a billion-plus parameter natural language model extension to BERT (Bi-directional Encoder Representations from Transformers) that NVIDIA and Microsoft pioneered to advance conversational AI. Not many researchers and developers in the industry have a world-class supercomputer at their disposal to tackle such leading-edge research projects, but NVIDIA and its partners enjoy this capability. I have toured the Saturn V facility in Santa Clara, and it is truly impressive. I believe Selene will take this to the next level.

Finally, and perhaps most importantly, an in-house supercomputer uniquely provisions NVIDIA engineers with a massive AI platform to speed and improve product development. As I have covered previously, the use of AI is emerging as a powerful approach to speed chip development and improve the final product. Synopsis clients, for example, have used AI to explore billions of possible physical layouts, in order to produce chips that consume less power, deliver more performance, require less die area and get to market faster with fewer engineers. NVIDIA engineers working on Ampere had access to Saturn V for nearly three years, using a system that would cost tens of millions of dollars for a rival to build. The Ampere chip is the impressive result.

Conclusions

NVIDIA CEO Jensen Huang famously says, “The more you buy, the more you save,” and applies this philosophy to his company’s investments in HPC and AI for his engineers. What’s good for the goose, is good for the gander, right? Consequently, NVIDIA engineers are able to produce better products, and collaborate with researchers and partners more readily than any of NVIDIA’s would-be competitors (at least for now). A startup would struggle mightily to array the resources to match this level of dedicated compute capacity, and I suspect larger companies like Intel are realizing that having a system like Selene will become table stakes for those who wish to enter the game.