Selene: A deep competitive moat
Most people think of CUDA when someone mentions NVIDIA’s competitive defenses. Certainly, the high-performance software is a significant advantage for NVIDIA, even 13 years after its introduction. CUDA enables HPC and AI applications to run efficiently on NVIDIA GPUs, and is embraced by programmers around the world. It supports thousands of applications on millions of GPUs. However, Selene may form an even more formidable defensive moat than the venerable CUDA libraries and tools.
Let’s look at Selene. It is comprised of 280 NVIDIA DGX A100 servers, each with 8 Ampere GPUs, interconnected by over 490 200Gb Mellanox Switches. Supercomputers typically require up to a year to be installed, but NVIDIA engineers assembled and tested the platform in under one month—a testament to the DGX platform’s plug-and-play ease of installation.
Back in 2017, NVIDIA announced the V100, along with the company’s Saturn V in-house supercomputer. A top-30 supercomputer built to enable research and development of NVIDIA software and hardware, the platform has been used to increase the performance of many AI and HPC workloads at scale. Additionally, it was used extensively in the development of the new Ampere-based products. Having such a supercomputer available to NVIDIA engineers and partners can form a strategic competitive advantage in several areas.
First, it provides a state-of-the-art platform for software optimization and model development. Figure 2 shows that NVIDIA doubled the performance of the V100 across a wide range of HPC applications in the two years following that chip’s introduction. Furthermore, the release of the mlperf benchmarks showed than NVIDIA quadrupled performance for AI, all without a single change to the hardware.
Second, a platform like Saturn V or Selene creates an powerful opportunity for research and collaboration. An example here is the development of Megatron, a billion-plus parameter natural language model extension to BERT (Bi-directional Encoder Representations from Transformers) that NVIDIA and Microsoft pioneered to advance conversational AI. Not many researchers and developers in the industry have a world-class supercomputer at their disposal to tackle such leading-edge research projects, but NVIDIA and its partners enjoy this capability. I have toured the Saturn V facility in Santa Clara, and it is truly impressive. I believe Selene will take this to the next level.
Finally, and perhaps most importantly, an in-house supercomputer uniquely provisions NVIDIA engineers with a massive AI platform to speed and improve product development. As I have covered previously, the use of AI is emerging as a powerful approach to speed chip development and improve the final product. Synopsis clients, for example, have used AI to explore billions of possible physical layouts, in order to produce chips that consume less power, deliver more performance, require less die area and get to market faster with fewer engineers. NVIDIA engineers working on Ampere had access to Saturn V for nearly three years, using a system that would cost tens of millions of dollars for a rival to build. The Ampere chip is the impressive result.
Conclusions
NVIDIA CEO Jensen Huang famously says, “The more you buy, the more you save,” and applies this philosophy to his company’s investments in HPC and AI for his engineers. What’s good for the goose, is good for the gander, right? Consequently, NVIDIA engineers are able to produce better products, and collaborate with researchers and partners more readily than any of NVIDIA’s would-be competitors (at least for now). A startup would struggle mightily to array the resources to match this level of dedicated compute capacity, and I suspect larger companies like Intel are realizing that having a system like Selene will become table stakes for those who wish to enter the game.