After steering a 25% rise in Nvidia shares last week, CEO Jensen Huang flew to Computex in Taipei to announce a slew of new products demonstrating how his company intends to continue to lead as it approaches the $1 trillion market cap milestone. While these products alone won’t take the company to $2 trillion, they indicate just how important generative AI will be to get there.
Major Nvidia announcements are typically reserved for the annual GTC event in Silicon Valley. But this year is different; this is the year that AI graduates from cool technology to the must-have solutions possessing nearly human intelligence and infinite knowledge that every Global 500 company on earth will need to master.
While Nvidia announced products that span from AI supercomputers to gaming characters that can hold conversations, we will focus on the three technologies that directly impinge on this AI moment: the Grace Hopper-based 256-GPU GH200, the MGX platform for system builders and the new Spectrum-X Ethernet networking that ties it all together.
The GH200
Nvidia CEO Jensen Huang has been telling us for years that Nvidia is in the business of selling optimized data centers, not chips and components. You probably missed the boat if you mistakenly took that as hyperbole. Nvidia has been selling DGX SuperPods since the A100, but while the scale of these systems could be quite large, the NVLink-enabled optimization (shared memory) was limited to 8 GPUs. A programmer could treat 8 GPUs as one large GPU sharing a large memory pool.
But the definition of “large” changed dramatically when GPT-4 hit the silicon streets, and it’s rumored to contain a trillion parameters. You need a massive amount of shared, fast (HMB) memory to train these huge AI models as well as run inference queries. Customers like Google, Meta and Microsoft/Open AI need a far larger footprint.
Now, Jensen has announced the DGX GH200 massive-memory supercomputer for generative AI, powered by Grace Hopper and NVLink to train large (there’s that word again) AI models and drive AI innovation forward. What DGX did for smaller AI the GH200 will do for massive AI builders. The GH200 is interconnected with NVLink to provide 1 exaflop of AI (low precision) performance and 144 terabytes of shared memory—nearly 500 times more than the previous generation Nvidia DGX A100, introduced in 2020. This increases the bandwidth between GPU and CPU sevenfold compared with the latest PCIe technology, slashes interconnect power consumption by more than five times and provides a 600GB Hopper architecture GPU building block for DGX GH200 supercomputers.
Speaking of hyperscalers, Nvidia mentioned three: Google, Meta and Microsoft, all of which voiced support in Nvidia’s press materials. Consequently, we would be shocked if these AI leaders don’t deploy multiple GH200. Amazon Web Services was notably MIA. We are convinced that AWS intends to go it alone, favoring its own AI chips: Inferentia and Trainium. Good luck with that; those chips are simply are not competitive with Nvidia. Nor is AWS networking competitive with NVLink, not to mention the new Nvidia ethernet technology also announced at Computex.
As is usually the case, Nvidia is its own best customer for the GH200. It announced it is building the Helios supercomputer to advance AI research and development. This private supercomputer, akin to Selene, will feature four DGX GH200 systems, and it will be interconnected with Nvidia Quantum-2 InfiniBand networking to accelerate data throughput for training large AI models. The 1,024 Grace Hopper Superchip system is expected to come online by the end of the year.
MGX
Nvidia must also enable its partners, who play a pivotal role in extending its market reach into first- and second-tier CSPs. For system vendors like HPE, Dell, Lenovo, Penguin and others, Nvidia created the HGX reference board to enable 8-way GPU connectivity. But for more flexibility to mix and match technologies, Jensen announced what is called MGX. MGX enables over 100 unique configurations of Nvidia CPU, GPU and DPU components to meet the individual needs of their customers. Nvidia announced six partners normally associated with building hyperscale infrastructure as the first to sign up for MGX.
Spectrum-X
Nvidia acquired the InfiniBand technology when it bought Mellanox three years ago. InfiniBand is great for supercomputers because, in part, it is “lossless,” unlike ethernet, which recovers from lost packets by trying again. And again. And again until the missing network packet finally arrives at its destination. That’s fine for cloud services, but not for HPC. And, it appears, not for large-scale AI.
Massive AI needs the performance and lossless packet delivery of InfiniBand, but it prefers the lower-cost and ubiquitous ethernet networking to run its data centers. As the figure below shows in the upper right, ethernet’s bandwidth fluctuates considerably as the TCP/IP protocol is durable to frequent packet drops. And that’s just not okay with big AI.
Nvidia’s solution is to provide these customers with Spectrum-X, a combination of a new Spectrum-4 ethernet switch combined with the high performance BlueField-3 DPU. The combination of the Spectrum-4 switch, the BlueField NIC and the Nvidia networking software stack achieves 1.7 times better overall AI performance and power efficiency along with the consistency afforded by a lossless network. That’s right: Nvidia is promising an ethernet network that does not drop packets. Sounds like magic to me, but this is just what enterprise and hyperscale data centers have been requesting for decades.
Conclusions
Nvidia has taken a holistic approach to training and running large AI models, laying out a reference architecture offering a 256-GPU building block that users of large AI will embrace. Combined with the recent Dell announcement, supercomputing traction for Grace, Microsoft Azure support for Nvidia Enterprise AI, the new MGX reference architecture, NVLink and the Spectrum-X lossless ethernet solution means that anyone running serious AI jobs has only one logical choice: Nvidia.
Could this change? Yes. Competition will always nibble at Nvidia’s heels. AMD has a serious entry, the MI300, set to enter the market later this year. RISC-V solutions like Tenstorrent and Esperanto are getting attention and traction, but not in the market where Jensen is focused: massive foundation models. Intel could pull a rabbit out of the hat with Gaudi3 and/or the forever-late PonteVecchio GPU.
But as I have said in the past, considering Nvidia’s superior hardware combined with the depth and breadth of its software to optimize AI and high performance computing (HPC) applications, all competitors combined could maybe get 10% of the market. In a $75 billion market, that could be plenty to float some more boats.
But Nvidia is the only trillion-dollar player, and we don’t see that changing.