Meta Builds World’s Largest AI Supercomputer With NVIDIA For AI Research And Production

by | Feb 8, 2022 | In the News

There are big implications, for both companies, beyond just bragging rights.

Facebook, I mean Meta, has always been one of the industry leaders when it comes to AI research and deployment. The company processes hundreds of trillions (yes, trillions with a “T”) of inferences every day, and trains some 30,000 models daily on its current NVIDIA V100 based AI fleet. That takes a ginormous amount of processing and the load is growing rapidly, perhaps doubling every year based on previous disclosures of data center power consumption. Consequently, rumors have consistently asserted that the company would develop its own AI accelerator, much as Google has done with the TPU. But now we know that the company appears to still love, and depend on, NVIDIA GPU’s to get the job done.

RSC: The world’s largest AI Supercomputer

So, before we get to the implications, let’s review the specs. The new system, dubbed the “RSC” (ok, Meta could use some help with naming, right?) already has 760 DGX servers with 6,080 A100 GPUs and 1520 AMD EPYC CPUs equipped with Nvidia’s Quantum InfiniBand networking system, which supports up to 200Gb/s of bandwidth.. The plan is to build that up to 16,000 GPU’s by July, which at 5 Exaflops would make it the largest known AI supercomputer in the world, beating out the US DOE Perlmutter 4 Exaflops NVIDIA-based system. PureStorage is supplying a flash subsystem growing up to an exabyte of training data, and Penguin Computing is acting as the system integrator, helping out with the setup and installation. When finished, the 16,000 GPU RSC will be able to train trillion-parameter AI models in weeks, and become a critical tool for Meta to fulfill it’s metaverse ambitions. (Remember, GPUs do graphics, too!)

The Implications

Ok, here is our takeaways:

  1. Facebook recognizes that NVIDIA GPUs are the best platform available for AI research and development. There’s no need, at present, to invest in a home-grown “better” chip, if thats even possible. Other AI behemoths like Microsoft Azure (and OpenAI) have come to the same conclusion, at least for now. To say the least, rumors of the fragility of NVIDIA’s leadership are greatly exaggerated.
  2. The NVIDIA DGX server allows customers like Facebook to stand up a large fleet quickly, avoiding months or years of the normal planning needed to design and install a custom supercomputer. DGX is plug-and-play, from a single server to a massive supercomputer. And over 600 software stacks are available on the Nvidia GPU Cloud. Meta gets it: time to market matters.
  3. In the past, large hyper-scalers designed and built all their own custom servers, with help from Taiwanese ODMs, in order to minimize cost. That approach may be fine for commodity servers, but is inadequate for state-of-the-art AI. Consequently, the systems business at NVIDIA is transforming from a literally gold-plated poster child example to providing a world-class platform worthy of the expense: DGX has the best AI performance money can buy. This sets the stage nicely for the upcoming NVIDIA Grace roll-out of a fully integrated intelligence platform, a complete re-imagining of accelerated computing systems. We would note that NVIDIA has managed to walk s fine line here with its partners, providing comparable, though less robust, HGX designs for OEMs to leverage NVIDIA design expertise in their own private labeled servers.

Conclusions

The NVIDIA DGX server with up to eight A100 GPUs. NVIDIA

Yes, the Meta RSC supercomputer is impressive, without a doubt. But even more impressive to our eye is how NVIDIA keeps raising the bar. As we have said, NVIDIA is no longer (just) a chip company. They design and deliver some of the finest and most complete accelerated data centers available any where, at any cost. And the largest data centers in the world can’t seem to get enough of them.