Ok, here we go again. MLCommons has released new AI benchmarks, this time for training. And again. Nvidia runs all AI models better than anyone, AMD decides once again not to play ball, and Intel does the best they can with old hardware (Gaudi3 wasn’t quite ready).
This time around, the MLCommons community has added two new benchmarks: one for Graph Neural Networks and one for LLM Fine Tuning using Llama 2 and LoRA (Low Rank Adaptation). LoRA can reduce the number of trainable parameters by 10,000 times and the GPU memory requirement by 3 times. Let’s take a look. We will also discuss the “yearly cadence” announcements from Nvidia and AMD.
Nvidia Sweeps Every MLPerf Benchmark
Nvidia isn’t just waiting for Blackwell, due out in full force later this year. They are improving performance of the Hopper-based GPU systems by tuning models and software. The company’s engineers have set a new LLM record with 11,616 Hopper GPUs, tripling training performance with near-perfect scaling from last years results.
First, as usual, Nvidia ran all the benchmarks, and touted improvements since the last run with the H100 a year ago. Four of the benchmarks are useful for generative AI, and Nvidia brought out the big guns, scaling to over 11,000 GPUs for the GPT-3 run in 3.4 minutes (this is not indicative of how long a full training run would take).
As the world waits for Blackwell, Nvidia needs to sell a ton of Hoppers. Historically, Nvidia typically increase performance with full-stack optimizations, and they have done so again, with decent results, reducing training time by some 27% on a 512-GPU cluster on GPT-3. A lot of this came from the Transformer engine which can optimally determine the precision that best meets the needs at each layer. Note that no Nvidia competitor has something similar.
And in the text-to-image space, Nvidia was able to increase performance by some 80% min just seven months. See the chart below for details.
While the new benchmarks are for training, Nvidia just couldn’t contain their excitement about H100 based inference, and claimed a 50% improvement in batch size one, with some undisclosed future software that will increase throughput even more. Stay tuned.
At Computex in Taipei, CEO Jensen share the following slide that details what he means by a yearly cadence. It doesn’t mean a new GPU architecture every year. Rather, he means a new GPU archtecture every two years, with an intervening kicker provided by adding more layers in the HBM stack. This is a far more consumable roadmap that many had feared, and is similar to what AMD announced at the show as well.
Intel continues to be the only other company to share MLPerf results.
Intel also ran the benchmarks, including the vital LoRA model, but on Gaudi2. Gaudi3 just wasn’t ready yet. Intel raised the bar on scale, using Ethernet which is the native networking on the Gaudi architecture. The Intel engineers ran on a large Gaudi 2 system (1,024 Gaudi 2 accelerators) trained on the Tiber development cloud.
And Intel is banging the drums for better AI affordability, with eight Intel Gaudi 2 accelerators with a universal at $65,000, which the company estimates to be one-third the cost of comparable competitive platforms (a.k.a. Nvidia). Intel Gaudi 3 accelerators lists at $125,000, estimated to be two-thirds the cost of comparable competitive platforms.
Since AMD still isn’t sharing results, Intel can claim to be the best benchmarked alternative to the more expensive (and faster) Nvidia GPUs.
Conclusions
Once again, we hear the sound of one hand clapping. Ok, two if you count Intel, God bless them. And Google did post some results as well. Keep in mind, running these benchmarks tells a vendor where they are good, and where they can improve. So, trust me, AMD ran the benchmarks.
Nvidia keeps winning.