MLPerf Training 4.0: It’s All About Scale

by Karl Freund | Jun 12, 2024 | In the News

While there isn’t a lot of new hardware (none!), Nvidia and Intel show off their muscles and ability to run new models at scale.

Ok, here we go again. MLCommons has released new AI benchmarks, this time for training. And again. Nvidia runs all AI models better than anyone, AMD decides once again not to play ball, and Intel does the best they can with old hardware (Gaudi3 wasn’t quite ready).

This time around, the MLCommons community has added two new benchmarks: one for Graph Neural Networks and one for LLM Fine Tuning using Llama 2 and LoRA (Low Rank Adaptation). LoRA can reduce the number of trainable parameters by 10,000 times and the GPU memory requirement by 3 times. Let’s take a look. We will also discuss the “yearly cadence” announcements from Nvidia and AMD.

Nvidia Sweeps Every MLPerf Benchmark

Nvidia isn’t just waiting for Blackwell, due out in full force later this year. They are improving performance of the Hopper-based GPU systems by tuning models and software. The company’s engineers have set a new LLM record with 11,616 Hopper GPUs, tripling training performance with near-perfect scaling from last years results.

Nvidia sweeps every MLPerf benchmark. Again. NVIDIA

First, as usual, Nvidia ran all the benchmarks, and touted improvements since the last run with the H100 a year ago. Four of the benchmarks are useful for generative AI, and Nvidia brought out the big guns, scaling to over 11,000 GPUs for the GPT-3 run in 3.4 minutes (this is not indicative of how long a full training run would take).

As the world waits for Blackwell, Nvidia needs to sell a ton of Hoppers. Historically, Nvidia typically increase performance with full-stack optimizations, and they have done so again, with decent results, reducing training time by some 27% on a 512-GPU cluster on GPT-3. A lot of this came from the Transformer engine which can optimally determine the precision that best meets the needs at each layer. Note that no Nvidia competitor has something similar.

Nvidia has increased the training performance of Hopper GPUs by 27% in the last year. NVIDIA

And in the text-to-image space, Nvidia was able to increase performance by some 80% min just seven months. See the chart below for details.

Nvidia engineers have nearly doubled the performance of H100 for text-to-Image training. NVIDIA

While the new benchmarks are for training, Nvidia just couldn’t contain their excitement about H100 based inference, and claimed a 50% improvement in batch size one, with some undisclosed future software that will increase throughput even more. Stay tuned.

Nvidia has also increased performance for LLM inference processing on the H100 by 50% since last December. NVIDIA

At Computex in Taipei, CEO Jensen share the following slide that details what he means by a yearly cadence. It doesn’t mean a new GPU architecture every year. Rather, he means a new GPU archtecture every two years, with an intervening kicker provided by adding more layers in the HBM stack. This is a far more consumable roadmap that many had feared, and is similar to what AMD announced at the show as well.

Nvidia’s roadmap: A new GPU every two years with an intervening kicker that adds more HBM memory. NVIDIA

Intel continues to be the only other company to share MLPerf results.

Intel also ran the benchmarks, including the vital LoRA model, but on Gaudi2. Gaudi3 just wasn’t ready yet. Intel raised the bar on scale, using Ethernet which is the native networking on the Gaudi architecture. The Intel engineers ran on a large Gaudi 2 system (1,024 Gaudi 2 accelerators) trained on the Tiber development cloud.

And Intel is banging the drums for better AI affordability, with eight Intel Gaudi 2 accelerators with a universal at $65,000, which the company estimates to be one-third the cost of comparable competitive platforms (a.k.a. Nvidia). Intel Gaudi 3 accelerators lists at $125,000, estimated to be two-thirds the cost of comparable competitive platforms.

Since AMD still isn’t sharing results, Intel can claim to be the best benchmarked alternative to the more expensive (and faster) Nvidia GPUs.

Conclusions

Once again, we hear the sound of one hand clapping. Ok, two if you count Intel, God bless them. And Google did post some results as well. Keep in mind, running these benchmarks tells a vendor where they are good, and where they can improve. So, trust me, AMD ran the benchmarks.

Nvidia keeps winning.

← Previous Post Next Post →

MLPerf Training 4.0: It’s All About Scale

Nvidia Sweeps Every MLPerf Benchmark

Intel continues to be the only other company to share MLPerf results.

Conclusions

More Recent AI News>>

Categories

MLPerf Training 4.0: It’s All About Scale

Nvidia Sweeps Every MLPerf Benchmark

Intel continues to be the only other company to share MLPerf results.

Conclusions

More Recent AI News>>

Companies

Categories