NVIDIA Blackwell Architecture Sweeps MLPerf Training v5.1 Benchmarks

-


The NVIDIA Blackwell architecture powered the fastest time to coach across every MLPerf Training v5.1 benchmark, marking a clean sweep in the most recent round of results. As developers experiment with latest architectures, and models proceed to grow in size, more training compute is important. Meeting this need for delivered compute requires innovation across every layer of the AI stack—from chips and systems to software—advancing performance at an unprecedented pace. 

MLPerf Training v5.1 is the most recent within the long-running series of industry benchmarks designed to measure AI training performance. This version measures the time to coach seven models, representing a wide selection of use cases, each to a specified goal accuracy. The Blackwell architecture, which powers each NVIDIA Blackwell and NVIDIA Blackwell Ultra GPUs, delivered the very best performance on every benchmark at maximum scale and at each submitted scale.

Benchmark Time to coach Maximum Submission scale
Llama 3.1 405B pretraining 10 minutes 5,120 Blackwell GPUs
LLama 3.1 8B pretraining 5.2 minutes 512 Blackwell Ultra GPUs
Llama 2 70B LoRA fine-tuning 0.40 minutes 512 Blackwell Ultra GPUs
FLUX.1 12.5 minutes 1,152 Blackwell GPUs
DLRM-DCNv2 0.71 minutes 64 Blackwell GPUs
R-GAT 0.84 minutes 256 Blackwell GPUs
RetinaNet 1.4 minutes 512 Blackwell GPUs
Table 1. The NVIDIA platform delivers the fastest time to coach on every model currently tested in MLPerf Training

MLPerf™ Training v5.0 and v5.1 results retrieved from www.mlcommons.org on November 12, 2025, from the next entries: 5.0-0082, 5.1-0002, 5.1-0004, 5.1-0060, 5.1-0070, 5.1-0072. The MLPerf™ name and logo are trademarks of MLCommons Association in america and other countries. All rights reserved. Unauthorized use is strictly prohibited. See www.mlcommons.org for more information.    

The NVIDIA platform was also the just one to submit results on all benchmarks. On this post, we take a better have a look at these results and the technology innovations that powered them. 

NVIDIA makes the industry’s first FP4 training submissions with NVFP4

Innovation in low-precision AI data formats is a key enabler of the performance gains delivered by the Blackwell architecture, which powers Blackwell and Blackwell Ultra GPUs. The Blackwell architecture incorporates hardware acceleration for FP4 data formats, including the NVIDIA-designed NVFP4 format. Blackwell GPUs offer peak FP4 throughput per clock, twice that of FP8. Blackwell Ultra GPUs construct upon that innovation, increasing FP4 throughput per clock to 3x that of FP8. 

As shown within the paper, Pretraining Large Language Models with NVFP4, NVFP4 provides higher accuracy for a similar variety of tokens used during training, or achieves the identical accuracy using significantly fewer tokens, in comparison with the industry MXFP4 data format. This implies faster time to coach to a specified accuracy and faster time to deployment with lower training costs.

This round, NVIDIA adopted NVFP4 in every large language model (LLM) in MLPerf Training by incorporating most of the techniques beneficial within the paper. NVIDIA submissions also fastidiously applied “healing”—a process by which higher precisions are used during certain parts of the training process—to enhance accuracy. Specifically, NVIDIA submissions kept the previous few training iterations in FP8 precision.  

These submissions required innovation at every layer of the technology stack, including hardware acceleration of NVFP4 directly in Blackwell and Blackwell Ultra silicon, acceleration libraries including NVIDIA cuBLAS, NVIDIA Transformer Engine, and NVIDIA Megatron-Core, and latest numerical techniques.

Blackwell Ultra delivers a big leap for LLM training

NVIDIA submitted the primary MLPerf Training results on Blackwell Ultra using an NVIDIA AI cluster codenamed “Theia,” after the Greek goddess of sight and vision. It encompasses a total of 512 Blackwell Ultra GPUs, built from multiple NVIDIA GB300 NVL72 rack-scale systems connected using NVIDIA Quantum-X800 InfiniBand. 

Blackwell Ultra GPUs incorporate several essential enhancements in comparison with Blackwell GPUs, including: 

  • 1.5x peak NVFP4 throughput. Blackwell Ultra GPUs feature updated Tensor Cores that increase peak FP4 throughput per clock by 1.5x in comparison with Blackwell GPUs. This helps speed up math-bound GEMM operations. 
  • 2x Softmax for attention. Blackwell Ultra GPUs feature an upgraded special function unit (SFU), providing 2x accelerated throughput for key softmax operations, which will be critical for the eye layer. In MLPerf benchmarks, this leads to as much as 1.3x speedup in the eye block.
  • 1.5x larger HBM3e capability. Blackwell Ultra GPUs incorporate higher-capacity HBM3e stacks, which at the moment are 12-Hi in comparison with 8-Hi in Blackwell GPUs. On the Llama 2 70B LoRA benchmark, this enabled us to suit your complete model in a single GPU, with no CPU offloading required, eliminating model-parallel communication overheads and improving GEMM efficiency. 

Blackwell Ultra GPU innovations, adoption of NVFP4 format, and software optimizations delivered large increases in pretraining and LLM fine-tuning performance with the identical variety of GPUs in comparison with essentially the most recent NVIDIA submissions using the Hopper architecture.

Two sets of bar charts, with performance starting with Hopper submissions in prior rounds, followed by Blackwell GB200 NVL72 submissions in v5.0, then finally Blackwell Ultra GB300 NVL72 submissions in v5.1. The speedups listed for Llama 3.1 405B are 1x, ~2x, and 4x+, and 1x, ~3x, and ~5x for Llama 2 70B LoRA, respectively. Two sets of bar charts, with performance starting with Hopper submissions in prior rounds, followed by Blackwell GB200 NVL72 submissions in v5.0, then finally Blackwell Ultra GB300 NVL72 submissions in v5.1. The speedups listed for Llama 3.1 405B are 1x, ~2x, and 4x+, and 1x, ~3x, and ~5x for Llama 2 70B LoRA, respectively.
Figure 1. Relative Llama 3.1 405B pretraining and Llama 2 70B LoRA fine-tuning performance at 512-GPU and 8-GPU scales, respectively

MLPerf Training v4.1, v5.0, and v5.1, closed division. Results from entries: 4.1-0050, 5.0-0076, 5.0-0067, 5.1-0058, 5.1-0060. The MLPerf name and logo are registered and unregistered trademarks of MLCommons Association in america and other countries. All rights reserved. Unauthorized use strictly prohibited. See www.mlcommons.org for more information.

Moreover, the most recent NVIDIA Quantum-X800 networking platform—composed of NVIDIA ConnectX-8 SuperNICs, NVIDIA Quantum-X800 InfiniBand switches, and NVIDIA LinkX cables—was used to attach the multiple GB300 NVL72 racks that form the Theia cluster. This marks the industry’s first and only 800 Gb/s networking submitted to MLPerf Training. 

NVIDIA Blackwell sets latest Llama 3.1 405B training record

On Llama 3.1 405B, the biggest and most difficult benchmark within the MLPerf Training v5.1, NVIDIA set a brand new time-to-train record of 10 minutes powered by 5,120 Blackwell GPUs. This can be a 2.7x increase in comparison with the fastest submission using Blackwell GPUs last round.* 

Two major aspects contributed to this massive speedup. With using NVFP4 training recipes and general software enhancements, the submission using 2,560 Blackwell GPUs achieved a rating of 18.79 minutes. That is 3x faster than the previous NVIDIA submissions with the identical variety of NVIDIA Hopper architecture GPUs.* Effective performance per Blackwell GPU also increased by 42%, when comparing the performance of the two,496 Blackwell GPU submission last round to the two,560 Blackwell GPU submission this round.*

* MLPerf™ Training v5.0 and v5.1 results retrieved from www.mlcommons.org on November 12, 2025, from the next entries: 5.0-0067, 5.0-0002, 5.0-0003, 5.0-0004, 5.1-0003, 5.1-0004, 5.1-0071. Performance-per-GPU isn’t an official MLPerf metric, and is derived by dividing the ratios of delivered performance and scales submitted.  The MLPerf™ name and logo are trademarks of MLCommons Association in america and other countries. All rights reserved. Unauthorized use is strictly prohibited. See www.mlcommons.org for more information.  

A dark green line chart indicating MLPerf Training v5.0 baseline, which scales from 512 Blackwell GPUs to 2,496 Blackwell GPUs. Then a lighter green line indicating Blackwell submissions in MLPerf Training v5.1, with points at 512 GPUs, 2,560 GPUs, and 5,120 GPUs. At the 2,560 GPU mark, performance/GPU in v5.1 is indicated as 1.4x that of v5.0, at the 2,496 GPU point. At 5,120 GPUs, a 2.7x increase in perf at max scale is indicated.A dark green line chart indicating MLPerf Training v5.0 baseline, which scales from 512 Blackwell GPUs to 2,496 Blackwell GPUs. Then a lighter green line indicating Blackwell submissions in MLPerf Training v5.1, with points at 512 GPUs, 2,560 GPUs, and 5,120 GPUs. At the 2,560 GPU mark, performance/GPU in v5.1 is indicated as 1.4x that of v5.0, at the 2,496 GPU point. At 5,120 GPUs, a 2.7x increase in perf at max scale is indicated.
Figure 2. Performance scaling with the variety of Blackwell GPUs submitted in each MLPerf Training v5.0 and MLPerf Training v5.1.

MLPerf™ Training v5.0 and v5.1 results retrieved from www.mlcommons.org on November 12, 2025, from the next entries: 5.0-0001, 5.0-0002, 5.0-0003, 5.0-0004, 5.0-0005, 5.0-0013, 5.0-0014, 5.1-0003, 5.1-0004,  5.1-0071. Performance-per-GPU isn’t an official MLPerf metric, and is derived by dividing the ratios of delivered performance and scales submitted.  The MLPerf™ name and logo are trademarks of MLCommons Association in america and other countries. All rights reserved. Unauthorized use is strictly prohibited. See www.mlcommons.org for more information.   

This submission also used a complete of 5,120 Blackwell GPUs—greater than doubling the biggest submitted scale of two,496 Blackwell GPUs within the prior round—connected using NVLink for scale-up inside a rack, and NVIDIA Quantum-2 InfiniBand for scale-out to multiple racks. Performance increased by 2.7x, meaning that the gains resulted from a bigger scale and increased effective performance per GPU. 

Scaling efficiency, which is the quantity of performance increase with additional GPUs, scaled by 10x from 512 Blackwell GPUs to five,120 Blackwell GPUs, was 85%. 

That is critical because it enables model builders to scale training runs, accelerating time to coach and time to revenue, while ensuring that every of those incremental GPUs achieves high utilization. 

Blackwell Ultra sets the bar for Llama 3.1 8B training performance

To make sure that MLPerf Training results represent modern AI use cases, the benchmark is frequently updated. This round, BERT-large was replaced by Llama 3.1 8B, which provides a considerable increase in capability and training complexity while maintaining a straightforward, accessible LLM for a broader range of platforms.   

The NVIDIA platform delivered the very best performance on the Llama 3.1 8B training benchmark, each by way of performance at a given variety of GPUs and performance at scale. 

Llama 3.1 8B submissions also benefited from several full-stack optimizations. 

One was using NVFP4 training recipes, which enabled performance increases while maintaining accuracy, even with a much smaller model. 

Next, with increased context lengths, attention becomes a critical component of the end-to-end LLM pretraining performance. Previous NVIDIA LLM pretraining submissions used BF16 precision for the inputs of batched-matrix-multiply (BMM) computations in the eye block. This round, NVIDIA submissions used FP8 precision for the eye BMM inputs for the Llama 3.1 8B pretraining benchmark. This applied to forward and backward pass computation, greater FP8 precision for attention BMMs. 

Our FP8 recipe achieved as much as 1.3x higher performance in the eye kernel of MLPerf benchmarks in comparison with the BF16 counterpart while still meeting the accuracy requirements of the benchmark. 

The FP8 attention recipe used for the pretraining benchmarks this round uses per-tensor current scaling FP8 for query (Q), key (K), and value (V) tensors, in addition to the gradient of output (dO) utilized in backward propagation. FP8 attention resulted in a 5% end-to-end speedup within the Llama 3.1 8B model. The FP8 attention implementation, for delayed scaling and current scaling recipes, is obtainable within the NVIDIA cuDNN library, which is utilized in NVIDIA MLPerf submissions through the NVIDIA Transformer Engine library.

Other software optimizations implemented for pretraining models include the next, which focused on optimizing away the device-to-device memory copies and tensor concatenations: 

  • Implementing a fused RoPE kernel in Transformer Engine that uses combined Q/K/V input and outputs Q, K, V tensors. This avoided splitting Q,K,V tensors within the forward pass, and concatenating dQ, dK, dV tensors within the backward pass
  • Avoiding changes to the eye input to BSHD layout by utilizing SBHD attention layout. This modification was implemented in Megatron-LM. On this notation, B stands for batch size, S sequence length, H variety of attention heads, and D head dimension, consistent with Transformer Engine notation.
  • Fusing amax computation into the producer operations.    

Highest performance on latest FLUX.1 benchmark

One other benchmark update was the addition of the FLUX.1 image generation model, replacing Stable Diffusion v2. On this test, NVIDIA once more set the bar, delivering the fastest time to coach at scale of 12.5 minutes using 1,152 Blackwell GPUs. NVIDIA was also the one platform to undergo this benchmark, highlighting each the performance and flexibility of the NVIDIA training stack. 

Llama 2 70B LoRA software optimizations

This round, several fusion optimizations were implemented that benefited the Llama 2 70B LoRA fine-tuning benchmark significantly. The core idea is using LoRALinearLayer, which mixes the LoRA adapters and the frozen GEMM throughout the same module. Constructing this abstraction enables us to fuse forged operations, scaling operations, and the addition to the frozen GEMM.

Key takeaways

NVIDIA is innovating on a one-year rhythm, with innovation across GPU, CPU, scale-up networking, scale-out networking, system architecture, and software, to drive up performance, drive down intelligence costs, and pave the way in which for brand spanking new AI breakthroughs. 

See more NVIDIA performance data on the Data Center Deep Learning Product Performance Hub and Performance Explorer pages.



Source link

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x