AI innovation continues to be driven by three scaling laws: pre-training, post-training, and test-time scaling. Training is foundational to constructing smarter models, and post-training—which might include fine-tuning, reinforcement learning, and other techniques—helps to further increase accuracy for specific tasks, in addition to provide models with recent capabilities like the flexibility to reason.
As foundation models grow larger and more complex, they require more compute performance to coach. This may result in longer time and price to coach for a hard and fast amount of performance. And, as AI researchers proceed to experiment to discover their next set of model architecture breakthroughs, still more training compute is required before the ultimate pre-training run. Innovation is required to dramatically increase delivered compute to coach more complex models while reducing the associated fee per unit of compute.
That’s where NVIDIA extreme codesign is available in. NVIDIA innovates across GPUs, CPUs, NVIDIA NVLink Switches, network interface cards (NICs), data processing units (DPUs), the NVIDIA Quantum InfiniBand platform, the NVIDIA Spectrum-X Ethernet platform, system architecture, and a mountain of software to deliver large increases in training performance—far beyond what Moore’s Law is capable of deliver. And, these increases in performance not only mean shorter training times—allowing model builders to more quickly deploy their models to start generating revenue—but in addition lower model training costs, increasing return on investment.
That’s why the world’s leading AI models today are trained on NVIDIA.
On this post, we take a more in-depth take a look at how recent chips, in addition to continued software stack innovations on the identical architecture, can dramatically speed up time-to-train while significantly reducing cost-to-train.
NVIDIA GB200 NVL72 delivers a big leap over Hopper
Within the last round of MLPerf Training, NVIDIA made the industry’s first and only submissions using FP4 precision—and did so across every large language model (LLM) within the benchmarking suite. These breakthroughs—the results of the NVFP4 precision accelerated in hardware by the NVIDIA Blackwell architecture, recent training recipes, and overall software stack enhancements—mean that GB200 NVL72 delivered as much as 3.2x faster training performance on the Llama 3.1 405B benchmark at the identical GPU count in comparison with optimized submissions using NVIDIA Hopper running FP8.


This faster time-to-train implies that model developers can bring their models to market sooner, accelerating their ability to generate revenue from their latest AI innovations.
The increased performance of the NVIDIA Blackwell platform not only accelerates model training speed, however the increases in performance also significantly outpace the rise in hourly instance pricing, translating into significant performance-per-dollar gains.


MLPerf Training v5.0, and v5.1, closed division. Results from entries: 5.0-0014, 5.1-0072. Training performance per dollar shouldn’t be a primary metric of MLPerf Training, and performance per dollar not verified by MLCommons. Performance per dollar is derived from MLPerf Training performance and published on-demand instance pricing. The MLPerf name and logo are registered and unregistered trademarks of MLCommons Association in the US and other countries. All rights reserved. Unauthorized use is strictly prohibited. See www.mlcommons.org for more information.
Using publicly available GPU rental prices and essentially the most recent MLPerf Training submissions on Llama 3.1 405B using NVIDIA H100 and GB200 NVL72, respectively, GB200 NVL72 delivers almost 2x the performance per dollar of H100.
How NVFP4 training unlocked much more performance and performance per dollar with same Blackwell GPUs
Along with the big performance leaps that recent architectures and platforms deliver as a part of the NVIDIA annual roadmap rhythm, NVIDIA engineers are always in search of ways to deliver more performance from existing architectures through ongoing algorithmic and software innovations.
The Blackwell architecture adds support for FP4 acceleration directly in hardware, including each industry FP4 formats in addition to the NVIDIA-designed NVFP4 format, which helps improve performance in comparison with other FP4 formats. The usage of NVFP4 training recipes in essentially the most recent MLPerf Training v5.1 round enabled significant training performance improvements on the identical GB200 NVL72 rack-scale architecture in comparison with FP8 submissions within the prior round—as much as 1.4x higher performance at similar scale.


MLPerf Training v5.0, and v5.1, closed division. Results from entries: 5.0-0067, 5.1-0072. The MLPerf name and logo are registered and unregistered trademarks of MLCommons Association in the US and other countries. All rights reserved. Unauthorized use is strictly prohibited. See www.mlcommons.org for more information.
This performance improvement yields not only significantly faster training, but—since that improvement is on the identical GPUs—translates into higher performance per dollar.
Blackwell Ultra delivers one other big performance boost
The NVIDIA GB300 NVL72, which features the upgraded NVIDIA Blackwell Ultra GPU, demonstrated further training speedups in MLPerf Training, fueled by significantly higher FP4 compute in addition to larger high bandwidth memory (HBM). When comparing submissions on the 512-GPU scale, GB300 NVL72 accomplished the Llama 3.1 405B benchmark 1.9x faster than GB200 NVL72 at the identical scale did last round, bringing the cumulative performance gain in comparison with Hopper as much as 4.2x.


MLPerf Training v5.0, and v5.1, closed division. Results from entries: 5.0-0014 5.1-0060, 5.0-0067, 5.1-0072. The MLPerf name and logo are registered and unregistered trademarks of MLCommons Association in the US and other countries. All rights reserved. Unauthorized use is strictly prohibited. See www.mlcommons.org for more information.
GB300 NVL72 has now demonstrated large performance gains across each MLPerf Training and MLPerf Inference in comparison with the prior-generation GB200 NVL72, accelerated by the broadening application of the NVFP4 data format. Which means that with GB300 NVL72, model makers can train their next-generation models faster and produce them to market sooner, in addition to serve them with higher throughput and increase the revenue potential from serving models.
NVIDIA extreme codesign on the speed of sunshine
By innovating relentlessly across GPU, CPU, scale-up fabric, scale-out and scale-across networking, system architecture, and software, NVIDIA extreme codesign is delivering massive performance leaps every year. These gains are set to enable training of larger and smarter next-generation AI models, in addition to to enable fast and cost-efficient serving of those models, bringing much more value to the broader AI ecosystem.
To learn more about our latest MLPerf Training and Inference results, visit our MLPerf benchmark webpage, and take a look at our MLPerf Training v5.1 and MLPerf Inference v5.1 technical blogs.
