Co-designed hardware, software, and models are key to delivering the very best AI factory throughput and lowest token cost. Measuring this goes far beyond peak chip specifications. Rigorous AI inference performance benchmarks are critical to understanding real-world token output, which drives AI factory revenue.
MLPerf Inference v6.0 is the newest in a series of industry benchmarks that measure performance across a big selection of model architectures and use cases. On this latest round, systems powered by NVIDIA Blackwell Ultra GPUs delivered the very best throughput across the widest range of models and scenarios. This brings the cumulative NVIDIA MLPerf training and inference wins since 2018 to 291, which is 9x of all other submitters combined.
 This round, the NVIDIA partner ecosystem participated broadly, with 14 partners—the biggest variety of partners submitting on any platform. ASUS, Cisco, CoreWeave, Dell Technologies, GigaComputing, Google Cloud, HPE, Lenovo, Nebius, Netweb Technology, Quanta Cloud Technology (QCT), Red Hat, Supermicro, and Lambda have delivered excellent performance on the NVIDIA platform.Â


This post takes a better take a look at the newest benchmark updates, the industry-leading performance achieved on the NVIDIA platform, and the full-stack engineering that makes it possible.Â
Latest benchmarks, latest performance records
The MLPerf Inference benchmark suite is routinely updated to be certain that it reflects models, modalities, use cases, and deployment scenarios that matter to the community. Only the NVIDIA platform submitted results on all newly added models and scenarios this round, and delivered the very best performance across all of them.
This round of MLPerf Inference added several latest tests, including:
- DeepSeek-R1 Interactive: Following the addition of DeepSeek-R1 reasoning LLM based on a sparse mixture-of-experts (MoE) architecture in MLPerf Inference v5.1, MLCommons added a brand new Interactive scenario with 5x faster minimum token rate and 1.3x shorter time to first token in comparison with the server scenario, representing higher-interactivity deployments
- Qwen3-VL-235B-A22B: Vision-language model with a complete of 235B parameters. This represents the primary multi-modal model within the MLPerf Inference suite. Two scenarios are tested: Offline and Server.Â
- GPT-OSS-120B: 120B-parameter MoE reasoning LLM, developed by OpenAI. This benchmark includes three scenarios: Offline, Server, and Interactive
- WAN-2.2-T2V-A14B: 4B-parameter text-to-video generative AI model. Two scenarios tested: single-stream, which measures the latency to process a single video generation request, and offline, which measures the variety of samples processed per second in a batch-processing scenario.Â
- DLRMv3 – A generative advice benchmark that replaces the DLRM-DCNv2 test. It uses a transformer-based architecture that increases model size and compute intensity in comparison with the prior benchmark. It tests offline and server scenarios.Â
| Benchmark | DeepSeek-R1 | GPT-OSS-120B | Qwen3-VL | Wan 2.2 | DLRMv3 |
| Offline | 2,494,310 tokens/sec* | 1,046,150 tokens/sec | 79 samples/sec | 0.059 samples/sec | 104,637 samples/sec |
| Server | 1,555,110 tokens/sec* | 1,096,770 tokens/sec | 68 queries/sec | 21 secs**(Single Stream) | 99,997 queries/sec |
| Interactive | 250,634 tokens/sec | 677,199 tokens/sec | *** | *** | *** |
* Not a brand new scenario in MLPerf Inference v6.0
** Wan 2.2 includes a single stream scenario, which measures end-to-end request latency, as an alternative of a server scenario. Lower is healthier.
*** Not tested in MLPerf Inference v6.0
MLPerf Inference v6.0, Closed Division. Results retrieved from www.mlcommons.org on April 1, 2026. NVIDIA platform results from the next entries: 6.0-0039, 6.0-0073, 6.0-0075, 6.0-0076, 6.0-0078, 6.0-0081, 6.0-0094. The MLPerf name and logo are registered and unregistered trademarks of MLCommons Association in the USA and other countries. All rights reserved. Unauthorized use is strictly prohibited. See www.mlcommons.org for more information.


NVIDIA TensorRT-LLM software updates unlock as much as 2.7X performance gains on the identical Blackwell Ultra GPUs
NVIDIA continually optimizes the performance of its software stack to extend delivered token throughput from existing platforms. This delivers reductions in token production cost and enables AI factory operators to serve more users to generate more revenue with a given infrastructure footprint.
The extra performance also provides headroom to run future AI models and serve existing models in demanding scenarios, comparable to higher token rates and longer contexts. This continual improvement makes it possible for NVIDIA GPUs introduced years ago to stay productive, at high utilization rates, within the cloud.Â
This round, NVIDIA GB300 NVL72—launched last yr—delivered as much as 2.7x higher token throughput in comparison with its debut submissions just six months ago on the server scenario of the DeepSeek-R1 benchmark1. This implies 2.7x more tokens from the identical GB300 NVL72-based infrastructure and power footprint, reducing the price to fabricate each token by greater than 60%. This speedup, achieved by NVIDIA partner Nebius, showcases a core advantage of the NVIDIA platform: an open, expansive ecosystem where customers and partners can uniquely optimize and innovate on top of our software stack.Â
1MLPerf Inference v5.1 and v6.0, Closed Division. Results retrieved from www.mlcommons.org on April 1, 2026. NVIDIA platform results from the next entries: 5.1-0072, 6.0-0081. The MLPerf name and logo are registered and unregistered trademarks of MLCommons Association in the USA and other countries. All rights reserved. Unauthorized use is strictly prohibited. See www.mlcommons.org for more information.
Powering the DeepSeek R1 performance improvements within the server and offline scenarios were several software enhancements, including:
- Faster kernels—this included a mixture of higher-performance kernels and using fewer kernels due to kernel fusions.
- Optimized Attention Data Parallel—Higher balancing of context requests between different ranks, enabling significant speedups in end-to-end performance.
The most recent features of the open source NVIDIA TensorRT-LLM inference serving software and the NVIDIA Dynamo open source distributed inference serving framework were used to support the newly added and tougher DeepSeek-R1 Interactive scenario. This includes:
- Disaggregated serving: This capability in Dynamo separates and individually optimizes the configurations of every inference phase (prefill and decode), respectively, enabling optimal overall throughput.Â
- Wide Expert Parallel (WideEP): For higher-interactivity scenarios, execution time for MoE models is certain by expert weight load time. By splitting, or sharding, the experts across multiple GPUs across NVL72 nodes, this bottleneck is reduced, improving end-to-end performance.
- Multi-Token Prediction (MTP): At higher interactivity levels, batch sizes are smaller, and performance is dominated by how quickly weights can load into memory, leaving compute performance underutilized. By applying compute otherwise that goes unutilized to predict and confirm additional tokens in parallel (up to 3 on this implementation), throughput at high interactivity is increased.Â
- KV-aware routing: This capability of Dynamo routes inference requests by evaluating their compute costs across different employees.
NVIDIA was the primary and only platform to submit DeepSeek-R1 results on MLPerf Inference when the benchmark debuted last yr. This round, NVIDIA not only increased performance on returning scenarios for DeepSeek-R1 but ‌was once more the one platform to submit on the newly added interactive scenario.
And even on Llama 3.1 405B—a really large, dense LLM launched almost two years ago— GB300 NVL72 performance increased by 1.5x within the server scenario.
| Benchmark | GB300 NVL72Â v5.1 | GB300 NVL72v6.0 | Speedup |
| DeepSeek-R1(Server) | 2,907 tokens/sec/gpu | 8,064 tokens/sec/gpu | 2.77x |
| DeepSeek-R1(Offline) | 5,842 tokens/sec/gpu | 9,821 tokens/sec/gpu | 1.68x |
| Llama 3.1 405B(Server) | 170 tokens/sec/gpu | 259 tokens/sec/gpu | 1.52x |
| Llama 3.1 405B(Offline) | 224 tokens/sec/gpu | 271 tokens/sec/gpu | 1.21x |
MLPerf Inference v5.1 and v6.0, Closed Division. Results retrieved from www.mlcommons.org on April 1, 2026. NVIDIA platform results from the next entries: 5.1-0072, 6.0-0017, 6.0-0078, 6.0-0082. Per chip performance is derived by dividing total throughput by the variety of reported chips. Per-chip performance is just not a primary metric of MLPerf Inference v5.1 or v6.0. The MLPerf name and logo are registered and unregistered trademarks of MLCommons Association in the USA and other countries. All rights reserved. Unauthorized use is strictly prohibited. See www.mlcommons.org for more information.
Moreover, NVIDIA submissions on the newly added multimodal, video generation, and advice benchmarks were powered by open source software frameworks optimized for the NVIDIA platform. The Qwen3-VL vision-language submission used the vLLM open source framework, showing how the community is rapidly constructing advanced multimodal optimizations to speed up image-heavy inference workloads on the newest GPUs like NVIDIA Blackwell Ultra. The WAN-2.2 text-to-video submission used the TensorRT-LLM VisualGen, which accelerates diffusion-based video generation pipelines on NVIDIA GPUs.Â
For DLRMv3, the submission was built on two open-source projects: the NVIDIA recsys-example for high-performance transformer-based advice inference, and NV Embedding Cache for GPU-accelerated embedding table lookups. Each were critical to achieving record throughput on this more demanding generative advice benchmark.
Through extensive and ongoing engineering, NVIDIA is continually increasing performance on existing hardware on existing models, as evidenced by these results. At the identical time, NVIDIA collaborates closely with model builders and open source inference frameworks to be certain that the newest models run on the NVIDIA platform on the day of launch.Â
Scale-out inference with NVIDIA Quantum-X800 InfiniBand platform enables thousands and thousands of tokens per second
NVIDIA also set latest throughput records at scale on the DeepSeek-R1 model within the offline and server scenarios by submitting results using 4 GB300 NVL72 systems interconnected with NVIDIA Quantum-X800 InfiniBand scale-out networking.Â
| DeepSeek-R1 | 4x GB300 NVL72 | Tokens/Second |
| Offline | 2,494,310 |
| Server | 1,555,110 |
MLPerf Inference v6.0, Closed Division. Results retrieved from www.mlcommons.org on April 1, 2026. NVIDIA platform results from the next entries: 6.0-0076. The MLPerf name and logo are registered and unregistered trademarks of MLCommons Association in the USA and other countries. All rights reserved. Unauthorized use is strictly prohibited. See www.mlcommons.org for more information.
With 288 Blackwell Ultra GPUs—the biggest scale ever submitted to any benchmark in MLPerf Inference—submissions set latest system-level throughput records, enabling thousands and thousands of tokens processed per second.Â
Waiting for MLPerf EndpointsÂ
Delivered inference throughput takes extreme co-design across many chips, system architecture, data center design, and software. The most recent MLPerf Inference v6.0 results show that NVIDIA yields unmatched inference throughput across the broadest range of workloads, from massive LLMs to advanced vision language models, to generative recommender systems and more, on industry-standard benchmarks.Â
AI inference workloads also proceed to evolve rapidly, as model sizes grow and context lengths rise. As agentic AI becomes more prevalent, premium use cases that require ultra-fast token rates are emerging.Â
NVIDIA has been working, as a part of the MLCommons consortium, to steer the definition of the MLPerf Endpoints benchmark. MLPerf Endpoints will give the community a rigorous, auditable picture of how deployed services perform under real API traffic—capturing key performance metrics that chip-level benchmarks alone cannot reveal—while providing the rigor and result integrity that defines MLPerf benchmarks.Â
To explore the newest performance on the NVIDIA platform across training, inference, and high-performance computing, please see our deep learning product performance page.Â
Acknowledgements
NVIDIA MLPerf Inference v6.0 results reflect the work of many talented engineers across the corporate. We’d prefer to acknowledge the contributions of the next individuals (last name sorted):
Tomar Bar-on, Nitin Sai Bommi, Viraat Chandra, Alice Cheng, Jerry Chen, Xiaoming Chen, Jesus Corbal San Adrian, Ashutosh Dhar, Kefeng Duan, Wookje Han, Kyle Huang, Kris Hung, Rashid Kaleem, Khubaib Khubaib, Zihao Kong, Tin-Yin Lai, Tao Li, Forrest Lin, Wanqian Li, Alex Liu, Jintao Peng, Yuxian Qiu, Junyi Qiu, Xiaowei Shi, Olivia Stoner, Jacob Subag, Tong Tong, Harshil Vagadia, Shobhit Verma, June Yang, Tailing Yuan, Ben Zhang… and plenty of others across NVIDIA whose efforts made these results possible.
