NVIDIA Extreme Co-Design Delivers Latest MLPerf Inference Records

Co-designed hardware, software, and models are key to delivering the very best AI factory throughput and lowest token cost. Measuring this goes far beyond peak chip specifications. Rigorous AI inference performance benchmarks are critical to understanding real-world token output, which drives AI factory revenue.

MLPerf Inference v6.0 is the newest in a series of industry benchmarks that measure performance across a big selection of model architectures and use cases. On this latest round, systems powered by NVIDIA Blackwell Ultra GPUs delivered the very best throughput across the widest range of models and scenarios. This brings the cumulative NVIDIA MLPerf training and inference wins since 2018 to 291, which is 9x of all other submitters combined.

This round, the NVIDIA partner ecosystem participated broadly, with 14 partners—the biggest variety of partners submitting on any platform. ASUS, Cisco, CoreWeave, Dell Technologies, GigaComputing, Google Cloud, HPE, Lenovo, Nebius, Netweb Technology, Quanta Cloud Technology (QCT), Red Hat, Supermicro, and Lambda have delivered excellent performance on the NVIDIA platform.

This post takes a better take a look at the newest benchmark updates, the industry-leading performance achieved on the NVIDIA platform, and the full-stack engineering that makes it possible.

Benchmark	DeepSeek-R1	GPT-OSS-120B	Qwen3-VL	Wan 2.2	DLRMv3
Offline	2,494,310 tokens/sec*	1,046,150 tokens/sec	79 samples/sec	0.059 samples/sec	104,637 samples/sec
Server	1,555,110 tokens/sec*	1,096,770 tokens/sec	68 queries/sec	21 secs**(Single Stream)	99,997 queries/sec
Interactive	250,634 tokens/sec	677,199 tokens/sec	***	***	***

Benchmark	GB300 NVL72 v5.1	GB300 NVL72v6.0	Speedup
DeepSeek-R1(Server)	2,907 tokens/sec/gpu	8,064 tokens/sec/gpu	2.77x
DeepSeek-R1(Offline)	5,842 tokens/sec/gpu	9,821 tokens/sec/gpu	1.68x
Llama 3.1 405B(Server)	170 tokens/sec/gpu	259 tokens/sec/gpu	1.52x
Llama 3.1 405B(Offline)	224 tokens/sec/gpu	271 tokens/sec/gpu	1.21x

DeepSeek-R1 \| 4x GB300 NVL72	Tokens/Second
Offline	2,494,310
Server	1,555,110

NVIDIA Extreme Co-Design Delivers Latest MLPerf Inference Records

Latest benchmarks, latest performance records

NVIDIA TensorRT-LLM software updates unlock as much as 2.7X performance gains on the identical Blackwell Ultra GPUs

Scale-out inference with NVIDIA Quantum-X800 InfiniBand platform enables thousands and thousands of tokens per second

Waiting for MLPerf Endpoints

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

How Can A Model 10,000× Smaller Outsmart ChatGPT?

The Inversion Error: Why Secure AGI Requires an Enactive Floor and State-Space Reversibility

CUDA Tile Programming Now Available for BASIC!

Breaking the Computer Use Frontier

The gig employees who’re training humanoid robots at home

NVIDIA Extreme Co-Design Delivers Latest MLPerf Inference Records

Latest benchmarks, latest performance records

NVIDIA TensorRT-LLM software updates unlock as much as 2.7X performance gains on the identical Blackwell Ultra GPUs

Scale-out inference with NVIDIA Quantum-X800 InfiniBand platform enables thousands and thousands of tokens per second

Waiting for MLPerf Endpoints

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.