Delivering Massive Performance Leaps for Mixture of Experts Inference on NVIDIA Blackwell

-


As AI models proceed to get smarter, people can depend on them for an expanding set of tasks. This leads users—from consumers to enterprises—to interact with AI more regularly, meaning that more tokens must be generated. To serve these tokens at the bottom possible cost, AI platforms must deliver the most effective possible token throughput per watt. 

Through extreme co-design across GPUs, CPUs, networking, software, power delivery, and cooling, NVIDIA continues to drive up token throughput per watt, which reduces cost per million tokens.

Moreover, NVIDIA continues to reinforce its software stacks to attain even greater levels of performance from existing platforms. This increases the worth of the big installed base of NVIDIA GPUs across cloud service providers (CSPs), GPU clouds, model builders, enterprises, and others, enabling that infrastructure to stay productive for longer. 

On this post, we show how recent updates to the NVIDIA inference software stack—running on the NVIDIA Blackwell architecture—in addition to use of the total capabilities available within the stack are enabling large performance gains across several scenarios on DeepSeek-R1, a state-of-the-art sparse mixture-of-experts (MoE) reasoning model.

Latest NVIDIA TensorRT-LLM software boosts reasoning inference performance

The NVIDIA GB200 NVL72 rack-scale platform connects 72 NVIDIA Blackwell GPUs using fifth-generation NVIDIA NVLink interconnect and NVLink Switch chips, providing 1,800 GB/s of bidirectional bandwidth between all chips within the rack. This massive scale-up domain is optimized for models based on sparse MoE architectures, which require frequent exchanges of information between experts to generate tokens. 

The Blackwell architecture also incorporates hardware acceleration for the NVFP4 data format, an NVIDIA-designed four-bit floating point format that higher preserves accuracy in comparison with alternative FP4 formats. As well as, optimizations like disaggregated serving—which perform prefill operations on one set of GPUs and decode operations on a distinct set—also reap the benefits of the NVL72 architecture and NVLink Switch technology.

These architectural innovations enable NVIDIA GB200 NVL72 to deliver industry-leading performance on the newest open models, including DeepSeek-R1, a 671 billion-parameter sparse MoE that prompts 37 billion parameters for every token.

A chart plotting interactivity on the x-axis and throughput per GPU on the y-axis, 8K input sequence length and 1K output sequence length, with GB200 NVL72 with October 2025 software plotted in gray and the January 2026 software plotted in green and higher across the curve. Both are using NVFP4 precision. A chart plotting interactivity on the x-axis and throughput per GPU on the y-axis, 8K input sequence length and 1K output sequence length, with GB200 NVL72 with October 2025 software plotted in gray and the January 2026 software plotted in green and higher across the curve. Both are using NVFP4 precision.
Figure 1. GB200 NVL72 DeepSeek-R1 token throughput using 8K/1K sequence length has increased substantially with the newest NVIDIA TensorRT-LLM software.

GB200 NVL72 had previously demonstrated leading per-GPU throughput on DeepSeek-R1 across the throughput/interactivity curves for each 1K/1K and 8K/1K input/output sequence lengths.

A chart plotting interactivity on the x-axis and throughput per GPU on the y-axis using 1K input and 1K output sequence lengths, with GB200 NVL72 with October 2025 software plotted in gray and the January 2026 software plotted in green and higher across the curve. Both are using NVFP4 precision. A chart plotting interactivity on the x-axis and throughput per GPU on the y-axis using 1K input and 1K output sequence lengths, with GB200 NVL72 with October 2025 software plotted in gray and the January 2026 software plotted in green and higher across the curve. Both are using NVFP4 precision.
Figure 2. GB200 NVL72 DeepSeek-R1 token throughput using 1K/1K sequence length has increased substantially with the newest NVIDIA TensorRT-LLM software.

The most recent enhancements to the NVIDIA TensorRT-LLM open source library for optimizing LLM inference dramatically accelerates performance on the identical platform, with the throughput of every Blackwell GPU increasing by as much as 2.8x prior to now three months. 

The optimizations behind these results include:

  • Expanded use of NVIDIA Programmatic Dependent Launch (PDL) to cut back kernel launch latencies, helping to extend throughput across the range of interactivity levels
  • Many low-level kernel optimizations to more efficiently utilize NVIDIA Blackwell Tensor Cores
  • Newly optimized implementation of all-to-all communication primitives that eliminate an extra intermediate buffer on the receiver side

TensorRT LLM provides a high-level Python LLM API. Its PyTorch-native architecture allows developers to experiment with the runtime or extend functionality. These optimizations can be found today in the newest version of TensorRT-LLM. 

Accelerating NVIDIA HGX B200 performance with multi-token prediction and NVFP4

The NVIDIA HGX B200 platform—comprised of eight Blackwell GPUs connected using the fifth-generation NVLink interconnect and NVLink Switch—also achieves outstanding DeepSeek-R1 inference performance for air-cooled deployments. 

Two key technologies enable very large DeepSeek-R1 inference performance increases on HGX B200. The primary is using MTP, which provides a big increase in throughput across the range of interactivity levels. That is observed across all three tested input/output sequence mixtures.

A chart plotting per-user interactivity on the x-axis and token throughput per GPU on the y-axis. With the progression from FP8 MTP Off (light gray) to FP8 with MTP On (darker gray) to NVFP4 with MTP On (green), the curves continue to shift to the right, indicating more throughput at a given interactivity level and enabling higher peak interactivity. A chart plotting per-user interactivity on the x-axis and token throughput per GPU on the y-axis. With the progression from FP8 MTP Off (light gray) to FP8 with MTP On (darker gray) to NVFP4 with MTP On (green), the curves continue to shift to the right, indicating more throughput at a given interactivity level and enabling higher peak interactivity.
Figure 3. Throughput-versus-interactivity curves across FP8 without MTP, FP8 with MTP, and NVFP4 with MTP on HGX B200, with 1K/1K sequence length and aggregated serving.

The second is using NVFP4, taking full advantage of the numerous compute capabilities available within the Blackwell GPU to spice up performance while preserving accuracy.

A chart plotting per-user interactivity on the x-axis and token throughput per GPU on the y-axis. With the progression from FP8 MTP Off (light gray) to FP8 with MTP On (darker gray) to NVFP4 with MTP On (green), the curves continue to shift to the right, indicating more throughput at a given interactivity level and enabling higher peak interactivity.A chart plotting per-user interactivity on the x-axis and token throughput per GPU on the y-axis. With the progression from FP8 MTP Off (light gray) to FP8 with MTP On (darker gray) to NVFP4 with MTP On (green), the curves continue to shift to the right, indicating more throughput at a given interactivity level and enabling higher peak interactivity.
Figure 4. Throughput-versus-interactivity curves across FP8 without MTP, FP8 with MTP, and NVFP4 with MTP on HGX B200, with 8K/1K sequence length and aggregated serving.

NVFP4 is activated by the total NVIDIA software stack, including TensorRT-LLM and NVIDIA TensorRT Model Optimizer, to make sure each high performance and preservation of accuracy. That permits one more large throughput boost at a given interactivity level, and once more allows for even higher interactivity levels to be possible on the identical HGX B200 platform.

A chart plotting per-user interactivity on the x-axis and token throughput per GPU on the y-axis. With the progression from FP8 MTP Off (light gray) to FP8 with MTP On (darker gray) to NVFP4 with MTP On (green), the curves continue to shift to the right, indicating more throughput at a given interactivity level and enabling higher peak interactivity. A chart plotting per-user interactivity on the x-axis and token throughput per GPU on the y-axis. With the progression from FP8 MTP Off (light gray) to FP8 with MTP On (darker gray) to NVFP4 with MTP On (green), the curves continue to shift to the right, indicating more throughput at a given interactivity level and enabling higher peak interactivity.
Figure 5. Throughput-versus-interactivity curves across FP8 without MTP, FP8 with MTP, and NVFP4 with MTP on HGX B200, with 1K/8K sequence length and aggregated serving.

By leveraging the total capabilities of the NVIDIA Blackwell platform, LLMs can serve more users and deliver significantly higher experiences to every of those users.

Delivering continuous performance gains

Through relentless optimization, NVIDIA continues to deliver higher performance across the complete technology stack. It drives up token throughput on the total range of AI models, each through an annual product cadence in addition to continued workload optimization to deliver more performance and value from existing products. 

The NVIDIA Blackwell architecture delivers industry-leading inference performance, and with the newest software innovations in TensorRT-LLM, NVIDIA is delivering one more big inference boost for patrons, partners, and the AI ecosystem at large. 

Please visit the NVIDIA Data Center Deep Learning Product Performance page to learn more in regards to the industry-leading performance delivered by the NVIDIA full-stack platform. 



Source link

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x