Delivering Massive Performance Leaps for Mixture of Experts Inference on NVIDIA Blackwell

As AI models proceed to get smarter, people can depend on them for an expanding set of tasks. This leads users—from consumers to enterprises—to interact with AI more regularly, meaning that more tokens must be generated. To serve these tokens at the bottom possible cost, AI platforms must deliver the most effective possible token throughput per watt.

Through extreme co-design across GPUs, CPUs, networking, software, power delivery, and cooling, NVIDIA continues to drive up token throughput per watt, which reduces cost per million tokens.

Moreover, NVIDIA continues to reinforce its software stacks to attain even greater levels of performance from existing platforms. This increases the worth of the big installed base of NVIDIA GPUs across cloud service providers (CSPs), GPU clouds, model builders, enterprises, and others, enabling that infrastructure to stay productive for longer.

On this post, we show how recent updates to the NVIDIA inference software stack—running on the NVIDIA Blackwell architecture—in addition to use of the total capabilities available within the stack are enabling large performance gains across several scenarios on DeepSeek-R1, a state-of-the-art sparse mixture-of-experts (MoE) reasoning model.