Making Softmax More Efficient with NVIDIA Blackwell Ultra

-


LLM context lengths are exploding, and architectures are moving toward complex attention schemes like Multi-Head Latent Attention (MLA) and Grouped Query Attention (GQA). Because of this, AI ”speed of thought” is increasingly governed not by the large throughput of matrix multiplications, but by the transcendental math of the softmax function.

Transcendentals check with functions that can’t be expressed as the foundation of a polynomial equation with rational coefficients. Subsequently, they “transcend” basic algebraic operations like addition and multiplication—the precise operations Tensor Cores excel at. In the particular context of softmax, probably the most computationally expensive of those transcendentals is the natural exponential function that  is executed on Special Function Units (SFUs). In NVIDIA assembly instructions (SASS), this function is invoked via the MUFU.EX2 instruction. This architectural split creates a softmax bottleneck throughout the attention block, when powerful matrix engines are forced to idle while waiting for the SFU datapaths to normalize attention scores.

NVIDIA Blackwell Ultra alleviates this bottleneck by doubling SFU throughput over the usual NVIDIA Blackwell architecture.

This blog dives into the mechanics of softmax throughout the attention loop, explores how Blackwell Ultra’s hardware optimizations eliminate pipeline stalls, and provides a benchmark so that you can measure the raw MUFU.EX2 speedup for yourself.

How attention works

A foundational component of contemporary large language models is the eye mechanism, which allows a model to dynamically transform static token vectors into dynamic, context-aware representations. At its core, it’s a means of re-weighting information by allowing tokens to regulate their importance to at least one one other. To facilitate this interaction, every token in a sequence is projected into three functional roles:

  • Query: Represents what the present token is searching for to grasp its own context. 
  • Key: Represents a token’s profile that others use for matching. Tokens previous within the sequence have keys that signal their specific relevance to the query. 
  • Value: This holds the actual informational content. Once a match is confirmed between a question and a key, the Value is the particular data that’s transferred to the unique token.

Figure 1 below shows attention in motion. We now have two sentences that utilize the word “dog” in two different definitions. Initially, we are able to see that the embeddings (the numerical vectors that capture meaning and nuance in a multidimensional space) of each “dog” mentions are similar.

A GIF diagram showing how attention builds context by using previous tokens in the sequence to modify the embeddings on the current token.A GIF diagram showing how attention builds context by using previous tokens in the sequence to modify the embeddings on the current token.
Figure 1. Context constructing through attention

Attention operates with the model calculating a dot product between the dog query and the keys of each other token within the sequence. 

if the query for “dog” aligns well with the important thing for “lazy,” it indicates a high degree of relevance. This interaction is what allows the word “dog” to tug in the particular value of its neighbor. By the top of this cycle, the unique vector for “dog” has been physically updated with the content of its neighbors, evolving from a generic dictionary definition right into a contextualized embedding that “understands” whether it refers to a lethargic animal or the sweltering peak of a season.

How softmax pertains to attention

Softmax serves because the critical decision-making phase that converts raw compatibility scores into actionable weights. Once the initial dot products are calculated between queries and keys, the resulting scores are passed through the softmax function to be normalized into probabilities that sum to precisely one. This step is what determines the “attention span” of the model, effectively deciding which tokens to prioritize and which to disregard. Without softmax, the model would haven’t any method to objectively weigh the knowledge it gathers, resulting in an unmanageable and noisy mix of information.

Nonetheless, the softmax operation is the first source of the “performance cliff” seen in long-context AI. Because every token in a sequence have to be compared against every other token, a sequence of 8,192 tokens creates a large [8,192 x 8,192] attention matrix. Normalizing this matrix requires billions of transcendental calculations and grows quadratically with the sequence length. This creates a bottleneck, where the sheer volume of transcendental math can stall your entire inference pipeline. 

Blackwell Ultra puts give attention to accelerating these exponential calculations specifically to alleviate this mathematical bottleneck and be certain that the system can handle the large normalization required for big context windows without sacrificing throughput.

Alleviating the softmax bottleneck in Blackwell Ultra

By doubling the throughput of the SFU for exponentials within the Blackwell Ultra architecture, NVIDIA is alleviating this bottleneck and is allowing for a more balanced and efficient processing pipeline. This ends in faster overall performance, especially for tasks which are heavy on attention mechanisms.

Figure 2 below illustrates the sequential dependency inherent in the usual attention mechanism, sometimes called the eye loop, as run on the previous generation NVIDIA Blackwell (GB200). Note that the Streaming Multiprocessor (SM) loads two thread blocks running attention loops concurrently. These separate attention loops are denoted within the two different shades of green.

This pipeline consists of three distinct phases that must execute so as:

  • BMM1 (rating calculation): The Tensor Cores perform a matrix multiplication to calculate the raw attention scores, or logits.
  • Softmax (normalization): The pipeline shifts to the SFUs to normalize these scores into probabilities using exponential functions.
  • BMM2 (context aggregation): The pipeline returns to the Tensor Cores to multiply the chances by the worth vectors.
A GIF diagram showing how the extended duration of the softmax phase creates a timing mismatch in the pipeline. That forces the high-speed Tensor Cores responsible for BMM1 and BMM2 to sit idle while waiting for the normalization step to complete.A GIF diagram showing how the extended duration of the softmax phase creates a timing mismatch in the pipeline. That forces the high-speed Tensor Cores responsible for BMM1 and BMM2 to sit idle while waiting for the normalization step to complete.
Figure 2. The Blackwell attention loop

The timeline illustrates the latency constraints inherent within the Blackwell GPU throughout the execution of the eye kernel. Since the second matrix multiplication (BMM2) acts on the output of the softmax, it cannot begin until the normalization is complete. 

The lower throughput of the Blackwell GPU’s SFUs forces the Tensor Cores to idle between the rating calculation (BMM1) and the context aggregation (BMM2). This dependency prevents the pipeline from fully saturating the compute resources and extends the duration of the softmax operation

The subsequent timeline, as shown in Figure 3, demonstrates the direct impact of the Blackwell Ultra GPUs in NVIDIA GB300 NVL72 and NVIDIA HGX B300 systems doubled SFU throughput on the identical instruction sequence.

Doubling the SFU throughput significantly shrinks the softmax execution time, closing the idle gaps between matrix operations and allowing the Tensor Cores to maintain near-peak utilization.Doubling the SFU throughput significantly shrinks the softmax execution time, closing the idle gaps between matrix operations and allowing the Tensor Cores to maintain near-peak utilization.
Figure 3. The Blackwell Ultra attention loop

Visually, the width of the softmax blocks is reduced by almost 50%, reflecting the hardware’s ability to process MUFU instructions at twice the speed.

This reduction in softmax latency tightens your entire pipeline. The gap between BMM1 and BMM2 is drastically minimized, allowing the Tensor Cores to modify between the query-key multiplication and the probability-value multiplication with minimal stalling. The result’s a denser foremost loop where the high-performance matrix engines spend a bigger percentage of the entire execution time energetic, directly translating to higher overall inference throughput.

Benchmarking MUFU.EX2 performance

To empirically confirm the theoretical throughput of the MUFU pipeline, we are able to construct an artificial micro-benchmark. The next kernel code isolates the exponential instructions to measure the raw cycle count without interference from global memory latency or other arithmetic operations.

This test harness launches a grid of threads where each thread performs a dense loop of MUFU.EX2 instructions. By timing the execution and comparing it against the clock frequency, you’ll be able to directly calculate the effective instruction throughput and validate the bandwidth saturation point mentioned earlier.

Step 1: Clone the next repository to tug the exp2-bg300.cu benchmark.

git clone https://github.com/jamieliNVIDIA/mufu_ex2_bench.git
cd mufu_ex2_bench

Step 2: Compile with (Using sm100f for GB300 or sm103a for GB200).

nvcc -O3 -gencode=arch=compute_103a,code=sm_103a --extended-lambda -o /tmp/exp2-gb300.out exp2-gb300.cu

Sample results

We see that GB300 performs about 2x higher in FLOPs performance over GB200 for all tested data types, consistent with the doubled SFU throughput.

Blackwell (GB200)

exp2 BF16x2 2454 Gop/s (4908 GFLOPS)
exp2 BF16 4938 Gop/s
exp2 FP32 4943 Gop/s

Blackwell Ultra (GB300)

exp2 BF16x2 4996 Gop/s (9992 GFLOPS)
exp2 BF16 9738 Gop/s
exp2 FP32 Time:  10024 Gop/s

Attention forward propagation performance in Blackwell vs Blackwell Ultra

The transition from Blackwell to Blackwell Ultra delivers a targeted increase in compute throughput driven by a 2x increase in SFU performance. This hardware upgrade directly accelerates the forward propagation (FPROP) pipeline for models like DeepSeek-V3.

FPROP is the method where input data travels “forward” through the neural network—from the input layer, through the hidden layers, to the output layer—to generate a prediction. Each time the model produces a single recent word, it must run one complete FPROP pass.

Figure 4 below shows that by doubling the throughput of the SFUs, the GB300 drastically reduces the execution time of the softmax layers throughout the attention blocks. This faster normalization means the GPU spends less time processing the eye scores and more time utilizing the high-speed matrix engines for the following layer’s computation, directly increasing the general speed of the forward pass.

A bar chart showing how GB300 demonstrates 1.35x end-to-end FPROP performance over GB200 in FP8.A bar chart showing how GB300 demonstrates 1.35x end-to-end FPROP performance over GB200 in FP8.
Figure 4. GB300 vs GB200 FLOPS in forward propagation in a grouped query attention (GQA) model.

The benchmark results highlight a ~35% increase in FPROP throughput for FP8 operations. ​​This gain is especially pronounced in FP8 since the matrix math is already extremely fast. On this low-precision regime, the time spent on softmax becomes a bigger percentage of the entire step.

Getting began

The performance dynamics of DeepSeek-V3 on the Blackwell Ultra highlight a critical, but often missed bottleneck in inference: the computational cost of non-linear operations.

By optimizing and compressing the eye mechanism, state-of-the-art models effectively increase the density of softmax operations relative to straightforward linear computations, exposing the SFUs as a governor of total throughput.

Blackwell Ultra directly addresses this bottleneck. By doubling the throughput of those specialized units, Blackwell Ultra unblocks the transcendental traffic jam that previously forced the powerful Tensor Cores to idle. The benchmark results confirm the impact, demonstrating a 35% gain in FP8 forward propagation. 

For contemporary, highly optimized architectures, the trail to faster inference isn’t nearly faster Tensor Cores, it’s also about ensuring the non-linear math units are fast enough to maintain up.

Visit NVIDIA’s trtllm-gen repository for more benchmarks and data on utilizing this SFU speedup in workloads. Doubling the throughput of the SFUs for MUFU.EX2 is just considered one of many features that enable Blackwell Ultra’s fast attention speed. NVIDIA’s extreme hardware-software codesign accelerates the total attention loop through technologies reminiscent of: 

  • Offloading critical “find-max” reductions to the Tensor Memory controller via LDTM.STAT.
  • Optimizing performance using CUDNN.
  • Optimizing KVCache data movements using NVFP4.

Stay tuned to the NVIDIA technical blog for future posts.

Acknowledgements

Special due to the cuDNN engineering team for creating the benchmarks and constructing the software optimizations making this innovative performance possible.



Source link

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x