Making Softmax More Efficient with NVIDIA Blackwell Ultra

LLM context lengths are exploding, and architectures are moving toward complex attention schemes like Multi-Head Latent Attention (MLA) and Grouped Query Attention (GQA). Because of this, AI ”speed of thought” is increasingly governed not by the large throughput of matrix multiplications, but by the transcendental math of the softmax function.

Transcendentals check with functions that can’t be expressed as the foundation of a polynomial equation with rational coefficients. Subsequently, they “transcend” basic algebraic operations like addition and multiplication—the precise operations Tensor Cores excel at. In the particular context of softmax, probably the most computationally expensive of those transcendentals is the natural exponential function that is executed on Special Function Units (SFUs). In NVIDIA assembly instructions (SASS), this function is invoked via the MUFU.EX2 instruction. This architectural split creates a softmax bottleneck throughout the attention block, when powerful matrix engines are forced to idle while waiting for the SFU datapaths to normalize attention scores.

NVIDIA Blackwell Ultra alleviates this bottleneck by doubling SFU throughput over the usual NVIDIA Blackwell architecture.

This blog dives into the mechanics of softmax throughout the attention loop, explores how Blackwell Ultra’s hardware optimizations eliminate pipeline stalls, and provides a benchmark so that you can measure the raw MUFU.EX2 speedup for yourself.

Making Softmax More Efficient with NVIDIA Blackwell Ultra

How attention works

How softmax pertains to attention

Alleviating the softmax bottleneck in Blackwell Ultra

Benchmarking MUFU.EX2 performance

Sample results

Blackwell (GB200)

Blackwell Ultra (GB300)

Attention forward propagation performance in Blackwell vs Blackwell Ultra

Getting began

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

Porting fairseq wmt19 translation system to transformers

Breaking the Host Memory Bottleneck: How Peer Direct Transformed Gaudi’s Cloud Performance

Leveraging Pre-trained Language Model Checkpoints for Encoder-Decoder Models

AI to assist researchers see the larger picture in cell biology

How we sped up transformer inference 100x for 🤗 API customers

Making Softmax More Efficient with NVIDIA Blackwell Ultra

How attention works

How softmax pertains to attention

Alleviating the softmax bottleneck in Blackwell Ultra

Benchmarking MUFU.EX2 performance

Sample results

Blackwell (GB200)

Blackwell Ultra (GB300)

Attention forward propagation performance in Blackwell vs Blackwell Ultra

Getting began

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.