Transformer architecture has turn into a foundational breakthrough driving the revolution in generative AI, powering large language models (LLMs) like GPT, DeepSeek, and Llama. The important thing to transformer architecture is the self-attention mechanism, which enables models to process a complete input sequence fairly than word by word. This parallelism enables the capture of long-range dependencies.
While the self-attention mechanism is powerful, its computational and memory complexity is quadratic. This creates a memory bottleneck when coping with the long context windows of contemporary LLMs.
On this post, we’ll discuss FlashAttention, an algorithmic breakthrough that may mitigate this, reducing computational and memory complexity.
What’s FlashAttention?
FlashAttention is an input/output-aware (IO-aware) algorithm that computes the identical mathematical result as standard attention, more efficiently. FlashAttention achieves this with:
- Reduced memory access that minimizes the slow transfer of knowledge between a GPU’s predominant high-bandwidth memory (HBM) and the faster but much smaller on-chip static random access memory (SRAM). It achieves this by combining computational steps (like matrix multiplication and softmax) right into a single optimized GPU kernel. A method called kernel fusion.
- Near-linear memory uses techniques comparable to tiling (breaking the computation into smaller blocks) and online softmax (normalizing the distribution incrementally). FlashAttention reduces the memory complexity from O(N2) to O(N) with respect to sequence length N.
These optimizations result in faster training and inference. This also enables models to handle longer sequences of tokens, for applications that require maintaining long-running conversations, like processing high-resolution images.




FlashAttention-4
FlashAttention-4 (FA4) is the most recent iteration of optimized CUDA kernels, with a leap in efficiency. It’s hardware-software co-designed and tailored to maximise performance on the NVIDIA Blackwell architecture, just like the NVIDIA HGX B200.
FA4 achieves a peak performance of 1,605 TFLOPS/s, harnessing 71% of the hardware’s theoretical maximum. By redesigning the eye mechanism to handle Blackwell’s asymmetric scaling (where compute power scales much faster than memory bandwidth), FA4 outperforms standard baselines, delivering as much as 1.3x speedup over NVIDIA cuDNN and a pair of.4x over NVIDIA Triton Inference Server implementations.
These gains extend to the backward pass, where FA4 uses tensor memory (TMEM) dedicated, Tensor Core—proximate memory (more available register capability)—to bypass register accumulation and relieve register pressure. This allows larger tiles (as much as 128×128) and deeper pipelines, while reducing shared memory (SMEM) traffic and maximizing operation overlap. This ensures that the training speed keeps pace with the doubled throughput of the brand new Tensor Cores fairly than being bottlenecked by memory logistics.
FA4 co-designs the algorithm and kernel implementation around the next latest features and mitigation strategies for Blackwell:
| Blackwell hardware feature | Bottleneck | FA4 technique |
| TMEM – 256 KB on-chip memory per SM; Fifth-gen tensor cores asynchronously write outputs on to TMEM | Standard backward passes overuse shared memory (SMEM) for intermediates, making a bandwidth bottleneck relative to tensor cores | TMEM-based backward pass: FA4 stores backward intermediates (S, P, dP, dS, dQ) directly in TMEM, drastically reducing SMEM traffic |
| SMEM | SMEM bandwidth becomes limiting as tensor core performance scales faster than memory movement | Reduced SMEM pressure by relocating intermediates to TMEM |
| Asymmetric scaling | Tensor Core throughput roughly doubles (~2.25 PFLOPs), while MUFU throughput stays unchanged from the prior generation (16 ops/clock) | Compute rebalancing to scale back reliance on MUFU-heavy paths |
| Exponential units (MUFU) | Softmax exponentials dominate runtime, exceeding matmul time by ~25–60% | Software-emulated exponentials using FMA-based polynomial approximations alongside MUFU |
| Expanded MMA tile size (128×128) | Larger tiles increase register pressure and impose stricter scheduling constraints | Recent CTA scheduling and register allocation, including LPT scheduling for causal masking |
| Fully asynchronous tensor cores | Sequential MMA–softmax dependencies can leave compute units idle if not overlapped | Redesigned asynchronous pipelines to maximise overlap across MMA, softmax, and memory operations |
| Finite non-matmul resources | Non-matmul ALUs scale more slowly than tensor cores | Algorithmic minimization of non-matmul work |
| Online softmax | Redundant rescaling wastes non-matmul cycles | Conditional softmax rescaling, updating only when the running max crosses a threshold |
| CUDA 13 and CUDA-X tooling | Kernel complexity slows tuning and optimization | Kernel-level graphs and performance tools used to optimize FA4 kernels |
| Developer productivity | Complex C++ templates slow compile times and hinder iteration | CuTe DSL in Python, achieving 20–30× faster compile times than FA3 while preserving kernel expressivity |
The forward and backward pass performance gains on a Blackwell GPU for various sequence sizes are shown in Figures 1 and a pair of, respectively.




Learn more
The FlashAttention-4 algorithm was developed using a hardware-software co-design and kernel pipeline that mitigates bottlenecks induced by modern accelerators. FA4 uses the NVIDIA Blackwell Tensor Core and Tensor Memory architecture to extend performance and power efficiency, especially in multi-GPU multi-Node (MGMN) distributed configurations. The forward and backward pass kernel design incorporates various optimizations that achieve speedups over previous versions of FlashAttention algorithms.
Inference frameworks comparable to SGLang and vLLM are compatible with FlashAttention-4 prefill, and NVIDIA has incorporated FA4 techniques into NVIDIA cuDNN 9.14.
Learn more about cuDNN and unlocking deep learning performance on Blackwell with cuDNN.
