Tianzhu Ye, Li Dong, Yutao Sun, Furu Wei
Code
We compare DIFF V2 with DIFF V1 below:
(For simplicity, we omit the batch dimension and assume that each the input and output of the next flash_attn_func are three-dimensional tensors (tokens, heads, head dimension). Heads belonging to the identical GQA group are arranged contiguously within the output)
def DiffAttnV1(
layer_index, q1, q2, k1, k2, v,
lam_q1, lam_k1, lam_q2, lam_k2,
):
"""
q1, q2: (N, h/2, d)
k1, k2: (N, h_kv/2, d)
v: (N, h_kv/2, 2nd)
lam_*: (d,)
"""
attn1 = flash_attn_func(q1, k1, v)
attn2 = flash_attn_func(q2, k2, v)
lam_init = 0.8 - 0.6 *
exp(-0.3 * layer_index)
lam1 = exp(sum(lam_q1 * lam_k1)
lam2 = exp(sum(lam_q2 * lam_k2)
lam = lam1 - lam2 + lam_init
attn = attn1 - lam * attn2
attn = rmsnorm(attn)
attn = attn * (1 - lam_init)
return attn
def DiffAttnV2(
q, k, v, lam
):
"""
q: (N, 2h, d)
k: (N, h_kv, d)
v: (N, h_kv, d)
lam: (N, h, 1)
"""
attn = flash_attn_func(q, k, v)
attn1, attn2 = (attn[:, 0::2],
attn[:, 1::2])
lam_val = sigmoid(lam)
attn = attn1 - lam_val * attn2
return attn
Full code at: unilm/Diff-Transformer/Diff-Transformer-V2 at master · microsoft/unilm
Within the script, h represents variety of query heads, h_kv represents variety of key-value heads, and d means head dimension. The $lambda$ in DIFF V2 is projected from $X$ for every token each head.
DIFF V2 doubles variety of query heads while maintaining variety of key value heads, and the additional dimension is reduced back to h*d after the differential operation so the $W_O$ projection stays the identical as baseline Transformer.
Motivation
Faster Decoding & No Custom Kernels
DIFF V2 introduces additional query heads in comparison with the baseline Transformer, but doesn’t increase the variety of key-value (KV) heads. Since LLM decoding is usually memory-bound, this design allows DIFF V2 to attain decoding speeds on par with standard Transformer. Besides, since head dimension is aligned between query, key and value, there is no such thing as a need for custom attention kernels for DIFF V2. In contrast, DIFF V1 will be slower during decoding since the value cache should be loaded twice, and a custom attention kernel is required. DIFF V2 may also increase the arithmetic intensity of the eye module during decoding.
During pretraining, when using cutting-edge FlashAttention kernels on H-series and B-series GPUs, the throughput reduction introduced by DIFF V2 is negligible. For long-sequence prefilling, we recommend combining DIFF V2 with techniques comparable to YOCO (also utilized in Gemma 3n), which already reduces prefilling complexity to linear time with respect to sequence length.
Another perspective is to check DIFF V2 with a Transformer that has the identical query dimension 2h*d. Under this comparison, each models exhibit same attention kernel speed, while DIFF V2 has less parameters and flops in output projection.
Softmax Magnitude Constraint
In the usual Scaled Dot-Product Attention (SDPA), let $Q, K, V in mathbb{R}^{n times d}$ be the queries, keys, and values. The context vector $C$ is defined as:
Where $A in mathbb{R}^{n times n}$ is the eye weight matrix. Let’s concentrate on a single row of $C$, denoted as $mathbf{c}_i$, which is a weighted sum of value vectors $mathbf{v}_j$:
We define the Context RMS (Root Mean Square) to represent the magnitude of this output:
The weights $a_{ij}$ are non-negative and sum to 1 ($sum_{j=1}^{n} a_{ij} = 1$). Assume the worth vectors $mathbf{v}_j$ are uncorrelated and have an RMS of 1, the Context RMS is strictly bounded in range $[frac{1}{sqrt{n}},1)$ however the attention distribution changes:
- If the attention is focused entirely on one token, the Context RMS is $1$.
- If the attention is spread equally across all tokens ($a_{ij} = frac{1}{n}$), the Context RMS drops to $frac{1}{sqrt{n}}$.
- In other situations, the Context RMS is between $frac{1}{sqrt{n}}$ and $1$.
In DIFF V1 we add a per-head RMSNorm on context vectors:
If the model learns a uniform attention distribution in a head, the Context RMS is approximately $1/sqrt{n}$. To normalize this back to $1$, RMSNorm must multiply the vector by a scale of $sqrt{n}$. For $n = 8192$, $sqrt{n} approx 90.5$. This means the RMSNorm layer applies a 100x magnification to the output. In large-scale pretraining, we find this leads to massive gradients and numerical instability.
A typical phenomenon is that when DIFF V1 is pre-trained at a large learning rate, the gradient norm experiences a larger increase compared to Transformer in the later stages, along with higher variance. In DIFF V2, after removing the per-head RMSNorm, the gradient norm scale becomes comparable to that of Transformer, and the gradient norm spike is reduced (will be discussed further below).
We adopted the per-head RMSNorm design in DIFF V1 primarily because of the doubled value head dimension and the globally shared $lambda$ across all tokens. Given the modifications made to these two aspects in DIFF V2, we found that removing RMSNorm is now safe.
Beyond Softmax Constraint & Elimination of Attention Sinks
We demonstrate DIFF V2 can overcome the constraint of Softmax mentioned above. It can also help eliminate attention sinks.
- In original Softmax attention:
mathbf{c}_i = sum_{j=1}^{n} a_{ij} mathbf{v}_j = sum_{j=1}^{n} text{Softmax}(z_{ij}) mathbf{v}_j
text{RMS}(mathbf{c}_i) in left[frac{1}{sqrt{n}},1right)
- In DIFF V2 we introduce a projected $lambda$ for each token and each head:
text{RMS}(mathbf{c}_i) in left(0, sqrt{2}right)
The projected $lambda_i$ helps to control the context RMS. We observe that lowering the lower bound of the context RMS to zero is particularly important. It can help eliminate attention sinks and improve training stability. The upper bound only needs to remain bounded.
Note that our analysis here consider RMS before output projection $W_O$. Although the RMS can be recovered and adjusted after the output projection, the lack of freedom at Softmax still affects the learning performance.
Other recent works alleviate this constraint as well:
mathbf{c}_i = sum_{j=1}^{n} a_{ij}^{text{off}} mathbf{v}_j = frac{sum_{k=1}^{n} exp(z_{ik})}{1 + sum_{k=1}^{n} exp(z_{ik})} sum_{j=1}^{n} text{Softmax}(z_{ij}) mathbf{v}_j
text{RMS}(mathbf{c}_i) in left(0, 1right)
- In gpt-oss, a learnable scalar $s$ is introduced for each head:
mathbf{c}_i = sum_{j=1}^{n} a_{ij}^{text{oss}} mathbf{v}_j = frac{sum_{k=1}^{n} exp(z_{ik})}{exp(s) + sum_{k=1}^{n} exp(z_{ik})} sum_{j=1}^{n} text{Softmax}(z_{ij}) mathbf{v}_j
text{RMS}(mathbf{c}_i) in left(0, 1right)
text{RMS}(mathbf{c}_i) in left(0, 1right)
Experimental Observations
We conduct pretraining experiments on production-scale LLMs, including dense models and a 30A3 MoE on trillions of tokens using large learning rate of 6e-4 to 1e-3.
The experiments are still running. What we have observed now:
- Notably lower language modeling loss compared to Transformer (a gap of 0.02 to 0.03).
- Reduced loss and gradient spikes during training, particularly under large learning rate settings where the Transformer baseline becomes unstable.
- Reduced activation outliers magnitude.
We expect to explore in later stages of training:
- Learning efficiency in mid- and post-training.
- Performance on downstream long-context benchmarks (alleviating context rot).
Discussions
Construction of Differential Operation
In theory, a standard Transformer with $2h$ attention heads can learn the differential operation by learning $W_O^{2i}=-W_O^{2i+1}, i=0,1,ldots,h-1$, where $W_O^{i}$ denotes the output projection of head $i$, and head $2i$ and $2i+1$ belong to the same GQA group.
Assumption 1. In practice, such a solution is difficult to learn through optimization, as it requires two sets of parameters to converge to exact negatives of each other.
Assumption 2. The differential operation can be learned by the model and the model chooses to learn it in the training. Then explicitly constructing it before the output projection as in DIFF V2 can save half of the $W_O$ parameters. The number of saved parameters is also non-trivial. Under the current GQA setting, the parameters in the attention module are dominated by $W_Q$ and $W_O$; Therefore, approximately 25% of the attention-module parameters can be saved. The saved parameter budget can then be reallocated to other parts of the model.
Even if DIFF V2, after reallocating parameters, does not achieve a lower loss than the baseline but merely matches it, the method is still worthwhile if it provides additional benefits such as improved training stability, better control of outliers, or higher training efficiency. This is analogous to GQA, which matches the loss of MHA while reducing KV-cache as an additional benefit. So the key question becomes empirical performance.
Design Ablations
- Subtracting two heads that are not in the same GQA group, which means they do not share the same key and value.
(For simplicity, we omit the batch dimension and assume that both the input and output of the following flash_attn_func are three-dimensional tensors (tokens, heads, head dimension). Heads belonging to the same GQA group are arranged contiguously in the output)
...
attn = flash_attn_func(q, k, v)
nh = attn.size(1)
attn1, attn2 = (attn[:, :nh//2],
attn[:, nh//2:])
...
...
attn = flash_attn_func(q, k, v)
attn1, attn2 = (attn[:, 0::2],
attn[:, 1::2])
...
In our large learning rate setting, the ablation 1 setting exhibits obvious training instability (rather more loss and gradient spikes) and better loss comparing to DIFF V2. The worth needs to be shared within the two subtraction heads to construct differential operation, as discussed in DIFF V1 paper.
- Subtracting two attention maps without $lambda$ scaling factor, i.e.,
attn1 - attn2as an alternative ofattn1 - lam_val * attn2. This leads to an excessively small context RMS at initialization. - Directly using projected $lambda$ without applying
sigmoidoperation. The context RMS is unbounded from above.
Each ablation 2 and ablation 3 result in higher language modeling loss than DIFF V2. Ablation 2 maintains training stability much like DIFF V2, whereas ablation 3 is less stable (still more stable than ablation 1).
Miscellaneous
- In DIFF, the outliers in qk logits will be smaller than those within the baseline. This was already analyzed in DIFF V1: DIFF can achieve attention sparsity comparable to the baseline while using smaller qk logits. We further propose that DIFF’s differential mechanism, which cancels out small attention values, may help mitigate the eye rounding error issue discussed on this blog and paper.
- DIFF V2 is compatible with sparse attention. In lots of existing sparse attention frameworks, query heads inside the same GQA group are required to take care of the identical key-value blocks with the intention to maximize speedup. A typical strategy is to pick key-value blocks based on the typical attention logits across heads.
For DIFF V2, the issue shifts to designing an efficient block-selection strategy for a bigger GQA group that incorporates pairs of differential heads. This may occasionally require handling the 2 kinds of differential heads individually during selection, or possibly an easy average of attention logits might already be sufficient in practice. Conceptually, this doesn’t introduce any fundamental differences in comparison with block sparse attention of ordinary Transformers.
