Glitches within the Attention Matrix

-

the groundwork for foundation models, which permit us to take pretrained models off the shelf and apply them to quite a lot of tasks. Nonetheless, there may be a standard artifact present in transformer models that may have detrimental impacts in specific tasks and scenarios. Not understanding these downfalls could cause your project to substantially underperform or fail. For instance, the DINOv2’s GitHub page has models pretrained with and without registers. A table with metrics suggests that registers, which were introduced to repair this artifact, don’t help the model in a meaningful way. And why add complexity if there isn’t a rise in accuracy?

Nonetheless, the metrics shown on the DINOv2’s page are just for ImageNet classification, which is thought to not be impacted by these artifacts. For those who use the DINOv2 ViT model without registers for object detection (like with LOST), your performance would likely be substantially worse.

Using Pretrained ViT Models without understanding when high-norm artifacts could impact your project could end in your project failing.

Since these artifacts were identified, the research community has developed several methods to handle them. The most recent solutions require little to no retraining and introduce zero additional test-time latency. These phenomena are usually not unique to ViTs, but additionally occur in LLMs. In reality, one in all the NeurIPS 2025 papers reviewed here proposes a general solution to those “attention sink” artifacts — which modifies the self-attention transformer architecture. This modified architecture is shown to be useful in a mess of the way and is already being incorporated into the newest Qwen model, Qwen3-Next.

This text provides a comprehensive guide to:

  1. Transformer registers.
  2. The high-norm artifacts (or attention sinks) they address.
  3. The most recent research-driven solutions for mitigating these artifacts.

1. Discovery of the Artifacts in ViTs with DINOv2

While ViTs have been pivotal in ushering within the era of foundation models for computer vision, they suffer from a persistent anomaly: the emergence of high-norm spikes1. These artifacts appear across each supervised and self-supervised training regimes, with the unique DINO being a notable exception. In Figure 1, that is demonstrated on ViT Base models trained with different algorithms, spanning self-supervised (DINO/DINOv2, MAE), weakly supervised (CLIP), to supervised (DeiT-III).

Figure 1. Visualization of the last layer of multiple ViT-B models. The unique DINO doesn’t show artifacts; adding registers to DINOv2 prevents artifacts from appearing in patch tokens. Figure by creator; input images generated via NanoBanana.

These artifacts exhibit 4 key characteristics:

  • High Norm: The L2 norm of artifact tokens may be 2–10 times larger than the typical token norm, depending on the training method.
  • Sparsity: They constitute a small fraction of total tokens (approx. 2%) and form a definite mode within the distribution (e.g. Fig 3 and 4 in Darcet et al 20241).
  • Patch Localization: They predominantly appear in low-information background areas or image corners.
  • Layer Localization: They seem primarily within the middle-to-late layers of ViTs.

The Impact of High-Norm Artifacts

The impact on accuracy varies by task. We measure this impact by observing how much performance improves after applying the fixes discussed in later sections. A summary of results from Jiang et al. (2025)2 is provided below:

Impact Task Mitigation Result
😐 ImageNet Classification No significant impact
😃 Unsupervised Object Discovery (LOST) Substantial improvement (20%) on DINOv2 ViT-L/14
😊 Zero-shot Segmentation +5 mIOU for OpenCLIP ViT-B/14, but not DINOv2
😊 Depth Estimation Marginal improvement with test-time registers (lower RMSE)

The Cause: Two Hypotheses

Why do these models generate high-norm artifacts? Two primary, non-contradictory hypotheses exist:

  1. Global Processing: Large models learn to discover redundant tokens and repurpose them as “storage slots” to process and retrieve global information.
  2. The Mechanistic Hypothesis: The artifacts are a byproduct of the Softmax function, which forces attention weights to sum to 1.

In SoftMax-based attention, the weights for a given query must sum to 1:

$$sum_{j} text{Attention}(Q, K_j) = 1$$

Even when a question token ( i ) has no meaningful relationship with any key token ( j ) the SoftMax operation forces it to distribute its “attention mass”. This mass often gets dumped into specific low-information background tokens that then develop into high-norm sinks.

They’re calculated individually for every attention head. To essentially understand the eye sink issue, we might be stepping through the eye code. The self attention diagrams are also reproduced in Figure 2 for reference.

Figure 2. Refresher of transformer attention. The left side zooms into the Scaled Dot-Product Attention (SDPA), while the correct side shows how SDPA suits into the network in a multi-headed configuration. The orange box on the left highlights the SoftMax layer, which is normalized in order that sum along the last dimension sums to 1. The correct illustrates how heads remain separate until after attention is applied. Figure by creator, based on Figure 2 from Vaswani et al. (2017)3.

You may see an example of the code at Facebook Research’s DeiT Github Repo:

class Attention(nn.Module):
    # ...
    def forward(self, x):
		# B: batch size
		# N: sequence length (# tokens)
		# C: embedding size * num_heads
        B, N, C = x.shape
        # self.qkv is a Linear Layer with bias that triples the dimensions of
        # the tensor - calculating Q=XW_Q, K=XW_K, V=XW_V in a single equation
        qkv = self.qkv(x).reshape(
            B, N,
            3, # includes Q, K, and V - this dimension gets permuted to
               # 0 index
            self.num_heads,
            C // self.num_heads).permute(2, 0, 3, 1, 4)
        q, k, v = qkv[0], qkv[1], qkv[2]
        
        q = q * self.scale # for numeric stability

        attn = (q @ k.transpose(-2, -1)) # attn: [B x N x N]
        attn = attn.softmax(dim=-1) # Creation of artifact
        attn = self.attn_drop(attn) # Optional dropout training augmentation

		# Next line does matrix multiply AND concatenation between heads
        x = (attn @ v).transpose(1, 2).reshape(B, N, C)
        x = self.proj(x) # one other linear layer
        x = self.proj_drop(x) # Optional dropout training augmentation
        return x

In ViTs, which lack explicit “global” tokens (aside from the [CLS] token), the model repurposes background patches as “attention sinks” or “trash cans”. These tokens aggregate global information, their norm magnitude swells, and their original local semantic meaning is lost.

2. The Register Solution: Vision Transformers Need Registers (2024)

Figure 3. Diagram of ViT with registers. Register output tokens are usually not used for training or predictions but provide a dedicated space for global information. Figure by creator; image of puppies created with NanoBanana.

The team behind DINOv2 discovered these high-norm artifacts and proposed adding “register” tokens (Darcet et al. 20241). These tokens are learned tokens just like the [cls] token without positional embeddings, however the corresponding output tokens are never used. That’s all they are surely, just additional tokens that aren’t directly used for training. These register tokens are learned identical to the [CLS] token and don’t have positional embeddings. The key downside of this method is that they require retraining the model. This limitation spurred the seek for post-hoc solutions that would fix existing models.

3. The Denoising Solution: Denoising Vision Transformers (2024)

Yang et al. (2024)4 proposed Denoising Vision Transformers (DVT) to wash output tokens post-hoc. While DVT is synergistic with registers, it introduces a major bottleneck, adding roughly 100 seconds of latency per 518×518 image—making it impractical for real-time applications.

Contributions:

  1. DVTs improve the performance on quite a lot of tasks and the authors showed that DVT was synergistic with adding registers.
  2. Paper adds to our understanding the contributions of positional embeddings are an underlying cause to the high-norm artifacts.

Nonetheless:

  1. Adds a big latency per image (around 100 seconds for 518×518 images)

4. The Distillation Solution: Self-Distilled Registers (2025)

The approach by Chen et al. 20255 uses a teacher-student paradigm to coach a small subset of weights and the register tokens. The high-norm artifacts are faraway from the teacher signal by applying data augmentation of random offsets and flips to the pictures, allowing the artifacts to be averaged out. The teacher model is kept frozen as the unique ViT. The scholar model can also be initialized from the identical ViT, nonetheless, additional learnable register tokens are added and a small subset of the weights are finetuned.

Contributions:

  1. Orders of magnitude less compute than training with registers from scratch.
  2. No additional test-time latency.

5. The Mechanistic Solution: Test-Time Registers (2025)

Jiang et al. (2025)2 introduce a technique to perform “surgery” on trained models so as to add registers without retraining. They found that artifacts are generated by a sparse set of specific “Register Neurons” inside the MLP layers (roughly 0.02% of all neurons). By rerouting the values from these internal MLP neurons to latest register tokens, they matched the performance of fully trained register models at zero retraining cost.

They find the next properties of the artifact-causing neurons (or “Register Neurons”):

  • Sparsity: Roughly 0.02% of neurons are answerable for the overwhelming majority of artifact energy.
  • Causality: the position of the outliers may be moved by modifying the activation pattern of the register neurons.

They show that these register neurons aggregate global information using linear probes: ie. they see in the event that they can use the register neurons for classification on ImageNet and CIFAR-10/100. The last output of the registers are ignored, but there are register tokens inside the network where the network can use that global information. The authors perform experiments to point out that setting the register neurons to zero substantially reduces the networks performance from 70.2% to 55.6%, suggesting that the networks are using the artifacts to store information and are usually not just an artifact of SoftMax.

Relationship between ViT High-Norm Artifacts and LLM Attention Sinks

A phenomenon much like the ViT high-norm artifacts — attention sinks — were present in LLMs within the StreamingLLM paper (Xiao et al., ICLR 20246). While extending LLMs to be used on streaming, infinite-length sequences, they noticed that the accuracy significantly dropped when the starting token now not fit right into a sliding window. These initial tokens, they’ve discovered, are inclined to accumulate over half of the eye rating. The drop in accuracy was recovered in the event that they kept the ( K ) and ( V ) values from the initial 1-4 tokens around, while sliding the window over the remaining tokens. They propose that the initial tokens are used as attention sinks due to sequential nature of autoregressive language modeling: they’re visible to all tokens, while later tokens are only visible to subsequent tokens. That is in contrast with ViTs where each patch token is visible to each other patch token. With LLMs, attention sinks tended to not be seen as an issue, unlike in ViTs.

The attentional sinks in LLMs were thought to function anchors without aggregating global information — unlike in ViTs; nonetheless, even more moderen research from Queipo-de-Llano and colleagues (Queipo-de-Llano et al 20257), “Attentional Sinks and Compression Valleys” finds that these attentional sinks do indeed contain global information. This implies that the final solution discussed in the subsequent solution may additionally apply to ViTs, though they weren’t tested on them on the time of this writing.

7. Removing the Artifacts with Sigmoidal Gating: Gated Attention (2025)

Figure 4. Gu et al.8 showed that replacing SoftMax with Sigmoid avoids creating the high-norm artifacts. This didn’t involve any gating outside of the eye calculation.

One technique to address the symptoms of SoftMax could be to interchange it with a sigmoid. Gu et al. 8 showed in 2025 that indeed replacing SoftMax with (unnormalized) sigmoid can eliminate the Attention Sink at the primary token, as shown in Figure 4. While the preliminary results show some potential improvement to validation loss, it stays unclear what the downstream impacts this could have on LLM performance and it lacks the robust experiments of our next paper.

Figure 5. Qiu et al.9 left the Scaled Dot-Product Attention (SDPA) untouched and added the sigmoid after concatenating the heads. Which means the Softmax would likely create the high-norm spikes within the SDPA, but then be removed through the gating step.

Qiu et al. did something different of their Gated Attention NeurIPS 2025 paper9: they left the SoftMax attention untouched, but then added gating after the tokens from all of the heads were concatenated, shown in Figure 5. They find that adding gating does remove the high-norm artifacts, though the SoftMax attention would still create such artifacts prior to the gating contained in the standard scaled-dot product attention (SDPA). The advantages of the Gated Attention transcend fixing the eye sink artifact, offering:

  1. Improved training stability
  2. Elimination of coaching loss spikes
  3. Support for larger learning rates and batch sizes

They use this Gated Attention of their latest Qwen3-Next model, although in addition they replace a few of the self-attention with Gated DeltaNet. This may very well be an indication that we’re moving away from single elegant solutions, like repeated self-attention modules, and more towards a group of hacks or heuristics that gets one of the best performance. In a number of ways, this may very well be much like the brain, with its wide range of forms of neurons, neurotransmitters, and neuroreceptors. Larger architecture changes could puncture the equilibrium of progress and require a number of the means of tweaking the gathering of the heuristics again.

8. Conclusion

For the reason that distant past of 2024, when high-norm artifacts of ViTs and a focus sinks of LLMs were discovered, the research community has discovered many solutions and made quite a bit more progress in understanding these artifacts. The artifacts are more similar than initially thought. In each cases, the SoftMax causes the eye to extend substantially for some tokens, that are used (implicitly or explicitly) as registers that store global information. Removing these registers can hurt performance once they’re learned. Test-time registers moves the high-norm artifacts (or implicit registers) to explicit registers, allowing the patch tokens to be cleansed from the artifacts. You too can prevent the registers from forming in the primary place by either replacing SoftMax with a sigmoid or using a sigmoid as a gating function after the SoftMax (although the latter allows high-norm artifacts inside the SDPA, but they’re removed before they form “tokens”)

In lots of cases, these artifacts don’t cause any issues, reminiscent of with global tasks like classification for ViTs and most LLM tasks. They do negatively impact dense ViT tasks, especially when a single or just a few tokens can have an outsized effect, like object detection. The fixes a minimum of don’t make the performance worse, although the fixes for LLMs, reminiscent of the sigmoid attention and gated attention haven’t been used as widely and — sigmoid attention specifically — could be harder to coach. Embracing the artifact — copying the KV values of the initial tokens — appears to be the present best mature solution for streaming LLMs6.

Comparison of Mitigation Strategies

The perfect mitigation strategy depends for those who have already got a trained model or for those who plan on training from scratch.

Method Training Cost Mechanism Latency Applied To
Trained Registers1 High (Full) Add Learned Tokens None ViTs
Denoising ViTs4 Medium Signal Decomposition Very High ViTs
Self-Distilled5 Low (Positive-tune) Distillation None ViTs
Test-Time Registers2 Zero Neuron Shifting None ViTs
Streaming LLM6 Zero KV Cache Preservation None LLMs
Sigmoid or Elu+1 Attention8 High (Full) Replace SoftMax None LLMs
Gated Attention9 High (Full) Add Sigmoid Gating Minimal LLMs

Bibliography

  1. Darcet, T., et al. “Vision Transformers Need Registers.” (2024).
  2. Jiang, N., et al. “Vision Transformers Don’t Need Trained Registers.” (2025).
  3. Vaswani, A., et al. “Attention Is All You Need.” (2017).
  4. Yang, et al. “Denoising Vision Transformers.” (2024).
  5. Chen, Y., et al. “Vision Transformers with Self-Distilled Registers.” NeurIPS (2025).
  6. Xiao, et al. “Efficient Streaming Language Models with Attention Sinks.” ICLR (2024).
  7. Queipo-de-Llano, et al. “Attentional Sinks and Compression Valleys.” (2025).
  8. Gu, et al. “When Attention Sink Emerges in Language Models: An Empirical View.” ICLR (2025).
  9. Qiu, Z., et al. “Gated Attention for Large Language Models.” NeurIPS (2025).
ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x