NeurIPS 2025 Best Paper Review: Qwen’s Systematic Exploration of Attention Gating

-

one little trick can bring about enhanced training stability, the usage of larger learning rates and improved scaling properties

The Enduring Popularity of AI’s Most Prestigious Conference

By all accounts this yr’s NeurIPS, the world’s premiere AI conference, was one in every of the biggest and most lively in its history. This yr’s conference was held on the San Diego Convention Center in San Diego, California from Sunday, November 30, 2025 through Sunday, December 7, 2025. As a way of the size, NeurIPS 2025 received 21,575 valid paper submissions. From 2023 (~12.3 k) to 2025 (~21.6 k) this reflects a ~75–80% hop over two years, roughly ~30% per yr average. In person attendance has been equally as impressive, which has often been the tens of hundreds of individuals often capped by venue size, with past locations operating near the upper limit of what the physical venue can handle. Reinforcement learning dominated the conversation this yr, with the sector is shifting from scaling models to tuning them for specific use cases. Industry momentum appeared to centre strongly around Google, with Google DeepMind particularly surging and pushing recent and refreshing research directions, for instance reasonably than simply “larger LLMs”. The dimensions and intesity of the conference is a mirrored image of perhaps each the pace of AI progress and the cultural peak of the fashionable AI gold rush.

Figure 1: Game of spot the speaker. A Typical NeurIPS important presentation hall, which as you possibly can see, are almost verging on a stadium-level setting. On this brave recent world, AI researchers have turn out to be rockstars. 📖 Source: photo by creator.

This yr, the exhibitor hall was packed, with major industry players from technology, finance, and AI infrastructure all setting out their stalls to show their latest breakthroughs, highlight open roles to talented delegates, and hand out the ever-coveted branded “stash” — pens, T-shirts, water bottles, and more. The especially fortunate conference goer might even receive an invitation to company-hosted “after-parties”, which have turn out to be a staple of the NeurIPS experience and a perfect opportunity to decompress, shed the information-overload and network, from Konwinski’s Laude Lounge to the invite-only Model Ship cruise filled with top researchers. Diamond sponsors this yr included Ant Group, Google, Apple, ByteDance, Tesla, and Microsoft. The buy-side presence this yr was particularly strong, with leading firms akin to Citadel, Citadel Securities, Hudson River Trading, Jane Street, Jump Trading, and The D. E. Shaw Group represented. On the infrastructure and tooling side, Lambda showcased its GPU cloud platform, while firms like Ollama and Poolside highlighted advances in local LLM runtimes and frontier model development.

Figure 2: Nobel laureate Geoffrey Hinton’s presentation on the Google booth (picture taken at NeurIPS 2018). Well-known AI researchers and industry titans are a standard sight throughout NeurIPS. 📖 Source: photo by creator.

The NeurIPS Expo showcased many equally fascinating applied-AI demos. Highlights included , demonstrating how autonomous agents can behave reliably across different LLM backends; a multimodal forensic search system able to scanning large video corpora with AI; an AI-accelerated LiDAR processing demo that showed how heterogeneous compute can dramatically speed up 3D perception; and LLM-driven data-engineering workflows that automate ingestion, transformation, and quality checks. It’s clear from the EXPO that AI is heading full steam ahead toward agents, multimodal intelligence, accelerated perception, and end-to-end automated data systems.

Figure 3: Smile you’re on camera! The NeurIPS EXPO at all times has some fascinating exhibitions, including robotics, hardware, neuromorphic systems, etc. 📖 Source: photo by creator.

The ceremony arguably represents a pinnacle of the conference and a celebration of its most impactful work. One of the best paper awards are given to exceptionally progressive and impactful research that’s more likely to have a right away and longlasting effect on the sector of AI. It goes without saying that a best paper award is a significant skilled accomplishment in a highly competitive and fast paced research field. It’s much more impressive if we bear in mind the large volume of submitted papers to NeurIPS. Standing out in that crowd is exceptionally difficult.

The Anatomy of a NeurIPS Best Paper: Exploring the advantages of Gated Attention in LLMs

Gating Explained: How a Tiny Valve Controls Big Neural Models

In the rest of this text, we take a deep dive into one in every of this yr’s best papers from NeurIPS:  by the Qwen team. Arguably, this dense paper title packs a number of information right into a very small footprint, so, in what follows, I’ll unpack the paper piece by piece with the target of giving practicing Data Scientists a transparent mental model of attention gating and concrete takeaways from the paper they will immediately apply to their very own work.

First, we start with an understanding of the g, the core module under study within the paper. What exactly is a gate within the context of neural networks? A gate is nothing greater than a  mechanism, a computational unit that takes the output of an existing transformation within the network and regulates it by , or parts of the input signal.

As an alternative of allowing every activation to flow unchanged through the network, a gate introduces a learned control pathway that determines how much of the transformed information should pass forward.

Operationally speaking, , typically using a sigmoid, softmax, or occasionally a ReLU-based squashing function, and these coefficients are applied multiplicatively to a different vector of activations originating from an upstream computation. This has the effect of regulating how much of that input makes its way downstream, a bit like twisting a faucet handle backward and forward to control the quantity of water passing through. That’s all there may be to it, now you understand gating, what it’s and the way it’s applied.

Figure 4: One in every of the simpest posssible gating mechanisms. The input is modulated by a vector of coefficients computed within the Gate module that applies a linear projection of the input data followed by a sigmoid non-linearity. The sigmoid squeezes the projected coefficients in order that they lie between 0 and 1, which is good for a gate as its purpose is to modulate how much information from the input makes its way through to the subsequent layer. 📖 Source: image by creator.

Since the gating weights are typically learnable parameters, the network can discover during training tips on how to modulate internal signals in ways in which minimise the general network loss. In this fashion, the gate becomes a dynamic filter, adjusting the inner information flow based on the input context, the model’s continually evolving parameters, and the gradients received during optimisation.

A Transient Tour Down Memory Lane: The Long History of Gating

It’s price taking in a little bit little bit of the history of Gating, before we move to the important contributions of the paper.  is de facto nothing recent, and the Qwen paper didn’t invent this standard component, their contribution lies elsewhere and will probably be covered shortly. In truth gating has been a core mechanism in deep architectures for a lot of many years now. For instance, Long Short-Term Memory (LSTM) networks, introduced in 1997, pioneered the systematic use of multiplicative gates — the , , and  gates — to control the flow of data through time. These gates act as learned filters that determine which signals ought to be written to memory, which ought to be retained, and which ought to be exposed to downstream layers. By controlling information flow on this fine-grained way, LSTMs effectively mitigated the multiplicative explosion or vanishing of gradients that hampered early recurrent networks, enabling stable long-term credit task during backpropagation through time (BPTT).

Applying Gating to the LLM Attention Block

The Qwen team’s contribution focuses on applying gating on to the transformer’s , a selected form of configuration called In this text, I won’t spend an excessive amount of time on the  of attention, as there are a lot of resources on the market to study it, including this recent course by the DeepLearning.ai team and this prior article I’ve written on the topic. In a brilliant transient summary, attention is the core mechanism within the transformer architecture that lets each input sequence token gather contextual information from some other token within the sequence, enabling tokens to during training and inference, sharing information no matter how far apart they seem within the input. The computational graph for the favored is shown below:

Figure 5: The Transformer’s attention mechanism, applied to a toy sequence “Big Cat”. The input tokens are projected into queries, keys, and values. The eye module compares queries with keys to form an attention map, which is then used to weight the values. The result’s an enriched representation of every token. 📖 Source: image by creator.

Although attention gating has been used for a few years, the Qwen team highlight a surprising gap in our body of information: as AI practitioners we’ve broadly applied attention gating without truly understanding why it really works or the way it shapes learning dynamics. The Qwen team’s work shows that we’ve been benefiting from this module for a very long time with out a rigorous, systematic account of its effectiveness or the conditions under which it performs best. The Qwen paper does just that and plugs the gap, with the NeurIPS best paper selection committee citation mentioning:

“This paper represents a considerable amount of labor that is feasible only with access to industrial scale computing resources, and the authors’ sharing of the outcomes of their work, which is able to advance the community’s understanding of attention in large language models, is very commendable, especially in an environment where there was a move away from open sharing of scientific results around LLMs.”

NeurIPS 2025, Select Committee statement.

Given the sheer amount of dollars flowing in and the large industrial interest in AI as of late, it’s very nice to see that the Qwen team decided to deliver this wealthy batch of lessons learnt to the broader community, reasonably than keep these informational nuggets behind closed doors. In doing so, the Qwen team have delivered a ravishing paper filled with practical lessons and clear explanations of the  behind attention gating, all distilled in a way that Data Scientists can immediately take and apply in real-world models.

The Qwen’s team systematic study makes several concrete contributions to knowledge that will be easily and immediately applied to enhance many standard LLM architectures:

  1. Positioning of Gating: Putting a gating module right after the worth matrix computation provides enhanced LLM performance, through introduction of a non-linearity and the inducement of input-dependent sparsity. Additionally they study key parameterisations of the gating module, akin to the form of activation function (SiLU or sigmoid) and the mix function (multiplication, addition).
  2. Attention Sink and Massive Activations: Gating can radically curtail the facility of the eye sink phenomenon, where most if not the entire attention in a layer concentrates on a single token — I cover this phenomenon intimately later. By suppressing these extreme activations, the model becomes way more numerically stable during optimisation, eliminating the loss spikes that typically appear in deep or long-training runs. This increased stability allows the model to tolerate substantially higher learning rates, unlocking higher scaling without the divergence seen in ungated transformers.
  3. Context Length Extension: Gating also facilitateswithout requiring full model retraining. In practice, this implies a model will be trained with a comparatively short context window and later scaled to for much longer sequences by retrospectively adjusting components akin to the RoPE base. This adjustment effectively reparameterises the positional embedding geometry, allowing the model to operate at prolonged context lengths (e.g., as much as 32k tokens) while preserving stability and without degrading previously learned representations.
Figure 6: First-token attention scores across layers within the baseline model. The early layers exhibit low attention to the primary token, followed by a pointy increase around layer 6, with mid–late layers maintaining elevated attention. That is the eye sink phenomenon. The dashed line marks the mean rating (0.467). 📖 Source: adapted by creator from the unique paper: https://openreview.net/pdf?id=1b7whO4SfY

Leveraging Gating to Improve Performance, Learning Stability and Attention Mechanics

The Qwen team focus their investigation on how gating interacts with the LLMs softmax attention module, aiming to know its influence on the module’s learning dynamics and to discover the optimal placement of the gate — for instance, after the Q, K, or V projections, after the eye computation, or after the dense layers. The setup of this study is illustrated in the next diagram below:

Figure 7: The Qwen paper studies the position of the Gating module with respect to the scaled dot product attention (SDPA) layer. Gating on the SDPA output (G1) or on the value pathway (G2) yields the strongest gains. These positions give the model essentially the most direct control over what information flows through the eye block, allowing it to suppress noisy interactions or amplify useful ones. 📖 Source: adapted by creator from the unique paper: https://openreview.net/pdf?id=1b7whO4SfY

The authors evaluate each mixture-of-experts — MoE (15B, 2.54B lively) and dense (1.7B) — feed forward network (FFN) — models. The MoE variant uses 128 experts, top-8 softmax gating and fine-grained experts. Models are trained on subsets of a 4T-token high-quality corpus covering multilingual, math, and general knowledge data, with a 4096 sequence length. Training uses AdamW defaults, with specific learning-rate and batch-size details provided per experiment. They find that gating adds minimal overhead — <2% latency. Evaluation covers standard few-shot benchmarks: HellaSwag, MMLU, GSM8K, HumanEval, C-Eval, and CMMLU, plus perplexity tests across domains including English, Chinese, code, math, law, and literature.

The experimental evaluation is organised to check the next questions in a scientific way. I also add the important thing takeaways beneath each research query, which apply equally MoEs and FFN models tested by the authors:

  • The authors find that inserting gating on the output of the scaled dot product attention (SDPA) module or after the worth map (G2), are essentially the most effective placements.
  • Moreover, SDPA attention placement is simpler than at G2. To elucidate this, the authors show that gating placement at SDPA induces very low sparse gating scores, which is correlated with superior task performance.
  • Value gating (G2) produces higher, less sparse scores and performs worse than SDPA-output gating (G1). This means that sparsity is most useful when the gating depends upon the present query, allowing the model to filter irrelevant context. The gate decides what to suppress or amplify based on what the present token needs.
Figure 8: Most gating scores sit near zero, revealing a sparse activation pattern (SDPA-output gating, elementwise application). The dashed line shows the typical gate value (0.116). 📖 Source: adapted by creator from the unique paper: https://openreview.net/pdf?id=1b7whO4SfY
  • Their experiments with input-independent gating confirm this: it offers minor gains through added non-linearity but lacks the selective sparsity provided by query-dependent gating.

This finding above is best explained through an example. Regardless that the K and V maps are technically input-dependent, they will not be conditioned on the . For instance, if the query is  the worth tokens is likely to be or  but each value token only knows its own representation and never what  is asking for. G2 gating bases its decision on the source tokens themselves, which could also be irrelevant to the query’s needs. In contrast, G1 gating is computed from the query representation, so it’s capable of selectively suppress or amplify context based on what the query is definitely attempting to retrieve. This results in sparser, cleaner gating and higher performance for G1, whereas the Qwen team finds that G2 tends to supply higher, noisier scores and weaker results.

The leads to the paper show that multiplicative SDPA gating is best than additive. When using a gating function in softmax attention, we’re higher of multiplying its output reasonably than adding it.

The authors are unequivocal that  reasonably than shared across heads. They find that when gates are shared, the model tends to supply larger, less selective gating values, which dilutes head-level specialization and harms performance. In contrast, head-specific gating preserves each head’s unique role and consistently yields higher results. Interestingly, the authors state that that has the biggest effect on performance, with the granularity of the gating and activation function selection having a more minor impact.

The leads to the paper show that multiplicative SDPA gating is best than additive. When using a gating function in softmax attention, we’re higher of multiplying its output reasonably than adding it.

Sigmoid outperforms SiLU when utilized in the best-performing configuration, namely elementwise gating applied to the SDPA output (G1). Replacing sigmoid with SiLU on this setup consistently results in worse results, indicating that sigmoid is the simpler activation for gating.

Mitigating the Scourge of Attention Sinks

A key issue in LLMs is , where the primary token absorbs many of the attention weight and overwhelms the remainder of the sequence, resulting in disproportionately large activations that may destabilise training and warp the model’s representations. Importantly, the  with the SDPA output gating reducing the large activations and a focus sink.

Figure 9: When the eye distribution collapses onto the primary token, its value vector dominates the weighted sum, resulting in an outsized activation while the remainder of the sequence is effectively ignored. 📖 Source: image by creator.

Extending Context Length by Changing the Base

To construct long-context models, the Qwen team follow a method, detailed below. This training strategy gives an extra fascinating insight into how frontier labs train large-scale models, and what tools they find effective:

  1. Expanding RoPE base: First, they expand the  base from 10k to 1M which flattens the positional frequency curve and allows stable attention at for much longer position.
  2. Mid-Training: the Qwen team then proceed training the model for a further 80B tokens using 32k-length sequences. This continuation phase (sometimes called “mid-training”) lets the model adapt naturally to the brand new RoPE geometry without relearning every little thing.
  3. YaRN Extension: they then apply (YaRN) to expand the context length as much as 128k, without further training.

Let’s step back and briefly make clear  and . Without injecting positional information, a Transformer’s attention mechanism has no sense of where tokens appear in a sequence. Like many techniques in AI there is a straightforward, underlying geometric intuition to how they work, that makes every little thing really clear. That is actually the case for positional embeddings and RoPE. In an easy 2D analogy, you possibly can imagine token embeddings as a cloud of points scattered in space, with no indication of their order or relative spacing in the unique sequence.

RoPE encodes position by rotating each 2D slice of the query/key embedding by an angle proportional to the token’s position. The embedding is partitioned into many 2D sub-vectors, each assigned its own rotation frequency (θ₁, θ₂, …), so different slices rotate at different speeds. Low-frequency slices rotate slowly and capture broad, long-range positional structure, while high-frequency slices rotate rapidly and capture fine-grained, short-range relationships. Together, these multi-scale rotations allow attention to infer relative distances between tokens across each local and global contexts. That is a ravishing idea and implementation, and it’s methods like these that make me grateful to be working in the sector of AI.

Figure 10: Illustration of RoPE (Rotary Position Embedding). Each query/key vector is split into 2-D slices, and every slice is rotated by an angle proportional to the token’s position. The colored patches on the of every slice show where the rotated 2-D subvector now lies after applying RoPE. The shading on the foot of every vector slice indicates that location throughout the slice shifts, giving a brand new orientation determined by the slice’s rotation frequency. Because different slices rotate at different speeds, their colored patches appear elsewhere, allowing RoPE to encode positional information across multiple frequency bands.. 📖 Source: image by creator based on Figure 1 in the unique RoPE paper: https://arxiv.org/pdf/2104.09864

The important thing insight here is that the  angle between two rotated embeddings naturally encodes their relative distance within the sequence, allowing the eye mechanism to infer ordering and spacing through geometry alone. This makes positional information a property of how queries and keys interact. For instance, if the tokens are close within the sequence, their rotations will probably be similar, which equates to a big dot product, giving the next attention weight. Conversely, when tokens are farther apart, their rotations differ more, so the dot product between their queries and keys changes in a position-dependent way, typically reducing attention to distant tokens unless the model has learned that long-range interactions are essential.

YaRN is a contemporary and versatile strategy to extend an LLM’s context window without retraining, and without causing the instabilities seen in naïvely extrapolated RoPE. RoPE begins to fail at long ranges because its . Once positions exceed the training horizon, those dimensions produce repeated phases, meaning tokens which might be far apart can appear deceptively similar in positional space. This (or matching) destabilises attention and may cause it to collapse. YaRN fixes this by easily stretching the RoPE frequency spectrum preserving the model’s short-range positional behaviour while regularly interpolating to lower frequencies for long-range positions. The result’s a positional embedding scheme that behaves naturally as much as 32k, 64k, and even 128k tokens, with far less distortion than older NTK or linear-scaling methods. Once their model was found to be stable at 32k, the Qwen team applied YaRN to further interpolate the RoPE frequencies, extending the effective context window to 128k.

Of their evaluation, the Qwen team find, that throughout the trained 32k window, SDPA-gated models barely outperform the baseline, indicating that gating improves attention dynamics without harming long-context stability, even under substantial positional scaling.

Moreover, with the YaRN extension and within the large-context regime, they find that the SDPA-output gated network significantly outperforms the baseline between 64k-128k context lengths. The authors tie this performance increase to the mitigation of the eye sink phenomenon, that they surmise the baseline model relies upon to distribute attention scores across tokens. They hypothesise that the SDPA-output gated model is way less sensitive to the RoPE and YaRN induced changes to the positioning encoding scheme and context length adjustments. Applying YaRN, which doesn’t require further training, may disrupt these learned sink patterns, resulting in the observed degradation in the bottom model performance. The SDPA-gated model, in contrast, doesn’t depend on the eye sink to stabilise attention.

Coding Up our Own Gating Implementation

Before we conclude, it’s will be instructive to try to code up an implementation of an AI technique directly from a paper, and it’s an excellent strategy to solidify the important thing concepts. To this end, we’ll walk through an easy Python implementation of scaled dot product attention with softmax gating.

We’ll first define our key hyper parameters, akin to the sequence length (seq_len), the hidden dimension of the model (d_model), the variety of heads (n_heads) and the top dimension (head_dim).

import numpy as np

np.random.seed(0)

# ---- Toy config ----
seq_len   = 4      # tokens
d_model   = 8      # model dim
n_heads   = 2
head_dim  = d_model // n_heads

We next define some (fake) token embeddings (simply generated randomly here), alongside our randomly initialised project weights (not learnt for the needs of this easy example).

# Fake token embeddings
x = np.random.randn(seq_len, d_model)        # [T, D]

# ---- Projection weights ----
W_q = np.random.randn(d_model, d_model)
W_k = np.random.randn(d_model, d_model)
W_v = np.random.randn(d_model, d_model)
W_o = np.random.randn(d_model, d_model)      # output projection

We then define the same old suspects, softmax, sigmoid, and likewise a way to separate the dimension D into n_heads:

def softmax(logits, axis=-1):
    logits = logits - np.max(logits, axis=axis, keepdims=True)
    exp = np.exp(logits)
    return exp / np.sum(exp, axis=axis, keepdims=True)

def sigmoid(z):
    return 1 / (1 + np.exp(-z))

# ---- Helper: split/concat heads ----
def split_heads(t):   # [T, D] -> [H, T, Dh]
    return t.reshape(seq_len, n_heads, head_dim).transpose(1, 0, 2)

def concat_heads(t):  # [H, T, Dh] -> [T, D]
    return t.transpose(1, 0, 2).reshape(seq_len, d_model)

Now we will dive into the core gating implementation and see exactly how it really works in practice. In the entire examples below, we use random tensors as stand-ins for the learned gate parameters that an actual model would train end-to-end.

#==================================================
# Forward pass 
# ============================================================
def attention_with_gates(x):
    # 1) Linear projections
    Q = x @ W_q   # [T, D]
    K = x @ W_k
    V = x @ W_v

    # ----- G4: gate on Queries (after W_q) -----
    G4 = sigmoid(np.random.randn(*Q.shape))
    Q = G4 * Q

    # ----- G3: gate on Keys (after W_k) -----
    G3 = sigmoid(np.random.randn(*K.shape))
    K = G3 * K

    # ----- G2: gate on Values (after W_v) -----
    G2 = sigmoid(np.random.randn(*V.shape))
    V = G2 * V

    # 2) Split into heads
    Qh = split_heads(Q)      # [H, T, Dh]
    Kh = split_heads(K)
    Vh = split_heads(V)

    # 3) Scaled Dot Product Attention per head
    scale = np.sqrt(head_dim)
    scores = Qh @ Kh.transpose(0, 2, 1) / scale   # [H, T, T]
    attn   = softmax(scores, axis=-1)
    head_out = attn @ Vh                          # [H, T, Dh]

    # 4) Concat heads
    multi_head_out = concat_heads(head_out)       # [T, D]

    # ----- G1: gate on concatenated heads (before W_o) -----
    G1 = sigmoid(np.random.randn(*multi_head_out.shape))
    multi_head_out = G1 * multi_head_out

    # 5) Output projection
    y = multi_head_out @ W_o                      # [T, D]

    # ----- G5: gate on final dense output -----
    G5 = sigmoid(np.random.randn(*y.shape))
    y = G5 * y

    return {
        "Q": Q, "K": K, "V": V,
        "G2": G2, "G3": G3, "G4": G4,
        "multi_head_out": multi_head_out,
        "G1": G1, "final_out": y, "G5": G5,
    }

out = attention_with_gates(x)
print("Final output shape:", out["final_out"].shape)

The code above inserts gating modules at 4 locations, replicating the positioning within the Qwen paper: the query map (G4), key map (G3), value map (G2), and the output of the SDPA module (G1). Although the Qwen team recommend using only the G1 configuration in practice — placing a single gate on the SDPA output — we include all 4 here for illustration. The goal is to point out that gating is solely a light-weight modulation mechanism applied to different pathways throughout the attention block. Hopefully this makes the general concept feel more concrete and intuitive.

Conclusions & Final Thoughts

In this text we took a whistle-stop tour into the concept of gating for softmax attention in LLMs and covered the important thing lessons learnt from the NeurIPS 2025 paper, .

The Qwen paper is an AI tour-de-force and a treasure trove of practical findings which might be immediately applicable to improving most contemporary LLM architectures. The Qwen team have prodcued an exhaustive study into the configuration of gating for LLM softmax attention, throwing light on this essential component. There’s little doubt in my mind that almost all, if not all, frontier AI labs will probably be furiously scrambling to update their architectures in step with the guidance coming out of the Qwen paper, one in every of this yr’s NeurIPS best papers, a highly coveted achievement in the sector. As we speak there are probably hundreds of GPUs crunching away at learning LLMs with gating module configurations inspired by the clear lessons within the Qwen paper.

Kudos to the Qwen team for making this data public for the good thing about the whole community. The unique code will be found here in the event you are fascinated about incorporating the Qwen team’s implementation into your personal models or driving their research further (every great research contribution results in more questions, there are turtles all the best way down!) to deal with unanswered questions akin to what internal dynamics change when a gate is added, and why this results in the observed robustness across positional regimes.

📚 Further Learning

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x