one little trick can bring about enhanced training stability, the usage of larger learning rates and improved scaling properties
The Enduring Popularity of AI’s Most Prestigious Conference
By all accounts this yr’s NeurIPS, the world’s premiere AI conference, was one in every of the biggest and most lively in its history. This yr’s conference was held on the San Diego Convention Center in San Diego, California from Sunday, November 30, 2025 through Sunday, December 7, 2025. As a way of the size, NeurIPS 2025 received 21,575 valid paper submissions. From 2023 (~12.3 k) to 2025 (~21.6 k) this reflects a ~75–80% hop over two years, roughly ~30% per yr average. In person attendance has been equally as impressive, which has often been the tens of hundreds of individuals often capped by venue size, with past locations operating near the upper limit of what the physical venue can handle. Reinforcement learning dominated the conversation this yr, with the sector is shifting from scaling models to tuning them for specific use cases. Industry momentum appeared to centre strongly around Google, with Google DeepMind particularly surging and pushing recent and refreshing research directions, for instance reasonably than simply “larger LLMs”. The dimensions and intesity of the conference is a mirrored image of perhaps each the pace of AI progress and the cultural peak of the fashionable AI gold rush.
This yr, the exhibitor hall was packed, with major industry players from technology, finance, and AI infrastructure all setting out their stalls to show their latest breakthroughs, highlight open roles to talented delegates, and hand out the ever-coveted branded “stash” — pens, T-shirts, water bottles, and more. The especially fortunate conference goer might even receive an invitation to company-hosted “after-parties”, which have turn out to be a staple of the NeurIPS experience and a perfect opportunity to decompress, shed the information-overload and network, from Konwinski’s Laude Lounge to the invite-only Model Ship cruise filled with top researchers. Diamond sponsors this yr included Ant Group, Google, Apple, ByteDance, Tesla, and Microsoft. The buy-side presence this yr was particularly strong, with leading firms akin to Citadel, Citadel Securities, Hudson River Trading, Jane Street, Jump Trading, and The D. E. Shaw Group represented. On the infrastructure and tooling side, Lambda showcased its GPU cloud platform, while firms like Ollama and Poolside highlighted advances in local LLM runtimes and frontier model development.

The NeurIPS Expo showcased many equally fascinating applied-AI demos. Highlights included , demonstrating how autonomous agents can behave reliably across different LLM backends; a multimodal forensic search system able to scanning large video corpora with AI; an AI-accelerated LiDAR processing demo that showed how heterogeneous compute can dramatically speed up 3D perception; and LLM-driven data-engineering workflows that automate ingestion, transformation, and quality checks. It’s clear from the EXPO that AI is heading full steam ahead toward agents, multimodal intelligence, accelerated perception, and end-to-end automated data systems.

The ceremony arguably represents a pinnacle of the conference and a celebration of its most impactful work. One of the best paper awards are given to exceptionally progressive and impactful research that’s more likely to have a right away and longlasting effect on the sector of AI. It goes without saying that a best paper award is a significant skilled accomplishment in a highly competitive and fast paced research field. It’s much more impressive if we bear in mind the large volume of submitted papers to NeurIPS. Standing out in that crowd is exceptionally difficult.
The Anatomy of a NeurIPS Best Paper: Exploring the advantages of Gated Attention in LLMs
Gating Explained: How a Tiny Valve Controls Big Neural Models
In the rest of this text, we take a deep dive into one in every of this yr’s best papers from NeurIPS: by the Qwen team. Arguably, this dense paper title packs a number of information right into a very small footprint, so, in what follows, I’ll unpack the paper piece by piece with the target of giving practicing Data Scientists a transparent mental model of attention gating and concrete takeaways from the paper they will immediately apply to their very own work.
First, we start with an understanding of the g, the core module under study within the paper. What exactly is a gate within the context of neural networks? A gate is nothing greater than a mechanism, a computational unit that takes the output of an existing transformation within the network and regulates it by , or parts of the input signal.
As an alternative of allowing every activation to flow unchanged through the network, a gate introduces a learned control pathway that determines how much of the transformed information should pass forward.
Operationally speaking, , typically using a sigmoid, softmax, or occasionally a ReLU-based squashing function, and these coefficients are applied multiplicatively to a different vector of activations originating from an upstream computation. This has the effect of regulating how much of that input makes its way downstream, a bit like twisting a faucet handle backward and forward to control the quantity of water passing through. That’s all there may be to it, now you understand gating, what it’s and the way it’s applied.

Since the gating weights are typically learnable parameters, the network can discover during training tips on how to modulate internal signals in ways in which minimise the general network loss. In this fashion, the gate becomes a dynamic filter, adjusting the inner information flow based on the input context, the model’s continually evolving parameters, and the gradients received during optimisation.
A Transient Tour Down Memory Lane: The Long History of Gating
It’s price taking in a little bit little bit of the history of Gating, before we move to the important contributions of the paper. is de facto nothing recent, and the Qwen paper didn’t invent this standard component, their contribution lies elsewhere and will probably be covered shortly. In truth gating has been a core mechanism in deep architectures for a lot of many years now. For instance, Long Short-Term Memory (LSTM) networks, introduced in 1997, pioneered the systematic use of multiplicative gates — the , , and gates — to control the flow of data through time. These gates act as learned filters that determine which signals ought to be written to memory, which ought to be retained, and which ought to be exposed to downstream layers. By controlling information flow on this fine-grained way, LSTMs effectively mitigated the multiplicative explosion or vanishing of gradients that hampered early recurrent networks, enabling stable long-term credit task during backpropagation through time (BPTT).
Applying Gating to the LLM Attention Block
The Qwen team’s contribution focuses on applying gating on to the transformer’s , a selected form of configuration called In this text, I won’t spend an excessive amount of time on the of attention, as there are a lot of resources on the market to study it, including this recent course by the DeepLearning.ai team and this prior article I’ve written on the topic. In a brilliant transient summary, attention is the core mechanism within the transformer architecture that lets each input sequence token gather contextual information from some other token within the sequence, enabling tokens to during training and inference, sharing information no matter how far apart they seem within the input. The computational graph for the favored is shown below:

Although attention gating has been used for a few years, the Qwen team highlight a surprising gap in our body of information: as AI practitioners we’ve broadly applied attention gating without truly understanding why it really works or the way it shapes learning dynamics. The Qwen team’s work shows that we’ve been benefiting from this module for a very long time with out a rigorous, systematic account of its effectiveness or the conditions under which it performs best. The Qwen paper does just that and plugs the gap, with the NeurIPS best paper selection committee citation mentioning:
“This paper represents a considerable amount of labor that is feasible only with access to industrial scale computing resources, and the authors’ sharing of the outcomes of their work, which is able to advance the community’s understanding of attention in large language models, is very commendable, especially in an environment where there was a move away from open sharing of scientific results around LLMs.”
NeurIPS 2025, Select Committee statement.
Given the sheer amount of dollars flowing in and the large industrial interest in AI as of late, it’s very nice to see that the Qwen team decided to deliver this wealthy batch of lessons learnt to the broader community, reasonably than keep these informational nuggets behind closed doors. In doing so, the Qwen team have delivered a ravishing paper filled with practical lessons and clear explanations of the behind attention gating, all distilled in a way that Data Scientists can immediately take and apply in real-world models.
The Qwen’s team systematic study makes several concrete contributions to knowledge that will be easily and immediately applied to enhance many standard LLM architectures:
- Positioning of Gating: Putting a gating module right after the worth matrix computation provides enhanced LLM performance, through introduction of a non-linearity and the inducement of input-dependent sparsity. Additionally they study key parameterisations of the gating module, akin to the form of activation function (SiLU or sigmoid) and the mix function (multiplication, addition).
- Attention Sink and Massive Activations: Gating can radically curtail the facility of the eye sink phenomenon, where most if not the entire attention in a layer concentrates on a single token — I cover this phenomenon intimately later. By suppressing these extreme activations, the model becomes way more numerically stable during optimisation, eliminating the loss spikes that typically appear in deep or long-training runs. This increased stability allows the model to tolerate substantially higher learning rates, unlocking higher scaling without the divergence seen in ungated transformers.
- Context Length Extension: Gating also facilitateswithout requiring full model retraining. In practice, this implies a model will be trained with a comparatively short context window and later scaled to for much longer sequences by retrospectively adjusting components akin to the RoPE base. This adjustment effectively reparameterises the positional embedding geometry, allowing the model to operate at prolonged context lengths (e.g., as much as 32k tokens) while preserving stability and without degrading previously learned representations.

Leveraging Gating to Improve Performance, Learning Stability and Attention Mechanics
The Qwen team focus their investigation on how gating interacts with the LLMs softmax attention module, aiming to know its influence on the module’s learning dynamics and to discover the optimal placement of the gate — for instance, after the Q, K, or V projections, after the eye computation, or after the dense layers. The setup of this study is illustrated in the next diagram below:

The authors evaluate each mixture-of-experts — MoE (15B, 2.54B lively) and dense (1.7B) — feed forward network (FFN) — models. The MoE variant uses 128 experts, top-8 softmax gating and fine-grained experts. Models are trained on subsets of a 4T-token high-quality corpus covering multilingual, math, and general knowledge data, with a 4096 sequence length. Training uses AdamW defaults, with specific learning-rate and batch-size details provided per experiment. They find that gating adds minimal overhead — <2% latency. Evaluation covers standard few-shot benchmarks: HellaSwag, MMLU, GSM8K, HumanEval, C-Eval, and CMMLU, plus perplexity tests across domains including English, Chinese, code, math, law, and literature.
The experimental evaluation is organised to check the next questions in a scientific way. I also add the important thing takeaways beneath each research query, which apply equally MoEs and FFN models tested by the authors:
- The authors find that inserting gating on the output of the scaled dot product attention (SDPA) module or after the worth map (G2), are essentially the most effective placements.
- Moreover, SDPA attention placement is simpler than at G2. To elucidate this, the authors show that gating placement at SDPA induces very low sparse gating scores, which is correlated with superior task performance.
- Value gating (G2) produces higher, less sparse scores and performs worse than SDPA-output gating (G1). This means that sparsity is most useful when the gating depends upon the present query, allowing the model to filter irrelevant context. The gate decides what to suppress or amplify based on what the present token needs.

- Their experiments with input-independent gating confirm this: it offers minor gains through added non-linearity but lacks the selective sparsity provided by query-dependent gating.
This finding above is best explained through an example. Regardless that the K and V maps are technically input-dependent, they will not be conditioned on the . For instance, if the query is the worth tokens is likely to be or but each value token only knows its own representation and never what is asking for. G2 gating bases its decision on the source tokens themselves, which could also be irrelevant to the query’s needs. In contrast, G1 gating is computed from the query representation, so it’s capable of selectively suppress or amplify context based on what the query is definitely attempting to retrieve. This results in sparser, cleaner gating and higher performance for G1, whereas the Qwen team finds that G2 tends to supply higher, noisier scores and weaker results.
The leads to the paper show that multiplicative SDPA gating is best than additive. When using a gating function in softmax attention, we’re higher of multiplying its output reasonably than adding it.
The authors are unequivocal that reasonably than shared across heads. They find that when gates are shared, the model tends to supply larger, less selective gating values, which dilutes head-level specialization and harms performance. In contrast, head-specific gating preserves each head’s unique role and consistently yields higher results. Interestingly, the authors state that that has the biggest effect on performance, with the granularity of the gating and activation function selection having a more minor impact.
The leads to the paper show that multiplicative SDPA gating is best than additive. When using a gating function in softmax attention, we’re higher of multiplying its output reasonably than adding it.
Sigmoid outperforms SiLU when utilized in the best-performing configuration, namely elementwise gating applied to the SDPA output (G1). Replacing sigmoid with SiLU on this setup consistently results in worse results, indicating that sigmoid is the simpler activation for gating.
Mitigating the Scourge of Attention Sinks
A key issue in LLMs is , where the primary token absorbs many of the attention weight and overwhelms the remainder of the sequence, resulting in disproportionately large activations that may destabilise training and warp the model’s representations. Importantly, the with the SDPA output gating reducing the large activations and a focus sink.

Extending Context Length by Changing the Base
To construct long-context models, the Qwen team follow a method, detailed below. This training strategy gives an extra fascinating insight into how frontier labs train large-scale models, and what tools they find effective:
- Expanding RoPE base: First, they expand the base from 10k to 1M which flattens the positional frequency curve and allows stable attention at for much longer position.
- Mid-Training: the Qwen team then proceed training the model for a further 80B tokens using 32k-length sequences. This continuation phase (sometimes called “mid-training”) lets the model adapt naturally to the brand new RoPE geometry without relearning every little thing.
- YaRN Extension: they then apply (YaRN) to expand the context length as much as 128k, without further training.
Let’s step back and briefly make clear and . Without injecting positional information, a Transformer’s attention mechanism has no sense of where tokens appear in a sequence. Like many techniques in AI there is a straightforward, underlying geometric intuition to how they work, that makes every little thing really clear. That is actually the case for positional embeddings and RoPE. In an easy 2D analogy, you possibly can imagine token embeddings as a cloud of points scattered in space, with no indication of their order or relative spacing in the unique sequence.
RoPE encodes position by rotating each 2D slice of the query/key embedding by an angle proportional to the token’s position. The embedding is partitioned into many 2D sub-vectors, each assigned its own rotation frequency (θ₁, θ₂, …), so different slices rotate at different speeds. Low-frequency slices rotate slowly and capture broad, long-range positional structure, while high-frequency slices rotate rapidly and capture fine-grained, short-range relationships. Together, these multi-scale rotations allow attention to infer relative distances between tokens across each local and global contexts. That is a ravishing idea and implementation, and it’s methods like these that make me grateful to be working in the sector of AI.

The important thing insight here is that the angle between two rotated embeddings naturally encodes their relative distance within the sequence, allowing the eye mechanism to infer ordering and spacing through geometry alone. This makes positional information a property of how queries and keys interact. For instance, if the tokens are close within the sequence, their rotations will probably be similar, which equates to a big dot product, giving the next attention weight. Conversely, when tokens are farther apart, their rotations differ more, so the dot product between their queries and keys changes in a position-dependent way, typically reducing attention to distant tokens unless the model has learned that long-range interactions are essential.
YaRN is a contemporary and versatile strategy to extend an LLM’s context window without retraining, and without causing the instabilities seen in naïvely extrapolated RoPE. RoPE begins to fail at long ranges because its . Once positions exceed the training horizon, those dimensions produce repeated phases, meaning tokens which might be far apart can appear deceptively similar in positional space. This (or matching) destabilises attention and may cause it to collapse. YaRN fixes this by easily stretching the RoPE frequency spectrum preserving the model’s short-range positional behaviour while regularly interpolating to lower frequencies for long-range positions. The result’s a positional embedding scheme that behaves naturally as much as 32k, 64k, and even 128k tokens, with far less distortion than older NTK or linear-scaling methods. Once their model was found to be stable at 32k, the Qwen team applied YaRN to further interpolate the RoPE frequencies, extending the effective context window to 128k.
Of their evaluation, the Qwen team find, that throughout the trained 32k window, SDPA-gated models barely outperform the baseline, indicating that gating improves attention dynamics without harming long-context stability, even under substantial positional scaling.
Moreover, with the YaRN extension and within the large-context regime, they find that the SDPA-output gated network significantly outperforms the baseline between 64k-128k context lengths. The authors tie this performance increase to the mitigation of the eye sink phenomenon, that they surmise the baseline model relies upon to distribute attention scores across tokens. They hypothesise that the SDPA-output gated model is way less sensitive to the RoPE and YaRN induced changes to the positioning encoding scheme and context length adjustments. Applying YaRN, which doesn’t require further training, may disrupt these learned sink patterns, resulting in the observed degradation in the bottom model performance. The SDPA-gated model, in contrast, doesn’t depend on the eye sink to stabilise attention.
Coding Up our Own Gating Implementation
Before we conclude, it’s will be instructive to try to code up an implementation of an AI technique directly from a paper, and it’s an excellent strategy to solidify the important thing concepts. To this end, we’ll walk through an easy Python implementation of scaled dot product attention with softmax gating.
We’ll first define our key hyper parameters, akin to the sequence length (seq_len), the hidden dimension of the model (d_model), the variety of heads (n_heads) and the top dimension (head_dim).
import numpy as np
np.random.seed(0)
# ---- Toy config ----
seq_len = 4 # tokens
d_model = 8 # model dim
n_heads = 2
head_dim = d_model // n_heads
We next define some (fake) token embeddings (simply generated randomly here), alongside our randomly initialised project weights (not learnt for the needs of this easy example).
# Fake token embeddings
x = np.random.randn(seq_len, d_model) # [T, D]
# ---- Projection weights ----
W_q = np.random.randn(d_model, d_model)
W_k = np.random.randn(d_model, d_model)
W_v = np.random.randn(d_model, d_model)
W_o = np.random.randn(d_model, d_model) # output projection
We then define the same old suspects, softmax, sigmoid, and likewise a way to separate the dimension D into n_heads:
def softmax(logits, axis=-1):
logits = logits - np.max(logits, axis=axis, keepdims=True)
exp = np.exp(logits)
return exp / np.sum(exp, axis=axis, keepdims=True)
def sigmoid(z):
return 1 / (1 + np.exp(-z))
# ---- Helper: split/concat heads ----
def split_heads(t): # [T, D] -> [H, T, Dh]
return t.reshape(seq_len, n_heads, head_dim).transpose(1, 0, 2)
def concat_heads(t): # [H, T, Dh] -> [T, D]
return t.transpose(1, 0, 2).reshape(seq_len, d_model)
Now we will dive into the core gating implementation and see exactly how it really works in practice. In the entire examples below, we use random tensors as stand-ins for the learned gate parameters that an actual model would train end-to-end.
#==================================================
# Forward pass
# ============================================================
def attention_with_gates(x):
# 1) Linear projections
Q = x @ W_q # [T, D]
K = x @ W_k
V = x @ W_v
# ----- G4: gate on Queries (after W_q) -----
G4 = sigmoid(np.random.randn(*Q.shape))
Q = G4 * Q
# ----- G3: gate on Keys (after W_k) -----
G3 = sigmoid(np.random.randn(*K.shape))
K = G3 * K
# ----- G2: gate on Values (after W_v) -----
G2 = sigmoid(np.random.randn(*V.shape))
V = G2 * V
# 2) Split into heads
Qh = split_heads(Q) # [H, T, Dh]
Kh = split_heads(K)
Vh = split_heads(V)
# 3) Scaled Dot Product Attention per head
scale = np.sqrt(head_dim)
scores = Qh @ Kh.transpose(0, 2, 1) / scale # [H, T, T]
attn = softmax(scores, axis=-1)
head_out = attn @ Vh # [H, T, Dh]
# 4) Concat heads
multi_head_out = concat_heads(head_out) # [T, D]
# ----- G1: gate on concatenated heads (before W_o) -----
G1 = sigmoid(np.random.randn(*multi_head_out.shape))
multi_head_out = G1 * multi_head_out
# 5) Output projection
y = multi_head_out @ W_o # [T, D]
# ----- G5: gate on final dense output -----
G5 = sigmoid(np.random.randn(*y.shape))
y = G5 * y
return {
"Q": Q, "K": K, "V": V,
"G2": G2, "G3": G3, "G4": G4,
"multi_head_out": multi_head_out,
"G1": G1, "final_out": y, "G5": G5,
}
out = attention_with_gates(x)
print("Final output shape:", out["final_out"].shape)
The code above inserts gating modules at 4 locations, replicating the positioning within the Qwen paper: the query map (G4), key map (G3), value map (G2), and the output of the SDPA module (G1). Although the Qwen team recommend using only the G1 configuration in practice — placing a single gate on the SDPA output — we include all 4 here for illustration. The goal is to point out that gating is solely a light-weight modulation mechanism applied to different pathways throughout the attention block. Hopefully this makes the general concept feel more concrete and intuitive.
Conclusions & Final Thoughts
In this text we took a whistle-stop tour into the concept of gating for softmax attention in LLMs and covered the important thing lessons learnt from the NeurIPS 2025 paper, .
The Qwen paper is an AI tour-de-force and a treasure trove of practical findings which might be immediately applicable to improving most contemporary LLM architectures. The Qwen team have prodcued an exhaustive study into the configuration of gating for LLM softmax attention, throwing light on this essential component. There’s little doubt in my mind that almost all, if not all, frontier AI labs will probably be furiously scrambling to update their architectures in step with the guidance coming out of the Qwen paper, one in every of this yr’s NeurIPS best papers, a highly coveted achievement in the sector. As we speak there are probably hundreds of GPUs crunching away at learning LLMs with gating module configurations inspired by the clear lessons within the Qwen paper.
Kudos to the Qwen team for making this data public for the good thing about the whole community. The unique code will be found here in the event you are fascinated about incorporating the Qwen team’s implementation into your personal models or driving their research further (every great research contribution results in more questions, there are turtles all the best way down!) to deal with unanswered questions akin to what internal dynamics change when a gate is added, and why this results in the observed robustness across positional regimes.
📚 Further Learning
- Alex Heath (2025) — Google’s Rise, RL Mania, and a Party Boat — A primary-hand roundup of NeurIPS 2025 takeaways, highlighting the surge of reinforcement learning, Google/DeepMind’s momentum, and the increasingly extravagant conference party culture. Published in Sources, a newsletter analysing AI industry trends.
- Jianlin Su et al. (2024) — RoFormer: Enhanced Transformer with Rotary Position Embedding — The unique RoPE paper that introduced rotary position embeddings, now used universally in LLMs. It explains how rotational encoding preserves relative position information and clarifies why changing the RoPE base affects long-range attention behavior.
- Bowen Peng et al. (2023) — YaRN: Efficient Context Window Extension of Large Language Models — The core reference behind YaRN interpolation. This work shows how adjusting RoPE frequencies through smooth extrapolation can extend models to 128k+ contexts .
- Zihan Qiu et al. (2025) — Gated Attention for Large Language Models: Non-Linearity, Sparsity, and Attention-Sink-Free — The definitive study on gating in softmax attention, reviwed in this text. It introduces SDPA-output gating (G1), explains why sigmoid gating introduces non-linearity and sparsity, shows how gating eliminates attention sinks, and demonstrates superior context-length generalization under RoPE/YaRN modifications.
- Guangxuan Xiao et al. (2023) — StreamingLLM: Efficient Streaming LMs with Attention Sinks — The paper that formally identifies the “attention sink” phenomenon: early tokens attracting disproportionately large attention weights. It explains why baseline transformers often collapse attention to the primary token.
- Mingjie Sun et al. (2024) — Massive Activations in Large Language Models — Shows that extremely large hidden activations in specific layers propagate through the residual stream and cause pathological attention distributions. The Qwen paper empirically validates this link and demonstrates how gating suppresses massive activations.
- Noam Shazeer (2020) — GLU Variants Improve Transformer — The foundational reference for gating inside feedforward blocks (SwiGLU, GEGLU). Modern LLMs heavily depend on this family of gated FFN activations; the Qwen paper connects this lineage to gating inside attention itself.
- Hochreiter & Schmidhuber (1997) — LSTM: Long Short-Term Memory –The earliest and most influential gating architecture. LSTMs introduce input, output, and forget gates for selective information passage — the conceptual precursor to all modern gating strategies, including SDPA-output gating in transformers.
- Xiangming Gu et al. (2024) — When Attention Sink Emerges in Language Models — Provides a contemporary empirical treatment of attention sinks, key biases, and non-informative early-token dominance.
- Dong et al. (2025) — LongRed: Mitigating Short-Text Degradation of Long-Context LLMs — Offers a mathematical derivation (referenced in Qwen) showing how modifying RoPE changes attention distributions and hidden-state geometry.
