1. Introduction
two years, we witnessed a race for sequence length in AI language models. We regularly evolved from 4k context length to 32k, then 128k, to the huge 1-million token window first promised by models like Gemini 1.5 pro. The promise was alluring: dump entire codebases or novels into the model and let it reason across all the thing.
But there may be a hidden cost to this virtually “infinite” context length, which is never ever mentioned: Memory.
In a regular Transformer architecture, memorising and reasoning across all the prompt isn’t free. Because the input sequence grows, the model must store the Key and Value (KV) states for each single token to calculate attention scores. For a 1-million-token sequence, this KV Cache can quickly snowball to a whole bunch of gigabytes, which in turn requires large clusters of GPUs across multiple data centres, all to only hold the conversation in memory.
2. The Motivation
In a regular attention mechanism (Vaswani et al., 2017)6, every recent token that the model generates must “look back” to each previous token within the prompt to completely understand the context. To make this efficient over multiple generations, the model caches the Key (K) and Value (V) vectors of previous tokens within the GPU VRAM. That is often called the KV cache.
The Linear Growth Trap
While caching the Key and Value vectors (KV cache) might be time-efficient (as we don’t must recompute the past for each recent token), it has an enormous memory footprint, which grows linearly with the input sequence length.
To place this into perspective: to store the KV cache for a regular 500B parameter model for a context of just 20,000 tokens requires about 126GB of memory. If we scale that to the parameter counts of contemporary LLM’s 1T+ parameters, and serving thousands and thousands of users at any given time, the whole memory footprint becomes an astronomically large figure.
Historically, we’ve had two ways to handle sequential data, neither of which is ideal:
- RNNs: Recurrent Neural Networks process the input prompt token by token, updating a single and stuck hidden state. While this may greatly reduce the memory requirements, they struggle to retain information and details over prolonged prompts. This causes the models to eventually forget the start of the input sequence by the point they get to the top.
- Transformers: Transformers, unlike RNNs, don’t suffer from this problem as they remember the whole lot perfectly by keeping all the history of the conversation in KV Cache. They’ve perfect recall, but attributable to the massive KV cache, they’re memory-intensive.
That is the trade-off that Infini-attention goals to fill.
3. The Solution: Infini-attention
To unravel the memory paradox, researchers at Google formulated Infini-attention (Munkhdalai et al., 2024)1. The core principle of the approach is that as a substitute of storing all the conversation, we are able to store a summary of it.
Infini-attention splits the eye output into two distinct mechanisms, which work concurrently:
- Local Attention: Same as a regular Transformer. It sees the immediate context and calculates an attention matrix for each token to capture details in high resolution.
- Global Linear Attention: A compressive memory that stores a summary of the entire past history in a fixed-size matrix, for the model to consult with.
Let’s walk through the pipeline of how this processes an extended input.
Visualisation of how infini-attention works (Retrieval)
Step 1: Segmentation
Firstly, all the input sequence is split into smaller segments (say, N=2,048 tokens). Inside each segment, the model uses the usual Dot-Product Attention to know the context. This ensures that for immediate tasks, resolution stays perfect.
Step 2: The Compression (Memory Update)
To maneuver on to the subsequent segment, the model stores the compressed states of the Key (K) and Value (V) of the present segment right into a fixed-size Memory Matrix (M). This enables the model to question the Memory Matrix (as a substitute of the larger KV cache) to fetch information concerning the previous segments.
Nonetheless, adding recent data blindly to the Memory Matrix can quickly corrupt the previous information it was holding. To forestall this, the authors use the Delta Rule (Schlag et al., 2021)7. The intuition behind it’s: Before adding any recent information, check if the memory already stores it or not. This avoids redundant updates. Your complete update process is explained below:
A. The “Peek” (Calculating Vretrieved)
Firstly, the model retrieves values from the present memory using the present Keys (K) as in the event that they were queries. The model does this to gauge what kind of knowledge (values) the memory already associates with current keys.

K: Keys generated for the present segment
Mold: Global memory’s current state
σ: Non-Linear activation function (ELU+1)
z: Normalising factor
Vretrieved: Value matrix from global memory
B. The Update Step
The model then compares the actual recent values (V) with the retrieved values (Vretrieved). It calculates the difference (the residual) and only adds that to the memory. This avoids updating the memory with what it already knows.

Mrecent: Updated global memory
KT: Transposed Key matrix of current segment
V: Value matrix of the present segment
Vretrieved: Retrieved matrix vector from global memory
This suggests that if the memory already incorporates the data of the present segment perfectly, the update is zero. This keeps the memory stable and “clean” over quite a few updates.
Step 3: Global Retrieval (Linear Attention)
To generate the subsequent token, the model needs the contextual information from all the prompt, a.k.a., across all segments. To get the relevant information, the model queries the Memory Matrix by performing a matrix multiplication.

Amem: Attention output from global memory
Q: Query matrix of current segment
M: Global memory matrix
z: Normalising factor
The resulting matrix incorporates the relevant information from all previous segments to generate the subsequent token.
Step 4: The Aggregation (The “Mixer”)
Finally, the model has two outputs:
- Adot: The detailed, local context from the present segment.
- Amem: The compressed, global history of all previous segments from the memory matrix.
To mix the 2, it uses a learned gating scalar, β (beta):

Sigmoid: Non-linear activation to sure β between 0 and 1
Amem and Adot: Attention outputs from global memory and dot-product, respectively
β: Learnt gating parameter to regulate the influence of Amem and Adot on the ultimate output
The β parameter acts as a mixing coefficient that determines the trade-off between long-term () and short-term () information flows:
- When β is low: The sigmoid function approaches 0. This causes the complementary weighting factor (
1−sigmoid(β)) to develop into dominant, which causes the model to prioritise the local dot-product attention (Adot) greater than the worldwide compressive memory. - When β is high: The sigmoid function approaches 1. The model prioritises the retrieved memory content (), allowing global context to override local information from the present segment.
4. The Results: Why Infini-attention Matters
The authors put Infini-attention to the test against existing long-context models, corresponding to Transformer-XL (Dai et al., 2019)2 and Memorising Transformers (Wu et al., 2022)3. The next are the outcomes:
1. The “114x” Memory Compression
Probably the most impactful achievement of this paper is the huge reduction in memory resources used. As Infini-Attention stores all the historical context in a fixed-size Memory Matrix as a substitute of a linearly growing KV cache, it will possibly get away with storing 114x fewer parameters into the GPU VRAM compared to Memorising Transformers. As shown within the table below, for a context length of 65k tokens, Infini-Attention achieves SOTA perplexity scores on benchmarks like PG19 and Arxiv-math while needing to store only one.6M parameters (size of the Memory Matrix), versus competing architectures.

Infini-attention notably reduces memory footprint while achieving SOTA perplexity on PG19 and Arxiv-math benchmarks
2. The 1 Million Token “Passkey” Test
For a long-context architecture, the needle-in-a-haystack challenge is conventional. The authors tested this by hiding a random passkey in an enormous corpus of text and asking the model to retrieve it. As shown within the table below, in a zero-shot setting, the model struggles to seek out the important thing, achieving mostly <20% accuracy.
The authors then fine-tuned the model for 400 steps with sequences that had a length of only 5,000 tokens. Remarkably, the model was in a position to generalise the fine-tuning to work with sequences as much as 1 million tokens long, with drastically improved retrieval accuracy across the board.

The three scores per entry denote the accuracy of retrieval relative to the position of the passkey hidden within the corpus (start/middle/end).
3. State-of-the-Art Book Summarization (500k Context)
Aside from synthetic tests, the authors also tested the model on the BookSum benchmark (Kryściński et al.)5, where the model is required to generate a summary of an extended novel. The 8B parameter Infini-Attention model set a brand new State-of-the-Art performance on the benchmark, by generating successful summaries of books as much as 500,000 tokens long.
The outcomes also show a transparent trend that the model’s summarisation abilities improve as longer contexts are fed into it. The graph shown below validates this hypothesis, that as a substitute of forgetting previous information (a typical failure mode often called “lost-in-the-middle”), the model can effectively use the Memory Matrix to generate accurate summaries.

Rouge vs input length. Rouge measures how close an AI-generated summary is to a human-written summary based on lexical similarity.
4. Visualising the Gating Scalar
As a further ablation study, the authors visualised the learnt gating scalar (β) to see how the model was using its recent memory. Shown below is the heatmap of the resulting visualisation. The eye heads split into two distinct roles:
- Specialised Heads: Heads which have a rating near 1 or 0, indicating that they decide to focus either on local context (inside segment) or global history (previous segments).
- Mixer Heads: Heads which have scores near 0.5, indicating that their principal role is to merge information from each pathways efficiently.
This means that the model can learn to change between short-term/long-term recall and blend information across all the sequence.

Visualisation of β reveals that focus heads are likely to specialise for either global or local attention under the infini-attention architecture.
5. Conclusion
While it could not fully replace external Vector Databases and RAG systems for reasoning over static knowledge, it does, nevertheless, change how models process standard user queries. Integration of such architectures may very well be the subsequent step forward to set free the research creativity, which earlier needed to be bottlenecked by hardware advancements, ultimately accelerating progress in the sector of language modelling.
.
6. References
- Infini-attention (Principal Paper): Munkhdalai, T., Faruqui, M., & Gopal, S. (2024). Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention. .
- Transformer-XL: Dai, Z., Yang, Z., Yang, Y., Carbonell, J., Le, Q. V., & Salakhutdinov, R. (2019). Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context. .
- Memorizing Transformers: Wu, Y., Rabe, M. N., Hutchins, D., & Szegedy, C. (2022). Memorizing Transformers. .
- Linear Attention (The maths foundation): Katharopoulos, A., Vyas, A., Pappas, N., & Fleuret, F. (2020). Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention. .
- BookSum Benchmark: Kryściński, W., Rajani, N., Agarwal, D., Xiong, C., & Radev, D. (2021). BookSum: A Collection of Datasets for Long-form Narrative Summarization. .
- Standard Attention: Vaswani, Ashish, et al. “Attention is all you wish.” 30 (2017).
- Delta Rule: Schlag, Imanol, Kazuki Irie, and Jürgen Schmidhuber. “Linear transformers are secretly fast weight programmers.” . PMLR, 2021.
