Reimagining LLM Memory: Using Context as Training Data Unlocks Models That Learn at Test-Time

-


We keep seeing LLMs with larger context windows within the news, together with guarantees that they will hold entire conversation histories, volumes of books, or multiple codebases in view directly. And yet, these models still repeat the identical mistakes. We still must copy and paste the sooner context back into the chat for LLMs to “get it”. A wise co-worker would pick up on these patterns, adapt, and carry the teachings forward. Why can’t LLMs?

On this blog post, we observe a critical difference between LLM memory and human memory. Then, we introduce test-time training with an end-to-end formulation (TTT-E2E), our latest research, by which the LLM compresses the context it’s reading into its weights through next-token prediction. 

Plots of loss and latency versus context length, comparing full-attention Transformers, RNN-based models, and TTT-E2E, with TTT-E2E showing balanced scaling across both metrics.Plots of loss and latency versus context length, comparing full-attention Transformers, RNN-based models, and TTT-E2E, with TTT-E2E showing balanced scaling across both metrics.
Figure 1. Scaling with context length, by way of loss (left) and latency (right)

Our key results are highlighted in Figure 1, which measures scaling with context length, by way of loss (left) and latency (right). Transformer with full attention scales well by way of loss but not latency. Recurrent Neural Networks (RNNs), similar to Mamba 2 and Gated DeltaNet, scale well in latency but not loss. TTT-E2E is the one method that scales well in each.

Left panel: TTT-E2E turns the worst line (gray) into the most effective (light green) at 128K context length. Loss ∆ (↓), the y-value, is computed as (lack of the reported method) − (lack of transformer with full attention), so loss ∆ of full attention itself (dark green) is the flat line at y=0. While other methods produce worse loss ∆ in longer context, TTT-E2E maintains the identical advantage over full attention. 

Right panel: Much like RNNs, TTT-E2E has constant inference latency no matter context length, making it 2.7x faster than full attention for 128K context on an NVIDIA H100, and 35x faster for 2M context. All models have 3B parameters and are trained with 164B tokens.

Scaling with context length, by way of each loss and latency, is the most fundamental problem in long-context and LLM research. TTT-E2E is the primary method that shows an indication of life at this problem, while all the opposite methods exhibit qualitatively different trends. Furthermore, we observed no wall for the scaling trends of TTT-E2E across rigorous and extensive experiments. These results indicate that the research community might finally arrive at a basic solution to long context in 2026.

Our paper and code are publicly available.

How does LLM memory differ from human memory?

Humans are remarkably good at improving with more “context” in the shape of life experience, despite their imperfect recall of the precise details. For instance, consider your first lecture in machine learning. You would possibly not recall the trainer’s first word through the lecture, however the intuition you learned might be helping you understand this blog post, even when that happened years ago.

Alternatively, transformers with self-attention are inefficient with long context, partially because they’re designed for nearly lossless recall. The fundamental type of self-attention known as full attention, which maintains full memory of each token by caching and comparing their keys and values. As a consequence, full attention readily attends to each detail, but its cost per token grows linearly with context length. Processing the 10-millionth token takes a million times longer than processing the tenth.

To process long context without burning the planet, modern architectures often mix full attention with approximations similar to sliding-window attention, Mamba, and Gated DeltaNet layers. These approximations have a relentless cost per token, but additionally grow to be significantly less effective in longer context in comparison with full attention. Specifically, these approximations lose necessary information that might have helped them predict the long run, as shown in Figure 1.

Our method: compressing context into weights

How can we design a technique with a relentless cost per token that may still remember the necessary, predictive, and intuitive information in long context?

The important thing mechanism is compression. For instance, humans compress a large amount of experience into their brains, which preserves the necessary information while leaving out many details. For language models, we all know that training with next-token prediction also compresses a large amount of information into their weights. So what if we just proceed training the language model at test time through next-token prediction on the given context?

We found this straightforward type of Test-Time Training (TTT) highly effective once we added one other missing piece. At training time, we prepare the model’s initialization for TTT through meta-learning as an alternative of normal pre-training. This addition makes our method end-to-end (E2E) in two ways. Our inner loop directly optimizes the next-token prediction loss at the top of the network, in contrast to prior work on long-context TTT (e.g., Titans). And our outer loop directly optimizes the ultimate loss after TTT. 

What will likely be the role of RAG?

TTT is like updating the human brain, while a retrieval-based methods, similar to RAG, are like writing things down and looking out things up in a notepad or calendar. The notepad will proceed to be a practical complement to the brain, especially when the main points matter, like searching for an extended list of groceries. But human productivity is generally determined by their brains, not by the notepads they use. Similarly, the productivity of an AI agent is generally determined by how well it compresses a large amount of context into predictive and intuitive information.

Limitations

At training time, the meta-learning phase of TTT-E2E requires gradients of gradients. Our current implementation of meta-learning is 3.4x slower than standard pre-training for brief context (8K), because the usual API of FlashAttention doesn’t support gradients of gradients. We are able to overcome this limitation by either developing a custom attention kernel that supports gradients of gradients or initializing TTT-E2E from a normal Transformer pre-trained without TTT. We invite the community to hitch us in these efforts!

Conclusion

For a deeper dive into the strategy, results, and implementation details, please try the total paper End-to-End Test-Time Training for Long Context.​ All experiments could be reproduced using the code and datasets in our public repo.



Source link

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x