Reimagining LLM Memory: Using Context as Training Data Unlocks Models That Learn at Test-Time

We keep seeing LLMs with larger context windows within the news, together with guarantees that they will hold entire conversation histories, volumes of books, or multiple codebases in view directly. And yet, these models still repeat the identical mistakes. We still must copy and paste the sooner context back into the chat for LLMs to “get it”. A wise co-worker would pick up on these patterns, adapt, and carry the teachings forward. Why can’t LLMs?

On this blog post, we observe a critical difference between LLM memory and human memory. Then, we introduce test-time training with an end-to-end formulation (TTT-E2E), our latest research, by which the LLM compresses the context it’s reading into its weights through next-token prediction.

Plots of loss and latency versus context length, comparing full-attention Transformers, RNN-based models, and TTT-E2E, with TTT-E2E showing balanced scaling across both metrics. — *Figure 1. Scaling with context length, by way of loss (left) and latency (right)*

Our key results are highlighted in Figure 1, which measures scaling with context length, by way of loss (left) and latency (right). Transformer with full attention scales well by way of loss but not latency. Recurrent Neural Networks (RNNs), similar to Mamba 2 and Gated DeltaNet, scale well in latency but not loss. TTT-E2E is the one method that scales well in each.

Left panel: TTT-E2E turns the worst line (gray) into the most effective (light green) at 128K context length. Loss ∆ (↓), the y-value, is computed as (lack of the reported method) − (lack of transformer with full attention), so loss ∆ of full attention itself (dark green) is the flat line at y=0. While other methods produce worse loss ∆ in longer context, TTT-E2E maintains the identical advantage over full attention.

Right panel: Much like RNNs, TTT-E2E has constant inference latency no matter context length, making it 2.7x faster than full attention for 128K context on an NVIDIA H100, and 35x faster for 2M context. All models have 3B parameters and are trained with 164B tokens.

Scaling with context length, by way of each loss and latency, is the most fundamental problem in long-context and LLM research. TTT-E2E is the primary method that shows an indication of life at this problem, while all the opposite methods exhibit qualitatively different trends. Furthermore, we observed no wall for the scaling trends of TTT-E2E across rigorous and extensive experiments. These results indicate that the research community might finally arrive at a basic solution to long context in 2026.

Our paper and code are publicly available.

Reimagining LLM Memory: Using Context as Training Data Unlocks Models That Learn at Test-Time

How does LLM memory differ from human memory?

Our method: compressing context into weights

What will likely be the role of RAG?

Limitations

Conclusion

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

A Tale of Two Variances: Why NumPy and Pandas Give Different Answers

How Vision Language Models Are Trained from “Scratch”

Why Care About Prompt Caching in LLMs?

Supply-chain attack using invisible code hits GitHub and other repositories

Introducing NVIDIA NeMo Retriever’s Generalizable Agentic Retrieval Pipeline

Reimagining LLM Memory: Using Context as Training Data Unlocks Models That Learn at Test-Time

How does LLM memory differ from human memory?

Our method: compressing context into weights

What will likely be the role of RAG?

Limitations

Conclusion

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.