
For all their superhuman power, today’s AI models suffer from a surprisingly human flaw: They forget. Give an AI assistant a sprawling conversation, a multi-step reasoning task or a project spanning days, and it is going to eventually lose the thread. Engineers consult with this phenomenon as “context rot,” and it has quietly turn into one of the crucial significant obstacles to constructing AI agents that may function reliably in the true world.
A research team from China and Hong Kong believes it has created an answer to context rot. Their recent paper introduces general agentic memory (GAM), a system built to preserve long-horizon information without overwhelming the model. The core premise is easy: Split memory into two specialized roles, one which captures the whole lot, one other that retrieves exactly the suitable things at the suitable moment.
Early results are encouraging, and couldn’t be higher timed. Because the industry moves beyond prompt engineering and embraces the broader discipline of context engineering, GAM is emerging at precisely the suitable inflection point.
When greater context windows still aren’t enough
At the center of each large language model (LLM) lies a rigid limitation: A hard and fast “working memory,” more commonly known as the context window. Once conversations grow long, older information gets truncated, summarized or silently dropped. This limitation has long been recognized by AI researchers, and since early 2023, developers have been working to expand context windows, rapidly increasing the quantity of data a model can handle in a single pass.
Mistral’s Mixtral 8x7B debuted with a 32K-token window, which is roughly 24 to 25 words, or about 128 characters in English; essentially a small amount of text, like a single sentence. This was followed by MosaicML’s MPT-7B-StoryWriter-65k+, which greater than doubled that capability; then got here Google’s Gemini 1.5 Pro and Anthropic’s Claude 3, offering massive 128K and 200K windows, each of that are extendable to an unprecedented a million tokens. Even Microsoft joined the push, vaulting from the 2K-token limit of the sooner Phi models to the 128K context window of Phi-3.
Increasing context windows might sound like the plain fix, but it surely isn’t. Even models with sprawling 100K-token windows, enough to carry tons of of pages of text, still struggle to recall details buried near the start of an extended conversation. Scaling context comes with its own set of problems. As prompts grow longer, models turn into less reliable at locating and interpreting information because attention over distant tokens weakens and accuracy steadily erodes.
Longer inputs also dilute the signal-to-noise ratio, as including every possible detail can actually make responses worse than using a focused prompt. Long prompts also slow models down; more input tokens result in noticeably higher output-token latency, making a practical limit on how much context may be used before performance suffers.
Memories are priceless
For many organizations, supersized context windows include a transparent downside — they’re costly. Sending massive prompts through an API isn’t low cost, and since pricing scales directly with input tokens, even a single bloated request can drive up expenses. Prompt caching helps, but not enough to offset the habit of routinely overloading models with unnecessary context. And that’s the strain at the center of the difficulty: Memory is crucial to creating AI more powerful.
As context windows stretch into the tons of of 1000’s or hundreds of thousands of tokens, the financial overhead rises just as sharply. Scaling context is each a technical challenge and an economic one, and counting on ever-larger windows quickly becomes an unsustainable strategy for long-term memory.
Fixes like summarization and retrieval-augmented generation (RAG) aren’t silver bullets either. Summaries inevitably strip away subtle but essential details, and traditional RAG, while strong on static documents, tends to interrupt down when information stretches across multiple sessions or evolves over time. Even newer variants, akin to agentic RAG and RAG 2.0 (which perform higher in steering the retrieval process), still inherit the identical foundational flaw of treating retrieval as the answer, relatively than treating memory itself because the core problem.
Compilers solved this problem a long time ago
If memory is the true bottleneck, and retrieval can’t fix it, then the gap needs a unique sort of solution. That’s the bet behind GAM. As a substitute of pretending retrieval is memory, GAM keeps a full, lossless record and layers smart, on-demand recall on top of it, resurfacing the precise details an agent needs whilst conversations twist and evolve. A useful solution to understand GAM is thru a well-recognized idea from software engineering: Just-in-time (JIT) compilation. Reasonably than precomputing a rigid, heavily compressed memory, GAM keeps things light and tight by storing a minimal set of cues, together with a full, untouched archive of raw history. Then, when a request arrives, it “compiles” a tailored context on the fly.
This JIT approach is built into GAM’s dual architecture, allowing AI to hold context across long conversations without overcompressing or guessing too early about what matters. The result’s the suitable information, delivered at exactly the suitable moment.
Inside GAM: A two-agent system built for memory that endures
GAM revolves around the easy idea of separating the act of remembering from recalling, which aptly involves two components: The 'memorizer' and the 'researcher.'
The memorizer: Total recall without overload
The memorizer captures every exchange in full, quietly turning each interaction right into a concise memo while preserving the entire, decorated session in a searchable page store. It doesn’t compress aggressively or guess what is significant. As a substitute, it organizes interactions into structured pages, adds metadata for efficient retrieval and generates optional lightweight summaries for quick scanning. Critically, every detail is preserved, and nothing is thrown away.
The researcher: A deep retrieval engine
When the agent must act, the researcher takes the helm to plan a search strategy, combining embeddings with keyword methods like BM25, navigating through page IDs and stitching the pieces together. It conducts layered searches across the page-store, mixing vector retrieval, keyword matching and direct lookups. It evaluates findings, identifies gaps and continues searching until it has sufficient evidence to provide a confident answer, very similar to a human analyst reviewing old notes and first documents. It iterates, searches, integrates and reflects until it builds a clean, task-specific briefing.
GAM’s power comes from this JIT memory pipeline, which assembles wealthy, task-specific context on demand as a substitute of leaning on brittle, precomputed summaries. Its core innovation is easy yet powerful, because it preserves all information intact and makes every detail recoverable.
Ablation studies support this approach: Traditional memory fails by itself, and naive retrieval isn’t enough. It’s the pairing of an entire archive with an lively, iterative research engine that permits GAM to surface details that other systems leave behind.
Outperforming RAG and long-context models
To check GAM, the researchers pitted it against standard RAG pipelines and models with enlarged context windows akin to GPT-4o-mini and Qwen2.5-14B. They evaluated GAM using 4 major long-context and memory-intensive benchmarks, each chosen to check a unique aspect of the system’s capabilities:
-
LoCoMo measures an agent’s ability to take care of and recall information across long, multi-session conversations, encompassing single-hop, multi-hop, temporal reasoning and open-domain tasks.
-
HotpotQA, a widely used multi-hop QA benchmark built from Wikipedia, was adapted using MemAgent’s memory-stress-test version, which mixes relevant documents with distractors to create contexts of 56K, 224K and 448K tokens — ideal for testing how well GAM handles noisy, sprawling input.
-
RULER evaluates retrieval accuracy, multi-hop state tracking, aggregation over long sequences and QA performance under a 128K-token context to further probe long-horizon reasoning.
-
NarrativeQA is a benchmark where each query should be answered using the total text of a book or movie script; the researchers sampled 300 examples with a mean context size of 87K tokens.
Together, these datasets and benchmarks allowed the team to evaluate each GAM’s ability to preserve detailed historical information and its effectiveness in supporting complex downstream reasoning tasks.
GAM got here out ahead across all benchmarks. Its biggest win was on RULER, which benchmarks long-range state tracking. Notably:
-
GAM exceeded 90% accuracy.
-
RAG collapsed because key details were lost in summaries.
-
Long-context models faltered as older information effectively “faded” even when technically present.
Clearly, greater context windows aren’t the reply. GAM works since it retrieves with precision relatively than piling up tokens.
GAM, context engineering and competing approaches
Poorly structured context, not model limitations, is commonly the true reason AI agents fail. GAM addresses this by ensuring that nothing is permanently lost and that the suitable information can at all times be retrieved, even far downstream. The technique’s emergence coincides with the present, broader shift in AI towards context engineering, or the practice of shaping the whole lot an AI model sees — its instructions, history, retrieved documents, tools, preferences and output formats.
Context engineering has rapidly eclipsed prompt engineering in importance, although other research groups are tackling the memory problem from different angles. Anthropic is exploring curated, evolving context states. DeepSeek is experimenting with storing memory as images. One other group of Chinese researchers has proposed “semantic operating systems” built around lifelong adaptive memory.
Nonetheless, GAM’s philosophy is distinct: Avoid loss and retrieve with intelligence. As a substitute of guessing what is going to matter later, it keeps the whole lot and uses a dedicated research engine to seek out the relevant pieces at runtime. For agents handling multi-day projects, ongoing workflows or long-term relationships, that reliability may prove essential.
Why GAM matters for the long haul
Just as adding more compute doesn’t mechanically produce higher algorithms, expanding context windows alone won’t solve AI’s long-term memory problems. Meaningful progress requires rethinking the underlying system, and GAM takes that approach. As a substitute of depending on ever-larger models, massive context windows or endlessly refined prompts, it treats memory as an engineering challenge — one which advantages from structure relatively than brute force.
As AI agents transition from clever demos to mission-critical tools, their ability to recollect long histories becomes crucial for developing dependable, intelligent systems. Enterprises require AI agents that may track evolving tasks, maintain continuity and recall past interactions with precision and accuracy. GAM offers a practical path toward that future, signaling what would be the next major frontier in AI: Not greater models, but smarter memory systems and the context architectures that make them possible.
