
Researchers at Google have developed a brand new AI paradigm geared toward solving certainly one of the largest limitations in today’s large language models: their inability to learn or update their knowledge after training. The paradigm, called Nested Learning, reframes a model and its training not as a single process, but as a system of nested, multi-level optimization problems. The researchers argue that this approach can unlock more expressive learning algorithms, leading to higher in-context learning and memory.
To prove their concept, the researchers used Nested Learning to develop a brand new model, called Hope. Initial experiments show that it has superior performance on language modeling, continual learning, and long-context reasoning tasks, potentially paving the way in which for efficient AI systems that may adapt to real-world environments.
The memory problem of huge language models
Deep learning algorithms helped obviate the necessity for the careful engineering and domain expertise required by traditional machine learning. By feeding models vast amounts of information, they may learn the mandatory representations on their very own. Nonetheless, this approach presented its own set of challenges that couldn’t be solved by simply stacking more layers or creating larger networks, akin to generalizing to recent data, continually learning recent tasks, and avoiding suboptimal solutions during training.
Efforts to beat these challenges led to the innovations that led to Transformers, the inspiration of today's large language models (LLMs). These models have ushered in "a paradigm shift from task-specific models to more general-purpose systems with various emergent capabilities consequently of scaling the 'right' architectures," the researchers write. Still, a fundamental limitation stays: LLMs are largely static after training and might't update their core knowledge or acquire recent skills from recent interactions.
The one adaptable component of an LLM is its in-context learning ability, which allows it to perform tasks based on information provided in its immediate prompt. This makes current LLMs analogous to a one that can't form recent long-term memories. Their knowledge is restricted to what they learned during pre-training (the distant past) and what's of their current context window (the immediate present). Once a conversation exceeds the context window, that information is lost without end.
The issue is that today’s transformer-based LLMs haven’t any mechanism for “online” consolidation. Information within the context window never updates the model’s long-term parameters — the weights stored in its feed-forward layers. Consequently, the model can’t permanently acquire recent knowledge or skills from interactions; anything it learns disappears as soon because the context window rolls over.
A nested approach to learning
Nested Learning (NL) is designed to permit computational models to learn from data using different levels of abstraction and time-scales, very like the brain. It treats a single machine learning model not as one continuous process, but as a system of interconnected learning problems which can be optimized concurrently at different speeds. It is a departure from the classic view, which treats a model's architecture and its optimization algorithm as two separate components.
Under this paradigm, the training process is viewed as developing an "associative memory," the power to attach and recall related pieces of data. The model learns to map a knowledge point to its local error, which measures how "surprising" that data point was. Even key architectural components just like the attention mechanism in transformers may be seen as easy associative memory modules that learn mappings between tokens. By defining an update frequency for every component, these nested optimization problems may be ordered into different "levels," forming the core of the NL paradigm.
Hope for continual learning
The researchers put these principles into practice with Hope, an architecture designed to embody Nested Learning. Hope is a modified version of Titans, one other architecture Google introduced in January to deal with the transformer model's memory limitations. While Titans had a robust memory system, its parameters were updated at only two different speeds: a long-term memory module and a short-term memory mechanism.
Hope is a self-modifying architecture augmented with a "Continuum Memory System" (CMS) that allows unbounded levels of in-context learning and scales to larger context windows. The CMS acts like a series of memory banks, each updating at a unique frequency. Faster-updating banks handle immediate information, while slower ones consolidate more abstract knowledge over longer periods. This enables the model to optimize its own memory in a self-referential loop, creating an architecture with theoretically infinite learning levels.
On a various set of language modeling and common sense reasoning tasks, Hope demonstrated lower perplexity (a measure of how well a model predicts the subsequent word in a sequence and maintains coherence within the text it generates) and better accuracy in comparison with each standard transformers and other modern recurrent models. Hope also performed higher on long-context "Needle-In-Haystack" tasks, where a model must find and use a selected piece of data hidden inside a big volume of text. This means its CMS offers a more efficient technique to handle long information sequences.
That is certainly one of several efforts to create AI systems that process information at different levels. Hierarchical Reasoning Model (HRM) by Sapient Intelligence, used a hierarchical architecture to make the model more efficient in learning reasoning tasks. Tiny Reasoning Model (TRM), a model by Samsung, improves HRM by making architectural changes, improving its performance while making it more efficient.
While promising, Nested Learning faces a few of the same challenges of those other paradigms in realizing its full potential. Current AI hardware and software stacks are heavily optimized for traditional deep learning architectures and Transformer models particularly. Adopting Nested Learning at scale may require fundamental changes. Nonetheless, if it gains traction, it could lead on to way more efficient LLMs that may continually learn, a capability crucial for real-world enterprise applications where environments, data, and user needs are in constant flux.
