Home Artificial Intelligence Understanding What We Lose

Understanding What We Lose

1
Understanding What We Lose

How We Tackle Catastrophic Forgetting in LLMs

Figure 1: The shared experience of forgetting. Image generated by DALL·E, developed by OpenAI.

Forgetting is an intrinsic a part of the human experience. All of us misplace our keys, come across a well-known name, or draw a blank on what we had for dinner a few nights ago. But this apparent lapse in our memory isn’t necessarily a failing. Slightly, it highlights a classy cognitive mechanism that permits our brains to prioritize, sift through, and manage a deluge of data. Forgetting, paradoxically, is a testament to our ability to learn and remember.

Just as people forget, so do machine learning models — particularly, Large Language Models. These models learn by adjusting internal parameters in response to data exposure. Nonetheless, if recent data contrasts with what the model has previously learned, it’d overwrite or dampen the old information. Even corroborating data can finagle and switch the unsuitable knobs on otherwise good learning weights. This phenomenon, referred to as “catastrophic forgetting,” is a big challenge in training stable and versatile artificial intelligence systems.

The Mechanics of Forgetting in LLMs

On the core, an LLM’s memory lies in its weights. In a neural network, each weight essentially constitutes a dimension within the network’s high-dimensional weight space. As the educational process unfolds, the network navigates this space, guided by a select gradient descent, in a quest to reduce the loss function.

This loss function, normally a type of cross-entropy loss for classification tasks in LLMs, compares the model’s output distribution to the goal distribution. Mathematically, for a goal distribution y and model output ŷ, the cross-entropy loss may be expressed as:

During training, the network tweaks its weights to reduce this loss. Now, the central aspect governing how much a weight should change is the educational rate. Within the stochastic gradient descent update rule:

η is the educational rate. Nonetheless, the selection of this learning rate may be tricky and holds implications for catastrophic forgetting. If η is high, the model is very plastic and might rapidly learn recent tasks but risks losing prior knowledge. A small η preserves old knowledge but might compromise the educational of recent tasks.

Furthermore, the complexity rises once we realize that weight updates aren’t independent. Adjusting a weight related to one feature may inadvertently affect the performance of other features, resulting in a posh, tangled web of dependencies.

We must also consider the curricular order of tasks or data during training. Sequentially introducing tasks could lead on to dominance of later tasks, making the model biased towards the most recent learned task, a direct manifestation of catastrophic forgetting.

Strategies to Counter Catastrophic Forgetting

We would like our LLMs to recollect exponentially beyond what we are able to ourselves. Thus, we’re striving to construct systems which are efficient with their memory yet not confined necessarily to our biological standards. In the hunt to combat catastrophic forgetting in LLMs, researchers have developed several modern strategies. Three of essentially the most outstanding strategies include Elastic Weight Consolidation, Progressive Neural Networks , and Optimized Fixed Expansion Layers. Each technique incorporates a novel mathematical approach to mitigate the forgetting problem.

Elastic Weight Consolidation (EWC): Remembering the Importance of Each Weight

EWC is inspired by neuroscience and Bayesian inference, and it goals to quantify the importance of every weight to the tasks the model has previously learned. The basic idea is that the weights critical to prior tasks must be altered less when recent data is encountered.

Figure 2 : EWC Schematic Parameter Space, https://www.pnas.org/doi/full/10.1073/pnas.1611835114

In Figure 2, we are able to clearly see the pivotal role that Elastic Weight Consolidation plays in stopping catastrophic forgetting once we train on task B, without losing the knowledge we’ve gained from task A. This diagram shows parameter space, with the grey areas signifying optimal performance for task A, and cream-colored regions indicating good performance for task B. After we’ve learned task A, our parameter values are labeled as θ*A.

If we concentrate only on task B and take steps within the direction of its gradient (as shown by the blue arrow), we’ll minimize the loss for task B, but potentially wipe out our knowledge of task A — that is the issue of catastrophic forgetting. Alternatively, if we constrain all weights with the identical coefficient (as illustrated by the green arrow), we place a harsh restriction that lets us retain our memory of task A, but makes learning task B difficult.

That is where EWC steps in — it finds the sweet spot by identifying an answer for task B (indicated by the red arrow) that doesn’t drastically impact our knowledge of task A. It accomplishes this by specifically determining the importance of every weight in relation to task A.

EWC introduces a quadratic penalty to the loss function, constraining the modification of essential weights. This penalty term is proportional to the square of the difference between the present and initial weight values, scaled by an importance factor. This importance factor, calculated from the Fisher Information Matrix, serves as a heuristic for a weight’s significance to the previously learned tasks.

In Elastic Weight Consolidation, a neural network is first trained on Task A, after which the Fisher Information Matrix (FIM) is computed and saved together with the learned weights. When training the network on Task B, EWC modifies the loss function to incorporate a penalty term, computed using the saved FIM and weights, which discourages drastic changes to the weights critical for Task A, thus balancing learning the brand new task with preserving knowledge from the previous task. The quadratic nature of the penalty ensures that larger deviations from the initial weights incur a better penalty. By assigning greater penalties to weights that contribute more to prior tasks, EWC goals to retain their learned knowledge while accommodating recent information.

Progressive Neural Networks (ProgNet): Constructing Neural Network Towers

ProgNets introduce a recent architecture that permits the network to expand when encountering recent tasks. As a substitute of altering the weights of a single network, it adds a recent network (or column) for every task, stacking these columns akin to constructing a tower. Each recent column is connected to all of the previously added columns but not the opposite way around, preserving the knowledge within the older columns.

Behind ProgNet, each task is learned by a separate column, and the output is a function of the inputs from all previous and current columns. The weights of previous columns remain frozen, stopping any catastrophic forgetting, while the weights of the brand new column are trained normally.

Figure 3 : A Block-based ProgNet Model, https://arxiv.org/abs/1606.04671

​​Imagine Progressive Neural Networks as a constellation of separate processing units, each having the flexibility to discern and harness essentially the most pertinent inputs for the tasks they’re assigned. Let’s consider an example from Figure 3, where output₃ not only interacts with its directly connected hidden layer, h₂, but in addition interfaces with the h₂ layers of prior columns, modifying their outputs through its unique lateral parameters. This output₃ unit scans and evaluates the available data, strategically omitting inputs which are unnecessary. As an example, if h₂¹ encapsulates all of the needed information, output₃ may decide to neglect the remaining. Alternatively, if each h₂² and h₂³ carry beneficial information, output₃ could preferentially deal with these while ignoring h₂¹. These side connections empower the network to effectively manage the flow of data across tasks while also enabling it to exclude irrelevant data.

Optimized Fixed Expansion Layers (OFELs): A Recent Room for Each Task

The concept behind OFELs is like constructing a recent room in a house for every recent member of the family. Within the context of neural networks, OFELs add a recent layer for every task the LLM encounters. This layer expansion allows the network to accommodate recent information without disrupting what it has already learned.

Figure 4 : OFEL diagram, https://www.mdpi.com/2073-4425/10/7/553

OFELs involve modifying the architecture of the network itself. Here, for every recent task, a recent layer is added to the neural network as a substitute of retraining the complete network. This modification in architecture helps to encapsulate the knowledge required for the brand new task inside that specific layer, minimising the impact on the pre-existing weights of the old layers.

where g is the activation function. The architecture of OFELs is designed such that it allows for the inclusion of a recent layer dedicated to the brand new task, which suggests that the network can process recent inputs (x_new) independently of the old inputs (x_old). In essence, while the equation presents a comprehensive view of the underlying process within the architecture, during inference or prediction for a recent task, we might typically use only x_new and never require x_old.

By selectively optimizing the brand new layers, OFELs strike a fragile balance between acquiring knowledge related to the brand new task and preserving the previously learned information. This meticulous optimization process allows the model to adapt to novel challenges while retaining its ability to leverage prior knowledge, ultimately facilitating more robust and versatile learning.

Forward Learnings

Forgetting — whether in humans or LLMs — is an interesting paradox. On one hand, it might be an obstacle to continuous learning and flexibility. On the opposite, it’s an inherent a part of how our brains and AI models manage and prioritize information. Strategies to counter catastrophic forgetting — Elastic Weight Consolidation, Progressive Neural Networks, and Optimized Fixed Expansion Layers — provide insightful yet diverse methodologies to preserve the retention capabilities of Large Language Models. Each offering distinct solutions, they reflect the resourcefulness and flexibility that the sphere of artificial intelligence must consistently embody. Nonetheless, it’s crucial to grasp that the issue of catastrophic forgetting shouldn’t be fully solved; there are still untapped avenues on this area demanding rigorous exploration, innovation, and creativity.

Addressing the challenge of catastrophic forgetting propels us not only towards more efficient AI systems, but towards a deeper understanding of learning and forgetting — a cognitive function shared by humans and machines alike. Due to this fact, it becomes an actionable imperative for researchers, scientists, practitioners, and anyone fascinated by the workings of intelligence, to contribute to this ongoing dialogue. The hunt to tame the phenomenon of catastrophic forgetting shouldn’t be merely a tutorial pursuit, but a journey that guarantees to redefine our relationship understanding.

1 COMMENT

LEAVE A REPLY

Please enter your comment!
Please enter your name here