In-Depth Exploration of Integrating Foundational Models similar to LLMs and VLMs into RL Training Loop
Authors: Elahe Aghapour, Salar Rahili
The sphere of computer vision and natural language processing is evolving rapidly, resulting in a growing demand for specialised models fine-tuned for specific downstream tasks. Nonetheless, having different fine-tuned models has multiple drawbacks:
1. For every task, a separate model have to be stored and deployed (this issue might be resolved by applying methods like LoRA for fine-tuning).
2. Independently fine-tuned models cannot profit from leveraging information from related tasks, which limits their generalization across each in-domain and out-of-domain tasks. Nonetheless, multi-task learning requires access to datasets for every specific task, and integrating these datasets might be complicated. What if we shouldn’t have access to datasets for all downstream tasks, however the fine-tuned models can be found? Imagine you wish a big language model (LLM) fine-tuned on a set of specific tasks. As a substitute of collecting extensive datasets for downstream tasks and undergoing the resource-heavy strategy of fine-tuning, you could find LLMs fine-tuned on each task and merge these models to create the specified one. Note that finding such models shouldn’t be a difficult task throughout the large Hugging Face repository, which hosts roughly 0.5 million fine-tuned models. Merging multiple models has recently gained significant attention, primarily since it requires lightweight computation and no training data.
With the growing attention to merging, public libraries similar to WEBUI and MergeKit have been developed to facilitate this process. WebUIs enables merging fine-tuned models similar to Stable Diffusion using different merging techniques. MergeKit is an open-source, centralized library that gives different merging methods. It facilitates model merging by its efficient implementation of merging techniques, applicable on any hardware.
Here, we categorized merging methods into three major categories:
1. merging models with equivalent architectures and initializations,
2. merging models with equivalent architectures but different initializations,
3. merging models with different architectures.
Each category involves different techniques to effectively mix models, which will probably be explained below.
1.a Merging With No Data Requirement:
The model merging methods on this section are all based on Linear Mode Connectivity (LMC). LMC suggests that for models with equivalent architecture and initialization, the loss between their checkpoints might be connected by a low-loss linear path. Because of this these models might be combined using linear interpolation.
To fine-tune a model, various configurations, like different learning rates, random seeds, and data augmentation techniques might be applied which result in several model parameters. Model soup proposes averaging these parameters since these models have learned similar representations and are close in parameter space. Weighted model averaging results in a flat local optimum with higher generalization to out-of-distribution tasks [see 13, 14]
SLERP (Spherical Linear Interpolation, first introduced here) is a method commonly utilized in computer graphics and animation for easily interpolating between rotations represented by quaternions. SLERP can be applicable in model merging. It merges two sets of model parameters by interpolating along a spherical path as an alternative of a straight line. Fig. 2 shows that for the given two model parameters p1 and p2, SLERP merges these parameters along the globe’s surface, providing a smooth transition. This method is often utilized in merging LLMs.
Assume two MLP models are given, each fine-tuned on a distinct downstream task. SLERP can merge these two models using the next steps:
Step 1: For every model parameters, flatten and concatenate them into vectors v1, v2
Step 2: Normalize the vectors v1 and v2 to be on the unit hypersphere surface (leading to v1′ and v2′).
Step 3: Calculate the angle θ (in radians) between these two vectors.
Step 4: Calculate Vslerp using the SLERP formula as:
where t is the interpolation parameter as t=0 means only Model 1 is used, while t=1 means only Model 2 is used.
Linear weight averaging techniques, similar to model soup and SLERP, have been common in the sector of computer vision from image processing and classification models to image generation models similar to latent diffusion models.
Task arithmetic introduces a technique based on task vectors. A task vector is calculated by subtracting the weights of a pretrained model (θpre) from the weights of the identical model fine-tuned for a selected task (θft), as
τ = θft − θpre. This vector represents a direction in the load space of the pretrained model where moving in that direction enhances performance on that task. Task vectors might be combined together by arithmetic operations similar to negation and addition. Negating a task vector (θpre — τ) reduces the model’s performance on the goal task (forgetting) with minimal impact on control tasks. To boost the performance of the pre-trained model across multiple tasks, we are able to initially learn a task vector for every task. By then summing these task vectors (θpre+∑τi), we improve the model’s capability to handle multiple tasks concurrently.
TIES addresses performance drops because of parameter interference when combining task vectors (∑τi). This issue might be solved through three steps (see Fig. 3):
(1) trim each task vector to the top-k% (normally k=20) largest magnitude values,
(2) for every non-zero parameter, select the sign with the very best total magnitude across all task vectors to avoid conflicting changes, and
(3) merging values only from task vectors with the identical sign because the elected one.
DARE is principally focused on LLM’s model merging and identifies the acute redundancy within the task vector (τ = θft−θpre). It proposes a 3 step approach:
1- Randomly drop p% (normally p =90) of the duty vector values,
2- Rescale the remaining ones by an element of 1/(1 − p), and
3- Merge (θpre + λi ∑τi)
where λi is the scaling term, representing the importance of every task vector to be merged.
1.b Merging With Data Requirement:
The merging methods that we discussed above require no data. Nonetheless, there are approaches that do need data to find out the optimal weights for merging the parameters. These methods use data to compute the activations after which adjust the weights accordingly.
One such approach is Fisher Merging. Given K fine-tuned models, each trained on a distinct downstream task ranging from a selected pretrained checkpoint, Fisher Merging performs a weighted summation of every model’s parameters. The weights are calculated using the Fisher information matrix, which requires some data from each task for the matrix construction.
In a related development, RegMean significantly outperforms Fisher-weighted merging by recasting the model merging task as a linear regression problem. This method derives closed-form solutions for the weights of linear layers and interpolates other weights (like layer normalization and bias terms) evenly. Given K fine-tuned models and a few data Xi i= 1,..,K, for every task, the linear layers of the merged model might be determined as follows:
Where Wi is the linear layer from the ith fine-tuned model.
Given models which have the identical architecture and training dataset but different initializations, easy merging methods like linear model combination often fail to perform well. The major reason is that the weights of the models aren’t aligned. Hence, researchers have developed techniques to leverage the permutation symmetry of neural networks. By reordering the neurons of the models, their weights can align higher, which makes the merging process more practical.
Git-Rebasin suggests permuting the weights of 1 model to match the configuration of one other. Assume two models, A and B are given with the identical architecture and training dataset, but their initializations and training data orders were different. The weights of every network might be permuted without changing its functionality, which implies that swapping neurons in hidden layers can lead to functionally equivalent models.
They formulated this as an optimization problem to discover the optimal permutations of units across layers that align the 2 models’ parameters in the load space. This alignment ensures that the models are in the same “basin” of the loss landscape, which results in a smooth and effective merging. To this goal, Git-Rebasin proposed the next three steps:
1. For every layer, the issue of finding the most effective permutations is formulated as a Linear Task Problem (LAP). This step involves computing a matrix of activations and finding the optimal permutation matrix that aligns the activations.
2. Given the optimal permutations for all layers, the weights of model B will probably be permuted.
3. Linear model combination between the permuted weights of model B and the weights of model A lies inside a low-loss basin within the loss landscape, which ensures that the merged model performs well.
REPAIR addresses a critical issue within the Rebasin merging method referred to as variance collapse, through which the hidden units have significantly smaller activation variance in comparison with the corresponding units of the unique networks before they were interpolated. Due to this fact, the activations of neurons change into nearly constant in deeper layers, hence the network will not even have the ability to distinguish between inputs. REPAIR resolves this issue by rescaling the activations of the interpolated networks to match the statistical properties of the unique networks. By adjusting the means and variances of the activations, the interpolated network maintains functional variability throughout its layers. Applying the REPAIR method significantly reduces the interpolation barrier, improving the performance of interpolated models.
In contrast to the methods discussed thus far, Frankenmerging doesn’t fuse models right into a single one, and as an alternative stacks different layers of various models sequentially. Due to this fact, it’s in a position to merge models with different architectures.
For instance, to construct an LLM with 40 layers, one might stack the primary 24 layers from one LLM onto layers 25–40 from one other LLM. This method has gained significant attention in style transfer in computer vision. Despite requiring a whole lot of trial and error and experimentation, it has led to impressive LLM models similar to Goliath and Solar-10.7B [see here].
EvolutionaryOptimization proposes a framework to robotically merge a given set of foundation models, such that the merged model outperforms any individual model within the given set. This approach involves two major phases (see Fig. 4):
In the primary phase, this method uses TIES-Merging with DARE for layer-wise merging of N foundational models. The method is optimized by utilizing an evolutionary algorithm guided by task-specific metrics (e.g., accuracy for MGSM, ROUGE rating for VQA). To search out unknown variables similar to dropout percentages in DARE and weights of every model’s parameters in merging, the evolutionary optimization begins with a gaggle of possible solutions that evolve over time. Through mutation (small random changes) and crossover (combining parts of two solutions), the most effective solutions are chosen to create a brand new group of candidates. This iterative process results in progressively higher solutions.
Within the second phase, where a set of N models is given, the goal is to seek out an optimal model with T layers using Frankenmerging. To scale back the search space and make the optimization tractable, all layers are specified by sequential order (i.e., all layers within the i-th model followed by those within the i + 1-th model) and repeated r times. On this phase, the goal is to seek out an optimal indicator which determines the inclusion/exclusion of layers: if Indicator(i)>0, the ith layer is included within the merged model; otherwise, it’s excluded.
The EvolutionaryOptimization process begins with applying the primary phase to a set of models. Then, the merged model from step one is added to the given collection and the second phase is applied on this enlarged collection to seek out an optimal indicator which selects T layers for the ultimate merged model. This approach applied to merge a Japanese LLM with an English Math LLM to construct a Japanese Math LLM. The merged model achieved state-of-the-art performance on quite a lot of established Japanese LLM benchmarks, even outperforming models with significantly more parameters, despite not being trained for such tasks.
The opinions expressed on this blog post are solely our own and don’t reflect those of our employer.
Also Read Our Previous Post: From Unimodals to Multimodality: DIY Techniques for Constructing Foundational Models
References:
[1] Model soup: Wortsman, Mitchell, et al. “Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time.” (2022).
[2] Task arithmetic: Ilharco, Gabriel, et al. “Editing models with task arithmetic.” (2022).
[3] TIES: Yadav, Prateek, et al. “Ties-merging: Resolving interference when merging models.” (2024).
[4] DARE: Yu, Le, et al. “Language models are super mario: Absorbing abilities from homologous models as a free lunch.” (2024).
[5] Fisher Merging Matena, Michael S., et al. “Merging models with fisher-weighted averaging.” (2022).
[6] RegMean: Jin, Xisen, et al. “Dataless knowledge fusion by merging weights of language models.” (2022).
[7] Git-Rebasin: Ainsworth, Samuel K., et al. “Git re-basin: Merging models modulo permutation symmetries.” (2022).
[8] REPAIR: Jordan, Keller, et al. “Repair: Renormalizing permuted activations for interpolation repair.” (2022).
[9] Frankenmerging: Charles O. Goddard. 2024. mergekit.
[10] EvolutionaryOptimization: Akiba, Takuya, et al. “Evolutionary optimization of model merging recipes.” (2024).
[11] Shoemake, Ken. “Animating rotation with quaternion curves.” (1985).
[12] LMC: Nagarajan, Vaishnavh, et al. “Uniform convergence could also be unable to clarify generalization in deep learning.” (2019).
[13] Kaddour, Jean, et al. “When do flat minima optimizers work?.” (2022)
[14] Petzka, Henning, et al. “Relative flatness and generalization.” (2021)