🤗 PEFT welcomes recent merging methods

Model merging has quickly develop into the de-facto standard of pushing the performance limits of huge language models. On the Open LLM Leaderboard, we proceed to note merged models topping up the charts. Our very own Omar Sanseviero, made somewhat sprint on model merging and discovered interesting findings.

The standard way of model merging, to this point, has been to take a set of models and merge them. This post gives a pleasant primer on this topic. Generally, for merging multiple models, we first download their checkpoints after which perform merging. Depending on the merge algorithm and the sizes of the underlying model, this process may be quite memory-intensive. The mergekit library provides optimized ways for handling this, making the method manageable on limited memory.

But what if we desired to merge different “adapters” obtained from the same model? You would possibly have 4 different LoRA checkpoints obtained from the identical base model, and you desire to experiment with different merging techniques. Eventually, you desire to settle with the most effective merge, supplying you with the most effective results in your task. A few things develop into evident when approaching such a developer experience:

When coping with adapters akin to LoRA, it’s common for users to swap out and in different adapters and even mix them. Adapters may be activated, de-activated, or completely swapped out of the memory. Due to this fact, we’d like to do the “merging” part on the fly (versus the tactic described above) to offer a seamless experience to the users.
Different adapters might need different requirements for merging. The merging algorithm for LoRA won’t equally translate to IA3, for instance.

With these facets in mind, we shipped recent merging methods targeting the favored LoRA adapters in 🤗 PEFT. On this post, we would like to take you thru the methods available, code examples to make it easier to get cracking, impressive results, and our future plans. Let’s start 🚀

Table of content

Methods for combining/merging LoRA adapters

Concatenation (`cat`)

On this method, the LoRA matrices are concatenated. For instance, if we have now 2 LoRA adapters $(A_{1}, B_{1})$

$A_{merged} = concat(weight_1*scaling_1*A_1, weight_2*scaling_2*A_2, dim=0)$

$B_{merged} = concat(B_1, B_2, dim=1)$

where $shape(A_{merged}) = (rank_1+rank_2, d) ) and ( shape(B_{merged}) = (d, rank_1+rank_2)$

Now, the output of this recent merged LoRA layer can be as if the unique 2 LoRAs were energetic with weights $w e i g h t_{1}$

$h = W_0x + B_{merged}A_{merged}x$

Here, we will observe that:

$B_{merged}A_{merged} = weight_1 * scaling_1 * B_1A_1 + weight_2 * scaling_2 * B_2A_2$

🧠 That is the precise weighted merging of LoRA adapters. Additionally it is available via PEFT integration of Diffusers if you call set_adapters() wherein as a substitute of making a brand new merged adapter, the energetic adapters are combined sequentially, as shown on the right-hand side of the above equation. When using this method, it allows for participating LoRA adapters to have different ranks.

Linear/Task Arithmetic (`linear`)

On this method, the LoRA matrices are involved in weighted sum. That is what the Task arithmetic paper implements on task weights. In task arithmetic, one first computes the duty weights which is difference between finetuned weights and base model weights, then does a weighted sum of those task weights. Here, the delta weights considered are the person matrices $A$

Let’s undergo an example. Consider 2 LoRA adapters $(A_{1}, B_{1})$

$A_{merged} = sqrt(weight_1 * scaling_1) * A_1+ sqrt (weight_2 * scaling_2) * A_2$

$B_{merged} = sqrt(weight_1 * scaling_1) * B_1+ sqrt (weight_2 * scaling_2) * B_2$

For more details, please check with the paper: Editing Models with Task Arithmetic.

SVD (`svd`)

As a substitute of considering individual matrices $A$

Let’s proceed with the instance from the previous sub-sections. Here, first the delta weight of merged combination is computed as follows:

$delta_{merged} = weight_1 * scaling_1 * B_1A_1 + weight_2 * scaling_2 * B_2A_2$

After getting the above-merged delta weight, SVD (singular value decomposition) is applied to get the approximates $A_{merged_approx}$

🧠 Much like cat method, this method also allows for LoRA adapters with different ranks. As well as, one can select the rank for the resultant merged LoRA adapter which defaults to the utmost rank among the many participating LoRA adapters. A limitation of this approach is that it requires lots of GPU memory for performing the SVD operation.

TIES (`ties` , `ties_svd` )

This builds upon the linear and svd methods by changing the way in which merged adapters are computed from task weights and end in the ties and ties_svd methods, respectively. In TIES (TRIM, ELECT SIGN & MERGE), one first computes the duty weights which in our case can be the LoRA adapters $A$

total considers the magnitude in addition to sign to get the bulk sign, i.e., sum up all of the corresponding weights;
frequency only considers the burden sign to acquire the bulk sign, i.e., sum up the sign of all of the corresponding weights.

For more details, check with the paper: TIES-Merging: Resolving Interference When Merging Models.

DARE (`dare_linear` , `dare_ties` , `dare_linear_svd` , `dare_ties_svd` )

This also builds upon the linear and svd methods wherein the duty weights are LoRA adapters $A$

For *_linear* variants of DARE, we first use DARE to randomly prune task weights after which perform weighted sum of task tensors based on user specified weightage for participating LoRA adapters.

For *_ties* variants of DARE, we first use DARE to get the pruned task weights, then adopt the last 2 steps of ties, i.e., calculating majority sign mask and using the mask to perform disjoint merge of the duty weights.

Magnitude Prune (`magnitude_prune` , `magnitude_prune_svd` )

This also builds upon the linear and svd methods wherein the duty weights are LoRA adapters $A$

How do I merge my LoRA adapters?

In PEFT, when using LoRA, you should use the category method add_weighted_adapter() to try the various combining methods. For instance, below you may see how we will mix three LoRA adapters using ties method and the resulting generations from the brand new merged adapter. We will observe that merged adapter is capable of retain the capabilities of individual adapters:

You could find the above example within the PEFT repo’s examples.

Let’s take one other example, as shown below, using magnitude_prune method and the resulting generations.

Now, what if we would like to make use of the merged adapters capability to reply a mental health related query in Hinglish? This could require capabilities from each the adapters for use. Below we will see the outcomes on the query “Sad feelings ko kaise dur kare?” (Translation: Methods to do away with sad feelings?). When, all adapters are disabled and the bottom model is used, the response starts with it being an AI followed by general suggestions. When hinglish adapter is enabled, the response is in Hinglish and short following the finetuning data but doesn’t do job at giving concrete suggestions to assist overcome sadness. When mental_health adapter is enabled, the response is akin to what a human would say but sadly it isn’t in Hinglish. When merge adapter is enabled, we will see that the response is in Hinglish and short while giving concrete suggestions which may be present in the response of mental_health adapter akin to exercising, spending time with friends, reading, meditation and specializing in positive considering. Due to this fact, we will observe that merging adapters can result in combining their individual capabilities to support recent use cases.

Finally, let’s take the instance of dare_linear and check the resulting generations.

We’ve a dedicated developer guide for these merging methods in PEFT which you could find here.

Extending to text-to-image generation

On this section, we show you the right way to reap the benefits of these merging methods for text-to-image generation using 🤗 Diffusers. Note that Diffusers already relies on PEFT for all things LoRA, including training and inference. Nevertheless, currently, it’s impossible to profit from the brand new merging methods when calling set_adapters() on a Diffusers pipeline. For this reason we’re openly discussing with the community the right way to best support it natively from inside Diffusers.

But due to PEFT, there’s at all times a solution to circumvent around this. We’ll use the add_weighted_adapter() functionality for this. Precisely, these are the steps that we’ll take to mix the “toy-face” LoRA and the “Pixel-Art” loRA, and experiment with different merging techniques:

Obtain PeftModels from these LoRA checkpoints.
Merge the PeftModels using the add_weighted_adapter() method with a merging approach to our alternative.
Assign the merged model to the respective component of the underlying DiffusionPipeline.

Let’s see this in motion. All of the code shown within the parts below come from this Colab Notebook.

Since each the LoRA checkpoints use SDXL UNet because the their base model, we are going to first load the UNet:

from diffusers import UNet2DConditionModel
import torch

unet = UNet2DConditionModel.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0",
    torch_dtype=torch.float16, 
    use_safetensors=True,
    variant="fp16",
    subfolder="unet",
).to("cuda")

We then load the actual SDXL pipeline and the LoRA checkpoints. We start with the “CiroN2022/toy-face” LoRA:

from diffusers import DiffusionPipeline
import copy

sdxl_unet = copy.deepcopy(unet)
pipe = DiffusionPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0",
     variant="fp16",
     torch_dtype=torch.float16,
     unet=unet
).to("cuda")
pipe.load_lora_weights("CiroN2022/toy-face", weight_name="toy_face_sdxl.safetensors", adapter_name="toy")

Now, obtain the PeftModel from the loaded LoRA checkpoint:

from peft import get_peft_model, LoraConfig

toy_peft_model = get_peft_model(
    sdxl_unet,
    pipe.unet.peft_config["toy"],
    adapter_name="toy"
)

original_state_dict = {f"base_model.model.{k}": v for k, v in pipe.unet.state_dict().items()}
toy_peft_model.load_state_dict(original_state_dict, strict=True)

💡 You possibly can optionally push the toy_peft_model to the Hub using: toy_peft_model.push_to_hub("toy_peft_model", token=TOKEN).

Next, we do the identical for the “nerijs/pixel-art-xl” LoRA:

pipe.delete_adapters("toy")
sdxl_unet.delete_adapters("toy")

pipe.load_lora_weights("nerijs/pixel-art-xl", weight_name="pixel-art-xl.safetensors", adapter_name="pixel")
pipe.set_adapters(adapter_names="pixel")

pixel_peft_model = get_peft_model(
    sdxl_unet,
    pipe.unet.peft_config["pixel"],
    adapter_name="pixel"
)

original_state_dict = {f"base_model.model.{k}": v for k, v in pipe.unet.state_dict().items()}
pixel_peft_model.load_state_dict(original_state_dict, strict=True)

Now, we’re all equipped with weighted adapter inference! We start by loading all of the essential things:

from peft import PeftModel
from diffusers import UNet2DConditionModel, DiffusionPipeline
import torch

base_unet = UNet2DConditionModel.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0",
    torch_dtype=torch.float16, 
    use_safetensors=True,
    variant="fp16",
    subfolder="unet",
).to("cuda")

toy_id = "sayakpaul/toy_peft_model"
model = PeftModel.from_pretrained(base_unet, toy_id, use_safetensors=True, subfolder="toy", adapter_name="toy")
model.load_adapter("sayakpaul/pixel_peft_model", use_safetensors=True, subfolder="pixel", adapter_name="pixel")

Now, mix the LoRA adapters — the moment all of us have been waiting for!

model.add_weighted_adapter(
    adapters=["toy", "pixel"],
    weights=[0.7, 0.3],
    combination_type="linear",
    adapter_name="toy-pixel"
)
model.set_adapters("toy-pixel")

Here, we are only starting with the “linear” merging strategy but will experiment with other exotic merging algorithms, akin to TIES. We finally assign the model to our DiffusionPipeline and perform inference:

model = model.to(dtype=torch.float16, device="cuda")

pipe = DiffusionPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0", unet=model, variant="fp16", torch_dtype=torch.float16,
).to("cuda")

prompt = "toy_face of a hacker with a hoodie, pixel art"
image = pipe(prompt, num_inference_steps=30, generator=torch.manual_seed(0)).images[0]
image

Let’s try ties_svd method. You could find the instance notebook here.

pipe.unet.add_weighted_adapter(
    ["teapot","watercolour"], 
    [1.0, 1.0],
    "merge",
    combination_type="ties_svd",
    density=0.5
)

Now, let’s try combining two style LoRAs using dare_linear:

model.add_weighted_adapter(
    adapters=["toy", "pixel"],
    weights=[1.0, 1.0],
    combination_type="dare_linear",
    adapter_name="merge",
    density=0.7
)

Now, let’s try ties method with majority_sign_method="frequency" :

model.add_weighted_adapter(
    adapters=["toy", "sticker"],
    weights=[1.0, 1.0],
    combination_type="ties",
    adapter_name="merge",
    density=0.5,
    majority_sign_method="frequency"
)

Observations

In most scenarios, cat method will give great results. So, start with that. Nevertheless, note that should you mix many adapters, the resulting merged adapter can have a big size because of concatenation resulting in OOM. So, when exploring few adapters, cat can be place to begin.
In you desire to explore or cat isn’t working, try linear , maginuted_prune and dare_linear in that order. For maginuted_prune and dare_linear, we found that higher density values around 0.7-0.8 work higher.
When using ties, we found that in lots of cases majority_sign_method="frequency" to perform higher than majority_sign_method="total" (total is currently the default). For ties, default value for density is 0.5. You possibly can then try tuning this lower or higher based in your observations post merging the adapters.
dare_ties wasn’t giving good results.
When working with Stable Diffusion LoRA adapters which have different ranks, you may try the *svd family of methods. Note that these require more GPU memory and take around ~1.5 minutes to create the merged adapter because of the expensive SVD operations. ties_svd gave good result when combining subject + style LoRAs as seen in an example above. When combining 2 style adapters, dare_linear with high density or ties with majority_sign_method="frequency" seems to work higher as seen within the examples above.

Acknowledgements

We’re grateful to Le Yu and Prateek Yadav, authors of DARE and TIES, for his or her generous feedback and guidance on the PR. To honor their efforts, we have now added them because the co-authors of the PR. Because of Prateek and Le for reviewing the blog post draft as well.

Useful links

Editing Models with Task Arithmetic
TIES-Merging: Resolving Interference When Merging Models
Language Models are Super Mario: Absorbing Abilities from Homologous Models as a Free Lunch
mergekit: Tools for merging pretrained large language models.
PEFT integration in Diffusers
Model merging guide for PEFT users

Citations

@inproceedings{
    ilharco2023editing,
    title={Editing models with task arithmetic},
    creator={Gabriel Ilharco and Marco Tulio Ribeiro and Mitchell Wortsman and Ludwig Schmidt and Hannaneh Hajishirzi and Ali Farhadi},
    booktitle={The Eleventh International Conference on Learning Representations },
    12 months={2023},
    url={https://openreview.net/forum?id=6t0Kwf8-jrj}
}

@inproceedings{
    yadav2023tiesmerging,
    title={{TIES}-Merging: Resolving Interference When Merging Models},
    creator={Prateek Yadav and Derek Tam and Leshem Choshen and Colin Raffel and Mohit Bansal},
    booktitle={Thirty-seventh Conference on Neural Information Processing Systems},
    12 months={2023},
    url={https://openreview.net/forum?id=xtaX3WyCj1}
}

@misc{yu2023language,
    title={Language Models are Super Mario: Absorbing Abilities from Homologous Models as a Free Lunch}, 
    creator={Le Yu and Bowen Yu and Haiyang Yu and Fei Huang and Yongbin Li},
    12 months={2023},
    eprint={2311.03099},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}

@misc{
    mergekit,
    creator = {Charles O. Goddard and contributors},
    title = {mergekit},
    12 months = {2023},
    publisher = {GitHub},
    journal = {GitHub repository},
    howpublished = {url{https://github.com/arcee-ai/mergekit}}
}

Source link

🤗 PEFT welcomes recent merging methods

Table of content

Methods for combining/merging LoRA adapters

Concatenation (`cat`)

Linear/Task Arithmetic (`linear`)

SVD (`svd`)

TIES (`ties` , `ties_svd` )

DARE (`dare_linear` , `dare_ties` , `dare_linear_svd` , `dare_ties_svd` )

Magnitude Prune (`magnitude_prune` , `magnitude_prune_svd` )

How do I merge my LoRA adapters?

Extending to text-to-image generation

Observations

Acknowledgements

Useful links

Citations

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

The Current Status of The Quantum Software Stack

The Multi-Agent Trap

A Tale of Two Variances: Why NumPy and Pandas Give Different Answers

How Vision Language Models Are Trained from “Scratch”

Why Care About Prompt Caching in LLMs?

🤗 PEFT welcomes recent merging methods

Table of content

Methods for combining/merging LoRA adapters

Concatenation (cat)

Linear/Task Arithmetic (linear)

SVD (svd)

TIES (ties , ties_svd )

DARE (dare_linear , dare_ties , dare_linear_svd , dare_ties_svd )

Magnitude Prune (magnitude_prune , magnitude_prune_svd )

How do I merge my LoRA adapters?

Extending to text-to-image generation

Observations

Acknowledgements

Useful links

Citations

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

Concatenation (`cat`)

Linear/Task Arithmetic (`linear`)

SVD (`svd`)

TIES (`ties` , `ties_svd` )

DARE (`dare_linear` , `dare_ties` , `dare_linear_svd` , `dare_ties_svd` )

Magnitude Prune (`magnitude_prune` , `magnitude_prune_svd` )

What are your thoughts on this topic?
Let us know in the comments below.