LoRA: Low-Rank Adaptation of Large Language Models is a novel technique introduced by Microsoft researchers to cope with the issue of fine-tuning large-language models. Powerful models with billions of parameters, corresponding to GPT-3, are prohibitively expensive to fine-tune to be able to adapt them to particular tasks or domains. LoRA proposes to freeze pre-trained model weights and inject trainable layers (rank-decomposition matrices) in each transformer block. This greatly reduces the variety of trainable parameters and GPU memory requirements since gradients don’t have to be computed for many model weights. The researchers found that by specializing in the Transformer attention blocks of large-language models, fine-tuning quality with LoRA was on par with full model fine-tuning while being much faster and requiring less compute.
LoRA for Diffusers 🧨
Regardless that LoRA was initially proposed for large-language models and demonstrated on transformer blocks, the technique can be applied elsewhere. Within the case of Stable Diffusion fine-tuning, LoRA might be applied to the cross-attention layers that relate the image representations with the prompts that describe them. The main points of the next figure (taken from the Stable Diffusion paper) aren’t essential, just note that the yellow blocks are those in control of constructing the connection between image and text representations.
To the perfect of our knowledge, Simo Ryu (@cloneofsimo) was the primary one to give you a LoRA implementation adapted to Stable Diffusion. Please, do take a take a look at their GitHub project to see examples and plenty of interesting discussions and insights.
With the intention to inject LoRA trainable matrices as deep within the model as within the cross-attention layers, people used to wish to hack the source code of diffusers in imaginative (but fragile) ways. If Stable Diffusion has shown us one thing, it’s that the community all the time comes up with ways to bend and adapt the models for creative purposes, and we love that! Providing the flexibleness to govern the cross-attention layers may very well be helpful for a lot of other reasons, corresponding to making it easier to adopt optimization techniques corresponding to xFormers. Other creative projects corresponding to Prompt-to-Prompt could do with some easy method to access those layers, so we decided to provide a general way for users to do it. We have been testing that pull request since late December, and it officially launched with our diffusers release yesterday.
We have been working with @cloneofsimo to offer LoRA training support in diffusers, for each Dreambooth and full fine-tuning methods! These techniques provide the next advantages:
- Training is way faster, as already discussed.
- Compute requirements are lower. We could create a full fine-tuned model in a 2080 Ti with 11 GB of VRAM!
- Trained weights are much, much smaller. Because the unique model is frozen and we inject recent layers to be trained, we are able to save the weights for the brand new layers as a single file that weighs in at ~3 MB in size. That is about one thousand times smaller than the unique size of the UNet model!
We’re particularly excited in regards to the last point. To ensure that users to share their awesome fine-tuned or dreamboothed models, that they had to share a full copy of the ultimate model. Other users that wish to try them out should download the fine-tuned weights of their favorite UI, adding as much as combined massive storage and download costs. As of today, there are about 1,000 Dreambooth models registered within the Dreambooth Concepts Library, and doubtless many more not registered within the library.
With LoRA, it’s now possible to publish a single 3.29 MB file to permit others to make use of your fine-tuned model.
(h/t to @mishig25, the primary person I heard use dreamboothing as a verb in a traditional conversation).
LoRA fine-tuning
Full model fine-tuning of Stable Diffusion was once slow and difficult, and that is a part of the rationale why lighter-weight methods corresponding to Dreambooth or Textual Inversion have turn out to be so popular. With LoRA, it is way easier to fine-tune a model on a custom dataset.
Diffusers now provides a LoRA fine-tuning script that may run in as little as 11 GB of GPU RAM without resorting to tricks corresponding to 8-bit optimizers. That is the way you’d use it to fine-tune a model using Lambda Labs Pokémon dataset:
export MODEL_NAME="runwayml/stable-diffusion-v1-5"
export OUTPUT_DIR="/sddata/finetune/lora/pokemon"
export HUB_MODEL_ID="pokemon-lora"
export DATASET_NAME="lambdalabs/pokemon-blip-captions"
speed up launch --mixed_precision="fp16" train_text_to_image_lora.py
--pretrained_model_name_or_path=$MODEL_NAME
--dataset_name=$DATASET_NAME
--dataloader_num_workers=8
--resolution=512 --center_crop --random_flip
--train_batch_size=1
--gradient_accumulation_steps=4
--max_train_steps=15000
--learning_rate=1e-04
--max_grad_norm=1
--lr_scheduler="cosine" --lr_warmup_steps=0
--output_dir=${OUTPUT_DIR}
--push_to_hub
--hub_model_id=${HUB_MODEL_ID}
--report_to=wandb
--checkpointing_steps=500
--validation_prompt="Totoro"
--seed=1337
One thing of notice is that the educational rate is 1e-4, much larger than the same old learning rates for normal fine-tuning (within the order of ~1e-6, typically). This can be a W&B dashboard of the previous run, which took about 5 hours in a 2080 Ti GPU (11 GB of RAM). I didn’t try and optimize the hyperparameters, so be happy to try it out yourself! Sayak did one other run on a T4 (16 GB of RAM), here’s his final model, and here’s a demo Space that uses it.
For added details on LoRA support in diffusers, please confer with our documentation – it’ll be all the time kept up so far with the implementation.
Inference
As we have discussed, one in every of the foremost benefits of LoRA is that you simply get excellent results by training orders of magnitude less weights than the unique model size. We designed an inference process that permits loading the extra weights on top of the unmodified Stable Diffusion model weights. Let’s examine how it really works.
First, we’ll use the Hub API to robotically determine what was the bottom model that was used to fine-tune a LoRA model. Ranging from Sayak’s model, we are able to use this code:
from huggingface_hub import model_info
model_path = "sayakpaul/sd-model-finetuned-lora-t4"
info = model_info(model_path)
model_base = info.cardData["base_model"]
print(model_base)
This snippet will print the model he used for fine-tuning, which is CompVis/stable-diffusion-v1-4. In my case, I trained my model ranging from version 1.5 of Stable Diffusion, so for those who run the identical code with my LoRA model you may see that the output is runwayml/stable-diffusion-v1-5.
The data in regards to the base model is robotically populated by the fine-tuning script we saw within the previous section, for those who use the --push_to_hub option. That is recorded as a metadata tag within the README file of the model’s repo, as you may see here.
After we determine the bottom model we used to fine-tune with LoRA, we load a traditional Stable Diffusion pipeline. We’ll customize it with the DPMSolverMultistepScheduler for very fast inference:
import torch
from diffusers import StableDiffusionPipeline, DPMSolverMultistepScheduler
pipe = StableDiffusionPipeline.from_pretrained(model_base, torch_dtype=torch.float16)
pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)
And here’s where the magic comes. We load the LoRA weights from the Hub on top of the regular model weights, move the pipeline to the cuda device and run inference:
pipe.unet.load_attn_procs(model_path)
pipe.to("cuda")
image = pipe("Green pokemon with menacing face", num_inference_steps=25).images[0]
image.save("green_pokemon.png")
Dreamboothing with LoRA
Dreambooth lets you “teach” recent concepts to a Stable Diffusion model. LoRA is compatible with Dreambooth and the method is analogous to fine-tuning, with a few benefits:
- Training is quicker.
- We only need just a few images of the topic we wish to coach (5 or 10 are often enough).
- We are able to tweak the text encoder, if we wish, for extra fidelity to the topic.
To coach Dreambooth with LoRA you have to use this diffusers script. Please, take a take a look at the README, the documentation and our hyperparameter exploration blog post for details.
For a fast, low-cost and simple method to train your Dreambooth models with LoRA, please check this Space by hysts. It’s worthwhile to duplicate it and assign a GPU so it runs fast. This process will prevent from having to establish your personal training environment and you may have the opportunity to coach your models in minutes!
Other Methods
The hunt for simple fine-tuning is just not recent. Along with Dreambooth, textual inversion is one other popular method that attempts to show recent concepts to a trained Stable Diffusion Model. One in all the principal reasons for using Textual Inversion is that trained weights are also small and simple to share. Nonetheless, they only work for a single subject (or a small handful of them), whereas LoRA might be used for general-purpose fine-tuning, meaning that it might probably be adapted to recent domains or datasets.
Pivotal Tuning is a technique that tries to mix Textual Inversion with LoRA. First, you teach the model a brand new concept using Textual Inversion techniques, obtaining a brand new token embedding to represent it. Then, you train that token embedding using LoRA to get the perfect of each worlds.
We have not explored Pivotal Tuning with LoRA yet. Who’s up for the challenge? 🤗


