LoRA training scripts of the world, unite!

-


Linoy Tsaban's avatar


A community derived guide to a number of the SOTA practices for SD-XL Dreambooth LoRA high-quality tuning

TL;DR

We combined the Pivotal Tuning technique used on Replicate’s SDXL Cog trainer with the Prodigy optimizer utilized in the
Kohya trainer (plus a bunch of other optimizations) to realize excellent results on training Dreambooth LoRAs for SDXL.
Take a look at the training script on diffusers🧨. Try it out on Colab.

If you would like to skip the technical talk, you need to use all of the techniques on this blog
and train on Hugging Face Spaces with a straightforward UI and curated parameters (that you may meddle with).



Overview

Stable Diffusion XL (SDXL) models fine-tuned with LoRA dreambooth achieve incredible results at capturing recent concepts using only a
handful of images, while concurrently maintaining the aesthetic and image quality of SDXL and requiring relatively
little compute and resources. Take a look at a number of the awesome SDXL
LoRAs here.
On this blog, we’ll review a number of the popular practices and techniques to make your LoRA finetunes go brrr, and show the way you
can run or train yours now with diffusers!

Recap: LoRA (Low-Rank Adaptation) is a fine-tuning technique for Stable Diffusion models that makes slight
adjustments to the crucial cross-attention layers where images and prompts intersect. It achieves quality on par with
full fine-tuned models while being much faster and requiring less compute. To learn more on how LoRAs work, please see
our previous post – Using LoRA for Efficient Stable Diffusion Wonderful-Tuning.

Contents:

  1. Techniques/tricks
    1. Pivotal tuning
    2. Adaptive optimizers
    3. Really helpful practices – Text encoder learning rate, custom captions, dataset repeats, min snr gamma, training set creation
  2. Experiments Settings and Results
  3. Inference
    1. Diffusers inference
    2. Automatic1111/ComfyUI inference

Acknowledgements ❤️:
The techniques showcased on this guide – algorithms, training scripts, experiments and explorations – were inspired and built upon the
contributions by Nataniel Ruiz: Dreambooth, Rinon Gal: Textual Inversion, Ron Mokady: Pivotal Tuning, Simo Ryu: cog-sdxl,
Kohya: sd-scripts, The Last Ben: fast-stable-diffusion. Our most sincere gratitude to them and the remaining of the community! 🙌



Pivotal Tuning

Pivotal Tuning is a technique that mixes Textual Inversion with regular diffusion fine-tuning. For Dreambooth, it’s
customary that you just provide a rare token to be your trigger word, say “an sks dog”. Nevertheless, those tokens often have
other semantic meaning related to them and might affect your results. The sks example, popular locally, is
actually related to a weapons brand.

To tackle this issue, we insert recent tokens into the text encoders of the model, as a substitute of reusing existing ones.
We then optimize the newly-inserted token embeddings to represent the brand new concept: that’s Textual Inversion –
we learn to represent the concept through recent “words” within the embedding space. Once we obtain the brand new token and its
embeddings to represent it, we will train our Dreambooth LoRA with those token embeddings to get the perfect of each worlds.

Training

In our recent training script, you possibly can do textual inversion training by providing the next arguments

--train_text_encoder_ti
--train_text_encoder_ti_frac=0.5
--token_abstraction="TOK"
--num_new_tokens_per_abstraction=2
--adam_weight_decay_text_encoder
  • train_text_encoder_ti enables training the embeddings of latest concepts
  • train_text_encoder_ti_frac specifies when to stop the textual inversion (i.e. stop optimization of the textual embeddings and proceed optimizing the UNet only).
    Pivoting halfway (i.e. performing textual inversion for the primary half of the training epochs)
    is the default value within the cog sdxl example and our experiments validate this as well. We encourage experimentation here.
  • token_abstraction this refers back to the concept identifier,
    the word utilized in the image captions to explain the concept we wish to coach on.
    Your alternative of token abstraction must be utilized in your instance prompt,
    validation prompt or custom captions. Here we selected TOK, so,
    for instance, “a photograph of a TOK” may be the instance prompt.
    As --token_abstraction is a place-holder, before training we insert the brand new
    tokens rather than TOK and optimize them (meaning “a photograph of TOK” becomes “a photograph of ” during training, where are the brand new tokens).
    Hence, it is also crucial that token_abstraction corresponds to the identifier utilized in the instance prompt, validation prompt and custom prompts(if used).

    • num_new_tokens_per_abstraction the number of latest tokens to initialize for every token_abstraction– i.e. what number of recent tokens to insert and train for every text encoder
      of the model. The default is ready to 2, we encourage you to experiment with this and share your results!
  • adam_weight_decay_text_encoder That is used to set a distinct weight decay value for the text encoder parameters (
    different from the worth used for the unet parameters).`



Adaptive Optimizers


When training/fine-tuning a diffusion model (or any machine learning model for that matter), we use optimizers to guide
us towards the optimal path that results in convergence of our training objective – a minimum point of our chosen loss
function that represents a state where the model learned what we try to show it. The usual (and
state-of-the-art) selections for deep learning tasks are the Adam and AdamW optimizers.

Nevertheless, they require the user to meddle so much with the hyperparameters that pave the trail to convergence (similar to
learning rate, weight decay, etc.). This can lead to time-consuming experiments that result in suboptimal outcomes, and
even should you land on an excellent learning rate, it should still result in convergence issues if the educational rate is constant
during training. Some parameters may profit from more frequent updates to expedite convergence, while others may
require smaller adjustments to avoid overshooting the optimal value. To tackle this challenge, algorithms with adaptable
learning rates similar to Adafactor and Prodigy have been introduced. These
methods optimize the algorithm’s traversal of the optimization landscape by dynamically adjusting the educational rate for
each parameter based on their past gradients.

We selected to focus a bit more on Prodigy as we expect it could actually be especially helpful for Dreambooth LoRA training!

Training

--optimizer="prodigy"

When using prodigy it’s generally good practice to set-

--learning_rate=1.0

Additional settings which might be considered helpful for diffusion models and specifically LoRA training are:

--prodigy_safeguard_warmup=True
--prodigy_use_bias_correction=True
--adam_beta1=0.9
# Note these are set to values different than the default:
--adam_beta2=0.99 
--adam_weight_decay=0.01

There are additional hyper-parameters you possibly can adjust when training with prodigy
(like- --prodigy_beta3, prodigy_decouple, prodigy_safeguard_warmup), we is not going to delve into those on this post,
but you possibly can learn more about them here.



Additional Good Practices

Besides pivotal tuning and adaptive optimizers, listed here are some additional techniques that may impact the standard of your
trained LoRA, all of them have been incorporated into the brand new diffusers training script.



Independent learning rates for text encoder and UNet

When optimizing the text encoder, it has been perceived by the community that setting different learning rates for it (
versus the educational rate of the UNet) can lead to higher quality results – specifically a lower learning rate for
the text encoder because it tends to overfit faster.
* The importance of various unet and text encoder learning rates is clear when performing pivotal tuning as
well- on this case, setting a better learning rate for the text encoder is perceived to be higher.
* Notice, nevertheless, that when using Prodigy (or adaptive optimizers generally) we start with a similar initial
learning rate for all trained parameters, and let the optimizer work it’s magic ✨

Training

--train_text_encoder
--learning_rate=1e-4 #unet
--text_encoder_lr=5e-5 

--train_text_encoder enables full text encoder training (i.e. the weights of the text encoders are fully optimized, as opposed to only optimizing the inserted embeddings we saw in textual inversion (--train_text_encoder_ti)).
In the event you wish the text encoder lr to at all times match --learning_rate, set --text_encoder_lr=None.



Custom Captioning

While it is feasible to realize good results by training on a set of images all captioned with the identical instance
prompt, e.g. “photo of a person” or “within the sort of ” etc, using the identical caption may result in
suboptimal results, depending on the complexity of the learned concept, how “familiar” the model is with the concept,
and the way well the training set captures it.


Training
To make use of custom captioning, first be certain that you will have the datasets library installed, otherwise you possibly can install it by –

!pip install datasets

To load the custom captions we’d like our training set directory to follow the structure of a datasets ImageFolder,
containing each the pictures and the corresponding caption for every image.

  • Option 1:
    You select a dataset from the hub that already accommodates images and prompts – for instance LinoyTsaban/3d_icon. Now all you will have to do
    is specify the name of the dataset and the name of the caption column (on this case it’s “prompt”) in your training arguments:

--dataset_name=LinoyTsaban/3d_icon
--caption_column=prompt
  • Option 2:
    You want to make use of your individual images and add captions to them. In that case, you need to use this colab notebook to
    routinely caption the pictures with BLIP, or you possibly can manually create the captions in a metadata file. Then you definately
    follow up the identical way, by specifying --dataset_name along with your folder path, and --caption_column with the column
    name for the captions.



Min-SNR Gamma weighting

Training diffusion models often suffers from slow convergence, partly attributable to conflicting optimization directions
between timesteps. Hang et al. found a solution to mitigate this issue by introducing
the straightforward Min-SNR-gamma approach. This method adapts loss weights of timesteps based on clamped signal-to-noise
ratios, which effectively balances the conflicts amongst timesteps.
* For small datasets, the consequences of Min-SNR weighting strategy won’t seem like pronounced, but for larger
datasets, the consequences will likely be more pronounced.
* snr vis
find this project on Weights and Biases that compares
the loss surfaces of the next setups: snr_gamma set to five.0, 1.0 and None.


Training

To make use of Min-SNR gamma, set a worth for:

--snr_gamma=5.0

By default --snr_gamma=None, I.e. not used. When enabling --snr_gamma, the advisable value is 5.0.



Repeats

This argument refers back to the variety of times a picture out of your dataset is repeated within the training set. This differs
from epochs in that first the pictures are repeated, and only then shuffled.

Training

To enable repeats simply set an integer value > 1 as your repeats count-

--repeats

By default, –repeats=1, i.e. training set shouldn’t be repeated



Training Set Creation

  • As the favored saying goes – “Garbage in – garbage out” Training a very good Dreambooth LoRA may be done easily using
    only a handful of images, but the standard of those images could be very impactful on the high-quality tuned model.

  • Generally, when fine-tuning on an object/subject, we wish to make sure that the training set accommodates images that
    portray the article/subject in as many distinct ways we’d need to prompt for it as possible.

  • For instance, if my concept is that this red backpack: (available
    in google/dreambooth dataset)


  • I might likely need to prompt it worn by people as well, so having examples like this:


    within the training set – that matches that scenario – will likely make it easier for the model to generalize to that
    setting/composition during inference.

Specifically when training on faces, it is advisable to consider the next things regarding your dataset:

  1. If possible, at all times select high resolution, prime quality images. Blurry or low resolution images can harm the
    tuning process.

  2. When training on faces, it is strongly recommended that no other faces appear within the training set as we don’t need to
    create an ambiguous notion of what’s the face we’re training on.

  3. Close-up photos are essential to realize realism, nevertheless good full-body shots also needs to be included to
    improve the power to generalize to different poses/compositions.

  4. We recommend avoiding photos where the topic is far-off, as most pixels in such images will not be related to
    the concept we want to optimize on, there’s not much for the model to learn from these.

  5. Avoid repeating backgrounds/clothing/poses – aim for variety by way of lighting, poses, backgrounds, and
    facial expressions. The greater the variety, the more flexible and generalizable the LoRA could be.

  6. Prior preservation loss
    Prior preservation loss is a technique that uses a
    model’s own generated samples to assist
    it learn the best way to generate more diverse images.
    Because these sample images belong to the identical class as
    the pictures you provided, they assist the model retain what it has learned about
    the category and the way it could actually use what it already knows in regards to the class to make recent
    compositions.
    real images for regularization VS model generated ones
    When selecting class images, you possibly can determine between synthetic ones (i.e. generated by the diffusion model) and
    real ones. In favor of using real images, we will argue they improve the fine-tuned model’s realism. On the opposite
    hand, some will argue that using model generated images higher serves the aim of preserving the models
    knowledge
    of the category and general aesthetics.

  7. Celebrity lookalike – that is more a comment on the captioning/instance prompt used to coach. Some high-quality
    tuners experienced improvements of their results when prompting with a token identifier + a public person who
    the bottom model knows about that resembles the person they trained on.

Training with prior preservation loss

--with_prior_preservation
--class_data_dir
--num_class_images
--class_prompt

--with_prior_preservation – enables training with prior preservation
--class_data_dir – path to folder containing class images
—-num_class_images – Minimal class images for prior preservation loss. If there will not be enough images already present
in --class_data_dir, additional images might be sampled with --class_prompt.



Experiments Settings and Results

To explore the described methods, we experimented with different combos of those techniques on
different objectives (style tuning, faces and objects).

With a view to narrow down the infinite amount of hyperparameters values, we used a number of the more popular and customary
configurations as starting points and tweaked our way from there.

Huggy Dreambooth LoRA
First, we were all for fine-tuning a huggy LoRA which suggests
each teaching an inventive style, and a selected character at the identical time.
For this instance, we curated a prime quality Huggy mascot dataset (using Chunte-Lee’s amazing artwork) containing 31
images paired with custom captions.


Configurations:

--train_batch_size = 1, 2,3, 4
-repeats = 1,2
-learning_rate = 1.0 (Prodigy), 1e-4 (AdamW)
-text_encoder_lr = 1.0 (Prodigy), 3e-4, 5e-5 (AdamW)
-snr_gamma = None, 5.0 
-max_train_steps = 1000, 1500, 1800
-text_encoder_training = regular finetuning, pivotal tuning (textual inversion)
  • Full Text Encoder Tuning VS Pivotal Tuning – we noticed pivotal tuning achieves results competitive or higher
    than full text encoder training and yet without optimizing the weights of the text_encoder.
  • Min SNR Gamma
    • We compare between a version1
      trained without snr_gamma, and a version2 trained with snr_gamma = 5.0
      Specifically we used the next arguments in each versions (and added snr_gamma to version 2)
--pretrained_model_name_or_path="stabilityai/stable-diffusion-xl-base-1.0" 
--pretrained_vae_model_name_or_path="madebyollin/sdxl-vae-fp16-fix" 
--dataset_name="./huggy_clean" 
--instance_prompt="a TOK emoji"
--validation_prompt="a TOK emoji dressed as Yoda"
--caption_column="prompt" 
--mixed_precision="bf16" 
--resolution=1024 
--train_batch_size=4 
--repeats=1
--report_to="wandb"
--gradient_accumulation_steps=1 
--gradient_checkpointing 
--learning_rate=1e-4 
--text_encoder_lr=3e-4 
--optimizer="adamw"
--train_text_encoder_ti
--lr_scheduler="constant" 
--lr_warmup_steps=0 
--rank=32 
--max_train_steps=1000 
--checkpointing_steps=2000 
--seed="0" 

  • AdamW vs Prodigy Optimizer
    • We compare between version1
      trained with optimizer=prodigy, and [version2](https://wandb.ai/linoy/dreambooth-lora-sd-xl/runs/cws7nfzg?
      workspace=user-linoy) trained with optimizer=adamW. Each version were trained with pivotal tuning.
    • When training with optimizer=prodigy we set the initial learning rate to be 1. For adamW we used the default
      learning rates used for pivotal tuning in cog-sdxl (1e-4, 3e-4 for learning_rate and text_encoder_lr respectively)
      as we were capable of reproduce good
      results with these settings.


    • all other training parameters and settings were the identical. Specifically:
    --pretrained_model_name_or_path="stabilityai/stable-diffusion-xl-base-1.0" 
  --pretrained_vae_model_name_or_path="madebyollin/sdxl-vae-fp16-fix" 
  --dataset_name="./huggy_clean" 
  --instance_prompt="a TOK emoji"
  --validation_prompt="a TOK emoji dressed as Yoda"
  --output_dir="huggy_v11" 
  --caption_column="prompt" 
  --mixed_precision="bf16" 
  --resolution=1024 
  --train_batch_size=4 
  --repeats=1
  --report_to="wandb"
  --gradient_accumulation_steps=1 
  --gradient_checkpointing 
  --train_text_encoder_ti
  --lr_scheduler="constant" 
  --snr_gamma=5.0 
  --lr_warmup_steps=0 
  --rank=32 
  --max_train_steps=1000 
  --checkpointing_steps=2000 
  --seed="0" 

Y2K Webpage LoRA
Let’s explore one other example, this time training on a dataset composed of 27 screenshots of webpages from the Nineties
and early 2000s that we (nostalgically 🥲) scraped from the web:


Configurations:

–rank = 4,16,32
-optimizer = prodigy, adamW
-repeats = 1,2,3
-learning_rate = 1.0 (Prodigy), 1e-4 (AdamW)
-text_encoder_lr = 1.0 (Prodigy), 3e-4, 5e-5 (AdamW)
-snr_gamma = None, 5.0 
-train_batch_size = 1, 2, 3, 4
-max_train_steps = 500, 1000, 1500
-text_encoder_training = regular finetuning, pivotal tuning

This instance showcases a rather different behaviour than the previous.
While in each cases we used roughly the identical amount of images (i.e. ~30),
we noticed that for this style LoRA, the identical settings that induced good results for the Huggy LoRA, are overfitting for the webpage style. There


For v1, we selected as start line the settings that worked best for us when training the Huggy LoRA – it was evidently overfit, so we tried to resolve that in the subsequent versions by tweaking --max_train_steps, --repeats, --train_batch_size and --snr_gamma.
More specifically, these are the settings we modified between each version (all the remaining we kept the identical):

param v1 v2 v3 v4 v5 v6 v7 v8
max_train_steps 1500 1500 1500 1000 1000 1000 1000 1000
repeats 1 1 2 2 1 1 2 1
train_batch_size 4 4 4 4 2 1 1 1
instance_data_dir web_y2k 14 images randomly samples from web_y2k web_y2k web_y2k web_y2k web_y2k web_y2k web_y2k
snr_gamma 5.0 5.0 5.0 5.0 5.0 5.0

We found v4, v5 and v6 to strike the perfect balance:


Face LoRA
When training on face images, we aim for the LoRA to generate images as realistic and just like the unique person as possible,
while also having the ability to generalize well to backgrounds and compositions that weren’t seen within the training set.
For this use-case, we used different datasets of Linoy’s face composed of 6-10 images, including a set of close-up photos taken all at the identical time and a dataset of shots taken at different occasions (changing backgrounds, lighting and outfits) in addition to full body shots.
We learned that less images with a greater curation works higher than more images if the pictures are of mid-to-low quality on the subject of lighting/resolution/concentrate on subject – less is more: pick your best pictures and use that to coach the model!
Configurations:

rank = 4,16,32, 64
optimizer = prodigy, adamW
repeats = 1,2,3,4
learning_rate = 1.0 , 1e-4
text_encoder_lr = 1.0, 3e-4
snr_gamma = None, 5.0
num_class_images = 100, 150
max_train_steps = 75 * num_images, 100 * num_images, 120 * num_images
text_encoder_training = regular finetuning, pivotal tuning
  • Prior preservation loss

    • contrary to common practices, we found the usage of generated class images to scale back each resemblance to the topic and realism.
    • we created a dataset of real portrait images, using free licensed images downloaded from unsplash.
      You possibly can now use it routinely in the brand new training space as well!
    • When using the actual image dataset, we did notice less language drift (i.e. the model doesn’t associate the term woman/man with trained faces only and might generate different people as well) while at the identical time maintaining realism and overall quality when prompted for the trained faces.
  • Rank

    • we compare LoRAs in ranks 4, 16, 32 and 64. We observed that within the settings tested in our explorations, images produced using the 64 rank LoRA are inclined to have a more air-brushed appearance, and fewer realistic looking skin texture.
    • Hence for the experiments detailed below in addition to the LoRA ease space, we use a default rank of 32.
  • Training Steps

  • Although few prime quality images (in our example, 6) work well, we still need to find out an excellent variety of steps to coach the model.

  • We experimented with few different multipliers on the variety of images: 6 x75 = 450 steps / 6 x100 = 600 steps / 6 x120 = 720 steps.

  • As you possibly can see below, our preliminary results show that good results are achieved with a 120x multiplier (if the dataset is diverse enough to not overfit, it’s preferable to not use the identical shooting)




    Inference

    Inference with models trained with the techniques above should work the identical as with every trainer, except that, after we do pivotal tuning, besides the *.safetensors weights of your LoRA, there’s also the *.safetensors text embeddings trained with the model
    for the brand new tokens. With a view to do inference with those we add 2 steps to how we’d normally load a LoRA:

    1. Download our trained embeddings from the hub
      (your embeddings filename is ready by default to be {model_name}_emb.safetensors)
    import torch
    from huggingface_hub import hf_hub_download
    from diffusers import DiffusionPipeline
    from safetensors.torch import load_file
    pipe = DiffusionPipeline.from_pretrained(
            "stabilityai/stable-diffusion-xl-base-1.0",
            torch_dtype=torch.float16,
            variant="fp16",
    ).to("cuda")
    
    
    embedding_path = hf_hub_download(repo_id="LinoyTsaban/web_y2k_lora", filename="web_y2k_emb.safetensors", repo_type="model")
    
    1. Load the embeddings into the text encoders
    
    
    state_dict = load_file(embedding_path)
    
    
    
    pipe.load_textual_inversion(state_dict["clip_l"], token=["", ""], text_encoder=pipe.text_encoder, tokenizer=pipe.tokenizer)
    
    pipe.load_textual_inversion(state_dict["clip_g"], token=["", ""], text_encoder=pipe.text_encoder_2, tokenizer=pipe.tokenizer_2)
    
    1. Load your LoRA and prompt it!
    
    pipe.load_lora_weights("LinoyTsaban/web_y2k_lora", weight_name="pytorch_lora_weights.safetensors")
    prompt="a  webpage about an astronaut riding a horse"
    images = pipe(
        prompt,
        cross_attention_kwargs={"scale": 0.8},
    ).images
    
    images[0]
    



    Comfortable UI / AUTOMATIC1111 Inference

    The brand new script fully supports textual inversion loading with Comfortable UI and AUTOMATIC1111 formats!

    AUTOMATIC1111 / SD.Next
    In AUTOMATIC1111/SD.Next we’ll load a LoRA and a textual embedding at the identical time.

    • LoRA: Besides the diffusers format, the script will even train a WebUI compatible LoRA. It’s generated as {your_lora_name}.safetensors. You possibly can then include it in your models/Lora directory.
    • Embedding: the embedding is identical for diffusers and WebUI. You possibly can download your {lora_name}_emb.safetensors file from a trained model, and include it in your embeddings directory.

    You possibly can then run inference by prompting a y2k_emb webpage in regards to the movie Mean Girls . You should utilize the y2k_emb token normally, including increasing its weight by doing (y2k_emb:1.2).

    ComfyUI
    In ComfyUI we’ll load a LoRA and a textual embedding at the identical time.

    • LoRA: Besides the diffusers format, the script will even train a ComfyUI compatible LoRA. It’s generated as {your_lora_name}.safetensors. You possibly can then include it in your models/Lora directory. Then you definately will load the LoRALoader node and hook that up along with your model and CLIP. Official guide for loading LoRAs
    • Embedding: the embedding is identical for diffusers and WebUI. You possibly can download your {lora_name}_emb.safetensors file from a trained model, and include it in your models/embeddings directory and use it in your prompts like embedding:y2k_emb. Official guide for loading embeddings.



    What’s next?

    🚀 More features coming soon!
    We’re working on adding much more control and adaptability to our advanced training script. Tell us what features
    you discover most helpful!

    🤹 Multi concept LoRAs
    A recent work of Shah et al. introduced ZipLoRAs – a technique to
    merge independently trained style and subject LoRAs so as to achieve generation of any user-provided subject in
    any user-provided style. mkshing implemented an open source replication available
    here and it uses the brand new and improved script.





Source link

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x