Train your ControlNet with diffusers

-


Pedro Cuenca's avatar

ControlNet is a neural network structure that permits fine-grained control of diffusion models by adding extra conditions. The technique debuted with the paper Adding Conditional Control to Text-to-Image Diffusion Models, and quickly took over the open-source diffusion community creator’s release of 8 different conditions to regulate Stable Diffusion v1-5, including pose estimations, depth maps, canny edges, sketches, and more.

ControlNet pose examples

On this blog post we are going to go over each step intimately on how we trained the Uncanny Faces model – a model on face poses based on 3D synthetic faces (the uncanny faces was an unintended consequence actually, stay tuned to see the way it got here through).



Getting began with training your ControlNet for Stable Diffusion

Training your individual ControlNet requires 3 steps:

  1. Planning your condition: ControlNet is flexible enough to tame Stable Diffusion towards many tasks. The pre-trained models showcase a wide-range of conditions, and the community has built others, akin to conditioning on pixelated color palettes.

  2. Constructing your dataset: Once a condition is determined, it’s time to construct your dataset. For that, you possibly can either construct a dataset from scratch, or use a sub-set of an existing dataset. You wish three columns in your dataset to coach the model: a ground truth image, a conditioning_image and a prompt.

  3. Training the model: Once your dataset is prepared, it’s time to train the model. That is the best part because of the diffusers training script. You will need a GPU with at the very least 8GB of VRAM.



1. Planning your condition

To plan your condition, it is helpful to consider two questions:

  1. What form of conditioning do I would like to make use of?
  2. Is there an already existing model that may convert ‘regular’ images into my condition?

For our example, we thought of using a facial landmarks conditioning. Our reasoning was: 1. the overall landmarks conditioned ControlNet works well. 2. Facial landmarks are a widespread enough technique, and there are multiple models that calculate facial landmarks on regular pictures 3. Might be fun to tame Stable Diffusion to follow a certain facial landmark or imitate your individual facial features.

Example of face landmarks



2. Constructing your dataset

Okay! So we decided to do a facial landmarks Stable Diffusion conditioning. So, to arrange the dataset we want:

  • The bottom truth image: on this case, images of faces
  • The conditioning_image: on this case, images where the facial landmarks are visualised
  • The caption: a caption that describes the photographs getting used

For this project, we decided to go together with the FaceSynthetics dataset by Microsoft: it’s a dataset that accommodates 100K synthetic faces. Other face research datasets with real faces akin to Celeb-A HQ, FFHQ – but we decided to go together with synthetic faces for this project.

Face synthetics example dataset

The FaceSynthetics dataset seemed like an important start: it accommodates ground truth images of faces, and facial landmarks annotated within the iBUG 68-facial landmarks format, and a segmented image of the face.

Face synthetics descriptions

Perfect. Right? Unfortunately, probably not. Remember the second query within the “planning your condition” step – that we must always have models that convert regular images to the conditioning? Seems there is no such thing as a known model that may turn faces into the annotated landmark format of this dataset.

No known segmentation model

So we decided to follow one other path:

  • Use the bottom truths image of faces of the FaceSynthetics dataset
  • Use a known model that may convert any image of a face into the 68-facial landmarks format of iBUG (in our case we used the SOTA model SPIGA)
  • Use custom code that converts the facial landmarks right into a nice illustrated mask for use because the conditioning_image
  • Save that as a Hugging Face Dataset

Here you will discover the code used to convert the bottom truth images from the FaceSynthetics dataset into the illustrated mask and reserve it as a Hugging Face Dataset.

Now, with the bottom truth image and the conditioning_image on the dataset, we’re missing one step: a caption for every image. This step is very really useful, but you possibly can experiment with empty prompts and report back in your results. As we didn’t have captions for the FaceSynthetics dataset, we ran it through a BLIP captioning. You’ll be able to check the code used for captioning all images here

With that, we arrived to our final dataset! The Face Synthetics SPIGA with captions accommodates a ground truth image, segmentation and a caption for the 100K images of the FaceSynthetics dataset. We’re able to train the model!

New dataset



3. Training the model

With our dataset ready, it’s time to train the model! Regardless that this was imagined to be the toughest a part of the method, with the diffusers training script, it turned out to be the best. We used a single A100 rented for US$1.10/h on LambdaLabs.



Our training experience

We trained the model for 3 epochs (which means that the batch of 100K images were shown to the model 3 times) and a batch size of 4 (each step shows 4 images to the model). This turned out to be excessive and overfit (so it forgot concepts that diverge a little bit of an actual face, so for instance “shrek” or “a cat” within the prompt wouldn’t make a shrek or a cat but slightly an individual, and likewise began to disregard styles).

With just 1 epoch (so after the model “saw” 100K images), it already converged to following the poses and never overfit. So it worked, but… as we used the face synthetics dataset, the model ended up learning uncanny 3D-looking faces, as an alternative of realistic faces. This is sensible provided that we used an artificial face dataset versus real ones, and may be used for fun/memetic purposes. Here is the uncannyfaces_25K model.

On this interactive table you possibly can play with the dial below to go over what number of training steps the model went through and the way it affects the training process. At around 15K steps, it already began learning the poses. And it matured around 25K steps. Here



How did we do the training

All we needed to do was, install the dependencies:

pip install git+https://github.com/huggingface/diffusers.git transformers speed up xformers==0.0.16 wandb
huggingface-cli login
wandb login 

After which run the train_controlnet.py code

!speed up launch train_controlnet.py 
 --pretrained_model_name_or_path="stabilityai/stable-diffusion-2-1-base" 
 --output_dir="model_out" 
 --dataset_name=multimodalart/facesyntheticsspigacaptioned 
 --conditioning_image_column=spiga_seg 
 --image_column=image 
 --caption_column=image_caption 
 --resolution=512 
 --learning_rate=1e-5 
 --validation_image "./face_landmarks1.jpeg" "./face_landmarks2.jpeg" "./face_landmarks3.jpeg" 
 --validation_prompt "High-quality close-up dslr photo of man wearing a hat with trees within the background" "Girl smiling, skilled dslr photograph, dark background, studio lights, prime quality" "Portrait of a clown face, oil on canvas, bittersweet expression" 
 --train_batch_size=4 
 --num_train_epochs=3 
 --tracker_project_name="controlnet" 
 --enable_xformers_memory_efficient_attention 
 --checkpointing_steps=5000 
 --validation_steps=5000 
 --report_to wandb 
 --push_to_hub

Let’s break down a number of the settings, and likewise let’s go over some optimisation suggestions for going as little as 8GB of VRAM for training.

  • pretrained_model_name_or_path: The Stable Diffusion base model you want to to make use of (we selected v2-1 here as it will possibly render faces higher)
  • output_dir: The directory you want to your model to be saved
  • dataset_name: The dataset that can be used for training. In our case Face Synthetics SPIGA with captions
  • conditioning_image_column: The name of the column in your dataset that accommodates the conditioning image (in our case spiga_seg)
  • image_column: The name of the colunn in your dataset that accommodates the bottom truth image (in our case image)
  • caption_column: The name of the column in your dataset that accommodates the caption of that image (in our case image_caption)
  • resolution: The resolution of each the conditioning and ground truth images (in our case 512x512)
  • learning_rate: The learing rate. We came upon that 1e-5 worked well for these examples, but it’s possible you’ll experiment with different values ranging between 1e-4 and 2e-6, for instance.
  • validation_image: That is so that you can take a sneak peak during training! The validation images can be ran for each amount of validation_steps so you possibly can see how your training goes. Insert here a neighborhood path to an arbitrary variety of conditioning images
  • validation_prompt: A prompt to be ran togehter along with your validation image. Could be anything that may test in case your model is training well
  • train_batch_size: That is the dimensions of the training batch to suit the GPU. We will afford 4 as a consequence of having an A100, but when you’ve got a GPU with lower VRAM we recommend bringing this value all the way down to 1.
  • num_train_epochs: Each epoch corresponds to how repeatedly the photographs within the training set can be “seen” by the model. We experimented with 3 epochs, but seems the most effective results required only a bit greater than 1 epoch, with 3 epochs our model overfit.
  • checkpointing_steps: Save an intermediary checkpoint every x steps (in our case 5000). Every 5000 steps, an intermediary checkpoint was saved.
  • validation_steps: Every x steps the validaton_prompt and the validation_image are ran.
  • report_to: where to report your training to. Here we used Weights and Biases, which gave us this nice report.
    But reducing the train_batch_size from 4 to 1 might not be enough for the training to suit a small GPU, listed here are some additional parameters so as to add for every GPU VRAM size:
  • push_to_hub: a parameter to push the ultimate trained model to the Hugging Face Hub.



Fitting on a 16GB VRAM GPU

pip install bitsandbytes

--train_batch_size=1 
--gradient_accumulation_steps=4 
--gradient_checkpointing 
--use_8bit_adam

The mix of a batch size of 1 with 4 gradient accumulation steps is corresponding to using the unique batch size of 4 we utilized in our example. As well as, we enabled gradient checkpointing and 8-bit Adam for extra memory savings.



Fitting on a 12GB VRAM GPU

--gradient_accumulation_steps=4 
--gradient_checkpointing 
--use_8bit_adam
--set_grads_to_none



Fitting on a 8GB VRAM GPU

Please follow our guide here



4. Conclusion!

This experience of coaching a ControlNet was loads of fun. We successfully trained a model that may follow real face poses – nonetheless it learned to make uncanny 3D faces as an alternative of real 3D faces because this was the dataset it was trained on, which has its own charm and flare.

Check out our Hugging Face Space:

As for next steps for us – so as to create realistically looking faces, while still not using an actual face dataset, one idea is running the whole FaceSynthetics dataset through Stable Diffusion Image2Imaage, converting the 3D-looking faces into realistically looking ones, after which training one other ControlNet.

And stay tuned, as we can have a ControlNet Training event soon! Follow Hugging Face on Twitter or join our Discord to not sleep so far on that.





Source link

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x