ControlNet is a neural network structure that permits fine-grained control of diffusion models by adding extra conditions. The technique debuted with the paper Adding Conditional Control to Text-to-Image Diffusion Models, and quickly took over the open-source diffusion community creator’s release of 8 different conditions to regulate Stable Diffusion v1-5, including pose estimations, depth maps, canny edges, sketches, and more.
On this blog post we are going to go over each step intimately on how we trained the Uncanny Faces model – a model on face poses based on 3D synthetic faces (the uncanny faces was an unintended consequence actually, stay tuned to see the way it got here through).
Getting began with training your ControlNet for Stable Diffusion
Training your individual ControlNet requires 3 steps:
-
Planning your condition: ControlNet is flexible enough to tame Stable Diffusion towards many tasks. The pre-trained models showcase a wide-range of conditions, and the community has built others, akin to conditioning on pixelated color palettes.
-
Constructing your dataset: Once a condition is determined, it’s time to construct your dataset. For that, you possibly can either construct a dataset from scratch, or use a sub-set of an existing dataset. You wish three columns in your dataset to coach the model: a ground truth
image, aconditioning_imageand aprompt. -
Training the model: Once your dataset is prepared, it’s time to train the model. That is the best part because of the diffusers training script. You will need a GPU with at the very least 8GB of VRAM.
1. Planning your condition
To plan your condition, it is helpful to consider two questions:
- What form of conditioning do I would like to make use of?
- Is there an already existing model that may convert ‘regular’ images into my condition?
For our example, we thought of using a facial landmarks conditioning. Our reasoning was: 1. the overall landmarks conditioned ControlNet works well. 2. Facial landmarks are a widespread enough technique, and there are multiple models that calculate facial landmarks on regular pictures 3. Might be fun to tame Stable Diffusion to follow a certain facial landmark or imitate your individual facial features.
2. Constructing your dataset
Okay! So we decided to do a facial landmarks Stable Diffusion conditioning. So, to arrange the dataset we want:
- The bottom truth
image: on this case, images of faces - The
conditioning_image: on this case, images where the facial landmarks are visualised - The
caption: a caption that describes the photographs getting used
For this project, we decided to go together with the FaceSynthetics dataset by Microsoft: it’s a dataset that accommodates 100K synthetic faces. Other face research datasets with real faces akin to Celeb-A HQ, FFHQ – but we decided to go together with synthetic faces for this project.
The FaceSynthetics dataset seemed like an important start: it accommodates ground truth images of faces, and facial landmarks annotated within the iBUG 68-facial landmarks format, and a segmented image of the face.
Perfect. Right? Unfortunately, probably not. Remember the second query within the “planning your condition” step – that we must always have models that convert regular images to the conditioning? Seems there is no such thing as a known model that may turn faces into the annotated landmark format of this dataset.
So we decided to follow one other path:
- Use the bottom truths
imageof faces of theFaceSyntheticsdataset - Use a known model that may convert any image of a face into the 68-facial landmarks format of iBUG (in our case we used the SOTA model SPIGA)
- Use custom code that converts the facial landmarks right into a nice illustrated mask for use because the
conditioning_image - Save that as a Hugging Face Dataset
Here you will discover the code used to convert the bottom truth images from the FaceSynthetics dataset into the illustrated mask and reserve it as a Hugging Face Dataset.
Now, with the bottom truth image and the conditioning_image on the dataset, we’re missing one step: a caption for every image. This step is very really useful, but you possibly can experiment with empty prompts and report back in your results. As we didn’t have captions for the FaceSynthetics dataset, we ran it through a BLIP captioning. You’ll be able to check the code used for captioning all images here
With that, we arrived to our final dataset! The Face Synthetics SPIGA with captions accommodates a ground truth image, segmentation and a caption for the 100K images of the FaceSynthetics dataset. We’re able to train the model!
3. Training the model
With our dataset ready, it’s time to train the model! Regardless that this was imagined to be the toughest a part of the method, with the diffusers training script, it turned out to be the best. We used a single A100 rented for US$1.10/h on LambdaLabs.
Our training experience
We trained the model for 3 epochs (which means that the batch of 100K images were shown to the model 3 times) and a batch size of 4 (each step shows 4 images to the model). This turned out to be excessive and overfit (so it forgot concepts that diverge a little bit of an actual face, so for instance “shrek” or “a cat” within the prompt wouldn’t make a shrek or a cat but slightly an individual, and likewise began to disregard styles).
With just 1 epoch (so after the model “saw” 100K images), it already converged to following the poses and never overfit. So it worked, but… as we used the face synthetics dataset, the model ended up learning uncanny 3D-looking faces, as an alternative of realistic faces. This is sensible provided that we used an artificial face dataset versus real ones, and may be used for fun/memetic purposes. Here is the uncannyfaces_25K model.
On this interactive table you possibly can play with the dial below to go over what number of training steps the model went through and the way it affects the training process. At around 15K steps, it already began learning the poses. And it matured around 25K steps. Here
How did we do the training
All we needed to do was, install the dependencies:
pip install git+https://github.com/huggingface/diffusers.git transformers speed up xformers==0.0.16 wandb
huggingface-cli login
wandb login
After which run the train_controlnet.py code
!speed up launch train_controlnet.py
--pretrained_model_name_or_path="stabilityai/stable-diffusion-2-1-base"
--output_dir="model_out"
--dataset_name=multimodalart/facesyntheticsspigacaptioned
--conditioning_image_column=spiga_seg
--image_column=image
--caption_column=image_caption
--resolution=512
--learning_rate=1e-5
--validation_image "./face_landmarks1.jpeg" "./face_landmarks2.jpeg" "./face_landmarks3.jpeg"
--validation_prompt "High-quality close-up dslr photo of man wearing a hat with trees within the background" "Girl smiling, skilled dslr photograph, dark background, studio lights, prime quality" "Portrait of a clown face, oil on canvas, bittersweet expression"
--train_batch_size=4
--num_train_epochs=3
--tracker_project_name="controlnet"
--enable_xformers_memory_efficient_attention
--checkpointing_steps=5000
--validation_steps=5000
--report_to wandb
--push_to_hub
Let’s break down a number of the settings, and likewise let’s go over some optimisation suggestions for going as little as 8GB of VRAM for training.
pretrained_model_name_or_path: The Stable Diffusion base model you want to to make use of (we selected v2-1 here as it will possibly render faces higher)output_dir: The directory you want to your model to be saveddataset_name: The dataset that can be used for training. In our case Face Synthetics SPIGA with captionsconditioning_image_column: The name of the column in your dataset that accommodates the conditioning image (in our casespiga_seg)image_column: The name of the colunn in your dataset that accommodates the bottom truth image (in our caseimage)caption_column: The name of the column in your dataset that accommodates the caption of that image (in our caseimage_caption)resolution: The resolution of each the conditioning and ground truth images (in our case512x512)learning_rate: The learing rate. We came upon that1e-5worked well for these examples, but it’s possible you’ll experiment with different values ranging between1e-4and2e-6, for instance.validation_image: That is so that you can take a sneak peak during training! The validation images can be ran for each amount ofvalidation_stepsso you possibly can see how your training goes. Insert here a neighborhood path to an arbitrary variety of conditioning imagesvalidation_prompt: A prompt to be ran togehter along with your validation image. Could be anything that may test in case your model is training welltrain_batch_size: That is the dimensions of the training batch to suit the GPU. We will afford4as a consequence of having an A100, but when you’ve got a GPU with lower VRAM we recommend bringing this value all the way down to1.num_train_epochs: Each epoch corresponds to how repeatedly the photographs within the training set can be “seen” by the model. We experimented with 3 epochs, but seems the most effective results required only a bit greater than 1 epoch, with 3 epochs our model overfit.checkpointing_steps: Save an intermediary checkpoint everyxsteps (in our case5000). Every 5000 steps, an intermediary checkpoint was saved.validation_steps: Everyxsteps thevalidaton_promptand thevalidation_imageare ran.report_to: where to report your training to. Here we used Weights and Biases, which gave us this nice report.
But reducing thetrain_batch_sizefrom4to1might not be enough for the training to suit a small GPU, listed here are some additional parameters so as to add for every GPU VRAM size:push_to_hub: a parameter to push the ultimate trained model to the Hugging Face Hub.
Fitting on a 16GB VRAM GPU
pip install bitsandbytes
--train_batch_size=1
--gradient_accumulation_steps=4
--gradient_checkpointing
--use_8bit_adam
The mix of a batch size of 1 with 4 gradient accumulation steps is corresponding to using the unique batch size of 4 we utilized in our example. As well as, we enabled gradient checkpointing and 8-bit Adam for extra memory savings.
Fitting on a 12GB VRAM GPU
--gradient_accumulation_steps=4
--gradient_checkpointing
--use_8bit_adam
--set_grads_to_none
Fitting on a 8GB VRAM GPU
Please follow our guide here
4. Conclusion!
This experience of coaching a ControlNet was loads of fun. We successfully trained a model that may follow real face poses – nonetheless it learned to make uncanny 3D faces as an alternative of real 3D faces because this was the dataset it was trained on, which has its own charm and flare.
Check out our Hugging Face Space:
As for next steps for us – so as to create realistically looking faces, while still not using an actual face dataset, one idea is running the whole FaceSynthetics dataset through Stable Diffusion Image2Imaage, converting the 3D-looking faces into realistically looking ones, after which training one other ControlNet.
And stay tuned, as we can have a ControlNet Training event soon! Follow Hugging Face on Twitter or join our Discord to not sleep so far on that.






