Stable Diffusion 1.5/2.0/2.1/XL 1.0, DALL-E, Imagen… Up to now years, Diffusion Models have showcased stunning quality in image generation. Nonetheless, while producing great quality on generic concepts, these struggle to generate top quality for more specialised queries, for instance generating images in a selected style, that was not often seen within the training dataset.
We could retrain the entire model on vast variety of images, explaining the concepts needed to handle the problem from scratch. Nonetheless, this doesn’t sound practical. First, we’d like a big set of images for the thought, and second, it is just too expensive and time-consuming.
There are answers, nevertheless, that, given a handful of images and an hour of fine-tuning at worst, would enable diffusion models to supply reasonable quality on the brand new concepts.
Below, I cover approaches like Dreambooth, Lora, Hyper-networks, Textual Inversion, IP-Adapters and ControlNets widely used to customize and condition diffusion models. The thought behind all these methods is to memorise a brand new concept we try to learn, nevertheless, each technique approaches it in a different way.
Diffusion architecture
Before diving into various methods that help to condition diffusion models, let’s first recap what diffusion models are.
The unique idea of diffusion models is to coach a model to reconstruct a coherent image from noise. Within the training stage, we regularly add small amounts of Gaussian noise (forward process) after which reconstruct the image iteratively by optimizing the model to predict the noise, subtracting which we might catch up with to the goal image (reverse process).
The unique idea of image corruption has evolved right into a more practical and light-weight architecture during which images are first compressed to a latent space, and all manipulation with added noise is performed in low dimensional space.
So as to add textual information to the diffusion model, we first pass it through a text-encoder (typically CLIP) to supply latent embedding, that’s then injected into the model with cross-attention layers.

The thought is to take a rare word; typically, an {SKS} word is used after which teach the model to map the word {SKS} to a feature we would love to learn. That may, for instance, be a mode that the model has never seen, like van Gogh. We might show a dozen of his paintings and fine-tune to the phrase “A painting of shoes within the {SKS} style”. We could similarly personalise the generation, for instance, learn find out how to generate images of a selected person, for instance “{SKS} within the mountains” on a set of 1’s selfies.
To keep up the data learned within the pre-training stage, Dreambooth encourages the model to not deviate an excessive amount of from the unique, pre-trained version by adding text-image pairs generated by the unique model to the fine-tuning set.
When to make use of and when not
Dreambooth produces the perfect quality across all methods; nevertheless, the technique could impact already learnt concepts because the whole model is updated. The training schedule also limits the variety of concepts the model can understand. Training is time-consuming, taking 1–2 hours. If we resolve to introduce several recent concepts at a time, we would wish to store two model checkpoints, which wastes quite a lot of space.
Textual Inversion, paper, code

The idea behind the textual inversion is that the knowledge stored within the latent space of the diffusion models is vast. Hence, the style or the condition we would like to breed with the Diffusion model is already known to it, but we just don’t have the token to access it. Thus, as a substitute of fine-tuning the model to breed the specified output when fed with rare words “within the {SKS} style”, we’re optimizing for a textual embedding that might end in the specified output.
When to make use of and when not
It takes little or no space, as only the token will probably be stored. Additionally it is relatively quick to coach, with a mean training time of 20–half-hour. Nonetheless, it comes with its shortcomings — as we’re fine-tuning a selected vector that guides the model to supply a selected style, it won’t generalise beyond this style.

Low-Rank Adaptions (LoRA) were proposed for Large Language Models and were first adapted to the diffusion model by Simo Ryu. The unique idea of LoRAs is that as a substitute of fine-tuning the entire model, which might be moderately costly, we will mix a fraction of latest weights that might be fine-tuned for the duty with an analogous rare token approach into the unique model.
In diffusion models, rank decomposition is applied to cross-attention layers and is answerable for merging prompt and image information. The burden matrices WO, WQ, WK, and WV in these layers have LoRA applied.
When to make use of and when not
LoRAs take little or no time to coach (5–quarter-hour) — we’re updating a handful of parameters in comparison with the entire model, and in contrast to Dreambooth, they take much less space. Nonetheless, small-in-size models fine-tuned with LoRAs prove worse quality in comparison with DreamBooth.
Hyper-networks, paper, code

Hyper-networks are, in some sense, extensions to LoRAs. As a substitute of learning the relatively small embeddings that might alter the model’s output directly, we train a separate network able to predicting the weights for these newly injected embeddings.
Having the model predict the embeddings for a selected concept we will teach the hypernetwork several concepts — reusing the identical model for multiple tasks.
When to make use of and never
Hypernetworks, not specialising in a single style, but as a substitute capable to supply plethora generally don’t end in pretty much as good quality as the opposite methods and might take significant time to coach. On the professionals side, they will store many more concepts than other single-concept fine-tuning methods.

As a substitute of controlling image generation with text prompts, IP adapters propose a technique to regulate the generation with a picture with none changes to the underlying model.
The core idea behind the IP adapter is a decoupled cross-attention mechanism that enables the mixture of source images with text and generated image features. That is achieved by adding a separate cross-attention layer, allowing the model to learn image-specific features.
When to make use of and never
IP adapters are lightweight, adaptable and fast. Nonetheless, their performance is extremely depending on the standard and variety of the training data. IP adapters generally are inclined to work higher with supplying stylistic attributes (e.g. with a picture of Mark Chagall’s paintings) that we would love to see within the generated image and will struggle with providing control for exact details, comparable to pose.

ControlNet paper proposes a technique to extend the input of the text-to-image model to any modality, allowing for fine-grained control of the generated image.
In the unique formulation, ControlNet is an encoder of the pre-trained diffusion model that takes, as an input, the prompt, noise and control data (e.g. depth-map, landmarks, etc.). To guide the generation, the intermediate levels of the ControlNet are then added to the activations of the frozen diffusion model.
The injection is achieved through zero-convolutions, where the weights and biases of 1×1 convolutions are initialized as zeros and regularly learn meaningful transformations during training. This is comparable to how LoRAs are trained — intialised with 0’s they start learning from the identity function.
When to make use of and never
ControlNets are preferable when we would like to regulate the output structure, for instance, through landmarks, depth maps, or edge maps. Resulting from the necessity to update the entire model weights, training may very well be time-consuming; nevertheless, these methods also allow for the perfect fine-grained control through rigid control signals.
Summary
- DreamBooth: Full fine-tuning of models for custom subjects of styles, high control level; nevertheless, it takes very long time to coach and are fit for one purpose only.
- Textual Inversion: Embedding-based learning for brand spanking new concepts, low level of control, nevertheless, fast to coach.
- LoRA: Lightweight fine-tuning of models for brand spanking new styles/characters, medium level of control, while quick to coach
- Hypernetworks: Separate model to predict LoRA weights for a given control request. Lower control level for more styles. Takes time to coach.
- IP-Adapter: Soft style/content guidance via reference images, medium level of stylistic control, lightweight and efficient.
- ControlNet: Control via pose, depth, and edges may be very precise; nevertheless, it takes longer time to coach.
Best practice: For the perfect results, the mixture of IP-adapter, with its softer stylistic guidance and ControlNet for pose and object arrangement, would produce the perfect results.
If you should go into more details on diffusion, take a look at this text, that I even have found thoroughly written accessible to any level of machine learning and math. If you should have an intuitive explanation of the Math with cool commentary take a look at this video or this video.
For looking up information on ControlNets, I discovered this explanation very helpful, this text and this text may very well be an excellent intro as well.
Liked the writer? Stay connected!
Have I missed anything? Don’t hesitate to go away a note, comment or message me directly on LinkedIn or Twitter!
The opinions on this blog are my very own and never attributable to or on behalf of Snap.