A Visual Guide to How Diffusion Models Work

-


This text is geared toward those that want to know exactly how Diffusion Models work, with no prior knowledge expected. I’ve tried to make use of illustrations wherever possible to offer visual intuitions on each a part of these models. I’ve kept mathematical notation and equations to a minimum, and where they’re vital I’ve tried to define and explain them as they occur.

Intro

I’ve framed this text around three foremost questions:

  • What exactly is it that diffusion models learn?
  • How and why do diffusion models work?
  • When you’ve trained a model, how do you get useful stuff out of it?

The examples shall be based on the glyffuser, a minimal text-to-image diffusion model that I previously implemented and wrote about. The architecture of this model is an ordinary text-to-image denoising diffusion model with none bells or whistles. It was trained to generate pictures of latest “Chinese” glyphs from English definitions. Have a have a look at the image below — even for those who’re not conversant in Chinese writing, I hope you’ll agree that the generated glyphs look pretty just like the actual ones!

What exactly is it that diffusion models learn?

Generative Ai models are sometimes said to take a giant pile of knowledge and “learn” it. For text-to-image diffusion models, the information takes the shape of pairs of images and descriptive text. But what exactly is it that we wish the model to learn? First, let’s forget in regards to the text for a moment and consider what we are attempting to generate: the pictures.

Probability distributions

Broadly, we are able to say that we wish a generative AI model to learn the of the information. What does this mean? Consider the one-dimensional normal (Gaussian) distribution below, commonly written 𝒩(,²) and with mean = 0 and variance ² = 1. The black curve below shows the probability density function. We will from it: drawing values such that over a lot of samples, the set of values reflects the underlying distribution. Nowadays, we are able to simply write something like x = random.gauss(0, 1) in Python to sample from the usual normal distribution, although the computational sampling process itself is non-trivial!

Values sampled from an underlying distribution (here, the usual normal 𝒩(0,1)) can then be used to estimate the parameters of that distribution.

We could consider a set of numbers sampled from the above normal distribution as a straightforward dataset, like that shown because the orange histogram above. On this particular case, we are able to calculate the parameters of the underlying distribution using , i.e. by figuring out the mean and variance. The traditional distribution estimated from the samples is shown by the dotted line above. To take some liberties with terminology, you would possibly consider this as a straightforward example of “learning” an underlying probability distribution. We can even say that here we learnt the distribution, in contrast with the methods that diffusion models use.

Conceptually, that is all that generative AI is doing — learning a distribution, then sampling from that distribution!

Data representations

What, then, does the underlying probability distribution of a more complex dataset seem like, similar to that of the image dataset we wish to make use of to coach our diffusion model?

First, we’d like to know what the of the information is. Generally, a machine learning (ML) model requires data inputs with a consistent representation, i.e. format. For the instance above, it was simply numbers (scalars). For images, this representation is often a fixed-length vector.

The image dataset used for the glyffuser model is ~21,000 pictures of Chinese glyphs. The pictures are all the identical size, 128 × 128 = 16384 pixels, and greyscale (single-channel color). Thus an obvious alternative for the representation is a vector x of length 16384, where each element corresponds to the colour of 1 pixel: x = (,₂,…,₁₆₃₈₄). We will call the domain of all possible images for our dataset “pixel space”.

An example glyph with pixel values labelled (downsampled to 32 × 32 pixels for readability).

Dataset visualization

We make the belief that our individual data samples, , are literally sampled from an underlying probability distribution, (), in pixel space, much because the samples from our first example were sampled from an underlying normal distribution in 1-dimensional space. Note: the notation ∼ () is often used to mean: “the random variable sampled from the probability distribution ().”

This distribution is clearly rather more complex than a Gaussian and can’t be easily parameterized — we’d like to learn it with a ML model, which we’ll discuss later. First, let’s try to visualise the distribution to realize a greater intution.

As humans find it difficult to see in greater than 3 dimensions, we’d like to scale back the dimensionality of our data. A small digression on why this works: the manifold hypothesis posits that natural datasets lie on lower dimensional manifolds embedded in the next dimensional space — consider a line embedded in a 2-D plane, or a plane embedded in 3-D space. We will use a dimensionality reduction technique similar to UMAP to project our dataset from 16384 to 2 dimensions. The two-D projection retains lots of structure, consistent with the concept that our data lie on a lower dimensional manifold embedded in pixel space. In our UMAP, we see two large clusters corresponding to characters during which the components are arranged either horizontally (e.g. 明) or vertically (e.g. 草). An interactive version of the plot below with popups on each datapoint is linked here.

 Click here for an interactive version of this plot.

Let’s now use this low-dimensional UMAP dataset as a visible shorthand for our high-dimensional dataset. Remember, we assume that these individual points have been sampled from a continuous underlying probability distribution (). To get a way of what this distribution might seem like, we are able to apply a KDE (kernel density estimation) over the UMAP dataset. (Note: that is just an approximation for visualization purposes.)

This offers a way of what () should seem like: clusters of glyphs correspond to high-probability regions of the distribution. The true () lies in 16384 dimensions — that is the distribution we wish to learn with our diffusion model.

We showed that for a straightforward distribution similar to the 1-D Gaussian, we could calculate the parameters (mean and variance) from our data. Nonetheless, for complex distributions similar to images, we’d like to call on ML methods. Furthermore, what we are going to find is that for diffusion models in practice, slightly than parameterizing the distribution directly, they learn it through the means of learning easy methods to transform noise into data over many steps.

Takeaway

The aim of generative AI similar to diffusion models is to learn the complex probability distributions underlying their training data after which sample from these distributions.

How and why do diffusion models work?

Diffusion models have recently come into the highlight as a very effective method for learning these probability distributions. They generate convincing images by ranging from pure noise and steadily refining it. To whet your interest, have a have a look at the animation below that shows the denoising process generating 16 samples.

On this section we’ll only talk in regards to the mechanics of how these models work but for those who’re all for how they arose from the broader context of generative models, have a have a look at the further reading section below.

What’s “noise”?

Let’s first precisely define noise, for the reason that term is thrown around so much within the context of diffusion. Specifically, we’re talking about Gaussian noise: consider the samples we talked about within the section about probability distributions. You can consider each sample as a picture of a single pixel of noise. A picture that’s “pure Gaussian noise”, then, is one during which each pixel value is sampled from an independent standard Gaussian distribution, 𝒩(0,1). For a pure noise image within the domain of our glyph dataset, this could be noise drawn from 16384 separate Gaussian distributions. You possibly can see this within the previous animation. One thing to consider is that we are able to select the technique of these noise distributions, i.e. them, on specific values — the pixel values of a picture, as an example.

For convenience, you’ll often find the noise distributions for image datasets written as a single multivariate distribution 𝒩(0,) where is the identity matrix, a covariance matrix with all diagonal entries equal to 1 and zeroes elsewhere. This is just a compact notation for a set of multiple independent Gaussians — i.e. there are not any correlations between the noise on different pixels. In the essential implementations of diffusion models, only uncorrelated (a.k.a. “isotropic”) noise is used. This text comprises a superb interactive introduction on multivariate Gaussians.

Diffusion process overview

Below is an adaptation of the somewhat-famous diagram from Ho .’s seminal paper “” which provides an summary of the entire diffusion process:

Diagram of the diffusion process adapted from Ho . 2020. The glyph 锂, meaning “lithium”, is used as a representative sample from the dataset.

I discovered that there was so much to unpack on this diagram and easily understanding what each component meant was very helpful, so let’s undergo it and define every part step-by-step.

We previously used ∼ () to discuss with our data. Here, we’ve added a subscript, ₜ, to indicate timestep indicating what number of steps of “noising” have taken place. We discuss with the samples noised a given timestep as ∼ (ₜ). ₀​ is clean data and ₜ ( = ) ∼ 𝒩(0,1) is pure noise.

We define a process whereby we corrupt samples with noise. This process is described by the distribution (ₜ|ₜ₋₁). If we could access the hypothetical reverse process (ₜ₋₁|ₜ), we could generate samples from noise. As we cannot access it directly because we would wish to know ₀​, we use ML to learn the parameters, , of a model of this process, 𝑝(𝑥ₜ₋₁∣𝑥ₜ). (That must be subscript but medium cannot render it.)

In the next sections we go into detail on how the forward and reverse diffusion processes work.

Forward diffusion, or “noising”

Used as a verb, “noising” a picture refers to applying a change that moves it towards pure noise by cutting down its pixel values toward 0 while adding proportional Gaussian noise. Mathematically, this transformation is a multivariate Gaussian distribution centered on the pixel values of the preceding image.

Within the forward diffusion process, this noising distribution is written as (ₜ|ₜ₋₁) where the vertical bar symbol “|” is read as “given” or “conditional on”, to point the pixel means are passed forward from (ₜ₋₁) At = where is a big number (commonly 1000) we aim to find yourself with images of pure noise (which, somewhat confusingly, can also be a Gaussian distribution, as discussed previously).

The distributions (ₜ) represent the distributions which have amassed the results of all of the previous noising steps ( refers to integration over all possible conditions, which recovers the unconditioned distribution).

Because the conditional distributions are Gaussian, what about their variances? They’re determined by a that maps timesteps to variance values. Initially, an empirically determined schedule of linearly increasing values from 0.0001 to 0.02 over 1000 steps was presented in Ho . Later research by Nichol & Dhariwal suggested an improved cosine schedule. They state that a schedule is only when the speed of data destruction through noising is comparatively even per step throughout the entire noising process.

Forward diffusion intuition

As we encounter Gaussian distributions each as pure noise (ₜ, = ) and because the noising distribution (ₜ|ₜ₋₁), I’ll try to attract the excellence by giving a visible intuition of the distribution for a single noising step, (₁∣₀), for some arbitrary, structured 2-dimensional data:

Each noising step (ₜ|ₜ₋₁) is a Gaussian distribution conditioned on the previous step.

The distribution (₁∣₀) is Gaussian, centered around each point in ₀, shown in blue. Several example points ₀⁽ⁱ⁾ are picked for instance this, with (₁∣₀ = ₀⁽ⁱ⁾) shown in orange.

In practice, the foremost usage of those distributions is to generate specific instances of noised samples for training (discussed further below). We will calculate the parameters of the noising distributions at any timestep directly from the variance schedule, because the chain of Gaussians is itself also Gaussian. This could be very convenient, as we don’t have to perform noising sequentially—for any given starting data ₀⁽ⁱ⁾, we are able to calculate the noised sample ₜ⁽ⁱ⁾ by sampling from (ₜ∣₀ = ₀⁽ⁱ⁾) directly.

Forward diffusion visualization

Let’s now return to our glyph dataset (once more using the UMAP visualization as a visible shorthand). The highest row of the figure below shows our dataset sampled from distributions noised to varied timesteps: ₜ ∼ (ₜ). As we increase the variety of noising steps, you’ll be able to see that the dataset begins to resemble pure Gaussian noise. The underside row visualizes the underlying probability distribution (ₜ).

The dataset ₜ (above) sampled from its probability distribution (ₜ) (below) at different noising timesteps.

Reverse diffusion overview

It follows that if we knew the reverse distributions (ₜ₋₁∣ₜ), we could repeatedly subtract a small amount of noise, ranging from a pure noise sample ₜ at = to reach at a knowledge sample ₀ ∼ (₀). In practice, nevertheless, we cannot access these distributions without knowing ₀ beforehand. Intuitively, it’s easy to make a known image much noisier, but given a really noisy image, it’s much harder to guess what the unique image was.

So what are we to do? Since we have now a considerable amount of data, we are able to train an ML model to accurately guess the unique image that any given noisy image got here from. Specifically, we learn the parameters of an ML model that approximates the reverse noising distributions, (ₜ₋₁ ∣ ₜ) for = 0, …, . In practice, that is embodied in a single trained over many various samples and timesteps. This enables it to denoise any given input, as shown within the figure below.

The ML model predicts added noise at any given timestep t.

Next, let’s go over how this noise prediction model is implemented and trained in practice.

How the model is implemented

First, we define the ML model — generally a deep neural network of some sort — that may act as our noise prediction model. That is what does the heavy lifting! In practice, any ML model that inputs and outputs data of the proper size might be used; the U-net, an architecture particularly suited to learning images, is what we use here and steadily chosen in practice. More moderen models also use .

We use the U-net architecture (Ronneberger . 2015) for our ML noise prediction model. We train the model by minimizing the difference between predicted and actual noise.

Then we run the training loop depicted within the figure above:

  • We take a random image from our dataset and noise it to a random timestep tt. (In practice, we speed things up by doing many examples in parallel!)
  • We feed the noised image into the ML model and train it to predict the (known to us) noise within the image. We also perform by feeding the model a , a high-dimensional unique representation of the timestep, in order that the model can distinguish between timesteps. This could be a vector the identical size as our image directly added to the input (see here for a discussion of how that is implemented).
  • The model “learns” by minimizing the worth of a , some measure of the difference between the anticipated and actual noise. The mean square error (the mean of the squares of the pixel-wise difference between the anticipated and actual noise) is utilized in our case.
  • Repeat until the model is well trained.

Note: A neural network is actually a function with an enormous variety of parameters (on the order of 10for the glyffuser). Neural network ML models are trained by iteratively updating their parameters using to attenuate a given loss function over many training data examples. This is a wonderful introduction. These parameters effectively store the network’s “knowledge”.

A noise prediction model trained in this manner eventually sees many various mixtures of timesteps and data examples. The glyffuser, for instance, was trained over 100 (runs through the entire data set), so it saw around 2 million data samples. Through this process, the model implicity learns the reverse diffusion distributions over your entire dataset in any respect different timesteps. This enables the model to sample the underlying distribution (₀) by stepwise denoising ranging from pure noise. Put one other way, given a picture noised to any given level, the model can predict easy methods to reduce the noise based on its guess of what the unique image. By doing this repeatedly, updating its guess of the unique image every time, the model can transform any noise to a sample that lies in a high-probability region of the underlying data distribution.

Reverse diffusion in practice

We will now revisit this video of the glyffuser denoising process. Recall a lot of steps from sample to noise e.g. = 1000 is used during training to make the noise-to-sample trajectory very easy for the model to learn, as changes between steps shall be small. Does that mean we’d like to run 1000 denoising steps each time we wish to generate a sample?

Luckily, this shouldn’t be the case. Essentially, we are able to run the single-step noise prediction but then rescale it to any given step, even though it may not be excellent if the gap is simply too large! This enables us to approximate the total sampling trajectory with fewer steps. The video above uses 120 steps, as an example (most implementations will allow the user to set the variety of sampling steps).

Recall that predicting the noise at a given step is such as predicting the unique image ₀, and that we are able to access the equation for any noised image deterministically using only the variance schedule and ₀. Thus, we are able to calculate ₜ₋ₖ based on any denoising step. The closer the steps are, the higher the approximation shall be.

Too few steps, nevertheless, and the outcomes change into worse because the steps change into too large for the model to effectively approximate the denoising trajectory. If we only use 5 sampling steps, for instance, the sampled characters don’t look very convincing in any respect:

There’s then an entire literature on more advanced sampling methods beyond what we’ve discussed to this point, allowing effective sampling with much fewer steps. These often reframe the sampling as a differential equation to be solved deterministically, giving an eerie quality to the sampling videos — I’ve included one on the end for those who’re interested. In production-level models, these are frequently preferred over the easy method discussed here, but the essential principle of deducing the noise-to-sample trajectory is identical. A full discussion is beyond the scope of this text but see e.g. this paper and its corresponding implementation within the Hugging Face diffusers library for more information.

Alternative intuition from rating function

To me, it was still not 100% clear why training the model on noise prediction generalises so well. I discovered that an alternate interpretation of diffusion models referred to as “score-based modeling” filled among the gaps in intuition (for more information, discuss with Yang Song’s definitive article on the subject.)

The dataset ₜ sampled from its probability distribution (ₜ) at different noising timesteps; below, we add the rating function ∇ₓ log (ₜ).

I try to provide a visible intuition in the underside row of the figure above: essentially, learning the noise in our diffusion model is equivalent (to a continuing factor) to learning the , which is the gradient of the log of the probability distribution: ∇ₓ log (). As a gradient, the rating function represents a vector field with vectors pointing towards the regions of highest probability density. Subtracting the noise at each step is then such as moving following the directions on this vector field towards regions of high probability density.

So long as there’s some signal, the rating function effectively guides sampling, but in regions of low probability it tends towards zero as there’s little to no gradient to follow. Using many steps to cover different noise levels allows us to avoid this, as we smear out the gradient field at high noise levels, allowing sampling to converge even when we start from low probability density regions of the distribution. The figure shows that because the noise level is increased, more of the domain is roofed by the rating function vector field.

Summary

  • The aim of diffusion models is learn the underlying probability distribution of a dataset after which have the opportunity to sample from it. This requires forward and reverse diffusion (noising) processes.
  • The forward noising process takes samples from our dataset and steadily adds Gaussian noise (pushes them off the information manifold). This forward process is computationally efficient because any level of noise might be added in closed form a single step.
  • The reverse noising process is difficult because we’d like to predict easy methods to remove the noise at each step without knowing the unique data point upfront. We train a ML model to do that by giving it many examples of knowledge noised at different timesteps.
  • Using very small steps within the forward noising process makes it easier for the model to learn to reverse these steps, because the changes are small.
  • By applying the reverse noising process iteratively, the model refines noisy samples step-by-step, eventually producing a sensible data point (one which lies on the information manifold).

Takeaway

Diffusion models are a strong framework for learning complex data distributions. The distributions are learnt implicitly by modelling a sequential denoising process. This process can then be used to generate samples just like those within the training distribution.

When you’ve trained a model, how do you get useful stuff out of it?

Earlier uses of generative AI similar to “This Person Does Not Exist” (. 2019) made waves just because it was the primary time most individuals had seen AI-generated photorealistic human faces. A generative adversarial network or “GAN” was utilized in that case, however the principle stays the identical: the model implicitly learnt a underlying data distribution — in that case, human faces — then sampled from it. To date, our glyffuser model does the same thing: it samples randomly from the distribution of Chinese glyphs.

The query then arises: can we do something more useful than simply sample randomly? You’ve likely already encountered text-to-image models similar to Dall-E. They can incorporate extra meaning from text prompts into the diffusion process — this in referred to as . Likewise, diffusion models for scientific scientific applications like protein (e.g. Chroma, RFdiffusion, AlphaFold3) or inorganic crystal structure generation (e.g. MatterGen) change into rather more useful if might be conditioned to generate samples with desirable properties similar to a particular symmetry, bulk modulus, or band gap.

Conditional distributions

We will consider conditioning as a option to guide the diffusion sampling process towards particular regions of our probability distribution. We mentioned conditional distributions within the context of forward diffusion. Below we show how conditioning might be considered reshaping a base distribution.

A straightforward example of a joint probability distribution (, ), shown as a contour map, together with its two marginal 1-D probability distributions, () and (). The very best points of (, ) are at (₁, ₁) and (₂, ₂). The conditional distributions (∣ = ₁) and (∣ = ₂) are shown overlaid on the foremost plot.

Consider the figure above. Consider () as a distribution we wish to sample from (i.e., the pictures) and () as conditioning information (i.e., the text dataset). These are the marginal distributions of a joint distribution (, ). Integrating (, ) over recovers (), and vice versa.

Sampling from (), we’re equally prone to get ₁ or ₂. Nonetheless, we are able to condition on ( = ₁) to acquire (∣ = ₁). You possibly can consider this as taking a slice through (, ) at a given value of . On this conditioned distribution, we’re rather more prone to sample at ₁ than ₂.

In practice, with a view to condition on a text dataset, we’d like to convert the text right into a numerical form. We will do that using that might be injected into the noise prediction model during training.

Embedding text with an LLM

Within the glyffuser, our conditioning information is in the shape of English text definitions. We now have two requirements: 1) ML models prefer fixed-length vectors as input. 2) The numerical representation of our text must understand context — if we have now the words “lithium” and “element” nearby, the meaning of “element” must be understood as “chemical element” slightly than “heating element”. Each of those requirements might be met by utilizing a pre-trained LLM.

The diagram below shows how an LLM converts text into fixed-length vectors. The text is first (LLMs break text into , small chunks of characters, as their basic unit of interaction). Each token is converted right into a base , which is a fixed-length vector of the dimensions of the LLM input. These vectors are then passed through the pre-trained LLM (here we use the portion of Google’s T5 model), where they’re imbued with additional contextual meaning. We find yourself with a array of vectors of the identical length , i.e. a () sized tensor.

We will convert text to a numerical embedding imbued with contextual meaning using a pre-trained LLM.

Note: in some models, notably Dall-E, additional image-text alignment is performed using . Imagen seems to point out that we are able to get away without doing this.

Training the diffusion model with text conditioning

The precise method that this embedding vector is injected into the model can vary. In Google’s Imagen model, for instance, the embedding tensor is pooled (combined right into a single vector within the embedding dimension) and added into the information because it passes through the noise prediction model; it’s also included another way using (a way of learning contextual information between sequences of tokens, most famously utilized in the models that form the premise of LLMs like ChatGPT).

Conditioning information might be added multiple different methods however the training loss stays the identical.

Within the glyffuser, we only use cross-attention to introduce this conditioning information. While a big architectural change is required to introduce this extra information into the model, the loss function for our noise prediction model stays the exact same.

Testing the conditioned diffusion model

Let’s do a straightforward test of the fully trained conditioned diffusion model. Within the figure below, we attempt to denoise in a single step with the text prompt “Gold”. As touched upon in our interactive UMAP, Chinese characters often contain components referred to as which may convey sound (phonetic radicals) or meaning (semantic radicals). A standard semantic radical is derived from the character meaning “gold”, “金”, and is utilized in characters which are in some broad sense related to gold or metals.

Even with a single sampling step, conditioning guides denoising towards the relevant regions of the probability distribution.

The figure shows that although a single step is insufficient to approximate the denoising trajectory thoroughly, we have now moved right into a region of our probability distribution with the “金” radical. This means that the text prompt is effectively guiding our sampling towards a region of the glyph probability distribution related to the meaning of the prompt. The animation below shows a 120 step denoising sequence for a similar prompt, “Gold”. You possibly can see that each generated glyph has either the 釒 or 钅 radical (the identical radical in traditional and simplified Chinese, respectively).

Takeaway

Conditioning enables us to sample meaningful outputs from diffusion models.

Further remarks

I discovered that with the assistance of tutorials and existing libraries, it was possible to implement a working diffusion model despite not having a full understanding of what was happening under the hood. I feel that is a great option to start learning and highly recommend Hugging Face’s tutorial on training a straightforward diffusion model using their diffusers Python library (which now includes my small bugfix!).

I’ve omitted some topics which are crucial to how production-grade diffusion models function, but are unnecessary for core understanding. One is the query of easy methods to generate high resolution images. In our example, we did every part in pixel space, but this becomes very computationally expensive for giant images. The final approach is to perform diffusion in a smaller space, then upscale it in a separate step. Methods include latent diffusion (utilized in Stable Diffusion) and cascaded super-resolution models (utilized in Imagen). One other topic is classifier-free guidance, a really elegant method for enhancing the conditioning effect to provide a lot better prompt adherence. I show the implementation in my previous post on the glyffuser and highly recommend this text if you desire to learn more.

Further reading

A non-exhaustive list of materials I discovered very helpful:

Fun extras

Diffusion sampling using the DPMSolverSDEScheduler developed by Katherine Crowson and implemented in Hugging Face diffusers—note the graceful transition from noise to data.

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x