But you don’t want any image—you wish the image you specified, typically with a text prompt. And so the diffusion model is paired with a second model—corresponding to a big language model (LLM) trained to match images with text descriptions—that guides each step of the cleanup process, pushing the diffusion model toward images that the big language model considers an excellent match to the prompt.
An aside: This LLM isn’t pulling the links between text and pictures out of thin air. Most text-to-image and text-to-video models today are trained on large data sets that contain billions of pairings of text and pictures or text and video scraped from the web (a practice many creators are very unhappy about). Which means what you get from such models is a distillation of the world because it’s represented online, distorted by prejudice (and pornography).
It’s easiest to assume diffusion models working with images. However the technique will be used with many kinds of information, including audio and video. To generate movie clips, a diffusion model must clean up sequences of images—the consecutive frames of a video—as a substitute of only one image.
What’s a latent diffusion model?
All this takes an enormous amount of compute (read: energy). That’s why most diffusion models used for video generation use a method called latent diffusion. As a substitute of processing raw data—the tens of millions of pixels in each video frame—the model works in what’s often called a latent space, wherein the video frames (and text prompt) are compressed right into a mathematical code that captures just the essential features of the information and throws out the remainder.
An analogous thing happens every time you stream a video over the web: A video is distributed from a server to your screen in a compressed format to make it get to you quicker, and when it arrives, your computer or TV will convert it back right into a watchable video.