The Evolution of Text to Video Models

Simplifying the neural nets behind Generative Video Diffusion

We’ve witnessed remarkable strides in AI image generation. But what happens after we add the dimension of time? Videos are moving images, in any case.

Text-to-video generation is a posh task that requires AI to know not only what things appear like, but how they move and interact over time. It’s an order of magnitude more complex than text-to-image.

To supply a coherent video, a neural network must:
1. Comprehend the input prompt
2. Understand how the world works
3. Understand how objects move and the way physics applies
4. Generate a sequence of frames that make sense spatially, temporally, and logically

Despite these challenges, today’s diffusion neural networks are making impressive progress on this field. In this text, we are going to cover the principal ideas behind video diffusion models — principal challenges, approaches, and the seminal papers in the sphere.

Also, this text is predicated on this larger YouTube video I made. In case you enjoy this read, you’ll enjoy watching the video too.

To grasp text-to-video generation, we’d like to begin with its predecessor: text-to-image diffusion models. These models have a singular goal — to rework random noise and a text prompt right into a coherent image. Normally, all generative image models do that — Variational Autoencoders (VAE), Generative Adversarial Neural Nets (GANs), and yes, Diffusion too.

The essential goal of all image generation models is to convert random noise into a picture, often conditioned on additional conditioning prompts (like text). [Image by Author]

Diffusion, specifically, relies on a gradual denoising process to generate images.

1. Start with a randomly generated noisy image
2. Use a neural network to progressively remove noise
3. Condition the denoising process on text input
4. Repeat until a transparent image emerges

How Diffusion Models generate images — A neural network progressively removes noise from a pure noise image conditioned on a text prompt, eventually revealing a transparent image. [Illustration by Author] (Image generated by a neural network)

But how are these denoising neural networks trained?

During training, we start with real images and progressively add noise to it in small steps — this known as forward diffusion. This generates loads of samples of clear image and their barely noisier versions. The neural network is then trained to reverse this process by inputting the noisy image and predicting how much noise to remove to retrieve the clearer version. In text-conditional models, we train attention layers to take care of the inputted prompt for guided denoising.

During training, we add noise to clear images (left) — this known as Forward Diffusion. The neural network is trained to reverse this noise addition process — a process often known as Reverse Diffusion. Images generated using a neural network. [Image by Author]

This iterative approach allows for the generation of highly detailed and diverse images. You possibly can watch the next YouTube video where I explain text to image in way more detail — concepts like Forward and Reverse Diffusion, U-Net, CLIP models, and the way I implemented them in Python and Pytorch from scratch.

In case you are comfortable with the core concepts of Text-to-Image Conditional Diffusion, let’s move to videos next.

In theory, we could follow the identical conditioned noise-removal idea to do text-to-video diffusion. Nevertheless, adding time into the equation introduces several latest challenges:

1. Temporal Consistency: Ensuring objects, backgrounds, and motions remain coherent across frames.
2. Computational Demands: Generating multiple frames per second as an alternative of a single image.
3. Data Scarcity: While large image-text datasets are available, high-quality video-text datasets are scarce.

Some commonly used video-text datasets [Image by Author]

Due to the lack of top quality datasets, text-to-video cannot rely just on supervised training. And that’s the reason people normally also mix two more data sources to coach video diffusion models — one — paired image-text data, which is way more available, and two — unlabelled video data, that are super-abundant and accommodates numerous details about how the world works. Several groundbreaking models have emerged to tackle these challenges. Let’s discuss among the vital milestone papers one after the other.

We’re about to get into the technical nitty gritty! In case you find the fabric ahead difficult, be happy to look at this companion video as a visible side-by-side guide while reading the following section.

VDM Uses a 3D U-Net architecture with factorized spatio-temporal convolution layers. Each term is explained in the image below.

What each of the terms mean (Image by Creator)

VDM is jointly trained on each image and video data. VDM replaces the 2D UNets from Image Diffusion models with 3D UNet models. The video is input into the model as a time sequence of 2D frames. The term Factorized mainly signifies that the spatial and temporal layers are decoupled and processed individually from one another. This makes the computations way more efficient.

What’s a 3D-UNet?

3D U-Net is a singular computer vision neural network that first downsamples the video through a series of those factorized spatio-temporal convolutional layers, mainly extracting video features at different resolutions. Then, there may be an upsampling path that expands the low-dimensional features back to the form of the unique video. While upsampling, skip connections are used to reuse the generated features through the downsampling path.

The 3D Factorized UNet Architecture [Image by Author]

Remember in any convolutional neural network, the sooner layers all the time capture detailed details about local sections of the image, while latter layers pick up global level pattern by accessing larger sections — so through the use of skip connections, U-Net combines local details with global features to be a super-awesome network for feature learning and denoising.

VDM is jointly trained on paired image-text and video-text datasets. While it’s an awesome proof of concept, VDM generates quite low-resolution videos for today’s standards.

You possibly can read more about VDM here.

Make-A-Video by Meta AI takes the daring approach of claiming that we don’t necessarily need labeled-video data to coach video diffusion models. WHHAAA?! Yes, you read that right.

Adding temporal layers to Image Diffusion

Make A Video first trains an everyday text-to-image diffusion model, identical to Dall-E or Stable Diffusion with paired image-text data. Next, unsupervised learning is completed on unlabelled video data to show the model temporal relationships. The extra layers of the network are trained using a way called masked spatio-temporal decoding, where the network learns to generate missing frames by processing the visible frames. Note that no labelled video data is required on this pipeline (although further video-text fine-tuning is feasible as a further third step), since the model learns spatio-temporal relationships with paired text-image and raw unlabelled video data.

Make-A-Video in a nutshell [Image by Author]

The video outputted by the above model is 64×64 with 16 frames. This video is then upsampled along the time and pixel axis using separate neural networks called Temporal Super Resolution or TSR (insert latest frames between existing frames to extend frames-per-second (fps)), and Spatial Super Resolution or SSR (super-scale the person frames of the video to be higher resolution). After these steps, Make-A-Video outputs 256×256 videos with 76 frames.

You possibly can learn more about Make-A-Video right here.

Imagen video employs a cascade of seven models for video generation and enhancement. The method starts with a base video generation model that creates low-resolution video clips. That is followed by a series of super-resolution models — three SSR (Spatial Super Resolution) models for spatial upscaling and three TSR (Temporal Super Resolution) models for temporal upscaling. This cascaded approach allows Imagen Video to generate high-quality, high-resolution videos with impressive temporal consistency. Generates high-quality, high-resolution videos with impressive temporal consistency

The Imagen workflow [Source: Imagen paper: https://imagen.research.google/video/paper.pdf]

Models like Nvidia’s VideoLDM tries to handle the temporal consistency issue through the use of latent diffusion modelling. First they train a latent diffusion image generator. The essential idea is to coach a Variational Autoencoder or VAE. The VAE consists of an encoder network that may compress input frames right into a low dimensional latent space and one other decoder network that may reconstruct it back to the unique images. The diffusion process is completed entirely on this low dimensional space as an alternative of the complete pixel-space, making it way more computationally efficient and semantically powerful.

A typical Autoencoder. The input frames are individually downsampled right into a low dimensional compressed latent space. A Decoder network then learns to reconstruct the image back from this low resolution space. [Image by Author]

What are Latent Diffusion Models?

The diffusion model is trained entirely within the low dimensional latent space, i.e. the diffusion model learns to denoise the low dimensional latent space images as an alternative of the complete resolution frames. That is why we call it Latent Diffusion Models. The resulting latent space outputs is then go through the VAE decoder to convert it back to pixel-space.

The decoder of the VAE is enhanced by adding latest temporal layers in between it’s spatial layers. These temporal layers are fine-tuned on video data, making the VAE produce temporally consistent and flicker-free videos from the latents generated by the image diffusion model. This is completed by freezing the spatial layers of the decoder and adding latest trainable temporal layers which might be conditioned on previously generated frames.

The VAE Decoder is finetuned with temporal information in order that it may well produce consistent videos from the latents generated by the Latent Diffusion Model (LDM) [Source: Video LDM Paper https://arxiv.org/abs/2304.08818]

You possibly can learn more about Video LDMs here.

While Video LDM compresses individual frames of the video to coach an LDM, SORA compresses video each spatially and temporally. Recent papers like CogVideoX have demonstrated that 3D Causal VAEs are great at compressing videos making diffusion training computationally efficient, and in a position to generate flicker-free consistent videos.

3D VAEs compress videos spatio-temporally to generate compressed 4D representations of video data [Image by Author]

Transformers for Diffusion

A transformer model is used because the diffusion network as an alternative of the more traditional UNEt model. After all, transformers need the input data to be presented as a sequence of tokens. That’s why the compressed video encodings are flattened right into a sequence of patches. Observe that every patch and its location within the sequence represents a spatio-temporal feature of the unique video.

OpenAI SORA Video Preprocessing [Source: OpenAI (https://openai.com/index/sora/)] (License: Free)

It’s speculated that OpenAI has collected a somewhat large annotation dataset of video-text data which they’re using to coach conditional video generation models.

Combining all of the strengths listed below, plus more tricks that the ironically-named OpenAI may never disclose, SORA guarantees to be a large leap in video generation AI models.

Massive video-text annotated dataset + pretraining techniques with image-text data and unlabelled data
General architectures of Transformers
Huge compute investment (thanks Microsoft)
The representation power of Latent Diffusion Modeling.

The longer term of AI is simple to predict. In 2024, Data + Compute = Intelligence. Large corporations will invest computing resources to coach large diffusion transformers. They’ll hire annotators to label high-quality video-text data. Large-scale text-video datasets probably exist already within the closed-source domain (taking a look at you OpenAI), and so they may grow to be open-source inside the following 2–3 years, especially with recent advancements in AI video understanding. It stays to be seen if the upcoming huge computing and financial investments could on their very own solve video generation. Or will further architectural and algorithmic advancements be needed from the research community?