We’ve witnessed remarkable strides in AI image generation. But what happens after we add the dimension of time? Videos are moving images, in any case.
Text-to-video generation is a posh task that requires AI to know not only what things appear like, but how they move and interact over time. It’s an order of magnitude more complex than text-to-image.
To supply a coherent video, a neural network must:
1. Comprehend the input prompt
2. Understand how the world works
3. Understand how objects move and the way physics applies
4. Generate a sequence of frames that make sense spatially, temporally, and logically
Despite these challenges, today’s diffusion neural networks are making impressive progress on this field. In this text, we are going to cover the principal ideas behind video diffusion models — principal challenges, approaches, and the seminal papers in the sphere.
To grasp text-to-video generation, we’d like to begin with its predecessor: text-to-image diffusion models. These models have a singular goal — to rework random noise and a text prompt right into a coherent image. Normally, all generative image models do that — Variational Autoencoders (VAE), Generative Adversarial Neural Nets (GANs), and yes, Diffusion too.
Diffusion, specifically, relies on a gradual denoising process to generate images.
1. Start with a randomly generated noisy image
2. Use a neural network to progressively remove noise
3. Condition the denoising process on text input
4. Repeat until a transparent image emerges
But how are these denoising neural networks trained?
During training, we start with real images and progressively add noise to it in small steps — this known as forward diffusion. This generates loads of samples of clear image and their barely noisier versions. The neural network is then trained to reverse this process by inputting the noisy image and predicting how much noise to remove to retrieve the clearer version. In text-conditional models, we train attention layers to take care of the inputted prompt for guided denoising.
This iterative approach allows for the generation of highly detailed and diverse images. You possibly can watch the next YouTube video where I explain text to image in way more detail — concepts like Forward and Reverse Diffusion, U-Net, CLIP models, and the way I implemented them in Python and Pytorch from scratch.
In case you are comfortable with the core concepts of Text-to-Image Conditional Diffusion, let’s move to videos next.
In theory, we could follow the identical conditioned noise-removal idea to do text-to-video diffusion. Nevertheless, adding time into the equation introduces several latest challenges:
1. Temporal Consistency: Ensuring objects, backgrounds, and motions remain coherent across frames.
2. Computational Demands: Generating multiple frames per second as an alternative of a single image.
3. Data Scarcity: While large image-text datasets are available, high-quality video-text datasets are scarce.
Due to the lack of top quality datasets, text-to-video cannot rely just on supervised training. And that’s the reason people normally also mix two more data sources to coach video diffusion models — one — paired image-text data, which is way more available, and two — unlabelled video data, that are super-abundant and accommodates numerous details about how the world works. Several groundbreaking models have emerged to tackle these challenges. Let’s discuss among the vital milestone papers one after the other.
We’re about to get into the technical nitty gritty! In case you find the fabric ahead difficult, be happy to look at this companion video as a visible side-by-side guide while reading the following section.
VDM Uses a 3D U-Net architecture with factorized spatio-temporal convolution layers. Each term is explained in the image below.
VDM is jointly trained on each image and video data. VDM replaces the 2D UNets from Image Diffusion models with 3D UNet models. The video is input into the model as a time sequence of 2D frames. The term Factorized mainly signifies that the spatial and temporal layers are decoupled and processed individually from one another. This makes the computations way more efficient.
What’s a 3D-UNet?
3D U-Net is a singular computer vision neural network that first downsamples the video through a series of those factorized spatio-temporal convolutional layers, mainly extracting video features at different resolutions. Then, there may be an upsampling path that expands the low-dimensional features back to the form of the unique video. While upsampling, skip connections are used to reuse the generated features through the downsampling path.
Remember in any convolutional neural network, the sooner layers all the time capture detailed details about local sections of the image, while latter layers pick up global level pattern by accessing larger sections — so through the use of skip connections, U-Net combines local details with global features to be a super-awesome network for feature learning and denoising.
VDM is jointly trained on paired image-text and video-text datasets. While it’s an awesome proof of concept, VDM generates quite low-resolution videos for today’s standards.
You possibly can read more about VDM here.
Make-A-Video by Meta AI takes the daring approach of claiming that we don’t necessarily need labeled-video data to coach video diffusion models. WHHAAA?! Yes, you read that right.
Adding temporal layers to Image Diffusion
Make A Video first trains an everyday text-to-image diffusion model, identical to Dall-E or Stable Diffusion with paired image-text data. Next, unsupervised learning is completed on unlabelled video data to show the model temporal relationships. The extra layers of the network are trained using a way called masked spatio-temporal decoding, where the network learns to generate missing frames by processing the visible frames. Note that no labelled video data is required on this pipeline (although further video-text fine-tuning is feasible as a further third step), since the model learns spatio-temporal relationships with paired text-image and raw unlabelled video data.
The video outputted by the above model is 64×64 with 16 frames. This video is then upsampled along the time and pixel axis using separate neural networks called Temporal Super Resolution or TSR (insert latest frames between existing frames to extend frames-per-second (fps)), and Spatial Super Resolution or SSR (super-scale the person frames of the video to be higher resolution). After these steps, Make-A-Video outputs 256×256 videos with 76 frames.
You possibly can learn more about Make-A-Video right here.
Imagen video employs a cascade of seven models for video generation and enhancement. The method starts with a base video generation model that creates low-resolution video clips. That is followed by a series of super-resolution models — three SSR (Spatial Super Resolution) models for spatial upscaling and three TSR (Temporal Super Resolution) models for temporal upscaling. This cascaded approach allows Imagen Video to generate high-quality, high-resolution videos with impressive temporal consistency. Generates high-quality, high-resolution videos with impressive temporal consistency
Models like Nvidia’s VideoLDM tries to handle the temporal consistency issue through the use of latent diffusion modelling. First they train a latent diffusion image generator. The essential idea is to coach a Variational Autoencoder or VAE. The VAE consists of an encoder network that may compress input frames right into a low dimensional latent space and one other decoder network that may reconstruct it back to the unique images. The diffusion process is completed entirely on this low dimensional space as an alternative of the complete pixel-space, making it way more computationally efficient and semantically powerful.
What are Latent Diffusion Models?
The diffusion model is trained entirely within the low dimensional latent space, i.e. the diffusion model learns to denoise the low dimensional latent space images as an alternative of the complete resolution frames. That is why we call it Latent Diffusion Models. The resulting latent space outputs is then go through the VAE decoder to convert it back to pixel-space.
The decoder of the VAE is enhanced by adding latest temporal layers in between it’s spatial layers. These temporal layers are fine-tuned on video data, making the VAE produce temporally consistent and flicker-free videos from the latents generated by the image diffusion model. This is completed by freezing the spatial layers of the decoder and adding latest trainable temporal layers which might be conditioned on previously generated frames.
You possibly can learn more about Video LDMs here.
While Video LDM compresses individual frames of the video to coach an LDM, SORA compresses video each spatially and temporally. Recent papers like CogVideoX have demonstrated that 3D Causal VAEs are great at compressing videos making diffusion training computationally efficient, and in a position to generate flicker-free consistent videos.
Transformers for Diffusion
A transformer model is used because the diffusion network as an alternative of the more traditional UNEt model. After all, transformers need the input data to be presented as a sequence of tokens. That’s why the compressed video encodings are flattened right into a sequence of patches. Observe that every patch and its location within the sequence represents a spatio-temporal feature of the unique video.
It’s speculated that OpenAI has collected a somewhat large annotation dataset of video-text data which they’re using to coach conditional video generation models.
Combining all of the strengths listed below, plus more tricks that the ironically-named OpenAI may never disclose, SORA guarantees to be a large leap in video generation AI models.
- Massive video-text annotated dataset + pretraining techniques with image-text data and unlabelled data
- General architectures of Transformers
- Huge compute investment (thanks Microsoft)
- The representation power of Latent Diffusion Modeling.
The longer term of AI is simple to predict. In 2024, Data + Compute = Intelligence. Large corporations will invest computing resources to coach large diffusion transformers. They’ll hire annotators to label high-quality video-text data. Large-scale text-video datasets probably exist already within the closed-source domain (taking a look at you OpenAI), and so they may grow to be open-source inside the following 2–3 years, especially with recent advancements in AI video understanding. It stays to be seen if the upcoming huge computing and financial investments could on their very own solve video generation. Or will further architectural and algorithmic advancements be needed from the research community?
Links
Creator’s Youtube Channel: https://www.youtube.com/@avb_fj
Video on this topic: https://youtu.be/KRTEOkYftUY
15-step Zero-to-Hero on Conditional Image Diffusion: https://youtu.be/w8YQc
Papers and Articles
Video Diffusion Models: https://arxiv.org/abs/2204.03458
Imagen: https://imagen.research.google/video/
Make A Video: https://makeavideo.studio/
Video LDM: https://research.nvidia.com/labs/toronto-ai/VideoLDM/index.html
CogVideoX: https://arxiv.org/abs/2408.06072
OpenAI SORA article: https://openai.com/index/sora/
Diffusion Transformers: https://arxiv.org/abs/2212.09748
Useful article: https://lilianweng.github.io/posts/2024-04-12-diffusion-video/