Stable Video Diffusion: Latent Video Diffusion Models to Large Datasets

Artificial Intelligence

Stable Video Diffusion: Latent Video Diffusion Models to Large Datasets

admin

December 1, 2023

Stable Video Diffusion: Latent Video Diffusion Models to Large Datasets

Generative AI has been a driving force within the AI community for a while now, and the advancements made in the sphere of generative image modeling especially with the usage of diffusion models have helped the generative video models progress significantly not only in research, but additionally when it comes to real world applications. Conventionally, generative video models are either trained from scratch, or they’re partially or completely finetuned from pretrained image models with extra temporal layers, on a combination of image and video datasets.

Taking forward the advancements in generative video models, in this text, we’ll talk in regards to the Stable Video Diffusion Model, a latent video diffusion model able to generating high-resolution, state-of-the-art image to video, and text to video content. We are going to discuss how latent diffusion models trained for synthesizing 2D images have improved the talents & efficiency of generative video models by adding temporal layers, and fine-tuning the models on small datasets consisting of high-quality videos. We will likely be having a deeper dive into the architecture and dealing of the Stable Video Diffusion Model, and evaluate its performance on various metrics and compare it with current cutting-edge frameworks for video generation. So let’s start.

Due to its almost unlimited potential, Generative AI has been the first subject of research for AI and ML practitioners for some time now, and the past few years have seen rapid advancements each when it comes to efficiency and performance of generative image models. The learnings from generative image models have allowed researchers and developers to make progress on generative video models leading to enhanced practicality and real-world applications. Nonetheless, many of the research attempting to enhance the capabilities of generative video models focus totally on the precise arrangement of temporal and spatial layers, with little attention being paid to research the influence of choosing the correct data on the final result of those generative models.

Due to the progress made by generative image models, researchers have observed that the impact of coaching data distribution on the performance of generative models is indeed significant and undisputed. Moreover, researchers have also observed that pretraining a generative image model on a big and diverse dataset followed by fine-tuning it on a smaller dataset with higher quality often ends in improving the performance significantly. Traditionally, generative video models implement the learnings obtained from successful generative image models, and researchers are yet to check the effect of information, and training strategies are yet to be studied. The Stable Video Diffusion Model is an try to enhance the talents of generative video models by venturing into previously uncharted territories with special focus being on choosing data.

Recent generative video models depend on diffusion models, and text conditioning or image conditioning approaches to synthesize multiple consistent video or image frames. Diffusion models are known for his or her ability to learn easy methods to progressively denoise a sample from normal distribution by implementing an iterative refinement process, and so they have delivered desirable results on high-resolution video, and text to image synthesis. Using the identical principle at its core, the Stable Video Diffusion Model trains a latent video diffusion model on its video dataset together with the usage of Generative Adversarial Networks or GANs, and even autoregressive models to some extent.

The Stable Video Diffusion Model follows a singular strategy never implemented by any generative video model because it relies on latent video diffusion baselines with a set architecture, and a set training strategy followed by assessing the effect of curating the info. The Stable Video Diffusion Model goals to make the next contributions in the sphere of generative video modeling.

To present a scientific and effective data curation workflow in an try to turn a big collection of uncurated video samples to high-quality dataset that’s then utilized by the generative video models.
To coach cutting-edge image to video, and text to video models that outperforms the present frameworks.
Conducting domain-specific experiments to probe the 3D understanding, and powerful prior of motion of the model.

Now, the Stable Video Diffusion Model implements the learnings from Latent Video Diffusion Models, and Data Curation techniques on the core of its foundation.

Latent Video Diffusion Models

Latent Video Diffusion Models or Video-LDMs follow the approach of coaching the first generative model in a latent space with reduced computational complexity, and most Video-LDMs implement a pre trained text to image model coupled with the addition of temporal mixing layers within the pretraining architecture. In consequence, most Video Latent Diffusion Models either only train temporal layers, or skip the training process altogether unlike the Stable Video Diffusion Model that fine-tunes your complete framework. Moreover, for synthesizing text to video data, the Stable Video Diffusion Model directly conditions itself on a text prompt, and the outcomes indicate that the resulting framework might be finetuned right into a multi-view synthesis or a picture to video model easily.

Data Curation

Data Curation is a vital part not only of the Stable Video Diffusion Model, but for generative models as a complete since it’s essential to pretrain large models on large-scale datasets to spice up performance across different tasks including language modeling, or discriminative text to image generation, and way more. Data Curation has been implemented successfully on generative image models by leveraging the capabilities of efficient language-image representations, although such such discussions have never been focussed on for developing generative video models. There are several hurdles developers face when curating data for generative video models, and to deal with these challenges, the Stable Video Diffusion Model implements a three-stage training strategy, leading to enhanced results, and a big boost in performance.

Data Curation for High Quality Video Synthesis

As discussed within the previous section, the Stable Video Diffusion Model implements a three-stage training strategy, leading to enhanced results, and a big boost in performance. Stage I is an image pretraining stage that makes use of a 2D text to image diffusion model. Stage II is for video pretraining during which the framework trains on a considerable amount of video data. Finally, we’ve got Stage III for video finetuning during which the model is refined on a small subset of top quality and high resolution videos.

Nonetheless, before the Stable Video Diffusion Model implements these three stages, it’s important to process and annotate the info because it serves as the bottom for Stage II or the video pre-training stage, and plays a critical role in ensuring the optimal output. To make sure maximum efficiency, the framework first implements a cascaded cut detection pipeline at 3 various FPS or Frames Per Second levels, and the necessity for this pipeline is demonstrated in the next image.

Next, the Stable Video Diffusion Model annotates each video clip using three various synthetic captioning methods. The next table compares the datasets utilized in the Stable Diffusion Framework before & after the filtration process.

Stage I : Image Pre-Training

The primary stage within the three-stage pipeline implemented within the Stable Video Diffusion Model is image pre-training, and to realize this, the initial Stable Video Diffusion Model framework is grounded against a pre-trained image diffusion model namely the Stable Diffusion 2.1 model that equips it with stronger visual representations.

Stage II : Video Pre-Training

The second stage is the Video Pre-Training stage, and it builds on the findings that the use of information curation in multimodal generative image models often ends in higher results, and enhanced efficiency together with powerful discriminative image generation. Nonetheless, owing to the dearth of comparable powerful off the shelf representations to filter out unwanted samples for generative video models, the Stable Video Diffusion Model relies on human preferences as input signals for the creation of an appropriate dataset used for pre-training the framework. The next figure display the positive effect of pre-training the framework on a curated dataset that helps in boosting the general performance for video pre-training on smaller datasets.

To be more specific, the framework uses different methods to curate subsets of Latent Video Diffusion, and considers the rating of LVD models trained on these datasets. Moreover, the Stable Video Diffusion framework also finds that the usage of curated datasets for training the frameworks helps in boosting the performance of the framework, and diffusion models normally. Moreover, data curation strategy also works on larger, more relevant, and highly practical datasets. The next figure demonstrates the positive effect of pre-training the framework on a curated dataset that helps in boosting the general performance for video pre-training on smaller datasets.

Stage III : High-Quality Tremendous-tuning

Till stage II, the Stable Video Diffusion framework focuses on improving the performance prior to video pretraining, and within the third stage, the framework lays its emphasis on optimizing or further boosting the performance of the framework after top quality video fine-tuning, and the way the transition from Stage II to Stage III is achieved within the framework. In Stage III, the framework draws on training techniques borrowed from latent image diffusion models, and increases the training examples’ resolution. To investigate the effectiveness of this approach, the framework compares it with three equivalent models that differ only when it comes to their initialization. The primary equivalent model has its weights initialized, and the video training process is skipped whereas the remaining two equivalent models are initialized with the weights borrowed from other latent video models.

Results and Findings

It is time to have a have a look at how the Stable Video Diffusion framework performs on real-world tasks, and the way it compares against the present cutting-edge frameworks. The Stable Video Diffusion framework first uses the optimal data approach to coach a base model, after which performs fine-tuning to generate several cutting-edge models, where each model performs a selected task.

The above picture represents the high-resolution image to video samples generated by the framework whereas the next figure demonstrates the flexibility of the framework to generate high-quality text to video samples.

Pre-Trained Base Model

As discussed earlier, the Stable Video Diffusion model is built on the Stable Diffusion 2.1 framework, and on the premise of recent findings, it was crucial for developers to adopt the noise schedule and increase the noise to acquire images with higher resolution when training image diffusion models. Due to this approach, the Stable Video Diffusion base model learns powerful motion representations, and in the method, outperforms baseline models for text to video generation in a zero shot setting, and the outcomes are displayed in the next table.

Frame Interpolation and Multi-View Generation

The Stable Video Diffusion framework finetunes the image to video model on multi-view datasets to acquire multiple novel views of an object, and this model is often known as SVD-MV or Stable Video Diffusion- Multi View model. The unique SVD model is finetuned with the assistance of two datasets in a way that the framework inputs a single image, and returns a sequence of multi-view images as its output.

As it could be seen in the next images, the Stable Video Diffusion Multi View framework delivers high performance comparable to cutting-edge Scratch Multi View framework, and the outcomes are a transparent demonstration of SVD-MV’s ability to make the most of the learnings obtained from the unique SVD framework for multi-view image generation. Moreover, the outcomes also indicate that running the model for a comparatively smaller variety of iterations helps in delivering optimal results as is the case with most models fine-tuned from the SVD framework.

Within the above figure, the metrics are indicated on the left-hand side and as it could be seen, the Stable Video Diffusion Multi View framework outperforms Scratch-MV and SD2.1 Multi-View framework by a good margin. The second image demonstrates the effect of the number of coaching iterations on the general performance of the framework when it comes to Clip Rating, and the SVD-MV frameworks deliver sustainable results.

Final Thoughts

In this text, we’ve got talked about Stable Video Diffusion, a latent video diffusion model able to generating high-resolution, state-of-the-art image to video, and text to video content. The Stable Video Diffusion Model follows a singular strategy never implemented by any generative video model because it relies on latent video diffusion baselines with a set architecture, and a set training strategy followed by assessing the effect of curating the info.

We’ve talked about how latent diffusion models trained for synthesizing 2D images have improved the talents & efficiency of generative video models by adding temporal layers, and fine-tuning the models on small datasets consisting of high-quality videos. To collect the pre-training data, the framework conducts scaling study and follows systematic data collection practices, and ultimately proposes a way to curate a considerable amount of video data, and converts noisy videos into input data suitable for generative video models.

Moreover, the Stable Video Diffusion framework employs three distinct video model training stages which might be analyzed independently to evaluate their impact on the framework’s performance. The framework ultimately outputs a video representation powerful enough to finetune the models for optimal video synthesis, and the outcomes are comparable to cutting-edge video generation models already in use.