State of open video generation models in Diffusers

-



OpenAI’s Sora demo marked a striking advance in AI-generated video last 12 months and gave us a glimpse of the potential capabilities of video generation models. The impact was immediate and since that demo, the video generation space has change into increasingly competitive with major players and startups producing their very own highly capable models comparable to Google’s Veo2, Haliluo’s Minimax, Runway’s Gen3 Alpha, Kling, Pika, and Luma Lab’s Dream Machine.

Open-source has also had its own surge of video generation models with CogVideoX, Mochi-1, Hunyuan, Allegro, and LTX Video. Is the video community having its “Stable Diffusion moment”?

This post will provide a transient overview of the state of video generation models, where we’re with respect to open video generation models, and the way the Diffusers team is planning to support their adoption at scale.

Specifically, we are going to discuss:

  • Capabilities and limitations of video generation models
  • Why video generation is difficult
  • Open video generation models
  • Video generation with Diffusers
    • Inference and optimizations
    • Fantastic-tuning
  • Looking ahead



Today’s Video Generation Models and their Limitations

These are today’s hottest video models for AI-generated content creation

Limitations:

  • High Resource Requirements: Producing high-quality videos requires large pretrained models, that are computationally expensive to develop and deploy. These costs arise from dataset collection, hardware requirements, extensive training iterations and experimentation. These costs make it hard to justify producing open-source and freely available models. Despite the fact that we don’t have an in depth technical report that sheds light on the training resources used, this post provides some reasonable estimates.
  • Generalization: Several open models suffer from limited generalization capabilities and underperform expectations of users. Models may require prompting in a certain way, or LLM-like prompts, or fail to generalize to out-of-distribution data, that are hurdles for widespread user adoption. For instance, models like LTX-Video often have to be prompted in a really detailed and specific way for obtaining good quality generations.
  • Latency: The high computational and memory demands of video generation lead to significant generation latency. For local usage, this is usually a roadblock. Most recent open video models are inaccessible to community hardware without extensive memory optimizations and quantization approaches that affect each inference latency and quality of the generated videos.



Why is Video Generation Hard?

There are several aspects we’d prefer to see and control in videos:

  • Adherence to Input Conditions (comparable to a text prompt, a starting image, etc.)
  • Realism
  • Aesthetics
  • Motion Dynamics
  • Spatio-Temporal Consistency and Coherence
  • FPS
  • Duration

With image generation models, we normally only care in regards to the first three facets. Nonetheless, for video generation we now have to contemplate motion quality, coherence and consistency over time, potentially with multiple subjects. Finding the suitable balance between good data, right inductive priors, and training methodologies to suit these additional requirements has proved to be tougher than other modalities.



Open Video Generation Models

diagram

Text-to-video generation models have similar components as their text-to-image counterparts:

  • Text encoders for providing wealthy representations of the input text prompt
  • A denoising network
  • An encoder and decoder to convert between pixel and latent space
  • A non-parametric scheduler chargeable for managing all of the timestep-related calculations and the denoising step

The most recent generation of video models shares a core feature where the denoising network processes 3D video tokens that capture each spatial and temporal information. The video encoder-decoder system, chargeable for producing and decoding these tokens, employs each spatial and temporal compression. While decoding the latents typically demands essentially the most memory, these models offer frame-by-frame decoding options to cut back memory usage.

Text conditioning is incorporated through either joint attention (introduced in Stable Diffusion 3) or cross-attention. T5 has emerged as the popular text encoder across most models, with HunYuan being an exception in its use of each CLIP-L and LLaMa 3.

The denoising network itself builds on the DiT architecture developed by William Peebles and Saining Xie, while incorporating various design elements from PixArt.



Video Generation with Diffusers


There are three broad categories of generation possible when working with video models:

  1. Text to Video
  2. Image or Image Control condition + Text to Video
  3. Video or Video Control condition + Text to Video

Going from a text (and other conditions) to a video is just a number of lines of code. Below we show the right way to do text-to-video generation with the LTX-Video model from Lightricks.

import torch
from diffusers import LTXPipeline
from diffusers.utils import export_to_video

pipe = LTXPipeline.from_pretrained("Lightricks/LTX-Video", torch_dtype=torch.bfloat16).to("cuda")

prompt = "A girl with long brown hair and lightweight skin smiles at one other woman with long blonde hair. The lady with brown hair wears a black jacket and has a small, barely noticeable mole on her right cheek. The camera angle is a close-up, focused on the lady with brown hair's face. The lighting is warm and natural, likely from the setting sun, casting a soft glow on the scene. The scene appears to be real-life footage"
negative_prompt = "worst quality, inconsistent motion, blurry, jittery, distorted"

video = pipe(
    prompt=prompt,
    negative_prompt=negative_prompt,
    width=704,
    height=480,
    num_frames=161,
    num_inference_steps=50,
).frames[0]
export_to_video(video, "output.mp4", fps=24)



Memory requirements

The memory requirements for any model may be computed by adding the next:

  • Memory required for weights
  • Maximum memory required for storing intermediate activation states

Memory required by weights may be lowered via – quantization, downcasting to lower dtypes, or by offloading to CPU. Memory required for activations states may also be lowered but that may be a more involved process, which is out of the scope of this blog.

It is feasible to run any video model with extremely low memory, but it surely comes at the associated fee of time required for inference. If the time required by an optimization technique is greater than what a user considers reasonable, it just isn’t feasible to run inference. Diffusers provides many such optimizations which might be opt-in and may be chained together.

Within the table below, we offer the memory requirements for 3 popular video generation models with reasonable defaults:

Model Name Memory (GB)
HunyuanVideo 60.09
CogVideoX (1.5 5B) 36.51
LTX-Video 17.75

These numbers were obtained with the next settings on an 80GB A100 machine (full script here):

  • torch.bfloat16 dtype
  • num_frames: 121, height: 512, width: 768
  • max_sequence_length: 128
  • num_inference_steps: 50

These requirements are quite staggering, and make these models difficult to run on consumer hardware. With Diffusers, users can opt-in to different optimizations to cut back memory usage.
The next table provides the memory requirements for HunyuanVideo with various optimizations enabled that make minimal compromises on quality and time required for inference.

We used HunyuanVideo for this study, because it is sufficiently large enough, to indicate the advantages of the optimizations in a progressive manner.

Setting Memory Time
BF16 Base 60.10 GB 863s
BF16 + CPU offloading 28.87 GB 917s
BF16 + VAE tiling 43.58 GB 870s
8-bit BnB 49.90 GB 983s
8-bit BnB + CPU offloading* 35.66 GB 1041s
8-bit BnB + VAE tiling 36.92 GB 997s
8-bit BnB + CPU offloading + VAE tiling 26.18 GB 1260s
4-bit BnB 42.96 GB 867s
4-bit BnB + CPU offloading 21.99 GB 953s
4-bit BnB + VAE tiling 26.42 GB 889s
4-bit BnB + CPU offloading + VAE tiling 14.15 GB 995s
FP8 Upcasting 51.70 GB 856s
FP8 Upcasting + CPU offloading 21.99 GB 983s
FP8 Upcasting + VAE tiling 35.17 GB 867s
FP8 Upcasting + CPU offloading + VAE tiling 20.44 GB 1013s
BF16 + Group offload (blocks=8) + VAE tiling 15.67 GB 925s
BF16 + Group offload (blocks=1) + VAE tiling 7.72 GB 881s
BF16 + Group offload (leaf) + VAE tiling 6.66 GB 887s
FP8 Upcasting + Group offload (leaf) + VAE tiling 6.56 GB^ 885s

*8Bit models in bitsandbytes can’t be moved to CPU from GPU, unlike the 4Bit ones.

^The memory usage doesn’t reduce further since the peak utilizations come from computing attention and feed-forward. Using Flash Attention and Optimized Feed-Forward will help lower this requirement to ~5 GB.

We used the identical settings as above to acquire these numbers. Also note that attributable to numerical precision loss, quantization can impact the standard of the outputs, effects of that are more outstanding in videos than images.

We offer more details about these optimizations within the sections below together with some code snippets to go. But in case you’re already feeling excited,
we encourage you to examine out our guide.



Suite of optimizations

Video generation may be quite difficult on resource-constrained devices and time-consuming even on beefier GPUs. Diffusers provides a collection of utilities that help to optimize each the runtime and memory consumption of those models. These optimizations fall under the next categories:

  • Quantization: The model weights are quantized to lower precision data types, which lowers the VRAM requirements of models. Diffusers supports three different quantization backends as of today: bitsandbytes, torchao, and GGUF.
  • Offloading: Different layers of a model may be loaded on the GPU when required for computation on-the-fly after which offloaded back to CPU. This protects a big amount of memory during inference. Offloading is supported through enable_model_cpu_offload() and enable_sequential_cpu_offload(). Refer here for more details.
  • Chunked Inference: By splitting inference across non-embedding dimensions of input latent tensors, the memory overheads from intermediate activation states may be reduced. Common use of this system is usually seen in encoder/decoder slicing/tiling. Chunked inference in Diffusers is supported through feed-forward chunking, decoder tiling and slicing, and split attention inference.
  • Re-use of Attention & MLP states: Computation of certain denoising steps may be skipped and past states may be re-used, if certain conditions are satisfied for particular algorithms, to hurry up the generation process with minimal quality loss.

Below, we offer a listing of some advanced optimization techniques which might be currently work-in-progress and can be merged soon:

  • Layerwise Casting: Lets users store the parameters in lower-precision, comparable to torch.float8_e4m3fn, and run computation in a better precision, comparable to torch.bfloat16.
  • Group offloading: Lets users group internal block-level or leaf-level modules to perform offloading. This is helpful because only parts of the model required for computation are loaded onto the GPU. Moreover, we offer support for overlapping data transfer with computation using CUDA streams, which reduce many of the additional overhead that comes from multiple onloading/offloading of layers.

Below is an example of applying 4bit quantization, vae tiling, cpu offloading, and layerwise casting to HunyuanVideo to cut back the required VRAM to simply ~6.5 GB for 121 x 512 x 768 resolution videos. To the most effective of our knowledge, that is the bottom memory requirement to run HunyuanVideo amongst all available implementations without compromising speed.

Install Diffusers from source to check out these features! Some implementations are agnostic to the model getting used, and may be applied in other backends easily – make sure you test it out!

pip install git+https://github.com/huggingface/diffusers.git
import torch
from diffusers import (
    BitsAndBytesConfig,
    HunyuanVideoTransformer3DModel,
    HunyuanVideoPipeline,
)
from diffusers.utils import export_to_video
from diffusers.hooks import apply_layerwise_casting
from transformers import LlamaModel

model_id = "hunyuanvideo-community/HunyuanVideo"
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True, bnb_4bit_compute_dtype=torch.bfloat16
)

text_encoder = LlamaModel.from_pretrained(model_id, subfolder="text_encoder", torch_dtype=torch.float16)
apply_layerwise_casting(text_encoder, storage_dtype=torch.float8_e4m3fn, compute_dtype=torch.float16)


transformer = HunyuanVideoTransformer3DModel.from_pretrained(
    model_id,
    subfolder="transformer",
    quantization_config=quantization_config,
    torch_dtype=torch.bfloat16,
)

pipe = HunyuanVideoPipeline.from_pretrained(
    model_id, transformer=transformer, text_encoder=text_encoder, torch_dtype=torch.float16
)


pipe.vae.enable_tiling()
pipe.enable_model_cpu_offload()

output = pipe(
    prompt="A cat walks on the grass, realistic",
    height=320,
    width=512,
    num_frames=61,
    num_inference_steps=30,
).frames[0]
export_to_video(output, "output.mp4", fps=15)

We may also apply optimizations during training. The 2 most well-known techniques applied to video models include:

  • Timestep distillation: This involves teaching the model to denoise the noisy latents faster in lesser amount of inference steps, in a recursive fashion. For instance, if a model takes 32 steps to generate good videos, it may be augmented to try to predict the ultimate outputs in just 16-steps, or 8-steps, and even 2-steps! This will be accompanied by loss in quality depending on how fewer steps are used. Some examples of timestep-distilled models include Flux.1-Schnell and FastHunyuan.
  • Guidance distillation: Classifier-Free Guidance is a method widely utilized in diffusion models that enhances generation quality. This, nevertheless, doubles the generation time since it involves two full forward passes through the models per inference step, followed by an interpolation step. By teaching models to predict the output of each forward passes and interpolation at the associated fee of 1 forward pass, this method can enable much faster generation. Some examples of guidance-distilled models include HunyuanVideo and Flux.1-Dev.

We refer the readers to this guide for an in depth tackle video generation and the present possibilities in Diffusers.



Fantastic-tuning

We’ve created finetrainers – a repository that lets you easily fine-tune the most recent generation of open video models. For instance, here is how you’d fine-tune CogVideoX with LoRA:


huggingface-cli download 
  --repo-type dataset Wild-Heart/Disney-VideoGeneration-Dataset 
  --local-dir video-dataset-disney


speed up launch train.py 
  --model_name="cogvideox" --pretrained_model_name_or_path="THUDM/CogVideoX1.5-5B" 
  --data_root="video-dataset-disney" 
  --video_column="videos.txt" 
  --caption_column="prompt.txt" 
  --training_type="lora" 
  --seed=42 
  --mixed_precision="bf16" 
  --batch_size=1 
  --train_steps=1200 
  --rank=128 
  --lora_alpha=128 
  --target_modules to_q to_k to_v to_out.0 
  --gradient_accumulation_steps 1 
  --gradient_checkpointing 
  --checkpointing_steps 500 
  --checkpointing_limit 2 
  --enable_slicing 
  --enable_tiling 
  --optimizer adamw 
  --lr 3e-5 
  --lr_scheduler constant_with_warmup 
  --lr_warmup_steps 100 
  --lr_num_cycles 1 
  --beta1 0.9 
  --beta2 0.95 
  --weight_decay 1e-4 
  --epsilon 1e-8 
  --max_grad_norm 1.0



We used finetrainers to emulate the “dissolve” effect and obtained promising results. Try the model for extra details.

Prompt: PIKA_DISSOLVE A slender glass vase, brimming with tiny white pebbles, stands centered on a cultured ebony dais. Unexpectedly, the glass begins to dissolve from the sides inward. Wisps of translucent dust swirl upward in a chic spiral, illuminating each pebble as they drop onto the dais. The gently drifting dust eventually settles, leaving only the scattered stones and faint traces of shimmering powder on the stage.



Looking ahead

We anticipate significant advancements in video generation models throughout 2025, with major improvements in each output quality and model capabilities.
Our goal is to make using these models easy and accessible. We’ll proceed to grow the finetrainers library, and we’re planning on adding many more features: Control LoRAs, Distillation Algorithms, ControlNets, Adapters, and more. As all the time, community contributions are welcome 🤗

Our commitment stays strong to partnering with model publishers, researchers, and community members to make sure the most recent innovations in video generation are nearby to everyone.



Resources

We cited plenty of links throughout the post. To be certain you don’t miss out on a very powerful ones, we offer a listing below:

Acknowledgements: Due to Chunte for creating the attractive thumbnail for this post. Due to Vaibhav and Pedro for his or her helpful feedback.



Source link

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x