Accelerating Diffusion Models with an Open, Plug-and-Play Offering

-


Recent advances in large-scale diffusion models have revolutionized generative AI across multiple domains, from image synthesis to audio generation, 3D asset creation, molecular design, and beyond. These models have demonstrated unprecedented capabilities in producing high-quality, diverse outputs across various conditional generation tasks.

Despite these successes, sampling inefficiency stays a fundamental bottleneck. Standard diffusion models require tens to lots of of iterative denoising steps, resulting in high inference latency and substantial computational cost. This limits practical deployment in interactive applications, edge devices, and large-scale production systems.

Video generation faces an especially critical challenge. Open source models resembling NVIDIA Cosmos—together with industrial text-to-video (T2V) systems —have shown remarkable text-to-video capabilities. Nonetheless, video diffusion models are orders of magnitude more computationally demanding on account of the temporal dimension. Generating a single video can take minutes to hours, making real-time video generation, interactive editing, and world modeling for agent training difficult.

Accelerating diffusion sampling without sacrificing quality and variety has emerged as a key open challenge, with video generation being one of the demanding and impactful applications to resolve.

This blog introduces NVIDIA FastGen, an open source library that unifies state-of-the-art diffusion distillation techniques for accelerating many-step diffusion models into one-step or few-step generators. We review trajectory-based and distribution-based distillation approaches, show reproducible benchmarking showing 10x to 100x sampling speedups with maintained quality, and showcase FastGen’s scalability to large video models as much as 14B parameters. We also highlight applications to interactive world modeling, where causal distillation enables real-time video generation. 

What are the important thing approaches to acceleration?

A growing body of research has explored diffusion distillation, which goals to compress long denoising trajectories right into a small variety of inference steps. Existing approaches broadly fall into two categories:

  1. Trajectory-based distillation—including progressive distillation and consistency models resembling OpenAI’s iCT and sCM, and Massachusetts Institute of Technology and Carnegie Mellon University’s MeanFlow—directly regresses the teacher’s denoising trajectories.
  2. Distribution-based distillation—resembling Stability.AI’s LADD, and MIT and Adobe’s DMD—aligns the coed and teacher distributions using adversarial or variational objectives.

These methods have successfully reduced diffusion sampling to 1 or two steps within the image domain. Nonetheless, each family comes with notable tradeoffs. Trajectory-based methods often suffer from training instability, slow convergence, and scalability challenges, while distribution-based methods are likely to be memory-intensive, sensitive to initialization, and vulnerable to mode collapse. Furthermore, none of those approaches alone consistently achieves one-step generation with high fidelity for complex data resembling real-world videos.

This motivates the necessity for a unified and extensible framework that may integrate, compare, and evolve diffusion distillation methods toward stable training, high-quality generation, and scalability to large models and sophisticated data.

What FastGen offers

FastGen is a brand new, open source, versatile library that brings together state-of-the-art diffusion distillation methods under a generic, plug-and-play interface.

Unified and versatile interface

FastGen provides a unified abstraction for accelerating diffusion models across diverse tasks. Users provide their diffusion model (and, optionally, training data) and choose an appropriate distillation method. FastGen then handles the training and inference pipeline, converting the unique model right into a one-step or few-step generator with minimal engineering overhead.

A diagram showing how users can easily use FastGen to turn their multi-step diffusion model into a one-step generator with a good preservation of generation quality. A diagram showing how users can easily use FastGen to turn their multi-step diffusion model into a one-step generator with a good preservation of generation quality.
Figure 1. FastGen distillation pipeline.

Reproducible benchmarks and fair comparisons

FastGen reproduces all supported distillation methods on standard image generation benchmarks. Historically, diffusion distillation methods have been proposed and evaluated in isolated codebases with different training recipes, making fair comparisons difficult. By unifying implementations and hyperparameter selections, FastGen enables transparent benchmarking and serves as a typical evaluation platform for the few-step diffusion community.

Table 1 below presents a comprehensive comparison of distillation method performance on CIFAR-10 and ImageNet-64 benchmarks, demonstrating FastGen’s reproducibility. The table shows one-step image generation quality achieved by FastGen’s unified implementations alongside the unique results reported of their respective papers (shown in parentheses). Each method is categorized by its distillation approach: trajectory-based methods that optimize along the diffusion trajectory (ECT, TCM, sCT, sCD, MeanFlow) and distribution-based methods that directly match generated distributions (LADD, DMD2, f-distill).

Acceleration methods, as noted within the research papers linked here Image generation, with Fréchet inception distance (FID) scores to represent quality 
CIFAR-10 ImageNet-64
Trajectory-based distillation ECT (Geng et al., 2024) 2.92 FID from FastGen (3.60 reported in research paper)  4.05 from FastGen (4.05 reported in research paper)
TCM (Lee et al., 2025) 2.70 (2.46) 2.23 (2.20)
sCT (Lu et al., 2025) 3.23 (2.85)
sCD (Lu et al., 2025) 3.23 (3.66)
MeanFlow (Geng et al., 2025) 2.82 (2.92)
Distribution-based distillation LADD (Sauer et al., 2024)
DMD2 (Yin et al., 2024) 1.99 (2.13)* 1.12 (1.28)
f-distill (Xu et al., 2025 1.85 (1.92)* 1.11 (1.16)
Table 1. One-step image generation quality measured by Fréchet inception distance (FID), reproduced on standard image generation benchmarks from the FastGen library. The FID numbers in parentheses are from the unique papers (* indicates conditional CIFAR-10)

Beyond vision tasks

While we show FastGen on vision tasks on this blog, the library is generic enough to speed up any diffusion model across different domains. One area of particular interest is AI-for-science applications, where sample quality is commonly as necessary as sample diversity. 

By decoupling distillation methods from network definitions, FastGen makes it straightforward and plug-and-play so as to add recent models. For instance, now we have successfully distilled the NVIDIA weather downscaling model, Corrector Diffusion (CorrDiff), in NVIDIA PhysicsNeMo using ECT for one-step Km-scale atmospheric downscaling.

As visualized in Figure 2 below, the distilled model matches the predictions of CorrDiff (by way of skill and spread) while allowing for 23x faster inference. 

A GIF showing how the one-step eastward wind predictions of the distilled ECT model are visually almost indistinguishable from the 18-step predictions of the CorrDiff teacher model. A GIF showing how the one-step eastward wind predictions of the distilled ECT model are visually almost indistinguishable from the 18-step predictions of the CorrDiff teacher model.
Figure 2. Observations of the eastward wind at 10 meters during Typhoon Chanthu (left top), 4 predictions of the distilled one-step model (top right), and the 18-step CorrDiff model  (bottom right)

Scalable and efficient infrastructure

FastGen also provides a highly optimized training infrastructure for scaling diffusion distillation to large models. Supported techniques include:

  • Fully Sharded Data Parallel v2 (FSDP2)
  • Automatic mixed precision (AMP)
  • Context parallelism (CP)
  • Flex attention
  • Efficient KV cache management
  • Adaptive finite-difference JVP estimation

With these optimizations, FastGen can distill large-scale models efficiently. For instance, we successfully distilled a 14B Wan2.1 T2V model right into a few-step generator using DMD2, achieving convergence in 16 hours on 64 NVIDIA H100 GPUs.

Figure 3 shows a visible comparison of 50-step teacher and two-step distilled student using the improved DMD2 method to distill Wan2.1-T2V-14B. Although the coed is 50x faster than the teacher in sampling, the coed’s generation quality closely matches the teacher’s. 

 A GIF showing how the improved DMD2 method generates a dog video of similar visual quality as the Wan teacher while reducing NFEs by 50x.  A GIF showing how the improved DMD2 method generates a dog video of similar visual quality as the Wan teacher while reducing NFEs by 50x.
Figure 3. Visual comparison of 50-step teacher with CFG=6 (NFE=100) (left) and two-step distilled student (NFE=2) (right), using the improved DMD2 to distill Wan2.1-T2V-14B. NFE denotes the variety of function evaluations during generation

FastGen for interactive world modeling

Interactive world models aim to simulate environment dynamics and respond coherently to user actions or agent interventions in real time. They require:

  • High sampling efficiency
  • Long-horizon temporal consistency
  • Motion-conditioned controllability

Video diffusion models provide a robust foundation for world modeling on account of their ability to capture wealthy visual dynamics, but their multi-step sampling process and passive formulation prevent real-time interaction.

To deal with this, recent work has explored causal distillation, which transforms a bidirectional video diffusion model right into a few-step, block-wise autoregressive model. This autoregressive structure enables real-time interaction and has change into a promising foundation for interactive world models.

FastGen implements each training and inference recipes for multiple causal distillation methods, including CausVid and Self-Forcing, where default formulations are primarily distribution-based.

Trajectory-based distillation has not yet been widely applied in causal distillation on account of performance degradation and trajectory misalignment between bidirectional teacher models and autoregressive students. FastGen addresses these challenges in two ways:

  1. Warm-starting causal distillation: Trajectory-based methods could be used to initialize student models before applying distribution-based objectives.
  2. Causal SFT via diffusion forcing: FastGen provides a causal supervised fine-tuning (SFT) recipe that first trains a many-step block-wise autoregressive model, which then serves as a brand new teacher for trajectory-based distillation.

These components enable hybrid distillation pipelines that mix the steadiness of trajectory-based methods with the flexibleness of distribution-based objectives.

On the applying side, FastGen supports a big selection of open source video diffusion models, including Wan2.1, Wan2.2, and NVIDIA Cosmos-Predict2.5, and provides end-to-end acceleration for multiple video synthesis scenarios:

  • Text-to-video (T2V)
  • Image-to-video (I2V)
  • Video-to-video (V2V)

Users can flexibly customize causal distillation pipelines, for instance, scaling from 2B to 14B models, adding first-frame conditioning for I2V, or incorporating structural priors resembling depth-guided driving videos for V2V tasks.

Subsequently, FastGen provides the essential infrastructure for advancing interactive world models—enabling the fast, controllable, and temporally consistent generation needed to rework diffusion models from passive synthesizers into real-time interactive systems.

Start

FastGen is designed to be greater than a group of distillation techniques—it’s a unified research and engineering platform for accelerating diffusion models. By bringing together trajectory-based and distribution-based methods under a scalable and reproducible framework, FastGen lowers the barrier to experimenting with few-step diffusion models and enables fair benchmarking across approaches.

Check out FastGen today—plug in your individual diffusion model, select a distillation approach, and watch a multistep generator transform right into a one-step performer. Whether you aim to speed up visual synthesis or scientific discovery, or power interactive world models, FastGen offers the flexibleness and reproducibility to maneuver from idea to implementation in record time.



Source link

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x