Towards Total Control in AI Video Generation

Video foundation models similar to Hunyuan and Wan 2.1, while powerful, don’t offer users the form of granular control that film and TV production (particularly VFX production) demands.

In skilled visual effects studios, open-source models like these, together with earlier image-based (relatively than video) models similar to Stable Diffusion, Kandinsky and Flux, are typically used alongside a spread of supporting tools that adapt their raw output to fulfill specific creative needs. When a director says, you possibly can’t respond by saying the model isn’t precise enough to handle such requests.

As a substitute an AI VFX team will use a spread of traditional CGI and compositional techniques, allied with custom procedures and workflows developed over time, in an effort to try and push the boundaries of video synthesis a little bit further.

So by analogy, a foundation video model is very similar to a default installation of a web-browser like Chrome; it does quite a bit out of the box, but when you want it to adapt to your needs, relatively than vice versa, you are going to need some plugins.

Control Freaks

On the planet of diffusion-based image synthesis, an important such third-party system is ControlNet.

ControlNet is a method for adding structured control to diffusion-based generative models, allowing users to guide image or video generation with additional inputs similar to edge maps, depth maps, or pose information.

image (top row), semantic segmentation>image (lower left) and pose-guided image generation of humans and animals (lower left).” width=”779″ height=”422″ srcset=”https://www.unite.ai/wp-content/uploads/2025/03/ControlNet-examples.jpg 1159w, https://www.unite.ai/wp-content/uploads/2025/03/ControlNet-examples-300×163.jpg 300w, https://www.unite.ai/wp-content/uploads/2025/03/ControlNet-examples-250×135.jpg 250w, https://www.unite.ai/wp-content/uploads/2025/03/ControlNet-examples-768×416.jpg 768w” sizes=”(max-width: 779px) 100vw, 779px”>

As a substitute of relying solely on text prompts, ControlNet introduces separate neural network branches, or , that process these conditioning signals while preserving the bottom model’s generative capabilities.

This permits fine-tuned outputs that adhere more closely to user specifications, making it particularly useful in applications where precise composition, structure, or motion control is required:

Source: https://arxiv.org/pdf/2302.05543

Nevertheless, adapter-based frameworks of this type operate externally on a set of neural processes which can be very internally-focused. These approaches have several drawbacks.

First, adapters are trained independently, resulting in when multiple adapters are combined, which may entail degraded generation quality.

Secondly, they introduce , requiring extra computation and memory for every adapter, making scaling inefficient.

Thirdly, despite their flexibility, adapters often produce results in comparison with models which can be fully fine-tuned for multi-condition generation. These issues make adapter-based methods less effective for tasks requiring seamless integration of multiple control signals.

Ideally, the capacities of ControlNet could be trained into the model, in a modular way that might accommodate later and much-anticipated obvious innovations similar to simultaneous video/audio generation, or native lip-sync capabilities (for external audio).

Because it stands, every extra piece of functionality represents either a post-production task or a non-native procedure that has to navigate the tightly-bound and sensitive weights of whichever foundation model it’s operating on.

FullDiT

Into this standoff comes a brand new offering from China, that posits a system where ControlNet-style measures are baked directly right into a generative video model at training time, as a substitute of being relegated to an afterthought.

. Source: https://arxiv.org/pdf/2503.19907

Titled , the brand new approach fuses multi-task conditions similar to identity transfer, depth-mapping and camera movement into an integrated a part of a trained generative video model, for which the authors have produced a prototype trained model, and accompanying video-clips at a project site.

In the instance below, we see generations that incorporate camera movement, identity information and text information (i.e., guiding user text prompts):

Source: https://fulldit.github.io/

It must be noted that the authors don’t propose their experimental trained model as a functional foundation model, but relatively as a proof-of-concept for native text-to-video (T2V) and image-to-video (I2V) models that supply users more control than simply a picture prompt or a text-prompt.

Since there are not any similar models of this type yet, the researchers created a brand new benchmark titled , for the evaluation of multi-task videos, and claim state-of-the-art performance within the like-for-like tests they devised against prior approaches. Nevertheless, since FullBench was designed by the authors themselves, its objectivity is untested, and its dataset of 1,400 cases could also be too limited for broader conclusions.

Perhaps essentially the most interesting aspect of the architecture the paper puts forward is its potential to include latest forms of control. The authors state:

Though the researchers present FullDiT as a step forward in multi-task video generation, it must be considered that this latest work builds on existing architectures relatively than introducing a fundamentally latest paradigm.

Nonetheless, FullDiT currently stands alone (to one of the best of my knowledge) as a video foundation model with ‘hard coded’ ControlNet-style facilities – and it’s good to see that the proposed architecture can accommodate later innovations too.

The latest paper is titled , and comes from nine researchers across Kuaishou Technology and The Chinese University of Hong Kong. The project page is here and the brand new benchmark data is at Hugging Face.

Method

The authors contend that FullDiT’s unified attention mechanism enables stronger cross-modal representation learning by capturing each spatial and temporal relationships across conditions:

According to the new paper, FullDiT integrates multiple input conditions through full self-attention, converting them into a unified sequence. By contrast, adapter-based models (left-most) use separate modules for each input, leading to redundancy, conflicts, and weaker performance.

Unlike adapter-based setups that process each input stream individually, this shared attention structure avoids branch conflicts and reduces parameter overhead. In addition they claim that the architecture can scale to latest input types without major redesign – and that the model schema shows signs of generalizing to condition combos not seen during training, similar to linking camera motion with character identity.

In FullDiT’s architecture, all conditioning inputs – similar to text, camera motion, identity, and depth – are first converted right into a unified token format. These tokens are then concatenated right into a single long sequence, which is processed through a stack of transformer layers using full self-attention. This approach follows prior works similar to Open-Sora Plan and Movie Gen.

This design allows the model to learn temporal and spatial relationships jointly across all conditions. Each transformer block operates over the whole sequence, enabling dynamic interactions between modalities without counting on separate modules for every input – and, as we have now noted, the architecture is designed to be extensible, making it much easier to include additional control signals in the longer term, without major structural changes.

The Power of Three

FullDiT converts each control signal right into a standardized token format so that each one conditions might be processed together in a unified attention framework. For camera motion, the model encodes a sequence of extrinsic parameters – similar to position and orientation – for every frame. These parameters are timestamped and projected into embedding vectors that reflect the temporal nature of the signal.

Identity information is treated in a different way, because it is inherently spatial relatively than temporal. The model uses identity maps that indicate which characters are present wherein parts of every frame. These maps are divided into , with each patch projected into an embedding that captures spatial identity cues, allowing the model to associate specific regions of the frame with specific entities.

Depth is a spatiotemporal signal, and the model handles it by dividing depth videos into 3D patches that span each space and time. These patches are then embedded in a way that preserves their structure across frames.

Once embedded, all of those condition tokens (camera, identity, and depth) are concatenated right into a single long sequence, allowing FullDiT to process them together using full self-attention. This shared representation makes it possible for the model to learn interactions across modalities and across time without counting on isolated processing streams.

Data and Tests

FullDiT’s training approach relied on selectively annotated datasets tailored to every conditioning type, relatively than requiring all conditions to be present concurrently.

For textual conditions, the initiative follows the structured captioning approach outlined within the MiraData project.

Source: https://arxiv.org/pdf/2407.06358

For camera motion, the RealEstate10K dataset was the important data source, because of its high-quality ground-truth annotations of camera parameters.

Nevertheless, the authors observed that training exclusively on static-scene camera datasets similar to RealEstate10K tended to cut back dynamic object and human movements in generated videos. To counteract this, they conducted additional fine-tuning using internal datasets that included more dynamic camera motions.

Identity annotations were generated using the pipeline developed for the ConceptMaster project, which allowed efficient filtering and extraction of fine-grained identity information.

The ConceptMaster framework is designed to address identity decoupling issues while preserving concept fidelity in customized videos. Source: https://arxiv.org/pdf/2501.04698

Source: https://arxiv.org/pdf/2501.04698

Depth annotations were obtained from the Panda-70M dataset using Depth Anything.

Optimization Through Data-Ordering

The authors also implemented a progressive training schedule, introducing more difficult conditions to make sure the model acquired robust representations before simpler tasks were added. The training order proceeded from to conditions, then , and at last , with easier tasks generally introduced later and with fewer examples.

The authors emphasize the worth of ordering the workload in this fashion:

An illustration of the data training order adopted by the researchers, with red indicating greater data volume.

After initial pre-training, a final fine-tuning stage further refined the model to enhance visual quality and motion dynamics. Thereafter the training followed that of an ordinary diffusion framework*: noise added to video latents, and the model learning to predict and take away it, using the embedded condition tokens as guidance.

To effectively evaluate FullDiT and supply a good comparison against existing methods, and within the absence of the provision of some other apposite benchmark, the authors introduced , a curated benchmark suite consisting of 1,400 distinct test cases.

Source: https://huggingface.co/datasets/KwaiVGI/FullBench

Each data point provided ground truth annotations for various conditioning signals, including , , and .

Metrics

The authors evaluated FullDiT using ten metrics covering five important points of performance: text alignment, camera control, identity similarity, depth accuracy, and general video quality.

Text alignment was measured using CLIP similarity, while camera control was assessed through (), (), and (CamMC), following the approach of CamI2V (within the project).

Identity similarity was evaluated using DINO-I and CLIP-I, and depth control accuracy was quantified using Mean Absolute Error (MAE).

Video quality was judged with three metrics from MiraData: frame-level CLIP similarity for smoothness; optical flow-based motion distance for dynamics; and LAION-Aesthetic scores for visual appeal.

Training

The authors trained FullDiT using an internal (undisclosed) text-to-video diffusion model containing roughly one billion parameters. They intentionally selected a modest parameter size to keep up fairness in comparisons with prior methods and ensure reproducibility.

Since training videos differed in length and backbone, the authors standardized each batch by resizing and padding videos to a standard resolution, sampling 77 frames per sequence, and using applied attention and loss masks to optimize training effectiveness.

The Adam optimizer was used at a learning rate of 1×10⁻⁵ across a cluster of 64 NVIDIA H800 GPUs, for a combined total of 5,120GB of VRAM (consider that within the enthusiast synthesis communities, on an RTX 3090 remains to be considered an opulent standard).

The model was trained for around 32,000 steps, incorporating up to a few identities per video, together with 20 frames of camera conditions and 21 frames of depth conditions, each evenly sampled from the full 77 frames.

For inference, the model generated videos at a resolution of 384×672 pixels (roughly five seconds at 15 frames per second) with 50 diffusion inference steps and a classifier-free guidance scale of 5.

Prior Methods

For camera-to-video evaluation, the authors compared FullDiT against MotionCtrl, CameraCtrl, and CamI2V, with all models trained using the RealEstate10k dataset to make sure consistency and fairness.

In identity-conditioned generation, since no comparable open-source multi-identity models were available, the model was benchmarked against the 1B-parameter ConceptMaster model, using the identical training data and architecture.

For depth-to-video tasks, comparisons were made with Ctrl-Adapter and ControlVideo.

Quantitative results for single-task video generation. FullDiT was compared to MotionCtrl, CameraCtrl, and CamI2V for camera-to-video generation; ConceptMaster (1B parameter version) for identity-to-video; and Ctrl-Adapter and ControlVideo for depth-to-video. All models were evaluated using their default settings. For consistency, 16 frames were uniformly sampled from each method, matching the output length of prior models.

The outcomes indicate that FullDiT, despite handling multiple conditioning signals concurrently, achieved state-of-the-art performance in metrics related to text, camera motion, identity, and depth controls.

In overall quality metrics, the system generally outperformed other methods, although its smoothness was barely lower than ConceptMaster’s. Here the authors comment:

Regarding the qualitative comparison, it may be preferable to discuss with the sample videos on the FullDiT project site, because the PDF examples are inevitably static (and likewise too large to completely reproduce here).

The first section of the reproduced qualitative results in the PDF. Please refer to the source paper for the additional examples, which are too extensive to reproduce here.

The authors comment:

A section of the PDF's examples of FullDiT's output with multiple signals. Please refer to the source paper and the project site for additional examples.

Conclusion

Though FullDiT is an exciting foray right into a more full-featured variety of video foundation model, one has to wonder if demand for ControlNet-style instrumentalities will ever justify implementing such features at scale, at the very least for FOSS projects, which might struggle to acquire the big amount of GPU processing power obligatory, without industrial backing.

The first challenge is that using systems similar to Depth and Pose generally requires non-trivial familiarity with relatively complex user interfaces similar to ComfyUI. Subsequently evidently a functional FOSS model of this type is most probably to be developed by a cadre of smaller VFX firms that lack the cash (or the desire, provided that such systems are quickly made obsolete by model upgrades) to curate and train such a model behind closed doors.

However, API-driven ‘rent-an-AI’ systems could also be well-motivated to develop simpler and more user-friendly interpretive methods for models into which ancillary control systems have been directly trained.

Towards Total Control in AI Video Generation

Control Freaks

FullDiT