The video/image synthesis research sector often outputs video-editing* architectures, and over the past nine months, outings of this nature have turn into much more frequent. That said, most of them represent only incremental advances on the cutting-edge, because the core challenges are substantial.
Nevertheless, a brand new collaboration between China and Japan this week has produced some examples that merit a more in-depth examination of the approach, even when it is just not necessarily a landmark work.
Within the video-clip below (from the paper’s associated project site, that – be warned – may tax your browser) we see that while the deepfaking capabilities of the system are non-existent in the present configuration, the system does a nice job of plausibly and significantly altering the identity of the young woman in the image, based on a video mask (bottom-left):
Source: https://yxbian23.github.io/project/video-painter/
Mask-based editing of this type is well-established in static latent diffusion models, using tools like ControlNet. Nevertheless, maintaining background consistency in video is much more difficult, even when masked areas provide the model with creative flexibility, as shown below:
The authors of the brand new work consider their method in regard each to Tencent’s own BrushNet architecture (which we covered last 12 months), and to ControlNet, each of which treat of a dual-branch architecture able to isolating the foreground and background generation.
Nevertheless, applying this method on to the very productive Diffusion Transformers (DiT) approach proposed by OpenAI’s Sora, brings particular challenges, because the authors note”
Due to this fact the researchers have developed a plug-and-play approach in the shape of a dual-branch framework titled .
VideoPainter offers a dual-branch video inpainting framework that enhances pre-trained DiTs with a light-weight context encoder. This encoder accounts for just 6% of the backbone’s parameters, which the authors claim makes the approach more efficient than conventional methods.
The model proposes three key innovations: a streamlined two-layer context encoder for efficient background guidance; a mask-selective feature integration system that separates masked and unmasked tokens; and an inpainting region ID resampling technique that maintains identity consistency across long video sequences.
By freezing each the pre-trained DiT and context encoder while introducing an ID-Adapter, VideoPainter ensures that inpainting region tokens from previous clips persist throughout a video, reducing flickering and inconsistencies.
The framework can also be designed for plug-and-play compatibility, allowing users to integrate it seamlessly into existing video generation and editing workflows.
To support the work, which uses CogVideo-5B-I2V as its generative engine, the authors curated what they state is the biggest video inpainting dataset thus far. Titled , the gathering consists of greater than 390,000 clips, for a complete video duration of greater than 886 hours. Additionally they developed a related benchmarking framework titled .
The latest work is titled , and comes from seven authors on the Tencent ARC Lab, The Chinese University of Hong Kong, The University of Tokyo, and the University of Macau.
Besides the aforementioned project site, the authors have also released a more accessible YouTube overview, as well a Hugging Face page.
Method
The info collection pipeline for VPData consists of collection, annotation, splitting, selection and captioning:
Source: https://arxiv.org/pdf/2503.05639
The source collections used for this compilation got here from Videvo and Pexels, with an initial haul of around 450,000 videos obtained.
Multiple contributing libraries and methods comprised the pre-processing stage: the Recognize Anything framework was used to supply open-set video tagging, tasked with identifying primary objects; Grounding Dino was used for the detection of bounding boxes across the identified objects; and the Segment Anything Model 2 (SAM 2) framework was used to refine these coarse selections into high-quality mask segmentations.
To administer scene transitions and ensure consistency in video inpainting, VideoPainter uses PySceneDetect to discover and segment clips at natural breakpoints, avoiding the disruptive shifts often brought on by tracking the identical object from multiple angles. The clips were divided into 10-second intervals, with anything shorter than six seconds discarded.
For data selection, three filtering criteria were applied: , assessed with the Laion-Aesthetic Rating Predictor; , measured via optical flow using RAFT; and , verified through Stable Diffusion’s Safety Checker.
One major limitation in existing video segmentation datasets is the dearth of detailed textual annotations, that are crucial for guiding generative models:

Due to this fact the VideoPainter data curation process incorporates diverse leading vision-language models, including CogVLM2 and Chat GPT-4o to generate keyframe-based captions and detailed descriptions of masked regions.
VideoPainter enhances pre-trained DiTs by introducing a custom lightweight context encoder that separates background context extraction from foreground generation, seen to the upper right of the illustrative schema below:

As a substitute of burdening the backbone with redundant processing, this encoder operates on a streamlined input: a mix of noisy latent, masked video latent (extracted via a variational autoencoder, or VAE), and downsampled masks.
The noisy latent provides generation context, and the masked video latent aligns with the DiT’s existing distribution, aiming to boost compatibility.
Fairly than duplicating large sections of the model, which the authors state has occurred in prior works, VideoPainter integrates only the primary two layers of the DiT. These extracted features are reintroduced into the frozen DiT in a structured, group-wise manner – early-layer features inform the initial half of the model, while later features refine the second half.
Moreover, a token-selective mechanism ensures that only background-relevant features are reintegrated, stopping confusion between masked and unmasked regions. This approach, the authors contend, allows VideoPainter to take care of high fidelity in background preservation while improving foreground inpainting efficiency.
The authors note that the tactic they proposes supports diverse stylization methods, including the most well-liked, Low Rank Adaptation (LoRA).
Data and Tests
VideoPainter was trained using the CogVideo-5B-I2V model, together with its text-to-video equivalent. The curated VPData corpus was used at 480x720px, at a learning rate of 1×10-5.
The ID Resample Adapter was trained for two,000 steps, and the context encoder for 80,000 steps, each using the AdamW optimizer. The training took place in two stages using a formidable 64 NVIDIA V100 GPUs (though the paper doesn’t specify whether these had 16GB or 32GB of VRAM).
For benchmarking, Davis was used for random masks, and the authors’ own VPBench for segmentation-based masks.
The VPBench dataset features objects, animals, humans, landscapes and diverse tasks, and covers 4 actions: , , , and . The gathering features 45 6-second videos, and nine videos lasting, on average, 30 seconds.
Eight metrics were utilized for the method. For Masked Region Preservation, the authors used Peak Signal-to-Noise Ratio (PSNR); Learned Perceptual Similarity Metrics (LPIPS); Structural Similarity Index (SSIM); and Mean Absolute Error (MAE).
For text-alignment, the researchers used CLIP Similarity each to judge semantic distance between the clip’s caption and its actual perceived content, and likewise to judge accuracy of masked regions.
To evaluate the final quality of the output videos, Fréchet Video Distance (FVD) was used.
For a quantitative comparison round for video inpainting, the authors set their system against prior approaches ProPainter, COCOCO and Cog-Inp (CogVideoX). The test consisted of inpainting the primary frame of a clip using image inpainting models, after which using an image-to-video (I2V) backbone to propagate the outcomes right into a latent mix operation, in accord with a technique proposed by a 2023 paper from Israel.

Of those qualitative results, the authors comment:
The authors then present static examples of qualitative tests, of which we feature a variety below. In all cases we refer the reader to the project site and YouTube video for higher resolution.

Regarding this qualitative round for video inpainting, the authors comment:
The researchers moreover tested VideoPainter’s ability to reinforce captions and procure improved results by this method, putting the system against UniEdit, DiTCtrl, and ReVideo.

The authors comment:
Though the paper features static qualitative examples for this metric, they’re unilluminating, and we refer the reader as an alternative to the various examples spread across the varied videos published for this project.
Finally, a human study was conducted, where thirty users were asked to judge 50 randomly-selected generations from the VPBench and editing subsets. The examples highlighted background preservation, alignment to prompt, and general video quality.

The authors state:
They concede, nevertheless, that the standard of VideoPainter’s generations depends upon the bottom model, which may struggle with complex motion and physics; and so they observe that it also performs poorly with low-quality masks or misaligned captions.
Conclusion
VideoPainter seems a worthwhile addition to the literature. Typical of recent solutions, nevertheless, it has considerable compute demands. Moreover, most of the examples chosen for presentation on the project site fall very far in need of the perfect examples; it will due to this fact be interesting to see this framework pitted against future entries, and a wider range of prior approaches.
*