A brand new paper out this week at Arxiv addresses a difficulty which anyone who has adopted the Hunyuan Video or Wan 2.1 AI video generators can have come across by now: , where the generative process tends to abruptly speed up, conflate, omit, or otherwise mess up crucial moments in a generated video:
Source: https://haroldchen19.github.io/FluxFlow/
The video above features excerpts from example test videos on the (be warned: slightly chaotic) project site for the paper. We will see several increasingly familiar issues being remediated by the authors’ method (pictured on the correct within the video), which is effectively a dataset preprocessing technique applicable to any generative video architecture.
In the primary example, featuring ‘two children fidgeting with a ball’, generated by CogVideoX, we see (on the left within the compilation video above and in the precise example below) that the native generation rapidly jumps through several essential micro-movements, speeding the youngsters’s activity as much as a ‘cartoon’ pitch. In contrast, the identical dataset and method yield higher results with the brand new preprocessing technique, dubbed (to the correct of the image in video below):
Within the second example (using NOVA-0.6B) we see that a central motion involving a cat has in a roundabout way been corrupted or significantly under-sampled on the training stage, to the purpose that the generative system becomes ‘paralyzed’ and is unable to make the topic move:
This syndrome, where the motion or subject gets ‘stuck’, is probably the most frequently-reported bugbears of HV and Wan, in the assorted image and video synthesis groups.
A few of these problems are related to video captioning issues within the source dataset, which we took a take a look at this week; however the authors of the brand new work focus their efforts on the temporal qualities of the training data as a substitute, and make a convincing argument that addressing the challenges from that perspective can yield useful results.
As mentioned in the sooner article about video captioning, certain are particularly difficult to distil into key moments, meaning that critical events (comparable to a slam-dunk) don’t get the eye they need at training time:
Within the above example, the generative system doesn’t know learn how to get to the subsequent stage of movement, and transits illogically from one pose to the subsequent, changing the attitude and geometry of the player in the method.
These are large movements that got lost in training – but equally vulnerable are far smaller but pivotal movements, comparable to the flapping of a butterfly’s wings:
Unlike the slam-dunk, the flapping of the wings just isn’t a ‘rare’ but slightly a persistent and monotonous event. Nevertheless, its consistency is lost within the sampling process, because the movement is so rapid that it is vitally difficult to ascertain temporally.
These are usually not particularly latest issues, but they’re receiving greater attention now that powerful generative video models can be found to enthusiasts for local installation and free generation.
The communities at Reddit and Discord have initially treated these issues as ‘user-related’. That is an comprehensible presumption, because the systems in query are very latest and minimally documented. Subsequently various pundits have suggested diverse (and never all the time effective) remedies for a number of the glitches documented here, comparable to altering the settings in various components of diverse kinds of ComfyUI workflows for Hunyuan Video (HV) and Wan 2.1.
In some cases, slightly than producing rapid motion, each HV and Wan will produce motion. Suggestions from Reddit and ChatGPT (which mostly leverages Reddit) include changing the variety of frames within the requested generation, or radically lowering the frame rate*.
That is all desperate stuff; the emerging truth is that we do not yet know the precise cause or the precise treatment for these issues; clearly, tormenting the generation settings to work around them (particularly when this degrades output quality, as an example with a too-low fps rate) is just a short-stop, and it’s good to see that the research scene is addressing emerging issues this quickly.
So, besides this week’s take a look at how captioning affects training, let’s take a take a look at the brand new paper about temporal regularization, and what improvements it’d offer the present generative video scene.
The central idea is slightly easy and slight, and none the more serious for that; nonetheless the paper is somewhat padded with the intention to reach the prescribed eight pages, and we are going to skip over this padding as obligatory.
Source: https://arxiv.org/pdf/2503.15417
The latest work is titled , and comes from eight researchers across Everlyn AI, Hong Kong University of Science and Technology (HKUST), the University of Central Florida (UCF), and The University of Hong Kong (HKU).
FluxFlow
The central idea behind , the authors’ latest pre-training schema, is to beat the widespread problems and by shuffling blocks and groups of blocks within the temporal frame orders because the source data is exposed to the training process:

The paper explains:
Most video generation models, the authors explain, still borrow too heavily from synthesis, specializing in spatial fidelity while largely ignoring the temporal axis. Though techniques comparable to cropping, flipping, and color jittering have helped improve static image quality, they are usually not adequate solutions when applied to videos, where the illusion of motion is dependent upon consistent transitions across frames.
The resulting problems include flickering textures, jarring cuts between frames, and repetitive or overly simplistic motion patterns.
The paper argues that though some models – including Stable Video Diffusion and LlamaGen – compensate with increasingly complex architectures or engineered constraints, these come at a value when it comes to compute and suppleness.
Since temporal data augmentation has already proven useful in video tasks (in frameworks comparable to FineCliper, SeFAR and SVFormer) it’s surprising, the authors assert, that this tactic is never applied in a generative context.
Disruptive Behavior
The researchers contend that straightforward, structured disruptions in temporal order during training help models generalize higher to realistic, diverse motion:
Frame-level perturbations, the authors state, introduce fine-grained disruptions inside a sequence. This type of disruption just isn’t dissimilar to masking augmentation, where sections of knowledge are randomly blocked out, to forestall the system overfitting on data points, and inspiring higher generalization.
Tests
Though the central idea here doesn’t run to a full-length paper, as a result of its simplicity, nonetheless there may be a test section that we will take a take a look at.
The authors tested for 4 queries referring to improved temporal quality while maintaining spatial fidelity; ability to learn motion/optical flow dynamics; maintaining temporal quality in extraterm generation; and sensitivity to key hyperparameters.
The researchers applied FluxFlow to 3 generative architectures: U-Net-based, in the shape of VideoCrafter2; DiT-based, in the shape of CogVideoX-2B; and AR-based, in the shape of NOVA-0.6B.
For fair comparison, they fine-tuned the architectures’ base models with FluxFlow as an extra training phase, for one epoch, on the OpenVidHD-0.4M dataset.
The models were evaluated against two popular benchmarks: UCF-101; and VBench.
For UCF, the Fréchet Video Distance (FVD) and Inception Rating (IS) metrics were used. For VBench, the researchers focused on temporal quality, frame-wise quality, and overall quality.

Commenting on these results, the authors state:
Below we see selections from the qualitative results the authors check with (please see the unique paper for full results and higher resolution):

The paper suggests that while each frame-level and block-level perturbations enhance temporal quality, frame-level methods are inclined to perform higher. That is attributed to their finer granularity, which enables more precise temporal adjustments. Block-level perturbations, against this, may introduce noise as a result of tightly coupled spatial and temporal patterns inside blocks, reducing their effectiveness.
Conclusion
This paper, together with the Bytedance-Tsinghua captioning collaboration released this week, has made it clear to me that the apparent shortcomings in the brand new generation of generative video models may not result from user error, institutional missteps, or funding limitations, but slightly from a research focus that has understandably prioritized more urgent challenges, comparable to temporal coherence and consistency, over these lesser concerns.
Until recently, the outcomes from freely-available and downloadable generative video systems were so compromised that no great locus of effort emerged from the enthusiast community to redress the problems (not least because the problems were fundamental and never trivially solvable).
Now that we’re a lot closer to the long-predicted age of purely AI-generated photorealistic video output, it’s clear that each the research and casual communities are taking a deeper and more productive interest in resolving remaining issues; with a bit of luck, these are usually not intractable obstacles.
Â
*