If 2022 was the yr that generative AI captured a wider public’s imagination, 2025 is the yr where the brand new breed of generative frameworks coming from China seems set to do the identical.
Tencent’s Hunyuan Video has made a significant impact on the hobbyist AI community with its open-source release of a full-world video diffusion model that users can tailor to their needs.
Close on its heels is Alibaba’s newer Wan 2.1, one of the crucial powerful image-to-video FOSS solutions of this era – now supporting customization through Wan LoRAs.
Besides the supply of recent human-centric foundation model SkyReels, on the time of writing we also await the discharge of Alibaba’s comprehensive VACE video creation and editing suite:
Source: https://ali-vilab.github.io/VACE-Page/
Sudden Impact
The generative video AI research scene itself isn’t any less explosive; it’s still the primary half of March, and Tuesday’s submissions to Arxiv’s Computer Vision section (a hub for generative AI papers) got here to almost 350 entries – a figure more related to the peak of conference season.
The 2 years for the reason that launch of Stable Diffusion in summer of 2022 (and the following development of Dreambooth and LoRA customization methods) have been characterised by the shortage of further major developments, until the previous couple of weeks, where recent releases and innovations have proceeded at such a breakneck pace that it is sort of unattainable to maintain apprised of all of it, much less cover all of it.
Video diffusion models reminiscent of Hunyuan and Wan 2.1 have solved, in the end, and after years of failed efforts from lots of of research initiatives, the issue of temporal consistency because it pertains to the generation of humans, and largely also to environments and objects.
There could be little doubt that VFX studios are currently applying staff and resources to adapting the brand new Chinese video models to unravel immediate challenges reminiscent of face-swapping, despite the present lack of ControlNet-style ancillary mechanisms for these systems.
It have to be such a relief that one such significant obstacle has potentially been overcome, albeit not through the avenues anticipated.
Of the issues that remain, this one, nonetheless, shouldn’t be insignificant:
Source: https://videophy2.github.io/
Up The Hill Backwards
All text-to-video and image-to-video systems currently available, including business closed-source models, generally tend to supply physics bloopers reminiscent of the one above, where the video shows a rock rolling , based on the prompt ‘‘.
One theory as to why this happens, recently proposed in a tutorial collaboration between Alibaba and UAE, is that models train all the time on single images, in a way, even once they’re training on videos (that are written out to single-frame sequences for training purposes); and so they may not necessarily learn the right temporal order of and pictures.
Nevertheless, the almost definitely solution is that the models in query have used data augmentation routines that involve exposing a source training clip to the model each forwards backwards, effectively doubling the training data.
It has long been known that this should not be done arbitrarily, because some movements work in reverse, but many don’t. A 2019 study from the UK’s University of Bristol sought to develop a technique that would distinguish , and source data video clips that co-exist in a single dataset (see image below), with the notion that unsuitable source clips may be filtered out from data augmentation routines.
Source: https://arxiv.org/abs/1909.09422
The authors of that work frame the issue clearly:
Temporary Reversals
We have no evidence that systems reminiscent of Hunyuan Video and Wan 2.1 allowed arbitrarily ‘reversed’ clips to be exposed to the model during training (neither group of researchers has been specific regarding data augmentation routines).
Yet the one reasonable alternative possibility, within the face of so many reports (and my very own practical experience), would appear to be that hyperscale datasets powering these model may contain clips that actually feature movements occurring in reverse.
The rock in the instance video embedded above was generated using Wan 2.1, and features in a brand new study that examines how well video diffusion models handle physics.
In tests for this project, Wan 2.1 achieved a rating of only 22% by way of its ability to consistently adhere to physical laws.
Nevertheless, that is the rating of any system tested for the work, indicating that we can have found our next stumbling block for video AI:

Source: https://arxiv.org/pdf/2503.06800
The authors of the brand new work have developed a benchmarking system, now in its second iteration, called , with the code available at GitHub.
Though the scope of the work is beyond what we will comprehensively cover here, let’s take a general take a look at its methodology, and its potential for establishing a metric that would help steer the course of future model-training sessions away from these bizarre instances of reversal.
The study, conducted by six researchers from UCLA and Google Research, known as . A crowded accompanying project site can be available, together with code and datasets at GitHub, and a dataset viewer at Hugging Face.
Method
The authors describe the most recent version of their work, , as a ‘difficult commonsense evaluation dataset for real-world actions.’ The gathering features 197 actions across a spread of diverse physical activities reminiscent of , and , in addition to object interactions, reminiscent of .
A big language model (LLM) is used to generate 3840 prompts from these seed actions, and the prompts are then used to synthesize videos via the varied frameworks being trialed.
Throughout the method the authors have developed a listing of ‘candidate’ physical rules and laws that AI-generated videos should satisfy, using vision-language models for evaluation.
The authors state:

Initially the researchers curated a set of actions to guage physical commonsense in AI-generated videos. They began with over 600 actions sourced from the Kinetics, UCF-101, and SSv2 datasets, specializing in activities involving sports, object interactions, and real-world physics.
Two independent groups of STEM-trained student annotators (with a minimum undergraduate qualification obtained) reviewed and filtered the list, choosing actions that tested principles reminiscent of , , and , while removing low-motion tasks reminiscent of , , or .
After further refinement with Gemini-2.0-Flash-Exp to eliminate duplicates, the ultimate dataset included 197 actions, with 54 involving object interactions and 143 centered on physical and sports activities:

Within the second stage, the researchers used Gemini-2.0-Flash-Exp to generate 20 prompts for every motion within the dataset, leading to a complete of three,940 prompts. The generation process focused on visible physical interactions that might be clearly represented in a generated video. This excluded non-visual elements reminiscent of , , and , but incorporated diverse characters and objects.
For instance, as a substitute of a straightforward prompt like ‘, the model was guided to supply a more detailed version reminiscent of ‘.
Since modern video models can interpret longer descriptions, the researchers further refined the captions using the Mistral-NeMo-12B-Instruct prompt upsampler, so as to add visual details without altering the unique meaning.

For the third stage, physical rules weren’t derived from text prompts but from generated videos, since generative models can struggle to stick to conditioned text prompts.
Videos were first created using VideoPhy-2 prompts, then ‘up-captioned’ with Gemini-2.0-Flash-Exp to extract key details. The model proposed three expected physical rules per video, which human annotators reviewed and expanded by identifying additional potential violations.

Next, to discover essentially the most difficult actions, the researchers generated videos using CogVideoX-5B with prompts from the VideoPhy-2 dataset. They then chosen 60 actions out of 197 where the model consistently didn’t follow each the prompts and basic physical commonsense.
These actions involved physics-rich interactions reminiscent of momentum transfer in discus throwing, state changes reminiscent of bending an object until it breaks, balancing tasks reminiscent of tightrope walking, and sophisticated motions that included back-flips, pole vaulting, and pizza tossing, amongst others. In total, 1,200 prompts were chosen to extend the issue of the sub-dataset.
The resulting dataset comprised 3,940 captions – 5.72 times greater than the sooner version of VideoPhy. The typical length of the unique captions is 16 tokens, while upsampled captions reaches 138 tokens – 1.88 times and 16.2 times longer, respectively.
The dataset also features 102,000 human annotations covering semantic adherence, physical commonsense, and rule violations across multiple video generation models.
Evaluation
The researchers then defined clear criteria for evaluating the videos. The predominant goal was to evaluate how well each video matched its input prompt and followed basic physical principles.
As a substitute of simply rating videos by preference, they used rating-based feedback to capture specific successes and failures. Human annotators scored videos on a five-point scale, allowing for more detailed judgments, while the evaluation also checked whether videos followed various physical rules and laws.
For human evaluation, a bunch of 12 annotators were chosen from trials on Amazon Mechanical Turk (AMT), and provided rankings after receiving detailed distant instructions. For fairness, and were evaluated individually (in the unique VideoPhy study, they were assessed jointly).
The annotators first rated how well videos matched their input prompts, then individually evaluated physical plausibility, scoring rule violations and overall realism on a five-point scale. Only the unique prompts were shown, to take care of a good comparison across models.

Though human judgment stays the gold standard, it’s expensive and comes with numerous caveats. Subsequently automated evaluation is important for faster and more scalable model assessments.
The paper’s authors tested several video-language models, including Gemini-2.0-Flash-Exp and VideoScore, on their ability to attain videos for semantic accuracy and for ‘physical commonsense’.
The models again rated each video on a five-point scale, while a separate classification task determined whether physical rules were followed, violated, or unclear.
Experiments showed that existing video-language models struggled to match human judgments, mainly because of weak physical reasoning and the complexity of the prompts. To enhance automated evaluation, the researchers developed , a 7B-parameter model designed to offer more accurate predictions across three categories: ; ; and , fine-tuned on the VideoCon-Physics model using 50,000 human annotations*.
Data and Tests
With these tools in place, the authors tested numerous generative video systems, each through local installations and, where needed, via business APIs: CogVideoX-5B; VideoCrafter2; HunyuanVideo-13B; Cosmos-Diffusion; Wan2.1-14B; OpenAI Sora; and Luma Ray.
The models were prompted with upsampled captions where possible, except that Hunyuan Video and VideoCrafter2 operate under 77-token CLIP limitations, and can’t accept prompts above a certain length.
Videos generated were kept to lower than 6 seconds, since shorter output is less complicated to guage.
The driving data was from the VideoPhy-2 dataset, which was split right into a benchmark and training set. 590 videos were generated per model, aside from Sora and Ray2; because of the fee factor (equivalent lower numbers of videos were generated for these).
The initial evaluation handled (PA) and (OI), and tested each the overall dataset and the aforementioned ‘harder’ subset:

Here the authors comment:
The outcomes showed that video models struggled more with physical activities like sports than with simpler object interactions. This implies that improving AI-generated videos on this area would require higher datasets – particularly high-quality footage of sports reminiscent of tennis, discus, baseball, and cricket.
The study also examined whether a model’s physical plausibility correlated with other video quality metrics, reminiscent of aesthetics and motion smoothness. The findings revealed no strong correlation, meaning a model cannot improve its performance on VideoPhy-2 just by generating visually appealing or fluid motion – it needs a deeper understanding of physical commonsense.
Though the paper provides abundant qualitative examples, few of the static examples provided within the PDF appear to relate to the extensive video-based examples that the authors furnish on the project site. Subsequently we’ll take a look at a small number of the static examples after which some more of the particular project videos.

Regarding the above qualitative test, the authors comment:
Further examples from the project site:
As I discussed on the outset, the amount of fabric related to this project far exceeds what could be covered here. Subsequently please discuss with the source paper, project site and related sites mentioned earlier, for a really exhaustive outline of the authors’ procedures, and considerably more testing examples and procedural details.
Â
*