Cooking Up Narrative Consistency for Long Video Generation

-

The recent public release of the Hunyuan Video generative AI model has intensified ongoing discussions in regards to the potential of huge multimodal vision-language models to someday create entire movies.

Nevertheless, as now we have observed, it is a very distant prospect in the mean time, for quite a few reasons. One is the very short attention window of most AI video generators, which struggle to take care of consistency even in a brief single shot, let alone a series of shots.

One other is that consistent references to video content (comparable to explorable environments, which shouldn’t change randomly for those who retrace your steps through them) can only be achieved in diffusion models by customization techniques comparable to low-rank adaptation (LoRA), which limits the out-of-the-box capabilities of foundation models.

Subsequently the evolution of generative video seems set to stall unless latest approaches to narrative continuity are developed.

Recipe for Continuity

With this in mind, a brand new collaboration between the US and China has proposed using as a possible template for future narrative continuity systems.

Source: https://videoauteur.github.io/

Titled , the work proposes a two-stage pipeline to generate instructional cooking videos using cohered states combining keyframes and captions, achieving state-of-the-art leads to – admittedly – an under-subscribed space.

VideoAuteur’s project page also includes quite a few reasonably more attention-grabbing videos that use the identical technique, comparable to a proposed trailer for a (non-existent) Marvel/DC crossover:

The page also features similarly-styled promo videos for an equally non-existent Netflix animal series and a Tesla automobile ad.

In developing VideoAuteur, the authors experimented with diverse loss functions, and other novel approaches. To develop a recipe how-to generation workflow, additionally they curated , the most important dataset focused on the cooking domain, featuring 200, 000 video clips with a median duration of 9.5 seconds.

At a median of 768.3 words per video, CookGen is comfortably probably the most extensively-annotated dataset of its kind. Diverse vision/language models were used, amongst other approaches, to make sure that descriptions were as detailed, relevant and accurate as possible.

Cooking videos were chosen because cooking instruction walk-throughs have a structured and unambiguous narrative, making annotation and evaluation a better task. Aside from pornographic videos (prone to enter this particular space sooner reasonably than later), it’s difficult to think about every other genre quite as visually and narratively ‘formulaic’.

The authors state:

The latest work is titled , and comes from eight authors across Johns Hopkins University, ByteDance, and ByteDance Seed.

Dataset Curation

To develop CookGen, which powers a two-stage generative system for producing AI cooking videos, the authors used material from the YouCook and HowTo100M collections. The authors compare the dimensions of CookGen to previous datasets focused on narrative development in generative video, comparable to the Flintstones dataset, the Pororo cartoon dataset, StoryGen, Tencent’s StoryStream, and VIST.

Source: https://arxiv.org/pdf/2501.06173

CookGen focuses on real-world narratives, particularly procedural activities like cooking, offering clearer and easier-to-annotate stories in comparison with image-based comic datasets. It exceeds the most important existing dataset, StoryStream, with 150x more frames and 5x denser textual descriptions.

The researchers fine-tuned a captioning model using the methodology of LLaVA-NeXT as a base. The automated speech recognition (ASR) pseudo-labels obtained for HowTo100M were used as ‘actions’ for every video, after which refined further by large language models (LLMs).

As an illustration, ChatGPT-4o was used to supply a caption dataset, and was asked to concentrate on subject-object interactions (comparable to hands handling utensils and food), object attributes, and temporal dynamics.

Since ASR scripts are prone to contain inaccuracies and to be generally ‘noisy’, Intersection-over-Union (IoU) was used as a metric to measure how closely the captions conformed to the section of the video they were addressing. The authors note that this was crucial for the creation of narrative consistency.

The curated clips were evaluated using Fréchet Video Distance (FVD), which measures the disparity between ground truth (real world) examples and generated examples, each with and without ground truth keyframes, arriving at a performative result:

Using FVD to evaluate the distance between videos generated with the new captions, both with and without the use of keyframes captured from the sample videos.

Moreover, the clips were rated each by GPT-4o, and 6 human annotators, following LLaVA-Hound‘s definition of ‘hallucination’ (i.e., the capability of a model to invent spurious content).

The researchers compared the standard of the captions to the Qwen2-VL-72B collection, obtaining a rather improved rating.

Comparison of FVD and human evaluation scores between Qwen2-VL-72B and the authors' collection.

Method

VideoAuteur’s generative phase is split between the (LND) and the (VCVGM).

LND generates a sequence of visual embeddings or keyframes that characterize the narrative flow, much like ‘essential highlights’. The VCVGM generates video clips based on these decisions.

Schema for the VideoAuteur processing pipeline. The Long Narrative Video Director makes apposite selections to feed to the Seed-X-powered generative module.

The authors extensively discuss the differing merits of an and a language-centric keyframe director, and conclude that the previous is the simpler approach.

The interleaved image-text director generates a sequence by interleaving text tokens and visual embeddings, using an auto-regressive model to predict the subsequent token, based on the combined context of each text and pictures. This ensures a good alignment between visuals and text.

In contrast, the language-centric keyframe director synthesizes keyframes using a text-conditioned diffusion model based solely on captions, without incorporating visual embeddings into the generation process.

The researchers found that while the language-centric method generates visually appealing keyframes, it lacks consistency across frames, arguing that the interleaved method achieves higher scores in realism and visual consistency. In addition they found that this method was higher capable of learn a sensible visual style through training, though sometimes with some repetitive or noisy elements.

Unusually, in a research strand dominated by the co-opting of Stable Diffusion and Flux into workflows, the authors used Tencent’s SEED-X 7B-parameter multi-modal LLM foundation model for his or her generative pipeline (though this model does leverage Stability.ai’s SDXL release of Stable Diffusion for a limited a part of its architecture).

The authors state:

Though typical visual-conditioned generative pipelines of this sort often use initial keyframes as a start line for model guidance, VideoAuteur expands on this paradigm by generating multi-part visual states in a semantically coherent latent space, avoiding the potential bias of basing further generation solely on ‘starting frames’.

Schema for the use of visual state embeddings as a superior conditioning method.

Tests

Consistent with the methods of SeedStory, the researchers use SEED-X to use LoRA fine-tuning on their narrative dataset, enigmatically describing the result as a ‘Sora-like model’, pre-trained on large-scale video/text couplings, and able to accepting each visual and text prompts and conditions.

32,000 narrative videos were used for model development, with 1,000 held aside as validation samples. The videos were cropped to 448 pixels on the short side after which center-cropped to 448x448px.

For training, the narrative generation was evaluated totally on the YouCook2 validation set. The Howto100M set was used for data quality evaluation and in addition for image-to-video generation.

For visual conditioning loss, the authors used diffusion loss from DiT and a 2024 work based around Stable Diffusion.

To prove their contention that interleaving is a superior approach, the authors pitted VideoAuteur against several methods that rely solely on text-based input: EMU-2, SEED-X, SDXL and FLUX.1-schnell (FLUX.1-s).

Given a global prompt, 'Step-by-step guide to cooking mapo tofu', the interleaved director generates actions, captions, and image embeddings sequentially to narrate the process. The first two rows show keyframes decoded from EMU-2 and SEED-X latent spaces. These images are realistic and consistent but less polished than those from advanced models like SDXL and FLUX.

.

The authors state:

Human evaluation further confirms the authors’ contention in regards to the improved performance of the interleaved approach, with interleaved methods achieving the best scores in a survey.

Comparisons of approaches from a human study conducted for the paper.

Nevertheless we note that language-centric approaches achieve the perfect scores. The authors contend, nevertheless, that this isn’t the central issue within the generation of long narrative videos.

Conclusion

The most well-liked strand of research in regard to this challenge, i.e., narrative consistency in long-form video generation, is worried with single images. Projects of this sort include DreamStory, StoryDiffusion, TheaterGen and NVIDIA’s ConsiStory.

In a way, VideoAuteur also falls into this ‘static’ category, because it makes use of seed images from which clip-sections are generated. Nevertheless, the interleaving of video and semantic content brings the method a step nearer to a practical pipeline.

 

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x