Why Can’t Generative Video Systems Make Complete Movies?

The arrival and progress of generative AI video has prompted many casual observers to predict that machine learning will prove the death of the movie industry as we comprehend it – as a substitute, single creators will have the ability to create Hollywood-style blockbusters at home, either on local or cloud-based GPU systems.

Is that this possible? Even when it is feasible, is it , as so many imagine?

That individuals will eventually have the ability to create movies, in the shape that we all know them, with consistent characters, narrative continuity and total photorealism, is kind of possible – and maybe even inevitable.

Nonetheless there are several truly fundamental the explanation why this isn’t more likely to occur with video systems based on Latent Diffusion Models.

This last fact is significant because, in the meanwhile, that category includes popular text-to-video (T2) and image-to-video (I2V) system available, including Minimax, Kling, Sora, Imagen, Luma, Amazon Video Generator, Runway ML, Kaiber (and, so far as we will discern, Adobe Firefly’s pending video functionality); amongst many others.

Here, we’re considering the prospect of true full-length gen-AI productions, created by individuals, with consistent characters, cinematography, and visual effects at the least on a par with the present state-of-the-art in Hollywood.

Let’s take a have a look at among the biggest practical roadblocks to the challenges involved.

1: You Can’t Get an Accurate Follow-on Shot

Narrative inconsistency is the biggest of those roadblocks. The very fact is that no currently-available video generation system could make a very accurate ‘follow on’ shot*.

It is because the denoising diffusion model at the guts of those systems relies on random noise, and this core principle isn’t amenable to reinterpreting the exact same content twice (i.e., from different angles, or by developing the previous shot right into a follow-on shot which maintains consistency with the previous shot).

Where text prompts are used, alone or along with uploaded ‘seed’ images (multimodal input), the tokens derived from the prompt will elicit semantically-appropriate content from the trained latent space of the model.

Nonetheless, further hindered by the ‘random noise’ factor, it should .

Because of this the identities of individuals within the video will are inclined to shift, and objects and environments won’t match the initial shot.

For this reason viral clips depicting extraordinary visuals and Hollywood-level output are inclined to be either single shots, or a ‘showcase montage’ of the system’s capabilities, where each shot features different characters and environments.

The implication in these collections of video generations (which could also be disingenuous within the case of business systems) is that the underlying system create contiguous and consistent narratives.

The analogy being exploited here’s a movie trailer, which features only a minute or two of footage from the film, but gives the audience reason to imagine that your complete film exists.

The one systems which currently offer narrative consistency in a diffusion model are those who produce still images. These include NVIDIA’s ConsiStory, and diverse projects within the scientific literature, similar to TheaterGen, DreamStory, and StoryDiffusion.

In theory, one could use a greater version of such systems (not one of the above are truly consistent) to create a series of image-to-video shots, which may very well be strung together right into a sequence.

At the present state-of-the-art, this approach doesn’t produce plausible follow-on shots; and, in any case, we now have already departed from the dream by adding a layer of complexity.

We will, moreover, use Low Rank Adaptation (LoRA) models, specifically trained on characters, things or environments, to keep up higher consistency across shots.

Nonetheless, if a personality wishes to look in a brand new costume, a wholly latest LoRA will normally should be trained that embodies the character wearing that fashion (although sub-concepts similar to ‘red dress’ will be trained into individual LoRAs, along with apposite images, they aren’t at all times easy to work with).

Such a scene, containing roughly 4-8 shots, will be filmed in a single morning by conventional film-making procedures; at the present state-of-the-art in generative AI, it potentially represents weeks of labor, multiple trained LoRAs (or other adjunct systems), and a substantial amount of post-processing

Alternatively, video-to-video will be used, where mundane or CGI footage is transformed through text-prompts into alternative interpretations. Runway offers such a system, as an example.

There are two problems here: you’re already having to create the core footage, so that you’re already making the movie , even in case you’re using an artificial system similar to UnReal’s MetaHuman.

Should you create CGI models (as within the clip above) and use these in a video-to-image transformation, their consistency across shots can’t be relied upon.

It is because video diffusion models don’t see the ‘big picture’ – reasonably, they create a brand new frame based on previous frame/s, and, in some cases, consider a close-by future frame; but, to check the method to a chess game, they can’t think ‘ten moves ahead’, and can’t remember ten moves behind.

Secondly, a diffusion model will still struggle to keep up a consistent appearance across the shots, even in case you include multiple LoRAs for character, environment, and lighting style, for reasons mentioned initially of this section.

2: You Cannot Edit a Shot Easily

Should you depict a personality walking down a street using old-school CGI methods, and you choose that you should change some aspect of the shot, you’ll be able to adjust the model and render it again.

If it is a real-life shoot, you only reset and shoot it again, with the apposite changes.

Nonetheless, in case you produce a gen-AI video shot that you simply love, but want to vary of it, you’ll be able to only achieve this by painstaking post-production methods developed during the last 30-40 years: CGI, rotoscoping, modeling and matting – all labor-intensive and expensive, time-consuming procedures.

The way in which that diffusion models work, simply changing one aspect of a text-prompt (even in a multimodal prompt, where you provide a whole source seed image) will change of the generated output, resulting in a game of prompting ‘whack-a-mole’.

3: You Can’t Depend on the Laws of Physics

Traditional CGI methods offer a wide range of algorithmic physics-based models that may simulate things similar to fluid dynamics, gaseous movement, inverse kinematics (the accurate modeling of human movement), cloth dynamics, explosions, and diverse other real-world phenomena.

Nonetheless, diffusion-based methods, as we now have seen, have short memories, and likewise a limited range of motion priors (examples of such actions, included within the training dataset) to attract on.

In an earlier version of OpenAI’s landing page for the acclaimed Sora generative system, the corporate conceded that Sora has limitations on this regard (though this text has since been removed):

The sensible use of varied API-based generative video systems reveals similar limitations in depicting accurate physics. Nonetheless, certain common physical phenomena, like explosions, seem like higher represented of their training datasets.

Some motion prior embeddings, either trained into the generative model or fed in from a source video, take some time to finish (similar to an individual performing a posh and non-repetitive dance sequence in an elaborate costume) and, once more, the diffusion model’s myopic window of attention is more likely to transform the content (facial ID, costume details, etc.) by the point the motion has played out. Nonetheless, LoRAs can mitigate this, to an extent.

Fixing It in Post

There are other shortcomings to pure ‘single user’ AI video generation, similar to the difficulty they’ve in depicting rapid movements, and the overall and much more pressing problem of obtaining temporal consistency in output video.

Moreover, creating specific facial performances is just about a matter of luck in generative video, as is lip-sync for dialogue.

In each cases, using ancillary systems similar to LivePortrait and AnimateDiff is becoming highly regarded within the VFX community, since this permits the transposition of at the least broad facial features and lip-sync to existing generated output.

Further, a myriad of complex solutions, incorporating tools similar to the Stable Diffusion GUI ComfyUI and the skilled compositing and manipulation application Nuke, in addition to latent space manipulation, allow AI VFX practitioners to realize greater control over facial features and disposition.

Though he describes the technique of facial animation in ComfyUI as ‘torture’, VFX skilled Francisco Contreras has developed such a procedure, which allows the imposition of lip phonemes and other facets of facial/head depiction”

Conclusion

None of that is promising for the prospect of a single user generating coherent and photorealistic blockbuster-style full-length movies, with realistic dialogue, lip-sync, performances, environments and continuity.

Moreover, the obstacles described here, at the least in relation to diffusion-based generative video models, aren’t necessarily solvable ‘any minute’ now, despite forum comments and media attention that make this case. The constraints described appear to be intrinsic to the architecture.

In AI synthesis research, as in all scientific research, good ideas periodically dazzle us with their potential, just for further research to unearth their fundamental limitations.

Within the generative/synthesis space, this has already happened with Generative Adversarial Networks (GANs) and Neural Radiance Fields (NeRF), each of which ultimately proved very difficult to instrumentalize into performant industrial systems, despite years of educational research towards that goal. These technologies now show up most regularly as adjunct components in alternative architectures.

Much as movie studios may hope that training on legitimately-licensed movie catalogs could eliminate VFX artists, AI is definitely roles to the workforce at the moment.

Whether diffusion-based video systems can really be transformed into narratively-consistent and photorealistic movie generators, or whether the entire business is just one other alchemic pursuit, should develop into apparent over the following 12 months.

It could be that we want a wholly latest approach; or it could be that Gaussian Splatting (GSplat), which was developed in the early Nineteen Nineties and has recently taken off within the image synthesis space, represents a possible alternative to diffusion-based video generation.

Since GSplat took 34 years to return to the fore, it’s possible too that older contenders similar to NeRF and GANs – and even latent diffusion models – are yet to have their day.

Why Can’t Generative Video Systems Make Complete Movies?

1: You Can’t Get an Accurate Follow-on Shot

2: You Cannot Edit a Shot Easily

3: You Can’t Depend on the Laws of Physics

Fixing It in Post

Conclusion

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

Introducing TextImage Augmentation for Document Images

XetHub is joining Hugging Face!

The primary strong attention-free 7B model

Tool Use, Unified

Breaking the Hardware Barrier: Software FP8 for Older GPUs

Why Can’t Generative Video Systems Make Complete Movies?

1: You Can’t Get an Accurate Follow-on Shot

2: You Cannot Edit a Shot Easily

3: You Can’t Depend on the Laws of Physics

Fixing It in Post

Conclusion

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.