Generating Higher AI Video From Just Two Images

-

Video frame interpolation (VFI) is an open problem in generative video research. The challenge is to generate intermediate frames between two existing frames in a video sequence.

Sources: https://film-net.github.io/ and https://arxiv.org/pdf/2202.04901

Broadly speaking, this method dates back over a century, and has been utilized in traditional animation since then. In that context, master ‘keyframes’ could be generated by a principal animation artist, while the work of ‘tweening’ intermediate frames could be carried out as by other staffers, as a more menial task.

Prior to the rise of generative AI, frame interpolation was utilized in projects akin to Real-Time Intermediate Flow Estimation (RIFE), Depth-Aware Video Frame Interpolation (DAIN), and Google’s Frame Interpolation for Large Motion (FILM – see above) for purposes of accelerating the frame rate of an existing video, or enabling artificially-generated slow-motion effects. That is completed by splitting out the present frames of a clip and generating estimated intermediate frames.

VFI can also be utilized in the event of higher video codecs, and, more generally, in optical flow-based systems (including generative systems), that utilize advance knowledge of coming keyframes to optimize and shape the interstitial content that precedes them.

End Frames in Generative Video Systems

Modern generative systems akin to Luma and Kling allow users to specify a start and an end frame, and may perform this task by analyzing keypoints within the two images and estimating a trajectory between the 2 images.

As we will see within the examples below, providing a ‘closing’ keyframe higher allows the generative video system (on this case, Kling) to take care of points akin to identity, even when the outcomes usually are not perfect (particularly with large motions).

Source: https://www.youtube.com/watch?v=8oylqODAaH8

Within the above example, the person’s identity is consistent between the 2 user-provided keyframes, resulting in a comparatively consistent video generation.

Where only the starting frame is provided, the generative systems window of attention isn’t normally large enough to ‘remember’ what the person looked like firstly of the video. Reasonably, the identity is more likely to shift a bit bit with each frame, until all resemblance is lost. In the instance below, a starting image was uploaded, and the person’s movement guided by a text prompt:

We are able to see that the actor’s resemblance isn’t resilient to the instructions, because the generative system doesn’t know what he would seem like if he was smiling, and he isn’t smiling within the seed image (the one available reference).

Nearly all of viral generative clips are fastidiously curated to de-emphasize these shortcomings. Nonetheless, the progress of temporally consistent generative video systems may rely on latest developments from the research sector in regard to border interpolation, because the only possible alternative is a dependence on traditional CGI as a driving, ‘guide’ video (and even on this case, consistency of texture and lighting are currently difficult to realize).

Moreover, the slowly-iterative nature of deriving a brand new frame from a small group of recent frames makes it very difficult to realize large and daring motions. It’s because an object that’s moving rapidly across a frame may transit from one side to the opposite within the space of a single frame, contrary to the more gradual movements on which the system is more likely to have been trained.

Likewise, a major and daring change of pose may lead not only to identity shift, but to vivid non-congruities:

Framer

This brings us to an interesting recent paper from China, which claims to have achieved a brand new state-of-the-art in authentic-looking frame interpolation – and which is the primary of its kind to supply drag-based user interaction.

. Source: https://www.youtube.com/watch?v=4MPGKgn7jRc

Drag-centric applications have change into frequent in the literature these days, because the research sector struggles to supply instrumentalities for generative system that usually are not based on the fairly crude results obtained by text prompts.

The brand new system, titled , can’t only follow the user-guided drag, but additionally has a more conventional ‘autopilot’ mode. Besides conventional tweening, the system is capable of manufacturing time-lapse simulations, in addition to morphing and novel views of the input image.

Source: https://arxiv.org/pdf/2410.18978

In regard to the production of novel views, Framer crosses over a bit into the territory of Neural Radiance Fields (NeRF) – though requiring only two images, whereas NeRF generally requires six or more image input views.

In tests, Framer, which is founded on Stability.ai’s Stable Video Diffusion latent diffusion generative video model, was capable of outperform approximated rival approaches, in a user study.

On the time of writing, the code is about to be released at GitHub. Video samples (from which the above images are derived) can be found on the project site, and the researchers have also released a YouTube video.

The latest paper is titled , and comes from nine researchers across Zhejiang University and the Alibaba-backed Ant Group.

Method

Framer uses keypoint-based interpolation in either of its two modalities, wherein the input image is evaluated for basic topology, and ‘movable’ points assigned where vital. In effect, these points are such as facial landmarks in ID-based systems, but generalize to any surface.

The researchers fine-tuned Stable Video Diffusion (SVD) on the OpenVid-1M dataset, adding a further last-frame synthesis capability. This facilitates a trajectory-control mechanism (top right in schema image below) that may evaluate a path toward the end-frame (or back from it).

Schema for Framer.

Regarding the addition of last-frame conditioning, the authors state:

For drag-based functionality, the trajectory module leverages the Meta Ai-led CoTracker framework, which evaluates profuse possible paths ahead. These are slimmed all the way down to between 1-10 possible trajectories.

The obtained point coordinates are then transformed through a strategy inspired by the DragNUWA and DragAnything architectures. This obtains a , which individuates the goal areas for movement.

Subsequently, the info is fed to the conditioning mechanisms of ControlNet, an ancillary conformity system originally designed for Stable Diffusion, and since adapted to other architectures.

For autopilot mode, feature matching is initially completed via SIFT, which interprets a trajectory that may then be passed to an auto-updating mechanism inspired by DragGAN and DragDiffusion.

Schema for point trajectory estimation in Framer.

Data and Tests

For the fine-tuning of Framer, the spatial attention and residual blocks were frozen, and only the temporal attention layers and residual blocks were affected.

The model was trained for 10,000 iterations under AdamW, at a learning rate of 1e-4, and a batch size of 16. Training took place across 16 NVIDIA A100 GPUs.

Since prior approaches to the issue don’t offer drag-based editing, the researchers opted to match Framer’s autopilot mode to the usual functionality of older offerings.

The frameworks tested for the category of current diffusion-based video generation systems were LDMVFI; Dynamic Crafter; and SVDKFI. For ‘traditional’ video systems, the rival frameworks were AMT; RIFE; FLAVR; and the aforementioned FILM.

Along with the user study, tests were conducted over the DAVIS and UCF101 datasets.

Qualitative tests can only be evaluated by the target faculties of the research team and by user studies. Nonetheless, the paper notes, traditional metrics are largely unsuited to the proposition at hand:

Despite this, the researchers conducted qualitative tests with several popular metrics:

Quantitative results for Framer vs. rival systems.

The authors note that despite having the chances stacked against them, Framer still achieves the very best FVD rating among the many methods tested.

Below are the paper’s sample results for a qualitative comparison:

Qualitative comparison against former approaches.

The authors comment:

For the user study, the researchers gathered 20 participants, who assessed 100 randomly-ordered video results from the assorted methods tested. Thus, 1000 rankings were obtained, evaluating essentially the most ‘realistic’ offerings:

Results from the user study.

As will be seen from the graph above, users overwhelmingly favored results from Framer.

The project’s accompanying YouTube video outlines a few of the potential other uses for framer, including morphing and cartoon in-betweening – where the whole concept began.

Conclusion

It is tough to over-emphasize how vital this challenge currently is for the duty of AI-based video generation. Up to now, older solutions akin to FILM and the (non-AI) EbSynth have been used, by each amateur and skilled communities, for tweening between frames; but these solutions include notable limitations.

Due to disingenuous curation of official example videos for brand spanking new T2V frameworks, there may be a large public misconception that machine learning systems can accurately infer geometry in motion without recourse to guidance mechanisms akin to 3D morphable models (3DMMs), or other ancillary approaches, akin to LoRAs.

To be honest, tweening itself, even when it might be perfectly executed, only constitutes a ‘hack’ or cheat upon this problem. Nonetheless, because it is commonly easier to supply two well-aligned frame images than to effect guidance via text-prompts or the present range of alternatives, it is sweet to see iterative progress on an AI-based version of this older method.

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x