MagicDance: Realistic Human Dance Video Generation

Artificial Intelligence

MagicDance: Realistic Human Dance Video Generation

admin

January 1, 2024

MagicDance: Realistic Human Dance Video Generation

Computer vision is some of the discussed fields within the AI industry, because of its potential applications across a wide selection of real-time tasks. In recent times, computer vision frameworks have advanced rapidly, with modern models now able to analyzing facial expression, objects, and way more in real-time scenarios. Despite these capabilities, human motion transfer stays a formidable challenge for computer vision models. This task involves retargeting facial and body motions from a source image or video to a goal image or video. Human motion transfer is widely utilized in computer vision models for styling images or videos, editing multimedia content, digital human synthesis, and even generating data for perception-based frameworks.

In this text, we give attention to MagicDance, a diffusion-based model designed to revolutionize human motion transfer. The MagicDance framework specifically goals to transfer 2D human facial expressions and motions onto difficult human dance videos. Its goal is to generate novel pose sequence-driven dance videos for specific goal identities while maintaining the unique identity. The MagicDance framework employs a two-stage training strategy, specializing in human motion disentanglement and appearance aspects like skin tone, facial expressions, and clothing. We’ll delve into the MagicDance framework, exploring its architecture, functionality, and performance in comparison with other state-of-the-art human motion transfer frameworks. Let’s dive in.

As mentioned earlier, human motion transfer is some of the complex computer vision tasks due to sheer complexity involved in transferring human motions and expressions from the source image or video to the goal image or video. Traditionally, computer vision frameworks have achieved human motion transfer by training a task-specific generative model including GAN or Generative Adversarial Networks on track datasets for facial expressions and body poses. Although training and using generative models deliver satisfactory leads to some cases, they typically suffer from two major limitations.

They rely heavily on a picture warping component consequently of which they often struggle to interpolate body parts invisible within the source image either attributable to a change in perspective or self-occlusion.
They can’t generalize to other images sourced externally that limits their applications especially in real-time scenarios within the wild.

Modern diffusion models have demonstrated exceptional image generation capabilities across different conditions, and diffusion models at the moment are able to presenting powerful visuals on an array of downstream tasks akin to video generation & image inpainting by learning from web-scale image datasets. Owing to their capabilities, diffusion models could be a perfect pick for human motion transfer tasks. Although diffusion models will be implemented for human motion transfer, it does have some limitations either by way of the standard of the generated content, or by way of identity preservation or affected by temporal inconsistencies consequently of model design & training strategy limits. Moreover, diffusion-based models reveal no significant advantage over GAN frameworks by way of generalizability.

To beat the hurdles faced by diffusion and GAN based frameworks on human motion transfer tasks, developers have introduced MagicDance, a novel framework that goals to take advantage of the potential of diffusion frameworks for human motion transfer demonstrating an unprecedented level of identity preservation, superior visual quality, and domain generalizability. At its core, the basic concept of the MagicDance framework is to separate the issue into two stages : appearance control and motion control, two capabilities required by image diffusion frameworks to deliver accurate motion transfer outputs.

The above figure gives a temporary overview of the MagicDance framework, and as it may possibly be seen, the framework employs the Stable Diffusion model, and likewise deploys two additional components : Appearance Control Model and Pose ControlNet where the previous provides appearance guidance to the SD model from a reference image via attention whereas the latter provides expression/pose guidance to the diffusion model from a conditioned image or video. The framework also employs a multi-stage training technique to learn these sub-modules effectively to disentangle pose control and appearance.

In summary, the MagicDance framework is a

Novel and effective framework consisting of appearance-disentangled pose control, and appearance control pretraining.
The MagicDance framework is able to generating realistic human facial expressions and human motion under the control of pose condition inputs and reference images or videos.
The MagicDance framework goals to generate appearance-consistent human content by introducing a Multi-Source Attention Module that provides accurate guidance for Stable Diffusion UNet framework.
The MagicDance framework will also be utilized as a convenient extension or plug-in for the Stable Diffusion framework, and likewise ensures compatibility with existing model weights because it doesn’t require additional fine-tuning of the parameters.

Moreover, the MagicDance framework shows exceptional generalization capabilities for each appearance and motion generalization.

Appearance Generalization : The MagicDance framework demonstrates superior capabilities with regards to generating diverse appearances.
Motion Generalization : The MagicDance framework also has the flexibility to generate a wide selection of motions.

MagicDance : Objectives and Architecture

For a given reference image either of an actual human or a stylized image, the first objective of the MagicDance framework is to generate an output image or an output video conditioned on the input and the pose inputs {P, F} where P represents human pose skeleton and F represents the facial landmarks. The generated output image or video should have the opportunity to preserve the looks and identity of the humans involved together with the background contents present within the reference image while retaining the pose and expressions defined by the pose inputs.

Architecture

During training, the MagicDance framework is trained as a frame reconstruction task to reconstruct the bottom truth with the reference image and pose input sourced from the identical reference video. During testing to realize motion transfer, the pose input and the reference image is sourced from different sources.

The general architecture of the MagicDance framework will be split into 4 categories: Preliminary stage, Appearance Control pretraining, Appearance-disentangled Pose Control, and Motion Module.

Preliminary Stage

Latent Diffusion Models or LDM represent uniquely designed diffusion models to operate throughout the latent space facilitated by means of an autoencoder, and the Stable Diffusion framework is a notable instance of LDMs that employs a Vector Quantized-Variational AutoEncoder and temporal U-Net architecture. The Stable Diffusion model employs a CLIP-based transformer as a text encoder to process textual inputs by converting text inputs into embeddings. The training phase of the Stable Diffusion framework exposes the model to a text condition and an input image with the method involving the encoding of the image to a latent representation, and subjects it to a predefined sequence of diffusion steps directed by a Gaussian method. The resultant sequence yields a loud latent representation that gives a regular normal distribution with the first learning objective of the Stable Diffusion framework being denoising the noisy latent representations iteratively into latent representations.

Appearance Control Pretraining

A significant issue with the unique ControlNet framework is its inability to regulate appearance amongst spatially various motions consistently even though it tends to generate images with poses closely resembling those within the input image with the general appearance being influenced predominantly by textual inputs. Although this method works, it just isn’t fitted to motion transfer involving tasks where its not the textual inputs however the reference image that serves as the first source for appearance information.

The Appearance Control Pre-training module within the MagicDance framework is designed as an auxiliary branch to supply guidance for appearance control in a layer-by-layer approach. Relatively than counting on text inputs, the general module focuses on leveraging the looks attributes from the reference image with the aim to reinforce the framework’s ability to generate the looks characteristics accurately particularly in scenarios involving complex motion dynamics. Moreover, it is barely the Appearance Control Model that’s trainable during appearance control pre-training.

Appearance-disentangled Pose Control

A naive solution to regulate the pose within the output image is to integrate the pre-trained ControlNet model with the pre-trained Appearance Control Model directly without fine-tuning. Nonetheless, the combination might lead to the framework scuffling with appearance-independent pose control that may result in a discrepancy between the input poses and the generated poses. To tackle this discrepancy, the MagicDance framework fine-tunes the Pose ControlNet model jointly with the pre-trained Appearance Control Model.

Motion Module

When working together, the Appearance-disentangled Pose ControlNet and the Appearance Control Model can achieve accurate and effective image to motion transfer, even though it might lead to temporal inconsistency. To make sure temporal consistency, the framework integrates an extra motion module into the first Stable Diffusion UNet architecture.

MagicDance : Pre-Training and Datasets

For pre-training, the MagicDance framework makes use of a TikTok dataset that consists of over 350 dance videos of various lengths between 10 to fifteen seconds capturing a single person dancing with a majority of those videos containing the face, and the upper-body of the human. The MagicDance framework extracts each individual video at 30 FPS, and runs OpenPose on each frame individually to infer the pose skeleton, hand poses, and facial landmarks.

For pre-training, the looks control model is pre-trained with a batch size of 64 on 8 NVIDIA A100 GPUs for 10 thousand steps with a picture size of 512 x 512 followed by jointly fine-tuning the pose control and appearance control models with a batch size of 16 for 20 thousand steps. During training, the MagicDance framework randomly samples two frames because the goal and reference respectively with the pictures being cropped at the identical position along the identical height. During evaluation, the model crops the image centrally as an alternative of cropping them randomly.

MagicDance : Results

The experimental results conducted on the MagicDance framework are demonstrated in the next image, and as it may possibly be seen, the MagicDance framework outperforms existing frameworks like Disco and DreamPose for human motion transfer across all metrics. Frameworks consisting a “*” in front of their name uses the goal image directly because the input, and includes more information in comparison with the opposite frameworks.

It’s interesting to notice that the MagicDance framework attains a Face-Cos rating of 0.426, an improvement of 156.62% over the Disco framework, and nearly 400% increase compared against the DreamPose framework. The outcomes indicate the robust capability of the MagicDance framework to preserve identity information, and the visible boost in performance indicates the prevalence of the MagicDance framework over existing state-of-the-art methods.

The next figures compare the standard of human video generation between the MagicDance, Disco, and TPS frameworks. As it may possibly be observed, the outcomes generated by the GT, Disco, and TPS frameworks suffer from inconsistent human pose identity and facial expressions.

Moreover, the next image demonstrates the visualization of facial features and human pose transfer on the TikTok dataset with the MagicDance framework with the ability to generate realistic and vivid expressions and motions under diverse facial landmarks and pose skeleton inputs while accurately preserving identity information from the reference input image.

It’s price noting that the MagicDance framework boasts of remarkable generalization capabilities to out-of-domain reference images of unseen pose and styles with impressive appearance controllability even with none additional fine-tuning on the goal domain with the outcomes being demonstrated in the next image.

The next images reveal the visualization capabilities of MagicDance framework by way of facial features transfer and zero-shot human motion. As it may possibly be seen, the MagicDance framework generalizes to in-the-wild human motions perfectly.

MagicDance : Limitations

OpenPose is an integral part of the MagicDance framework because it plays an important role for pose control, affecting the standard and temporal consistency of the generated images significantly. Nonetheless, the MagicDance framework still finds it a bit difficult to detect facial landmarks and pose skeletons accurately, especially when the objects in the pictures are partially visible, or show rapid movement. These issues may end up in artifacts within the generated image.

Conclusion

In this text, we now have talked about MagicDance, a diffusion-based model that goals to revolutionize human motion transfer. The MagicDance framework tries to transfer 2D human facial expressions and motions on difficult human dance videos with the particular aim of generating novel pose sequence driven human dance videos for specific goal identities while keeping the identity constant. The MagicDance framework is a two-stage training strategy for human motion disentanglement and appearance like skin tone, facial expressions, and garments.

MagicDance is a novel approach to facilitate realistic human video generation by incorporating facial and motion expression transfer, and enabling consistent within the wild animation generation while not having any further fine-tuning that demonstrates significant advancement over existing methods. Moreover, the MagicDance framework demonstrates exceptional generalization capabilities over complex motion sequences and diverse human identities, establishing the MagicDance framework because the lead runner in the sphere of AI assisted motion transfer and video generation.