A brand new initiative from the Alibaba Group offers top-of-the-line methods I even have seen for generating full-body human avatars from a Stable Diffusion-based foundation model.
Titled (MIMicking with Object Interactions), the system uses a spread of popular technologies and modules, including CGI-based human models and AnimateDiff, to enable temporally consistent character substitute in videos – or else to drive a personality with a user-defined skeletal pose.
Here we see characters interpolated from a single image source, and driven by a predefined motion:
Source: https://menyifang.github.io/projects/MIMO/index.html
Generated characters, which may also be sourced from frames in videos and in diverse other ways, might be integrated into real-world footage.
MIMO offers a novel system which generates three discrete encodings, each for character, scene, and occlusion (i.e., matting, when some object or person passes in front of the character being depicted). These encodings are integrated at inference time.
The system is trained over the Stable Diffusion V1.5 model, using a custom dataset curated by the researchers, and composed equally of real-world and simulated videos.
The good bugbear of diffusion-based video is temporal stability, where the content of the video either flickers or ‘evolves’ in ways in which are usually not desired for consistent character representation.
MIMO, as an alternative, effectively uses a single image as a map for consistent guidance, which might be orchestrated and constrained by the interstitial SMPL CGI model.
For the reason that source reference is consistent, and the bottom model over which the system is trained has been enhanced with adequate representative motion examples, the system’s capabilities for temporally consistent output are well above the overall standard for diffusion-based avatars.
It’s becoming more common for single images for use as a source for effective neural representations, either by themselves, or in a multimodal way, combined with text prompts. For instance, the favored LivePortrait facial-transfer system can even generate highly plausible deepfaked faces from single face images.
The researchers imagine that the principles utilized in the MIMO system might be prolonged into other and novel kinds of generative systems and frameworks.
The recent paper is titled , and comes from 4 researchers at Alibaba Group’s Institute for Intelligent Computing. The work has a video-laden project page and an accompanying YouTube video, which can be embedded at the underside of this text.
Method
MIMO achieves automatic and unsupervised separation of the aforementioned three spatial components, in an end-to-end architecture (i.e., all of the sub-processes are integrated into the system, and the user need only provide the input material).
Source: https://arxiv.org/pdf/2409.16160
Objects in source videos are translated from 2D to 3D, initially using the monocular depth estimator Depth Anything. The human element in any frame is extracted with methods adapted from the Tune-A-Video project.
These features are then translated into video-based volumetric facets via Facebook Research’s Segment Anything 2 architecture.
The scene layer itself is obtained by removing objects detected in the opposite two layers, effectively providing a rotoscope-style mask mechanically.
For the motion, a set of extracted latent codes for the human element are anchored to a default human CGI-based SMPL model, whose movements provide the context for the rendered human content.
A 2D feature map for the human content is obtained by a differentiable rasterizer derived from a 2020 initiative from NVIDIA. Combining the obtained 3D data from SMPL with the 2D data obtained by the NVIDIA method, the latent codes representing the ‘neural person’ have a solid correspondence to their eventual context.
At this point, it’s obligatory to determine a reference commonly needed in architectures that use SMPL – a canonical pose. That is broadly much like Da Vinci’s ‘Vitruvian man’, in that it represents a zero-pose template which may accept content after which be deformed, bringing the (effectively) texture-mapped content with it.
These deformations, or ‘deviations from the norm’, represent human movement, while the SMPL model preserves the latent codes that constitute the human identity that has been extracted, and thus represents the resulting avatar accurately when it comes to pose and texture.

Source: https://www.researchgate.net/figure/Layout-of-23-joints-in-the-SMPL-models_fig2_351179264
Regarding the difficulty of entanglement (the extent to which trained data can turn into inflexible while you stretch it beyond its trained confines and associations), the authors state*:
For the scene and occlusion points, a shared and stuck Variational Autoencoder (VAE – on this case derived from a 2013 publication) is used to embed the scene and occlusion elements into the latent space. Incongruities are handled by an inpainting method from the 2023 ProPainter project.
Once assembled and retouched in this manner, each the background and any occluding objects within the video will provide a matte for the moving human avatar.
These decomposed attributes are then fed right into a U-Net backbone based on the Stable Diffusion V1.5 architecture. The entire scene code is concatenated with the host system’s native latent noise. The human component is integrated via self-attention and cross-attention layers, respectively.
Then, the denoised result’s output via the VAE decoder.
Data and Tests
For training, the researchers created human video dataset titled HUD-7K, which consisted of 5,000 real character videos and a couple of,000 synthetic animations created by the En3D system. The actual videos required no annotation, attributable to the non-semantic nature of the figure extraction procedures in MIMO’s architecture. The synthetic data was fully annotated.
The model was trained on eight NVIDIA A100 GPUs (though the paper doesn’t specify whether these were the 40GB or 80GB VRAM models), for 50 iterations, using 24 video frames and a batch size of 4, until convergence.
The motion module for the system was trained on the weights of AnimateDiff. Through the training process, the weights of the VAE encoder/decoder, and the CLIP image encoder were frozen (in contrast to full fine-tuning, which could have a wider effect on a foundation model).
Though MIMO was not trialed against analogous systems, the researchers tested it on difficult out-of-distribution motion sequence sourced from AMASS and Mixamo. These movements included climbing, playing, and dancing.
Additionally they tested the system on in-the-wild human videos. In each cases, the paper reports ‘high robustness’ for these unseen 3D motions, from different viewpoints.
Though the paper offers multiple static image results demonstrating the effectiveness of the system, the true performance of MIMO is best assessed with the extensive video results provided on the project page, and within the YouTube video embedded below (from which the videos at first of this text have been derived).
The authors conclude:
Conclusion
It’s refreshing to see an avatar system based on Stable Diffusion that appears able to such temporal stability – not least because Gaussian Avatars appear to be gaining the high ground on this particular research sector.
The stylized avatars represented in the outcomes are effective, and while the extent of photorealism that MIMO can produce will not be currently equal to what Gaussian Splatting is able to, the varied benefits of making temporally consistent humans in a semantically-based Latent Diffusion Network (LDM) are considerable.
*