A Notable Advance in Human-Driven AI Video

One in all the first objectives in current video synthesis research is generating an entire AI-driven video performance from a single image. This week a brand new paper from Bytedance Intelligent Creation outlined what could be the most comprehensive system of this type to this point, capable of manufacturing full- and semi-body animations that mix expressive facial detail with accurate large-scale motion, while also achieving improved identity consistency – an area where even leading business systems often fall short.

In the instance below, we see a performance driven by an actor (top left) and derived from a single image (top right), that gives a remarkably flexible and dexterous rendering, with none of the same old issues around creating large movements or ‘guessing’ about occluded areas (i.e., parts of clothing and facial angles that should be inferred or invented because they will not be visible in the only real source photo):

Though we are able to see some residual challenges regarding persistence of identity as each clip proceeds, that is the primary system I even have seen that excels in generally (though not all the time) maintaining ID over a sustained period without using LoRAs:

The brand new system, titled , uses a three-part hybrid control system that offers dedicated attention to facial features, head rotation and core skeleton design, thus accommodating AI-driven performances where neither the facial nor body aspect suffer on the expense of the opposite – a rare, arguably unknown capability amongst similar systems.

Below we see considered one of these facets, , in motion. The coloured ball within the corner of every thumbnail towards the fitting indicates a sort of virtual gimbal that defines head-orientation independently of facial movement and expression, which is here driven by an actor (lower left).

One in all the project’s most interesting functionalities, which is just not even included properly within the paper’s tests, is its capability to derive lip-sync movement directly from audio – a capability which works unusually well even and not using a driving actor-video.

The researchers have taken on the most effective incumbents on this pursuit, including the much-lauded Runway Act-One and LivePortrait, and report that DreamActor was in a position to achieve higher quantitative results.

Since researchers can set their very own criteria, quantitative results aren’t necessarily an empirical standard; however the accompanying qualitative tests appear to support the authors’ conclusions.

Unfortunately this technique is just not intended for public release, and the one value the community can potentially derive from the work is in potentially reproducing the methodologies outlined within the paper (as was done to notable effect for the equally closed-source Google Dreambooth in 2022).

The paper states*:

Naturally, ethical considerations of this type are convenient from a business standpoint, because it provides a rationale for API-only access to the model, which might then be monetized. ByteDance has already done this once in 2025, by making the much-lauded OmniHuman available for paid credits on the Dreamina website. Due to this fact, since DreamActor is possibly a fair stronger product, this seems the likely end result. What stays to be seen is the extent to which its principles, so far as they’re explained within the paper, can aid the open source community.

The latest paper is titled , and comes from six Bytedance researchers.

Method

The DreamActor system proposed within the paper goals to generate human animation from a reference image and a driving video, using a Diffusion Transformer (DiT) framework adapted for latent space (apparently some flavor of Stable Diffusion, though the paper cites only the 2022 landmark release publication).

Relatively than counting on external modules to handle reference conditioning, the authors merge appearance and motion features directly contained in the DiT backbone, allowing interaction across space and time through attention:

Source: https://arxiv.org/pdf/2504.01724

To do that, the model uses a pretrained 3D variational autoencoder to encode each the input video and the reference image. These latents are patchified, concatenated, and fed into the DiT, which processes them jointly.

This architecture departs from the common practice of attaching a secondary network for reference injection, which was the approach for the influential Animate Anyone and Animate Anyone 2 projects.

As an alternative, DreamActor builds the fusion into the important model itself, simplifying the design while enhancing the flow of knowledge between appearance and motion cues. The model is then trained using flow matching fairly than the usual diffusion objective (Flow matching trains diffusion models by directly predicting velocity fields between data and noise, skipping rating estimation).

Hybrid Motion Guidance

The Hybrid Motion Guidance method that informs the neural renderings combines pose tokens derived from 3D body skeletons and head spheres; implicit facial representations extracted by a pretrained face encoder; and reference appearance tokens sampled from the source image.

These elements are integrated throughout the Diffusion Transformer using distinct attention mechanisms, allowing the system to coordinate global motion, facial features, and visual identity throughout the generation process.

For the primary of those, fairly than counting on facial landmarks, DreamActor uses implicit facial representations to guide expression generation, apparently enabling finer control over facial dynamics while disentangling identity and head pose from expression.

To create these representations, the pipeline first detects and crops the face region in each frame of the driving video, resizing it to 224×224. The cropped faces are processed by a face motion encoder pretrained on the PD-FGC dataset, which is then conditioned by an MLP layer.

PD-FGC, employed in DreamActor, generates a talking head from a reference image with disentangled control of lip sync (from audio), head pose, eye movement, and expression (from separate videos), allowing precise, independent manipulation of each. Source: https://arxiv.org/pdf/2211.14506

Source: https://arxiv.org/pdf/2211.14506

The result’s a sequence of face motion tokens, that are injected into the Diffusion Transformer through a cross-attention layer.

The identical framework also supports an variant, wherein a separate encoder is trained that maps speech input on to face motion tokens. This makes it possible to generate synchronized facial animation – including lip movements – and not using a driving video.

Secondly, to manage head pose independently of facial features, the system introduces a 3D head sphere representation (see video embedded earlier in this text), which decouples facial dynamics from global head movement, improving precision and adaptability during animation.

Head spheres are generated by extracting 3D facial parameters – equivalent to rotation and camera pose – from the driving video using the FaceVerse tracking method.

Source: https://www.liuyebin.com/faceverse/faceverse.html

These parameters are used to render a color sphere projected onto the 2D image plane, spatially aligned with the driving head. The sphere’s size matches the reference head, and its color reflects the pinnacle’s orientation. This abstraction reduces the complexity of learning 3D head motion, helping to preserve stylized or exaggerated head shapes in characters drawn from animation.

Visualization of the control sphere influencing head orientation.

Finally, to guide full-body motion, the system uses 3D body skeletons with adaptive bone length normalization. Body and hand parameters are estimated using 4DHumans and the hand-focused HaMeR, each of which operate on the SMPL-X body model.

SMPL-X applies a parametric mesh over the full human body in an image, aligning with estimated pose and expression to enable pose-aware manipulation using the mesh as a volumetric guide. Source: https://arxiv.org/pdf/1904.05866

Source: https://arxiv.org/pdf/1904.05866

From these outputs, key joints are chosen, projected into 2D, and connected into line-based skeleton maps. Unlike methods equivalent to Champ, that render full-body meshes, this approach avoids imposing predefined shape priors, and by relying solely on skeletal structure, the model is thus encouraged to infer body shape and appearance directly from the reference images, reducing bias toward fixed body types, and improving generalization across a spread of poses and builds.

During training, the 3D body skeletons are concatenated with head spheres and passed through a pose encoder, which outputs features which can be then combined with noised video latents to provide the noise tokens utilized by the Diffusion Transformer.

At inference time, the system accounts for skeletal differences between subjects by normalizing bone lengths. The SeedEdit pretrained image editing model transforms each reference and driving images into a regular canonical configuration. RTMPose is then used to extract skeletal proportions, that are used to regulate the driving skeleton to match the anatomy of the reference subject.

Overview of the inference pipeline. Pseudo-references may be generated to enrich appearance cues, while hybrid control signals – implicit facial motion and explicit pose from head spheres and body skeletons – are extracted from the driving video. These are then fed into a DiT model to produce animated output, with facial motion decoupled from body pose, allowing for the use of audio as a driver.

Appearance Guidance

To boost appearance fidelity, particularly in occluded or rarely visible areas, the system supplements the first reference image with pseudo-references sampled from the input video.

These additional frames are chosen for pose diversity using RTMPose, and filtered using CLIP-based similarity to make sure they continue to be consistent with the topic’s identity.

All reference frames (primary and pseudo) are encoded by the identical visual encoder and fused through a self-attention mechanism, allowing the model to access complementary appearance cues. This setup improves coverage of details equivalent to profile views or limb textures. Pseudo-references are all the time used during training and optionally during inference.

Training

DreamActor was trained in three stages to regularly introduce complexity and improve stability.

In the primary stage, only 3D body skeletons and 3D head spheres were used as control signals, excluding facial representations. This allowed the bottom video generation model, initialized from MMDiT, to adapt to human animation without being overwhelmed by fine-grained controls.

Within the second stage, implicit facial representations were added, but all other parameters frozen. Only the face motion encoder and face attention layers were trained at this point, enabling the model to learn expressive detail in isolation.

In the ultimate stage, all parameters were unfrozen for joint optimization across appearance, pose, and facial dynamics.

Data and Tests

For the testing phase, the model is initialized from a pretrained image-to-video DiT checkpoint^† and trained in three stages: 20,000 steps for every of the primary two stages and 30,000 steps for the third.

To enhance generalization across different durations and resolutions, video clips were randomly sampled with lengths between 25 and 121 frames. These were then resized to 960x640px, while preserving aspect ratio.

Training was performed on eight (China-focused) NVIDIA H20 GPUs, each with 96GB of VRAM, using the AdamW optimizer with a (tolerably high) learning rate of 5e−6.

At inference, each video segment contained 73 frames. To take care of consistency across segments, the ultimate latent from one segment was reused because the initial latent for the subsequent, which contextualizes the duty as sequential image-to-video generation.

Classifier-free guidance was applied with a weight of two.5 for each reference images and motion control signals.

The authors constructed a training dataset (no sources are stated within the paper) comprising 500 hours of video sourced from diverse domains, featuring instances of (amongst others) dance, sports, film, and public speaking. The dataset was designed to capture a broad spectrum of human motion and expression, with a fair distribution between full-body and half-body shots.

To boost facial synthesis quality, Nersemble was incorporated in the information preparation process.

Source: https://www.youtube.com/watch?v=a-OAWqBzldU

For evaluation, the researchers used their dataset also as a benchmark to evaluate generalization across various scenarios.

The model’s performance was measured using standard metrics from prior work: Fréchet Inception Distance (FID); Structural Similarity Index (SSIM); Learned Perceptual Image Patch Similarity (LPIPS); and Peak Signal-to-Noise Ratio (PSNR) for frame-level quality. Fréchet Video Distance (FVD) was used for assessing temporal coherence and overall video fidelity.

The authors conducted experiments on each body animation and portrait animation tasks, all employing a single (goal) reference image.

For body animation, DreamActor-M1 was compared against Animate Anyone; Champ; MimicMotion, and DisPose.

Quantitative comparisons against rival frameworks.

Though the PDF provides a static image as a visible comparison, considered one of the videos from the project site may highlight the differences more clearly:

For portrait animation tests, the model was evaluated against LivePortrait; X-Portrait; SkyReels-A1; and Act-One.

Quantitative comparisons for portrait animation.

The authors note that their method wins out in quantitative tests, and contend that additionally it is superior qualitatively.

Arguably the third and final of the clips shown within the video above exhibits a less convincing lip-sync in comparison with a few the rival frameworks, though the final quality is remarkably high.

Conclusion

In anticipating the necessity for textures which can be implied but not actually present in the only real goal image fueling these recreations, ByteDance has addressed considered one of the largest challenges facing diffusion-based video generation – consistent, persistent textures. The subsequent logical step after perfecting such an approach could be to in some way create a reference atlas from the initial generated clip that may very well be applied to subsequent, different generations, to keep up appearance without LoRAs.

Though such an approach would effectively still be an external reference, this is not any different from texture-mapping in traditional CGI techniques, and the standard of realism and plausibility is much higher than those older methods can obtain.

That said, essentially the most impressive aspect of DreamActor is the combined three-part guidance system, which bridges the standard divide between face-focused and body-focused human synthesis in an ingenious way.

It only stays to be seen if a few of these core principles will be leveraged in additional accessible offerings; because it stands, DreamActor seems destined to turn into yet one more synthesis-as-a-service offering, severely sure by restrictions on usage, and by the impracticality of experimenting extensively with a business architecture.

^†

A Notable Advance in Human-Driven AI Video