Home Artificial Intelligence Create a speaking and singing video with a single photo…”Produce mouth shapes, facial expressions, and movements.”

Create a speaking and singing video with a single photo…”Produce mouth shapes, facial expressions, and movements.”

0
Create a speaking and singing video with a single photo…”Produce mouth shapes, facial expressions, and movements.”

Alibaba introduced a man-made intelligence (AI) system that creates realistic speaking and singing videos from a single photo. It’s the follow-up to the character animation creation AI that was released in December last yr and received rave reviews.

Enterprise Beat reported on the twenty eighth (local time) that Alibaba Intelligent Computing Lab unveiled a latest AI framework called 'EMO (Emote Portrait Alive)'.

online archiveIn response to a paper published in , this model can’t only generate accurate mouth shapes in line with a given voice, but additionally generate facial expressions and head movements. You’ll be able to turn any image of an individual, including selfies, photos of celebrities, cartoons, or drawings, right into a video of somebody speaking within the language or song of your selection.

In relation to this, the day before, Pika Labs released a tool called 'Lip Sync' that may add voice to video creation AI. It is a function introduced by PikaLab in response to Open AI ‘Sora’.

Nonetheless, a video comparing EMO and lip sync and which is more sophisticated is spreading in related communities. The response is so enthusiastic that more individuals are raising EMO's hand.

Also, it’s noteworthy that among the many videos released by the researchers, a sample an identical to the 'woman walking on the streets of Tokyo' that appeared within the video created by Sora was included. The sunglasses, large earrings, red shirt, and coat all match.

Previously, in December of last yr, Alibaba received favorable reviews by introducing a model called 'Animate Anyone', which creates a full-motion video from a single photo. This model was evaluated as improving existing photo-video conversion technology by extracting human gestures and movements from existing images and using the diffusion model to convert photos into images.

EMO Overview (Photo = Alibaba)

EMO also used a diffusion model. Moreover, it was trained on over 250 hours of interactive video datasets, including movies, TV programs, and performances.

Particularly, unlike previous methods that roughly represent facial movements using 3D face models, EMO can capture subtle movements and features by directly converting audio waveforms into video frames.

The researchers reported that through experiments, EMO performed significantly higher than existing state-of-the-art methods. Moreover, user evaluations revealed that it produced more natural mouth shapes than videos generated by other models.

He introduced that his strengths include not only conversation but additionally singing. The reason is that when singing, not only the form of the mouth but additionally facial expressions and head movements are most evident. Particularly, it was announced that it supports video creation in line with the length of the input audio.

The researchers emphasized that as a result of these features, EMO “might be produced in quite a lot of styles, far exceeding existing state-of-the-art methodologies when it comes to expressiveness and realism.”

This model is githubIt was revealed through.

Reporter Lim Da-jun ydj@aitimes.com

LEAVE A REPLY

Please enter your comment!
Please enter your name here