Person-Specific Deepfakes with 3D Morphable Models The 3DMM Tracking Neural Rendering Audio-to-Expression Conclusion

Artificial Intelligence

Person-Specific Deepfakes with 3D Morphable Models The 3DMM Tracking Neural Rendering Audio-to-Expression Conclusion

admin

May 15, 2023

Person-Specific Deepfakes with 3D Morphable Models
The 3DMM
Tracking
Neural Rendering
Audio-to-Expression
Conclusion

Given a 3DMM, the very first thing we’d like to do is fit it to real video. Along with the 3DMM parameters, we also often consider additional parameters for the pose (position and orientation of the pinnacle), the lighting, and the camera. Taking all of those parameters together, the final idea is to search out a set of parameters that best matches a given frame. This is normally done using a way called differentiable rendering.

Differentiable rendering allows gradients to be backpropagated through the rendering process meaning that losses could be applied on the image level. A brilliant easy tracker works by computing the difference between the actual image and the rendered 3DMM on the image level, then optimizing the above set of parameters to get one of the best match. In practice, this will not be enough, and extra losses are required. Particularly, landmark losses are used as a way of getting the mesh “close enough” to the actual face.

An example of the tracking process for a brief video

A very important a part of the tracking process is that the form, texture, and camera parameters (and sometimes light) are fixed across a given video, meaning only expression and pose should be calculated per frame and hence estimated from audio in a while. This even further reduces the complexity of the issue.

If you desire to have a go at this kind of tracking, there are some great codebases. For my part, one of the best of those in the meanwhile is from the MICA codebase. I’d suggest using something like this as a basis when you wanted to aim a full pipeline.

The subsequent query is: how will we invert the fitting process, to get realistic-looking video from these low-quality renderings? The reply to that is, after all, to throw deep learning at it! Fortunately, we have now the advantage to make this easier, paired data. Every frame that we have now tracked gives us a pair of a rendered mesh frame and an actual frame. We just should train a model to convert one to the opposite.

A basic way of doing this might be to make use of a UNET-style model and train the model using regression losses. Nonetheless, in practice, this will not be enough to provide super-realistic frames. We typically use some type of generative model to fill on this gap. In the present literature, this is normally a GAN. A more sophisticated way exists in the shape of neural textures, where a more detailed and abstract texture could be learned along with a UNET that may interpret it. I’d suggest reading this paper.

I believe there remains to be some room for research on this direction, for instance using diffusion models or a VQ-GAN style approach for the neural rendering, but this stays an open query.

The ultimate step within the pipeline is to manage the expressions of the 3DMM using audio. As mentioned before this is an easy task in comparison with attempting to predict pixels. Early models use an approach like VOCA which simply attempts to predict parameters from audio features using regression. More advanced transformer-based models, similar to Imitator (below) have helped move the standard up significantly.

The demo video for the imitator paper. This paper drives a 3DMM from an audio signal only. Credit to the authors.

With these models, the inference pipeline is largely the next:

Take a recent audio you wish the character to say.
Convert it into expressions using the audio-to-expression model.
Take the form, texture, pose, and lighting from a piece of the tracked video.
Render it right into a low-quality mesh video.
Use neural rendering to convert it to photo-realistic video.

Throughout this text, I’ve tried to remain as general concerning the pipeline as possible. I’ve not covered any specific papers intimately and for that reason, the actual process behind each of those steps is lacking. I’ll cover individual papers and specific methods for every step in the longer term, but this text is supposed to function a reference, showing how these popular person-specific models often work.

This method is very talked-about, and the outcomes are insanely good. I hope this text can go a way toward explaining the fundamentals of how these models work. Please let me know if there’s any specific a part of this pipeline you wish covered in the longer term, or if you could have any suggestions!

Person-Specific Deepfakes with 3D Morphable Models The 3DMM Tracking Neural Rendering Audio-to-Expression Conclusion

1 COMMENT

LEAVE A REPLY Cancel reply