AI learns how vision and sound are connected, without human intervention

Humans naturally learn by making connections between sight and sound. As an illustration, we are able to watch someone playing the cello and recognize that the cellist’s movements are generating the music we hear.

A brand new approach developed by researchers from MIT and elsewhere improves an AI model’s ability to learn on this same fashion. This could possibly be useful in applications reminiscent of journalism and film production, where the model could help with curating multimodal content through automatic video and audio retrieval.

In the long run, this work could possibly be used to enhance a robot’s ability to grasp real-world environments, where auditory and visual information are sometimes closely connected.

Improving upon prior work from their group, the researchers created a technique that helps machine-learning models align corresponding audio and visual data from video clips without the necessity for human labels.

They adjusted how their original model is trained so it learns a finer-grained correspondence between a specific video frame and the audio that happens in that moment. The researchers also made some architectural tweaks that help the system balance two distinct learning objectives, which improves performance.

Taken together, these relatively easy improvements boost the accuracy of their approach in video retrieval tasks and in classifying the motion in audiovisual scenes. As an illustration, the brand new method could mechanically and precisely match the sound of a door slamming with the visual of it closing in a video clip.

“We’re constructing AI systems that may process the world like humans do, by way of having each audio and visual information coming in directly and with the ability to seamlessly process each modalities. Looking forward, if we are able to integrate this audio-visual technology into a few of the tools we use every day, like large language models, it could open up quite a lot of recent applications,” says Andrew Rouditchenko, an MIT graduate student and co-author of a paper on this research.

He’s joined on the paper by lead writer Edson Araujo, a graduate student at Goethe University in Germany; Yuan Gong, a former MIT postdoc; Saurabhchand Bhati, a current MIT postdoc; Samuel Thomas, Brian Kingsbury, and Leonid Karlinsky of IBM Research; Rogerio Feris, principal scientist and manager on the MIT-IBM Watson AI Lab; James Glass, senior research scientist and head of the Spoken Language Systems Group within the MIT Computer Science and Artificial Intelligence Laboratory (CSAIL); and senior writer Hilde Kuehne, professor of computer science at Goethe University and an affiliated professor on the MIT-IBM Watson AI Lab. The work shall be presented on the Conference on Computer Vision and Pattern Recognition.

Syncing up

This work builds upon a machine-learning method the researchers developed a couple of years ago, which provided an efficient technique to train a multimodal model to concurrently process audio and visual data without the necessity for human labels.

The researchers feed this model, called CAV-MAE, unlabeled video clips and it encodes the visual and audio data individually into representations called tokens. Using the natural audio from the recording, the model mechanically learns to map corresponding pairs of audio and visual tokens close together inside its internal representation space.

They found that using two learning objectives balances the model’s learning process, which enables CAV-MAE to grasp the corresponding audio and visual data while improving its ability to get better video clips that match user queries.

But CAV-MAE treats audio and visual samples as one unit, so a 10-second video clip and the sound of a door slamming are mapped together, even when that audio event happens in only one second of the video.

Of their improved model, called CAV-MAE Sync, the researchers split the audio into smaller windows before the model computes its representations of the info, so it generates separate representations that correspond to every smaller window of audio.

During training, the model learns to associate one video frame with the audio that happens during just that frame.

“By doing that, the model learns a finer-grained correspondence, which helps with performance later once we aggregate this information,” Araujo says.

Additionally they incorporated architectural improvements that help the model balance its two learning objectives.

Adding “wiggle room”

The model incorporates a contrastive objective, where it learns to associate similar audio and visual data, and a reconstruction objective which goals to get better specific audio and visual data based on user queries.

In CAV-MAE Sync, the researchers introduced two recent sorts of data representations, or tokens, to enhance the model’s learning ability.

They include dedicated “global tokens” that help with the contrastive learning objective and dedicated “register tokens” that help the model deal with essential details for the reconstruction objective.

“Essentially, we add a bit more wiggle room to the model so it might probably perform each of those two tasks, contrastive and reconstructive, a bit more independently. That benefitted overall performance,” Araujo adds.

While the researchers had some intuition these enhancements would improve the performance of CAV-MAE Sync, it took a careful combination of strategies to shift the model within the direction they wanted it to go.

“Because we now have multiple modalities, we’d like a great model for each modalities by themselves, but we also must get them to fuse together and collaborate,” Rouditchenko says.

Ultimately, their enhancements improved the model’s ability to retrieve videos based on an audio query and predict the category of an audio-visual scene, like a dog barking or an instrument playing.

Its results were more accurate than their prior work, and it also performed higher than more complex, state-of-the-art methods that require larger amounts of coaching data.

“Sometimes, quite simple ideas or little patterns you see in the info have big value when applied on top of a model you’re working on,” Araujo says.

In the long run, the researchers want to include recent models that generate higher data representations into CAV-MAE Sync, which could improve performance. Additionally they wish to enable their system to handle text data, which can be a crucial step toward generating an audiovisual large language model.

This work is funded, partly, by the German Federal Ministry of Education and Research and the MIT-IBM Watson AI Lab.

AI learns how vision and sound are connected, without human intervention

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

The SyncNet Research Paper, Clearly Explained

Constructing LLM Apps That Can See, Think, and Integrate: Using o3 with Multimodal Input and Structured Output

An Interactive Guide to 4 Fundamental Computer Vision Tasks Using Transformers

Google rolls out 10 latest AI upgrades to Chrome, including Gemini integration

Google brings AI to Chrome

AI learns how vision and sound are connected, without human intervention

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.