Teaching AI to speak feels like humans do

-

Whether you’re describing the sound of your faulty automotive engine or meowing like your neighbor’s cat, imitating sounds along with your voice is usually a helpful strategy to relay an idea when words don’t do the trick.

Vocal imitation is the sonic equivalent of doodling a fast picture to speak something you saw — except that as a substitute of using a pencil for example a picture, you employ your vocal tract to specific a sound. This may appear difficult, nevertheless it’s something all of us do intuitively: To experience it for yourself, try using your voice to mirror the sound of an ambulance siren, a crow, or a bell being struck.

Inspired by the cognitive science of how we communicate, MIT Computer Science and Artificial Intelligence Laboratory (CSAIL) researchers have developed an AI system that may produce human-like vocal imitations with no training, and without ever having “heard” a human vocal impression before.

To realize this, the researchers engineered their system to provide and interpret sounds very like we do. They began by constructing a model of the human vocal tract that simulates how vibrations from the voice box are shaped by the throat, tongue, and lips. Then, they used a cognitively-inspired AI algorithm to manage this vocal tract model and make it produce imitations, bearing in mind the context-specific ways in which humans select to speak sound.

The model can effectively take many sounds from the world and generate a human-like imitation of them — including noises like leaves rustling, a snake’s hiss, and an approaching ambulance siren. Their model can be run in reverse to guess real-world sounds from human vocal imitations, much like how some computer vision systems can retrieve high-quality images based on sketches. For example, the model can appropriately distinguish the sound of a human imitating a cat’s “meow” versus its “hiss.”

In the long run, this model could potentially result in more intuitive “imitation-based” interfaces for sound designers, more human-like AI characters in virtual reality, and even methods to assist students learn recent languages.

The co-lead authors — MIT CSAIL PhD students Kartik Chandra SM ’23 and Karima Ma, and undergraduate researcher Matthew Caren — note that computer graphics researchers have long recognized that realism isn’t the last word goal of visual expression. For instance, an abstract painting or a baby’s crayon doodle might be just as expressive as a photograph.

“Over the past few many years, advances in sketching algorithms have led to recent tools for artists, advances in AI and computer vision, and even a deeper understanding of human cognition,” notes Chandra. “In the identical way that a sketch is an abstract, non-photorealistic representation of a picture, our method captures the abstract, non-phonorealistic ways humans express the sounds they hear. This teaches us concerning the strategy of auditory abstraction.”

Play video

“The goal of this project has been to grasp and computationally model vocal imitation, which we take to be the type of auditory equivalent of sketching within the visual domain,” says Caren.

The art of imitation, in three parts

The team developed three increasingly nuanced versions of the model to check to human vocal imitations. First, they created a baseline model that simply aimed to generate imitations that were as much like real-world sounds as possible — but this model didn’t match human behavior thoroughly.

The researchers then designed a second “communicative” model. In keeping with Caren, this model considers what’s distinctive a few sound to a listener. For example, you’d likely imitate the sound of a motorboat by mimicking the rumble of its engine, since that’s its most distinctive auditory feature, even when it’s not the loudest aspect of the sound (in comparison with, say, the water splashing). This second model created imitations that were higher than the baseline, however the team wanted to enhance it much more.

To take their method a step further, the researchers added a final layer of reasoning to the model. “Vocal imitations can sound different based on the quantity of effort you place into them. It costs time and energy to provide sounds which might be perfectly accurate,” says Chandra. The researchers’ full model accounts for this by attempting to avoid utterances which might be very rapid, loud, or high- or low-pitched, which persons are less prone to use in a conversation. The result: more human-like imitations that closely match lots of the choices that humans make when imitating the identical sounds.

After constructing this model, the team conducted a behavioral experiment to see whether the AI- or human-generated vocal imitations were perceived as higher by human judges. Notably, participants within the experiment favored the AI model 25 percent of the time typically, and as much as 75 percent for an imitation of a motorboat and 50 percent for an imitation of a gunshot.

Toward more expressive sound technology

Captivated with technology for music and art, Caren envisions that this model could help artists higher communicate sounds to computational systems and assist filmmakers and other content creators with generating AI sounds which might be more nuanced to a particular context. It could also enable a musician to rapidly search a sound database by imitating a noise that’s difficult to explain in, say, a text prompt.

Within the meantime, Caren, Chandra, and Ma are taking a look at the implications of their model in other domains, including the event of language, how infants learn to speak, and even imitation behaviors in birds like parrots and songbirds.

The team still has work to do with the present iteration of their model: It struggles with some consonants, like “z,” which led to inaccurate impressions of some sounds, like bees buzzing. Additionally they can’t yet replicate how humans imitate speech, music, or sounds which might be imitated in another way across different languages, like a heartbeat.

Stanford University linguistics professor Robert Hawkins says that language is stuffed with onomatopoeia and words that mimic but don’t fully replicate the things they describe, just like the “meow” sound that very inexactly approximates the sound that cats make. “The processes that get us from the sound of an actual cat to a word like ‘meow’ reveal rather a lot concerning the intricate interplay between physiology, social reasoning, and communication within the evolution of language,” says Hawkins, who wasn’t involved within the CSAIL research. “This model presents an exciting step toward formalizing and testing theories of those processes, demonstrating that each physical constraints from the human vocal tract and social pressures from communication are needed to clarify the distribution of vocal imitations.”

Caren, Chandra, and Ma wrote the paper with two other CSAIL affiliates: Jonathan Ragan-Kelley, MIT Department of Electrical Engineering and Computer Science associate professor, and Joshua Tenenbaum, MIT Brain and Cognitive Sciences professor and Center for Brains, Minds, and Machines member. Their work was supported, partially, by the Hertz Foundation and the National Science Foundation. It was presented at SIGGRAPH Asia in early December.

ASK DUKE

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x