A brand new AI translation system for headphones clones multiple voices concurrently

-

Spatial Speech Translation consists of two AI models, the primary of which divides the space surrounding the person wearing the headphones into small regions and uses a neural network to look for potential speakers and pinpoint their direction. 

The second model then translates the speakers’ words from French, German, or Spanish into English text using publicly available data sets. The identical model extracts the unique characteristics and emotional tone of every speaker’s voice, corresponding to the pitch and the amplitude, and applies those properties to the text, essentially making a “cloned” voice. Which means that when the translated version of a speaker’s words is relayed to the headphone wearer a number of seconds later, apparently it’s coming from the speaker’s direction and the voice sounds loads just like the speaker’s own, not a robotic-sounding computer.

Provided that separating out human voices is difficult enough for AI systems, with the ability to incorporate that ability right into a real-time translation system, map the gap between the wearer and the speaker, and achieve decent latency on an actual device is impressive, says Samuele Cornell, a postdoc researcher at Carnegie Mellon University’s Language Technologies Institute, who didn’t work on the project.

“Real-time speech-to-speech translation is incredibly hard,” he says. “Their results are excellent within the limited testing settings. But for an actual product, one would wish far more training data—possibly with noise and real-world recordings from the headset, relatively than purely counting on synthetic data.”

Gollakota’s team is now specializing in reducing the period of time it takes for the AI translation to kick in after a speaker says something, which is able to accommodate more natural-sounding conversations between people speaking different languages. “We would like to actually get down that latency significantly to lower than a second, so which you can still have the conversational vibe,” Gollakota says.

This stays a serious challenge, since the speed at which an AI system can translate one language into one other will depend on the languages’ structure. Of the three languages Spatial Speech Translation was trained on, the system was quickest to translate French into English, followed by Spanish after which German—reflecting how German, unlike the opposite languages, places a sentence’s verbs and far of its meaning at the tip and never originally, says Claudio Fantinuoli, a researcher on the Johannes Gutenberg University of Mainz in Germany, who didn’t work on the project. 

Reducing the latency could make the translations less accurate, he warns: “The longer you wait [before translating], the more context you could have, and the higher the interpretation shall be. It’s a balancing act.”

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x