Pushing the frontiers of audio generation

-


Our pioneering speech generation technologies are helping people around the globe interact with more natural, conversational and intuitive digital assistants and AI tools.

Speech is central to human connection. It helps people around the globe exchange information and concepts, express emotions and create mutual understanding. As our technology built for generating natural, dynamic voices continues to enhance, we’re unlocking richer, more engaging digital experiences.

Over the past few years, we’ve been pushing the frontiers of audio generation, developing models that may create top quality, natural speech from a variety of inputs, like text, tempo controls and particular voices. This technology powers single-speaker audio in lots of Google products and experiments — including Gemini Live, Project Astra, Journey Voices and YouTube’s auto dubbing — and helps people around the globe interact with more natural, conversational and intuitive digital assistants and AI tools.

Working along with partners across Google, we recently helped develop two recent features that may generate long-form, multi-speaker dialogue for making complex content more accessible:

  • NotebookLM Audio Overviews turns uploaded documents into engaging and full of life dialogue. With one click, two AI hosts summarize user material, make connections between topics and banter backwards and forwards.
  • Illuminate creates formal AI-generated discussions about research papers to assist make knowledge more accessible and digestible.

Here, we offer an summary of our latest speech generation research underpinning all of those products and experimental tools.

Pioneering techniques for audio generation

For years, we have been investing in audio generation research and exploring recent ways for generating more natural dialogue in our products and experimental tools. In our previous research on SoundStorm, we first demonstrated the flexibility to generate 30-second segments of natural dialogue between multiple speakers.

This prolonged our earlier work, SoundStream and AudioLM, which allowed us to use many text-based language modeling techniques to the issue of audio generation.

SoundStream is a neural audio codec that efficiently compresses and decompresses an audio input, without compromising its quality. As a part of the training process, SoundStream learns the right way to map audio to a variety of acoustic tokens. These tokens capture all of the knowledge needed to reconstruct the audio with high fidelity, including properties reminiscent of prosody and timbre.

AudioLM treats audio generation as a language modeling task to supply the acoustic tokens of codecs like SoundStream. Consequently, the AudioLM framework makes no assumptions concerning the type or makeup of the audio being generated, and might flexibly handle a wide range of sounds with no need architectural adjustments — making it an excellent candidate for modeling multi-speaker dialogues.



Source link

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x