Sesame  Speech Model:  How This Viral AI Model Generates Human-Like Speech

-

published a demo of their latest Speech-to-Speech model. A conversational AI agent who’s good at speaking, they supply relevant answers, they speak with expressions, and truthfully, they are only very fun and interactive to play with.

 

Thankfully, they provided enough information for me to jot down this text and make a YouTube video out of it. Read on!

Training a Conversational Speech Model

Sesame is a Conversational Speech Model, or a CSM. It inputs each text and audio, and generates speech as audio. While they haven’t revealed their training data sources within the articles, we are able to still attempt to take a solid guess. The blog post heavily cites one other CSM, 2024’s Moshi, and fortunately, the creators of Moshi did reveal their data sources of their paper. Moshi uses of unsupervised speech data, of natural and scripted conversations (for multi-stream training), and of telephone conversations (The Fischer Dataset).


Sesame builds upon the Moshi Paper (2024)

But what does it really take to generate audio?

In raw form, audio is just an extended sequence of amplitude values — a waveform. For instance, when you’re sampling audio at 24 kHz, you’re capturing 24,000 float values every second.

There are 24000 values here to represent 1 second of speech! (Image generated by writer)

In fact, it is sort of resource-intensive to process 24000 float values for only one second of information, especially because transformer computations scale quadratically with sequence length. It might be great if we could compress this signal and reduce the variety of samples required to process the audio.

We’ll take a deep dive into the Mimi encoder and specifically Residual Vector Quantizers (RVQ), that are the backbone of Audio/Speech modeling in Deep Learning today. We’ll end the article by learning about how Sesame generates audio using its special dual-transformer architecture.

Preprocessing audio

Compression and have extraction are where convolution helps us. Sesame uses the Mimi speech encoder to process audio. Mimi was introduced within the aforementioned Moshi paper as well. Mimi is a self-supervised audio encoder-decoder model that converts audio waveforms into discrete “latent” tokens first, after which reconstructs the unique signal. Sesame only uses the encoder section of Mimi to tokenize the input audio tokens. Let’s find out how.

Mimi inputs the raw speech waveform at 24Khz, passes them through several strided convolution layers to downsample the signal, with a stride factor of 4, 5, 6, 8, and a couple of. Because of this the primary CNN block downsamples the audio by 4x, then 5x, then 6x, and so forth. Ultimately, it downsamples by an element of 1920, reducing it to only 12.5 frames per second.

The convolution blocks also project the unique float values to an embedding dimension of 512. Each embedding aggregates the local features of the unique 1D waveform. 1 second of audio is now represented as around 12 vectors of size 512. This fashion, Mimi reduces the sequence length from 24000 to only 12 and converts them into dense continuous vectors.

Before applying any quantization, the Mimi Encoder downsamples the input 24KHz audio by 1920 times, and embeds it into 512 dimensions. In other words, you get 12.5 frames per second with each frame as a 512-dimensional vector. (Image from writer’s video)

What’s Audio Quantization?

Given the continual embeddings obtained after the convolution layer, we would like to tokenize the input speech. If we are able to represent speech as a sequence of tokens, we are able to apply standard language learning transformers to coach generative models.

Mimi uses a Residual Vector Quantizer or RVQ tokenizer to realize this. We’ll talk in regards to the residual part soon, but first, let’s have a look at what an easy vanilla Vector quantizer does.

Vector Quantization

The thought behind Vector Quantization is easy: you train a codebook , which is a set of, say, 1000 random vector codes all of size 512 (same as your embedding dimension).

A Vanilla Vector Quantizer. A codebook of embeddings is trained. Given an input embedding, we map/quantize it to the closest codebook entry. (Screenshot from writer’s video)

Then, given the input vector, we are going to map it to the closest vector in our codebook — principally snapping some extent to its nearest cluster center. This implies we now have effectively created a hard and fast vocabulary of tokens to represent each audio frame, because regardless of the input frame embedding could also be, we are going to represent it with the closest cluster centroid. If you ought to learn more about Vector Quantization, try my video on this topic where I am going much deeper with this.

More about Vector Quantization! (Video by writer)

Residual Vector Quantization

The issue with easy vector quantization is that the loss of knowledge could also be too high because we’re mapping each vector to its cluster’s centroid. This isn’t perfect, so there may be all the time an error between the unique embedding and the closest codebook.

The large idea of Residual Vector Quantization is that it doesn’t stop at having only one codebook. As an alternative, it tries to make use of multiple codebooks to represent the input vector.

  1. First, you quantize the unique vector using the primary codebook.
  2. Then, you subtract that centroid out of your original vector. What you’re left with is the residual — the error that wasn’t captured in the primary quantization.
  3. Now take this residual, and quantize it again, using a second codebook stuffed with brand recent code vectors — again by snapping it to the closest centroid.
  4. Subtract too, and also you get a smaller residual. Quantize again with a 3rd codebook… and you’ll be able to keep doing this for as many codebooks as you wish.
Residual Vector Quantizers (RVQ) hierarchically encode the input embeddings by utilizing a brand new codebook and VQ layer to represent the previous codebook’s error. (Illustration by the writer)

Each step hierarchically captures a little bit more detail that was missed within the previous round. For those who repeat this for, let’s say, N codebooks, you get a set of N discrete tokens from each stage of quantization to represent one audio frame.

The good thing about RVQs is that they’re designed to have a high inductive bias towards capturing essentially the most essential content within the very first quantizer. In the following quantizers, they learn increasingly more fine-grained features.

For those who’re conversant in PCA, you’ll be able to consider the primary codebook as containing the first principal components, capturing essentially the most critical information. The following codebooks represent higher-order components, containing information that adds more details.

Residual Vector Quantizers (RVQ) uses multiple codebooks to encode the input vector — one entry from each codebook. (Screenshot from writer’s video)

Acoustic vs Semantic Codebooks

Since Mimi is trained on the duty of audio reconstruction, the encoder compresses the signal to the discretized latent space, and the decoder reconstructs it back from the latent space. When optimizing for this task, the RVQ codebooks learn to capture the essential acoustic content of the input audio contained in the compressed latent space. 

Mimi also individually trains a single codebook (vanilla VQ) that only focuses on embedding the semantic content of the audio. This is the reason Mimi is known as a split-RVQ tokenizer – it divides the quantization process into two independent parallel paths: one for semantic information and one other for acoustic information.

The Mimi Architecture (Source: Moshi paper) License: Free

To coach semantic representations, Mimi used knowledge distillation with an existing speech model called WavLM as a semantic teacher. Mainly, Mimi introduces an extra loss function that decreases the cosine distance between the semantic RVQ code and the WavLM-generated embedding.


Audio Decoder

Given a conversation containing text and audio, we first convert them right into a sequence of token embeddings using the text and audio tokenizers. This token sequence is then input right into a transformer model as a time series. Within the blog post, this model is known as the Autoregressive Backbone Transformer. Its task is to process this time series and output the “zeroth” codebook token.

A lighterweight transformer called the audio decoder then reconstructs the following codebook tokens conditioned on this zeroth code generated by the backbone transformer. Note that the zeroth code already incorporates a variety of information in regards to the history of the conversation for the reason that backbone transformer has visibility of the whole past sequence. The lightweight audio decoder only operates on the zeroth token and generates the opposite N-1 codes. These codes are generated by utilizing N-1 distinct linear layers that output the probability of selecting each code from their corresponding codebooks. 

You’ll be able to imagine this process as predicting a text token from the vocabulary in a text-only LLM. Just that a text-based LLM has a single vocabulary, however the RVQ-tokenizer has multiple vocabularies in the shape of the N codebooks, so you should train a separate linear layer to model the codes for every.

The Sesame Architecture (Illustration by the writer)

Finally, after the codewords are all generated, we aggregate them to form the combined continuous audio embedding. The ultimate job is to convert this audio back to a waveform. For this, we apply transposed convolutional layers to upscale the embedding back from 12.5 Hz back to KHz waveform audio. Mainly, reversing the transforms we had applied originally during audio preprocessing.

In Summary

Try the accompanying video on this text! (Video by writer)

So, here is the general summary of the Sesame model in some bullet points.

  1.  Sesame is built on a multimodal Conversation Speech Model or a CSM.
  2. Text and audio are tokenized together to form a sequence of tokens and input into the backbone transformer that autoregressively processes the sequence.
  3. While the text is processed like several other text-based LLM, the audio is processed directly from its waveform representation. They use the Mimi encoder to convert the waveform into latent codes using a split RVQ tokenizer.
  4. The multimodal backbone transformers devour a sequence of tokens and predict the following zeroth codeword.
  5.  One other lightweight transformer called the Audio Decoder predicts the following codewords from the zeroth codeword.
  6. The ultimate audio frame representation is generated from combining all of the generated codewords and upsampled back to the waveform representation.

Thanks for reading!

References and Must-read papers

Try my ML YouTube Channel

Sesame Blogpost and Demo

Relevant papers: 
Moshi: https://arxiv.org/abs/2410.00037 
SoundStream: https://arxiv.org/abs/2107.03312 
HuBert: https://arxiv.org/abs/2106.07447 
Speech Tokenizer: https://arxiv.org/abs/2308.16692


ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x