Speech Synthesis, Recognition, and More With SpeechT5

We’re completely satisfied to announce that SpeechT5 is now available in 🤗 Transformers, an open-source library that gives easy-to-use implementations of state-of-the-art machine learning models.

SpeechT5 was originally described within the paper SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing by Microsoft Research Asia. The official checkpoints published by the paper’s authors can be found on the Hugging Face Hub.

If you need to jump right in, listed below are some demos on Spaces:

Introduction

SpeechT5 will not be one, not two, but three sorts of speech models in a single architecture.

It might probably do:

speech-to-text for automatic speech recognition or speaker identification,
text-to-speech to synthesize audio, and
speech-to-speech for converting between different voices or performing speech enhancement.

The important idea behind SpeechT5 is to pre-train a single model on a combination of text-to-speech, speech-to-text, text-to-text, and speech-to-speech data. This fashion, the model learns from text and speech at the identical time. The results of this pre-training approach is a model that has a unified space of hidden representations shared by each text and speech.

At the guts of SpeechT5 is a daily Transformer encoder-decoder model. Just like all other Transformer, the encoder-decoder network models a sequence-to-sequence transformation using hidden representations. This Transformer backbone is similar for all SpeechT5 tasks.

To make it possible for a similar Transformer to cope with each text and speech data, so-called pre-nets and post-nets were added. It’s the job of the pre-net to convert the input text or speech into the hidden representations utilized by the Transformer. The post-net takes the outputs from the Transformer and turns them into text or speech again.

A figure illustrating SpeechT5’s architecture is depicted below (taken from the original paper).

During pre-training, the entire pre-nets and post-nets are used concurrently. After pre-training, all the encoder-decoder backbone is fine-tuned on a single task. Such a fine-tuned model only uses the pre-nets and post-nets specific to the given task. For instance, to make use of SpeechT5 for text-to-speech, you’d swap within the text encoder pre-net for the text inputs and the speech decoder pre and post-nets for the speech outputs.

Note: Though the fine-tuned models start out using the identical set of weights from the shared pre-trained model, the ultimate versions are all quite different in the long run. You possibly can’t take a fine-tuned ASR model and swap out the pre-nets and post-net to get a working TTS model, for instance. SpeechT5 is flexible, but not that flexible.

Text-to-speech

SpeechT5 is the first text-to-speech model we’ve added to 🤗 Transformers, and we plan so as to add more TTS models within the near future.

For the TTS task, the model uses the next pre-nets and post-nets:

Text encoder pre-net. A text embedding layer that maps text tokens to the hidden representations that the encoder expects. Much like what happens in an NLP model comparable to BERT.
Speech decoder pre-net. This takes a log mel spectrogram as input and uses a sequence of linear layers to compress the spectrogram into hidden representations. This design is taken from the Tacotron 2 TTS model.
Speech decoder post-net. This predicts a residual so as to add to the output spectrogram and is used to refine the outcomes, also from Tacotron 2.

The architecture of the fine-tuned model looks like the next.

SpeechT5 architecture for text-to-speech

Here is a whole example of the way to use the SpeechT5 text-to-speech model to synthesize speech. You may also follow along in this interactive Colab notebook.

SpeechT5 will not be available in the newest release of Transformers yet, so you’ll need to put in it from GitHub. Also install the extra dependency sentencepiece after which restart your runtime.

pip install git+https://github.com/huggingface/transformers.git
pip install sentencepiece

First, we load the fine-tuned model from the Hub, together with the processor object used for tokenization and have extraction. The category we’ll use is SpeechT5ForTextToSpeech.

from transformers import SpeechT5Processor, SpeechT5ForTextToSpeech

processor = SpeechT5Processor.from_pretrained("microsoft/speecht5_tts")
model = SpeechT5ForTextToSpeech.from_pretrained("microsoft/speecht5_tts")

Next, tokenize the input text.

inputs = processor(text="Don't count the times, make the times count.", return_tensors="pt")

The SpeechT5 TTS model will not be limited to creating speech for a single speaker. As a substitute, it uses so-called speaker embeddings that capture a specific speaker’s voice characteristics. We’ll load such a speaker embedding from a dataset on the Hub.

from datasets import load_dataset
embeddings_dataset = load_dataset("Matthijs/cmu-arctic-xvectors", split="validation")

import torch
speaker_embeddings = torch.tensor(embeddings_dataset[7306]["xvector"]).unsqueeze(0)

The speaker embedding is a tensor of shape (1, 512). This particular speaker embedding describes a female voice. The embeddings were obtained from the CMU ARCTIC dataset using this script, but any X-Vector embedding should work.

Now we are able to tell the model to generate the speech, given the input tokens and the speaker embedding.

spectrogram = model.generate_speech(inputs["input_ids"], speaker_embeddings)

This outputs a tensor of shape (140, 80) containing a log mel spectrogram. The primary dimension is the sequence length, and it might vary between runs because the speech decoder pre-net all the time applies dropout to the input sequence. This adds a little bit of random variability to the generated speech.

To convert the anticipated log mel spectrogram into an actual speech waveform, we’d like a vocoder. In theory, you need to use any vocoder that works on 80-bin mel spectrograms, but for convenience, we’ve provided one in Transformers based on HiFi-GAN. The weights for this vocoder, in addition to the weights for the fine-tuned TTS model, were kindly provided by the unique authors of SpeechT5.

Loading the vocoder is as easy as every other 🤗 Transformers model.

from transformers import SpeechT5HifiGan
vocoder = SpeechT5HifiGan.from_pretrained("microsoft/speecht5_hifigan")

To make audio from the spectrogram, do the next:

with torch.no_grad():
    speech = vocoder(spectrogram)

We’ve also provided a shortcut so that you don’t need the intermediate step of creating the spectrogram. Once you pass the vocoder object into generate_speech, it directly outputs the speech waveform.

speech = model.generate_speech(inputs["input_ids"], speaker_embeddings, vocoder=vocoder)

And at last, save the speech waveform to a file. The sample rate utilized by SpeechT5 is all the time 16 kHz.

import soundfile as sf
sf.write("tts_example.wav", speech.numpy(), samplerate=16000)

The output seems like this (download audio):

That’s it for the TTS model! The important thing to creating this sound good is to make use of the fitting speaker embeddings.

You possibly can play with an interactive demo on Spaces.

💡 Concerned with learning the way to fine-tune SpeechT5 TTS on your individual dataset or language? Try this Colab notebook with an in depth walk-through of the method.

Speech-to-speech for voice conversion

Conceptually, doing speech-to-speech modeling with SpeechT5 is similar as text-to-speech. Simply swap out the text encoder pre-net for the speech encoder pre-net. The remaining of the model stays the identical.

SpeechT5 architecture for speech-to-speech

The speech encoder pre-net is similar because the feature encoding module from wav2vec 2.0. It consists of convolution layers that downsample the input waveform right into a sequence of audio frame representations.

For instance of a speech-to-speech task, the authors of SpeechT5 provide a fine-tuned checkpoint for doing voice conversion. To make use of this, first load the model from the Hub. Note that the model class now could be SpeechT5ForSpeechToSpeech.

from transformers import SpeechT5Processor, SpeechT5ForSpeechToSpeech

processor = SpeechT5Processor.from_pretrained("microsoft/speecht5_vc")
model = SpeechT5ForSpeechToSpeech.from_pretrained("microsoft/speecht5_vc")

We are going to need some speech audio to make use of as input. For the aim of this instance, we’ll load the audio from a small speech dataset on the Hub. You may also load your individual speech waveforms, so long as they’re mono and use a sampling rate of 16 kHz. The samples from the dataset we’re using listed below are already on this format.

from datasets import load_dataset
dataset = load_dataset("hf-internal-testing/librispeech_asr_demo", "clean", split="validation")
dataset = dataset.sort("id")
example = dataset[40]

Next, preprocess the audio to place it within the format that the model expects.

sampling_rate = dataset.features["audio"].sampling_rate
inputs = processor(audio=example["audio"]["array"], sampling_rate=sampling_rate, return_tensors="pt")

As with the TTS model, we’ll need speaker embeddings. These describe what the goal voice seems like.

import torch
embeddings_dataset = load_dataset("Matthijs/cmu-arctic-xvectors", split="validation")
speaker_embeddings = torch.tensor(embeddings_dataset[7306]["xvector"]).unsqueeze(0)

We also must load the vocoder to show the generated spectrograms into an audio waveform. Let’s use the identical vocoder as with the TTS model.

from transformers import SpeechT5HifiGan
vocoder = SpeechT5HifiGan.from_pretrained("microsoft/speecht5_hifigan")

Now we are able to perform the speech conversion by calling the model’s generate_speech method.

speech = model.generate_speech(inputs["input_values"], speaker_embeddings, vocoder=vocoder)

import soundfile as sf
sf.write("speech_converted.wav", speech.numpy(), samplerate=16000)

Changing to a special voice is as easy as loading a brand new speaker embedding. You possibly can even make an embedding from your individual voice!

The unique input (download):

The converted voice (download):

Note that the converted audio in this instance cuts off before the tip of the sentence. This could be resulting from the pause between the 2 sentences, causing SpeechT5 to (wrongly) predict that the tip of the sequence has been reached. Try it with one other example, you’ll find that always the conversion is correct but sometimes it stops prematurely.

You possibly can play with an interactive demo here. 🔥

Speech-to-text for automatic speech recognition

The ASR model uses the next pre-nets and post-net:

Speech encoder pre-net. This is similar pre-net utilized by the speech-to-speech model and consists of the CNN feature encoder layers from wav2vec 2.0.
Text decoder pre-net. Much like the encoder pre-net utilized by the TTS model, this maps text tokens into the hidden representations using an embedding layer. (During pre-training, these embeddings are shared between the text encoder and decoder pre-nets.)
Text decoder post-net. That is the best of all of them and consists of a single linear layer that projects the hidden representations to probabilities over the vocabulary.

The architecture of the fine-tuned model looks like the next.

SpeechT5 architecture for speech-to-text

Should you’ve tried any of the opposite 🤗 Transformers speech recognition models before, you’ll find SpeechT5 just as easy to make use of. The quickest strategy to start is through the use of a pipeline.

from transformers import pipeline
generator = pipeline(task="automatic-speech-recognition", model="microsoft/speecht5_asr")

As speech audio, we’ll use the identical input as within the previous section, but any audio file will work, because the pipeline mechanically converts the audio into the proper format.

from datasets import load_dataset
dataset = load_dataset("hf-internal-testing/librispeech_asr_demo", "clean", split="validation")
dataset = dataset.sort("id")
example = dataset[40]

Now we are able to ask the pipeline to process the speech and generate a text transcription.

transcription = generator(example["audio"]["array"])

Printing the transcription gives:

a person said to the universe sir i exist

That sounds exactly right! The tokenizer utilized by SpeechT5 could be very basic and works on the character level. The ASR model will subsequently not output any punctuation or capitalization.

After all it’s also possible to make use of the model class directly. First, load the fine-tuned model and the processor object. The category is now SpeechT5ForSpeechToText.

from transformers import SpeechT5Processor, SpeechT5ForSpeechToText

processor = SpeechT5Processor.from_pretrained("microsoft/speecht5_asr")
model = SpeechT5ForSpeechToText.from_pretrained("microsoft/speecht5_asr")

Preprocess the speech input:

sampling_rate = dataset.features["audio"].sampling_rate
inputs = processor(audio=example["audio"]["array"], sampling_rate=sampling_rate, return_tensors="pt")

Finally, tell the model to generate text tokens from the speech input, after which use the processor’s decoding function to show these tokens into actual text.

predicted_ids = model.generate(**inputs, max_length=100)
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)

Play with an interactive demo for the speech-to-text task.

Conclusion

SpeechT5 is an interesting model because — unlike most other models — it permits you to perform multiple tasks with the identical architecture. Only the pre-nets and post-nets change. By pre-training the model on these combined tasks, it becomes more capable at doing each of the person tasks when fine-tuned.

Now we have only included checkpoints for the speech recognition (ASR), speech synthesis (TTS), and voice conversion tasks however the paper also mentions the model was successfully used for speech translation, speech enhancement, and speaker identification. It’s very versatile!

Source link

Speech Synthesis, Recognition, and More With SpeechT5

Introduction

Text-to-speech

Speech-to-speech for voice conversion

Speech-to-text for automatic speech recognition

Conclusion

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

Parameter-Efficient Positive-Tuning using 🤗 PEFT

Distributed Reinforcement Learning for Scalable High-Performance Policy Optimization

Zero-shot image-to-text generation with BLIP-2

Why we’re switching to Hugging Face Inference Endpoints, and possibly it is best to too

Hugging Face and AWS partner to make AI more accessible

Speech Synthesis, Recognition, and More With SpeechT5

Introduction

Text-to-speech

Speech-to-speech for voice conversion

Speech-to-text for automatic speech recognition

Conclusion

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.