Testing the Massively Multilingual Speech (MMS) Model that Supports 1162 Languages Introduction The Approach to construct the Massively Multilingual Speech Model Overview of the Fairseq Repository: A Powerful Toolkit for Sequence-to-Sequence Learning Massively Multilingual Speech Models Implementing Automatic Speech Recognition with Massively Multilingual Speech Automatic Speech Recognition Results with Fairseq Conclusion References

Explore the cutting-edge multilingual features of Meta’s latest automatic speech recognition (ASR) model

Massively Multilingual Speech (MMS)¹ is the most recent release by Meta AI (just a number of days ago). It pushes the boundaries of speech technology by expanding its reach from about 100 languages to over 1,000. This was achieved by constructing a single multilingual speech recognition model. The model can even discover over 4,000 languages, representing a 40-fold increase over previous capabilities.

The MMS project goals to make it easier for people to access information and use devices of their preferred language. It expands text-to-speech and speech-to-text technology to underserved languages, continuing to cut back language barriers in our global world. Existing applications can now include a greater variety of languages, similar to virtual assistants or voice-activated devices. At the identical time, recent use cases emerge in cross-cultural communication, for instance, in messaging services or virtual and augmented reality.

In this text, we are going to walk through the usage of MMS for ASR in English and Portuguese and supply a step-by-step guide on establishing the environment to run the model.

Figure 1: Massively Multilingual Speech (MMS) is able to identifying over 4,000 languages and supports 1162 (source)

This text belongs to “Large Language Models Chronicles: Navigating the NLP Frontier”, a recent weekly series of articles that can explore the best way to leverage the ability of huge models for various NLP tasks. By diving into these cutting-edge technologies, we aim to empower developers, researchers, and enthusiasts to harness the potential of NLP and unlock recent possibilities.

Articles published up to now:

Summarizing the most recent Spotify releases with ChatGPT
Master Semantic Search at Scale: Index Tens of millions of Documents with Lightning-Fast Inference Times using FAISS and Sentence Transformers
Unlock the Power of Audio Data: Advanced Transcription and Diarization with Whisper, WhisperX, and PyAnnotate
Whisper JAX vs PyTorch: Uncovering the Truth about ASR Performance on GPUs
Vosk for Efficient Enterprise-Grade Speech Recognition: An Evaluation and Implementation Guide

As at all times, the code is accessible on my Github.

Meta used religious texts, similar to the Bible, to construct a model covering this wide selection of languages. These texts have several interesting components: first, they’re translated into many languages, and second, there are publicly available audio recordings of individuals reading these texts in numerous languages. Thus, the primary dataset where this model was trained was the Recent Testament, which the research team was in a position to collect for over 1,100 languages and provided greater than 32h of information per language. They went further to make it recognize 4,000 languages. This was done by utilizing unlabeled recordings of varied other Christian religious readings. From the experiments results, despite the fact that the information is from a selected domain, it might generalize well.

These usually are not the one contributions of the work. They created a recent preprocessing and alignment model that may handle long recordings. This was used to process the audio, and misaligned data was removed using a final cross-validation filtering step. Recall from one in all our previous articles that we saw that one in all the challenges of Whisper was the incapacity to align the transcription properly. One other vital step of the approach was the usage of wav2vec 2.0, a self-supervised learning model, to coach their system on an enormous amount of speech data (about 500,000 hours) in over 1,400 languages. The labeled dataset we discussed previously isn’t enough to coach a model of the dimensions of MMS, so wav2vec 2.0 was used to cut back the necessity for labeled data. Finally, the resulting models were then fine-tuned for a selected speech task, similar to multilingual speech recognition or language identification.

The MMS models were open-sourced by Meta a number of days ago and were made available within the Fairseq repository. In the subsequent section, we cover what Fairseq is and the way we are able to test these recent models from Meta.

Fairseq is an open-source sequence-to-sequence toolkit developed by Facebook AI Research, also referred to as FAIR. It provides reference implementations of varied sequence modeling algorithms, including convolutional and recurrent neural networks, transformers, and other architectures.

The Fairseq repository is predicated on PyTorch, one other open-source project initially developed by the Meta and now under the umbrella of the Linux Foundation. It’s a really powerful machine learning framework that gives high flexibility and speed, particularly in terms of deep learning.

The Fairseq implementations are designed for researchers and developers to coach custom models and it supports tasks similar to translation, summarization, language modeling, and other text generation tasks. One in all the important thing features of Fairseq is that it supports distributed training, meaning it might efficiently utilize multiple GPUs either on a single machine or across multiple machines. This makes it well-suited for large-scale machine learning tasks.

Fairseq provides two pre-trained models for download: MMS-300M and MMS-1B. You furthermore mght have access to fine-tuned models available for various languages and datasets. For our purpose, we test the MMS-1B model fine-tuned for 102 languages within the FLEURS dataset and in addition the MMS-1B-all, which is fine-tuned to handle 1162 languages (!), fine-tuned using several different datasets.

Do not forget that these models are still in research phase, making testing a bit tougher. There are additional steps that you simply wouldn’t find with production-ready software.

First, you might want to arrange a .env file in your project root to configure your environment variables. It should look something like this:

CURRENT_DIR=/path/to/current/dir
AUDIO_SAMPLES_DIR=/path/to/audio_samples
FAIRSEQ_DIR=/path/to/fairseq
VIDEO_FILE=/path/to/video/file
AUDIO_FILE=/path/to/audio/file
RESAMPLED_AUDIO_FILE=/path/to/resampled/audio/file
TMPDIR=/path/to/tmp
PYTHONPATH=.
PREFIX=INFER
HYDRA_FULL_ERROR=1
USER=micro
MODEL=/path/to/fairseq/models_new/mms1b_all.pt
LANG=eng

Next, you might want to configure the YAML file positioned at fairseq/examples/mms/asr/config/infer_common.yaml. This file comprises vital settings and parameters utilized by the script.

Within the YAML file, use a full path for the checkpoint field like this (unless you’re using a containerized application to run the script):

checkpoint: /path/to/checkpoint/${env:USER}/${env:PREFIX}/${common_eval.results_path}

This full path is needed to avoid potential permission issues unless you’re running the applying in a container.

For those who plan on using a CPU for computation as a substitute of a GPU, you have to so as to add the next directive to the highest level of the YAML file:

common:
cpu: true

This setting directs the script to make use of the CPU for computations.

We use the dotevn python library to load these environment variables in our Python script. Since we’re overwriting some system variables, we are going to need to make use of a trick to make sure that that we get the appropriate variables loaded. We use thedotevn_valuesmethod and store the output in a variable. This ensures that we get the variables stored in our .envfile and never random system variables even in the event that they have the identical name.

config = dotenv_values(".env")current_dir = config['CURRENT_DIR']
tmp_dir = config['TMPDIR']
fairseq_dir = config['FAIRSEQ_DIR']
video_file = config['VIDEO_FILE']
audio_file = config['AUDIO_FILE']
audio_file_resampled = config['RESAMPLED_AUDIO_FILE']
model_path = config['MODEL']
model_new_dir = config['MODELS_NEW']
lang = config['LANG']

Then, we are able to clone the fairseq GitHub repository and install it in our machine.

def git_clone(url, path):
"""
Clones a git repositoryParameters:
url (str): The URL of the git repository
path (str): The local path where the git repository will probably be cloned
"""
if not os.path.exists(path):
Repo.clone_from(url, path)
def install_requirements(requirements):
"""
Installs pip packages
Parameters:
requirements (list): List of packages to put in
"""
subprocess.check_call(["pip", "install"] + requirements)
git_clone('https://github.com/facebookresearch/fairseq', 'fairseq')
install_requirements(['--editable', './'])

We already discussed the models that we use in this text, so let’s download them to our local environment.

def download_file(url, path):
"""
Downloads a fileParameters:
url (str): URL of the file to be downloaded
path (str): The trail where the file will probably be saved
"""
subprocess.check_call(["wget", "-P", path, url])
download_file('https://dl.fbaipublicfiles.com/mms/asr/mms1b_fl102.pt', model_new_dir)

There may be one additional restriction related to the input of the MMS model, the sampling rate of the audio data must be 16000 Hz. In our case, we defined two ways to generate these files: one which converts video to audio and one other that resamples audio files for the proper sampling rate.

def convert_video_to_audio(video_path, audio_path):
"""
Converts a video file to an audio fileParameters:
video_path (str): Path to the video file
audio_path (str): Path to the output audio file
"""
subprocess.check_call(["ffmpeg", "-i", video_path, "-ar", "16000", audio_path])
def resample_audio(audio_path, new_audio_path, new_sample_rate):
"""
Resamples an audio file
Parameters:
audio_path (str): Path to the present audio file
new_audio_path (str): Path to the output audio file
new_sample_rate (int): Recent sample rate in Hz
"""
audio = AudioSegment.from_file(audio_path)
audio = audio.set_frame_rate(new_sample_rate)
audio.export(new_audio_path, format='wav')

We at the moment are able to run the inference process using our MMS-1B-all model, which supports 1162 languages.

def run_inference(model, lang, audio):
"""
Runs the MMS ASR inferenceParameters:
model (str): Path to the model file
lang (str): Language of the audio file
audio (str): Path to the audio file
"""
subprocess.check_call(
[
"python",
"examples/mms/asr/infer/mms_infer.py",
"--model",
model,
"--lang",
lang,
"--audio",
audio,
]
)
run_inference(model_path, lang, audio_file_resampled)

On this section, we describe our experimentation setup and discuss the outcomes. We performed ASR using two different models from Fairseq, MMS-1B-all and MMS-1B-FL102, in each English and Portuguese. Yow will discover the audio files in my GitHub repo. These are files that I generated myself only for testing purposes.

Let’s start with the MMS-1B-all model. Here is the input and output for the English and Portuguese audio samples:

: just requiring a small clip to grasp if the brand new facebook research model really performs on

: ora bem só agravar aqui um exemplo pa tentar perceber se de facto om novo modelo da facebook research realmente funciona ou não vamos estar

With the MMS-1B-FL102, the generated speech was significantly worse. Let’s see the identical example for English:

: just recarding a small ho clip to grasp if the brand new facebuok research model really performs on speed recognition tasks lets see

While the speech generated isn’t super impressive for the usual of models we have now today, we’d like to deal with these results from the attitude that these models open up ASR to a much wider range of the worldwide population.

The Massively Multilingual Speech model, developed by Meta, represents yet another step to foster global communication and broaden the reach of language technology using AI. Its ability to grasp over 4,000 languages and performance effectively across 1,162 of them increases accessibility for varied languages which were traditionally underserved.

Our testing of the MMS models showcased the chances and limitations of the technology at its current stage. Although the speech generated by the MMS-1B-FL102 model was not as impressive as expected, the MMS-1B-all model provided promising results, demonstrating its capability to transcribe speech in each English and Portuguese. Portuguese has been one in all those underserved languages, specially after we consider Portuguese from Portugal.

Be at liberty to try it out in your chosen language and to share the transcription and feedback within the comment section.

Communicate: LinkedIn

[1] — Pratap, V., Tjandra, A., Shi, B., Tomasello, P., Babu, A., Kundu, S., Elkahky, A., Ni, Z., Vyas, A., Fazel-Zarandi, M., Baevski, A., Adi, Y., Zhang, X., Hsu, W.-N., Conneau, A., & Auli, M. (2023). Scaling Speech Technology to 1,000+ Languages. arXiv.

Explore the cutting-edge multilingual features of Meta’s latest automatic speech recognition (ASR) model

What are your thoughts on this topic?
Let us know in the comments below.

3 COMMENTS

Share this article

Recent posts

Python Concurrency — A Brain-Friendly Guide for Data Professionals

AI in Finance and Its Impact on Worker Retention

AI’s Growing Power Needs: Tech Industry’s Move Towards Nuclear Power

“Human Intelligence Created”… Human Intelligence Challenge Spreads Against ‘Made by AI’

What We Still Don’t Understand About Machine Learning

Explore the cutting-edge multilingual features of Meta’s latest automatic speech recognition (ASR) model

What are your thoughts on this topic? Let us know in the comments below.

3 COMMENTS

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.