Unlock the Power of Audio Data: Advanced Transcription and Diarization with Whisper, WhisperX, and PyAnnotate Introduction Whisper: A General-Purpose Speech Recognition Model PyAnnotate: Speaker Diarization Library WhisperX: Long-Form Audio Transcription with Voice Activity Detection and Forced Phoneme Alignment Integrating WhisperX, Whisper, and PyAnnotate Assessing the Performance of the Integrated ASR System Conclusion


In our fast-paced world, we generate enormous amounts of audio data. Take into consideration your favorite podcast or conference calls at work. The information is already wealthy in its raw form; we, as humans, can understand it. Even so, we could go further and, for instance, convert it right into a written format to go looking for it later.

To raised understand the duty at hand, we’re introducing two concepts. The primary is transcription, simply converting spoken language into text. A second one which we explore in this text is diarization. Diarization helps us give additional structure to the unstructured content. On this case, we’re fascinated by attributing specific speech segments to different speakers.

With the above context, we address each tasks by utilizing different tools. We use Whisper, a general-purpose speech recognition model developed by OpenAI. It was training on a various dataset of audio samples, and the researchers developed it to perform multiple tasks. Secondly, we use PyAnnotate, a library for speaker diarization. Finally, we use WhisperX, a research project that helps mix the 2 while solving some limitations of Whisper.

Figure 1: Speak to the machines (source).

This text belongs to “Large Language Models Chronicles: Navigating the NLP Frontier”, a recent weekly series of articles that may explore leverage the facility of huge models for various NLP tasks. By diving into these cutting-edge technologies, we aim to empower developers, researchers, and enthusiasts to harness the potential of NLP and unlock recent possibilities.

Articles published to date:

  1. Summarizing the most recent Spotify releases with ChatGPT
  2. Master Semantic Search at Scale: Index Hundreds of thousands of Documents with Lightning-Fast Inference Times using FAISS and Sentence Transformers

As at all times, the code is accessible on my Github.

Whisper is a general-purpose speech recognition model performing thoroughly in various speech-processing tasks. It within reason robust at multilingual speech recognition, speech translation, spoken language identification, and voice activity detection.

On the core of Whisper lies a Transformer sequence-to-sequence model. The model jointly represents various speech-processing tasks as a sequence of tokens to be predicted by the decoder. The model can replace multiple stages of a standard speech-processing pipeline by employing special tokens as task specifiers or classification targets. We will consider it as a meta-model for speech-processing tasks.

Whisper is available in five model sizes, targeting edge devices or large computing machines. It allows users to pick out the suitable model for his or her use case and the capability of their systems. Note that the English-only versions of some models perform higher for English use cases.

Speaker diarization is the strategy of identifying and segmenting speech by different speakers. The duty will be useful, for instance, once we are analyzing data from a call center, and we wish to separate the client and the agent’s voices. Firms can then use it to enhance customer support and ensure company policy compliance.

PyAnnotate is a Python library specifically designed to support this task. The method is comparatively easy. It preprocesses the info, allowing us to extract features from the raw audio file. Next, it produces clusters of comparable speech segments based on the extracted features. Finally, it attributes the generated clusters to individual speakers.

As we saw within the previous sections, Whisper is a large-scale and weakly supervised model trained to perform several tasks within the speech-processing field. While it performs well in numerous domains and even in numerous languages it falls short in the case of long audio transcription. The limitation comes from the undeniable fact that the training procedure uses a sliding window approach, which can lead to drifting and even hallucination. As well as, it has severe limitations in the case of aligning the transcription with the audio timestamps. This is especially vital to us when performing the speaker diarization.

To tackle these limitations, an Oxford research group is actively developing WhisperX. The Arxiv pre-print paper was published last month. It uses Voice Activity Detection (VAD), which detects the presence or absence of human speech and pre-segments the input audio file. It then cuts and merges these segments into windows of roughly 30 seconds by defining the boundaries on the regions where there’s a low probability of speech (yield from the voice model). This step has a further advantage: it allows using batch transcriptions with Whisper. It increases performance while reducing the probability of drifting or hallucination we discussed above. The ultimate step is named forced alignment. WhisperX uses a phoneme model to align the transcription with the audio. Phoneme-based Automatic Speech Recognition (ASR) recognizes the smallest unit of speech, e.g., the element “g” in “big.” This post-processing operation aligns the generated transcription with the audio timestamps on the word level.

On this section, we integrate WhisperX, Whisper, and PyAnnotate to create our own ASR system. We designed our approach to handle long-form audio transcriptions while having the ability to segment the speech and attribute a particular speaker to every segment. As well as, it reduces the probability of hallucination, increases inference efficiency, and ensures proper alignment between the transcription and the audio. Let’s construct a pipeline to perform the various tasks.

We start with transcription, converting the speech recognized from the audio file into written text. The transcribefunction loads a Whisper model specified by model_name and transcribes the audio file. It then returns a dictionary containing the transcript segments and language code. OpenAI designed Whisper also to perform language detection, being a multilanguage model.

def transcribe(audio_file: str, model_name: str, device: str = "cpu") -> Dict[str, Any]:
Transcribe an audio file using a speech-to-text model.

audio_file: Path to the audio file to transcribe.
model_name: Name of the model to make use of for transcription.
device: The device to make use of for inference (e.g., "cpu" or "cuda").

A dictionary representing the transcript, including the segments, the language code, and the duration of the audio file.
model = whisper.load_model(model_name, device)
result = model.transcribe(audio_file)

language_code = result["language"]
return {
"segments": result["segments"],
"language_code": language_code,

Next, we align the transcript segments using the align_segments function. As we discussed previously, this step is essencial for accurate speaker diarization, because it ensures that every segment corresponds to the right speaker:

def align_segments(
segments: List[Dict[str, Any]],
language_code: str,
audio_file: str,
device: str = "cpu",
) -> Dict[str, Any]:
Align the transcript segments using a pretrained alignment model.

segments: List of transcript segments to align.
language_code: Language code of the audio file.
audio_file: Path to the audio file containing the audio data.
device: The device to make use of for inference (e.g., "cpu" or "cuda").

A dictionary representing the aligned transcript segments.
model_a, metadata = load_align_model(language_code=language_code, device=device)
result_aligned = align(segments, model_a, metadata, audio_file, device)
return result_aligned

With the transcript segments aligned, we will now perform speaker diarization. We use the diarize function, which leverages the PyAnnotate library:

def diarize(audio_file: str, hf_token: str) -> Dict[str, Any]:
Perform speaker diarization on an audio file.

audio_file: Path to the audio file to diarize.
hf_token: Authentication token for accessing the Hugging Face API.

A dictionary representing the diarized audio file, including the speaker embeddings and the variety of speakers.
diarization_pipeline = DiarizationPipeline(use_auth_token=hf_token)
diarization_result = diarization_pipeline(audio_file)
return diarization_result

After diarization, we assign speakers to every transcript segment using the assign_speakers function. It’s the ultimate step in our pipeline and completes our strategy of transforming the raw audio file right into a transcript with speaker information:

def assign_speakers(
diarization_result: Dict[str, Any], aligned_segments: Dict[str, Any]
) -> List[Dict[str, Any]]:
Assign speakers to every transcript segment based on the speaker diarization result.

diarization_result: Dictionary representing the diarized audio file, including the speaker embeddings and the variety of speakers.
aligned_segments: Dictionary representing the aligned transcript segments.

A listing of dictionaries representing each segment of the transcript, including the beginning and end times, the
spoken text, and the speaker ID.
result_segments, word_seg = assign_word_speakers(
diarization_result, aligned_segments["segments"]
results_segments_w_speakers: List[Dict[str, Any]] = []
for result_segment in result_segments:
"start": result_segment["start"],
"end": result_segment["end"],
"text": result_segment["text"],
"speaker": result_segment["speaker"],
return results_segments_w_speakers

Finally, we mix all of the steps right into a single transcribe_and_diarize function. This function returns a listing of dictionaries representing each transcript segment, including the beginning and end times, spoken text, and speaker identifier. Note that you just need a Hugging Face API token to run the pipeline.

def transcribe_and_diarize(
audio_file: str,
hf_token: str,
model_name: str,
device: str = "cpu",
) -> List[Dict[str, Any]]:
Transcribe an audio file and perform speaker diarization to find out which words were spoken by each speaker.

audio_file: Path to the audio file to transcribe and diarize.
hf_token: Authentication token for accessing the Hugging Face API.
model_name: Name of the model to make use of for transcription.
device: The device to make use of for inference (e.g., "cpu" or "cuda").

A listing of dictionaries representing each segment of the transcript, including the beginning and end times, the
spoken text, and the speaker ID.
transcript = transcribe(audio_file, model_name, device)
aligned_segments = align_segments(
transcript["segments"], transcript["language_code"], audio_file, device
diarization_result = diarize(audio_file, hf_token)
results_segments_w_speakers = assign_speakers(diarization_result, aligned_segments)

# Print the ends in a user-friendly way
for i, segment in enumerate(results_segments_w_speakers):
print(f"Segment {i + 1}:")
print(f"Start time: {segment['start']:.2f}")
print(f"End time: {segment['end']:.2f}")
print(f"Speaker: {segment['speaker']}")
print(f"Transcript: {segment['text']}")

return results_segments_w_speakers

Let’s start by testing our pipeline with a brief audio clip that I recorded myself. There are two speakers within the video that we want to discover. Also, notice the several hesitations within the speech of one in every of the speakers, making it hard to transcribe. We’ll use the base model from Whisper to evaluate its capabilities. For higher accuracy, you need to use the medium or large ones. The transcription is presented below:

Segment 1:
Start time: 0.95
End time: 2.44
Speaker: SPEAKER_01
Transcript: What TV show are you watching?

Segment 2:
Start time: 3.56
End time: 5.40
Speaker: SPEAKER_00
Transcript: Currently I’m watching This Is Us.

Segment 3:
Start time: 6.18
End time: 6.93
Speaker: SPEAKER_01
Transcript: What’s it about?

Segment 4:
Start time: 8.30
End time: 15.44
Speaker: SPEAKER_00
Transcript: It’s in regards to the lifetime of a family throughout several generations.

Segment 5:
Start time: 15.88
End time: 21.42
Speaker: SPEAKER_00
Transcript: And you’ll be able to in some way live their lives through the series.

Segment 6:
Start time: 22.34
End time: 23.55
Speaker: SPEAKER_01
Transcript: What might be the subsequent one?

Segment 7:
Start time: 25.48
End time: 28.81
Speaker: SPEAKER_00
Transcript: Possibly beef I’ve been hearing excellent things about it.

Execution time for base: 8.57 seconds
Memory usage for base: 3.67GB

Our approach achieves its foremost goals with the transcription above. First, notice that the transcription is accurate and that we could ignore the speech hesitations successfully. We produced text with the right syntax, which helps readability. The segments were well separated and accurately aligned with the audio timestamps. Finally, the speaker diarization was also executed adequately, with the 2 speakers attributed accurately to every speech segment.

One other vital aspect is the computation efficiency of the varied models on long-format audio when running inference on CPU and GPU. We chosen an audio file of around half-hour. Below, you’ll find the outcomes:

Figure 2: Execution time for the varied models using CPU and GPU (Image by Writer).

The foremost takeaway is that these models are very heavy and should be more efficient to run at scale. For half-hour of video, we take around 70–75 minutes to run the transcriptions on the CPU and roughly quarter-hour on GPU. Also, keep in mind that we want about 10GB of VRAM to run the massive model. We should always expect these results for the reason that models are still within the research phase.

This text provides a comprehensive step-by-step guide to analyzing audio data using state-of-the-art speech recognition and speaker diarization technologies. We introduced Whisper, PyAnnotate, and WhisperX, forming a robust integrated ASR system together — our approach produces promising results when working with long-form audio transcriptions. It also solves the foremost limitations of Whisper, hallucinating on long-form audio transcriptions, ensuring alignment between transcription and audio, accurately segmenting the speech, and attributing speakers to every segment.

Nonetheless, the computational efficiency of those models stays a challenge, particularly for long-format audio and when running inference on limited hardware. Even so, the combination of Whisper, WhisperX, and PyAnnotate demonstrates the potential of those tools to remodel the best way we process and analyze audio data, unlocking recent possibilities for applications across various industries and use cases.

Be in contact: LinkedIn


What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
Inline Feedbacks
View all comments

Share this article

Recent posts

Would love your thoughts, please comment.x