Tl;dr: This post explains the right way to use the specificities of the Connectionist
Temporal Classification (CTC) architecture with a purpose to achieve superb
quality automatic speech recognition (ASR) even on arbitrarily long files or
during live inference.
Wav2Vec2 is a well-liked pre-trained model for speech recognition.
Released in September 2020
by Meta AI Research, the novel architecture catalyzed progress in
self-supervised pretraining for speech recognition, e.g. G. Ng et
al., 2021, Chen et al,
2021, Hsu et al.,
2021 and Babu et al.,
2021. On the Hugging Face Hub,
Wav2Vec2’s hottest pre-trained checkpoint currently amounts to
over 250,000 monthly
downloads.
Wav2Vec2 is at its core a transformers models and one caveat
of transformers is that it often has a finite amount of sequence
length it might handle. Either since it uses position encodings (not
the case here) or just because the associated fee of attention in transformers
is definitely O(n²) in sequence_length, meaning that using very large
sequence_length explodes in complexity/memory. So you can not run with finite hardware
(even a really large GPU like A100), simply run Wav2Vec2 on an hour long
file. Your program will crash. Let’s try it !
pip install transformers
from transformers import pipeline
pipe = pipeline(model="facebook/wav2vec2-base-960h")
pipe("very_long_file.mp3")
pipe("very_long_file.mp3", chunk_length_s=10)
Easy Chunking
The best method to achieve inference on very long files can be to easily chunk
the initial audio into shorter samples, to illustrate 10 seconds each, run inference on those, and find yourself
with a final reconstruction. That is efficient computationally but often leads
to subpar results, the explanation being that with a purpose to do good inference, the model
needs some context, so across the chunking border, inference tends to be of poor
quality.
Take a look at the next diagram:
There are methods to try to work around the issue in a general fashion, but
they’re never entirely robust. You possibly can attempt to chunk only once you encounter
silence but you’ll have a non silent audio for a very long time (a song, or noisy
café audio). It’s also possible to attempt to cut only when there is no voice however it requires
one other model and this shouldn’t be a wholly solved problem. You might even have
a continous voice for a really very long time.
Because it seems, CTC structure, which is utilized by Wav2Vec2, could be exploited
with a purpose to achieve very robust speech recognition even on very long files
without falling into those pitfalls.
Chunking with stride
Wav2Vec2 uses the CTC algorithm, which implies that each frame of audio is mapped
to a single letter prediction (logit).
That is the primary feature we will use with a purpose to add a stride.
This link explains it
within the image context, however it’s the identical concept for audio.
For this reason property, we are able to:
- Start doing inference on overlapping chunks
in order that the model actually has proper context in the middle. - Drop the inferenced logits on the side.
- Chain the logits without their dropped sides to recuperate something extremely near what the model would have
predicted on the total length audio.
This shouldn’t be technically 100% the identical thing as running the model on the entire
file so it shouldn’t be enabled by default, but as you saw in the sooner example you
need only so as to add chunk_length_s to your pipeline for it to work.
In practice, we observed that almost all of the bad inference is kept inside
the strides, which get dropped before inference, resulting in a correct
inference of the total text.
Let’s note which you can select every argument of this system:
from transformers import pipeline
pipe = pipeline(model="facebook/wav2vec2-base-960h")
output = pipe("very_long_file.mp3", chunk_length_s=10, stride_length_s=(4, 2))
Chunking with stride on LM augmented models
In transformers, we also
added support for adding LM to Wav2Vec2 with a purpose to boost the WER performance
of the models without even finetuning. See this excellent blogpost explaining
how it really works.
It seems, that the LM works directly on the logits themselves, so we
can actually apply the very same technique as before with none modification !
So chunking large files on these LM boosted models still works out of the box.
Live inference
A really nice perk of using a CTC model like Wav2vec2, is that it’s a single
pass model, so it’s very fast. Especially on GPU. We will exploit that so as
to do live inference.
The principle is strictly the identical as regular striding, but this time we are able to
feed the pipeline data because it is coming in and easily use striding on
full chunks of length 10s for example with 1s striding to get proper context.
That requires running far more inference steps than easy file chunking, but it might make the
live experience significantly better since the model can print things as you’re
speaking, without having to attend for X seconds before seeing something displayed.



