Boosting Wav2Vec2 with n-grams in 🤗 Transformers

Wav2Vec2 is a preferred pre-trained model for speech recognition.
Released in September 2020
by Meta AI Research, the novel architecture catalyzed progress in
self-supervised pretraining for speech recognition, e.g. G. Ng et
al., 2021, Chen et al,
2021, Hsu et al.,
2021 and Babu et al.,
2021. On the Hugging Face Hub,
Wav2Vec2’s hottest pre-trained checkpoint currently amounts to
over 250,000 monthly
downloads.

Using Connectionist Temporal Classification (CTC), pre-trained
Wav2Vec2-like checkpoints are extremely easy to fine-tune on downstream
speech recognition tasks. In a nutshell, fine-tuning pre-trained
Wav2Vec2 checkpoints works as follows:

A single randomly initialized linear layer is stacked on top of the
pre-trained checkpoint and trained to categorise raw audio input to a
sequence of letters. It does so by:

extracting audio representations from the raw audio (using CNN
layers),
processing the sequence of audio representations with a stack of
transformer layers, and,
classifying the processed audio representations right into a sequence of
output letters.

Previously audio classification models required a further language
model (LM) and a dictionary to rework the sequence of classified audio
frames to a coherent transcription. Wav2Vec2’s architecture is predicated on
transformer layers, thus giving each processed audio representation
context from all other audio representations. As well as, Wav2Vec2
leverages the CTC algorithm for
fine-tuning, which solves the issue of alignment between a various
“input audio length”-to-“output text length” ratio.

Having contextualized audio classifications and no alignment problems,
Wav2Vec2 doesn’t require an external language model or dictionary to
yield acceptable audio transcriptions.

As might be seen in Appendix C of the official
paper, Wav2Vec2 gives impressive
downstream performances on LibriSpeech without using a language model at
all. Nonetheless, from the appendix, it also becomes clear that using Wav2Vec2
together with a language model can yield a big
improvement, especially when the model was trained on only 10 minutes of
transcribed audio.

Until recently, the 🤗 Transformers library didn’t offer an easy user
interface to decode audio files with a fine-tuned Wav2Vec2 and a
language model. This has thankfully modified. 🤗 Transformers now offers
an easy-to-use integration with Kensho Technologies’ pyctcdecode
library. This blog
post is a step-by-step technical guide to elucidate how one can create
an n-gram language model and mix it with an existing fine-tuned
Wav2Vec2 checkpoint using 🤗 Datasets and 🤗 Transformers.

We start by:

How does decoding audio with an LM differ from decoding audio
without an LM?
How you can get suitable data for a language model?
How you can construct an n-gram with KenLM?
How you can mix the n-gram with a fine-tuned Wav2Vec2 checkpoint?

For a deep dive into how Wav2Vec2 functions – which just isn’t essential for
this blog post – the reader is suggested to seek the advice of the next
material:

1. Decoding audio data with Wav2Vec2 and a language model

As shown in 🤗 Transformers exemple docs of
Wav2Vec2,
audio might be transcribed as follows.

First, we install datasets and transformers.

pip install datasets transformers

Let’s load a small excerpt of the Librispeech
dataset to reveal
Wav2Vec2’s speech transcription capabilities.

from datasets import load_dataset

dataset = load_dataset("hf-internal-testing/librispeech_asr_demo", "clean", split="validation")
dataset

Output:

    Reusing dataset librispeech_asr (/root/.cache/huggingface/datasets/hf-internal-testing___librispeech_asr/clean/2.1.0/f2c70a4d03ab4410954901bde48c54b85ca1b7f9bf7d616e7e2a72b5ee6ddbfc)

    Dataset({
        features: ['file', 'audio', 'text', 'speaker_id', 'chapter_id', 'id'],
        num_rows: 73
    })

We are able to pick certainly one of the 73 audio samples and take heed to it.

audio_sample = dataset[2]
audio_sample["text"].lower()

Output:

    he tells us that at this festive season of the yr with christmas and roast beef looming before us similes drawn from eating and its results occur most readily to the mind

Having chosen a knowledge sample, we now load the fine-tuned model and
processor.

from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC

processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-base-100h")
model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base-100h")

Next, we process the info

inputs = processor(audio_sample["audio"]["array"], sampling_rate=audio_sample["audio"]["sampling_rate"], return_tensors="pt")

forward it to the model

import torch

with torch.no_grad():
  logits = model(**inputs).logits

and decode it

predicted_ids = torch.argmax(logits, dim=-1)
transcription = processor.batch_decode(predicted_ids)

transcription[0].lower()

Output:

'he tells us that at this festive season of the yr with christmaus and rose beef looming before us simalyis drawn from eating and its results occur most readily to the mind'

Comparing the transcription to the goal transcription above, we are able to
see that some words sound correct, but aren’t spelled accurately,
e.g.:

christmaus vs. christmas
rose vs. roast
simalyis vs. similes

Let’s have a look at whether combining Wav2Vec2 with an n-gram lnguage model
might help here.

First, we’d like to put in pyctcdecode and kenlm.

pip install https://github.com/kpu/kenlm/archive/master.zip pyctcdecode

For demonstration purposes, we now have prepared a brand new model repository
patrickvonplaten/wav2vec2-base-100h-with-lm
which comprises the identical Wav2Vec2 checkpoint but has a further
4-gram language model for English.

As an alternative of using Wav2Vec2Processor, this time we use
Wav2Vec2ProcessorWithLM to load the 4-gram model along with
the feature extractor and tokenizer.

from transformers import Wav2Vec2ProcessorWithLM

processor = Wav2Vec2ProcessorWithLM.from_pretrained("patrickvonplaten/wav2vec2-base-100h-with-lm")

In constrast to decoding the audio without language model, the processor
now directly receives the model’s output logits as a substitute of the
argmax(logits) (called predicted_ids) above. The rationale is that when
decoding with a language model, at every time step, the processor takes
the chances of all possible output characters under consideration. Let’s
take a take a look at the dimension of the logits output.

logits.shape

Output:

    torch.Size([1, 624, 32])

We are able to see that the logits correspond to a sequence of 624 vectors
each having 32 entries. Each of the 32 entries thereby stands for the
logit probability of certainly one of the 32 possible output characters of the
model:

" ".join(sorted(processor.tokenizer.get_vocab()))

Output:

"'     A B C D E F G H I J K L M N O P Q R S T U V W X Y Z |"

Intuitively, one can understand the decoding technique of
Wav2Vec2ProcessorWithLM as applying beam search through a matrix of
size 624 $times$ 32 probabilities while leveraging the chances of
the following letters as given by the n-gram language model.

OK, let’s run the decoding step again. pyctcdecode language model
decoder doesn’t mechanically convert torch tensors to numpy so
we’ll must convert them ourselves before.

transcription = processor.batch_decode(logits.numpy()).text
transcription[0].lower()

Output:

'he tells us that at this festive season of the yr with christmas and rose beef looming before us similes drawn from eating and its results occur most readily to the mind'

Cool! Recalling the words facebook/wav2vec2-base-100h with out a
language model transcribed incorrectly previously, e.g.,

christmaus vs. christmas

rose vs. roast

simalyis vs. similes

we are able to take one other take a look at the transcription of
facebook/wav2vec2-base-100h with a 4-gram language model. 2 out of
3 errors are corrected; christmas and similes have been accurately
transcribed.

Interestingly, the wrong transcription of rose persists. Nonetheless,
this mustn’t surprise us very much. Decoding audio with out a language
model is way more susceptible to yield spelling mistakes, reminiscent of
christmaus or similes (those words don’t exist within the English
language so far as I do know). It’s because the speech recognition
system almost solely bases its prediction on the acoustic input it was
given and not likely on the language modeling context of previous and
successive predicted letters $^{1}$

The language model by itself most probably does favor the right word
roast for the reason that word sequence roast beef is way more common in
English than rose beef. Because the ultimate transcription is derived
from a weighted combination of facebook/wav2vec2-base-100h output
probabilities and people of the n-gram language model, it is sort of
common to see incorrectly transcribed words reminiscent of rose.

For more information on how you’ll be able to tweak different parameters when
decoding with Wav2Vec2ProcessorWithLM, please take a take a look at the
official documentation
here.

$^{1}$

Great, now that you’ve seen the benefits adding an n-gram language
model can bring, let’s dive into tips on how to create an n-gram and
Wav2Vec2ProcessorWithLM from scratch.

2. Getting data to your language model

A language model that is beneficial for a speech recognition system should
support the acoustic model, e.g. Wav2Vec2, in predicting the following word
(or token, letter) and subsequently model the next distribution:
$mathbf{P}(w_n | mathbf{w}_0^{t-1})$

As at all times a language model is barely nearly as good as the info it’s trained on.
Within the case of speech recognition, we should always subsequently ask ourselves for
what kind of knowledge, the speech recognition shall be used for:
conversations, audiobooks, movies, speeches, , etc, …?

The language model needs to be good at modeling language that corresponds
to the goal transcriptions of the speech recognition system. For
demonstration purposes, we assume here that we now have fine-tuned a
pre-trained
facebook/wav2vec2-xls-r-300m
on Common Voice
7
in Swedish. The fine-tuned checkpoint might be found
here. Common Voice 7 is
a comparatively crowd-sourced read-out audio dataset and we are going to evaluate
the model on its test data.

Let’s now search for suitable text data on the Hugging Face Hub. We
search all datasets for those that contain Swedish
data.
Browsing a bit through the datasets, we’re searching for a dataset that
is analogous to Common Voice’s read-out audio data. The plain decisions
of oscar and
mc4 may not be probably the most
suitable here because they:

are generated from crawling the online, which could not be very
clean and correspond well to spoken language
require quite a lot of pre-processing
are very large which just isn’t ideal for demonstration purposes
here 😉

A dataset that seems sensible here and which is comparatively clean and
easy to pre-process is
europarl_bilingual
because it’s a dataset that is predicated on discussions and talks of the
European parliament. It should subsequently be relatively clean and
correspond well to read-out audio data. The dataset is originally designed
for machine translation and may subsequently only be accessed in
translation pairs. We are going to only extract the text of the goal
language, Swedish (sv), from the English-to-Swedish translations.

target_lang="sv"

Let’s download the info.

from datasets import load_dataset

dataset = load_dataset("europarl_bilingual", lang1="en", lang2=target_lang, split="train")

We see that the info is sort of large – it has over 1,000,000
translations. Because it’s only text data, it needs to be relatively easy
to process though.

Next, let’s take a look at how the info was preprocessed when training the
fine-tuned XLS-R checkpoint in Swedish. Taking a look at the run.sh
file, we
can see that the next characters were faraway from the official
transcriptions:

chars_to_ignore_regex = '[,?.!-;:"“%‘”�—’…–]'

Let’s do the identical here in order that the alphabet of our language model
matches the certainly one of the fine-tuned acoustic checkpoints.

We are able to write a single map function to extract the Swedish text and
process it instantly.

import re

def extract_text(batch):
  text = batch["translation"][target_lang]
  batch["text"] = re.sub(chars_to_ignore_regex, "", text.lower())
  return batch

Let’s apply the .map() function. This could take roughly 5 minutes.

dataset = dataset.map(extract_text, remove_columns=dataset.column_names)

Great. Let’s upload it to the Hub so
that we are able to inspect and reuse it higher.

You may log in by executing the next cell.

from huggingface_hub import notebook_login

notebook_login()

Output:

    Login successful
    Your token has been saved to /root/.huggingface/token
    Authenticated through git-credential store but this isn't the helper defined in your machine.
    You may must re-authenticate when pushing to the Hugging Face Hub. Run the next command in your terminal in case you ought to set this credential helper because the default

    git config --global credential.helper store

Next, we call 🤗 Hugging Face’s
push_to_hub
method to upload the dataset to the repo
"sv_corpora_parliament_processed".

dataset.push_to_hub(f"{target_lang}_corpora_parliament_processed", split="train")

That was easy! The dataset viewer is mechanically enabled when
uploading a brand new dataset, which may be very convenient. You may now directly
inspect the dataset online.

Be at liberty to leaf through our preprocessed dataset directly on
hf-test/sv_corpora_parliament_processed.
Even when we aren’t a native speaker in Swedish, we are able to see that the info
is well processed and seems clean.

Next, let’s use the info to construct a language model.

**3. Construct an n-gram with KenLM**

While large language models based on the Transformer architecture have change into the usual in NLP, it continues to be quite common to make use of an n-gram LM to spice up speech recognition systems – as shown in Section 1.

Looking again at Table 9 of Appendix C of the official Wav2Vec2 paper, it will possibly be noticed that using a Transformer-based LM for decoding clearly yields higher results than using an n-gram model, however the difference between n-gram and Transformer-based LM is way less important than the difference between n-gram and no LM.

E.g., for the big Wav2Vec2 checkpoint that was fine-tuned on 10min only, an n-gram reduces the word error rate (WER) in comparison with no LM by ca. 80% while a Transformer-based LM only reduces the WER by one other 23% in comparison with the n-gram. This relative WER reduction becomes less, the more data the acoustic model has been trained on. E.g., for the big checkpoint a Transformer-based LM reduces the WER by merely 8% in comparison with an n-gram LM whereas the n-gram still yields a 21% WER reduction in comparison with no language model.

The rationale why an n-gram is preferred over a Transformer-based LM is that n-grams come at a significantly smaller computational cost. For an n-gram, retrieving the probability of a word given previous words is sort of only as computationally expensive as querying a look-up table or tree-like data storage – i.e. it’s totally fast in comparison with modern Transformer-based language models that may require a full forward pass to retrieve the following word probabilities.

For more information on how n-grams function and why they’re (still) so useful for speech recognition, the reader is suggested to check out this excellent summary from Stanford.

Great, let’s examine step-by-step tips on how to construct an n-gram. We are going to use the
popular KenLM library to accomplish that. Let’s
start by installing the Ubuntu library prerequisites:

sudo apt install build-essential cmake libboost-system-dev libboost-thread-dev libboost-program-options-dev libboost-test-dev libeigen3-dev zlib1g-dev libbz2-dev liblzma-dev

before downloading and unpacking the KenLM repo.

wget -O - https://kheafield.com/code/kenlm.tar.gz | tar xz

KenLM is written in C++, so we’ll make use of cmake to construct the
binaries.

mkdir kenlm/construct && cd kenlm/construct && cmake .. && make -j2
ls kenlm/construct/bin

Great, as we are able to see, the executable functions have successfully
been built under kenlm/construct/bin/.

KenLM by default computes an n-gram with Kneser-Ney
smooting.
All text data used to create the n-gram is anticipated to be stored in a
text file. We download our dataset and reserve it as a .txt file.

from datasets import load_dataset

username = "hf-test"  

dataset = load_dataset(f"{username}/{target_lang}_corpora_parliament_processed", split="train")

with open("text.txt", "w") as file:
  file.write(" ".join(dataset["text"]))

Now, we just must run KenLM’s lmplz command to construct our n-gram,
called "5gram.arpa". Because it’s relatively common in speech recognition,
we construct a 5-gram by passing the -o 5 parameter.
For more information on the several n-gram LM that might be built
with KenLM, one can take a take a look at the official website of KenLM.

Executing the command below might take a minute or so.

kenlm/construct/bin/lmplz -o 5 <"text.txt" > "5gram.arpa"

Output:

    === 1/5 Counting and sorting n-grams ===
    Reading /content/swedish_text.txt
    ----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
    tcmalloc: large alloc 1918697472 bytes == 0x55d40d0f0000 @  0x7fdccb1a91e7 0x55d40b2f17a2 0x55d40b28c51e 0x55d40b26b2eb 0x55d40b257066 0x7fdcc9342bf7 0x55d40b258baa
    tcmalloc: large alloc 8953896960 bytes == 0x55d47f6c0000 @  0x7fdccb1a91e7 0x55d40b2f17a2 0x55d40b2e07ca 0x55d40b2e1208 0x55d40b26b308 0x55d40b257066 0x7fdcc9342bf7 0x55d40b258baa
    ****************************************************************************************************
    Unigram tokens 42153890 types 360209
    === 2/5 Calculating and sorting adjusted counts ===
    Chain sizes: 1:4322508 2:1062772928 3:1992699264 4:3188318720 5:4649631744
    tcmalloc: large alloc 4649631744 bytes == 0x55d40d0f0000 @  0x7fdccb1a91e7 0x55d40b2f17a2 0x55d40b2e07ca 0x55d40b2e1208 0x55d40b26b8d7 0x55d40b257066 0x7fdcc9342bf7 0x55d40b258baa
    tcmalloc: large alloc 1992704000 bytes == 0x55d561ce0000 @  0x7fdccb1a91e7 0x55d40b2f17a2 0x55d40b2e07ca 0x55d40b2e1208 0x55d40b26bcdd 0x55d40b257066 0x7fdcc9342bf7 0x55d40b258baa
    tcmalloc: large alloc 3188326400 bytes == 0x55d695a86000 @  0x7fdccb1a91e7 0x55d40b2f17a2 0x55d40b2e07ca 0x55d40b2e1208 0x55d40b26bcdd 0x55d40b257066 0x7fdcc9342bf7 0x55d40b258baa
    Statistics:
    1 360208 D1=0.686222 D2=1.01595 D3+=1.33685
    2 5476741 D1=0.761523 D2=1.06735 D3+=1.32559
    3 18177681 D1=0.839918 D2=1.12061 D3+=1.33794
    4 30374983 D1=0.909146 D2=1.20496 D3+=1.37235
    5 37231651 D1=0.944104 D2=1.25164 D3+=1.344
    Memory estimate for binary LM:
    type      MB
    probing 1884 assuming -p 1.5
    probing 2195 assuming -r models -p 1.5
    trie     922 without quantization
    trie     518 assuming -q 8 -b 8 quantization 
    trie     806 assuming -a 22 array pointer compression
    trie     401 assuming -a 22 -q 8 -b 8 array pointer compression and quantization
    === 3/5 Calculating and sorting initial probabilities ===
    Chain sizes: 1:4322496 2:87627856 3:363553620 4:728999592 5:1042486228
    ----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
    
    === 4/5 Calculating and writing order-interpolated probabilities ===
    Chain sizes: 1:4322496 2:87627856 3:363553620 4:728999592 5:1042486228
    ----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
    
    === 5/5 Writing ARPA model ===
    ----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
    ****************************************************************************************************
    Name:lmplz	VmPeak:14181536 kB	VmRSS:2199260 kB	RSSMax:4160328 kB	user:120.598	sys:26.6659	CPU:147.264	real:136.344

Great, we now have built a 5-gram LM! Let’s inspect the primary couple of
lines.

head -20 5gram.arpa

Output:

    data
    ngram 1=360208
    ngram 2=5476741
    ngram 3=18177681
    ngram 4=30374983
    ngram 5=37231651

    1-grams:
    -6.770219		0
    0		-0.11831701
    -4.6095004	återupptagande	-1.2174699
    -2.2361007	av	-0.79668784
    -4.8163533	sessionen	-0.37327805
    -2.2251768	jag	-1.4205662
    -4.181505	förklarar	-0.56261665
    -3.5790775	europaparlamentets	-0.63611007
    -4.771945	session	-0.3647111
    -5.8043895	återupptagen	-0.3058712
    -2.8580177	efter	-0.7557702
    -5.199537	avbrottet	-0.43322718

There’s a small problem that 🤗 Transformers is not going to be blissful about
in a while. The 5-gram accurately features a “Unknown” or , as
well as a begin-of-sentence, token, but no end-of-sentence,
token. This sadly must be corrected currently after the construct.

We are able to simply add the end-of-sentence token by adding the road 0 -0.11831701 below the begin-of-sentence token and increasing the ngram 1 count by 1. Since the file has roughly 100 million lines, this command will take ca. 2 minutes. with open("5gram.arpa", "r") as read_file, open("5gram_correct.arpa", "w") as write_file: has_added_eos = False for line in read_file: if not has_added_eos and "ngram 1=" in line: count=line.strip().split("=")[-1] write_file.write(line.replace(f"{count}", f"{int(count)+1}")) elif not has_added_eos and "" in line: write_file.write(line) write_file.write(line.replace("", "")) has_added_eos = True else: write_file.write(line) Let’s now inspect the corrected 5-gram. head -20 5gram_correct.arpa Output: data ngram 1=360209 ngram 2=5476741 ngram 3=18177681 ngram 4=30374983 ngram 5=37231651 1-grams: -6.770219 0 0 -0.11831701 0 -0.11831701 -4.6095004 återupptagande -1.2174699 -2.2361007 av -0.79668784 -4.8163533 sessionen -0.37327805 -2.2251768 jag -1.4205662 -4.181505 förklarar -0.56261665 -3.5790775 europaparlamentets -0.63611007 -4.771945 session -0.3647111 -5.8043895 återupptagen -0.3058712 -2.8580177 efter -0.7557702 Great, this looks higher! We’re done at this point and all that’s left to do is to accurately integrate the "ngram" with pyctcdecode and 🤗 Transformers. 4. Mix an n-gram with Wav2Vec2 In a final step, we would like to wrap the 5-gram right into a Wav2Vec2ProcessorWithLM object to make the 5-gram boosted decoding as seamless as shown in Section 1. We start by downloading the currently “LM-less” processor of xls-r-300m-sv. from transformers import AutoProcessor processor = AutoProcessor.from_pretrained("hf-test/xls-r-300m-sv") Next, we extract the vocabulary of its tokenizer because it represents the "labels" of pyctcdecode‘s BeamSearchDecoder class. vocab_dict = processor.tokenizer.get_vocab() sorted_vocab_dict = {k.lower(): v for k, v in sorted(vocab_dict.items(), key=lambda item: item[1])} The "labels" and the previously built 5gram_correct.arpa file is all that is needed to construct the decoder. from pyctcdecode import build_ctcdecoder decoder = build_ctcdecoder( labels=list(sorted_vocab_dict.keys()), kenlm_model_path="5gram_correct.arpa", ) Output: Found entries of length > 1 in alphabet. That is unusual unless style is BPE, however the alphabet was not recognized as BPE type. Is that this correct? Unigrams and labels don't appear to agree. We are able to safely ignore the warning and all that’s left to do now’s to wrap the just created decoder, along with the processor’s tokenizer and feature_extractor right into a Wav2Vec2ProcessorWithLM class. from transformers import Wav2Vec2ProcessorWithLM processor_with_lm = Wav2Vec2ProcessorWithLM( feature_extractor=processor.feature_extractor, tokenizer=processor.tokenizer, decoder=decoder ) We wish to directly upload the LM-boosted processor into the model folder of xls-r-300m-sv to have all relevant files in a single place. Let’s clone the repo, add the brand new decoder files and upload them afterward. First, we’d like to put in git-lfs. sudo apt-get install git-lfs tree Cloning and uploading of modeling files might be done conveniently with the huggingface_hub‘s Repository class. More information on tips on how to use the huggingface_hub to upload any files, please take a take a look at the official docs. from huggingface_hub import Repository repo = Repository(local_dir="xls-r-300m-sv", clone_from="hf-test/xls-r-300m-sv") Output: Cloning https://huggingface.co/hf-test/xls-r-300m-sv into local empty directory. Having cloned xls-r-300m-sv, let’s save the brand new processor with LM into it. processor_with_lm.save_pretrained("xls-r-300m-sv") Let’s inspect the local repository. The tree command conveniently can also show the dimensions of the several files. tree -h xls-r-300m-sv/ Output: xls-r-300m-sv/ ├── [ 23] added_tokens.json ├── [ 401] all_results.json ├── [ 253] alphabet.json ├── [2.0K] config.json ├── [ 304] emissions.csv ├── [ 226] eval_results.json ├── [4.0K] language_model │ ├── [4.1G] 5gram_correct.arpa │ ├── [ 78] attrs.json │ └── [4.9M] unigrams.txt ├── [ 240] preprocessor_config.json ├── [1.2G] pytorch_model.bin ├── [3.5K] README.md ├── [4.0K] runs │ └── [4.0K] Jan09_22-00-50_brutasse │ ├── [4.0K] 1641765760.8871996 │ │ └── [4.6K] events.out.tfevents.1641765760.brutasse.31164.1 │ ├── [ 42K] events.out.tfevents.1641765760.brutasse.31164.0 │ └── [ 364] events.out.tfevents.1641794162.brutasse.31164.2 ├── [1.2K] run.sh ├── [ 30K] run_speech_recognition_ctc.py ├── [ 502] special_tokens_map.json ├── [ 279] tokenizer_config.json ├── [ 29K] trainer_state.json ├── [2.9K] training_args.bin ├── [ 196] train_results.json ├── [ 319] vocab.json └── [4.0K] wandb ├── [ 52] debug-internal.log -> run-20220109_220240-1g372i3v/logs/debug-internal.log ├── [ 43] debug.log -> run-20220109_220240-1g372i3v/logs/debug.log ├── [ 28] latest-run -> run-20220109_220240-1g372i3v └── [4.0K] run-20220109_220240-1g372i3v ├── [4.0K] files │ ├── [8.8K] conda-environment.yaml │ ├── [140K] config.yaml │ ├── [4.7M] output.log │ ├── [5.4K] requirements.txt │ ├── [2.1K] wandb-metadata.json │ └── [653K] wandb-summary.json ├── [4.0K] logs │ ├── [3.4M] debug-internal.log │ └── [8.2K] debug.log └── [113M] run-1g372i3v.wandb 9 directories, 34 files As might be seen the 5-gram LM is sort of large – it amounts to greater than 4 GB. To cut back the dimensions of the n-gram and make loading faster, kenLM allows converting .arpa files to binary ones using the build_binary executable. Let’s make use of it here. kenlm/construct/bin/build_binary xls-r-300m-sv/language_model/5gram_correct.arpa xls-r-300m-sv/language_model/5gram.bin Output: Reading xls-r-300m-sv/language_model/5gram_correct.arpa ----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100 **************************************************************************************************** SUCCESS Great, it worked! Let’s remove the .arpa file and check the dimensions of the binary 5-gram LM. rm xls-r-300m-sv/language_model/5gram_correct.arpa && tree -h xls-r-300m-sv/ Output: xls-r-300m-sv/ ├── [ 23] added_tokens.json ├── [ 401] all_results.json ├── [ 253] alphabet.json ├── [2.0K] config.json ├── [ 304] emissions.csv ├── [ 226] eval_results.json ├── [4.0K] language_model │ ├── [1.8G] 5gram.bin │ ├── [ 78] attrs.json │ └── [4.9M] unigrams.txt ├── [ 240] preprocessor_config.json ├── [1.2G] pytorch_model.bin ├── [3.5K] README.md ├── [4.0K] runs │ └── [4.0K] Jan09_22-00-50_brutasse │ ├── [4.0K] 1641765760.8871996 │ │ └── [4.6K] events.out.tfevents.1641765760.brutasse.31164.1 │ ├── [ 42K] events.out.tfevents.1641765760.brutasse.31164.0 │ └── [ 364] events.out.tfevents.1641794162.brutasse.31164.2 ├── [1.2K] run.sh ├── [ 30K] run_speech_recognition_ctc.py ├── [ 502] special_tokens_map.json ├── [ 279] tokenizer_config.json ├── [ 29K] trainer_state.json ├── [2.9K] training_args.bin ├── [ 196] train_results.json ├── [ 319] vocab.json └── [4.0K] wandb ├── [ 52] debug-internal.log -> run-20220109_220240-1g372i3v/logs/debug-internal.log ├── [ 43] debug.log -> run-20220109_220240-1g372i3v/logs/debug.log ├── [ 28] latest-run -> run-20220109_220240-1g372i3v └── [4.0K] run-20220109_220240-1g372i3v ├── [4.0K] files │ ├── [8.8K] conda-environment.yaml │ ├── [140K] config.yaml │ ├── [4.7M] output.log │ ├── [5.4K] requirements.txt │ ├── [2.1K] wandb-metadata.json │ └── [653K] wandb-summary.json ├── [4.0K] logs │ ├── [3.4M] debug-internal.log │ └── [8.2K] debug.log └── [113M] run-1g372i3v.wandb 9 directories, 34 files Nice, we reduced the n-gram by greater than half to lower than 2GB now. In the ultimate step, let’s upload all files. repo.push_to_hub(commit_message="Upload lm-boosted decoder") Output: Git LFS: (1 of 1 files) 1.85 GB / 1.85 GB Counting objects: 9, done. Delta compression using as much as 2 threads. Compressing objects: 100% (9/9), done. Writing objects: 100% (9/9), 1.23 MiB | 1.92 MiB/s, done. Total 9 (delta 3), reused 0 (delta 0) To https://huggingface.co/hf-test/xls-r-300m-sv 27d0c57..5a191e2 primary -> primary That is it. Now it’s best to have the opportunity to make use of the 5gram for LM-boosted decoding as shown in Section 1. As might be seen on xls-r-300m-sv‘s model card our 5gram LM-boosted decoder yields a WER of 18.85% on Common Voice’s 7 test set which is a relative performance of ca. 30% 🔥.

~~Source link ASK ANA 0 0~~

Boosting Wav2Vec2 with n-grams in 🤗 Transformers

1. Decoding audio data with Wav2Vec2 and a language model

2. Getting data to your language model

**3. Construct an n-gram with KenLM**

**4. Mix an n-gram with Wav2Vec2**

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

The who, what, and why of the attack that has shut down Stryker’s Windows network”

Construct Next-Gen Physical AI with Edge‑First LLMs for Autonomous Vehicles and Robotics

Constructing a powerful data infrastructure for AI agent success

Construct Accelerated, Differentiable Computational Physics Code for AI with NVIDIA Warp

Exploratory Data Evaluation for Credit Scoring with Python

Boosting Wav2Vec2 with n-grams in 🤗 Transformers

1. Decoding audio data with Wav2Vec2 and a language model

2. Getting data to your language model

3. Construct an n-gram with KenLM

4. Mix an n-gram with Wav2Vec2

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

**3. Construct an n-gram with KenLM**

**4. Mix an n-gram with Wav2Vec2**

What are your thoughts on this topic?
Let us know in the comments below.