Understanding Byte Pair Encoding Tokenizer : Working of BPE : Limitations References Thanks for Reading it 😊.

-

Harnessing the Power of Byte Pair Encoding for Language Modeling

Within the Pre-Byte Pair Encoding(BPE) era , the tokenization algorithm or techniques mostly relied on simplistic approaches reminiscent of splitting text into individual words or using fixed vocabularies. Nonetheless these approaches often fails when coping with morphologically wealthy languages , out of vocabulary words etc. This algorithm not only addresses the above mentioned limitations, but in addition presents a groundbreaking approach to text tokenization, surpassing them with utmost efficiency. This algorithm was introduced in Neural Machine Translation of Rare Words with Subword Units (Sennrich et al., 2015).

Before diving into BPE , Lets understand what’s Tokenizer first 🤔.

A tokenizer is sort of a language detective that breaks down text into smaller and more manageable pieces. Just as we divide sentences into words and words into letters, a tokenizer takes a piece of text and splits it into meaningful units called tokens. These tokens might be individual words, phrases, and even subword units, depending on the tokenizer’s approach.

Lets try to know this with utmost easy way 😊. BPE works on Pretokenization that splits the massive corpus text into words. In easy terms pretokenization is nothing but space tokenization. Space tokenization splits the text into words if there’s space present between them.

After the pretokenization process we obtain numerous words from the text corpus. Now BPE counts the frequency of unique words obtained. This helps to know which words are more common and which of them are rare. Consider it like a survey of the words popularity.

Next , BPE forms a special vocabulary made up of all different symbols present in the words of text corpus. Symbols might be individual letters or mixtures of letters like “k” or “do”. Now , comes the interesting part! BPE starts merging those symbols to form latest symbols and adds to the vocabulary. It keeps doing this until it reaches a desired vocabulary size. The vocab size is a hyperparameter which must be defined by the user before the tokenization process. The magic of BPE lies in its adaptibility. It could actually handle complex words , even those which it has never seen before.

Lets consider now we have the next set of unique word from the text corpus.

("cat", 10), ("cats", 5), ("dog", 8), ("dogs", 6)

So for the above words now BPE forms base vocabulary or in easy terms symbols.

 ["c", "a", "t", "s", "d", "o", "g"].

Now , BPE starts counting the frequency of every possible symbol pair and indentifies probably the most frequent pair. We are able to see that probably the most common pair is “s” and followed by “t” , occuring a complete of 15 times. BPE decides to merge these two symbols right into a latest symbol , “st”.

After merging “s” and “t” , the set of words becomes:

("cat", 10), ("ca" "st", 5), ("dog", 8), ("do" "gs", 6)

Now , BPE continues by identifying the following most frequent symbol pair. Let’s say the pair “ca” followed by “st” is probably the most common pair. Now the set of words becomes :

("forged", 5), ("dog", 8), ("do" "gs", 6)

BPE repeats this process , finding the following most frequent symbol pair which may be “do” followed by “gs”. So BPE merges them to form the symbol “dogs”. At this stage latest vocabulary which is formed is :

["c", "a", "t", "s", "d", "o", "g", "st", "ca", "cast", "dogs"]

If the BPE training stops at this point , the learned tokenizer would apply this rules to tokenize the unseen words. For eg : the word casting can be tokenized as [“cast”, “ing”].

We are able to see that unknowingly BPE is capable of handle verbs , tenses , plurals etc. and helps large language model to generalize.

We have now seen that how BPE is a robust algorithm for tokenization however it do have some limitations.

BPE can lead to extend in vocab size. As vocab size increases it could actually also result in computational cost.

As we all know BPE relies on pretokenization , so its accuracy also depends upon the pretokenization which might be misleading.It also is determined by quality of coaching data corpus.

Neural Machine Translation of Rare Words with Subword Units (Sennrich et al., 2015).

HuggingFace Blog

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

4 COMMENTS

0 0 votes
Article Rating
guest
4 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

4
0
Would love your thoughts, please comment.x
()
x