Construct a Tokenizer for the Thai Language from Scratch

-

A step-by-step guide to constructing a Thai multilingual sub-word tokenizer based on a BPE algorithm trained on Thai and English datasets using only Python

[Image by writer]: Thai Tokenizer encode and decode Thai text to Token Ids and vice versa

The first task of the Tokenizer is to translate the raw input texts (Thai in our case but may be in any foreign language) into numbers and pass them to the model’s transformers. The model’s transformer then generates output as numbers. Again, Tokenizer translates these numbers back to texts which is comprehensible to finish users. The high level diagram below describes the flow explained above.

[Image by writer]: Diagram showing tokenizers role in LLM’s input and output flow.

Generally, lots of us are only fascinated about learning how the model’s transformer architecture works under the hood. We frequently overlook learning some essential components reminiscent of tokenizers intimately. Understanding how tokenizer works under the hood and having good control of its functionalities gives us good leverage to enhance our model’s accuracy and performance.

Just like Tokenizer, a few of an important components of LLM implementation pipelines are Data preprocessing, Evaluation, Guardrails/Security, and Testing/Monitoring. I might highly recommend you study more details on these topics. I noticed the importance of those components only after I used to be working on the actual implementation of my foundational multilingual model ThaiLLM in production.

Why do you would like a Thai tokenizer or another foreign language tokenizer?

  • Suppose you’re using generic English-based tokenizers to pre-train a multilingual large language model reminiscent of Thai, Hindi, Indonesian, Arabic, Chinese, etc. In that case, your model might unlikely give an appropriate output that makes good sense in your specific domain or use cases. Hence, constructing your personal tokenizer in your alternative of language definitely helps make your model’s output way more coherent and comprehensible.
  • Constructing your personal tokenizer also gives you full control over how comprehensive and inclusive vocabulary you desire to construct. Throughout the attention mechanism, due to comprehensive vocabulary, the token can attend and learn from more tokens inside the limited context length of the sequence. Hence it makes learning more coherent which eventually helps in higher model inference.

The excellent news is that after you finish constructing Thai Tokenizer, you possibly can easily construct a tokenizer in another language. All of the constructing steps are the identical except that you simply’ll need to train on the dataset of your alternative of language.

Now that we’ve all the great reason to construct our own tokenizer. Below are steps to constructing our tokenizer within the Thai language.

  1. Construct our own BPE algorithm
  2. Train the tokenizer
  3. Tokenizer encode and decode function
  4. Load and test the tokenizer

Step 1: Construct our own BPE (Byte Pair Encoding) algorithm:

The BPE algorithm is utilized in many popular LLMs reminiscent of Llama, GPT, and others to construct their tokenizer. We are able to select certainly one of these LLM tokenizers if our model relies on the English language. Since we’re constructing the Thai Tokenizer, one of the best option is to create our own BPE algorithm from scratch and use it to construct our tokenizer. Let’s first understand how the BPE algorithm works with the assistance of the straightforward flow diagram below after which we’ll start constructing it accordingly.

[Image by writer]: BPE flow diagram. Example reference from a wiki page (https://en.wikipedia.org/wiki/Byte_pair_encoding)

The examples within the flow diagram are shown in English to make it easier to grasp.

Let’s write code to implement the BPE algorithm for our Thai Tokenizer.

# A straightforward practice example to get familiarization with utf-8 encoding to convert strings to bytes. 
text = "How are you คุณเป็นอย่างไร" # Text string in each English and Thai
text_bytes = text.encode("utf-8")
print(f"Text in byte: {text_bytes}")

text_list = list(text_bytes) # Converts text bytes to a listing of integer
print(f"Text list in integer: {text_list}")

# As I don't need to reinvent the wheel, I might be referencing a lot of the code block from Andrej Karpathy's GitHub (https://github.com/karpathy/minbpe?tab=readme-ov-file).
# Nevertheless, I will be modifying code blocks specific to constructing our Thai language tokenizer and likewise explaining the codes so that you may understand how each code block works and make it easy once you implement code in your use case later.

# This module provides access to the Unicode Character Database (UCD) which defines character properties for all Unicode characters.
import unicodedata

# This function returns a dictionary with consecutive pairs of integers and their counts within the given list of integers.
def get_stats(ids, stats=None):

stats = {} if stats is None else stats
# zip function allows to iterate consecutive items from given two list
for pair in zip(ids, ids[1:]):
# If a pair already exists within the stats dictionary, add 1 to its value else assign the worth as 0.
stats[pair] = stats.get(pair, 0) + 1
return stats

# Once we discover out the list of consecutive pairs of integers, we'll then replace those pairs with recent integer tokens.
def merge(ids, pair, idx):
newids = []
i = 0
# As we'll be merging a pair of ids, hence the minimum id within the list needs to be 2 or more.
while i < len(ids):
# If the present id and next id(id+1) exist within the given pair, and the position of id isn't the last, then replace the two consecutive id with the given index value.
if ids[i] == pair[0] and that i < len(ids) - 1 and ids[i+1] == pair[1]:
newids.append(idx)
i += 2 # If the pair is matched, the subsequent iteration starts after 2 positions within the list.
else:
newids.append(ids[i])
i += 1 # Because the current id pair didn't match, so start iteration from the 1 position next within the list.
# Returns the Merged Ids list
return newids

# This function checks that using 'unicodedata.category' which returns "C" as the primary letter whether it is a control character and we'll have to exchange it readable character.
def replace_control_characters(s: str) -> str:
chars = []
for ch in s:
# If the character isn't distorted (meaning the primary letter doesn't start with "C"), then append the character to chars list.
if unicodedata.category(ch)[0] != "C":
chars.append(ch)
# If the character is distorted (meaning the primary letter has the letter "C"), then replace it with readable bytes and append to chars list.
else:
chars.append(f"u{ord(ch):04x}")
return "".join(chars)

# Among the tokens reminiscent of control characters like Escape Characters cannot be decoded into valid strings.
# Hence those should be replace with readable character reminiscent of �
def render_token(t: bytes) -> str:
s = t.decode('utf-8', errors='replace')
s = replace_control_characters(s)
return s

The 2 functions get_stats and merge defined above within the code block are the implementation of the BPE algorithm for our Thai Tokenizer. Now that the algorithm is prepared. Let’s write code to coach our tokenizer.

Step 2: Train the tokenizer:

Training tokenizer involves generating a vocabulary which is a database of unique tokens (word and sub-words) together with a novel index number assigned to every token. We’ll be using the Thai Wiki dataset from the Hugging Face to coach our Thai Tokenizer. Similar to training an LLM requires an enormous data, you’ll also require an excellent amount of information to coach a tokenizer. You would also use the identical dataset to coach the LLM in addition to tokenizer though not mandatory. For a multilingual LLM, it’s advisable to make use of each the English and Thai datasets within the ratio of two:1 which is a typical approach many practitioners follow.

Let’s begin writing the training code.

# Import Regular Expression
import as re

# Create a Thai Tokenizer class.
class ThaiTokenizer():

def __init__(self):

# The byte pair needs to be done inside the related words or sentences that give a correct context. Pairing between unrelated words or sentences may give undesirable output.
# To stop this behavior, we'll implement the LLama 3 regular expression pattern to make meaningful chunks of our text before implementing the byte pair algorithm.
self.pattern = r"(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^rnp{L}p{N}]?p{L}+|p{N}{1,3}| ?[^sp{L}p{N}]+[rn]*|s*[rn]+|s+(?!S)|s+"
self.compiled_pattern = re.compile(self.pattern)

# Special tokens are used to offer coherence within the sequence while training.
# Special tokens are assigned a novel index number and stored in vocabulary.
self.special_tokens = >': 1104,
'<

# Initialize merges with empty dictionary
self.merges = {}

# Initialize the vocab dictionary by calling the function _build_vocab which is defined later on this class.
self.vocab = self._build_vocab()

# Tokenizer training function
def train(self, text, vocab_size):

# Make certain the vocab size should be no less than 256 because the utf-8 encoding for the range 0-255 are same because the Ascii character.
assert vocab_size >= 256
# Total variety of merges into the vocabulary.
num_merges = vocab_size - 256

# Step one is to make certain to separate the text up into text chunks using the pattern defined above.
text_chunks = re.findall(self.compiled_pattern, text)

# Each text_chunks might be utf-8 encoded to bytes after which converted into an integer list.
ids = [list(ch.encode("utf-8")) for ch in text_chunks]

# Iteratively merge probably the most common pairs to create recent tokens
merges = {} # (int, int) -> int
vocab = {idx: bytes([idx]) for idx in range(256)} # idx -> bytes

# Until the full num_merges is reached, find the common pair of consecutive id within the ids list and begin merging them to create a brand new token
for i in range(num_merges):
# Count the variety of times every consecutive pair appears
stats = {}
for chunk_ids in ids:
# Passing in stats will update it in place, adding up counts
get_stats(chunk_ids, stats)
# Find the pair with the best count
pair = max(stats, key=stats.get)
# Mint a brand new token: assign it the subsequent available id
idx = 256 + i
# Replace all occurrences of pair in ids with idx
ids = [merge(chunk_ids, pair, idx) for chunk_ids in ids]
# Save the merge
merges[pair] = idx
vocab[idx] = vocab[pair[0]] + vocab[pair[1]]

# Save class variables for use later during tokenizer encode and decode
self.merges = merges
self.vocab = vocab

# Function to return a vocab dictionary combines with merges and special tokens
def _build_vocab(self):
# The utf-8 encoding for the range 0-255 are same because the Ascii character.
vocab = {idx: bytes([idx]) for idx in range(256)}

# Iterate through merge dictionary and add into vocab dictionary
for (p0, p1), idx in self.merges.items():
vocab[idx] = vocab[p0] + vocab[p1]

# Iterate through special token dictionary and add into vocab dictionary
for special, idx in self.special_tokens.items():
vocab[idx] = special.encode("utf-8")

return vocab

# After training is complete, use the save function to save lots of the model file and vocab file.
# Model file might be used to load the tokenizer model for further use in llm
# Vocab file is only for the aim of human verification
def save(self, file_prefix):
# Writing to model file
model_file = file_prefix + ".model" # model file name

# Model write begins
with open(model_file, 'w') as f:
f.write("thai tokenizer v1.0n") # write the tokenizer version
f.write(f"{self.pattern}n") # write the pattern utilized in tokenizer
f.write(f"{len(self.special_tokens)}n") # write the length of special tokens

# Write each special token in the precise format like below
for tokens, idx in self.special_tokens.items():
f.write(f"{tokens} {idx}n")

# Write only the keys part from the merges dict
for idx1, idx2 in self.merges:
f.write(f"{idx1} {idx2}n")

# Writing to the vocab file
vocab_file = file_prefix + ".vocab" # vocab file name

# Change the position of keys and values of merge dict and store into inverted_merges
inverted_merges = {idx: pair for pair, idx in self.merges.items()}
# Vocab write begins
with open(vocab_file, "w", encoding="utf-8") as f:
for idx, token in self.vocab.items():
# render_token function processes tokens and prevents distorted bytes by replacing them with readable character
s = render_token(token)
# If the index of vocab is present in merge dict, then find its child index, convert their corresponding bytes in vocab dict and write the characters
if idx in inverted_merges:
idx0, idx1 = inverted_merges[idx]
s0 = render_token(self.vocab[idx0])
s1 = render_token(self.vocab[idx1])
f.write(f"[{s0}][{s1}] -> [{s}] {idx}n")
# If index of vocab isn't present in merge dict, just write it's index and the corresponding string
else:
f.write(f"[{s}] {idx}n")

# Function to load tokenizer model.
# This function is invoked only after the training is complete and the tokenizer model file is saved.
def load(self, model_file):

merges = {} # Initialize merge and special_tokens with empty dict
special_tokens = {} # Initialize special_tokens with empty dict
idx = 256 # Because the range (0, 255) is already reserved in vocab. So the subsequent index only starts from 256 and onwards.

# Read model file
with open(model_file, 'r', encoding="utf-8") as f:

version = f.readline().strip() # Read the tokenizer version as defined during model file writing
self.pattern = f.readline().strip() # Read the pattern utilized in tokenizer
num_special = int(f.readline().strip()) # Read the length of special tokens

# Read all of the special tokens and store in special_tokens dict defined earlier
for _ in range(num_special):
special, special_idx = f.readline().strip().split()
special_tokens[special] = int(special_idx)

# Read all of the merge indexes from the file. Make it a key pair and store it in merge dictionary defined earlier.
# The worth of this key pair could be idx(256) as defined above and carry on increase by 1.
for line in f:
idx1, idx2 = map(int, line.split())
merges[(idx1, idx2)] = idx
idx += 1

self.merges = merges
self.special_tokens = special_tokens

# Create a final vocabulary dictionary by combining merge, special_token and vocab (0-255). _build_vocab function helps to do exactly that.
self.vocab = self._build_vocab()

Step 3: Tokenizer encode and decode function:

  • Tokenizer Encode: The tokenizer encoding function looks into vocabulary and translates the given input texts or prompts into the list of integer IDs. These IDs are then fed into the transformer blocks.
  • Tokenizer Decode: The tokenizer decoding function looks into vocabulary and translates the list of IDs generated from the transformer’s classifier block into output texts.

Let’s take a have a look at the diagram below to have further clarity.

[Image by writer]: Thai tokenizer encode and decode function

Let’s write code to implement the tokenizer’s encode and decode function.

# Tokenizer encode function takes text as a string and returns integer ids list
def encode(self, text):

# Define a pattern to discover special token present within the text
special_pattern = "(" + "|".join(re.escape(k) for k in self.special_tokens) + ")"
# Split special token (if present) from the remaining of the text
special_chunks = re.split(special_pattern, text)
# Initialize empty ids list
ids = []

# Loop through each of parts within the special chunks list.
for part in special_chunks:
# If the a part of the text is the special token, get the idx of the part from the special token dictionary and append it to the ids list.
if part in self.special_tokens:
ids.append(self.special_tokens[part])
# If the a part of text isn't a special token
else:
# Split the text into multiple chunks using the pattern we have defined earlier.
text_chunks = re.findall(self.compiled_pattern, text)

# All text chunks are encoded individually, then the outcomes are joined
for chunk in text_chunks:
chunk_bytes = chunk.encode("utf-8") # Encode text to bytes
chunk_ids = list(chunk_bytes) # Convert bytes to list of integer

while len(chunk_ids) >= 2: # chunks ids list should be no less than 2 id to form a byte-pair
# Count the variety of times every consecutive pair appears
stats = get_stats(chunk_ids)
# Some idx pair could be created with one other idx within the merge dictionary. Hence we'll find the pair with the bottom merge index to make sure we cover all byte pairs within the merge dict.
pair = min(stats, key=lambda p: self.merges.get(p, float("inf")))

# Break the loop and return if the pair isn't present within the merges dictionary
if pair not in self.merges:
break
# Find the idx of the pair present within the merges dictionary
idx = self.merges[pair]
# Replace the occurrences of pair in ids list with this idx and proceed
chunk_ids = merge(chunk_ids, pair, idx)

ids.extend(chunk_ids)
return ids

# Tokenizer decode function takes a listing of integer ids and return strings
def decode(self, ids):

# Initialize empty byte list
part_bytes = []
# Change the position of keys and values of special_tokens dict and store into inverse_special_tokens
inverse_special_tokens = {v: k for k, v in self.special_tokens.items()}

# Loop through idx within the ids list
for idx in ids:
# If the idx is present in vocab dict, get the bytes of idx and append them into part_bytes list
if idx in self.vocab:
part_bytes.append(self.vocab[idx])
# If the idx is present in inverse_special_tokens dict, get the token string of the corresponding idx, convert it to bytes using utf-8 encode after which append it into part_bytes list
elif idx in inverse_special_tokens:
part_bytes.append(inverse_special_tokens[idx].encode("utf-8"))
# If the idx isn't present in each vocab and special token dict, throw an invalid error
else:
raise ValueError(f"invalid token id: {idx}")

# Join all the person bytes from the part_byte list
text_bytes = b"".join(part_bytes)

# Convert the bytes to text string using utf-8 decode function. Make certain to make use of "errors=replace" to exchange distorted characters with readable characters reminiscent of �.
text = text_bytes.decode("utf-8", errors="replace")
return text

Step 4: Load and test the tokenizer:

Finally, here comes one of the best a part of this text. On this section, we’ll perform two interesting tasks.

  • First, train our tokenizer with the Thai Wiki Dataset from the Hugging Face. Now we have chosen a small dataset size (2.2 MB) to make training faster. Nevertheless, for real-world implementation, you need to select a much larger dataset for higher results. After the training is complete, we’ll save the model.
  • Second, we’ll load the saved tokenizer model and perform testing the tokenizer’s encode and decode function.

Let’s dive in.

# Train the tokenizer

import time # To caculate the duration of coaching completion
# Load training raw text data (thai_wiki dataset) from huggingface. thai_wiki_small.text: https://github.com/tamangmilan/thai_tokenizer
texts = open("/content/thai_wiki_small.txt", "r", encoding="utf-8").read()
texts = texts.strip()
# Define vocab size
vocab_size = 512
# Initialize a tokenizer model class
tokenizer = ThaiTokenizer()
# Start train a tokenizer
start_time = time.time()
tokenizer.train(texts, vocab_size)
end_time = time.time()
# Save tokenizer: you possibly can change path and filename.
tokenizer.save("./models/thaitokenizer")
print(f"Total time to finish tokenizer training: {end_time-start_time:.2f} seconds")

# Output: Total time to finish tokenizer training: 186.11 seconds (3m 6s) [Note: Training duration will be longer if vocab_size is bigger and lesser for smaller vocab_size]

# Test the tokenizer

# Initialize a tokenizer model class
tokenizer = ThaiTokenizer()
# Load tokenizer model. This model was saved during training.
tokenizer.load("./models/thaitokenizer.model")
# Invoke and confirm the tokenizer encode and decode function for English Language
eng_texts = "When society evolved in several lands"
print(f"English Text: {eng_texts}")
encoded_ids = tokenizer.encode(eng_texts)
print(f"Encoded Ids: {encoded_ids}")
decoded_texts = tokenizer.decode(encoded_ids)
print(f"Decoded Texts: {decoded_texts}n")

# Invoke and confirm the tokenizer encode and decode function for Thai Language
thai_texts = "เมื่อสังคมมีวิวัฒนาการขึ้นในดินแดนต่าง"
print(f"Thai Text: {thai_texts}")
thai_encoded_ids = tokenizer.encode(thai_texts)
print(f"Encoded Ids: {thai_encoded_ids}")
thai_decoded_texts = tokenizer.decode(thai_encoded_ids)
print(f"Decoded Texts: {thai_decoded_texts}")

[Thai Tokenizer]: Encoding and decoding output for the texts in Thai and English language.

Perfect. Our Thai Tokenizer can now successfully and accurately encode and decode texts in each Thai and English languages.

Have you ever noticed that the encoded IDs for English texts are longer than Thai encoded IDs? It is because we’ve only trained our tokenizer with the Thai dataset. Hence the tokenizer is barely in a position to construct a comprehensive vocabulary for the Thai language. Since we didn’t train with an English dataset, the tokenizer has to encode right from the character level which leads to longer encoded IDs. As I actually have mentioned before, for multilingual LLM, you need to train each the English and Thai datasets with a ratio of two:1. This provides you with balanced and quality results.

And that’s it! Now we have now successfully created our own Thai Tokenizer from scratch only using Python. And, I believe that was pretty cool. With this, you possibly can easily construct a tokenizer for any foreign language. This provides you with a whole lot of leverage while implementing your Multilingual LLM.

Thanks loads for reading!

Link to Google Colab notebook

References

[1] Andrej Karpathy, Git Hub: Karpthy/minbpe

ASK DUKE

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x