Tokenizer

Construct a Tokenizer for the Thai Language from Scratch

A step-by-step guide to constructing a Thai multilingual sub-word tokenizer based on a BPE algorithm trained on Thai and English datasets using only PythonPerfect. Our Thai Tokenizer can now successfully and accurately encode and...

Full Guide on LLM Synthetic Data Generation

Large Language Models (LLMs) are powerful tools not only for generating human-like text, but in addition for creating high-quality synthetic data. This capability is changing how we approach AI development, particularly in scenarios where...

The Ultimate Guide to Training BERT from Scratch: The Tokenizer

From Text to Tokens: Your Step-by-Step Guide to BERT TokenizationBy the point you finish reading this text, you’ll not only understand the ins and outs of the BERT tokenizer, but you’ll even be equipped...

Recent posts

Popular categories

ASK ANA