Tips on how to Construct a Neural Machine Translation System for a Low-Resource Language

of the AI boom, the pace of technological iteration has reached an unprecedented level. Previous obstacles now appear to have viable solutions. This text serves as an “NMT 101” guide. While introducing our project, it also walks readers step-by-step through the means of fine-tuning an existing translation model to support a low-resource language that shouldn’t be included in mainstream multilingual models.

Background: Dongxiang as a Low-Resource Language

Dongxiang is a minority language spoken in China’s Gansu Province and is assessed as vulnerable by the UNESCO Atlas of the World’s Languages in Danger. Despite being widely spoken in local communities, Dongxiang lacks the institutional and digital support enjoyed by high-resource languages. Before diving into the training pipeline, it helps to briefly understand the language itself. Dongxiang, as its name suggests, is the mother tongue of the Dongxiang people. Descended from Central Asian groups who migrated to Gansu through the Yuan dynasty, the Dongxiang community has linguistic roots closely tied to Middle Mongol. From a writing-system perspective, Dongxiang has undergone a comparatively recent standardization. Because the Nineteen Nineties, with governmental promotion, the language has steadily adopted an official Latin-based orthography, using the 26 letters of the English alphabet and delimiting words by whitespace.

Dongxiang Language Textbook for Primary Schools (by Writer)

Even though it continues to be classified under the Mongolic language family, on account of the prolonged coexistence with Mandarin-speaking communities through history, the language has a trove of lexical borrowing from Chinese (Mandarin). Dongxiang exhibits no overt tense inflection or grammatical gender, which could also be a bonus to simplify our model training.

Based on the Dongxiang dictionary, roughly **33.8%** of Dongxiang vocabulary items are of Chinese origin. (by Writer)

Further background on the Dongxiang language and its speakers might be found on our website, which hosts an official English-language introduction released by the Chinese government.

Our Model: Tips on how to Use the Translation System

We construct our translation system on top of NLLB-200-distilled-600M, a multilingual neural machine translation model released by Meta as a part of the No Language Left Behind (NLLB) project. We were inspired by the work of David Dale. Nonetheless, ongoing updates to the Transformers library have made the unique approach difficult to use. In our own trials, rolling back to earlier versions (e.g., transformers ≤ 4.33) often triggered conflicts with other dependencies. In light of those constraints, we offer a full list of libraries in our project’s GitHub to your reference.

Our model was fine-tuned on 42,868 Dongxiang–Chinese bilingual sentence pairs. The training corpus combines publicly available materials with internally curated resources provided by local government partners, all of which were processed and cleaned upfront. Training was conducted using Adafactor, a memory-efficient optimizer well suited to large transformer models. With the distilled architecture, the complete fine-tuning process might be accomplished in under 12 hours on a single NVIDIA A100 GPU. All training configurations, hyperparameters, and experimental settings are documented across two training Jupyter notebooks. Relatively than counting on a single bidirectional model, we trained two direction-specific models to support Dongxiang–Chinese and Chinese–Dongxiang translation. Since NLLB is already pretrained on Chinese, joint training under data-imbalanced conditions tends to favor the better or more dominant direction. Because of this, performance gains on the low-resource side (Dongxiang) are sometimes limited. Nonetheless, NLLB does support bidirectional translation in a single model, and a simple approach is to alternate translation directions on the batch level.

Listed here are the links to our repository and website.

GitHub Repository
GitHub-hosted website

The model can be publicly available on Hugging Face.

Chinese → Dongxiang
Dongxiang → Chinese

Model Training: Step-by-Step Reproducible Pipeline

Before following this pipeline to construct the model, we assume that the reader has a basic understanding of Python and fundamental concepts in natural language processing. For readers who could also be less accustomed to these topics, Andrew Ng’s courses are a highly really useful gateway. Personally, I also began my very own journey to this field through his course.

Step 1: Bilingual Dataset Processing

The primary stage of model training focuses on constructing a bilingual dataset. While parallel corpora for major languages can often be obtained by leveraging existing web-scraped resources, Dongxiang–Chinese data stays difficult to accumulate. To support transparency and reproducibility, and with consent from the relevant data custodians, we’ve released each the raw corpus and a normalized version in our GitHub repository. The normalized dataset is produced through a simple preprocessing pipeline that removes excessive whitespace, standardizes punctuation, and ensures a transparent separation between scripts. Dongxiang text is restricted to Latin characters, while Chinese text incorporates only Chinese characters.
Below is the code used for preprocessing:

import re
import pandas as pd

def split_lines(s: str):
    if "n" in s and "n" not in s:
        lines = s.split("n")
    else:
        lines = s.splitlines()
    lines = [ln.strip().strip("'").strip() for ln in lines if ln.strip()]
    return lines

def clean_dxg(s: str) -> str:
    s = re.sub(r"[^A-Za-zs,.?]", " ", s)
    s = re.sub(r"s+", " ", s).strip()
    s = re.sub(r"[,.?]+$", "", s)
    return s

def clean_zh(s: str) -> str:
    s = re.sub(r"[^u4e00-u9fff，。？]", "", s)
    s = re.sub(r"[，。？]+$", "", s)
    return s

def make_pairs(raw: str) -> pd.DataFrame:
    lines = split_lines(raw)
    pairs = []
    for i in range(0, len(lines) - 1, 2):
        dxg = clean_dxg(lines[i])
        zh  = clean_zh(lines[i+1])
        if dxg or zh:
            pairs.append({"Dongxiang": dxg, "Chinese": zh})
    return pd.DataFrame(pairs, columns=["Dongxiang", "Chinese"])

In practice, bilingual sentence-level pairs are preferred over word-level entries, and excessively long sentences are split into shorter segments. This facilitates more reliable cross-lingual alignment and results in more stable and efficient model training. Isolated dictionary entries mustn’t be inserted into training inputs. Without surrounding context, the model cannot infer syntactic roles, or learn the way words interact with surrounding tokens.

When parallel data is proscribed, a standard alternative is to generate synthetic source sentences from monolingual target-language data and pair them with the originals to form pseudo-parallel corpora. This concept was popularized by Rico Sennrich, whose work on back-translation laid the groundwork for a lot of NMT pipelines. LLM-generated synthetic data is one other viable approach. Prior work has shown that LLM-generated synthetic data is effective in constructing translation systems for Purépecha, an Indigenous language spoken in Mexico.

Step 2: Tokenizer Preparation

Before text might be digested by a neural machine translation model, it have to be converted into tokens. Tokens are discrete units, typically on the subword level, that function the essential input symbols for neural networks. Using entire words as atomic units is impractical, because it results in excessively large vocabularies and rapid growth in model dimensionality. Furthermore, word-level representations struggle to generalize to unseen or rare words, whereas subword tokenization enables models to compose representations for novel word forms.

The official NLLB documentation already provides standard examples demonstrating how tokenization is handled. Owing to NLLB’s strong multilingual capability, most generally used writing systems might be tokenized in an inexpensive and stable manner. In our case, adopting the default NLLB multilingual tokenizer (Unigram-based) was sufficient to process Dongxiang text.

Summary statistics of tokenized Dongxiang sentences (by Writer)

Whether the tokenizer ought to be retrained is best determined by two criteria. The primary is coverage: frequent occurrences of unknown tokens () indicate insufficient vocabulary or character handling. In our sample of 300 Dongxiang sentences, the rate is zero, suggesting full coverage under the present preprocessing. The second criterion is subword fertility, defined as the typical variety of subword tokens generated per whitespace-delimited word. Across the 300 samples, sentences average 6.86 words and 13.48 tokens, corresponding to a fertility of roughly 1.97. This pattern stays consistent across the distribution, with no evidence of excessive fragmentation in longer sentences.

Overall, NLLB demonstrates robust behavior even on previously unseen languages. Because of this, tokenizer retraining is mostly unnecessary unless the goal language employs a highly unconventional writing system and even lacks Unicode support. Retraining a SentencePiece tokenizer also has implications for the embedding layer. Latest tokens start without pretrained embeddings and have to be initialized using random values or easy averaging.

Step 3: Language ID Registration

In practical machine translation systems similar to Google Translate, the source and goal languages have to be explicitly specified. NLLB adopts the identical assumption. Translation is governed by explicit language tag, known as and , determining how text is encoded and generated inside the model. When a language falls outside NLLB’s predefined scope, it must first be explicitly registered, together with a corresponding expansion of the model’s embedding layer. The embedding layer maps discrete tokens into continuous vector representations, allowing the neural network to process and learn linguistic patterns in a numerical form.

In our implementation, a custom language tag is added to the tokenizer as an extra special token, which assigns it a singular token ID. The model’s token embedding matrix is then resized to accommodate the expanded vocabulary. The embedding vector related to the brand new language tag is initialized from a zero-centered normal distribution with a small variance, scaled by 0.02. If the newly introduced language is closely related to an existing supported language, its embedding can often be trained on top of the present representation space. Nonetheless, linguistic similarity alone doesn’t guarantee effective transfer learning. Differences in writing systems can affect tokenization. A well known example is Moldovan, which is linguistically equivalent to Romanian but is written within the Latin script, while it’s written in Cyrillic within the so-called . Despite the close linguistic relationship, the difference in script introduces distinct tokenization patterns.

The code used to register a brand new language is presented here.

def fix_tokenizer(tokenizer, new_lang: str):
    old = list(tokenizer.additional_special_tokens)
    if new_lang not in old:
        tokenizer.add_special_tokens(
            {"additional_special_tokens": old + [new_lang]})
    return tokenizer.convert_tokens_to_ids(new_lang)

fix_tokenizer(tokenizer,"sce_Latn")
# we register Dongxiang as sce_Latn, and it should append to the last
# output 256204

print(tokenizer.convert_ids_to_tokens([256100,256204]))
print(tokenizer.convert_tokens_to_ids(['lao_Laoo','sce_Latn']))
# output 
['lao_Laoo', 'sce_Latn']
[256100, 256204]

model = AutoModelForSeq2SeqLM.from_pretrained("facebook/nllb-200-distilled-600M")
model.resize_token_embeddings(len(tokenizer))
new_id = fix_tokenizer(tokenizer, "sce_Latn")
embed_dim = model.model.shared.weight.size(1)
model.model.shared.weight.data[new_id] = torch.randn(embed_dim) * 0.02

Step 4: Model Training

We fine-tuned the interpretation model using the Adafactor optimizer, a memory-efficient optimization algorithm designed for large-scale sequence-to-sequence models. The training schedule begins with 500 warmup steps, during which the training rate is steadily increased as much as 1e-4 to stabilize early optimization and avoid sudden gradient spikes. The model is then trained for a complete of 8,000 optimization steps, with 64 sentence pairs per optimization step (batch). The utmost sequence length is ready to 128 tokens, and gradient clipping is applied with a threshold of 1.0.

We initially planned to adopt early stopping. Nonetheless, on account of the limited size of the bilingual corpus, nearly all available bilingual data was used for training, leaving only a dozen-plus sentence pairs reserved for testing. Under these conditions, a validation set of sufficient size was not available. Subsequently, although our GitHub codebase includes placeholders for early stopping, this mechanism was not actively utilized in practice.

Below is a snapshot of the important thing hyperparameters utilized in training.

optimizer = Adafactor(
    [p for p in model.parameters() if p.requires_grad],
    scale_parameter=False,
    relative_step=False,
    lr=1e-4,
    clip_threshold=1.0,
    weight_decay=1e-3,
)

batch_size = 64
max_length = 128
training_steps = 8000
warmup_steps = 500

It’s also value noting that, within the design of the loss function, we adopt a computationally efficient training strategy. The model receives tokenized source sentences as input and generates the goal sequence incrementally. At each step, the anticipated token is compared against the corresponding reference token within the goal sentence, and the training objective is computed using token-level cross-entropy loss.

loss = model(**x, labels=y.input_ids).loss
# Pseudocode below illustrates the underlying mechanism of the loss function
for every batch:

    x = tokenize(source_sentences)        # input: source language tokens
    y = tokenize(target_sentences)        # goal: reference translation tokens

    predictions = model.forward(x)        # predict next-token distributions
    loss = cross_entropy(predictions, y)  # compare with reference tokens

    backpropagate(loss)
    update_model_parameters()

This formulation actually carries an implicit assumption: that the reference translation represents the one correct answer and that the model’s output must align with it token by token. Under this assumption, any deviation from the reference is treated as an error. Even when a prediction conveys the identical idea using different wording, synonyms, or an altered sentence structure.

The mismatch between token-level supervision and meaning-level correctness is especially problematic in low-resource and morphologically flexible languages. On the training stage, this issue might be alleviated by relaxing strict token-level alignment and treating multiple paraphrased goal sentences as equally valid references. On the inference stage, as a substitute of choosing the highest-probability output, a set of candidate translations might be generated and re-ranked using semantically informed criteria (e.g., chrF).

Step 5: Model Evaluation

Once the model is built, the subsequent step is to look at how well it translates. Translation quality is formed not only by the model itself, but additionally by how the interpretation process is configured at inference time. Under the NLLB framework, the goal language have to be explicitly specified during generation. This is finished through the parameter, which anchors the output to the intended language. Output length is controlled through two parameters. The primary is the minimum output allowance (), which guarantees a baseline variety of tokens that the model is allowed to generate. The second is a scaling factor (), which determines how the utmost output length grows in proportion to the input length. The utmost variety of generated tokens is ready as a linear function of the input length, computed as a + b × input_length. As well as, limits what number of input tokens the model reads.

This function powers the Dongxiang → Chinese translation.

import torch
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

device = "cuda" if torch.cuda.is_available() else "cpu"
MODEL_DIR3 = "/content/drive/MyDrive/my_nllb_CD_model"
tokenizer3 = AutoTokenizer.from_pretrained(MODEL_DIR3)
model3 = AutoModelForSeq2SeqLM.from_pretrained(MODEL_DIR3).to(device)
model3.eval()

def translate3(text, src_lang="zho_Hans", tgt_lang="sce_Latn",
               a=16, b=1.5, max_input_length=1024, **kwargs):
    tokenizer3.src_lang = src_lang
    inputs = tokenizer3(text, return_tensors="pt", padding=True,
                        truncation=True, max_length=max_input_length).to(model3.device)
    result = model3.generate(
        **inputs,
        forced_bos_token_id=tokenizer3.convert_tokens_to_ids(tgt_lang),
        max_new_tokens=int(a + b * inputs.input_ids.shape[1]),
        **kwargs
    )
    outputs = tokenizer3.batch_decode(result, skip_special_tokens=True)
    return outputs

Model quality is then assessed using a mixture of automatic evaluation metrics and human judgment. On the quantitative side, we report standard machine translation metrics similar to BLEU and ChrF++. BLEU scores were computed using standard BLEU-4, which measures word-level n-gram overlap from unigrams to four-grams and combines them using a geometrical mean with brevity penalty. ChrF++ was calculated over character-level n-grams and reported as an F-score. It ought to be noted that the present evaluation is preliminary. Because of limited data availability at this early stage, BLEU and ChrF++ scores were computed on only a couple of dozen held-out sentence pairs. Our model achieved the next results:

Dongxiang → Chinese (DX→ZH)
BLEU-4: 44.00
ChrF++: 34.3

Chinese → Dongxiang (ZH→DX)
BLEU-4: 46.23
ChrF++: 59.80

BLEU-4 scores above 40 are generally thought to be strong in low-resource settings, indicating that the model captures sentence structure and key lexical decisions with reasonable accuracy. The lower chrF++ rating within the Dongxiang → Chinese direction is anticipated and doesn’t necessarily indicate poor translation quality, as Chinese permits substantial surface-level variation in word selection and sentence structure, which reduces character-level overlap with a single reference translation.

In parallel, bilingual evaluators fluent in each languages reported that the model performs reliably on easy sentences, similar to those following basic subject–verb–object structures. Performance degrades on longer and more complex sentences. While these results are encouraging, additionally they indicate that further improvement continues to be required.

Step 6: Deployment

At the present stage, we deploy the project through a light-weight setup by hosting the documentation and demo interface on GitHub Pages, while releasing the trained models on Hugging Face. This approach enables public access and community engagement without incurring additional infrastructure costs. Details regarding GitHub-based deployment and Hugging Face model hosting follow the official documentation provided by GitHub Pages and the Hugging Face Hub, respectively.

This script uploads a locally trained Hugging Face–compatible model.

import os
from huggingface_hub import HfApi, HfFolder

# Load the Hugging Face access token 
token = os.environ.get("HF_TOKEN")
HfFolder.save_token(token)

# Path to the local directory containing the trained model artifacts
local_dir = "/path/to/your/local_model_directory"

# Goal Hugging Face Hub repository ID within the format: username/repo_name
repo_id = "your_username/your_model_name"

# Upload the complete model directory to the Hugging Face Model Hub
api = HfApi()
api.upload_folder(
    folder_path=local_dir,
    repo_id=repo_id,
    repo_type="model",
)

Following model release, a Gradio-based interface is deployed as a Hugging Face Space and embedded into the project’s GitHub Pages site. In comparison with Docker-based self-deployment, using Hugging Face Spaces with Gradio avoids the price of maintaining dedicated cloud infrastructure.

Screenshot of our translation demo (by Writer)

Reflection

Throughout the project, data preparation, not model training, dominated the general workload. The time spent cleansing, validating, and aligning Dongxiang–Chinese data far exceeded the time required to fine-tune the model itself. Without local government involvement and the support of native and bilingual speakers, completing this work wouldn’t have been possible. From a technical perspective, this imbalance highlights a broader issue of representation in multilingual NLP. Low-resource languages similar to Dongxiang are underrepresented not on account of inherent linguistic complexity, but because the information required to support them is dear to acquire and relies heavily on human expertise.

At its core, this project digitizes a printed bilingual dictionary and constructs a basic translation system. For a community of fewer than a million people, these incremental steps play an outsized role in ensuring that the language shouldn’t be excluded from modern language technologies. Finally, let’s take a moment to understand the breathtaking scenery of Dongxiang Autonomous County!

River gorge in Dongxiang Autonomous County (by Writer)

Contact

This text was jointly written by Kaixuan Chen and Bo Ma, who were classmates within the Department of Statistics on the University of North Carolina — Chapel Hill. Kaixuan Chen is currently pursuing a master’s degree at Northwestern University, while Bo Ma is pursuing a master’s degree on the University of California, San Diego. Each authors are open to skilled opportunities.

Should you are focused on our work or would love to attach, be at liberty to achieve out:

Project GitHub: https://github.com/dongxiangtranslationproject
Kaixuan Chen: [email protected]
Bo Ma: [email protected]

Tips on how to Construct a Neural Machine Translation System for a Low-Resource Language

Background: Dongxiang as a Low-Resource Language

Our Model: Tips on how to Use the Translation System

Model Training: Step-by-Step Reproducible Pipeline

Step 1: Bilingual Dataset Processing

Step 2: Tokenizer Preparation

Step 3: Language ID Registration

Step 4: Model Training

Step 5: Model Evaluation

Step 6: Deployment

Reflection

Contact

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

Constructing a Like-for-Like solution for Stores in Power BI

How NVIDIA Builds Open Data for AI

How Joseph Paradiso’s sensing innovations bridge the humanities, medicine, and ecology

NVIDIA RTX Innovations Are Powering the Next Era of Game Development

Hybrid Neuro-Symbolic Fraud Detection: Guiding Neural Networks with Domain Rules

Tips on how to Construct a Neural Machine Translation System for a Low-Resource Language

Background: Dongxiang as a Low-Resource Language

Our Model: Tips on how to Use the Translation System

Model Training: Step-by-Step Reproducible Pipeline

Step 1: Bilingual Dataset Processing

Step 2: Tokenizer Preparation

Step 3: Language ID Registration

Step 4: Model Training

Step 5: Model Evaluation

Step 6: Deployment

Reflection

Contact

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.