This blog post introduces mmBERT, a state-of-the-art massively multilingual encoder model trained on 3T+ tokens of text in over 1800 languages. It shows significant performance and speed improvements over previous multilingual models, being the primary to enhance upon XLM-R, while also developing latest strategies for effectively learning low-resource languages. mmBERT builds upon ModernBERT for a blazingly fast architecture, and adds novel components to enable efficient multilingual learning.
When you are concerned with trying out the models yourself, some example boilerplate is obtainable at the top of this blogpost!
Training Data

mmBERT was trained on a rigorously curated multilingual dataset totaling over 3T tokens across three distinct training phases. The muse of our training data consists of three primary open-source and high-quality web crawls that enable each multilingual coverage and data quality:
DCLM and Filtered DCLM provides the best quality English content available, serving because the backbone for strong English performance (with the filtered data coming from Dolmino). This dataset represents state-of-the-art web filtering techniques and forms a vital component. As a result of the prime quality of this data, we use a signficantly higher proportion of English than previous generation multilingual encoder models (as much as 18%).
FineWeb2 delivers broad multilingual web content covering over 1,800 languages. This dataset enables our extensive multilingual coverage while maintaining reasonable quality standards across diverse language families and scripts.
FineWeb2-HQ consists of a filtered subset of FineWeb2 specializing in 20 high-resource languages. This filtered version provides higher-quality multilingual content that bridges the gap between English-only filtered data and broad multilingual coverage.
The training data also incorporates specialized corpora from Dolma, MegaWika v2, ProLong and more: code repositories (StarCoder, ProLong), academic content (ArXiv, PeS2o), reference materials (Wikipedia, textbooks), and community discussions (StackExchange), together with instruction and mathematical datasets.
The important thing innovation in our data approach is the progressive language inclusion strategy shown in Figure 1. At each phase we progressively sample from a flatter distribution (i.e. closer to uniform), while also adding latest languages. Which means high resource languages like Russian start off with a high percentage of the info (i.e. 9%) after which within the last phase of coaching end around half of that. We start with 60 high-resource languages during pre-training, expand to 110 languages during mid-training, and eventually include all 1,833 languages from FineWeb2 in the course of the decay phase. This permits us to maximise the impact of limited low-resource language data without excess reptitions and while maintaining high overall data quality.
Training Recipe and Novel Components
mmBERT builds upon the ModernBERT architecture but introduces several key innovations for multilingual learning:
Architecture
We use the identical core architecture as ModernBERT-base with 22 layers and 1152 intermediate dimensions, but switch to the Gemma 2 tokenizer to raised handle multilingual text. The bottom model has 110M non-embedding parameters (307M total resulting from the larger vocabulary), while the small variant has 42M non-embedding parameters (140M total).
Three-Phase Training Approach
Our training follows a rigorously designed three-phase schedule:
- Pre-training (2.3T tokens): Warmup and stable learning rate phase using 60 languages with 30% mask rate
- Mid-training (600B tokens): Context extension to 8192 tokens, higher-quality data, expanded to 110 languages with 15% mask rate
- Decay phase (100B tokens): Inverse square root learning rate decay, all 1,833 languages included with 5% mask rate
Novel Training Techniques
Inverse Mask Ratio Schedule: As an alternative of using a hard and fast masking rate, we progressively reduce the mask ratio from 30% → 15% → 5% across training phases. This permits the model to learn basic representations with higher masking early on, then deal with more nuanced understanding with lower masking rates.
Annealed Language Learning: We dynamically adjust the temperature for multilingual data sampling from τ=0.7 → 0.5 → 0.3. This creates a progression from high-resource language bias toward more uniform sampling, enabling the model to construct a powerful multilingual foundation before learning low-resource languages.
Progressive Language Addition: Fairly than training on all languages concurrently, we strategically add languages at each phase (60 → 110 → 1,833). This maximizes learning efficiency by avoiding excessive epochs on limited low-resource data while still achieving strong performance.
Model Merging: We train three different variants in the course of the decay phase (English-focused, 110-language, and all-language) and use TIES merging to mix their strengths into the ultimate model.
Results
Natural Language Understanding (NLU)

English Performance: On the English GLUE benchmark (Table 1), mmBERT base achieves strong performance, substantially outperforming other multilingual models like XLM-R (multilingual RoBERTa) base and mGTE base, while remaining competitive to English-only models despite lower than 25% of the mmBERT training data being English.

Multilingual Performance: mmBERT shows significant improvements on XTREME benchmark in comparison with XLM-R as demonstrated in Table 2. Notable gains include strong performance on XNLI classification, substantial improvements in query answering tasks like TyDiQA, and competitive results across PAWS-X and XCOPA for cross-lingual understanding.
The model performs well across most categories, excluding some structured prediction tasks like NER and POS tagging, likely resulting from tokenizer differences that affect word boundary detection. On these categories, it performs concerning the same because the previous generation, but might be applied to more languages.
Retrieval Performance

English Retrieval: Though mmBERT is designed for massively multilingual settings, within the MTEB v2 English benchmarks (Table 3), mmBERT shows significant gains over previous multilingual models and even ties the capabilities of English-only models like ModernBERT!

Multilingual Retrieval: mmBERT shows consistent improvements on MTEB v2 multilingual benchmarks in comparison with other models (Table 4).

Code Retrieval: As a result of the trendy tokenizer (based on Gemma 2) mmBERT also shows strong coding performance (Table 5), making mmBERT suitable for any variety of textual data. The one model that outperforms it’s EuroBERT, which was in a position to use the non-publicly accessible Stack v2 dataset.
Learning Languages within the Decay Phase
One in every of mmBERT’s most vital novel features is demonstrating that low-resource languages might be effectively learned in the course of the short decay phase of coaching. We validated this approach by testing on languages only introduced in the course of the final 100B token decay phase.

Dramatic Performance Gains: Testing on TiQuaD (Tigrinya) and FoQA (Faroese), we observed substantial improvements when these languages were included within the decay phase, as shown in Figure 2. The outcomes show the effectiveness of our progressive language learning approach.
Competitive with Large Models: Despite only seeing these languages in the ultimate training phase, mmBERT achieves performance levels that exceed much larger models. On Faroese query answering where LLMs have been benchmarked, mmBERT outperforms Google Gemini 2.5 Pro and OpenAI o3.
Rapid Learning Mechanism: The success of decay-phase language learning stems from the model’s ability to leverage its strong multilingual foundation built during earlier phases. When exposed to latest languages, the model can quickly adapt existing cross-lingual representations fairly than learning from scratch.
Model Merging Advantages: The ultimate mmBERT models successfully retain a lot of the decay-phase improvements while benefiting from the English-focused and high-resource variants through TIES merging.
Efficiency Improvements
mmBERT delivers substantial efficiency gains over previous multilingual encoder models through architectural improvements inherited from ModernBERT:

Throughput Performance: mmBERT processes text significantly faster than existing multilingual models across various sequence lengths, as demonstrated in Figure 3. Each the small and base models show substantial speed improvements over previous multilingual encoders.
Modern Architecture Advantages: The efficiency gains come from two essential technical improvements:
- Flash Attention 2: Optimized attention computation for higher memory usage and speed
- Unpadding techniques: Elimination of unnecessary padding tokens during processing
Sequence Length Scaling: Unlike older models limited to 512 tokens, mmBERT handles as much as 8,192 tokens efficiently while maintaining high throughput. This makes it suitable for longer document processing tasks which might be increasingly common in multilingual applications.
Energy Efficiency: The mix of higher throughput and modern architecture leads to lower computational costs for inference, making mmBERT more practical for production deployments where multilingual support is required at scale.
These efficiency improvements make mmBERT not only more accurate than previous multilingual encoders, but additionally significantly more practical for real usage.
Usage Examples
You should utilize these models with just a couple of lines of code!
from transformers import AutoTokenizer, AutoModelForMaskedLM
import torch
tokenizer = AutoTokenizer.from_pretrained("jhu-clsp/mmBERT-base")
model = AutoModelForMaskedLM.from_pretrained("jhu-clsp/mmBERT-base")
def predict_masked_token(text):
inputs = tokenizer(text, return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs)
mask_indices = torch.where(inputs["input_ids"] == tokenizer.mask_token_id)
predictions = outputs.logits[mask_indices]
top_tokens, top_indices = torch.topk(predictions, 5, dim=-1)
return [tokenizer.decode(token) for token in top_indices[0]]
texts = [
"The capital of France is ." ,
"La capital de España es ." ,
"Die Hauptstadt von Deutschland ist ." ,
]
for text in texts:
predictions = predict_masked_token(text)
print(f"Text: {text}")
print(f"Predictions: {predictions}n")
High quality-tuning Examples
Encoders
Click to see tips on how to finetune this right into a dense embedding model using Sentence Transformers
import argparse
from datasets import load_dataset
from sentence_transformers import (
SentenceTransformer,
SentenceTransformerTrainer,
SentenceTransformerTrainingArguments,
)
from sentence_transformers.evaluation import TripletEvaluator
from sentence_transformers.losses import CachedMultipleNegativesRankingLoss
from sentence_transformers.training_args import BatchSamplers
def essential():
parser = argparse.ArgumentParser()
parser.add_argument("--lr", type=float, default=8e-5)
parser.add_argument("--model_name", type=str, default="jhu-clsp/mmBERT-small")
args = parser.parse_args()
lr = args.lr
model_name = args.model_name
model_shortname = model_name.split("/")[-1]
model = SentenceTransformer(model_name)
dataset = load_dataset(
"sentence-transformers/msmarco-co-condenser-margin-mse-sym-mnrl-mean-v1",
"triplet-hard",
split="train",
)
dataset_dict = dataset.train_test_split(test_size=1_000, seed=12)
train_dataset = dataset_dict["train"].select(range(1_250_000))
eval_dataset = dataset_dict["test"]
loss = CachedMultipleNegativesRankingLoss(model, mini_batch_size=16)
run_name = f"{model_shortname}-DPR-{lr}"
args = SentenceTransformerTrainingArguments(
output_dir=f"output/{model_shortname}/{run_name}",
num_train_epochs=1,
per_device_train_batch_size=512,
per_device_eval_batch_size=512,
warmup_ratio=0.05,
fp16=False,
bf16=True,
batch_sampler=BatchSamplers.NO_DUPLICATES,
learning_rate=lr,
save_strategy="steps",
save_steps=500,
save_total_limit=2,
logging_steps=500,
run_name=run_name,
)
dev_evaluator = TripletEvaluator(
anchors=eval_dataset["query"],
positives=eval_dataset["positive"],
negatives=eval_dataset["negative"],
name="msmarco-co-condenser-dev",
)
dev_evaluator(model)
trainer = SentenceTransformerTrainer(
model=model,
args=args,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
loss=loss,
evaluator=dev_evaluator,
)
trainer.train()
dev_evaluator(model)
model.save_pretrained(f"output/{model_shortname}/{run_name}/final")
model.push_to_hub(run_name, private=False)
if __name__ == "__main__":
essential()
Click to see tips on how to finetune this right into a multi-vector embedding model with PyLate
from datasets import load_dataset
from pylate import losses, models, utils
from sentence_transformers import (
SentenceTransformerTrainer,
SentenceTransformerTrainingArguments,
)
def essential():
train = load_dataset(
path="lightonai/ms-marco-en-bge",
name="train",
)
queries = load_dataset(
path="lightonai/ms-marco-en-bge",
name="queries",
)
documents = load_dataset(
path="lightonai/ms-marco-en-bge",
name="documents",
)
train.set_transform(
utils.KDProcessing(queries=queries, documents=documents).transform,
)
num_train_epochs = 1
lr = 8e-5
batch_size = 16
accum_steps = 1
model_name = "jhu-clsp/mmBERT-small"
model_shortname = model_name.split("/")[-1]
run_name = f"{model_shortname}-colbert-KD-{lr}"
output_dir = f"output/{model_shortname}/{run_name}"
model = models.ColBERT(model_name_or_path=model_name)
args = SentenceTransformerTrainingArguments(
output_dir=output_dir,
num_train_epochs=num_train_epochs,
per_device_train_batch_size=batch_size,
fp16=False,
bf16=True,
run_name=run_name,
logging_steps=10,
learning_rate=lr,
gradient_accumulation_steps=accum_steps,
warmup_ratio=0.05,
)
train_loss = losses.Distillation(model=model)
trainer = SentenceTransformerTrainer(
model=model,
args=args,
train_dataset=train,
loss=train_loss,
data_collator=utils.ColBERTCollator(tokenize_fn=model.tokenize),
)
trainer.train()
model.save_pretrained(f"{output_dir}/final")
if __name__ == "__main__":
essential()
Click to see tips on how to finetune this right into a sparse retrieval model using Sentence Transformers
import logging
from datasets import load_dataset
from sentence_transformers import (
SparseEncoder,
SparseEncoderModelCardData,
SparseEncoderTrainer,
SparseEncoderTrainingArguments,
)
from sentence_transformers.sparse_encoder.evaluation import SparseNanoBEIREvaluator
from sentence_transformers.sparse_encoder.losses import SparseMultipleNegativesRankingLoss, SpladeLoss
from sentence_transformers.training_args import BatchSamplers
logging.basicConfig(format="%(asctime)s - %(message)s", datefmt="%Y-%m-%d %H:%M:%S", level=logging.INFO)
model = SparseEncoder(
"jhu-clsp/mmBERT-small",
model_card_data=SparseEncoderModelCardData(
language="en",
license="apache-2.0",
)
)
full_dataset = load_dataset("sentence-transformers/natural-questions", split="train").select(range(100_000))
dataset_dict = full_dataset.train_test_split(test_size=1_000, seed=12)
train_dataset = dataset_dict["train"]
eval_dataset = dataset_dict["test"]
loss = SpladeLoss(
model=model,
loss=SparseMultipleNegativesRankingLoss(model=model),
query_regularizer_weight=5e-5,
document_regularizer_weight=3e-5,
)
run_name = "splade-distilbert-base-uncased-nq"
args = SparseEncoderTrainingArguments(
output_dir=f"models/{run_name}",
num_train_epochs=1,
per_device_train_batch_size=16,
per_device_eval_batch_size=16,
learning_rate=2e-5,
warmup_ratio=0.1,
fp16=True,
bf16=False,
batch_sampler=BatchSamplers.NO_DUPLICATES,
eval_strategy="steps",
eval_steps=1000,
save_strategy="steps",
save_steps=1000,
save_total_limit=2,
logging_steps=200,
run_name=run_name,
)
dev_evaluator = SparseNanoBEIREvaluator(dataset_names=["msmarco", "nfcorpus", "nq"], batch_size=16)
trainer = SparseEncoderTrainer(
model=model,
args=args,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
loss=loss,
evaluator=dev_evaluator,
)
trainer.train()
dev_evaluator(model)
model.save_pretrained(f"models/{run_name}/final")
model.push_to_hub(run_name)
Click to see tips on how to finetune this right into a reranker model using Sentence Transformers
import logging
import traceback
import torch
from datasets import load_dataset
from sentence_transformers import SentenceTransformer
from sentence_transformers.cross_encoder import (
CrossEncoder,
CrossEncoderModelCardData,
CrossEncoderTrainer,
CrossEncoderTrainingArguments,
)
from sentence_transformers.cross_encoder.evaluation import (
CrossEncoderNanoBEIREvaluator,
CrossEncoderRerankingEvaluator,
)
from sentence_transformers.cross_encoder.losses import BinaryCrossEntropyLoss
from sentence_transformers.evaluation import SequentialEvaluator
from sentence_transformers.util import mine_hard_negatives
logging.basicConfig(format="%(asctime)s - %(message)s", datefmt="%Y-%m-%d %H:%M:%S", level=logging.INFO)
def essential():
model_name = "jhu-clsp/mmBERT-small"
train_batch_size = 64
num_epochs = 1
num_hard_negatives = 5
model = CrossEncoder(
model_name,
model_card_data=CrossEncoderModelCardData(
language="en",
license="apache-2.0",
),
)
print("Model max length:", model.max_length)
print("Model num labels:", model.num_labels)
logging.info("Read the gooaq training dataset")
full_dataset = load_dataset("sentence-transformers/gooaq", split="train").select(range(100_000))
dataset_dict = full_dataset.train_test_split(test_size=1_000, seed=12)
train_dataset = dataset_dict["train"]
eval_dataset = dataset_dict["test"]
logging.info(train_dataset)
logging.info(eval_dataset)
embedding_model = SentenceTransformer("sentence-transformers/static-retrieval-mrl-en-v1", device="cpu")
hard_train_dataset = mine_hard_negatives(
train_dataset,
embedding_model,
num_negatives=num_hard_negatives,
margin=0,
range_min=0,
range_max=100,
sampling_strategy="top",
batch_size=4096,
output_format="labeled-pair",
use_faiss=True,
)
logging.info(hard_train_dataset)
loss = BinaryCrossEntropyLoss(model=model, pos_weight=torch.tensor(num_hard_negatives))
nano_beir_evaluator = CrossEncoderNanoBEIREvaluator(
dataset_names=["msmarco", "nfcorpus", "nq"],
batch_size=train_batch_size,
)
hard_eval_dataset = mine_hard_negatives(
eval_dataset,
embedding_model,
corpus=full_dataset["answer"],
num_negatives=30,
batch_size=4096,
include_positives=True,
output_format="n-tuple",
use_faiss=True,
)
logging.info(hard_eval_dataset)
reranking_evaluator = CrossEncoderRerankingEvaluator(
samples=[
{
"query": sample["question"],
"positive": [sample["answer"]],
"documents": [sample[column_name] for column_name in hard_eval_dataset.column_names[2:]],
}
for sample in hard_eval_dataset
],
batch_size=train_batch_size,
name="gooaq-dev",
always_rerank_positives=False,
)
evaluator = SequentialEvaluator([reranking_evaluator, nano_beir_evaluator])
evaluator(model)
short_model_name = model_name if "/" not in model_name else model_name.split("/")[-1]
run_name = f"reranker-{short_model_name}-gooaq-bce"
args = CrossEncoderTrainingArguments(
output_dir=f"models/{run_name}",
num_train_epochs=num_epochs,
per_device_train_batch_size=train_batch_size,
per_device_eval_batch_size=train_batch_size,
learning_rate=2e-5,
warmup_ratio=0.1,
fp16=False,
bf16=True,
dataloader_num_workers=4,
load_best_model_at_end=True,
metric_for_best_model="eval_gooaq-dev_ndcg@10",
eval_strategy="steps",
eval_steps=1000,
save_strategy="steps",
save_steps=1000,
save_total_limit=2,
logging_steps=200,
logging_first_step=True,
run_name=run_name,
seed=12,
)
trainer = CrossEncoderTrainer(
model=model,
args=args,
train_dataset=hard_train_dataset,
loss=loss,
evaluator=evaluator,
)
trainer.train()
evaluator(model)
final_output_dir = f"models/{run_name}/final"
model.save_pretrained(final_output_dir)
try:
model.push_to_hub(run_name)
except Exception:
logging.error(
f"Error uploading model to the Hugging Face Hub:n{traceback.format_exc()}To upload it manually, you'll be able to run "
f"`huggingface-cli login`, followed by loading the model using `model = CrossEncoder({final_output_dir!r})` "
f"and saving it using `model.push_to_hub('{run_name}')`."
)
if __name__ == "__main__":
essential()
Model Family and Links
Standard Models:
Research Resources:
