What would occur when you took the ModernBERT recipe and applied it to a decoder-only model? Seems, a state-of-the-art decoder language model that beats Llama 3.2 1B and SmolLM2!
We introduce a brand new open-data training recipe to breed the encoder-only ModernBERT model (and truly beat it!). We then apply the very same recipe to decoder-only models. For the primary time, we’ve two state-of-the-art models trained in the identical setup but with two different training objectives: masked language modeling (MLM), and causal language modeling (CLM).
This blog post introduces Ettin, the primary suite of SoTA paired encoder-only and decoder-only models (17M-1B params) trained with similar data (2T tokens), architecture, and training recipes. Ettin enables true apples-to-apples comparisons between architectures and delivers state-of-the-art performance for open-data models in each categories. We then further explore whether it is feasible to get a competitive encoder ranging from the decoder and vice-versa.
In case you are involved in trying out the models, some boilerplates can be found at the tip of this blogpost!
Encoders vs Decoders: The Architecture Divide
The LLM community has largely converged on decoder-only models like GPT, Llama, and Qwen. Their generative capabilities are impressive, but this focus is detracting attention from other categories, akin to encoder-only models like BERT.
Nonetheless, encoder BERT-like models remain the workhorses of production systems for classification, retrieval, and embedding tasks. They’re faster, more memory-efficient, and sometimes more accurate for discriminative tasks. The important thing difference lies of their attention patterns:
- Encoder models use bidirectional attention, allowing each token to “see” all other tokens within the sequence (fully visible)
- Decoder models use causal attention, where tokens can only “see” previous tokens to enable autoregressive generation
While decoder models have seen rapid innovation, encoder model development had stagnated – until recently, with efforts like ModernBERT modernizing them. But which architecture is best? Previous comparisons between encoders and decoders used different datasets, architectures, and training recipes, so it was hard to inform.
Named after the two-headed Norse giant, Ettin provides a controlled comparison by training with each architectures on similar data, similar model shapes, and similar training recipes. They only differ in attention patterns and training objectives!
Training Recipe: Modern Techniques for Each Architectures
We construct on the ModernBERT recipe, which borrowed modern techniques from decoder-only models and brought them to encoder training. This provides a robust base for training each architectures.
Sizes
We train six different sizes, starting from 17M to 1B parameters. This permits us to check the consequences of scale, and provides a wide range of models so that you can use!
Regardless of when you need a blazing fast on-device model or a strong but slower model, we got you covered!
Three-Phase Training Process
We use a comprehensive three-phase training approach to maximise performance:
Phase 1 – Pre-training (1.7T tokens): We start with a various mixture of high-quality data sources, training on shorter contexts (1024 tokens) to determine strong foundational knowledge.
Phase 2 – Context Extension (250B tokens): We increase context length to 8K tokens using higher-quality filtered data, allowing models to know longer documents and more complex relationships.
Phase 3 – Decay (100B tokens): We finish with premium data sources including scientific papers, textbooks, and curated content while step by step reducing the educational rate.
Modern Architecture Components
Our encoder models gain all the advantages of ModernBERT’s speed, allowing them to be significantly faster than the previous generations of encoders.
Data Sources and Quality
Unlike ModernBERT, all our training data is public and reproducible:
You may proceed to coach these models on recent data or propose a brand new recipe to further improve results!
Encoder Results: Beating ModernBERT
Our encoder models outperform ModernBERT across all tasks and model sizes, while using completely open training data. Since we offer a wide variety of sizes, you possibly can now use ModernBERT-style models in smaller sizes (great for on-device or for fast-inference), or power up with a 1B-sized encoder that crushes the competition.
Decoder Results: Beating Llama 3.2 and SmolLM2
Applying the identical recipe to decoder models yields equally impressive results, with our models outperforming or matching established baselines akin to Llama 3.2 and SmolLM2:
The gains are particularly strong on knowledge-intensive tasks like SciQ, reflecting the advantages of our high-quality training data mixture. These results reveal that our training recipe creates genuinely strong models in each architectural paradigms.
Fair Fight: Encoders vs Decoders on Even Ground
For the primary time, we will fairly compare encoder and decoder architectures trained with similar data and recipes. The outcomes reveal fundamental architectural benefits that persist even when all other aspects are controlled:
Architecture-Specific Benefits Persist
The outcomes show clear patterns:
Encoders dominate classification and retrieval: On MNLI classification, even a 150M encoder (89.2) outperforms a 400M decoder (88.2). For retrieval tasks, the gap is smaller but still noticeable – especially when decoders will not be trained with MNTP.
Decoders excel at generation: On generative tasks, decoders maintain consistent benefits, with the performance gap actually widening at larger model sizes.
Size doesn’t at all times matter: A 400M encoder beats a 1B decoder on classification tasks, while a 400M decoder beats a 1B encoder on generation tasks.
Cross-Objective Training Falls Short
As a consequence of the shortage of recent encoder models, works like LLM2Vec have proposed to proceed pre-training decoders with MLM. We are able to now test the effectiveness of this strategy!
We switched the target and continued to coach our models with the alternative objective for 50B additional tokens. That is what we found:
- Encoder-from-decoder: Still generally trails native encoders on classification/retrieval
- Decoder-from-encoder: Are significantly worse than native decoders, especially at larger scales. This may occasionally be since the encoders were trained with MLM as a substitute of MNTP (masked next token prediction) as proposed by LLM2Vec (and utilized in our encoder from decoder recipe).
This implies the architecture selection matters fundamentally, not only the training objective.
Beyond Performance: Understanding Model Behavior
With similar training data, we will study how different objectives affect learning. For instance, analyzing gender bias using the WinoGender benchmark reveals:
- Encoder models prefer gender-neutral pronouns more often (60%+ neutral vs 30%+ for decoders)
- Each architectures show male bias, but decoders barely more so
- Cross-objective training affects bias patterns in measurable ways
This opens doors for systematic studies of how training objectives influence model behavior beyond just accuracy metrics.
Usage Examples
You need to use these models with just a couple of lines of code!
Encoders
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("jhu-clsp/ettin-encoder-150m")
model = AutoModel.from_pretrained("jhu-clsp/ettin-encoder-150m")
def predict_masked_token(text):
inputs = tokenizer(text, return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs)
mask_indices = torch.where(inputs["input_ids"] == tokenizer.mask_token_id)
predictions = outputs.logits[mask_indices]
top_tokens = torch.topk(predictions, 5, dim=-1)
return [tokenizer.decode(token) for token in top_tokens.indices[0]]
masked_text = "The capital of France is [MASK]."
predictions = predict_masked_token(masked_text)
print(f"Predictions: {predictions}")
For classification and retrieval tasks, use encoder models: You could wish to use a fine-tuned version for these tasks as well.
Decoders
For text generation tasks, use decoder models:
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("jhu-clsp/ettin-decoder-150m")
model = AutoModelForCausalLM.from_pretrained("jhu-clsp/ettin-decoder-150m")
prompt = "The longer term of artificial intelligence is"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(inputs.input_ids, max_length=50, temperature=0.7)
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
Fantastic-tuning Examples
Encoders
Click to see easy methods to finetune this right into a dense embedding model using Sentence Transformers
import argparse
from datasets import load_dataset
from sentence_transformers import (
SentenceTransformer,
SentenceTransformerTrainer,
SentenceTransformerTrainingArguments,
)
from sentence_transformers.evaluation import TripletEvaluator
from sentence_transformers.losses import CachedMultipleNegativesRankingLoss
from sentence_transformers.training_args import BatchSamplers
def most important():
parser = argparse.ArgumentParser()
parser.add_argument("--lr", type=float, default=8e-5)
parser.add_argument("--model_name", type=str, default="jhu-clsp/ettin-encoder-150m")
args = parser.parse_args()
lr = args.lr
model_name = args.model_name
model_shortname = model_name.split("/")[-1]
model = SentenceTransformer(model_name)
dataset = load_dataset(
"sentence-transformers/msmarco-co-condenser-margin-mse-sym-mnrl-mean-v1",
"triplet-hard",
split="train",
)
dataset_dict = dataset.train_test_split(test_size=1_000, seed=12)
train_dataset = dataset_dict["train"].select(range(1_250_000))
eval_dataset = dataset_dict["test"]
loss = CachedMultipleNegativesRankingLoss(model, mini_batch_size=16)
run_name = f"{model_shortname}-DPR-{lr}"
args = SentenceTransformerTrainingArguments(
output_dir=f"output/{model_shortname}/{run_name}",
num_train_epochs=1,
per_device_train_batch_size=512,
per_device_eval_batch_size=512,
warmup_ratio=0.05,
fp16=False,
bf16=True,
batch_sampler=BatchSamplers.NO_DUPLICATES,
learning_rate=lr,
save_strategy="steps",
save_steps=500,
save_total_limit=2,
logging_steps=500,
run_name=run_name,
)
dev_evaluator = TripletEvaluator(
anchors=eval_dataset["query"],
positives=eval_dataset["positive"],
negatives=eval_dataset["negative"],
name="msmarco-co-condenser-dev",
)
dev_evaluator(model)
trainer = SentenceTransformerTrainer(
model=model,
args=args,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
loss=loss,
evaluator=dev_evaluator,
)
trainer.train()
dev_evaluator(model)
model.save_pretrained(f"output/{model_shortname}/{run_name}/final")
model.push_to_hub(run_name, private=False)
if __name__ == "__main__":
most important()
Click to see easy methods to finetune this right into a multi-vector embedding model with PyLate
from datasets import load_dataset
from pylate import losses, models, utils
from sentence_transformers import (
SentenceTransformerTrainer,
SentenceTransformerTrainingArguments,
)
def most important():
train = load_dataset(
path="lightonai/ms-marco-en-bge",
name="train",
)
queries = load_dataset(
path="lightonai/ms-marco-en-bge",
name="queries",
)
documents = load_dataset(
path="lightonai/ms-marco-en-bge",
name="documents",
)
train.set_transform(
utils.KDProcessing(queries=queries, documents=documents).transform,
)
num_train_epochs = 1
lr = 8e-5
batch_size = 16
accum_steps = 1
model_name = "jhu-clsp/ettin-encoder-150m"
model_shortname = model_name.split("/")[-1]
run_name = f"{model_shortname}-colbert-KD-{lr}"
output_dir = f"output/{model_shortname}/{run_name}"
model = models.ColBERT(model_name_or_path=model_name)
args = SentenceTransformerTrainingArguments(
output_dir=output_dir,
num_train_epochs=num_train_epochs,
per_device_train_batch_size=batch_size,
fp16=False,
bf16=True,
run_name=run_name,
logging_steps=10,
learning_rate=lr,
gradient_accumulation_steps=accum_steps,
warmup_ratio=0.05,
)
train_loss = losses.Distillation(model=model)
trainer = SentenceTransformerTrainer(
model=model,
args=args,
train_dataset=train,
loss=train_loss,
data_collator=utils.ColBERTCollator(tokenize_fn=model.tokenize),
)
trainer.train()
model.save_pretrained(f"{output_dir}/final")
if __name__ == "__main__":
most important()
Click to see easy methods to finetune this right into a sparse retrieval model using Sentence Transformers
import logging
from datasets import load_dataset
from sentence_transformers import (
SparseEncoder,
SparseEncoderModelCardData,
SparseEncoderTrainer,
SparseEncoderTrainingArguments,
)
from sentence_transformers.sparse_encoder.evaluation import SparseNanoBEIREvaluator
from sentence_transformers.sparse_encoder.losses import SparseMultipleNegativesRankingLoss, SpladeLoss
from sentence_transformers.training_args import BatchSamplers
logging.basicConfig(format="%(asctime)s - %(message)s", datefmt="%Y-%m-%d %H:%M:%S", level=logging.INFO)
model = SparseEncoder(
"jhu-clsp/ettin-encoder-150m",
model_card_data=SparseEncoderModelCardData(
language="en",
license="apache-2.0",
)
)
full_dataset = load_dataset("sentence-transformers/natural-questions", split="train").select(range(100_000))
dataset_dict = full_dataset.train_test_split(test_size=1_000, seed=12)
train_dataset = dataset_dict["train"]
eval_dataset = dataset_dict["test"]
loss = SpladeLoss(
model=model,
loss=SparseMultipleNegativesRankingLoss(model=model),
query_regularizer_weight=5e-5,
document_regularizer_weight=3e-5,
)
run_name = "splade-distilbert-base-uncased-nq"
args = SparseEncoderTrainingArguments(
output_dir=f"models/{run_name}",
num_train_epochs=1,
per_device_train_batch_size=16,
per_device_eval_batch_size=16,
learning_rate=2e-5,
warmup_ratio=0.1,
fp16=True,
bf16=False,
batch_sampler=BatchSamplers.NO_DUPLICATES,
eval_strategy="steps",
eval_steps=1000,
save_strategy="steps",
save_steps=1000,
save_total_limit=2,
logging_steps=200,
run_name=run_name,
)
dev_evaluator = SparseNanoBEIREvaluator(dataset_names=["msmarco", "nfcorpus", "nq"], batch_size=16)
trainer = SparseEncoderTrainer(
model=model,
args=args,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
loss=loss,
evaluator=dev_evaluator,
)
trainer.train()
dev_evaluator(model)
model.save_pretrained(f"models/{run_name}/final")
model.push_to_hub(run_name)
Click to see easy methods to finetune this right into a reranker model using Sentence Transformers
import logging
import traceback
import torch
from datasets import load_dataset
from sentence_transformers import SentenceTransformer
from sentence_transformers.cross_encoder import (
CrossEncoder,
CrossEncoderModelCardData,
CrossEncoderTrainer,
CrossEncoderTrainingArguments,
)
from sentence_transformers.cross_encoder.evaluation import (
CrossEncoderNanoBEIREvaluator,
CrossEncoderRerankingEvaluator,
)
from sentence_transformers.cross_encoder.losses import BinaryCrossEntropyLoss
from sentence_transformers.evaluation import SequentialEvaluator
from sentence_transformers.util import mine_hard_negatives
logging.basicConfig(format="%(asctime)s - %(message)s", datefmt="%Y-%m-%d %H:%M:%S", level=logging.INFO)
def most important():
model_name = "jhu-clsp/ettin-encoder-150m"
train_batch_size = 64
num_epochs = 1
num_hard_negatives = 5
model = CrossEncoder(
model_name,
model_card_data=CrossEncoderModelCardData(
language="en",
license="apache-2.0",
),
)
print("Model max length:", model.max_length)
print("Model num labels:", model.num_labels)
logging.info("Read the gooaq training dataset")
full_dataset = load_dataset("sentence-transformers/gooaq", split="train").select(range(100_000))
dataset_dict = full_dataset.train_test_split(test_size=1_000, seed=12)
train_dataset = dataset_dict["train"]
eval_dataset = dataset_dict["test"]
logging.info(train_dataset)
logging.info(eval_dataset)
embedding_model = SentenceTransformer("sentence-transformers/static-retrieval-mrl-en-v1", device="cpu")
hard_train_dataset = mine_hard_negatives(
train_dataset,
embedding_model,
num_negatives=num_hard_negatives,
margin=0,
range_min=0,
range_max=100,
sampling_strategy="top",
batch_size=4096,
output_format="labeled-pair",
use_faiss=True,
)
logging.info(hard_train_dataset)
loss = BinaryCrossEntropyLoss(model=model, pos_weight=torch.tensor(num_hard_negatives))
nano_beir_evaluator = CrossEncoderNanoBEIREvaluator(
dataset_names=["msmarco", "nfcorpus", "nq"],
batch_size=train_batch_size,
)
hard_eval_dataset = mine_hard_negatives(
eval_dataset,
embedding_model,
corpus=full_dataset["answer"],
num_negatives=30,
batch_size=4096,
include_positives=True,
output_format="n-tuple",
use_faiss=True,
)
logging.info(hard_eval_dataset)
reranking_evaluator = CrossEncoderRerankingEvaluator(
samples=[
{
"query": sample["question"],
"positive": [sample["answer"]],
"documents": [sample[column_name] for column_name in hard_eval_dataset.column_names[2:]],
}
for sample in hard_eval_dataset
],
batch_size=train_batch_size,
name="gooaq-dev",
always_rerank_positives=False,
)
evaluator = SequentialEvaluator([reranking_evaluator, nano_beir_evaluator])
evaluator(model)
short_model_name = model_name if "/" not in model_name else model_name.split("/")[-1]
run_name = f"reranker-{short_model_name}-gooaq-bce"
args = CrossEncoderTrainingArguments(
output_dir=f"models/{run_name}",
num_train_epochs=num_epochs,
per_device_train_batch_size=train_batch_size,
per_device_eval_batch_size=train_batch_size,
learning_rate=2e-5,
warmup_ratio=0.1,
fp16=False,
bf16=True,
dataloader_num_workers=4,
load_best_model_at_end=True,
metric_for_best_model="eval_gooaq-dev_ndcg@10",
eval_strategy="steps",
eval_steps=1000,
save_strategy="steps",
save_steps=1000,
save_total_limit=2,
logging_steps=200,
logging_first_step=True,
run_name=run_name,
seed=12,
)
trainer = CrossEncoderTrainer(
model=model,
args=args,
train_dataset=hard_train_dataset,
loss=loss,
evaluator=evaluator,
)
trainer.train()
evaluator(model)
final_output_dir = f"models/{run_name}/final"
model.save_pretrained(final_output_dir)
try:
model.push_to_hub(run_name)
except Exception:
logging.error(
f"Error uploading model to the Hugging Face Hub:n{traceback.format_exc()}To upload it manually, you possibly can run "
f"`huggingface-cli login`, followed by loading the model using `model = CrossEncoder({final_output_dir!r})` "
f"and saving it using `model.push_to_hub('{run_name}')`."
)
if __name__ == "__main__":
most important()
Decoders
Click to expand decoder training code
python trl/scripts/sft.py
--model_name_or_path jhu-clsp/ettin-decoder-17m
--dataset_name trl-lib/Capybara
--learning_rate 2.0e-5
--num_train_epochs 1
--packing
--per_device_train_batch_size 2
--gradient_accumulation_steps 8
--gradient_checkpointing
--eos_token '<|im_end|>'
--eval_strategy steps
--eval_steps 100
--output_dir ettin-decoder-17m
--push_to_hub
python trl/scripts/sft.py
--model_name_or_path jhu-clsp/ettin-decoder-17m
--dataset_name trl-lib/Capybara
--learning_rate 2.0e-4
--num_train_epochs 1
--packing
--per_device_train_batch_size 2
--gradient_accumulation_steps 8
--gradient_checkpointing
--eos_token '<|im_end|>'
--eval_strategy steps
--eval_steps 100
--use_peft
--lora_r 32
--lora_alpha 16
--output_dir ettin-decoder-17m
--push_to_hub
with sft.py:
import argparse
from datasets import load_dataset
from transformers import AutoConfig, AutoModelForCausalLM, AutoTokenizer
from transformers.models.auto.modeling_auto import MODEL_FOR_IMAGE_TEXT_TO_TEXT_MAPPING_NAMES
from trl import (
ModelConfig,
ScriptArguments,
SFTConfig,
SFTTrainer,
TrlParser,
clone_chat_template,
get_kbit_device_map,
get_peft_config,
get_quantization_config,
)
def most important(script_args, training_args, model_args):
quantization_config = get_quantization_config(model_args)
model_kwargs = dict(
revision=model_args.model_revision,
trust_remote_code=model_args.trust_remote_code,
attn_implementation=model_args.attn_implementation,
torch_dtype=model_args.torch_dtype,
use_cache=False if training_args.gradient_checkpointing else True,
device_map=get_kbit_device_map() if quantization_config is not None else None,
quantization_config=quantization_config,
)
config = AutoConfig.from_pretrained(model_args.model_name_or_path)
valid_image_text_architectures = MODEL_FOR_IMAGE_TEXT_TO_TEXT_MAPPING_NAMES.values()
if config.architectures and any(arch in valid_image_text_architectures for arch in config.architectures):
from transformers import AutoModelForImageTextToText
model_kwargs.pop("use_cache", None)
model = AutoModelForImageTextToText.from_pretrained(model_args.model_name_or_path, **model_kwargs)
else:
model = AutoModelForCausalLM.from_pretrained(model_args.model_name_or_path, **model_kwargs)
tokenizer = AutoTokenizer.from_pretrained(
model_args.model_name_or_path, trust_remote_code=model_args.trust_remote_code, use_fast=True
)
if tokenizer.chat_template is None:
model, tokenizer = clone_chat_template(model, tokenizer, "Qwen/Qwen3-0.6B")
dataset = load_dataset(script_args.dataset_name, name=script_args.dataset_config)
trainer = SFTTrainer(
model=model,
args=training_args,
train_dataset=dataset[script_args.dataset_train_split],
eval_dataset=dataset[script_args.dataset_test_split] if training_args.eval_strategy != "no" else None,
processing_class=tokenizer,
peft_config=get_peft_config(model_args),
)
trainer.train()
trainer.save_model(training_args.output_dir)
if training_args.push_to_hub:
trainer.push_to_hub(dataset_name=script_args.dataset_name)
def make_parser(subparsers: argparse._SubParsersAction = None):
dataclass_types = (ScriptArguments, SFTConfig, ModelConfig)
if subparsers is not None:
parser = subparsers.add_parser("sft", help="Run the SFT training script", dataclass_types=dataclass_types)
else:
parser = TrlParser(dataclass_types)
return parser
if __name__ == "__main__":
parser = make_parser()
script_args, training_args, model_args, _ = parser.parse_args_and_config(return_remaining_strings=True)
most important(script_args, training_args, model_args)
Model Family and Links
The entire Ettin suite includes models at six different scales (for each encoders and decoders):
Standard Models:
Research Resources:






