SoTA Paired Encoders and Decoders

-


What would occur when you took the ModernBERT recipe and applied it to a decoder-only model? Seems, a state-of-the-art decoder language model that beats Llama 3.2 1B and SmolLM2!

We introduce a brand new open-data training recipe to breed the encoder-only ModernBERT model (and truly beat it!). We then apply the very same recipe to decoder-only models. For the primary time, we’ve two state-of-the-art models trained in the identical setup but with two different training objectives: masked language modeling (MLM), and causal language modeling (CLM).

This blog post introduces Ettin, the primary suite of SoTA paired encoder-only and decoder-only models (17M-1B params) trained with similar data (2T tokens), architecture, and training recipes. Ettin enables true apples-to-apples comparisons between architectures and delivers state-of-the-art performance for open-data models in each categories. We then further explore whether it is feasible to get a competitive encoder ranging from the decoder and vice-versa.

In case you are involved in trying out the models, some boilerplates can be found at the tip of this blogpost!

Attention patterns comparison between encoder and decoder models



Encoders vs Decoders: The Architecture Divide

The LLM community has largely converged on decoder-only models like GPT, Llama, and Qwen. Their generative capabilities are impressive, but this focus is detracting attention from other categories, akin to encoder-only models like BERT.

Nonetheless, encoder BERT-like models remain the workhorses of production systems for classification, retrieval, and embedding tasks. They’re faster, more memory-efficient, and sometimes more accurate for discriminative tasks. The important thing difference lies of their attention patterns:

  • Encoder models use bidirectional attention, allowing each token to “see” all other tokens within the sequence (fully visible)
  • Decoder models use causal attention, where tokens can only “see” previous tokens to enable autoregressive generation

While decoder models have seen rapid innovation, encoder model development had stagnated – until recently, with efforts like ModernBERT modernizing them. But which architecture is best? Previous comparisons between encoders and decoders used different datasets, architectures, and training recipes, so it was hard to inform.

Named after the two-headed Norse giant, Ettin provides a controlled comparison by training with each architectures on similar data, similar model shapes, and similar training recipes. They only differ in attention patterns and training objectives!



Training Recipe: Modern Techniques for Each Architectures

We construct on the ModernBERT recipe, which borrowed modern techniques from decoder-only models and brought them to encoder training. This provides a robust base for training each architectures.



Sizes

We train six different sizes, starting from 17M to 1B parameters. This permits us to check the consequences of scale, and provides a wide range of models so that you can use!
Regardless of when you need a blazing fast on-device model or a strong but slower model, we got you covered!

Sizes of Ettin models



Three-Phase Training Process

We use a comprehensive three-phase training approach to maximise performance:

Phase 1 – Pre-training (1.7T tokens): We start with a various mixture of high-quality data sources, training on shorter contexts (1024 tokens) to determine strong foundational knowledge.

Phase 2 – Context Extension (250B tokens): We increase context length to 8K tokens using higher-quality filtered data, allowing models to know longer documents and more complex relationships.

Phase 3 – Decay (100B tokens): We finish with premium data sources including scientific papers, textbooks, and curated content while step by step reducing the educational rate.



Modern Architecture Components

Our encoder models gain all the advantages of ModernBERT’s speed, allowing them to be significantly faster than the previous generations of encoders.



Data Sources and Quality

Unlike ModernBERT, all our training data is public and reproducible:

Data used to train Ettin models

You may proceed to coach these models on recent data or propose a brand new recipe to further improve results!



Encoder Results: Beating ModernBERT

Our encoder models outperform ModernBERT across all tasks and model sizes, while using completely open training data. Since we offer a wide variety of sizes, you possibly can now use ModernBERT-style models in smaller sizes (great for on-device or for fast-inference), or power up with a 1B-sized encoder that crushes the competition.

Encoder performance comparison showing Ettin models beating ModernBERT



Decoder Results: Beating Llama 3.2 and SmolLM2

Applying the identical recipe to decoder models yields equally impressive results, with our models outperforming or matching established baselines akin to Llama 3.2 and SmolLM2:

Decoder performance comparison showing Ettin models beating Llama 3.2 and SmolLM2

The gains are particularly strong on knowledge-intensive tasks like SciQ, reflecting the advantages of our high-quality training data mixture. These results reveal that our training recipe creates genuinely strong models in each architectural paradigms.



Fair Fight: Encoders vs Decoders on Even Ground

For the primary time, we will fairly compare encoder and decoder architectures trained with similar data and recipes. The outcomes reveal fundamental architectural benefits that persist even when all other aspects are controlled:

Encoder vs decoder comparison across model sizes and tasks



Architecture-Specific Benefits Persist

The outcomes show clear patterns:

Encoders dominate classification and retrieval: On MNLI classification, even a 150M encoder (89.2) outperforms a 400M decoder (88.2). For retrieval tasks, the gap is smaller but still noticeable – especially when decoders will not be trained with MNTP.

Decoders excel at generation: On generative tasks, decoders maintain consistent benefits, with the performance gap actually widening at larger model sizes.

Size doesn’t at all times matter: A 400M encoder beats a 1B decoder on classification tasks, while a 400M decoder beats a 1B encoder on generation tasks.



Cross-Objective Training Falls Short

As a consequence of the shortage of recent encoder models, works like LLM2Vec have proposed to proceed pre-training decoders with MLM. We are able to now test the effectiveness of this strategy!

We switched the target and continued to coach our models with the alternative objective for 50B additional tokens. That is what we found:

  • Encoder-from-decoder: Still generally trails native encoders on classification/retrieval
  • Decoder-from-encoder: Are significantly worse than native decoders, especially at larger scales. This may occasionally be since the encoders were trained with MLM as a substitute of MNTP (masked next token prediction) as proposed by LLM2Vec (and utilized in our encoder from decoder recipe).

This implies the architecture selection matters fundamentally, not only the training objective.



Beyond Performance: Understanding Model Behavior

With similar training data, we will study how different objectives affect learning. For instance, analyzing gender bias using the WinoGender benchmark reveals:

  • Encoder models prefer gender-neutral pronouns more often (60%+ neutral vs 30%+ for decoders)
  • Each architectures show male bias, but decoders barely more so
  • Cross-objective training affects bias patterns in measurable ways

This opens doors for systematic studies of how training objectives influence model behavior beyond just accuracy metrics.



Usage Examples

You need to use these models with just a couple of lines of code!



Encoders

from transformers import AutoTokenizer, AutoModel


tokenizer = AutoTokenizer.from_pretrained("jhu-clsp/ettin-encoder-150m")
model = AutoModel.from_pretrained("jhu-clsp/ettin-encoder-150m")

def predict_masked_token(text):
    inputs = tokenizer(text, return_tensors="pt")
    with torch.no_grad():
        outputs = model(**inputs)
    
    
    mask_indices = torch.where(inputs["input_ids"] == tokenizer.mask_token_id)
    predictions = outputs.logits[mask_indices]
    
    
    top_tokens = torch.topk(predictions, 5, dim=-1)
    return [tokenizer.decode(token) for token in top_tokens.indices[0]]


masked_text = "The capital of France is [MASK]."
predictions = predict_masked_token(masked_text)
print(f"Predictions: {predictions}")

For classification and retrieval tasks, use encoder models: You could wish to use a fine-tuned version for these tasks as well.



Decoders

For text generation tasks, use decoder models:

from transformers import AutoTokenizer, AutoModelForCausalLM


tokenizer = AutoTokenizer.from_pretrained("jhu-clsp/ettin-decoder-150m")
model = AutoModelForCausalLM.from_pretrained("jhu-clsp/ettin-decoder-150m")


prompt = "The longer term of artificial intelligence is"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(inputs.input_ids, max_length=50, temperature=0.7)
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)



Fantastic-tuning Examples



Encoders

Click to see easy methods to finetune this right into a dense embedding model using Sentence Transformers
import argparse

from datasets import load_dataset
from sentence_transformers import (
    SentenceTransformer,
    SentenceTransformerTrainer,
    SentenceTransformerTrainingArguments,
)
from sentence_transformers.evaluation import TripletEvaluator
from sentence_transformers.losses import CachedMultipleNegativesRankingLoss
from sentence_transformers.training_args import BatchSamplers

def most important():
    
    parser = argparse.ArgumentParser()
    parser.add_argument("--lr", type=float, default=8e-5)
    parser.add_argument("--model_name", type=str, default="jhu-clsp/ettin-encoder-150m")
    args = parser.parse_args()
    lr = args.lr
    model_name = args.model_name
    model_shortname = model_name.split("/")[-1]

    
    model = SentenceTransformer(model_name)

    
    dataset = load_dataset(
        "sentence-transformers/msmarco-co-condenser-margin-mse-sym-mnrl-mean-v1",
        "triplet-hard",
        split="train",
    )
    dataset_dict = dataset.train_test_split(test_size=1_000, seed=12)
    train_dataset = dataset_dict["train"].select(range(1_250_000))
    eval_dataset = dataset_dict["test"]

    
    loss = CachedMultipleNegativesRankingLoss(model, mini_batch_size=16)  

    run_name = f"{model_shortname}-DPR-{lr}"
    
    args = SentenceTransformerTrainingArguments(
        
        output_dir=f"output/{model_shortname}/{run_name}",
        
        num_train_epochs=1,
        per_device_train_batch_size=512,
        per_device_eval_batch_size=512,
        warmup_ratio=0.05,
        fp16=False,  
        bf16=True,  
        batch_sampler=BatchSamplers.NO_DUPLICATES,  
        learning_rate=lr,
        
        save_strategy="steps",
        save_steps=500,
        save_total_limit=2,
        logging_steps=500,
        run_name=run_name,  
    )

    
    dev_evaluator = TripletEvaluator(
        anchors=eval_dataset["query"],
        positives=eval_dataset["positive"],
        negatives=eval_dataset["negative"],
        name="msmarco-co-condenser-dev",
    )
    dev_evaluator(model)

    
    trainer = SentenceTransformerTrainer(
        model=model,
        args=args,
        train_dataset=train_dataset,
        eval_dataset=eval_dataset,
        loss=loss,
        evaluator=dev_evaluator,
    )
    trainer.train()

    
    dev_evaluator(model)

    
    model.save_pretrained(f"output/{model_shortname}/{run_name}/final")

    
    model.push_to_hub(run_name, private=False)

if __name__ == "__main__":
    most important()
Click to see easy methods to finetune this right into a multi-vector embedding model with PyLate
from datasets import load_dataset
from pylate import losses, models, utils
from sentence_transformers import (
    SentenceTransformerTrainer,
    SentenceTransformerTrainingArguments,
)

def most important():
    
    train = load_dataset(
        path="lightonai/ms-marco-en-bge",
        name="train",
    )

    queries = load_dataset(
        path="lightonai/ms-marco-en-bge",
        name="queries",
    )

    documents = load_dataset(
        path="lightonai/ms-marco-en-bge",
        name="documents",
    )

    
    train.set_transform(
        utils.KDProcessing(queries=queries, documents=documents).transform,
    )

    
    num_train_epochs = 1
    lr = 8e-5
    batch_size = 16
    accum_steps = 1
    model_name = "jhu-clsp/ettin-encoder-150m"
    model_shortname = model_name.split("/")[-1]

    
    run_name = f"{model_shortname}-colbert-KD-{lr}"
    output_dir = f"output/{model_shortname}/{run_name}"

    
    model = models.ColBERT(model_name_or_path=model_name)

    
    args = SentenceTransformerTrainingArguments(
        output_dir=output_dir,
        num_train_epochs=num_train_epochs,
        per_device_train_batch_size=batch_size,
        fp16=False,  
        bf16=True,  
        run_name=run_name,
        logging_steps=10,
        learning_rate=lr,
        gradient_accumulation_steps=accum_steps,
        warmup_ratio=0.05,
    )

    
    train_loss = losses.Distillation(model=model)

    
    trainer = SentenceTransformerTrainer(
        model=model,
        args=args,
        train_dataset=train,
        loss=train_loss,
        data_collator=utils.ColBERTCollator(tokenize_fn=model.tokenize),
    )

    
    trainer.train()

    model.save_pretrained(f"{output_dir}/final")

if __name__ == "__main__":
    most important()
Click to see easy methods to finetune this right into a sparse retrieval model using Sentence Transformers
import logging

from datasets import load_dataset

from sentence_transformers import (
    SparseEncoder,
    SparseEncoderModelCardData,
    SparseEncoderTrainer,
    SparseEncoderTrainingArguments,
)
from sentence_transformers.sparse_encoder.evaluation import SparseNanoBEIREvaluator
from sentence_transformers.sparse_encoder.losses import SparseMultipleNegativesRankingLoss, SpladeLoss
from sentence_transformers.training_args import BatchSamplers

logging.basicConfig(format="%(asctime)s - %(message)s", datefmt="%Y-%m-%d %H:%M:%S", level=logging.INFO)


model = SparseEncoder(
    "jhu-clsp/ettin-encoder-150m",
    model_card_data=SparseEncoderModelCardData(
        language="en",
        license="apache-2.0",
    )
)


full_dataset = load_dataset("sentence-transformers/natural-questions", split="train").select(range(100_000))
dataset_dict = full_dataset.train_test_split(test_size=1_000, seed=12)
train_dataset = dataset_dict["train"]
eval_dataset = dataset_dict["test"]


loss = SpladeLoss(
    model=model,
    loss=SparseMultipleNegativesRankingLoss(model=model),
    query_regularizer_weight=5e-5,
    document_regularizer_weight=3e-5,
)


run_name = "splade-distilbert-base-uncased-nq"
args = SparseEncoderTrainingArguments(
    
    output_dir=f"models/{run_name}",
    
    num_train_epochs=1,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    learning_rate=2e-5,
    warmup_ratio=0.1,
    fp16=True,  
    bf16=False,  
    batch_sampler=BatchSamplers.NO_DUPLICATES,  
    
    eval_strategy="steps",
    eval_steps=1000,
    save_strategy="steps",
    save_steps=1000,
    save_total_limit=2,
    logging_steps=200,
    run_name=run_name,  
)


dev_evaluator = SparseNanoBEIREvaluator(dataset_names=["msmarco", "nfcorpus", "nq"], batch_size=16)


trainer = SparseEncoderTrainer(
    model=model,
    args=args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    loss=loss,
    evaluator=dev_evaluator,
)
trainer.train()


dev_evaluator(model)


model.save_pretrained(f"models/{run_name}/final")


model.push_to_hub(run_name)
Click to see easy methods to finetune this right into a reranker model using Sentence Transformers
import logging
import traceback

import torch
from datasets import load_dataset

from sentence_transformers import SentenceTransformer
from sentence_transformers.cross_encoder import (
    CrossEncoder,
    CrossEncoderModelCardData,
    CrossEncoderTrainer,
    CrossEncoderTrainingArguments,
)
from sentence_transformers.cross_encoder.evaluation import (
    CrossEncoderNanoBEIREvaluator,
    CrossEncoderRerankingEvaluator,
)
from sentence_transformers.cross_encoder.losses import BinaryCrossEntropyLoss
from sentence_transformers.evaluation import SequentialEvaluator
from sentence_transformers.util import mine_hard_negatives


logging.basicConfig(format="%(asctime)s - %(message)s", datefmt="%Y-%m-%d %H:%M:%S", level=logging.INFO)


def most important():
    model_name = "jhu-clsp/ettin-encoder-150m"

    train_batch_size = 64
    num_epochs = 1
    num_hard_negatives = 5  

    
    model = CrossEncoder(
        model_name,
        model_card_data=CrossEncoderModelCardData(
            language="en",
            license="apache-2.0",
        ),
    )
    print("Model max length:", model.max_length)
    print("Model num labels:", model.num_labels)

    
    logging.info("Read the gooaq training dataset")
    full_dataset = load_dataset("sentence-transformers/gooaq", split="train").select(range(100_000))
    dataset_dict = full_dataset.train_test_split(test_size=1_000, seed=12)
    train_dataset = dataset_dict["train"]
    eval_dataset = dataset_dict["test"]
    logging.info(train_dataset)
    logging.info(eval_dataset)

    
    embedding_model = SentenceTransformer("sentence-transformers/static-retrieval-mrl-en-v1", device="cpu")
    hard_train_dataset = mine_hard_negatives(
        train_dataset,
        embedding_model,
        num_negatives=num_hard_negatives,  
        margin=0,  
        range_min=0,  
        range_max=100,  
        sampling_strategy="top",  
        batch_size=4096,  
        output_format="labeled-pair",  
        use_faiss=True,
    )
    logging.info(hard_train_dataset)

    
    
    
    

    
    
    loss = BinaryCrossEntropyLoss(model=model, pos_weight=torch.tensor(num_hard_negatives))

    
    nano_beir_evaluator = CrossEncoderNanoBEIREvaluator(
        dataset_names=["msmarco", "nfcorpus", "nq"],
        batch_size=train_batch_size,
    )

    
    
    
    hard_eval_dataset = mine_hard_negatives(
        eval_dataset,
        embedding_model,
        corpus=full_dataset["answer"],  
        num_negatives=30,  
        batch_size=4096,
        include_positives=True,
        output_format="n-tuple",
        use_faiss=True,
    )
    logging.info(hard_eval_dataset)
    reranking_evaluator = CrossEncoderRerankingEvaluator(
        samples=[
            {
                "query": sample["question"],
                "positive": [sample["answer"]],
                "documents": [sample[column_name] for column_name in hard_eval_dataset.column_names[2:]],
            }
            for sample in hard_eval_dataset
        ],
        batch_size=train_batch_size,
        name="gooaq-dev",
        
        
        always_rerank_positives=False,
    )

    
    evaluator = SequentialEvaluator([reranking_evaluator, nano_beir_evaluator])
    evaluator(model)

    
    short_model_name = model_name if "/" not in model_name else model_name.split("/")[-1]
    run_name = f"reranker-{short_model_name}-gooaq-bce"
    args = CrossEncoderTrainingArguments(
        
        output_dir=f"models/{run_name}",
        
        num_train_epochs=num_epochs,
        per_device_train_batch_size=train_batch_size,
        per_device_eval_batch_size=train_batch_size,
        learning_rate=2e-5,
        warmup_ratio=0.1,
        fp16=False,  
        bf16=True,  
        dataloader_num_workers=4,
        load_best_model_at_end=True,
        metric_for_best_model="eval_gooaq-dev_ndcg@10",
        
        eval_strategy="steps",
        eval_steps=1000,
        save_strategy="steps",
        save_steps=1000,
        save_total_limit=2,
        logging_steps=200,
        logging_first_step=True,
        run_name=run_name,  
        seed=12,
    )

    
    trainer = CrossEncoderTrainer(
        model=model,
        args=args,
        train_dataset=hard_train_dataset,
        loss=loss,
        evaluator=evaluator,
    )
    trainer.train()

    
    evaluator(model)

    
    final_output_dir = f"models/{run_name}/final"
    model.save_pretrained(final_output_dir)

    
    
    try:
        model.push_to_hub(run_name)
    except Exception:
        logging.error(
            f"Error uploading model to the Hugging Face Hub:n{traceback.format_exc()}To upload it manually, you possibly can run "
            f"`huggingface-cli login`, followed by loading the model using `model = CrossEncoder({final_output_dir!r})` "
            f"and saving it using `model.push_to_hub('{run_name}')`."
        )


if __name__ == "__main__":
    most important()



Decoders

Click to expand decoder training code
python trl/scripts/sft.py 
    --model_name_or_path jhu-clsp/ettin-decoder-17m 
    --dataset_name trl-lib/Capybara 
    --learning_rate 2.0e-5 
    --num_train_epochs 1 
    --packing 
    --per_device_train_batch_size 2 
    --gradient_accumulation_steps 8 
    --gradient_checkpointing 
    --eos_token '<|im_end|>' 
    --eval_strategy steps 
    --eval_steps 100 
    --output_dir ettin-decoder-17m 
    --push_to_hub
python trl/scripts/sft.py 
    --model_name_or_path jhu-clsp/ettin-decoder-17m 
    --dataset_name trl-lib/Capybara 
    --learning_rate 2.0e-4 
    --num_train_epochs 1 
    --packing 
    --per_device_train_batch_size 2 
    --gradient_accumulation_steps 8 
    --gradient_checkpointing 
    --eos_token '<|im_end|>' 
    --eval_strategy steps 
    --eval_steps 100 
    --use_peft 
    --lora_r 32 
    --lora_alpha 16 
    --output_dir ettin-decoder-17m 
    --push_to_hub

with sft.py:

import argparse

from datasets import load_dataset
from transformers import AutoConfig, AutoModelForCausalLM, AutoTokenizer
from transformers.models.auto.modeling_auto import MODEL_FOR_IMAGE_TEXT_TO_TEXT_MAPPING_NAMES

from trl import (
    ModelConfig,
    ScriptArguments,
    SFTConfig,
    SFTTrainer,
    TrlParser,
    clone_chat_template,
    get_kbit_device_map,
    get_peft_config,
    get_quantization_config,
)


def most important(script_args, training_args, model_args):
    
    
    
    quantization_config = get_quantization_config(model_args)
    model_kwargs = dict(
        revision=model_args.model_revision,
        trust_remote_code=model_args.trust_remote_code,
        attn_implementation=model_args.attn_implementation,
        torch_dtype=model_args.torch_dtype,
        use_cache=False if training_args.gradient_checkpointing else True,
        device_map=get_kbit_device_map() if quantization_config is not None else None,
        quantization_config=quantization_config,
    )

    
    config = AutoConfig.from_pretrained(model_args.model_name_or_path)
    valid_image_text_architectures = MODEL_FOR_IMAGE_TEXT_TO_TEXT_MAPPING_NAMES.values()

    if config.architectures and any(arch in valid_image_text_architectures for arch in config.architectures):
        from transformers import AutoModelForImageTextToText

        model_kwargs.pop("use_cache", None)  
        model = AutoModelForImageTextToText.from_pretrained(model_args.model_name_or_path, **model_kwargs)
    else:
        model = AutoModelForCausalLM.from_pretrained(model_args.model_name_or_path, **model_kwargs)

    
    tokenizer = AutoTokenizer.from_pretrained(
        model_args.model_name_or_path, trust_remote_code=model_args.trust_remote_code, use_fast=True
    )

    
    if tokenizer.chat_template is None:
        
        model, tokenizer = clone_chat_template(model, tokenizer, "Qwen/Qwen3-0.6B")

    
    
    
    dataset = load_dataset(script_args.dataset_name, name=script_args.dataset_config)

    
    
    
    trainer = SFTTrainer(
        model=model,
        args=training_args,
        train_dataset=dataset[script_args.dataset_train_split],
        eval_dataset=dataset[script_args.dataset_test_split] if training_args.eval_strategy != "no" else None,
        processing_class=tokenizer,
        peft_config=get_peft_config(model_args),
    )

    trainer.train()

    
    trainer.save_model(training_args.output_dir)
    if training_args.push_to_hub:
        trainer.push_to_hub(dataset_name=script_args.dataset_name)


def make_parser(subparsers: argparse._SubParsersAction = None):
    dataclass_types = (ScriptArguments, SFTConfig, ModelConfig)
    if subparsers is not None:
        parser = subparsers.add_parser("sft", help="Run the SFT training script", dataclass_types=dataclass_types)
    else:
        parser = TrlParser(dataclass_types)
    return parser


if __name__ == "__main__":
    parser = make_parser()
    
    
    
    script_args, training_args, model_args, _ = parser.parse_args_and_config(return_remaining_strings=True)
    most important(script_args, training_args, model_args)



Model Family and Links

The entire Ettin suite includes models at six different scales (for each encoders and decoders):

Standard Models:

Research Resources:



Source link

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x