Training and Finetuning Sparse Embedding Models with Sentence Transformers v5

Sentence Transformers is a Python library for using and training embedding and reranker models for a wide selection of applications, similar to retrieval augmented generation, semantic search, semantic textual similarity, paraphrase mining, and more. The previous couple of major versions have introduced significant improvements to training:

v3.0: (improved) Sentence Transformer (Dense Embedding) model training
v4.0: (improved) Cross Encoder (Reranker) model training
v5.0: (recent) Sparse Embedding model training

On this blogpost, I’ll show you use it to finetune a sparse encoder/embedding model and explain why it is advisable to accomplish that. This leads to sparse-encoder/example-inference-free-splade-distilbert-base-uncased-nq, an inexpensive model that works especially well in hybrid search or retrieve and rerank scenarios.

Finetuning sparse embedding models involves several components: the model, datasets, loss functions, training arguments, evaluators, and the trainer class. I’ll have a take a look at each of those components, accompanied by practical examples of how they might be used for finetuning strong sparse embedding models.

Along with training your individual models, you’ll be able to select from a wide selection of pretrained sparse encoders available on the Hugging Face Hub. To assist navigate this growing space, we’ve curated a SPLADE Models collection highlighting among the most relevant models.
We list probably the most outstanding ones together with their benchmark leads to Pretrained Models within the documentation.

What are Sparse Embedding models?

The broader term “embedding models” check with models that convert some input, normally text, right into a vector representation (embedding) that captures the semantic meaning of the input. Unlike with the raw inputs, you’ll be able to perform mathematical operations on these embeddings, leading to similarity scores that might be used for various tasks, similar to search, clustering, or classification.

With dense embedding models, i.e. the common variety, the embeddings are typically low-dimensional vectors (e.g., 384, 768, or 1024 dimensions) where most values are non-zero. Sparse embedding models, however, produce high-dimensional vectors (e.g., 30,000+ dimensions) where most values are zero. Normally, each lively dimension (i.e. the dimension with a non-zero value) in a sparse embedding corresponds to a selected token within the model’s vocabulary, allowing for interpretability.

Let’s have a take a look at naver/splade-v3, a state-of-the-art sparse embedding model, for example:

from sentence_transformers import SparseEncoder


model = SparseEncoder("naver/splade-v3")


sentences = [
    "The weather is lovely today.",
    "It's so sunny outside!",
    "He drove to the stadium.",
]
embeddings = model.encode(sentences)
print(embeddings.shape)



similarities = model.similarity(embeddings, embeddings)
print(similarities)





decoded = model.decode(embeddings, top_k=10)
for decoded, sentence in zip(decoded, sentences):
    print(f"Sentence: {sentence}")
    print(f"Decoded: {decoded}")
    print()

Sentence: The weather is gorgeous today.
Decoded: [('weather', 2.754288673400879), ('today', 2.610959529876709), ('lovely', 2.431990623474121), ('currently', 1.5520408153533936), ('beautiful', 1.5046082735061646), ('cool', 1.4664798974990845), ('pretty', 0.8986214995384216), ('yesterday', 0.8603134155273438), ('nice', 0.8322536945343018), ('summer', 0.7702118158340454)]

Sentence: It is so sunny outside!
Decoded: [('outside', 2.6939032077789307), ('sunny', 2.535827398300171), ('so', 2.0600898265838623), ('out', 1.5397940874099731), ('weather', 1.1198079586029053), ('very', 0.9873268604278564), ('cool', 0.9406591057777405), ('it', 0.9026399254798889), ('summer', 0.684999406337738), ('sun', 0.6520509123802185)]

Sentence: He drove to the stadium.
Decoded: [('stadium', 2.7872302532196045), ('drove', 1.8208855390548706), ('driving', 1.6665740013122559), ('drive', 1.5565159320831299), ('he', 1.4721972942352295), ('stadiums', 1.449463129043579), ('to', 1.0441515445709229), ('car', 0.7002660632133484), ('visit', 0.5118278861045837), ('football', 0.502326250076294)]

In this instance, the embeddings are 30,522-dimensional vectors, where each dimension corresponds to a token within the model’s vocabulary. The decode method returned the highest 10 tokens with the best values within the embedding, allowing us to interpret which tokens contribute most to the embedding.

We will even determine the intersection or overlap between embeddings, very useful for determining why two texts are deemed similar or dissimilar:


intersection_embedding = model.intersection(embeddings[0], embeddings[1])
decoded_intersection = model.decode(intersection_embedding)
print(decoded_intersection)

Decoded: [('weather', 3.0842742919921875), ('cool', 1.379457712173462), ('summer', 0.5275946259498596), ('comfort', 0.3239051103591919), ('sally', 0.22571465373039246), ('julian', 0.14787325263023376), ('nature', 0.08582140505313873), ('beauty', 0.0588383711874485), ('mood', 0.018594780936837196), ('nathan', 0.000752730411477387)]

Query and Document Expansion

A key component of neural sparse embedding models is query/document expansion. Unlike traditional lexical methods like BM25, which only match exact tokens, neural sparse models generally mechanically expand the unique text with semantically related terms:

Traditional, Lexical (e.g. BM25): Only matches on exact tokens within the text
Neural Sparse Models: Routinely expand with related terms

For instance, within the code output above, the sentence “The weather is gorgeous today” is expanded to incorporate terms like “beautiful”, “cool”, “pretty”, and “nice” which weren’t in the unique text. Similarly, “It is so sunny outside!” is expanded to incorporate “weather”, “summer”, and “sun”.

This expansion allows neural sparse models to match semantically related content or synonyms even without exact token matches, handle misspellings, and overcome vocabulary mismatch problems. That is why neural sparse models like SPLADE often outperform traditional lexical search methods while maintaining the efficiency advantages of sparse representations.

Nonetheless, expansion has its risks. For instance, query expansion for “What’s the weather on Tuesday?” will likely also expand to “monday”, “wednesday”, etc., which is probably not desired.

Why Use Sparse Embedding Models?

Briefly, neural sparse embedding models fall in a worthwhile area of interest between traditional lexical methods like BM25 and dense embedding models like Sentence Transformers. They’ve the next benefits:

Hybrid potential: Very effectively combined with dense models, which can struggle with searches where lexical matches are necessary
Interpretability: You possibly can see exactly which tokens contribute to a match
Performance: Competitive or higher than dense models in lots of retrieval tasks

Throughout this blogpost, I’ll use “sparse embedding model” and “sparse encoder model” interchangeably.

Why Finetune?

The vast majority of (neural) sparse embedding models employ the aforementioned query/document expansion so that you would be able to match texts with nearly an identical meaning, even in the event that they don’t share any words. Briefly, the model has to acknowledge synonyms so those tokens might be placed in the ultimate embedding.

Most out-of-the-box sparse embedding models will easily recognize that “supermarket”, “food”, and “market” are useful expansions of a text containing “grocery”, but for instance:

“The patient complained of severe cephalalgia.”

expands to:

'##lal', 'severe', '##pha', 'ce', '##gia', 'patient', 'criticism', 'patients', 'complained', 'warning', 'suffered', 'had', 'disease', 'complain', 'diagnosis', 'syndrome', 'mild', 'pain', 'hospital', 'injury'

whereas we wish for it to expand to “headache”, the common word for “cephalalgia”. This instance expands to many domains, e.g. not recognizing that “Java” is a programming language, that “Audi” makes cars, or that “NVIDIA” is an organization that makes graphics cards.

Through finetuning, the model can learn to focus exclusively on the domain and/or language that matters to you.

Training Components

Training Sentence Transformer models involves the next components:

Model: The model to coach or finetune, which is usually a pre-trained Sparse Encoder model or a base model.
Dataset: The info used for training and evaluation.
Loss Function: A function that quantifies the model’s performance and guides the optimization process.
Training Arguments (optional): Parameters that influence training performance and tracking/debugging.
Evaluator (optional): A tool for evaluating the model before, during, or after training.
Trainer: Brings together the model, dataset, loss function, and other components for training.

Now, let’s dive into each of those components in additional detail.

Model

Sparse Encoder models consist of a sequence of Modules, Sparse Encoder specific Modules or Custom Modules, allowing for a whole lot of flexibility. If you need to further finetune a Sparse Encoder model (e.g. it has a modules.json file), you then haven’t got to fret about which modules are used:

from sentence_transformers import SparseEncoder

model = SparseEncoder("naver/splade-cocondenser-ensembledistil")

But when as an alternative you need to train from one other checkpoint, or from scratch, then these are probably the most common architectures you should utilize:

Splade

Splade models use the MLMTransformer followed by a SpladePooling modules. The previous loads a pretrained Masked Language Modeling transformer model (e.g. BERT, RoBERTa, DistilBERT, ModernBERT, etc.) and the latter pools the output of the MLMHead to provide a single sparse embedding of the dimensions of the vocabulary.

from sentence_transformers import models, SparseEncoder
from sentence_transformers.sparse_encoder.models import MLMTransformer, SpladePooling


mlm_transformer = MLMTransformer("google-bert/bert-base-uncased")


splade_pooling = SpladePooling(pooling_strategy="max")


model = SparseEncoder(modules=[mlm_transformer, splade_pooling])

This architecture is the default when you provide a fill-mask model architecture to SparseEncoder, so it’s easier to make use of the shortcut:

from sentence_transformers import SparseEncoder

model = SparseEncoder("google-bert/bert-base-uncased")

Inference-free Splade

Inference-free Splade uses a Router module with different modules for queries and documents. Normally for this sort of architecture, the documents part is a conventional Splade architecture (a MLMTransformer followed by a SpladePooling module) and the query part is an SparseStaticEmbedding module, which just returns a pre-computed rating for each token within the query.

from sentence_transformers import SparseEncoder
from sentence_transformers.models import Router
from sentence_transformers.sparse_encoder.models import SparseStaticEmbedding, MLMTransformer, SpladePooling


doc_encoder = MLMTransformer("google-bert/bert-base-uncased")


router = Router.for_query_document(
    query_modules=[SparseStaticEmbedding(tokenizer=doc_encoder.tokenizer, frozen=False)],
    
    document_modules=[doc_encoder, SpladePooling("max")],
)


model = SparseEncoder(modules=[router], similarity_fn_name="dot")

This architecture allows for fast query-time processing using the lightweight SparseStaticEmbedding approach, that might be trained and seen as a linear weights, while documents are processed with the total MLM transformer and SpladePooling.

Inference-free Splade is especially useful for search applications where query latency is critical, because it shifts the computational complexity to the document indexing phase which might be done offline.

When training models with the Router module, it’s essential to use the router_mapping argument within the SparseEncoderTrainingArguments to map the training dataset columns to the right route (“query” or “document”). For instance, in case your dataset(s) have ["question", "answer"] columns, then you should utilize the next mapping:
args = SparseEncoderTrainingArguments(
    ...,
    router_mapping={
        "query": "query",
        "answer": "document",
    }
)
Moreover, it is strongly recommended to make use of a much higher learning rate for the SparseStaticEmbedding module than for the remainder of the model. For this, it is best to use the learning_rate_mapping argument within the SparseEncoderTrainingArguments to map parameter patterns to their learning rates. For instance, if you need to use a learning rate of 1e-3 for the SparseStaticEmbedding module and 2e-5 for the remainder of the model, you’ll be able to do that:
args = SparseEncoderTrainingArguments(
    ...,
    learning_rate=2e-5,
    learning_rate_mapping={
        r"SparseStaticEmbedding.*": 1e-3,
    }
)

Contrastive Sparse Representation (CSR)

Contrastive Sparse Representation (CSR) models, introduced in Beyond Matryoshka: Revisiting Sparse Coding for Adaptive Representation, apply a SparseAutoEncoder module on top of a dense Sentence Transformer model, which normally consist of a Transformer followed by a Pooling module. You possibly can initialize one from scratch like so:

from sentence_transformers import models, SparseEncoder
from sentence_transformers.sparse_encoder.models import SparseAutoEncoder


transformer = models.Transformer("google-bert/bert-base-uncased")


pooling = models.Pooling(transformer.get_word_embedding_dimension(), pooling_mode="mean")


sparse_auto_encoder = SparseAutoEncoder(
    input_dim=transformer.get_word_embedding_dimension(),
    hidden_dim=4 * transformer.get_word_embedding_dimension(),
    k=256,  
    k_aux=512,  
)

model = SparseEncoder(modules=[transformer, pooling, sparse_auto_encoder])

Or in case your base model is 1) a dense Sentence Transformer model or 2) a non-MLM Transformer model (those are loaded as Splade models by default), then this shortcut will mechanically initialize the CSR model for you:

from sentence_transformers import SparseEncoder

model = SparseEncoder("mixedbread-ai/mxbai-embed-large-v1")

Unlike (Inference-free) Splade models, sparse embeddings by CSR models haven’t got the identical size because the vocabulary of the bottom model. This implies you’ll be able to’t directly interpret which words are activated in your embedding like you’ll be able to with Splade models, where each dimension corresponds to a selected token within the vocabulary.

Beyond that, CSR models are simplest on dense encoder models that use high-dimensional representations (e.g. 1024-4096 dimensions).

Architecture Picker Guide

If you happen to’re unsure which architecture to make use of, here’s a fast guide:

Do you need to sparsify an existing Dense Embedding model? If yes, use CSR.
Do you wish your query inference to be instantaneous at the associated fee of slight performance? If yes, use Inference-free SPLADE.
Otherwise, use SPLADE.

Dataset

The SparseEncoderTrainer uses datasets.Dataset or datasets.DatasetDict instances for training and evaluation. You possibly can load data from the Hugging Face Datasets Hub or use local data in various formats similar to CSV, JSON, Parquet, Arrow, or SQL.

Note: Numerous public datasets that work out of the box with Sentence Transformers have been tagged with sentence-transformers on the Hugging Face Hub, so you’ll be able to easily find them on https://huggingface.co/datasets?other=sentence-transformers. Consider browsing through these to seek out ready-to-go datasets that could be useful on your tasks, domains, or languages.

Data on the Hugging Face Hub

You need to use the load_dataset function to load data from datasets within the Hugging Face Hub

from datasets import load_dataset

train_dataset = load_dataset("sentence-transformers/natural-questions", split="train")

print(train_dataset)
"""
Dataset({
    features: ['query', 'answer'],
    num_rows: 100231
})
"""

Some datasets, like nthakur/swim-ir-monolingual, have multiple subsets with different data formats. You want to specify the subset name together with the dataset name, e.g. dataset = load_dataset("nthakur/swim-ir-monolingual", "de", split="train").

Local Data (CSV, JSON, Parquet, Arrow, SQL)

You may also use load_dataset for loading local data in certain file formats:

from datasets import load_dataset

dataset = load_dataset("csv", data_files="my_file.csv")

dataset = load_dataset("json", data_files="my_file.json")

Local Data that requires pre-processing

You need to use datasets.Dataset.from_dict in case your local data requires pre-processing. This means that you can initialize your dataset with a dictionary of lists:

from datasets import Dataset

queries = []
documents = []



dataset = Dataset.from_dict({
    "query": queries,
    "document": documents,
})

Each key within the dictionary becomes a column within the resulting dataset.

Dataset Format

It’s crucial to make sure that your dataset format matches your chosen loss function. This involves checking two things:

In case your loss function requires a Label (as indicated within the Loss Overview table), your dataset should have a column named “label” or “rating”.
All columns aside from “label” or “rating” are considered Inputs (as indicated within the Loss Overview table). The variety of these columns must match the variety of valid inputs on your chosen loss function. The names of the columns don’t matter, only their order matters.

For instance, in case your loss function accepts (anchor, positive, negative) triplets, then your first, second, and third dataset columns correspond with anchor, positive, and negative, respectively. Which means your first and second column must contain texts that ought to embed closely, and that your first and third column must contain texts that ought to embed far apart. That’s the reason depending in your loss function, your dataset column order matters.

Consider a dataset with columns ["text1", "text2", "label"], where the "label" column incorporates floating point similarity scores. This dataset might be used with SparseCoSENTLoss, SparseAnglELoss, and SparseCosineSimilarityLoss because:

The dataset has a “label” column, which is required by these loss functions.
The dataset has 2 non-label columns, matching the variety of inputs required by these loss functions.

If the columns in your dataset are usually not ordered appropriately, use Dataset.select_columns to reorder them. Moreover, remove any extraneous columns (e.g., sample_id, metadata, source, type) using Dataset.remove_columns, as they shall be treated as inputs otherwise.

Loss Function

Loss functions measure how well a model performs on a given batch of knowledge and guide the optimization process. The alternative of loss function relies on your available data and goal task. Confer with the Loss Overview for a comprehensive list of options.

To coach a SparseEncoder, you either need a SpladeLoss or CSRLoss, depending on the architecture. These are wrapper losses that add sparsity regularization on top of a essential loss function, which have to be provided as a parameter. The one loss that might be used independently is SparseMSELoss, because it performs embedding-level distillation, ensuring sparsity by directly copying the teacher’s sparse embedding.

Most loss functions might be initialized with just the SparseEncoder that you simply’re training, alongside some optional parameters, e.g.:

from datasets import load_dataset
from sentence_transformers import SparseEncoder
from sentence_transformers.sparse_encoder.losses import SpladeLoss, SparseMultipleNegativesRankingLoss


model = SparseEncoder("distilbert/distilbert-base-uncased")



loss = SpladeLoss(
    model=model,
    loss=SparseMultipleNegativesRankingLoss(model=model),
    query_regularizer_weight=5e-5,  
    document_regularizer_weight=3e-5,
) 


train_dataset = load_dataset("sentence-transformers/natural-questions", split="train")
print(train_dataset)
"""
Dataset({
    features: ['query', 'answer'],
    num_rows: 100231
})
"""

Documentation

Training Arguments

The SparseEncoderTrainingArguments class means that you can specify parameters that influence training performance and tracking/debugging. While optional, experimenting with these arguments may help improve training efficiency and supply insights into the training process.

Within the Sentence Transformers documentation, I’ve outlined among the most useful training arguments. I’d recommend reading it in Training Overview > Training Arguments.

Here’s an example of initialize SparseEncoderTrainingArguments:

from sentence_transformers import SparseEncoderTrainingArguments

args = SparseEncoderTrainingArguments(
    
    output_dir="models/splade-distilbert-base-uncased-nq",
    
    num_train_epochs=1,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    learning_rate=2e-5,
    warmup_ratio=0.1,
    fp16=True,  
    bf16=False,  
    batch_sampler=BatchSamplers.NO_DUPLICATES,  
    
    eval_strategy="steps",
    eval_steps=100,
    save_strategy="steps",
    save_steps=100,
    save_total_limit=2,
    logging_steps=100,
    run_name="splade-distilbert-base-uncased-nq",  
)

Note that eval_strategy was introduced in transformers version 4.41.0. Prior versions should use evaluation_strategy as an alternative.

Evaluator

You possibly can provide the SparseEncoderTrainer with an eval_dataset to get the evaluation loss during training, but it surely could also be useful to get more concrete metrics during training, too. For this, you should utilize evaluators to evaluate the model’s performance with useful metrics before, during, or after training. You need to use each an eval_dataset and an evaluator, one or the opposite, or neither. They evaluate based on the eval_strategy and eval_steps Training Arguments.

Listed here are the implemented Evaluators that include Sentence Transformers for Sparse Encoder models:

Moreover, SequentialEvaluator ought to be used to mix multiple evaluators into one Evaluator that might be passed to the SparseEncoderTrainer.

Sometimes you haven’t got the required evaluation data to arrange one among these evaluators on your individual, but you continue to wish to track how well the model performs on some common benchmarks. In that case, you should utilize these evaluators with data from Hugging Face.

SparseNanoBEIREvaluator

Documentation

from sentence_transformers.sparse_encoder.evaluation import SparseNanoBEIREvaluator



dev_evaluator = SparseNanoBEIREvaluator()

SparseEmbeddingSimilarityEvaluator with STSb

Documentation

from datasets import load_dataset
from sentence_transformers.evaluation import SimilarityFunction
from sentence_transformers.sparse_encoder.evaluation import SparseEmbeddingSimilarityEvaluator


eval_dataset = load_dataset("sentence-transformers/stsb", split="validation")


dev_evaluator = SparseEmbeddingSimilarityEvaluator(
    sentences1=eval_dataset["sentence1"],
    sentences2=eval_dataset["sentence2"],
    scores=eval_dataset["score"],
    main_similarity=SimilarityFunction.COSINE,
    name="sts-dev",
)

SparseTripletEvaluator with AllNLI

Documentation

from datasets import load_dataset
from sentence_transformers.evaluation import SimilarityFunction
from sentence_transformers.sparse_encoder.evaluation import SparseTripletEvaluator


max_samples = 1000
eval_dataset = load_dataset("sentence-transformers/all-nli", "triplet", split=f"dev[:{max_samples}]")


dev_evaluator = SparseTripletEvaluator(
    anchors=eval_dataset["anchor"],
    positives=eval_dataset["positive"],
    negatives=eval_dataset["negative"],
    main_distance_function=SimilarityFunction.DOT,
    name="all-nli-dev",
)

When evaluating regularly during training with a small eval_steps, think about using a tiny eval_dataset to attenuate evaluation overhead. If you happen to’re concerned in regards to the evaluation set size, a 90-1-9 train-eval-test split can provide a balance, reserving a fairly sized test set for final evaluations. After training, you’ll be able to assess your model’s performance using trainer.evaluate(test_dataset) for test loss or initialize a testing evaluator with test_evaluator(model) for detailed test metrics.

If you happen to evaluate after training, but before saving the model, your mechanically generated model card will still include the test results.

When using Distributed Training, the evaluator only runs on the primary device, unlike the training and evaluation datasets, that are shared across all devices.

Trainer

The SparseEncoderTrainer is where all previous components come together. We only should specify the trainer with the model, training arguments (optional), training dataset, evaluation dataset (optional), loss function, evaluator (optional) and we will start training. Let’s have a take a look at a script where all of those components come together:

import logging

from datasets import load_dataset

from sentence_transformers import (
    SparseEncoder,
    SparseEncoderModelCardData,
    SparseEncoderTrainer,
    SparseEncoderTrainingArguments,
)
from sentence_transformers.models import Router
from sentence_transformers.sparse_encoder.evaluation import SparseNanoBEIREvaluator
from sentence_transformers.sparse_encoder.losses import SparseMultipleNegativesRankingLoss, SpladeLoss
from sentence_transformers.sparse_encoder.models import SparseStaticEmbedding, MLMTransformer, SpladePooling
from sentence_transformers.training_args import BatchSamplers

logging.basicConfig(format="%(asctime)s - %(message)s", datefmt="%Y-%m-%d %H:%M:%S", level=logging.INFO)


mlm_transformer = MLMTransformer("distilbert/distilbert-base-uncased", tokenizer_args={"model_max_length": 512})
splade_pooling = SpladePooling(
    pooling_strategy="max", word_embedding_dimension=mlm_transformer.get_sentence_embedding_dimension()
)
router = Router.for_query_document(
    query_modules=[SparseStaticEmbedding(tokenizer=mlm_transformer.tokenizer, frozen=False)],
    document_modules=[mlm_transformer, splade_pooling],
)

model = SparseEncoder(
    modules=[router],
    model_card_data=SparseEncoderModelCardData(
        language="en",
        license="apache-2.0",
        model_name="Inference-free SPLADE distilbert-base-uncased trained on Natural-Questions tuples",
    ),
)


full_dataset = load_dataset("sentence-transformers/natural-questions", split="train").select(range(100_000))
dataset_dict = full_dataset.train_test_split(test_size=1_000, seed=12)
train_dataset = dataset_dict["train"]
eval_dataset = dataset_dict["test"]
print(train_dataset)
print(train_dataset[0])


loss = SpladeLoss(
    model=model,
    loss=SparseMultipleNegativesRankingLoss(model=model),
    query_regularizer_weight=0,
    document_regularizer_weight=3e-3,
)


run_name = "inference-free-splade-distilbert-base-uncased-nq"
args = SparseEncoderTrainingArguments(
    
    output_dir=f"models/{run_name}",
    
    num_train_epochs=1,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    learning_rate=2e-5,
    learning_rate_mapping={r"SparseStaticEmbedding.weight": 1e-3},  
    warmup_ratio=0.1,
    fp16=True,  
    bf16=False,  
    batch_sampler=BatchSamplers.NO_DUPLICATES,  
    router_mapping={"query": "query", "answer": "document"},  
    
    eval_strategy="steps",
    eval_steps=1000,
    save_strategy="steps",
    save_steps=1000,
    save_total_limit=2,
    logging_steps=200,
    run_name=run_name,  
)


dev_evaluator = SparseNanoBEIREvaluator(dataset_names=["msmarco", "nfcorpus", "nq"], batch_size=16)


trainer = SparseEncoderTrainer(
    model=model,
    args=args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    loss=loss,
    evaluator=dev_evaluator,
)
trainer.train()


dev_evaluator(model)


model.save_pretrained(f"models/{run_name}/final")


model.push_to_hub(run_name)

In this instance I’m finetuning from distilbert/distilbert-base-uncased, a base model that will not be yet a Sparse Encoder model. This requires more training data than finetuning an existing Sparse Encoder model, like naver/splade-cocondenser-ensembledistil.

After running this script, the sparse-encoder/example-inference-free-splade-distilbert-base-uncased-nq model was uploaded for me. The model scores 0.5241 NDCG@10 on NanoMSMARCO, 0.3299 NDCG@10 on NanoNFCorpus and 0.5357 NDCG@10 NanoNQ, which is a great result for an inference-free distilbert-based model trained on just 100k pairs from the Natural Questions dataset.

The model uses a median of 184 lively dimensions within the sparse embeddings for the documents, in comparison with 7.7 lively dimensions for the queries (i.e. the typical variety of tokens within the query). This corresponds to a sparsity of 99.39% and 99.97%, respectively.

All of this information is stored within the mechanically generated model card, including the bottom model, language, license, evaluation results, training & evaluation dataset info, hyperparameters, training logs, and more. With none effort, your uploaded models should contain all the data that your potential users would want to find out whether your model is suitable for them.

Callbacks

The Sentence Transformers trainer supports various transformers.TrainerCallback subclasses, including:

WandbCallback for logging training metrics to W&B if wandb is installed
TensorBoardCallback for logging training metrics to TensorBoard if tensorboard is accessible
CodeCarbonCallback for tracking carbon emissions during training if codecarbon is installed

These are mechanically used without you having to specify anything, so long as the required dependency is installed.

Confer with the Transformers Callbacks documentation for more information on these callbacks and create your individual.

Multi-Dataset Training

Top-performing models are sometimes trained using multiple datasets concurrently. The SparseEncoderTrainer simplifies this process by allowing you to coach with multiple datasets without converting them to the identical format. You possibly can even apply different loss functions to every dataset. Listed here are the steps for multi-dataset training:

Use a dictionary of datasets.Dataset instances (or a datasets.DatasetDict) because the train_dataset and eval_dataset.
(Optional) Use a dictionary of loss functions mapping dataset names to losses if you need to use different losses for various datasets.

Each training/evaluation batch will contain samples from only one among the datasets. The order by which batches are sampled from the multiple datasets is set by the MultiDatasetBatchSamplers enum, which might be passed to the SparseEncoderTrainingArguments via multi_dataset_batch_sampler. The valid options are:

MultiDatasetBatchSamplers.ROUND_ROBIN: Samples from each dataset in a round-robin fashion until one is exhausted. This strategy may not use all samples from each dataset, but it surely ensures equal sampling from each dataset.
MultiDatasetBatchSamplers.PROPORTIONAL (default): Samples from each dataset proportionally to its size. This strategy ensures that every one samples from each dataset are used, and bigger datasets are sampled from more regularly.

Evaluation

Let’s evaluate our newly trained inference-free SPLADE model using the NanoMSMARCO dataset, and see the way it compares to dense retrieval approaches. We’ll also explore hybrid retrieval methods that mix sparse and dense vectors, in addition to reranking to further improve search quality.

After running a rather modified version of our hybrid_search.py script, we get the next results for the NanoMSMARCO dataset, using these models:

Sparse	Dense	Reranker	NDCG@10	MRR@10	MAP
x			52.41	43.06	44.20
	x		55.40	47.96	49.08
x	x		62.22	53.02	53.44
x		x	66.31	59.45	60.36
	x	x	66.28	59.43	60.34
x	x	x	66.28	59.43	60.34

The Sparse and Dense rankings might be combined using Reciprocal Rank Fusion (RRF), which is an easy approach to mix the outcomes of multiple rankings. If a Reranker is applied, it would rerank the outcomes of the prior retrieval step.

The outcomes indicate that for this dataset, combining Dense and Sparse rankings may be very performant, leading to 12.3% and 18.7% increases over the Dense and Sparse baselines, respectively. Briefly, combining Sparse and Dense retrieval methods is a really effective approach to improve search performance.

Moreover, applying a reranker on any of the rankings improved the performance to roughly 66.3 NDCG@10, showing that either Sparse, Dense, or Hybrid (Dense + Sparse) found the relevant documents of their top 100, which the reranker then ranked to the highest 10. So, replacing a Dense -> Reranker pipeline with a Sparse -> Reranker pipeline might improve each latency and costs:

Sparse embeddings might be cheaper to store, e.g. our model only uses ~180 lively dimensions for MS MARCO documents as an alternative of the common 1024 dimensions for dense models.
Some Sparse Encoders allow for inference-free query processing, allowing for a near-instant first-stage retrieval, akin to lexical solutions like BM25.

Training Suggestions

Sparse Encoder models have a couple of quirks that try to be aware of when training them:

Sparse Encoder models shouldn’t be evaluated solely using the evaluation scores, but additionally with the sparsity of the embeddings. In spite of everything, a low sparsity implies that the model embeddings are expensive to store and slow to retrieve.
The stronger Sparse Encoder models are trained almost exclusively with distillation from a stronger teacher model (e.g. a CrossEncoder model), as an alternative of coaching directly from text pairs or triplets. See for instance the SPLADE-v3 paper, which uses SparseDistillKLDivLoss and SparseMarginMSELoss for distillation. We do not cover this intimately on this blog because it requires more data preparation, but a distillation setup ought to be seriously considered.

Vector Database Integration

After training sparse embedding models, the subsequent crucial step is deploying them effectively in production environments. Vector databases provide the essential infrastructure for storing, indexing, and retrieving sparse embeddings at scale. Popular options include Qdrant, OpenSearch, Elasticsearch, and Seismic, amongst others.

For comprehensive examples covering vector databases mentioned above, check with the semantic search with vector database documentation or below for the Qdrant example.

Qdrant Integration Example

Qdrant offers excellent support for sparse vectors with efficient storage and fast retrieval capabilities. Below is a comprehensive implementation example:

Prerequisites:

This instance demonstrates arrange Qdrant for sparse vector search by showing efficiently encode and index documents with sparse encoders, formulating search queries with sparse vectors, and providing an interactive query interface. See below:

import time

from datasets import load_dataset
from sentence_transformers import SparseEncoder
from sentence_transformers.sparse_encoder.search_engines import semantic_search_qdrant


dataset = load_dataset("sentence-transformers/natural-questions", split="train")
num_docs = 10_000
corpus = dataset["answer"][:num_docs]


queries = dataset["query"][:2]


sparse_model = SparseEncoder("naver/splade-cocondenser-ensembledistil")


corpus_embeddings = sparse_model.encode_document(
    corpus, convert_to_sparse_tensor=True, batch_size=16, show_progress_bar=True
)


corpus_index = None
while True:
    
    start_time = time.time()
    query_embeddings = sparse_model.encode_query(queries, convert_to_sparse_tensor=True)
    print(f"Encoding time: {time.time() - start_time:.6f} seconds")

    
    results, search_time, corpus_index = semantic_search_qdrant(
        query_embeddings,
        corpus_index=corpus_index,
        corpus_embeddings=corpus_embeddings if corpus_index is None else None,
        top_k=5,
        output_index=True,
    )

    
    print(f"Search time: {search_time:.6f} seconds")
    for query, result in zip(queries, results):
        print(f"Query: {query}")
        for entry in result:
            print(f"(Rating: {entry['score']:.4f}) {corpus[entry['corpus_id']]}, corpus_id: {entry['corpus_id']}")
        print("")

    
    queries = [input("Please enter a question: ")]

Additional Resources

Training Examples

The next pages contain training examples with explanations in addition to links to code. We recommend that you simply flick thru these to familiarize yourself with the training loop:

Model Distillation – Examples to make models smaller, faster and lighter.
MS MARCO – Example training scripts for training on the MS MARCO information retrieval dataset.
Retrievers – Example training scripts for training on generic information retrieval datasets.
Natural Language Inference – Natural Language Inference (NLI) data might be quite helpful to pre-train and fine-tune models to create meaningful sparse embeddings.
Quora Duplicate Questions – Quora Duplicate Questions is a big set corpus with duplicate questions from the Quora community. The folder incorporates examples train models for duplicate questions mining and for semantic search.
STS – Essentially the most basic method to coach models is using Semantic Textual Similarity (STS) data. Here, we use sentence pairs and a rating indicating the semantic similarity.

Documentation

Moreover, the next pages could also be useful to learn more about Sentence Transformers:

And lastly, listed below are some advanced pages that may interest you:

Source link

Training and Finetuning Sparse Embedding Models with Sentence Transformers v5

Table of Contents

What are Sparse Embedding models?

Query and Document Expansion

Why Use Sparse Embedding Models?

Why Finetune?

Training Components

Model

Splade

Inference-free Splade

Contrastive Sparse Representation (CSR)

Architecture Picker Guide

Dataset

Data on the Hugging Face Hub

Local Data (CSV, JSON, Parquet, Arrow, SQL)

Local Data that requires pre-processing

Dataset Format

Loss Function

Training Arguments

Evaluator

SparseNanoBEIREvaluator

SparseEmbeddingSimilarityEvaluator with STSb

SparseTripletEvaluator with AllNLI

Trainer

Callbacks

Multi-Dataset Training

Evaluation

Training Suggestions

Vector Database Integration

Qdrant Integration Example

Prerequisites:

Additional Resources

Training Examples

Documentation

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.