A Latest Standard for Retrieval Evaluation

TL;DR – We’re excited to introduce the beta version of the Retrieval Embedding Benchmark (RTEB), a brand new benchmark designed to reliably evaluate the retrieval accuracy of embedding models for real-world applications. Existing benchmarks struggle to measure true generalization, while RTEB addresses this with a hybrid strategy of open and personal datasets. Its goal is straightforward: to create a good, transparent, and application-focused standard for measuring how models perform on data they haven’t seen before.

The performance of many AI applications, from RAG and agents to advice systems, is fundamentally limited by the standard of search and retrieval. As such, accurately measuring the retrieval quality of embedding models is a standard pain point for developers. How do you really know the way well a model will perform within the wild?

That is where things get tricky. The present standard for evaluation often relies on a model’s “zero-shot” performance on public benchmarks. Nonetheless, that is, at best, an approximation of a model’s true generalization capabilities. When models are repeatedly evaluated against the identical public datasets, a spot emerges between their reported scores and their actual performance on recent, unseen data.

Performance Discrepancy Between Public and Closed Datasets

To handle these challenges, we developed RTEB, a benchmark built to offer a reliable standard for evaluating retrieval models.

Why Existing Benchmarks Fall Short

While the underlying evaluation methodology and metrics (akin to NDCG@10) are well-known and robust, the integrity of existing benchmarks is usually set back by the next issues:

The Generalization Gap. The present benchmark ecosystem inadvertently encourages “teaching to the test.” When training data sources overlap with evaluation datasets, a model’s rating can develop into inflated, undermining a benchmark’s integrity. This practice, whether intentional or not, is clear within the training datasets of several models. This creates a feedback loop where models are rewarded for memorizing test data moderately than developing robust, generalizable capabilities.

Due to above, models with a lower zero-shot rating[1] may perform thoroughly on the benchmark, without generalizing to recent problems. For that reason, models with barely lower benchmark performance and the next zero-shot rating are sometimes beneficial as an alternative.

Misalignment with Today’s AI Applications. Many benchmarks are poorly aligned with the enterprise use cases that developers are constructing today. They often depend on academic datasets or on retrieval tasks derived from QA datasets, which, while useful in their very own right, weren’t designed to guage retrieval and might fail to capture the distributional biases and complexities encountered in real-world retrieval scenarios. Benchmarks which don’t possess these issues are sometimes too narrow, specializing in a single domain like code retrieval, making them unsuitable for evaluating general-purpose models.

Introducing RTEB

Today, we’re excited to introduce the Retrieval Embedding Benchmark (RTEB). Its goal is to create a brand new, reliable, high-quality benchmark that measures the true retrieval accuracy of embedding models.

A Hybrid Strategy for True Generalization

To combat benchmark overfitting, RTEB implements a hybrid strategy using each open and personal datasets:

Open Datasets: The corpus, queries, and relevance labels are fully public. This ensures transparency and allows any user to breed the outcomes.
Private Datasets: These datasets are kept private, and evaluation is handled by the MTEB maintainers to make sure impartiality. This setup provides a transparent, unbiased measure of a model’s ability to generalize to unseen data. For transparency, we offer descriptive statistics, a dataset description, and sample (query, document, relevance) triplets for every private dataset.

This hybrid approach encourages the event of models with broad, robust generalization. A model with a big performance drop between the open and the private datasets would suggest overfitting, providing a transparent signal to the community. That is already apparent with some models, which show a notable drop in performance on RTEB’s private datasets.

Built for Real-World Domains

RTEB is designed with a selected emphasis on enterprise use cases. As a substitute of a fancy hierarchy, it uses easy groups for clarity. A single dataset can belong to multiple groups (e.g., a German law dataset exists in each the “law” and “German” groups).

Multilingual in Nature: The benchmark datasets cover 20 languages, from common ones like English or Japanese to rarer languages akin to Bengali or Finnish.
Domain-Specific Focus: The benchmark includes datasets from critical enterprise domains like law, healthcare, code, and finance.
Efficient Dataset Sizes: Datasets are large enough to be meaningful (not less than 1k documents and 50 queries) without being so large that they make evaluation time-consuming and expensive.
Retrieval-First Metric: The default leaderboard metric is NDCG@10, a gold-standard measure for the standard of ranked search results.

An entire list of the datasets might be found below. We plan to repeatedly update each the open in addition to closed portion with different categories of datasets and actively encourage participation from the community; please open a difficulty on the MTEB repository on GitHub should you would love to suggest other datasets.

RTEB Datasets

Open

Dataset	Dataset Groups	Open/Closed	Dataset URL	Repurposed from QA	Description and Reason for Inclusion
AILACasedocs	english, legal	Open	https://huggingface.co/datasets/mteb/AILA_casedocs	No	This dataset comprises roughly 3,000 Supreme Court of India case documents and is designed to evaluae the retrieval of relevant prior cases for given legal situations. It includes 50 queries, each outlining a particular scenario. We include this dataset within the benchmark since the documents are reasonably difficult, the queries are non-synthetic, and the labels are of top quality.
AILAStatutes	english, legal	Open	https://huggingface.co/datasets/mteb/AILA_statutes	No	The dataset comprises descriptions of 197 Supreme Court of India statutes, designed to facilitate the retrieval of relevant prior statutes for given legal situations. It includes 50 queries, each outlining a particular scenario. We include this dataset within the benchmark since the documents are reasonably difficult, the queries are non-synthetic, and the labels are of top quality.
LegalSummarization	english, legal	Open	https://huggingface.co/datasets/mteb/legal_summarization	No	The dataset comprises 446 pairs of legal text excerpts and their corresponding plain English summaries, sourced from reputable web sites dedicated to clarifying legal documents. The summaries have been manually reviewed for quality, ensuring that the info is clean and suitable for evaluating legal retrieval.
LegalQuAD	german, legal	Open	https://huggingface.co/datasets/mteb/LegalQuAD	No	The corpus consists of 200 real-world legal documents and the query set consists of 200 questions pertaining to legal documents.
FinanceBench	english, finance	Open	https://huggingface.co/datasets/virattt/financebench	Yes	The FinanceBench dataset is derived from the PatronusAI/financebench-test dataset, containing only the PASS examples processed right into a clean format for question-answering tasks within the financial domain. FinanceBench-rtl has been repurposed for retrieval.
HC3Finance	english, finance	Open	https://huggingface.co/datasets/Hello-SimpleAI/HC3	No	The HC3 dataset comprises tens of 1000’s of comparison responses from each human experts and ChatGPT across various domains, including open-domain, financial, medical, legal, and psychological areas. The info collection process involved sourcing publicly available question-answering datasets and wiki texts, ensuring that the human answers were either expert-provided or high-quality user responses, thereby minimizing mislabeling and enhancing the dataset’s reliability.
FinQA	english, finance	Open	https://huggingface.co/datasets/ibm/finqa	Yes	FinQA is a large-scale dataset with 2.8k financial reports for 8k Q&A pairs to check numerical reasoning with structured and unstructured evidence.
HumanEval	code	Open	https://huggingface.co/datasets/openai/openai_humaneval	Yes	The HumanEval dataset released by OpenAI includes 164 programming problems with a handwritten function signature, docstring, body, and a number of other unit tests for every problem. The dataset was handcrafted by engineers and researchers at OpenAI.
MBPP	code	Open	https://huggingface.co/datasets/google-research-datasets/mbpp	Yes	The MBPP dataset consists of around 1,000 crowd-sourced Python programming problems, designed to be solvable by entry level programmers, covering programming fundamentals, standard library functionality, and so forth. Each problem consists of a task description, code solution and three automated test cases. As described within the paper, a subset of the info has been hand-verified by the dataset authors to make sure quality.
MIRACLHardNegatives		Open	https://huggingface.co/datasets/mteb/miracl-hard-negatives	No	MIRACL (Multilingual Information Retrieval Across a Continuum of Languages) is a multilingual retrieval dataset that focuses on search across 18 different languages. The hard negative version has been created by pooling the 250 top documents per query from BM25, e5-multilingual-large and e5-mistral-instruct.
APPS	code, english	Open	https://huggingface.co/datasets/codeparrot/apps	Yes	APPS is a benchmark for code generation with 10000 problems. It could be used to guage the flexibility of language models to generate code from natural language specifications. To create the APPS dataset, the authors manually curated problems from open-access sites where programmers share problems with one another, including Codewars, AtCoder, Kattis, and Codeforces.
DS1000	code, english	Open	https://huggingface.co/datasets/xlangai/DS-1000	Yes	DS-1000 is a code generation benchmark with a thousand data science problems spanning seven Python libraries, akin to NumPy and Pandas. It employs multi-criteria evaluation metrics, including functional correctness and surface-form constraints, leading to a high-quality dataset with only one.8% incorrect solutions amongst accepted Codex-002 predictions.
WikiSQL	code, english	Open	https://huggingface.co/datasets/Salesforce/wikisql	Yes	WikiSQL is a dataset comprising 80,654 hand-annotated examples of natural language questions and corresponding SQL queries across 24,241 tables from Wikipedia.
ChatDoctor_HealthCareMagic	english, healthcare	Open	https://huggingface.co/datasets/lavita/ChatDoctor-HealthCareMagic-100k	No	The ChatDoctor-HealthCareMagic-100k dataset comprises 112,000 real-world medical question-and-answer pairs, providing a considerable and diverse collection of authentic medical dialogues. There’s a slight risk to this dataset since there are grammatical inconsistencies in lots of the questions and answers, but this will potentially help separate strong healthcare retrieval models from weak ones.
HC3 Medicine	english, healthcare	Open	https://huggingface.co/datasets/Hello-SimpleAI/HC3	No	The HC3 dataset comprises tens of 1000’s of comparison responses from each human experts and ChatGPT across various domains, including open-domain, financial, medical, legal, and psychological areas. The info collection process involved sourcing publicly available question-answering datasets and wiki texts, ensuring that the human answers were either expert-provided or high-quality user responses, thereby minimizing mislabeling and enhancing the dataset’s reliability.
HC3 French OOD	french, healthcare	Open	https://huggingface.co/datasets/almanach/hc3_french_ood	No	The HC3 dataset comprises tens of 1000’s of comparison responses from each human experts and ChatGPT across various domains, including open-domain, financial, medical, legal, and psychological areas. The info collection process involved sourcing publicly available question-answering datasets and wiki texts, ensuring that the human answers were either expert-provided or high-quality user responses, thereby minimizing mislabeling and enhancing the dataset’s reliability.
JaQuAD	japanese	Open	https://huggingface.co/datasets/SkelterLabsInc/JaQuAD	Yes	The JaQuAD dataset comprises 39,696 human-annotated question-answer pairs based on Japanese Wikipedia articles, with 88.7% of the contexts sourced from curated high-quality articles.
Cure	english, healthcare	Open	https://huggingface.co/datasets/clinia/CUREv1	No
TripClick	english, healthcare	Open	https://huggingface.co/datasets/irds/tripclick	No
FreshStack	english	Open	https://huggingface.co/papers/2504.13128	No

Closed

Dataset	Dataset Groups	Open/Closed	Comments	Description and Reason for Inclusion
_GermanLegal1	german, legal	Closed	Yes	This dataset is derived from real-world judicial decisions and employs a mixture of legal citation matching and BM25 similarity. The BM25 baseline poses a slight risk because it biases the info outside of citation matching. A subset of the dataset was manually verified to make sure correctness and quality.
_JapaneseLegal1	japanese, legal	Closed	No	This dataset comprises 8.75K deduplicated law records retrieved from the official Japanese government website e-Gov, ensuring authoritative and accurate content. Record titles are used as queries, while record bodies are used as documents.
_FrenchLegal1	french, legal	Closed	No	This dataset comprises case laws from the French court “Conseil d’Etat,” systematically extracted from the OPENDATA/JADE repository, specializing in tax-related cases. Queries are the title of every document, ensuring that the labels are clean.
_EnglishFinance1	english, finance	Closed	Yes	This retrieval dataset has been repurposed for retrieval from TAT-QA, a large-scale QA dataset using tabular and textual content.
_EnglishFinance4	english, finance	Closed	No	This dataset is a mixture of Stanford’s Alpaca and FiQA with one other 1.3k pairs custom generated using GPT3.5, after which further cleaned to make sure that the info quality is high.
_EnglishFinance2	english, finance	Closed	Yes	This dataset is a finance-domain dataset that consists of questions for every conversation turn based on simulated conversation flow. The curation is finished by expert annotators, ensuring a fairly high data quality. The questions are repurposed as queries, while the conversation block is repurposed as documents for retrieval.
_EnglishFinance3	english, finance	Closed	Yes	This dataset is a set of question-answer pairs curated to handle various features of private finance.
_Code1	code	Closed	No	We extracted functions from GIthub repos. With syntactic parsing, doc strings and performance signature are obtained from the functions. Only functions with docstrings are kept. Doc strings are used as queries, with function signature (which incorporates function name and argument names) removed to creating the duty harder. Each language is a subset with separate corpus.
_JapaneseCode1	code, japanese	Closed	No	This can be a subset of the CoNaLa challenge with Japanese questions.
_EnglishHealthcare1	english, healthcare	Closed	Yes	This dataset comprises 2,019 question-answer pairs annotated by 15 experts, each holding not less than a Master’s degree in biomedical sciences. A medical doctor led the annotation team, verifying each question-answer pair to make sure data quality.
_GermanHealthcare1	german, healthcare	Closed	No	This dataset comprises of 465 German-language medical dialogues between patients and healthcare assistants, each entry containing detailed patient descriptions and corresponding skilled responses. We now have manually verified a subset of the dataset for accuracy and data quality.
_German1	german	Closed	No	This dataset is a dialogue summarization dataset derived from multiple public corpora, which have cleaned and preprocessed right into a unified format. Each dialogue has been manually summarized and labeled with topics by annotators, ensuring high-quality and clean data. Dialog summaries are used as queries, while full dialogues are used as documents.
_French1	french	Closed	Yes	This dataset comprises over 4118 French trivia question-answer pairs, each accompanied by relevant Wikipedia context. We now have manually verified a subset of the dataset for accuracy and data quality.

Launching RTEB: A Community Effort

RTEB is launching today in beta. We consider constructing a sturdy benchmark is a community effort, and we plan to evolve RTEB based on feedback from developers and researchers alike. We encourage you to share your thoughts, suggest recent datasets, find issues in existing datasets and help us construct a more reliable standard for everybody. Please be at liberty to hitch the discussion or open a difficulty within the MTEB repository on Github.

Limitations and Future Work

To focus on areas for improvement we would like to be transparent about RTEB’s current limitations and our plans for the long run.

Benchmark Scope: RTEB is targeted on realistic, retrieval-first use cases. Highly difficult synthetic datasets aren’t a current goal but might be added in the long run.
Modality: The benchmark currently evaluates text-only retrieval. We plan to include text-image and other multimodal retrieval tasks in future releases.
Language Coverage: We’re actively working to expand our language coverage, particularly for major languages like Chinese and Arabic, in addition to more low-resource languages. In case you know of high-quality datasets that matches these criteria please tell us.
Repurposing of QA dataset: About 50% of the present retrieval datasets are repurposed from QA datasets, which could result in issues akin to a powerful lexical overlap between the query and the context, favoring models that depend on keyword matching over true semantic understanding.
Private datasets: To check for generalization, we utilize private datasets which can be only accessible to MTEB maintainers. To keep up fairness, all maintainers commit to not publishing models trained on these datasets and only testing on these private datasets through public channels, ensuring no company or individual receives unfair benefits.

Our goal is for RTEB to develop into a community-trusted standard for retrieval evaluation.

The RTEB leaderboard is accessible today on Hugging Face as an element of the brand new Retrieval section on the MTEB leaderboard. We invite you to ascertain it out, evaluate your models, and join us in constructing a greater, more reliable benchmark for all the AI community.

Source link

A Latest Standard for Retrieval Evaluation

Why Existing Benchmarks Fall Short

Introducing RTEB

A Hybrid Strategy for True Generalization

Built for Real-World Domains

Open

Closed

Launching RTEB: A Community Effort

Limitations and Future Work

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

Anthropic’s Claude Opus 4.5 is here: Cheaper AI, infinite chats, and coding skills that beat humans

Scuffling with Data Science? 5 Common Beginner Mistakes

How you can Predict Biomolecular Structures Using the OpenFold3 NIM

Empowering the community to check agents

The right way to avoid becoming an “AI-first” company with zero real AI usage

A Latest Standard for Retrieval Evaluation

Why Existing Benchmarks Fall Short

Introducing RTEB

A Hybrid Strategy for True Generalization

Built for Real-World Domains

Open

Closed

Launching RTEB: A Community Effort

Limitations and Future Work

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.