From zero to semantic search embedding model An issue with semantic search A rabbit hole of embeddings Transformer: a grandparent of all LLMs The BERT model BEIR benchmark The leaderboard Embeddings versus sparse retrieval The hidden cost of huge models Es gibt nicht nur Englisch Next chapter

A series of articles on constructing an accurate Large Language Model for neural search from scratch. We’ll start with BERT and sentence-transformers, undergo semantic search benchmarks like BEIR, modern models like SGPT and E5, and finish by constructing our toy embedding model.

The fundamental design of a semantic search system, as pitched by most vector search vendors, has two easy (that is irony) steps:

Compute embeddings on your documents and queries. Somewhere. Someway. Figure it out by yourself.
Upload them to a vector search engine and luxuriate in a greater semantic search.

embedding model is important for semantic search. Image by creator.

, but selecting the model is commonly considered out of scope for many early adopters. So everyone just takes a sentence-transformers/all-MiniLM-L6-v2 and hopes for the most effective.

But this approach has more open questions than answers:

Is there a difference between embedding models? Are paid models from OpenAI and Cohere higher?
How do they handle multiple languages? Is there a profit in large 1B+ models?
Dense retrieval using embeddings is one in all many semantic search methods. Is it higher than new-age sparse approaches like SPLADEv2 and ELSER?

On this series of 5 articles, we’ll give attention to selecting and constructing the best semantic search embedding model. The present plan:

(the one you’re reading now) . Baseline embeddings like BERT, MiniLM, DeBERTaV3, and GPT-* family. How are you going to evaluate the embedding quality with the BEIR benchmark? Current winners of BEIR and their pros and cons.
. Why MiniLM and never BERT? The training process, the 1B sentence-pairs dataset, and the way such a tiny and ancient model might be so good on the BEIR benchmark.
. Putting all of the recent LLM advancements right into a single semantic search model: dataset denoising, asymmetric embeddings, and additional IR fine-tuning. Can it’s further improved?
. Combining all of the strong points from MiniLM and E5 right into a single model with some extra secret sauce. Denoising and training data preparation. Can we fit and train it on a single RTX 4090?
: E5 is fine-tuned on MS MARCO+SNLI. Can we then fine-tune on an entire BEIR set? Can we make it multi-lingual?

The unique Transformer architecture might be seen as a black box that transforms input text into output text. But neural networks don’t understand texts per se; they only speak numbers — and all the inner transformations are numeric.

The Transformer consists of two major blocks:

: take text input in a numeric form, and produce an embedding representation of the semantic meaning of the input.
: inverse the method, take the embedding, and predict the subsequent text token.

Encoder and decoder parts of the unique transformer architecture. Image by creator.

So in the center between the encoder and decoder, there’s an embedding representation of the input. Each input and embedding are numerical vectors, but there continues to be a major difference between them:

Input vectors are only a sequence of term identifiers from a pre-defined dictionary (for BERT, the vocabulary size is 32K), padded to a set length.
The embedding vector is an internal representation of the input. That is how the network sees your input.

The embedding in the center is how neural network sees your input. The output of decoder is simplified — it emits probabilities of the subsequent token, and never the entire phrase. Image by creator.

In a few years, there was a vibrant transformer-based family of various text processing models with two major separate branches:

BERT-like, only using the encoder a part of the transformer. Good at classification, summarization, and entity recognition.
GPT family, decoder only. Good at generative tasks like translation and QA.

Transformer family tree. Source: https://arxiv.org/abs/2302.07730

Within the image above, you’ll be able to see a split between BERT and GPT sub-families of models. Traditionally BERT descendants are most frequently utilized in the world of semantic search.

The BERT model appears to be a very good fit for our problem of semantic search, as it could possibly be reduced to a binary classification task of relevant and irrelevant documents for a selected query.

: the model was trained to predict a masked word on a big text corpus. The indisputable fact that similar texts have similar embeddings is a pleasant self-emergent side effect.

Pre-training for BERT is completed on masked token prediction task. Source: https://arxiv.org/abs/1810.04805v2

But “not originally meant for semantic similarity” is just an opinion. Is there a approach to objectively measure how good or bad it’s on a reference dataset?

The tutorial paper “BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models” proposed a reference set of benchmarks and datasets for IR methods. And made model quality fights much less enjoyable: there may be now a single leaderboard to check your embedding model with competitors.

There are too some ways to measure relevance. Image by creator.

The BEIR benchmark proposes a set of 19 diverse IR datasets and all of the machinery for search quality evaluation.

An inventory of datasets utilized in BEIR benchmark. Source: https://openreview.net/forum?id=wCu6T5xFjeJ

The unique paper also benchmarks a few baseline methods on the entire collection of datasets. The predominant conclusion made in 2021 is that and a robust baseline.

Source: https://openreview.net/forum?id=wCu6T5xFjeJ

Later BEIR was merged into an excellent more extensive benchmark suite: MTEB, Massive Text Embedding Benchmark. Running it is sort of easy (if you’ve got 128GB of RAM, modern GPU and eight hours of free time):