A series of articles on constructing an accurate Large Language Model for neural search from scratch. We’ll start with BERT and sentence-transformers, undergo semantic search benchmarks like BEIR, modern models like SGPT and E5, and finish by constructing our toy embedding model.
The fundamental design of a semantic search system, as pitched by most vector search vendors, has two easy (that is irony) steps:
- Compute embeddings on your documents and queries. Somewhere. Someway. Figure it out by yourself.
- Upload them to a vector search engine and luxuriate in a greater semantic search.
, but selecting the model is commonly considered out of scope for many early adopters. So everyone just takes a sentence-transformers/all-MiniLM-L6-v2 and hopes for the most effective.
But this approach has more open questions than answers:
- Is there a difference between embedding models? Are paid models from OpenAI and Cohere higher?
- How do they handle multiple languages? Is there a profit in large 1B+ models?
- Dense retrieval using embeddings is one in all many semantic search methods. Is it higher than new-age sparse approaches like SPLADEv2 and ELSER?
On this series of 5 articles, we’ll give attention to selecting and constructing the best semantic search embedding model. The present plan:
- (the one you’re reading now) . Baseline embeddings like BERT, MiniLM, DeBERTaV3, and GPT-* family. How are you going to evaluate the embedding quality with the BEIR benchmark? Current winners of BEIR and their pros and cons.
- . Why MiniLM and never BERT? The training process, the 1B sentence-pairs dataset, and the way such a tiny and ancient model might be so good on the BEIR benchmark.
- . Putting all of the recent LLM advancements right into a single semantic search model: dataset denoising, asymmetric embeddings, and additional IR fine-tuning. Can it’s further improved?
- . Combining all of the strong points from MiniLM and E5 right into a single model with some extra secret sauce. Denoising and training data preparation. Can we fit and train it on a single RTX 4090?
- : E5 is fine-tuned on MS MARCO+SNLI. Can we then fine-tune on an entire BEIR set? Can we make it multi-lingual?
The unique Transformer architecture might be seen as a black box that transforms input text into output text. But neural networks don’t understand texts per se; they only speak numbers — and all the inner transformations are numeric.
The Transformer consists of two major blocks:
- : take text input in a numeric form, and produce an embedding representation of the semantic meaning of the input.
- : inverse the method, take the embedding, and predict the subsequent text token.
So in the center between the encoder and decoder, there’s an embedding representation of the input. Each input and embedding are numerical vectors, but there continues to be a major difference between them:
- Input vectors are only a sequence of term identifiers from a pre-defined dictionary (for BERT, the vocabulary size is 32K), padded to a set length.
- The embedding vector is an internal representation of the input. That is how the network sees your input.
In a few years, there was a vibrant transformer-based family of various text processing models with two major separate branches:
- BERT-like, only using the encoder a part of the transformer. Good at classification, summarization, and entity recognition.
- GPT family, decoder only. Good at generative tasks like translation and QA.
Within the image above, you’ll be able to see a split between BERT and GPT sub-families of models. Traditionally BERT descendants are most frequently utilized in the world of semantic search.
The BERT model appears to be a very good fit for our problem of semantic search, as it could possibly be reduced to a binary classification task of relevant and irrelevant documents for a selected query.
: the model was trained to predict a masked word on a big text corpus. The indisputable fact that similar texts have similar embeddings is a pleasant self-emergent side effect.
But “not originally meant for semantic similarity” is just an opinion. Is there a approach to objectively measure how good or bad it’s on a reference dataset?
The tutorial paper “BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models” proposed a reference set of benchmarks and datasets for IR methods. And made model quality fights much less enjoyable: there may be now a single leaderboard to check your embedding model with competitors.
The BEIR benchmark proposes a set of 19 diverse IR datasets and all of the machinery for search quality evaluation.
The unique paper also benchmarks a few baseline methods on the entire collection of datasets. The predominant conclusion made in 2021 is that and a robust baseline.
Later BEIR was merged into an excellent more extensive benchmark suite: MTEB, Massive Text Embedding Benchmark. Running it is sort of easy (if you’ve got 128GB of RAM, modern GPU and eight hours of free time):
Let’s return to the thesis “raw BERT embeddings aren’t for the semantic search.” If we run side-by-side with top sentence-tranformers models and over the BEIR/MTEB benchmark, we’re going to see the next numbers:
We will make two apparent conclusions from this table:
- Original raw and document similarity. Now you’ll be able to see why.
- — even a large MPNET model tuned for semantic similarity cannot consistently outperform it.
But why similar embedding models are so different on semantic search tasks?
The present (for June 2023) leaderboard of the MTEB/BEIR benchmark looks filled with not well-known names: