Representing unstructured data as embedding vectors and embedding-based retrieval (EBR) using vector search is more popular than ever. What are embeddings anyway? Roy Keyes explains it well in The shortest definition of embeddings?

Embeddings are learned transformations to make data more useful

In academia, this process is often known as representation learning and has been a field of research for a long time. By transforming the information into vectors, a language native to computers, we will make the information more useful. Take BERT for text for example. ().

How , is determined by how we learn this transformation and the way the learned method to represent data generalizes to recent data. That is how we do Machine Learning. Take some data, learn something from it, then apply that learning to recent data. Easy.

So what’s recent? Why the surge in interest? The reply is best model architectures (e.g., Transformer architecture) and self-supervised representation learning. Add a touch of confusion around Large Language Models (LLMs) corresponding to chatGPT to the combo, and here we’re.

About self-supervised learning. Using a clever objective, we will train a model using piles of information without human supervision (labeling). Then, once that is completed, we will fine-tune the model for tasks where the fine-tuning requires less labeled data than if we began from scratch.

Any such learning pipelining known as transfer learning. Learning to snowboard also transfers to skateboarding, windsurfing, browsing, and other fun activities.

To shorten this blog post, allow us to give attention to text models and BERT models specifically. How can we transform data into useful embedding representation using Transformer-based models?

BERT is a deep neural network model with weights, layers, and whatnot, a complexity we hide contained in the box. If we pull down the model from Huggingface, the model weights are assigned by pre-training using a masked language model objective.

We will take some text and tokenize that text into a set vocabulary to acquire a set of numeric ids. A mapping between free text and hard-coded identifiers. The vocabulary size is determined by the language, but for the vanilla BERT model for English, that is around 30K words. Unknown words (out of vocabulary) are assigned UNK and given a specially reserved identifier. All unknown words are assigned to this identifier, and the model cannot differentiate between “foo” and “bar” if each will not be within the vocabulary.

The BERT model can take a maximum of 512 words (input context length limitation), and the network output is 512 vectors with dimensionality N, depending on the style of bert-base model. A vanilla BERT model uses 768 dimensions. For an input of 512 words, we obtain a matrix of 512 x 768 floats, one 768-dimensional vector per input word. Unlike previous NLP model architectures, like Word2vec, each word vector representation on the output is contextualized by the eye mechanism within the Transformer architecture. The vector representation of a single word is determined by all the opposite words within the input.

Now, we have now multiple vectors representing a single text; what can we do if we wish to represent a piece of text, a text passage, or a paragraph of text in a single vector representation? One approach is to decide on a single output vector because the representation and ignore the remaining. One other approach is pooling. For instance, average pooling will average the 512 output vectors right into a single vector representation.

Now we have now an embedding representation of a text chunk, which results in mistake number one.

## Mistake #1: Using pre-trained models without task-specific fine-tuning

Using the direct vector representations from the model which have only been pre-trained is not going to produce a useful embedding representation for any task. Search rating is an example of such a task; see details in How not to make use of BERT for search rating.

Encoding free text queries and documents and expecting that the cosine similarity between the 2 representations can rank the documents by relevance is naive, and the outcomes of that approach provide you with next to random rating results. Your learned snowboard skills don’t transfer to playing golf or swimming.

## Mistake #2 Using fine-tuned single vector embedding models out-of-domain

To acquire a useful embedding representation (higher than random) for search rating, we want to tune the model weights. We will try this by utilizing a special objective when training the model. We will train (update the weights) using labeled examples like relevant and irrelevant documents for a big sample of queries. MS MARCO is a big web search relevance collection with labeled queries and document pairs, which will be used to coach a rating model.

This fine-tuning creates based on BERT and outcompetes traditional keyword search methods with no learnable parameters, corresponding to BM25, by a really large margin on the MS MARCO dataset.

The issue is that once we take a single vector representation model, fine-tuned on MS MARCO labels, it doesn’t beat BM25 in a special domain with barely various kinds of documents and questions.

The BEIR Benchmark is a superb framework for evaluating the standard of models trained on MS Marco and the way well they transfer to different domains and tasks.

We studied the effectiveness of ten different retrieval models and exhibit that in-domain performance cannot predict how well an approach will generalize in a zero-shot setup. Many approaches 9 that outperform BM25 in an in-domain evaluation on MS MARCO, perform poorly on the BEIR datasets.

I’ve written about zero-shot rating and a few solutions here, here, and here. Multi-vector representation model for search, like ColBERT, generalizes significantly better than single-vector representations.

## Mistake 3: Lack of information of vector search tradeoffs

So that you made it here and have useful embedding representations of information. Now, you wish a method to search the vector data using the closest neighbor search, also often known as KNN, and you may deploy your exciting use case to production.

The very first thing it is best to ask yourself is, will we want to introduce an approximate nearest neighbor search (ANNS) as an alternative of an actual nearest neighbor search? As in lots of elements of life, this can be a query of tradeoffs.

On the query serving side. Even not considering the document side processing complexity, just like the need for CRUD, real-time versus batch, etc.

- Latency Service Level Agreement (SLA) — how briskly do we want it to be? 0,001ms, 1ms, 10ms, 100ms, a second? Perhaps 3 seconds is positive?
- Throughput — What can we expect of query throughput, and what’s the max we will anticipate? 1 QPS, possibly 1M QPS, and even billions?
- Accuracy impact we will tolerate for our use case. After we introduce an search, there’s an accuracy loss in comparison with an exhaustive search. Depending on our alternative of algorithm, this may be substantial. We will evaluate this impact by running queries using the precise search, comparing the approximate output, and calculating the overlap between the 2. For instance, using k=10, we will measure the overlap@10, or k=1, measure the overlap@1.

Given the above, it comes right down to production deployment cost; what number of servers do we want, or do we want servers in any respect?

Allow us to expand on the accuracy error tolerance and why that’s use-case dependent. In case you are constructing a picture search service with over a billion photo vectors, you don’t necessarily need perfect recall. There are numerous equally great cat photos, and bringing back the precise best cats as deemed most relevant by the model may not be that essential.

Then again, for those who are constructing a retina image scan app using vector search to find out if the user can access the constructing, you higher have great overlap@1. In academic research on ANN algorithms, there’s a definite differentiation between these extremes, high-recall and low-recall settings.

The precise seek for neighbors will brute-force calculate the gap between the query and all eligible documents, and the returned k documents are the true nearest neighbors. The search will be parallelized, multi-threaded, and in lots of cases, can use optimized HW instructions; vectors are the machine’s language. The search also can efficiently be limited to a subset if we store the vectors in an engine with query engine filtering capabilities.

For instance, brute force searching 1M vectors with 128 dimensions takes about 100ms single-threaded. We will parallelize the search; for instance, by utilizing 4 threads, we will get it right down to 25 ms until memory bandwidth hits. If we page the vector data randomly from the disk, it should be slower but still parallelizable. If we have now 10B vectors, and we don’t have a method to efficiently select a subset of documents that we perform the closest neighbor search over, we have now a value problem. We will still get decent latency by distributing the search over multiple nodes in parallel, as Vespa can do. But renting servers to maintain the latency in check can develop into costly with billions of embeddings. Add high query throughput to the combo, and we have now an actual cost problem.

Taking place the approximate vector search route, we want an algorithm that may index the vector data in order that searches are less expensive than exhaustive searches at the associated fee of resource usage and indexing processing. Here there are also many tradeoffs, like disk usage and memory usage. How well the algorithm will be used with real-time CRUD operations. One source of ANN algorithm understanding is https://github.com/erikbern/ann-benchmarks, where different algorithms and implementations are compared on various vector datasets.

The above graph is for the SIFT dataset, with 1M 128-dimensional vectors. The graph displays recall@10 (same as overlap@10) versus the queries per second. The benchmark is single-threaded, which implies that if the algorithm is at 10² QPS, we have now a latency of 10ms. 10³ QPS means a latency of 1ms, and so forth. These algorithms are pretty rattling fast.

If we deploy these algorithms on a server with multiple CPU cores, we will enjoy much more QPS. 2 cores are expected to present 2x QPS, so long as there aren’t any contention or locking scaling problems. But not all ANN algorithms give us equally good recall. Algorithms which can be up and to the precise give the perfect tradeoff between performance and accuracy, and the lower left quadrant is worse. As seen above, some algorithms struggle with getting past 50% recall.

What isn’t reflected within the graph above is the associated fee of indexing and whether the algorithm can support updates and CRUD operations. Some are batch-oriented, in order that they first need a big sample of the document vectors before they’ll construct the index, while others can construct the index incrementally. Note that ann-benchmark can only use open-source algorithms to breed on the identical runtime. Some industrial and proprietary vector search vendors have unknown recall versus performance tradeoffs.

In case you hated this post, you can shout out to me over at Twitter https://twitter.com/jobergum.

jazz study

Impressed by The nuanced clarity. It’s like you’re explaining quantum physics to a toddler, and they get it.