, I’ve talked quite a bit about Reterival Augmented Generation (RAG). Specifically, I’ve covered the fundamentals of the RAG methodology, in addition to a bunch of relevant concepts, like chunking, embeddings, reranking, and retrieval evaluation.
The normal RAG methodology is so useful since it allows for looking for relevant parts of text in a big knowledge base, based on the of the text slightly than exact words. In this manner, it allows us to utilize the facility of AI on our custom documents. Satirically, as useful as this similarity search is, it sometimes fails to retrieve parts of text which might be to the user’s prompt. More specifically, when searching in a big knowledge base, specific keywords (similar to specific technical terms or names) may wander off, and relevant chunks might not be retrieved even when the user’s query incorporates the precise words.
Happily, this issue could be easily tackled by utilising an older keyword-based searching technique, like BM25 (Best Matching 25). Then, by combining the outcomes of the similarity search and BM25 search, we will essentially get the very best of each worlds and significantly improve the outcomes of our RAG pipeline.
. . .
In information retrieval systems, BM25 is a rating function used to guage how relevant a document is to a search query. Unlike similarity search, BM25 evaluates the document’s relevance to the user’s query, not based on the semantic meaning of the document, but slightly on the actual words it incorporates. More specifically, BM25 is a bag-of-words (BoW) model, meaning that it doesn’t keep in mind the order of the words in a document (from which the semantic meaning emerges), but slightly the frequency with which each word appears within the document.
BM25 rating for a given query q containing terms t and a document d could be (not so) easily calculated as follows:
😿
Since this expression generally is a bit overwhelming, let’s take a step back and take a look at it little by little.
. . .
Starting easy with TF-IDF
The fundamental underlying concept of BM25 is TF-IDF (Term Frequency – Inverse Document Frequency). TF-IDF is a fundamental information retrieval concept aiming to measure how essential a word is in a selected document in a knowledge base. In other words, it measures in what number of documents of the knowledge base a term appears in, allowing on this approach to express how specific and informative a term is about a selected document. The rarer a term is within the knowledge base, the more informative it is taken into account to be for a selected document.
Specifically, for a document in a knowledge base and a term , the Term Frequency TF(t,d) could be defined as follows:

and Inverse Document Frequency IDF(t) could be defined as follows:

Then, the TF-IDF rating could be calculated because the product of TF and IDF as follows:

. . .
Let’s do a fast example to get a greater grip of TF-IDF. Let’s assume a tiny knowledge base containing three movies with the next descriptions:
- “A sci-fi thriller about time travel and a dangerous adventure across alternate realities.”
- “A romantic drama about two strangers who fall in love during unexpected time travel.”
- “A sci-fi adventure featuring an alien explorer forced to travel across galaxies.”
After removing the stopwords, we will consider the next terms in each document:
- document 1: sci-fi, thriller, time, travel, dangerous, adventure, alternate, realities
- size of document 1, |d1| = 8
- document 2: romantic, drama, two, strangers, fall, love, unexpected, time, travel
- size of document 2, |d2| = 9
- document 3: sci-fi, adventure, featuring, alien, explorer, forced, travel, galaxies
- size of document 3, |d3| = 8
- total documents in knowledge base N = 3
We are able to then calculate the f(t,d) for every term in each document:

Next, for every document, we also calculate the Document Frequency and the Inverse Document Frequency:

After which finally we calculate the TF-IDF rating of every term.

So, we can we get from this? Let’s have a look, for instance, on the TF-IDF scores of document 1. The word ‘travel’ shouldn’t be informative in any respect, because it is included in all documents of the knowledge base. On the flip side, words like ‘thriller’ and ‘dangerous’ are very informative, specifically for document 1, since they’re only included in it.
In this manner, TF-IDF rating provides a straightforward and easy approach to discover and quantify the importance of the terms in each document of a knowledge base. To place it in a different way, the upper the full rating of the terms in a document, the rarer the knowledge on this document is as compared to the knowledge contained in all other documents within the knowledge base.
. . .
Understanding BM25 rating
In BM25, we utilise the TF-IDF concept with a purpose to quantify how imformative (how rare or essential) each document in a knowledge base is, To do that, for the BM25 calculation, we only keep in mind the terms of every document which might be contained within the user’s query, and perform a calculation somewhat just like TF-IF.
BM25 uses the TF-IDF concept, but with just a few mathematical tweaks with a purpose to improve two fundamental weaknesses of TF-IDF.
. . .
The primary pain point of TF-IDF is that with the variety of times a term appears in a document d, f(t,d), as any function of the shape:

Which means the more times a term t appears in a document d, the more TF grows , which, as chances are you’ll imagine, could be problematic for giant documents, where a term appears over and yet again without necessarily being correspondingly more essential.
A straightforward approach to resolve that is to make use of a saturation curve as a substitute of a linear function. Which means output increases with the input but approaches a maximum limit asymptotically, unlike the linear function, where the output increases with the input ceaselessly:

Thus, we will attempt to rewrite TF in this kind as follows, introducing a parameter k1, which allows for the control of the frequency scaling. In this manner, the parameter K1allows for introducing diminishing returns. That’s, the first occurrence of the term t in a document has a big effect on the TF rating, whereas the twentieth appearance only adds a small extra gain.

Noetheless, this might lead to values within the range 0 to 1. We are able to tweak this a bit more and add a (k1 + 1) within the nominator, in order that the resulting values of TF are comparable with the initial definition of TF utilized in TD-IDF.

. . .
To this point, so good, but one critical piece of knowledge that remains to be missing from this expression is the scale of the document |d| that was included within the initial calculation of TF. Nonetheless, before adding the |d| term, we also must alter it just a little bit since that is the second pain point of the initial TF-IDF expression. More specifically, the problem is that a knowledge base goes to contain documents with variable lengths |d|, leading to scores of various terms not being comparable. BM25 resolves this by normalizing |d|. That’s, as a substitute of |d|, the next expression is used:

where avg(dl) is the common document length of the documents within the knowledge base. Moreover, b is a parameter in [0,1] that controls the length normalization, with b = 0 corresponding to no normalization and b = 1 corresponding to finish normalization.
So, adding the normalised expressionof |d|, we will get the fancier version of TF utilized in BM25. This shall be as follows:

Normally, the used parameter values are k₁ ≈ 1.2 to 2.0 and b ≈ 0.75.
. . .
BM52 also uses a rather altered expression for the IDF calculation as follows:

This expression is derived by asking a greater query. Within the initial IDF calculation, we ask:
“How rare is the term?”
As a substitute, when attempting to calculate the IDF for BM25, we ask:
“How way more likely is that this term in relevant documents than in non-relevant documents?”
The probability of a document containing the term t, in a knowledge base of N documents, could be expressed as:

We are able to then express the chances of a document containing a term t versus not containing it as:

After which taking the inverse, we find yourself with:

Similarly to the everyday IDF, we get the log of this expression to compress the acute values. An exotic transformation called Robertson–Sparck Jones smoothing can also be performed, and in this manner, we finally get the IDF expression utilized in BM25.
. . .
Ultimately, we will calculate the BM25 rating for a selected document d for a given query q that incorporates multiple terms t.

In this manner, we will rating the documents available in a knowledge base based on their relevance to a selected query, after which retrieve probably the most relevant documents.
All that is simply to say that the BM52 rating is the way more easily understood TD-IDF rating, but a bit more sophisticated. So, BM52 is extremely popular for performing keyword searches and can also be utilized in our case for keyword searches in a RAG system.
RAG with Hybrid Search
So, now that now we have an idea about how BM25 works and scores the varied documents in a knowledge base based on the frequency of keywords, we will further take a take a look at how BM25 scores are incorporated in a conventional RAG pipeline.
As discussed in several of my previous posts, a quite simple RAG pipeline would look something like this:

Such a pipeline uses a similarity rating (like cosine similarity) of embeddings with a purpose to seek for, find, and retrieve chunks which might be semantically just like the user’s query. While similarity search could be very useful, it will possibly sometimes miss exact matches. Thus, by incorporating a keyword search, on top of the similarity search within the RAG pipeline, we will discover relevant chunks more effectively and comprehensively. This may alter our landscape as follows:

For every text chunk, aside from the embedding, we now also calculate a BM25 index, allowing for quick calculation of respective BM25 scores on various user queries. In this manner, for every user query, we will discover the chunks with the best BM25 scores – that’s, the chunks that contain probably the most rare, most informative terms with respect to the user’s query as compared to all other chunks within the knowledge base.
Notice how now we match the user’s query each with the embeddings within the vector store (semantic search) and the BM25 index (keyword search). Different chunks are retrieved based on the semantic search and the keyword search – then the retrieved chunks are combined, deduplicated, and ranked using rank fusion.
. . .
On my mind
Integrating BM25 keyword search right into a RAG pipeline allows us to get the very best of each worlds: the semantic understanding of embeddings and the precision of tangible keyword matching. By combining these approaches, we will retrieve probably the most relevant chunks even from a bigger knowledge base more reliably, ensuring that critical terms, technical phrases, or names will not be ignored. In this manner, we will significantly improve the effectiveness of our retrieval process and be certain that no essential relevant information is left behind.
📰 💌 💼☕
