Methods to Select the 5 Most Relevant Documents for AI Search

, I discuss a selected step of the RAG pipeline: The document retrieval step. This step is critical for any RAG system’s performance, considering that without fetching essentially the most relevant documents, it’s difficult for an LLM to accurately answer the user’s questions. I’ll discuss the standard approach to fetching essentially the most relevant documents, some techniques to enhance it, and the advantages you’ll see from higher document retrieval in your RAG pipeline.

As per my last article on Enriching LLM Context with Metadata, I’ll write my foremost goal for this text:

My goal for this text is to to focus on how you possibly can fetch and filter essentially the most relevant documents to your AI search.

This figure showcases a conventional RAG pipeline. You begin with the user query, which you encode using an embedding model. You then compare this embedding to the precomputed embedding of all the document corpus. Often, the documents are split into chunks, with some overlap between them, though some systems also just work with entire documents. After the embedding similarity is calculated, you simply keep the highest K most relevant documents, where K is a number you select yourself, normally a number between 10 and 20. The step of fetching essentially the most relevant documents given the semantic similarity is the subject of today’s article. After fetching essentially the most relevant documents, you feed them into an LLM together with the user query, and the LLM finally returns a response. Image by the creator.

Why is perfect document retrieval vital?

It’s vital to actually understand why the document fetching step is so critical to any RAG pipeline. To grasp this, it’s essential to even have a general outline of the flow in a RAG pipeline:

The user enters their query
The query is embedded, and also you calculate embedding similarity between the query and every individual document (or chunk of document)
We fetch essentially the most relevant documents based on embedding similarity
Probably the most relevant documents (or chunks) are fed into an LLM, and it’s prompted to reply the user query given the provided chunks

This figure highlights the concept of embedding similarity. On the left side, you will have the user query, with “Summarize the lease agreement”. This question is embedded into the vector you see below the text. Moreover, in the highest middle, you will have the available document corpus, which on this instance is 4 documents, all of which have precomputed embeddings. We then calculate the similarity between the query embedding and every of the documents, and are available out with a similarity. In this instance, K=2, so we feed the 2 most relevant documents to our LLM for query answering. Image by the creator.

Now there are several features of the pipeline which is vital. Elements corresponding to:

Which embedding model do you utilize
Which LLM model do you utilize
What number of documents (or chunks) do you fetch

Nonetheless, I’d argue that there’s likely no aspect more vital than the collection of documents. It is because without the right documents, it doesn’t matter how good you’re LLM is, or what number of chunks you fetch, the reply is more than likely to be incorrect.

The model will probably work with a rather worse embedding model or a rather older LLM. Nonetheless, for those who don’t fetch the right documents, you’re RAG pipeline will fail.

Traditional approaches

I’ll first understand some traditional approaches which are used today, mainly using embedding similarity or keyword search.

Embedding similarity

Using embedding similarity to fetch essentially the most relevant documents is the go-to approach today. This can be a solid approach that’s decent in most use cases. RAG with embedding similarity document retrieval is precisely as I described above.

Keyword search

Keyword search can be commonly used to fetch relevant documents. Traditional approaches, corresponding to TF-IDF or BM25, are still used today with success. Nonetheless, keyword search also has its weaknesses. For instance, it only fetches documents based on a precise match, which introduces issues when a precise match just isn’t possible.

Thus, I need to debate another techniques you should use to enhance your document retrieval step.

Techniques to fetch more relevant documents

On this section, I’ll discuss some more advanced techniques to fetch essentially the most relevant documents. I’ll divide the section into two. The primary section will cover optimizing document retrieval for recall, referring to fetching as most of the relevant documents as possible from the corpus of accessible documents. The opposite subsection discusses the way to optimize for precision. This implies ensuring that the documents you fetch are literally correct and relevant for the user query.

Recall: Fetch more of the relevant documents

I’ll discuss the next techniques:

Contextual retrieval
Fetching more chunks
Reranking

Contextual retrieval

Contextual retrieval is a way introduced by Anthropic in September 2024. Their article covers two topics: Adding context to document chunks and mixing keyword search (BM25) with semantic search to fetch relevant documents.

So as to add context to documents, they take each document chunk and prompt an LLM, given the chunk and all the document, to rewrite the chunk to incorporate each information from the given chunk and relevant context from all the document.

For instance, if you will have a document divided into two chunks. Where chunk one includes vital metadata corresponding to an address, date, location, and time, and the opposite chunk accommodates details about a lease agreement. The LLM might rewrite the second chunk to incorporate each the lease agreement and essentially the most relevant a part of the primary chunk, which on this case is the address, location, and date.

Anthropic also discusses combining semantic search and keyword search of their article, essentially fetching documents with each techniques, and using a prioritized approach to mix the documents retrieved from each technique.

Fetching more chunks

A less complicated approach to fetch more of the relevant documents is to easily fetch more chunks. The more chunks you fetch, the upper your likelihood of fetching the relevant chunks is. Nonetheless, this has two foremost downsides:

You’ll likely get more irrelevant chunks as well (impacting recall)
You’ll increase the quantity of tokens you feed to your LLM, which can negatively impact the LLM’s output quality

Reranking for recall

Rereanking can be a strong technique, which will be used to extend precision and recall when fetching relevant documents to a user query. When fetching documents based on semantic similarity, you’ll assign a similarity rating to all chunks, and typically only keep the highest K most similar chunks (K is normally a number between 10 and 20, nevertheless it varies for various applications). Because of this a reranker should try and put the relevant documents throughout the K most relevant documents, while keeping irrelevant documents out of the identical list. I feel Qwen Reranker is a very good model; nonetheless, there are also many other rerankers on the market.

Precision: Filter away irrelevant documents

Reranking
LLM verification

Reranking for precision

As discussed within the last section on recall, rerankers will also be used to enhance precision. Rerankers will increase recall by adding relevant documents into the highest K list of most relevant documents. On the opposite side, rerankers will improve precision, by ensuring that the irrelevant documents stay out of the highest K most relevant documents list.

LLM verification

Utilizing LLM to guage chunk (or document) relevance can be a strong technique to filter away irrelevant chunks. You may simply create a function like below:

def is_relevant_chunk(chunk_text: str, user_query: str) -> bool:
    """
    Confirm if the chunk text is relevant to the user query
    """

    prompt = f"""
    Given the provided user query, and chunk text, determine whether the chunk text is relevant to reply the user query.
    Return a json response with {
        "relevant": bool
    }
    {user_query}
    {chunk_text}
    """
    return llm_client.generate(prompt)

You then feed each chunk (or document) through this function, and only keep the chunks or documents which are judged as relevant by the LLM.

This method has two foremost downsides:

LLM cost
LLM response time

You’ll be sending numerous LLM API calls, which can inevitably incur a big cost. Moreover, sending so many queries will take time, which adds delay to your RAG pipeline. It’s best to balance this with the necessity for rapid responses to the users.

Advantages of improving document retrieval

There are many advantages to improving the document retrieval step in your RAG pipeline. Some examples are:

Higher LLM query answering performance
Less hallucinations
More often in a position to accurately answer users’ queries
Essentially, it makes the LLMs’ job easier

Overall, the flexibility of your query answering model will increase when it comes to the variety of successfully answered user queries. That is the metric I like to recommend scoring your RAG system after, and you possibly can read more about LLM system evaluations in my article on Evaluating 5 Million Documents with Automatic Evals.

Fewer hallucinations are also an incredibly vital factor. Hallucinations are one of the vital significant issues we face with LLMs. They’re so detrimental because they lower the users’ trust within the question-answer system, which makes them less more likely to proceed using your application. Nonetheless, ensuring the LLM each receives the relevant documents (precision), and minimizes the quantity of irrelevant documents (recall), is helpful to reduce the quantity of hallucinations the RAG system produces.

Less irrelevant documents (precision), also avoids the issues of context bloat (an excessive amount of noise within the context), and even context poisoning (misinformation provided within the documents).

Summary

In this text, I’ve discussed how you possibly can improve the document retrieval step of your RAG pipeline. I began off discussing how I imagine the document retrieval step is essentially the most significant a part of the RAG pipeline, and it’s best to spend time optimizing this step. Moreover, I discussed how traditional RAG pipelines fetch relevant documents through semantic search and keyword search. Continuing, I discussed techniques you possibly can utilize to enhance each the precision and recall of retrieved documents, with techniques corresponding to contextual retrieval and LLM chunk verification.

👉 Find me on socials:

🧑‍💻 Get in contact

🔗 LinkedIn

🐦 X / Twitter

✍️ Medium

Methods to Select the 5 Most Relevant Documents for AI Search

Table of contents

Why is perfect document retrieval vital?

Traditional approaches

Embedding similarity

Keyword search

Techniques to fetch more relevant documents

Recall: Fetch more of the relevant documents

Precision: Filter away irrelevant documents

Advantages of improving document retrieval

Summary

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

Teaching robots to map large environments

Google’s AI space moonshot

What Constructing My First Dashboard Taught Me About Data Storytelling

NumPy for Absolute Beginners: A Project-Based Approach to Data Evaluation

3 Questions: How AI helps us monitor and support vulnerable ecosystems

Methods to Select the 5 Most Relevant Documents for AI Search

Table of contents

Why is perfect document retrieval vital?

Traditional approaches

Embedding similarity

Keyword search

Techniques to fetch more relevant documents

Recall: Fetch more of the relevant documents

Precision: Filter away irrelevant documents

Advantages of improving document retrieval

Summary

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.