, I discuss a selected step of the RAG pipeline: The document retrieval step. This step is critical for any RAG system’s performance, considering that without fetching essentially the most relevant documents, it’s difficult for an LLM to accurately answer the user’s questions. I’ll discuss the standard approach to fetching essentially the most relevant documents, some techniques to enhance it, and the advantages you’ll see from higher document retrieval in your RAG pipeline.
As per my last article on Enriching LLM Context with Metadata, I’ll write my foremost goal for this text:
My goal for this text is to to focus on how you possibly can fetch and filter essentially the most relevant documents to your AI search.
Table of contents
Why is perfect document retrieval vital?
It’s vital to actually understand why the document fetching step is so critical to any RAG pipeline. To grasp this, it’s essential to even have a general outline of the flow in a RAG pipeline:
- The user enters their query
- The query is embedded, and also you calculate embedding similarity between the query and every individual document (or chunk of document)
- We fetch essentially the most relevant documents based on embedding similarity
- Probably the most relevant documents (or chunks) are fed into an LLM, and it’s prompted to reply the user query given the provided chunks

Now there are several features of the pipeline which is vital. Elements corresponding to:
- Which embedding model do you utilize
- Which LLM model do you utilize
- What number of documents (or chunks) do you fetch
Nonetheless, I’d argue that there’s likely no aspect more vital than the collection of documents. It is because without the right documents, it doesn’t matter how good you’re LLM is, or what number of chunks you fetch, the reply is more than likely to be incorrect.
The model will probably work with a rather worse embedding model or a rather older LLM. Nonetheless, for those who don’t fetch the right documents, you’re RAG pipeline will fail.
Traditional approaches
I’ll first understand some traditional approaches which are used today, mainly using embedding similarity or keyword search.
Embedding similarity
Using embedding similarity to fetch essentially the most relevant documents is the go-to approach today. This can be a solid approach that’s decent in most use cases. RAG with embedding similarity document retrieval is precisely as I described above.
Keyword search
Keyword search can be commonly used to fetch relevant documents. Traditional approaches, corresponding to TF-IDF or BM25, are still used today with success. Nonetheless, keyword search also has its weaknesses. For instance, it only fetches documents based on a precise match, which introduces issues when a precise match just isn’t possible.
Thus, I need to debate another techniques you should use to enhance your document retrieval step.
Techniques to fetch more relevant documents
On this section, I’ll discuss some more advanced techniques to fetch essentially the most relevant documents. I’ll divide the section into two. The primary section will cover optimizing document retrieval for recall, referring to fetching as most of the relevant documents as possible from the corpus of accessible documents. The opposite subsection discusses the way to optimize for precision. This implies ensuring that the documents you fetch are literally correct and relevant for the user query.
Recall: Fetch more of the relevant documents
I’ll discuss the next techniques:
- Contextual retrieval
- Fetching more chunks
- Reranking
Contextual retrieval

Contextual retrieval is a way introduced by Anthropic in September 2024. Their article covers two topics: Adding context to document chunks and mixing keyword search (BM25) with semantic search to fetch relevant documents.
So as to add context to documents, they take each document chunk and prompt an LLM, given the chunk and all the document, to rewrite the chunk to incorporate each information from the given chunk and relevant context from all the document.
For instance, if you will have a document divided into two chunks. Where chunk one includes vital metadata corresponding to an address, date, location, and time, and the opposite chunk accommodates details about a lease agreement. The LLM might rewrite the second chunk to incorporate each the lease agreement and essentially the most relevant a part of the primary chunk, which on this case is the address, location, and date.
Anthropic also discusses combining semantic search and keyword search of their article, essentially fetching documents with each techniques, and using a prioritized approach to mix the documents retrieved from each technique.
Fetching more chunks
A less complicated approach to fetch more of the relevant documents is to easily fetch more chunks. The more chunks you fetch, the upper your likelihood of fetching the relevant chunks is. Nonetheless, this has two foremost downsides:
- You’ll likely get more irrelevant chunks as well (impacting recall)
- You’ll increase the quantity of tokens you feed to your LLM, which can negatively impact the LLM’s output quality
Reranking for recall
Rereanking can be a strong technique, which will be used to extend precision and recall when fetching relevant documents to a user query. When fetching documents based on semantic similarity, you’ll assign a similarity rating to all chunks, and typically only keep the highest K most similar chunks (K is normally a number between 10 and 20, nevertheless it varies for various applications). Because of this a reranker should try and put the relevant documents throughout the K most relevant documents, while keeping irrelevant documents out of the identical list. I feel Qwen Reranker is a very good model; nonetheless, there are also many other rerankers on the market.
Precision: Filter away irrelevant documents
- Reranking
- LLM verification
Reranking for precision
As discussed within the last section on recall, rerankers will also be used to enhance precision. Rerankers will increase recall by adding relevant documents into the highest K list of most relevant documents. On the opposite side, rerankers will improve precision, by ensuring that the irrelevant documents stay out of the highest K most relevant documents list.
LLM verification
Utilizing LLM to guage chunk (or document) relevance can be a strong technique to filter away irrelevant chunks. You may simply create a function like below:
def is_relevant_chunk(chunk_text: str, user_query: str) -> bool:
"""
Confirm if the chunk text is relevant to the user query
"""
prompt = f"""
Given the provided user query, and chunk text, determine whether the chunk text is relevant to reply the user query.
Return a json response with {
"relevant": bool
}
{user_query}
{chunk_text}
"""
return llm_client.generate(prompt)
You then feed each chunk (or document) through this function, and only keep the chunks or documents which are judged as relevant by the LLM.
This method has two foremost downsides:
- LLM cost
- LLM response time
You’ll be sending numerous LLM API calls, which can inevitably incur a big cost. Moreover, sending so many queries will take time, which adds delay to your RAG pipeline. It’s best to balance this with the necessity for rapid responses to the users.
Advantages of improving document retrieval
There are many advantages to improving the document retrieval step in your RAG pipeline. Some examples are:
- Higher LLM query answering performance
- Less hallucinations
- More often in a position to accurately answer users’ queries
- Essentially, it makes the LLMs’ job easier
Overall, the flexibility of your query answering model will increase when it comes to the variety of successfully answered user queries. That is the metric I like to recommend scoring your RAG system after, and you possibly can read more about LLM system evaluations in my article on Evaluating 5 Million Documents with Automatic Evals.
Fewer hallucinations are also an incredibly vital factor. Hallucinations are one of the vital significant issues we face with LLMs. They’re so detrimental because they lower the users’ trust within the question-answer system, which makes them less more likely to proceed using your application. Nonetheless, ensuring the LLM each receives the relevant documents (precision), and minimizes the quantity of irrelevant documents (recall), is helpful to reduce the quantity of hallucinations the RAG system produces.
Less irrelevant documents (precision), also avoids the issues of context bloat (an excessive amount of noise within the context), and even context poisoning (misinformation provided within the documents).
Summary
In this text, I’ve discussed how you possibly can improve the document retrieval step of your RAG pipeline. I began off discussing how I imagine the document retrieval step is essentially the most significant a part of the RAG pipeline, and it’s best to spend time optimizing this step. Moreover, I discussed how traditional RAG pipelines fetch relevant documents through semantic search and keyword search. Continuing, I discussed techniques you possibly can utilize to enhance each the precision and recall of retrieved documents, with techniques corresponding to contextual retrieval and LLM chunk verification.
👉 Find me on socials:
🧑💻 Get in contact
✍️ Medium