In my latest post, I how hybrid search will be utilised to significantly improve the effectiveness of a RAG pipeline. RAG, in its basic version, using just semantic search on embeddings, will be very effective, allowing us to utilise the ability of AI in our own documents. Nonetheless, semantic search, as powerful because it is, when utilised in large knowledge bases, can sometimes miss exact matches of the user’s query, even in the event that they exist within the documents. This weakness of traditional RAG will be handled by adding a keyword search component within the pipeline, like BM25. In this manner, hybrid search, combining semantic and keyword search, results in rather more comprehensive results and significantly improves the performance of a RAG system.
Be that as it might, even when using RAG with hybrid search, we will still sometimes miss necessary information that’s scattered in several parts of the document. This will occur because when a document is broken down into text chunks, sometimes the context — that’s, the encircling text of the chunk that forms a part of its meaning — is lost. This will especially occur for text that’s complex, with meaning that’s interconnected and scattered across several pages, and inevitably can’t be wholly included inside a single chunk. Think, for instance, referencing a table or a picture across several different text sections without explicitly defining to which table we’re refering to (e.g., “” — which table?). Consequently, when the text chunks are then retrieved, they’re stripped down of their context, sometimes leading to the retrieval of irrelevant chunks and generation of irrelevant responses.
This lack of context was a significant issue for RAG systems for a while, and a number of other not-so-successful solutions have been explored for improving it. An obvious attempt for improving this, is increasing chunk size, but this often also alters the semantic meaning of every chunk and finally ends up making retrieval less precise. One other approach is increasing chunk overlap. While this helps to extend the preservation of context, it also increases storage and computation costs. Most significantly, it doesn’t fully solve the issue — we will still have necessary interconnections to the chunk out of chunk boundaries. More advanced approaches attempting to resolve this challenge include Hypothetical Document Embeddings (HyDE) or Document Summary Index. Nonetheless, those still fail to supply substantial improvements.
Ultimately, an approach that effectively resolves this and significantly enhances the outcomes of a RAG system is , originally introduced by Anthropic in 2024. Contextual retrieval goals to resolve the lack of context by preserving the context of the chunks and, due to this fact, improving the accuracy of the retrieval step of the RAG pipeline.
. . .
What about context?
Before saying anything about contextual retrieval, let’s take a step back and talk somewhat bit about what context is. Sure, we’ve all heard concerning the context of LLMs or context windows, but what are those about, really?
To be very precise, context refers to — remember, LLMs work by generating text by predicting it one word at a time. Thus, that can be the user prompt, the system prompt, instructions, skills, or every other guidelines influencing how the model produces a response. Importantly, the a part of the ultimate response the model has produced thus far can be a part of the context, since each recent token is generated based on all the things that got here before it.
Apparently, different contexts result in very different model outputs. For instance:
- ‘‘ could output ‘‘
- ‘ could output ‘‘
A fundamental limitation of LLMs is their . The context window of an LLM is the utmost variety of tokens that will be passed directly as input to the model and be taken into consideration to provide a single response. There are LLMs with larger or smaller context windows. Modern frontier models can handle tons of of hundreds of tokens in a single request, whereas earlier models often had context windows as small as 8k tokens.
In an ideal world, we might want to simply pass all the knowledge that the LLM must know within the context, and we might almost certainly get superb answers. And that is true to some extent — a frontier model like Opus 4.6 with a 200k token context window corresponds to about 500-600 pages of text. If all the knowledge we want to supply matches this size limit, we will indeed just include all the things as is, as an input to the LLM and get an awesome answer.
The problem is that for many of real-world AI use cases, we want to utilize some kind of data base with a size that is far beyond this threshold — think, as an example, legal libraries or manuals of technical equipment. Since models have these context window limitations, we unfortunately cannot just pass all the things to the LLM and let it magically respond — we have now to somwhow pick is a very powerful information that must be included in our limited context window. And that is basically what the RAG methodology is all about — picking the suitable information from a big knowledge base in order to effectively answer a user’s query. Ultimately, this emerges as an optimization/ engineering problem — context engineering — identifying the suitable information to incorporate in a limited context window, in order to provide the most effective possible responses.
That is probably the most crucial a part of a RAG system — ensuring the suitable information is retrieved and omitted as input to the LLM. This will be done with semantic search and keyword search, as already explained. Nevertheless, even when bringing all semantically relevant chunks and all exact matches, there’s still a great likelihood that necessary information could also be left behind.
But what kind of data would this be? Since we have now covered the meaning with semantic search and the precise matches with keyword search, what other kind of information is there to contemplate?
Different documents with inherently different meanings may include parts which can be similar and even equivalent. Imagine a recipe book and a chemical processing manual each instructing the reader to The semantic meaning of such a text chunk and the actual words are very similar — equivalent. In this instance, what forms the meaning of the text and permit us to separate between cooking and chemnical engineering is what we’re reffering to as .

Thus, that is the type of extra information we aim to preserve. And this is strictly what contextual retrieval does: preserves the context — the encircling meaning — of every text chunk.
. . .
What about contextual retrieval?
So, contextual retrieval is a strategy applied in RAG aiming to preserve the context of every chunk. In this manner, when a piece is retrieved and omitted to the LLM as input, we’re capable of preserve as much of its initial meaning as possible — the semantics, the keywords, the context — all of it.
To attain this, contextual retrieval suggests that we first generate a helper text for every chunk — namely, the — that enables us to situate the text chunk in the unique document it comes from. In practice, we ask an LLM to generate this contextual text for every chunk. To do that, we offer the document, together with the actual chunk, in a single request to an LLM and prompt it to ““. A prompt for generating the contextual text for our chunk would look something like this:
all the document Italian Cookbook document the chunk comes from
Here is the chunk we would like to position inside the context of the total document.
the actual chunk
Provide a temporary context that situates this chunk inside the overall
document to enhance search retrieval. Respond only with the concise
context and nothing else.
The LLM returns the contextual text which we mix with our initial text chunk. In this manner, for every chunk of our initial text, we generate a contextual text that describes how this specific chunk is placed in its parent document. For our example, this could be something like:
Context: Recipe step for simmering homemade tomato pasta sauce.
Chunk: Heat the mixture slowly and stir occasionally to stop it from sticking.
Which is indeed lots more informative and specific! Now there is no such thing as a doubt about what this mysterious mixture is, because all the knowledge needed for identiying whether we’re talking about tomato sauce or laboratory starch solutions is conveniently included inside the same chunk.
From this point on, we take care of the initial chunk text and the contextual text as an unbreakable pair. Then, the remainder of the steps of RAG with hybrid search are performed essentially in the identical way. That’s, we create embeddings which can be stored in a vector search and the BM25 index for every text chunk, prepended with its contextual text.

This approach, so simple as it’s, leads to astonishing improvements within the retrieval performance of RAG pipelines. In line with Anthropic, Contextual Retrieval improves the retrieval accuracy by a formidable 35%.
. . .
Reducing cost with prompt caching
I hear you asking, ““. Surprisingly, no.
Intuitively, we understand that this setup goes to significantly increase the fee of ingestion for a RAG pipeline — essentially double it, if no more. In any case we now added a bunch of additional calls to the LLM, didn’t we? That is true to some extent — indeed now, for every chunk, we make an extra call to the LLM to be able to situate it inside its source document and get the contextual text.
Nevertheless, it is a cost that we’re only paying once, on the stage of document ingestion. Unlike alternative techniques that try and preserve context at runtime — similar to Hypothetical Document Embeddings (HyDE) — contextual retrieval performs the heavy work throughout the document ingestion stage. In runtime approaches, additional LLM calls are required for each user query, which may quickly scale latency and operational costs. In contrast, contextual retrieval shifts the computation to the ingestion phase, meaning that the improved retrieval quality comes with no additional overhead during runtime. On top of those, additional techniques will be used for further reducing the contextual retrieval cost. More precisely, caching will be used for generating the summary of the document just once after which situating each chunk against the produced document summary.
. . .
On my mind
Contextual retrieval represents a straightforward yet powerful improvement to traditional RAG systems. By enriching each chunk with contextual text, pinpointing its semantic position inside its source document, we dramatically reduce the anomaly of every chunk, and thus improve the standard of the knowledge passed to the LLM. Combined with hybrid search, this system allows us to preserve semantics, keywords, and context concurrently.
📰 💌 💼☕
