The best way to Context Engineer to Optimize Query Answering Pipelines

engineering is some of the relevant topics in machine learning today, which is why I’m writing my third article on the subject. My goal is to each broaden my understanding of engineering contexts for LLMs and share that knowledge through my articles.

In today’s article, I’ll discuss improving the context you feed into your LLMs for query answering. Often, this context is predicated on retrieval augmented generation (RAG), nevertheless, in today’s ever-shifting environment, this approach needs to be updated.

The co-founder of Chroma (a vector database provider) tweeted that RAG is dead. I don’t fully agree that we won’t use RAG anymore, but his tweet highlights how there are different options for filling the context of your LLM.

You may as well read my previous context engineering articles:

Basic Context engineering techniques
Advanced context engineering techniques

Why it is best to care about context engineering

First, let me highlight three key points for why it is best to care about context engineering:

Higher output quality by avoiding context rot. Fewer unnecessary tokens increase output quality. You may read more details about it in this text
Cheaper (don’t send unnecessary tokens, they cost money)
Speed (less tokens = faster response times)

These are three core metrics for many query answering systems. The output quality is of course of utmost priority, considering users won’t wish to use a low-performing system.

Moreover, price should all the time be a consideration, and if you happen to can lower it (without an excessive amount of engineering cost), it’s a straightforward decision to accomplish that. Lastly, a faster query answering system provides a greater user experience. You don’t want users waiting quite a few seconds to get a response when ChatGPT will respond much faster.

The normal question-answering approach

Traditional, on this sense, means essentially the most common query answering approach in systems built after the discharge of ChatGPT. This method is traditional RAG, which works as follows:

Fetch essentially the most relevant documents to the user’s query, using vector similarity retrieval
Feed relevant documents together with a matter into an LLM, and receive a response

Considering its simplicity, this approach works incredibly well. Interestingly enough, we also see this happening with one other traditional approach. BM25 has been around since 1994 and was, for instance, recently utilized by Anthropic after they introduced Contextual Retrieval, proving how effective even easy information retrieval techniques are.

Nevertheless, you possibly can still vastly improve your query answering system by updating your RAG using some techniques I’ll describe in the subsequent section.

Improving RAG context fetching

Despite the fact that RAG works relatively well, you possibly can likely achieve higher performance by introducing the techniques I’ll discuss on this section. The techniques I describe here all deal with improving the context you feed to the LLM. You may improve this context with two major approaches:

Use fewer tokens on irrelevant context (for instance, removing or using less material from relevant documents)
Add documents which might be relevant

Thus, it is best to deal with achieving one in all the points above. Should you think when it comes to precision and recall:

Increases precision (at the fee of recall)
Increase recall (at the fee of precision)

It is a tradeoff you have to make while working on context engineering your query answering system.

Reducing the variety of irrelevant tokens

On this section, I highlight three major approaches to scale back the variety of irrelevant tokens you feed into the LLMs context:

Reranking
Summarization
Prompting GPT

When fetching documents from vector similarity search, they’re returned so as of most relevant to least relevant, given the vector similarity rating. Nevertheless, this similarity rating won’t accurately represent which documents are most relevant.

Reranking

You may thus use a reranking model, for instance, Qwen reranker, to reorder the document chunks. You may then decide to only keep the highest X most relevant chunks (in accordance with the reranker), which should remove some irrelevant documents out of your context.

Summarization

You may as well decide to summarize documents, reducing the variety of tokens used per document. You may, for instance, keep the complete document from the highest 10 most similar documents fetched, summarize documents ranked from 11-20, and discard the remainder.

This approach will increase the likelihood that you just keep the complete context from relevant documents, while no less than maintaining some context (the summary) from documents which might be less more likely to be relevant.

Prompting GPT

Lastly, you may also prompt GPT whether the fetched documents are relevant to the user query. For instance, if you happen to fetch 15 documents, you possibly can make 15 individual LLM calls to evaluate if each document is relevant. You then discard documents which might be deemed irrelevant. Have in mind that these LLM calls should be parallelized to maintain response time inside an appropriate limit.

Adding relevant documents

Before or after removing irrelevant documents, you furthermore mght make sure you include relevant documents. I include two major approaches on this subsection:

Higher embedding models
Looking through more documents (at the fee of lower precision)

Higher embedding models

To search out one of the best embedding models, you possibly can go to the HuggingFace embedding model leaderboard, where Gemini and Qwen are in the highest 3 as of the writing of this text. Updating your embedding model is often an inexpensive approach to fetch more relevant documents. It’s because running and storing embeddings is often low cost, for instance, embedding through the Gemini API, and storing vectors in Pinecone.

Search more documents

One other (relatively easy) approach to fetch more relevant documents is to fetch more documents typically. Fetching more documents naturally increases the probability that you just add relevant ones. Nevertheless, you’ve to balance this with avoiding context rot and reducing the variety of irrelevant documents to a minimum. Every unnecessary token in an LLM call is, as earlier, more likely to:

Reduce output quality
Increase cost
Lower speed

These are all crucial facets of a question-answering system.

Agentic search approach

I’ve discussed agentic search approaches in previous articles, for instance, once I discussed Scaling your AI Search. Nevertheless, on this section, I’ll dive deeper into establishing an agentic search, which replaces some or all the vector retrieval step in your RAG.

Step one is that the user provides their query to a given set of information points, for instance, a set of documents. You then arrange an agentic system consisting of an orchestra agent and a listing of sub-agents.

This figure highlights an orchestra system of LLM agents. The major agent receives the user query and assigns tasks to subagents. Image by ChatGPT.

That is an example of the pipeline the agents would follow (though there are numerous ways to set it up).

Orchestra agent tells two subagents to iterate over all document filenames and return relevant documents
Relevant documents are fed back to the orchestra agent, which again releases a subagent to every of the relevant documents, to fetch subparts (chunks) of the document which might be relevant to the user’s query. These chunks are then fed back to the orchestra agent
The orchestra agent answers the user’s query, given the provided chunks

One other flow you can implement may very well be to store document embeddings, and replace the first step with vector similarity between the user query and every document.

This agentic approach has upsides and drawbacks.

Upsides:

Higher probability of fetching relevant chunks than with traditional RAG
More control over the RAG system. You may update system prompts, etc, while RAG is comparatively static with its embedding similarities

Downside:

In my view, constructing such an agent-based retrieval system is an excellent powerful approach that may result in amazing results. The consideration you’ve to make when constructing such a system is whether or not the increased quality you’ll (likely) see is definitely worth the increase in cost.

Other context engineering facets

In this text, I’ve mainly covered context engineering for the documents we fetch in a matter answering system. Nevertheless, there are also other facets you need to be aware of, mainly:

The system/user prompt you might be using
Other information fed into the prompt

The prompt you write in your query answering system needs to be precise, structured, and avoid irrelevant information. You may read many other articles on the subject of structuring prompts, and you possibly can typically ask an LLM to enhance these facets of your prompt.

Sometimes, you furthermore mght feed other information into your prompt. A standard example is feeding in metadata, for instance, data covering information concerning the user, reminiscent of:

Name
Job role
What they sometimes seek for
etc

Every time you add such information, it is best to all the time ask yourself:

Does amending this information help my query answering system answer the query?

Sometimes the reply is yes, other times it’s no. An important part is that you just made a rational decision on whether the knowledge is required within the prompt. Should you can’t justify having this information within the prompt, it should normally be removed.

Conclusion

In this text, I even have discussed context engineering in your query answering system, and why it’s necessary. Query answering systems normally consist of an initial step to fetch relevant information. The deal with this information needs to be to scale back the variety of irrelevant tokens to a minimum, while also including as many relevant pieces of knowledge as possible.

👉 Find me on socials:

🧑‍💻 Get in contact

🔗 LinkedIn

🐦 X / Twitter

✍️ Medium

You may as well read my in-depth article on Anthropic’s contextual retrieval below:

The best way to Context Engineer to Optimize Query Answering Pipelines

Table of Contents

Why it is best to care about context engineering

The normal question-answering approach

Improving RAG context fetching

Reducing the variety of irrelevant tokens

Adding relevant documents

Agentic search approach

Other context engineering facets

Conclusion

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

3 Easy Ways to Supercharge Your Robotics Development Using OpenUSD

Introducing Training Cluster as a Service

Train a Quadruped Locomotion Policy and Simulate Cloth Manipulation with NVIDIA Isaac Lab and Newton

Learn the Hugging Face Kernel Hub in 5 Minutes

Ontology is the true guardrail: The best way to stop AI agents from misunderstanding your corporation

The best way to Context Engineer to Optimize Query Answering Pipelines

Table of Contents

Why it is best to care about context engineering

The normal question-answering approach

Improving RAG context fetching

Reducing the variety of irrelevant tokens

Adding relevant documents

Agentic search approach

Other context engineering facets

Conclusion

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.