When Does Adding Fancy RAG Features Work?

an article about overengineering a RAG system, adding fancy things like query optimization, detailed chunking with neighbors and keys, together with expanding the context.

The argument against this type of work is that for a few of these add-ons, you continue to find yourself paying 40–50% more in latency and value.

So, I made a decision to test each pipelines, one with query optimisation and neighbor expansion, and one without.

The primary test I ran used easy corpus questions generated directly from the docs, and the outcomes were lackluster. But then I continued testing it on messier questions and on random real-world questions, and it showed something different.

That is what we’ll discuss here: where features like neighbor expansion can do well, and where the fee will not be price it.

We’ll undergo the setup, the experiment design, three different evaluation runs with different datasets, how one can understand the outcomes, and the fee/profit tradeoff.

If you happen to feel confused at any time, there are two articles, here and here, that got here before this one, though this one should stand by itself.

The intro

People proceed so as to add complexity to their RAG pipelines, and there’s a reason for it. The general design is flawed, so we keep patching on fixes to make something that’s more robust.

Most individuals have introduced hybrid search, BM25 and semantic, together with re-rankers of their RAG setups. This has develop into standard practice. But there are more complex features you may add.

The pipeline we’re testing here introduces two additional features, query optimization and neighbor expansion, and tests their efficiency.

We’re using LLM judges and different datasets to guage automated metrics like faithfulness, together with A/B tests on quality, to see how the metrics move and alter for every.

The introduction will walk through the setup and the experiment design.

The setup

Let’s first run through the setup, briefly covering detailed chunking and neighbor expansion, and what I define as complex versus naive for the aim of this text.

The pipeline I’ve run here uses very detailed chunking methods, for those who’ve read my previous article.

This implies parsing the PDFs accurately, respecting document structure, using smart merging logic, intelligent boundary detection, numeric fragment handling, and document-level context (i.e. applying headings for every chunk).

I made a decision to not budge on this part, though this is clearly the toughest a part of constructing a retrieval pipeline.

When processing, it also splits the sections after which references the chunk neighbors within the metadata. This enables us to expand the content so the LLM can see where it comes from.

For this test, we use the identical chunks, but we remove query optimization and context expansion for the naive pipeline to see if the flowery add-ons are literally doing any good.

I must also mention that the use case for this was scientific RAG papers. It is a semi-difficult case, and as such, for easier use cases this may increasingly not apply (but we’ll get to that later too).

To conclude: the setup uses the identical chunking, the identical reranker, and the identical LLM. The one difference is optimizing the queries and expanding the chunks to neighbors.

The experiment design

We now have three different datasets that were run through several automated metrics, then through a head-to-head judge, together with examining the outputs to validate and understand the outcomes.

I began by making a dataset with 256 questions generated from the corpus. This implies questions similar to “What’s the principal purpose of the Step-Audio 2 model?”

This might be a superb solution to validate that your pipeline works, but when it’s too clean it could actually offer you a false sense of security.

Note that I didn’t specify how the questions must be generated. This implies I didn’t ask it to generate questions that a complete section could answer, or that only a single chunk could answer.

The second dataset was also generated from the corpus, but I intentionally asked the LLM to generate messy questions like “what are the plz three kinds of reward functions utilized in eviomni?”

The third dataset, and a very powerful one, was the random dataset.

I asked an AI agent to research different RAG questions people had online, similar to “best rag eval benchmarks and why?” and “when does using titles/abstracts beat full text retrieval.”

Remember, the pipeline had only ingested around 150 scientific papers from September/October that mentioned RAG. So we don’t know if the corpus even has the answers.

To run the primary evals, I used automated metrics similar to faithfulness (does the reply stay grounded within the context) and answer relevancy (does it answer the query) from RAGAS. I also added a number of metrics from DeepEval to have a look at context relevance, structure, and hallucinations.

We ran each pipelines through the various datasets, after which through all of those metrics.

Then I added one other head-to-head judge to A/B test the standard of every pipeline for every dataset. This judge didn’t see the context, only the query, the reply, and the automated metrics.

Why not include the context within the evaluation? Because you may’t overload these judges with too many variables. This can also be why evals can feel difficult. You should understand the one key metric you ought to measure for every one.

I should note that this might be an unreliable solution to test systems. If the hallucination rating is generally vibes, but we remove an information point due to it before sending it into the subsequent judge that tests quality, we will find yourself with highly unreliable data once we start aggregating.

For the ultimate part, we checked out semantic similarity between the answers and examined those with the biggest differences, together with cases where one pipeline clearly won over the opposite.

Let’s now turn to running the experiment.

Running the experiment

Since we now have several different datasets, we’d like to undergo the outcomes of every. The primary two datasets proved pretty lackluster, but they did show us something, so it’s price covering them.

The random dataset showed essentially the most interesting results up to now. This will likely be the principal focus, and I’ll dig into the outcomes a bit to indicate where it failed and where it succeeded.

Clean questions from the corpus

The clean corpus showed pretty similar results on all metrics. The judge appeared to prefer one over the opposite based on shallow preferences, nevertheless it showed us the problem with counting on synthetic datasets.

The primary run was on the corpus dataset. Remember the clean questions that had been generated from the docs the pipeline had ingested.

The outcomes of the automated metrics were eerily similar.

Even context relevance was just about the identical, ~0.95 for each. I needed to double check it several times to verify. As I’ve had great success with using context expansion, the outcomes made me a bit uneasy.

It’s quite obvious though in hindsight, the questions are already well formatted for retrieval, and one passage may answer the query.

I did have the thought on why context relevance didn’t decrease for the expanded pipeline if one passage was ok. This was because the additional contexts come from the identical section because the seed chunks, making them semantically related and never considered “irrelevant” by RAGAS.

The A/B test for quality we ran it through had similar results. Each won for a similar reasons: completeness, accuracy, clarity.

For the cases where naive won, the judge liked the reply’s conciseness, clarity, and focus. It penalized the complex pipeline for more peripheral details (edge cases, extra citations) that weren’t directly asked for.

When complex won, it liked the completeness/comprehensiveness of the reply over the naive one. This meant having specific numbers/metrics, step-by-step mechanisms, and “why” explanations, not only “what.”

Nevertheless, these results didn’t point to any failures. This was more a preference thing fairly than about pure quality differences, each did exceptionally well.

So what did we learn from this? In an ideal world, you don’t need any fancy RAG add-ons, and using a test set from the corpus is extremely unreliable.

Messy questions from the corpus

Next up we tested the second dataset, which showed similar results as the primary one because it had been synthetically generated, nevertheless it began moving in one other direction, which was interesting.

Keep in mind that I introduced the messier questions generated from the corpus earlier. This dataset was generated the identical way as the primary one, but with messy phrasing (“can u explain like how plz…”).

The outcomes from the automated metrics showed that the outcomes were still very similar, though context relevance began to drop within the complex one while faithfulness began to rise barely.

For those that failed the metrics, there have been a number of RAGAS false positives.

But there have been also some failures for questions that had been formatted without specificity within the synthetic dataset, similar to “what number of posts tbh were used for dataset?” or “what number of datasets did they test on?”

There have been some questions that the query optimizer helped by removing noisy input. But I noticed too late that the questions that had been generated were too directed at specific passages.

This meant that pushing them in as they were did well on the retrieval side. I.e., questions with specific names in them (like “how does CLAUSE compare…”) matched documents advantageous, and the query optimizer just made things worse.

There have been times when the query optimization failed completely due to how the questions had been phrased.

Similar to the query: “how does the btw pre-check phase in ac-rag work & why is it vital?” where direct search found the AC-RAG paper directly, because the query had been generated from there.

Running it through the A/B judge, the outcomes favored the advanced pipeline quite a bit greater than that they had for the primary corpus.

The judge favored naive’s conciseness and brevity, while it favored the complex pipeline for completeness and comprehensiveness.

The rationale we see the rise in wins for the complex pipeline is that the judge increasingly selected “complete but verbose” over “transient but potentially missing points” this time around.

That is once I had the thought how useless answer quality is as a metric. These LLM judges run on vibes sometimes.

On this run, I didn’t think the answers were different enough to warrant the difference in results. So remember, using an artificial dataset like this could offer you some intel, but it could actually be quite unreliable.

Random questions dataset

Lastly, we’ll undergo the outcomes from the random dataset, which showed quite a bit more interesting results. Metrics began to maneuver with a better margin here, which gave us something to dig into.

Up thus far I had nothing to indicate for this, but this last dataset finally gave me something interesting to dig into.

See the outcomes from the random dataset below.

On random questions, we actually saw a drop in faithfulness and answer relevancy for the naive baseline. Context relevance was still higher, together with structure, but this we had already established for the complex pipeline within the previous article.

Noise will inevitably occur for the complex one, as we’re talking about 10x more chunks. Citation structure could also be harder for the model when the context increases (or the judge has trouble judging the complete context).

The A/B judge, though, gave it a really high rating in comparison with the opposite datasets.

I ran it twice to examine, and every time it favored the complex one over the naive one by an enormous margin.

Why the change? This time there have been a number of questions that one passage couldn’t answer by itself.

Specifically, the complex pipeline did well on tradeoff and comparison questions. The judge reasoned “more complete/comprehensive” in comparison with the naive pipeline.

An example was the query “what are pros and cons of hybrid vs knowledge-graph RAG for vague queries?” Naive had many unsupported claims (missing GraphRAG, HybridRAG, EM/F1 metrics).

At this point, I needed to grasp why it won and why naive lost. This is able to give me intel on where the flowery features were actually helping.

Looking into the results

Now, without digging into the outcomes, you may’t fully know why something is winning. For the reason that random dataset showed essentially the most interesting results, that is where I made a decision to place my focus.

First, the judge has real issues evaluating the fuller context. This is the reason I could never create a judge to guage each context against the opposite. It would like naive since it’s cognitively easier to guage. That is what made this so hard.

Nevertheless, we will pinpoint a number of the real failures.

Despite the fact that the hallucination metric showed decent results, when digging into it, we could see that the naive pipeline fabricated information more often.

We could locate this by taking a look at the low faithfulness scores.

To provide you an example, for the query “how do I test prompt injection risks if the bad text is inside retrieved PDFs?” the naive pipeline filled in gaps within the context to supply the reply.

Query: How do I test prompt injection risks if the bad text is inside retrieved PDFs?
Naive Response: Lists standard prompt-injection testing steps (PoisonedRAG, adaptive instructions, multihop poisoning) but synthesizes a generic evaluation recipe that shouldn't be fully supported by the precise retrieved sections and implicitly fills gaps with prior knowledge.
Complex Response: Derives testing steps directly from the retrieved experiment sections and threat models, including multihop-triggered attacks, single-text generation bias measurement, adaptive prompt attacks, and success-rate reporting, staying inside what the cited papers actually describe.
Faithfulness: Naive: 0.0 | Complex: 0.83
What Happened: Unlike the naive answer, it shouldn't be inventing attacks, metrics, or techniques out of thin air. PoisonedRAG, trigger-based attacks, Hotflip-style perturbations, multihop attacks, ASR, DACC/FPR/FNR, PC1–PC3 all appear within the provided documents. Nonetheless, the complex pipeline is subtly overstepping and has a case of scope inflation.

The expanded content added the missing evaluation metrics, which bumped up the faithfulness rating by 87%.

Nevertheless, the complex pipeline was subtly overstepping and had a case of scope inflation. This could possibly be a problem with the LLM generator, where we’d like to tune it to make certain that every claim is explicitly tied to a paper and to mark cross-paper synthesis as such.

For the query “how do I benchmark prompts that force the model to list contradictions explicitly?” naive again has only a few metrics and thus invents metrics, reverses findings, and collapses task boundaries.

Query: How do I benchmark prompts that force the model to list contradictions explicitly?
Naive Response: Mentions MAGIC by name and vaguely gestures at “conflicts” and “benchmarking,” but lacks concrete mechanics. No clear description of conflict generation, no separation of detection vs localization, no actual evaluation protocol. It fills gaps by inventing generic-sounding steps that should not grounded within the provided contexts.
Complex Response: Explicitly aligns with the MAGIC paper’s methodology. Describes KG-based conflict generation, single-hop vs multi-hop and 1 vs N conflicts, subgraph-level few-shot prompting, stepwise prompting (detect then localize), and the actual ID/LOC metrics used across multiple runs. Also accurately incorporates PC1–PC3 as auxiliary prompt components and explains their role, consistent with the cited sections.
Faithfulness: Naive: 0.35 | Complex: 0.73
What Happened: The complex pipeline has way more surface area, but most of it's anchored to actual sections of the MAGIC paper and related prompt-component work. In brief: the naive answer hallucinates by necessity attributable to missing context, while the complex answer is verbose but materially supported. It over-synthesizes and over-prescribes, but mostly stays inside the factual envelope. The upper faithfulness rating is doing its job, even when it offends human patience.

For complex, though, it over-synthesizes and over-prescribes, but stays inside the factual information.

This pattern shows up in multiple examples. The naive pipeline lacks enough information for a few of these questions, so it falls back to prior knowledge and pattern completion. Whereas the complex pipeline over-synthesizes under false coherence.

Essentially, naive fails by making things up, and complicated fails by saying true things too broadly.

This test was more about determining if these fancy features help, nevertheless it did point to us needing to work on claim scoping: forcing the model to say “Paper A shows X; Paper B shows Y,” and so forth.

Before we move on to the fee/latency evaluation, we will attempt to isolate the query optimizer as well.

How much did the query optimizer help?

Since I didn’t test each a part of the pipeline for every run, we had to have a look at various things to estimate whether the query optimizer was helping or hurting.

First, we checked out the seed chunk overlap for the complex vs naive pipeline, which showed 8.3% semantic overlap within the random pipeline, versus greater than 50% overlap for the corpus pipeline.

We already know that the complete pipeline won on the random dataset, and now we could also see that it surfaced different documents due to query optimizer.

Most documents were different, so I couldn’t isolate whether the standard degraded when there was little overlap.

We also asked a judge to estimate the standard of the optimized queries in comparison with the unique ones, by way of preserving intent and being diverse enough, and it won with an 8% margin.

A matter that it excelled on was “why is everyone saying RAG doesn’t scale? how are people fixing that?”

Orginal: why is everyone saying RAG doesn't scale? how are people fixing that?
Optimized (1): RAG scalability challenges (hybrid)
Optimized (2): Solutions for RAG scalability (hybrid)

Whereas a matter that naive did well by itself was “what retrieval settings help reduce needle-in-a-haystack,” and other questions that were thoroughly formatted from the beginning.

We could reasonably deduce, though, that multi-questions and messier questions did higher with the optimizer, so long as they weren’t domain specific. The optimizer was overkill for well formatted questions.

It also did badly when the query would already be understood by the underlying documents, in cases where someone asks something domain specific that the query optimizer won’t understand.

You possibly can leaf through a number of examples within the Excel document.

This teaches us how vital it’s to make certain that the optimizer is tuned well to the questions that users will ask. In case your users keep asking with domain specific jargon that the optimizer is ignoring or filtering out, it won’t perform well.

We are able to see here that it’s rescuing some questions and failing others at the identical time, so it might need work for this use case.

Let’s discuss it

I’ve overloaded you with a number of data, so now it’s time to undergo the fee/latency tradeoff, discuss what we will and can’t conclude, and the constraints of this experiment.

The associated fee/latency tradeoff

When taking a look at the fee and latency tradeoffs, the goal here is to place concrete numbers on what these features cost and where it actually comes from.

The associated fee of running this pipeline could be very slim. We’re talking $0.00396 per run, and this doesn’t include caching. Removing the query optimizer and neighbor expansion decreases costs by 41%.

It’s not greater than that because token inputs, the thing that increases with added context, are quite low cost.

What actually costs money on this pipeline is the re-ranker from Cohere, which each the naive and the complete pipeline use.

For the naive pipeline, the re-ranker accounts for 70% of your complete cost. So it’s price taking a look at every a part of the pipeline to determine where you may implement smaller models to chop costs.

Nevertheless, at around 100k questions, you could be paying $400.00 for the complete pipeline and $280.00 for the naive one.

There may be also the case for latency.

We measured a +49% increase in latency with the complex pipeline, which amounts to about 6 seconds, mostly driven by the query optimizer using GPT-5-mini. It’s possible to make use of a faster and smaller model here.

For neighbor expansion, we measured the common increase to be 2–3 seconds longer. Do note that this doesn’t scale linearly.

4.4x more input tokens only added 24% more time.

What this shows is that the fee difference is real but not extreme, while the latency difference is rather more noticeable. Many of the money continues to be spent on re-ranking, not on adding context.

What we will conclude

Let’s give attention to what worked, what failed, and why. We see that neighbor expansion may pull it’s weight when questions are diffuse, but each pipeline has it’s own failure modes.

The clearest finding from this experiment is that neighbor expansion earns its keep when retrieval gets hard and one chunk can’t answer the query.

We did a test within the previous article that checked out how much of the reply was generated from the expanded chunks, and on clean corpus questions, only 22% of the reply content got here from expanded neighbors. We also saw that the A/B results here in this text showed a tie.

On messy questions, this rose to 30% with a 10-point margin for the A/B test. On random questions, it hit 41% (used from the context) with a 44-point margin for the A/B test. This pattern is undeniable.

What’s happening underneath is a difference in failure modes. When naive fails, it fails by omission. The LLM doesn’t have enough context, so it either gives an incomplete answer or fabricates information to fill the gaps.

We saw this clearly within the prompt injection example, where naive scored 0.0 on faithfulness because overreached on the facts.

When complex fails, it fails by inflation. It has a lot context that the LLM over-synthesizes and makes claims broader than any single source supports. But no less than those claims are grounded in something.

The faithfulness scores reflect this asymmetry. Naive bottoms out at 0.0 or 0.35, while complex’s worst cases still land around 0.73.

The query optimizer is harder to call. It helped on 38% of questions, hurt on 27%, and made no difference on 35%. The wins were dramatic after they happened, rescuing questions like “why is everyone saying RAG doesn’t scale?” where direct search returned nothing.

However the losses were also not great, similar to when the user’s phrasing already matched the corpus vocabulary and the optimizer introduced drift.

This probably suggests you’d need to tune the optimizer rigorously to your users, or discover a solution to detect when reformulation is more likely to help versus hurt.

On cost and latency, the numbers weren’t where I expected. Adding 10x more chunks only increased generation time by 24% because reading tokens is quite a bit cheaper.

The true cost driver is the reranker, at 70% of the naive pipeline’s total.

The query optimizer contributes essentially the most latency, at nearly 3 seconds per query. If you happen to’re optimizing for speed, that’s where to look first, together with the re-ranker.

So more context doesn’t necessarily mean chaos, nevertheless it does mean it is advisable to control the LLM to a bigger degree. When the query doesn’t need the complexity, the naive pipeline will rule, but once questions develop into diffuse, the more complex pipeline may start to drag its weight.

Let’s talk limitations

I even have to cover the principal limitations of the experiment and what we must be careful about when interpreting the outcomes.

The plain one is that LLM judges run on vibes.

The metrics moved in the proper direction across datasets, but I wouldn’t trust absolutely the numbers enough to set production thresholds on them.

The messy corpus showed a 10-point margin for complex, but truthfully the answers weren’t different enough to warrant that gap. It it could possibly be noise.

I also didn’t isolate what happens when the docs genuinely can’t answer the query.

The random dataset included questions where we didn’t know if the papers had relevant content, but I treated all 66 the identical. I did though hunt through the examples, nevertheless it’s still possible a number of the complex pipeline’s wins got here from being higher at admitting ignorance fairly than higher at finding information.

Finally, I tested two features together, query optimization and neighbor expansion, without fully isolating every one’s contribution. The seed overlap evaluation gave us some signal on the optimizer, but a cleaner experiment would test them independently.

For now, we all know the mix helps on hard questions and that the fee is 41% more per query. Whether that tradeoff is sensible depends entirely on what your users are literally asking.

Notes

I believe we will conclude from this text that doing evals is difficult, and it’s even harder to place an experiment like this on paper.

I wish I could offer you a clean answer, nevertheless it’s complicated.

I might personally say though that fabrication is worse than being overly verbose. But still in case your corpus is incredibly clean and every answer normally points to a selected chunk, the neighbor expansion is overkill.

This then just tells you that these fancy features are a form of insurance.

Nevertheless I hope it was informational, let me know what you thought by connecting with me at LinkedIn, Medium or via my website.

❤

When Does Adding Fancy RAG Features Work?

The intro

The setup

The experiment design

Running the experiment

Clean questions from the corpus

Messy questions from the corpus

Random questions dataset

Looking into the results

How much did the query optimizer help?

Let’s discuss it

The associated fee/latency tradeoff

What we will conclude

Let’s talk limitations

Notes

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

Statement from Dario Amodei on our discussions with the Department of War Anthropic

Google quantum-proofs HTTPS by squeezing 2.5kB of information into 64-byte space – Ars Technica

Generative AI, Discriminative Human

Featured video: Coding for underwater robotics

Maximizing GPU Utilization with NVIDIA Run:ai and NVIDIA NIM

When Does Adding Fancy RAG Features Work?

The intro

The setup

The experiment design

Running the experiment

Clean questions from the corpus

Messy questions from the corpus

Random questions dataset

Looking into the results

How much did the query optimizer help?

Let’s discuss it

The associated fee/latency tradeoff

What we will conclude

Let’s talk limitations

Notes

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.