Construct an Over-Engineered Retrieval System

you’ll encounter when doing AI engineering work is that there’s no real blueprint to follow.

Yes, for probably the most basic parts of retrieval (the “R” in RAG), you’ll be able to chunk documents, use semantic search on a question, re-rank the outcomes, and so forth. This part is well-known.

But once you begin digging into this area, you start to ask questions like: how can we call a system intelligent if it’s only in a position to read a couple of chunks here and there in a document? So, how will we ensure that it has enough information to really answer intelligently?

Soon, you’ll end up happening a rabbit hole, attempting to discern what others are doing in their very own orgs, because none of this is correctly documented, and persons are still constructing their very own setups.

This can lead you to implement various optimization strategies: constructing custom chunkers, rewriting user queries, using different search methods, filtering with metadata, and expanding context to incorporate neighboring chunks.

Hence why I’ve now built a relatively bloated retrieval system to indicate you ways it really works. So, let’s walk through it so we will see the outcomes of every step, but in addition to debate the trade-offs.

To demo this technique in public, I made a decision to embed 150 recent ArXiv papers (2,250 pages) that mention RAG. This implies the system we’re testing here is designed for scientific papers, and all of the test queries will probably be RAG-related.

I actually have collected the raw outputs for every step for a couple of queries on this repository, if you ought to have a look at the entire thing intimately.

For the tech stack, I’m using Qdrant and Redis to store data, and Cohere and OpenAI for the LLMs. I don’t depend on any framework to construct the pipelines (because it makes it harder to debug).

Recap retrieval & RAG

If you work with AI knowledge systems like Copilot (where you feed it your custom docs to reply from) you’re employed with a RAG system.

RAG stands for Retrieval Augmented Generation and is separated into two parts, the retrieval part and the generation part.

Retrieval refers back to the means of fetching information in your files, using keyword and semantic matching, based on a user query. The generation part is where the LLM is available in and answers based on the provided context and the user query.

For anyone recent to RAG it might look like a chunky solution to construct systems. Shouldn’t an LLM do a lot of the work by itself?

Unfortunately, LLMs are static, and we’d like to engineer systems in order that every time we call on them, we give them all the pieces they need upfront so that they can answer the query.

I actually have written about constructing RAG bots for Slack before. This one uses standard chunking methods, should you’re keen to get a way of how people construct something easy.

This text goes a step further and tries to rebuild your entire retrieval pipeline with none frameworks, to do some fancy stuff like construct a multi-query optimizer, fuse results, and expand the chunks to construct higher context for the LLM.

As we’ll see though, all of these fancy additions we’ll need to pay for in latency and extra work.

Processing different documents

As with every data engineering problem, your first hurdle will probably be to architect learn how to store data. With retrieval, we concentrate on something called chunking, and the way you do it and what you store with it is crucial to constructing a well-engineered system.

After we do retrieval, we search text, and to try this we’d like to separate the text into different chunks of knowledge. These pieces of text are what we’ll later search to seek out a match for a question.

Simplest systems use general chunkers, simply splitting the total text by length, paragraph, or sentence.

But every document is different, so by doing this you risk losing context.

To know this, you must have a look at different documents to see how all of them follow different structures. You’ll have an HR document with clear section headers, and API docs with unnumbered sections using code blocks and tables.

In case you applied the identical chunking logic to all of those, you’d risk splitting each text the fallacious way. Because of this once the LLM gets the chunks of knowledge, it is going to be incomplete, which can cause it to fail at producing an accurate answer.

Moreover, for every chunk of knowledge, you furthermore mght have to think in regards to the data you would like it to carry.

Should it contain certain metadata so the system can apply filters? Should it link to similar information so it may well connect data? Should it hold context so the LLM understands where the knowledge comes from?

This implies the architecture of the way you store data becomes a very powerful part. In case you start storing information and later know it’s not enough, you’ll need to redo it. In case you realize you’ve complicated the system, you’ll have to begin from scratch.

This technique will ingest Excel and PDFs, specializing in adding context, keys, and neighbors. This can will let you see what this looks like when doing retrieval later.

Ingesting tabular files

First we’ll undergo how you’ll be able to chunk tabular data, add context, and keep information connected with keys.

When coping with already structured tabular data, like in Excel files, it’d look like the apparent approach is to let the system search it directly. But semantic matching is definitely quite effective for messy user queries.

SQL or direct queries only work should you already know the schema and exact fields. As an illustration, should you get a question like “Mazda 2023 specs” from a user, semantically matching rows will give us something to go on.

I’ve talked to firms that wanted their system to match documents across different Excel files. To do that, we will store keys together with the chunks (without going full KG).

So as an example, if we’re working with Excel files containing purchase data, we could ingest data for every row like so:

{
    "chunk_id": "Sales_Q1_123::row::1",
    "doc_id": "Sales_Q1_123:1234"
    "location": {"sheet_name": "Sales Q1", "row_n": 1},
    "type": "chunk",
    "text": "OrderID: 1001234f67 n Customer: Alice Hemsworth n Products: Blue sweater 4, Red pants 6",
    "context": "Quarterly sales snapshot",
    "keys": {"OrderID": "1001234f67"},
}

If we determine later within the retrieval pipeline to attach information, we will do standard search using the keys to seek out connecting chunks. This enables us to make quick hops between documents without adding one other router step to the pipeline.

Very simplified — connecting keys between tabular documents | Image by writer

We can even set a summary for every document. This acts as a gatekeeper to chunks.

{
    "chunk_id": "Sales_Q1::summary",
    "doc_id": "Sales_Q1_123:1234"
    "location": {"sheet_name": "Sales Q1"},
    "type": "summary",
    "text": "Sheet tracks Q1 orders for 2025, kind of product, and customer names for reconciliation.",
    "context": ""
}

The gatekeeper summary idea is likely to be a bit complicated to know at first, nevertheless it also helps to have the summary stored on the document level should you need it when constructing the context later.

When the LLM sets up this summary (and a temporary context string), it may well suggest the important thing columns (i.e. order IDs and so forth).

For this technique with the ArXiv papers, I’ve ingested two Excel files that contain information on title and writer level.

The chunks will look something like this:

{
    "chunk_id": "titles::row::8817::250930134607",
    "doc_id": "titles::250930134607",
    "location": {
      "sheet_name": "titles",
      "row_n": 8817
    },
    "type": "chunk",
    "text": "id: 2507 2114ntitle: Gender Similarities Dominate Mathematical Cognition on the Neural Level: A Japanese fMRI Study Using Advanced Wavelet Evaluation and Generative AInkeywords: FMRI; Functional Magnetic Resonance Imaging; Gender Differences; Machine Learning; Mathematical Performance; Time Frequency Evaluation; Waveletnabstract_url: https://arxiv.org/abs/2507.21140ncreated: 2025-07-23 00:00:00 UTCnauthor_1: Tatsuru Kikuchi",
    "context": "Analyzing trends in AI and computational research articles.",
    "keys": {
      "id": "2507 2114",
      "author_1": "Tatsuru Kikuchi"
    }
 }

These Excel files were strictly not vital (the PDF files would have been enough), but they’re a solution to demo how the system can look up keys to seek out connecting information.

I created summaries for these files too.

{
    "chunk_id": "titles::summary::250930134607",
    "doc_id": "titles::250930134607",
    "location": {
      "sheet_name": "titles"
    },
    "type": "summary",
    "text": "The dataset consists of articles with various attributes including ID, title, keywords, authors, and publication date. It comprises a complete of 2508 rows with a wealthy number of topics predominantly around AI, machine learning, and advanced computational methods. Authors often contribute in teams, indicated by multiple writer columns. The dataset serves academic and research purposes, enabling catego",
 }

We also store information in Redis at document level, which tells us what it’s about, where to seek out it, who’s allowed to see it, and when it was last updated. This can allow us to update stale information later.

Now let’s turn to PDF files, that are the worst monster you’ll cope with.

Ingesting PDF docs

To process PDF files, we do similar things as with tabular data, but chunking them is far harder, and we store neighbors as an alternative of keys.

To start out processing PDFs, we’ve several frameworks to work with, comparable to LlamaParse and Docling, but none of them are perfect, so we’ve to construct out the system further.

PDF documents are very hard to process, as most don’t follow the identical structure. Additionally they often contain figures and tables that almost all systems can’t handle appropriately.

Nevertheless, a tool like Docling might help us at the least parse normal tables properly and map out each element to the right page and element number.

From here, we will create our own programmatic logic by mapping sections and subsections for every element, and smart-merging snippets so chunks read naturally (i.e. don’t split mid-sentence).

We also ensure that to group chunks by section, keeping them together by linking their IDs in a field called .

This enables us to maintain the chunks small but still expand them after retrieval.

The tip result will probably be something like below:

{
    "chunk_id": "S3::C02::251009105423",
    "doc_id": "2507.18910v1",
    "location": {
      "page_start": 2,
      "page_end": 2
    },
    "type": "chunk",
    "text": "1 Introductionnn1.1 Background and MotivationnnLarge-scale pre-trained language models have demonstrated a capability to store vast amounts of factual knowledge of their parameters, but they struggle with accessing up-to-date information and providing verifiable sources. This limitation has motivated techniques that augment generative models with information retrieval. Retrieval-Augmented Generation (RAG) emerged as an answer to this problem, combining a neural retriever with a sequence-to-sequence generator to ground outputs in external documents [52]. The seminal work of [52] introduced RAG for knowledge-intensive tasks, showing that a generative model (built on a BART encoder-decoder) could retrieve relevant Wikipedia passages and incorporate them into its responses, thereby achieving state-of-the-art performance on open-domain query answering. RAG is built upon prior efforts wherein retrieval was used to boost query answering and language modeling [48, 26, 45]. Unlike earlier extractive approaches, RAG produces free-form answers while still leveraging non-parametric memory, offering the perfect of each worlds: improved factual accuracy and the flexibility to cite sources. This capability is particularly essential to mitigate hallucinations (i.e., believable but incorrect outputs) and to permit knowledge updates without retraining the model [52, 33].",
    "context": "Systematic review of RAG's development and applications in NLP, addressing challenges and advancements.",
    "section_neighbours": {
      "before": [
        "S3::C01::251009105423"
      ],
      "after": [
        "S3::C03::251009105423",
        "S3::C04::251009105423",
        "S3::C05::251009105423",
        "S3::C06::251009105423",
        "S3::C07::251009105423"
      ]
    },
    "keys": {}
 }

After we arrange data like this, we will consider these chunks as seeds. We’re trying to find where there could also be relevant information based on the user query, and expanding from there.

The difference from simpler RAG systems is that we attempt to benefit from the LLM’s growing context window to send in additional information (but there are obviously trade offs to this).

You’ll have the ability to see a messy solution of what this looks like when constructing the context within the retrieval pipeline later.

Constructing the retrieval pipeline

Since I’ve built this pipeline piece by piece, it allows us to check each part and undergo why we ensure decisions in how we retrieve and transform information before handing it over to the LLM.

We’ll undergo semantic, hybrid, and BM25 search, constructing a multi-query optimizer, re-ranking results, expanding content to construct the context, after which handing the outcomes to an LLM to reply.

We’ll end the section with some discussion on latency, unnecessary complexity, and what to chop to make the system faster.

If you ought to have a look at the output of several runs of this pipeline, go to this repository.

Semantic, BM25 and hybrid search

The primary a part of this pipeline is to ensure that we’re getting back relevant documents for a user query. To do that, we work with semantic, BM25, and hybrid search.

For easy retrieval systems, people will often just use semantic search. To perform semantic search, we embed dense vectors for every chunk of text using an embedding model.

If that is recent to you, note that embeddings represent each bit of text as a degree in a high-dimensional space. The position of every point reflects how the model understands its meaning, based on patterns it learned during training.

Texts with similar meanings will then find yourself close together.

Because of this if the model has seen many examples of comparable language, it becomes higher at placing related texts near one another, and due to this fact higher at matching a question with probably the most relevant content.

To create dense vectors, I used OpenAI’s Large embedding model, since I’m working with scientific papers.

This model is costlier than their small one and maybe not ideal for this use case.

I might look into specialized models for specific domains or consider fine-tuning your individual. Because remember if the embedding model hasn’t seen many examples just like the texts you’re embedding, it is going to be harder to match them to relevant documents.

To support hybrid and BM25 search, we also construct a lexical index (sparse vectors). BM25 works on exact tokens (for instance, “ID 826384”) as an alternative of returning “similar-meaning” text the way in which semantic search does.

To check semantic search, we’ll arrange a question that I believe the papers we’ve ingested can answer, comparable to:

[1] rating=0.5071 doc=docs_ingestor/docs/arxiv/2508.15253.pdf chunk=S3::C02::251009131027
  text: 1 Introduction This challenge is exacerbated when incorrect yet highly ranked contexts function hard negatives. Conventional RAG, i.e. , simply appending * Corresponding writer 1 https://github.com/eunseongc/CARE Figure 1: LLMs struggle to resolve context-memory conflict. Green bars show the variety of questions appropriately answered without retrieval in a closed-book setting. Blue and yellow bars show performance when supplied with a positive or negative context, respectively. Closed-book w/ Positive Context W/ Negative Context 1 8k 25.1% 49.1% 39.6% 47.5% 6k 4k 1 2k 4 Mistral-7b LLaMA3-8b GPT-4o-mini Claude-3.5 retrieved context to the prompt, struggles to discriminate between incorrect external context and proper parametric knowledge (Ren et al., 2025). This misalignment results in overriding correct internal representations, leading to substantial performance degradation on questions that the model initially answered appropriately. As shown in Figure 1, we observed significant performance drops of 25.149.1% across state-of-the-
[2] rating=0.5022 doc=docs_ingestor/docs/arxiv/2508.19614.pdf chunk=S3::C03::251009132038
  text: 1 Introductions Despite these advances, LLMs might underutilize accurate external contexts, disproportionately favoring internal parametric knowledge during generation [50, 40]. This overreliance risks propagating outdated information or hallucinations, undermining the trustworthiness of RAG systems. Surprisingly, recent studies reveal a paradoxical phenomenon: injecting noise-random documents or tokens-to retrieved contexts that already contain answer-relevant snippets can improve the generation accuracy [10, 49]. While this noise-injection approach is straightforward and effective, its underlying influence on LLM stays unclear. Moreover, long contexts containing noise documents create computational overhead. Due to this fact, it will be significant to design more principled strategies that may achieve similar advantages without incurring excessive cost.
[3] rating=0.4982 doc=docs_ingestor/docs/arxiv/2508.19614.pdf chunk=S6::C18::251009132038
  text: 4 Experiments 4.3 Evaluation Experiments Qualitative Study In Table 4, we analyze a case study from the NQ dataset using the Llama2-7B model, evaluating 4 decoding strategies: GD(0), CS, DoLA, and LFD. Despite access to groundtruth documents, each GD(0) and DoLA generate incorrect answers (e.g., '18 minutes'), suggesting limited capability to integrate contextual evidence. Similarly, while CS produces a partially relevant response ('Texas Revolution'), it exhibits reduced factual consistency with the source material. In contrast, LFD demonstrates superior utilization of retrieved context, synthesizing a precise and factually aligned answer. Additional case studies and analyses are provided in Appendix F.
[4] rating=0.4857 doc=docs_ingestor/docs/arxiv/2507.23588.pdf chunk=S6::C03::251009122456
  text: 4 Results Figure 4: Change in attention pattern distribution in numerous models. For DiffLoRA variants we plot attention mass for most important component (green) and denoiser component (yellow). Note that focus mass is normalized by the variety of tokens in each a part of the sequence. The negative attention is shown after it's scaled by λ . DiffLoRA corresponds to the variant with learnable λ and LoRa parameters in each terms. BOS CONTEXT 1 MAGIC NUMBER CONTEXT 2 QUERY 0 0.2 0.4 0.6 BOS CONTEXT 1 MAGIC NUMBER CONTEXT 2 QUERY BOS CONTEXT 1 MAGIC NUMBER CONTEXT 2 QUERY BOS CONTEXT 1 MAGIC NUMBER CONTEXT 2 QUERY Llama-3.2-1B LoRA DLoRA-32 DLoRA, Tulu-3 perform similarly because the initial model, nonetheless they're outperformed by LoRA. When increasing the context length with more sample demonstrations, DiffLoRA seems to struggle much more in TREC-fine and Banking77. This is likely to be on account of the character of instruction tuned data, and the max_sequence_length = 4096 applied during finetuning. LoRA is less impacted, likely since it diverges less
[5] rating=0.4838 doc=docs_ingestor/docs/arxiv/2508.15253.pdf chunk=S3::C03::251009131027
  text: 1 Introduction To mitigate context-memory conflict, existing studies comparable to adaptive retrieval (Ren et al., 2025; Baek et al., 2025) and the decoding strategies (Zhao et al., 2024; Han et al., 2025) adjust the influence of external context either before or during answer generation. Nonetheless, on account of the LLM's limited capability in detecting conflicts, it's at risk of misleading contextual inputs that contradict the LLM's parametric knowledge. Recently, robust training has equipped LLMs, enabling them to discover conflicts (Asai et al., 2024; Wang et al., 2024). As shown in Figure 2(a), it enables the LLM to dis-
[6] rating=0.4827 doc=docs_ingestor/docs/arxiv/2508.05266.pdf chunk=S27::C03::251009123532
  text: B. Subclassification Criteria for Misinterpretation of Design Specifications Initially, regarding long-context scenarios, we observed that directly prompting LLMs to generate RTL code based on lengthy contexts often resulted in certain code segments failing to accurately reflect high-level requirements. Nonetheless, by manually decomposing the long context-retaining only the important thing descriptive text relevant to the erroneous segments while omitting unnecessary details-the LLM regenerated RTL code that appropriately matched the specifications. As shown in Fig 23, after manual decomposition of the long context, the LLM successfully generated the right code. This demonstrates that redundancy in long contexts is a limiting consider LLMs' ability to generate accurate RTL code.
[7] rating=0.4798 doc=docs_ingestor/docs/arxiv/2508.19614.pdf chunk=S3::C02::251009132038
  text: 1 Introductions Figure 1: Illustration for layer-wise behavior in LLMs for RAG. Given a question and retrieved documents with the right answer ('Real Madrid'), shallow layers capture local context, middle layers concentrate on answer-relevant content, while deep layers may over-rely on internal knowledge and hallucinate (e.g., 'Barcelona'). Our proposal, LFD fuses middle-layer signals into the ultimate output to preserve external knowledge and improve accuracy. Shallow Layers Middle Layers Deep Layers Who has more la liga titles real madrid or barcelona? …Nine teams have been crowned champions, with Real Madrid winning the title a record 33 times and Barcelona 25 times … Query Retrieved Document …with Real Madrid winning the title a record 33 times and Barcelona 25 times … Short-context Modeling Deal with Right Answer Answer is barcelona Mistaken Answer LLMs …with Real Madrid winning the title a record 33 times and Barcelona 25 times … …with Real Madrid winning the title a record 33 times and Barcelona 25 times … Internal Knowledge Confou

From the outcomes above, we will see that it’s in a position to match some interesting passages where they discuss topics that may answer the query.

If we try BM25 (which matches exact tokens) with the identical query, we get back these results:

[1] rating=22.0764 doc=docs_ingestor/docs/arxiv/2507.20888.pdf chunk=S4::C27::251009115003
  text: 3 APPROACH 3.2.2 Project Knowledge Retrieval Similar Code Retrieval. Similar snippets inside the same project are helpful for code completion, even in the event that they usually are not entirely replicable. On this step, we also retrieve similar code snippets. Following RepoCoder, we now not use the unfinished code because the query but as an alternative use the code draft, since the code draft is closer to the bottom truth in comparison with the unfinished code. We use the Jaccard index to calculate the similarity between the code draft and the candidate code snippets. Then, we obtain a listing sorted by scores. On account of the doubtless large differences in length between code snippets, we now not use the top-k method. As an alternative, we get code snippets from the best to the bottom scores until the preset context length is filled.
[2] rating=17.4931 doc=docs_ingestor/docs/arxiv/2508.09105.pdf chunk=S20::C08::251009124222
  text: C. Ablation Studies Ablation result across White-Box attribution: Table V shows the comparison end in methods of WhiteBox Attribution with Noise, White-Box Attrition with Alternative Model and our current method Black-Box zero-gradient Attribution with Noise under two LLM categories. We will know that: First, The White-Box Attribution with Noise is under the specified condition, thus the typical Accuracy Rating of two LLMs get the 0.8612 and 0.8073. Second, the the choice models (the 2 models are exchanged for attribution) reach the 0.7058 and 0.6464. Finally, our current method Black-Box Attribution with Noise get the Accuracy of 0.7008 and 0.6657 by two LLMs.
[3] rating=17.1458 doc=docs_ingestor/docs/arxiv/2508.05100.pdf chunk=S4::C03::251009123245
  text: Preliminaries Based on this, inspired by existing analyses (Zhang et al. 2024c), we measure the quantity of knowledge a position receives using discrete entropy, as shown in the next equation: which quantifies how much information t i receives from the eye perspective. This insight suggests that LLMs struggle with longer sequences when not trained on them, likely on account of the discrepancy in information received by tokens in longer contexts. Based on the previous evaluation, the optimization of attention entropy should concentrate on two facets: The data entropy at positions which are relatively essential and sure contain key information should increase.

Here, the outcomes are lackluster for this question — but sometimes queries include specific keywords we’d like to match, where BM25 is the better option.

We will test this by changing the query tousing BM25.

[1] rating=62.3398 doc=authors.csv chunk=authors::row::1::251009110024
  text: author_name: Anirban Saha Anik n_papers: 2 article_1: 2509.01058 article_2: 2507.07307
[2] rating=56.4007 doc=titles.csv chunk=titles::row::24::251009110138
  text: id: 2509.01058 title: Speaking on the Right Level: Literacy-Controlled Counterspeech Generation with RAG-RL keywords: Controlled-Literacy; Health Misinformation; Public Health; RAG; RL; Reinforcement Learning; Retrieval Augmented Generation abstract_url: https://arxiv.org/abs/2509.01058 created: 2025-09-10 00:00:00 UTC author_1: Xiaoying Song author_2: Anirban Saha Anik author_3: Dibakar Barua author_4: Pengcheng Luo author_5: Junhua Ding author_6: Lingzi Hong
[3] rating=56.2614 doc=titles.csv chunk=titles::row::106::251009110138
  text: id: 2507.07307 title: Multi-Agent Retrieval-Augmented Framework for Evidence-Based Counterspeech Against Health Misinformation keywords: Evidence Enhancement; Health Misinformation; LLMs; Large Language Models; RAG; Response Refinement; Retrieval Augmented Generation abstract_url: https://arxiv.org/abs/2507.07307 created: 2025-07-27 00:00:00 UTC author_1: Anirban Saha Anik author_2: Xiaoying Song author_3: Elliott Wang author_4: Bryan Wang author_5: Bengisu Yarimbas author_6: Lingzi Hong

All the outcomes above mention “Anirban Saha Anik,” which is strictly what we’re in search of.

If we ran this with semantic search, it will return not only the name “Anirban Saha Anik” but similar names as well.

[1] rating=0.5810 doc=authors.csv chunk=authors::row::1::251009110024
  text: author_name: Anirban Saha Anik n_papers: 2 article_1: 2509.01058 article_2: 2507.07307
[2] rating=0.4499 doc=authors.csv chunk=authors::row::55::251009110024
  text: author_name: Anand A. Rajasekar n_papers: 1 article_1: 2508.0199
[3] rating=0.4320 doc=authors.csv chunk=authors::row::59::251009110024
  text: author_name: Anoop Mayampurath n_papers: 1 article_1: 2508.14817
[4] rating=0.4306 doc=authors.csv chunk=authors::row::69::251009110024
  text: author_name: Avishek Anand n_papers: 1 article_1: 2508.15437
[5] rating=0.4215 doc=authors.csv chunk=authors::row::182::251009110024
  text: author_name: Ganesh Ananthanarayanan n_papers: 1 article_1: 2509.14608

That is an excellent example of how semantic search isn’t all the time the best method — similar names don’t necessarily mean they’re relevant to the query.

So, there are cases where semantic search is good, and others where BM25 (token matching) is the better option.

We can even use hybrid search, which mixes semantic and BM25.

You’ll see the outcomes below from running hybrid search on the unique query:

[1] rating=0.5000 doc=docs_ingestor/docs/arxiv/2508.15253.pdf chunk=S3::C02::251009131027
  text: 1 Introduction This challenge is exacerbated when incorrect yet highly ranked contexts function hard negatives. Conventional RAG, i.e. , simply appending * Corresponding writer 1 https://github.com/eunseongc/CARE Figure 1: LLMs struggle to resolve context-memory conflict. Green bars show the variety of questions appropriately answered without retrieval in a closed-book setting. Blue and yellow bars show performance when supplied with a positive or negative context, respectively. Closed-book w/ Positive Context W/ Negative Context 1 8k 25.1% 49.1% 39.6% 47.5% 6k 4k 1 2k 4 Mistral-7b LLaMA3-8b GPT-4o-mini Claude-3.5 retrieved context to the prompt, struggles to discriminate between incorrect external context and proper parametric knowledge (Ren et al., 2025). This misalignment results in overriding correct internal representations, leading to substantial performance degradation on questions that the model initially answered appropriately. As shown in Figure 1, we observed significant performance drops of 25.149.1% across state-of-the-
[2] rating=0.5000 doc=docs_ingestor/docs/arxiv/2507.20888.pdf chunk=S4::C27::251009115003
  text: 3 APPROACH 3.2.2 Project Knowledge Retrieval Similar Code Retrieval. Similar snippets inside the same project are helpful for code completion, even in the event that they usually are not entirely replicable. On this step, we also retrieve similar code snippets. Following RepoCoder, we now not use the unfinished code because the query but as an alternative use the code draft, since the code draft is closer to the bottom truth in comparison with the unfinished code. We use the Jaccard index to calculate the similarity between the code draft and the candidate code snippets. Then, we obtain a listing sorted by scores. On account of the doubtless large differences in length between code snippets, we now not use the top-k method. As an alternative, we get code snippets from the best to the bottom scores until the preset context length is filled.
[3] rating=0.4133 doc=docs_ingestor/docs/arxiv/2508.19614.pdf chunk=S3::C03::251009132038
  text: 1 Introductions Despite these advances, LLMs might underutilize accurate external contexts, disproportionately favoring internal parametric knowledge during generation [50, 40]. This overreliance risks propagating outdated information or hallucinations, undermining the trustworthiness of RAG systems. Surprisingly, recent studies reveal a paradoxical phenomenon: injecting noise-random documents or tokens-to retrieved contexts that already contain answer-relevant snippets can improve the generation accuracy [10, 49]. While this noise-injection approach is straightforward and effective, its underlying influence on LLM stays unclear. Moreover, long contexts containing noise documents create computational overhead. Due to this fact, it will be significant to design more principled strategies that may achieve similar advantages without incurring excessive cost.
[4] rating=0.1813 doc=docs_ingestor/docs/arxiv/2508.19614.pdf chunk=S6::C18::251009132038
  text: 4 Experiments 4.3 Evaluation Experiments Qualitative Study In Table 4, we analyze a case study from the NQ dataset using the Llama2-7B model, evaluating 4 decoding strategies: GD(0), CS, DoLA, and LFD. Despite access to groundtruth documents, each GD(0) and DoLA generate incorrect answers (e.g., '18 minutes'), suggesting limited capability to integrate contextual evidence. Similarly, while CS produces a partially relevant response ('Texas Revolution'), it exhibits reduced factual consistency with the source material. In contrast, LFD demonstrates superior utilization of retrieved context, synthesizing a precise and factually aligned answer. Additional case studies and analyses are provided in Appendix F.

I discovered semantic search worked best for this question, which is why it may well be useful to run multi-queries with different search methods to fetch the primary chunks (though this also adds complexity).

So, let’s turn to constructing something that may transform the unique query into several optimized versions and fuse the outcomes.

Multi-query optimizer

For this part we have a look at how we will optimize messy user queries by generating multiple targeted variations and choosing the proper search method for every. It could improve recall nevertheless it introduces trade-offs.

All of the agent abstraction systems you see often transform the user query when performing search. For instance, while you use the QueryTool in LlamaIndex, it uses an LLM to optimize the incoming query.

We will rebuild this part ourselves, but as an alternative we give it the flexibility to create multiple queries, while also setting the search method.

As for creating quite a lot of queries, I might try to maintain it easy, as issues here will cause low-quality outputs in retrieval. The more unrelated queries the system generates, the more noise it introduces into the pipeline.

The function I’ve created here will generate 1–3 academic-style queries, together with the search method for use, based on a messy user query.

Original query:
why is everyone saying RAG doesn't scale? how are people fixing that?

Generated queries:
- hybrid: RAG scalability issues
- hybrid: solutions to RAG scaling challenges

We’ll get back results like these:

Query 1 (hybrid) top 20 for query: RAG scalability issues

[1] rating=0.5000 doc=docs_ingestor/docs/arxiv/2507.18910.pdf chunk=S22::C05::251104142800
  text: 7 Challenges of RAG 7.2.1 Scalability and Infrastructure Deploying RAG at scale requires substantial engineering to take care of large knowledge corpora and efficient retrieval indices. Systems must handle thousands and thousands or billions of documents, demanding significant computational resources, efficient indexing, distributed computing infrastructure, and price management strategies [21]. Efficient indexing methods, caching, and multi-tier retrieval approaches (comparable to cascaded retrieval) turn into essential at scale, especially in large deployments like web serps.
[2] rating=0.5000 doc=docs_ingestor/docs/arxiv/2507.07695.pdf chunk=SDOC::SUM::251104135247
  text: This paper proposes the KeyKnowledgeRAG (K2RAG) framework to boost the efficiency and accuracy of Retrieval-Augment-Generate (RAG) systems. It addresses the high computational costs and scalability issues related to naive RAG implementations by incorporating techniques comparable to knowledge graphs, a hybrid retrieval approach, and document summarization to cut back training times and improve answer accuracy. Evaluations show that K2RAG significantly outperforms traditional implementations, achieving greater answer similarity and faster execution times, thereby providing a scalable solution for firms looking for robust question-answering systems.

[...]

Query 2 (hybrid) top 20 for query: solutions to RAG scaling challenges

[1] rating=0.5000 doc=docs_ingestor/docs/arxiv/2507.18910.pdf chunk=S22::C05::251104142800
  text: 7 Challenges of RAG 7.2.1 Scalability and Infrastructure Deploying RAG at scale requires substantial engineering to take care of large knowledge corpora and efficient retrieval indices. Systems must handle thousands and thousands or billions of documents, demanding significant computational resources, efficient indexing, distributed computing infrastructure, and price management strategies [21]. Efficient indexing methods, caching, and multi-tier retrieval approaches (comparable to cascaded retrieval) turn into essential at scale, especially in large deployments like web serps.
[2] rating=0.5000 doc=docs_ingestor/docs/arxiv/2508.05100.pdf chunk=S3::C06::251104155301
  text: Introduction Empirical analyses across multiple real-world benchmarks reveal that BEE-RAG fundamentally alters the entropy scaling laws governing conventional RAG systems, which provides a sturdy and scalable solution for RAG systems coping with long-context scenarios. Our most important contributions are summarized as follows: We introduce the concept of balanced context entropy, a novel attention reformulation that ensures entropy invariance across various context lengths, and allocates attention to essential segments. It addresses the critical challenge of context expansion in RAG.

[...]

We can even test the system with specific keywords like names and IDs to ensure that it chooses BM25 relatively than semantic search.

Original query:
any papers from Chenxin Diao?

Generated queries:
- BM25: Chenxin Diao

This can pull up results where is clearly mentioned.

If you ought to do that even higher, you’ll be able to construct a retrieval system that generates a couple of example queries based on the input, so when the unique query is available in, you fetch examples to assist guide the optimizer.

This helps because smaller models aren’t great at transforming messy human queries into ones with more precise academic phrasing.

To offer you an example, when a user is asking why the LLM is lying, the optimizer may transform the query to something like “causes of inaccuracies in large language models” relatively than directly search for “hallicunations.”

After we fetch leads to parallel, we fuse them. The result will look something like this:

RRF Fusion top 38 for query: why is everyone saying RAG doesn't scale? how are people fixing that?

[1] rating=0.0328 doc=docs_ingestor/docs/arxiv/2507.18910.pdf chunk=S22::C05::251104142800
  text: 7 Challenges of RAG 7.2.1 Scalability and Infrastructure Deploying RAG at scale requires substantial engineering to take care of large knowledge corpora and efficient retrieval indices. Systems must handle thousands and thousands or billions of documents, demanding significant computational resources, efficient indexing, distributed computing infrastructure, and price management strategies [21]. Efficient indexing methods, caching, and multi-tier retrieval approaches (comparable to cascaded retrieval) turn into essential at scale, especially in large deployments like web serps.
[2] rating=0.0313 doc=docs_ingestor/docs/arxiv/2507.18910.pdf chunk=S22::C42::251104142800
  text: 7 Challenges of RAG 7.5.5 Scalability Scalability challenges arise as knowledge corpora expand. Advanced indexing, distributed retrieval, and approximate nearest neighbor techniques facilitate efficient handling of large-scale knowledge bases [57]. Selective indexing and corpus curation, combined with infrastructure improvements like caching and parallel retrieval, allow RAG systems to scale to massive knowledge repositories. Research indicates that moderate-sized models augmented with large external corpora can outperform significantly larger standalone models, suggesting parameter efficiency benefits [10].
[3] rating=0.0161 doc=docs_ingestor/docs/arxiv/2507.07695.pdf chunk=SDOC::SUM::251104135247
  text: This paper proposes the KeyKnowledgeRAG (K2RAG) framework to boost the efficiency and accuracy of Retrieval-Augment-Generate (RAG) systems. It addresses the high computational costs and scalability issues related to naive RAG implementations by incorporating techniques comparable to knowledge graphs, a hybrid retrieval approach, and document summarization to cut back training times and improve answer accuracy. Evaluations show that K2RAG significantly outperforms traditional implementations, achieving greater answer similarity and faster execution times, thereby providing a scalable solution for firms looking for robust question-answering systems.
[4] rating=0.0161 doc=docs_ingestor/docs/arxiv/2508.05100.pdf chunk=S3::C06::251104155301
  text: Introduction Empirical analyses across multiple real-world benchmarks reveal that BEE-RAG fundamentally alters the entropy scaling laws governing conventional RAG systems, which provides a sturdy and scalable solution for RAG systems coping with long-context scenarios. Our most important contributions are summarized as follows: We introduce the concept of balanced context entropy, a novel attention reformulation that ensures entropy invariance across various context lengths, and allocates attention to essential segments. It addresses the critical challenge of context expansion in RAG.

[...]

We see that there are some good matches, but in addition a couple of irrelevant ones that we’ll have to filter out further.

As a note before we move on, this might be the step you’ll cut or optimize when you’re trying to cut back latency.

I find LLMs aren’t great at creating key queries that truly pull up useful information all that well, so if it’s not done right, it just adds more noise.

Adding a re-ranker

We do get results back from the retrieval system, and a few of these are good while others are irrelevant, so most retrieval systems will use a re-ranker of some sort.

A re-ranker takes in several chunks and offers every one a relevancy rating based on the unique user query. You might have several decisions here, including using something smaller, but I’ll use Cohere’s re-ranker.

We will test this re-ranker on the primary query we utilized in the previous section:

[... optimizer... retrieval... fuse...]

Rerank summary:
- strategy=cohere
- model=rerank-english-v3.0
- candidates=32
- eligible_above_threshold=4
- kept=4 (reranker_threshold=0.35)

Reranked Relevant (4/32 kept ≥ 0.35) top 4 for query: why is everyone saying RAG doesn't scale? how are people fixing that?

[1] rating=0.7920 doc=docs_ingestor/docs/arxiv/2507.07695.pdf chunk=S4::C08::251104135247
  text: 1 Introduction Scalability: Naive implementations of Retrieval-Augmented Generation (RAG) often depend on 16-bit floating-point large language models (LLMs) for the generation component. Nonetheless, this approach introduces significant scalability challenges on account of the increased memory demands required to host the LLM in addition to longer inference times on account of using a better precision number type. To enable more efficient scaling, it's crucial to integrate methods or techniques that reduce the memory footprint and inference times of generator models. Quantized models offer more scalable solutions on account of less computational requirements, hence when developing RAG systems we should always aim to make use of quantized LLMs for less expensive deployment as in comparison with a full fine-tuned LLM whose performance is likely to be good but is costlier to deploy on account of higher memory requirements. A quantized LLM's role within the RAG pipeline itself must be minimal and for technique of rewriting retrieved information right into a presentable fashion for the top users
[2] rating=0.4749 doc=docs_ingestor/docs/arxiv/2507.18910.pdf chunk=S22::C42::251104142800
  text: 7 Challenges of RAG 7.5.5 Scalability Scalability challenges arise as knowledge corpora expand. Advanced indexing, distributed retrieval, and approximate nearest neighbor techniques facilitate efficient handling of large-scale knowledge bases [57]. Selective indexing and corpus curation, combined with infrastructure improvements like caching and parallel retrieval, allow RAG systems to scale to massive knowledge repositories. Research indicates that moderate-sized models augmented with large external corpora can outperform significantly larger standalone models, suggesting parameter efficiency benefits [10].
[3] rating=0.4304 doc=docs_ingestor/docs/arxiv/2507.18910.pdf chunk=S22::C05::251104142800
  text: 7 Challenges of RAG 7.2.1 Scalability and Infrastructure Deploying RAG at scale requires substantial engineering to take care of large knowledge corpora and efficient retrieval indices. Systems must handle thousands and thousands or billions of documents, demanding significant computational resources, efficient indexing, distributed computing infrastructure, and price management strategies [21]. Efficient indexing methods, caching, and multi-tier retrieval approaches (comparable to cascaded retrieval) turn into essential at scale, especially in large deployments like web serps.
[4] rating=0.3556 doc=docs_ingestor/docs/arxiv/2509.13772.pdf chunk=S11::C02::251104182521
  text: 7. Discussion and Limitations Scalability of RAGOrigin: We extend our evaluation by scaling the NQ dataset's knowledge database to 16.7 million texts, combining entries from the knowledge database of NQ, HotpotQA, and MS-MARCO. Using the identical user questions from NQ, we assess RAGOrigin's performance under larger data volumes. As shown in Table 16, RAGOrigin maintains consistent effectiveness and performance even on this significantly expanded database. These results display that RAGOrigin stays robust at scale, making it suitable for enterprise-level applications requiring large

Remember, at this point, we’ve already transformed the user query, done semantic or hybrid search, and fused the outcomes before passing the chunks to the re-ranker.

In case you have a look at the outcomes, we will clearly see that it’s in a position to discover a couple of relevant chunks that we will use as seeds.

It’s also possible to see that it returns multiple chunks from the identical document. We’ll set this up later within the context construction, but should you want unique documents fetched, you’ll be able to add some custom logic here to set the limit for unique docs relatively than chunks.

We will do this with one other query:

[... optimizer... retrieval... fuse...]

Rerank summary:
- strategy=cohere
- model=rerank-english-v3.0
- candidates=35
- eligible_above_threshold=12
- kept=5 (threshold=0.2)

Reranked Relevant (5/35 kept ≥ 0.2) top 5 for query: hallucinations in rag vs normal llms and learn how to reduce them

[1] rating=0.9965 doc=docs_ingestor/docs/arxiv/2508.19614.pdf chunk=S7::C03::251104164901
  text: 5 Related Work Hallucinations in LLMs Hallucinations in LLMs confer with instances where the model generates false or unsupported information not grounded in its reference data [42]. Existing mitigation strategies include multi-agent debating, where multiple LLM instances collaborate to detect inconsistencies through iterative debates [8, 14]; self-consistency verification, which aggregates and reconciles multiple reasoning paths to cut back individual errors [53]; and model editing, which directly modifies neural network weights to correct systematic factual errors [62, 19]. While RAG systems aim to ground responses in retrieved external knowledge, recent studies show that they still exhibit hallucinations, especially those who contradict the retrieved content [50]. To handle this limitation, our work conducts an empirical study analyzing how LLMs internally process external knowledge
[2] rating=0.9342 doc=docs_ingestor/docs/arxiv/2508.05509.pdf chunk=S3::C01::251104160034
  text: Introduction Large language models (LLMs), like Claude (Anthropic 2024), ChatGPT (OpenAI 2023) and the Deepseek series (Liu et al. 2024), have demonstrated remarkable capabilities in lots of real-world tasks (Chen et al. 2024b; Zhou et al. 2025), comparable to query answering (Allam and Haggag 2012), text comprehension (Wright and Cervetti 2017) and content generation (Kumar 2024). Despite the success, these models are sometimes criticized for his or her tendency to provide hallucinations, generating incorrect statements on tasks beyond their knowledge and perception (Ji et al. 2023; Zhang et al. 2024). Recently, retrieval-augmented generation (RAG) (Gao et al. 2023; Lewis et al. 2020) has emerged as a promising solution to alleviate such hallucinations. By dynamically leveraging external knowledge from textual corpora, RAG enables LLMs to generate more accurate and reliable responses without costly retraining (Lewis et al. 2020; Figure 1: Comparison of three paradigms. LAG exhibits greater lightweight properties in comparison with GraphRAG while
[3] rating=0.9030 doc=docs_ingestor/docs/arxiv/2509.13702.pdf chunk=S3::C01::251104182000
  text: ABSTRACT Hallucination stays a critical barrier to the reliable deployment of Large Language Models (LLMs) in high-stakes applications. Existing mitigation strategies, comparable to Retrieval-Augmented Generation (RAG) and post-hoc verification, are sometimes reactive, inefficient, or fail to deal with the basis cause inside the generative process. Inspired by dual-process cognitive theory, we propose D ynamic S elfreinforcing C alibration for H allucination S uppression (DSCC-HS), a novel, proactive framework that intervenes directly during autoregressive decoding. DSCC-HS operates via a two-phase mechanism: (1) During training, a compact proxy model is iteratively aligned into two adversarial roles-a Factual Alignment Proxy (FAP) and a Hallucination Detection Proxy (HDP)-through contrastive logit-space optimization using augmented data and parameter-efficient LoRA adaptation. (2) During inference, these frozen proxies dynamically steer a big goal model by injecting a real-time, vocabulary-aligned steering vector (computed because the 
[4] rating=0.9007 doc=docs_ingestor/docs/arxiv/2509.09360.pdf chunk=S2::C05::251104174859
  text: 1 Introduction Figure 1. Standard Retrieval-Augmented Generation (RAG) workflow. A user query is encoded right into a vector representation using an embedding model and queried against a vector database constructed from a document corpus. Essentially the most relevant document chunks are retrieved and appended to the unique query, which is then provided as input to a big language model (LLM) to generate the ultimate response. Corpus Retrieved_Chunks Vectpr DB Embedding model Query Response LLM Retrieval-Augmented Generation (RAG) [17] goals to mitigate hallucinations by grounding model outputs in retrieved, up-to-date documents, as illustrated in Figure 1. By injecting retrieved text from re- a
[5] rating=0.8986 doc=docs_ingestor/docs/arxiv/2508.04057.pdf chunk=S20::C02::251104155008
  text: Parametric knowledge can generate accurate answers. Effects of LLM hallucinations. To evaluate the impact of hallucinations when large language models (LLMs) generate answers without retrieval, we conduct a controlled experiment based on a straightforward heuristic: if a generated answer comprises numeric values, it's more more likely to be affected by hallucination. It is because LLMs are generally less reliable when producing precise facts comparable to numbers, dates, or counts from parametric memory alone (Ji et al. 2023; Singh et al. 2025). We filter out all directly answered queries (DQs) whose generated answers contain numbers, and we then rerun our DPR-AIS for these queries (referred to Exclude num ). The outcomes are reported in Tab. 5. Overall, excluding numeric DQs leads to barely improved performance. The common exact match (EM) increases from 35.03 to 35.12, and the typical F1 rating improves from 35.68 to 35.80. While these gains are modest, they arrive with a rise within the retriever activation (RA) ratio-from 75.5% to 78.1%.

This question also performs well enough (should you have a look at the total chunks returned).

We can even test messier user queries, like:

[... optimizer...]

Original query:
why is the llm lying and rag help with this?

Generated queries:
- semantic: explore reasons for LLM inaccuracies
- hybrid: RAG techniques for LLM truthfulness

[...retrieval... fuse...]

Rerank summary:
- strategy=cohere
- model=rerank-english-v3.0
- candidates=39
- eligible_above_threshold=39
- kept=6 (threshold=0)

Reranked Relevant (6/39 kept ≥ 0) top 6 for query: why is the llm lying and rag help with this?

[1] rating=0.0293 doc=docs_ingestor/docs/arxiv/2507.05714.pdf chunk=S3::C01::251104134926
  text: 1 Introduction Retrieval Augmentation Generation (hereafter known as RAG) helps large language models (LLMs) (OpenAI et al., 2024) reduce hallucinations (Zhang et al., 2023) and access real-time data 1 *Equal contribution.
[2] rating=0.0284 doc=docs_ingestor/docs/arxiv/2508.15437.pdf chunk=S3::C01::251104164223
  text: 1 Introduction Large language models (LLMs) augmented with retrieval have turn into a dominant paradigm for knowledge-intensive NLP tasks. In a typical retrieval-augmented generation (RAG) setup, an LLM retrieves documents from an external corpus and conditions generation on the retrieved evidence (Lewis et al., 2020b; Izacard and Grave, 2021). This setup mitigates a key weakness of LLMs-hallucination-by grounding generation in externally sourced knowledge. RAG systems now power open-domain QA (Karpukhin et al., 2020), fact verification (V et al., 2024; Schlichtkrull et al., 2023), knowledge-grounded dialogue, and explanatory QA.
[3] rating=0.0277 doc=docs_ingestor/docs/arxiv/2509.09651.pdf chunk=S3::C01::251104180034
  text: 1 Introduction Large Language Models (LLMs) have transformed natural language processing, achieving state-ofthe-art performance in summarization, translation, and query answering. Nonetheless, despite their versatility, LLMs are vulnerable to generating false or misleading content, a phenomenon commonly known as hallucination [9, 21]. While sometimes harmless in casual applications, such inaccuracies pose significant risks in domains that demand strict factual correctness, including medicine, law, and telecommunications. In these settings, misinformation can have severe consequences, starting from financial losses to safety hazards and legal disputes.
[4] rating=0.0087 doc=docs_ingestor/docs/arxiv/2507.07695.pdf chunk=S4::C08::251104135247
  text: 1 Introduction Scalability: Naive implementations of Retrieval-Augmented Generation (RAG) often depend on 16-bit floating-point large language models (LLMs) for the generation component. Nonetheless, this approach introduces significant scalability challenges on account of the increased memory demands required to host the LLM in addition to longer inference times on account of using a better precision number type. To enable more efficient scaling, it's crucial to integrate methods or techniques that reduce the memory footprint and inference times of generator models. Quantized models offer more scalable solutions on account of less computational requirements, hence when developing RAG systems we should always aim to make use of quantized LLMs for less expensive deployment as in comparison with a full fine-tuned LLM whose performance is likely to be good but is costlier to deploy on account of higher memory requirements. A quantized LLM's role within the RAG pipeline itself must be minimal and for technique of rewriting retrieved information right into a presentable fashion for the top users

Before we move on, I want to notice that there are moments where this re-ranker doesn’t try this well, as you’ll see above from the scores.

At times it estimates that the chunks doesn’t answer the user’s query nevertheless it actually does, at the least once we have a look at these chunks as seeds.

Normally for a re-ranker, the chunks should hint at your entire content, but we’re using these chunks as seeds, so in some cases it is going to rate results very low, nevertheless it’s enough for us to go on.

For this reason I’ve kept the rating threshold very low.

There could also be higher options here that you just might need to explore, perhaps constructing a custom re-ranker that understands what you’re in search of.

Nevertheless, now that we’ve a couple of relevant documents, we’ll use its metadata that we set before on ingestion to expand and fan out the chunks so the LLM will get enough context to know learn how to answer the query.

Construct the context

Now that we’ve a couple of chunks as seeds, we’ll pull up more information from Redis, expand, and construct the context.

This step is clearly quite a bit more complicated, as you must construct logic for which chunks to fetch and the way (keys in the event that they exist, or neighbors if there are any), fetch information in parallel, after which clean out the chunks further.

Once you might have all of the chunks (plus information on the documents themselves), you must put them together, i.e. de-duping chunks, perhaps setting a limit on how far the system can expand, and highlighting which chunks were fetched and which were expanded.

The tip result will appear like something below:

Expanded context windows (Markdown ready):

## Document #1 - Fusing Knowledge and Language: A Comparative Study of Knowledge Graph-Based Query Answering with LLMs
- `doc_id`: `doc::6371023da29b4bbe8242ffc5caf4a8cd`
- **Last Updated:** 2025-11-04T17:44:07.300967+00:00
- **Context:** Comparative study on methodologies for integrating knowledge graphs in QA systems using LLMs.
- **Content fetched inside document:**
```text
[start on page 4]
    LLMs in QA
    The arrival of LLMs has steered in a transformative era in NLP, particularly inside the domain of QA. These models, pre-trained on massive corpora of diverse text, exhibit sophisticated capabilities in each natural language understanding and generation. Their proficiency in producing coherent, contextually relevant, and human-like responses to a broad spectrum of prompts makes them exceptionally well-suited for QA tasks, where delivering precise and informative answers is paramount. Recent advancements by models comparable to BERT [57] and ChatGPT [58], have significantly propelled the sector forward. LLMs have demonstrated strong performance in open-domain QA scenarios-such as commonsense reasoning[20]-owing to their extensive embedded knowledge of the world. Furthermore, their ability to grasp and articulate responses to abstract or contextually nuanced queries and reasoning tasks [22] underscores their utility in addressing complex QA challenges that require deep semantic understanding. Despite their strengths, LLMs also pose challenges: they will exhibit contextual ambiguity or overconfidence of their outputs ('hallucinations')[21], and their substantial computational and memory requirements complicate deployment in resource-constrained environments.
    RAG, nice tuning in QA
    ---------------------- this was the passage that we matched to the query -------------
    LLMs also face problems relating to domain specific QA or tasks where they're needed to recall factual information accurately as an alternative of just probabilistically generating whatever comes next. Research has also explored different prompting techniques, like chain-of-thought prompting[24], and sampling based methods[23] to cut back hallucinations. Contemporary research increasingly explores strategies comparable to fine-tuning and retrieval augmentation to boost LLM-based QA systems. High quality-tuning on domain-specific corpora (e.g., BioBERT for biomedical text [17], SciBERT for scientific text [18]) has been shown to sharpen model focus, reducing irrelevant or generic responses in specialized settings comparable to medical or legal QA. Retrieval-augmented architectures comparable to RAG [19] mix LLMs with external knowledge bases, to attempt to further mitigate problems with factual inaccuracy and enable real-time incorporation of latest information. Constructing on RAG's ability to bridge parametric and non-parametric knowledge, many modern QA pipelines introduce a light-weight re-ranking step [25] to sift through the retrieved contexts and promote passages which are most relevant to the query. Nonetheless, RAG still faces several challenges. One key issue lies within the retrieval step itself-if the retriever fails to fetch relevant documents, the generator is left to hallucinate or provide incomplete answers. Furthermore, integrating noisy or loosely relevant contexts can degrade response quality relatively than enhance it, especially in high-stakes domains where precision is critical. RAG pipelines are also sensitive to the standard and domain alignment of the underlying knowledge base, they usually often require extensive tuning to balance recall and precision effectively.
    --------------------------------------------------------------------------------------
[end on page 5]
```

## Document #2 - Each to Their Own: Exploring the Optimal Embedding in RAG
- `doc_id`: `doc::3b9c43d010984d4cb11233b5de905555`
- **Last Updated:** 2025-11-04T14:00:38.215399+00:00
- **Context:** Enhancing Large Language Models using Retrieval-Augmented Generation techniques.
- **Content fetched inside document:**
```text
[start on page 1]
    1 Introduction
    Large language models (LLMs) have recently accelerated the pace of transformation across multiple fields, including transportation (Lyu et al., 2025), arts (Zhao et al., 2025), and education (Gao et al., 2024), through various paradigms comparable to direct answer generation, training from scratch on various kinds of data, and fine-tuning on track domains. Nonetheless, the hallucination problem (Henkel et al., 2024) related to LLMs has confused people for a very long time, stemming from multiple aspects comparable to a lack of understanding on the given prompt (Huang et al., 2025b) and a biased training process (Zhao, 2025).
    Serving as a highly efficient solution, RetrievalAugmented Generation (RAG) has been widely employed in constructing foundation models (Chen et al., 2024) and practical agents (Arslan et al., 2024). In comparison with training methods like fine-tuning and prompt-tuning, its plug-and-play feature makes RAG an efficient, easy, and costeffective approach. The most important paradigm of RAG involves first calculating the similarities between a matter and chunks in an external knowledge corpus, followed by incorporating the highest K relevant chunks into the prompt to guide the LLMs (Lewis et al., 2020).
    Despite some great benefits of RAG, choosing the suitable embedding models stays a vital concern, as the standard of retrieved references directly influences the generation results of the LLM (Tu et al., 2025). Variations in training data and model architecture result in different embedding models providing advantages across various domains. The differing similarity calculations across embedding models often leave researchers uncertain about learn how to select the optimal one. Consequently, improving the accuracy of RAG from the angle of embedding models continues to be an ongoing area of research.
    ---------------------- this was the passage that we matched to the query -------------
    To handle this research gap, we propose two methods for improving RAG by combining the advantages of multiple embedding models. The primary method is known as Mixture-Embedding RAG, which sorts the retrieved materials from multiple embedding models based on normalized similarity and selects the highest K materials as final references. The second method is known as Confident RAG, where we first utilize vanilla RAG to generate answers multiple times, every time employing a special embedding model and recording the associated confidence metrics, after which select the reply with the best confidence level as the ultimate response. By validating our approach using multiple LLMs and embedding models, we illustrate the superior performance and generalization of Confident RAG, regardless that MixtureEmbedding RAG may lose to vanilla RAG. The most important contributions of this paper could be summarized as follows:
    We first indicate that in RAG, different embedding models operate inside their very own prior domains. To leverage the strengths of assorted embedding models, we propose and test two novel RAG methods: MixtureEmbedding RAG and Confident RAG. These methods effectively utilize the retrieved results from different embedding models to their fullest extent.
    --------------------------------------------------------------------------------------
    While Mixture-Embedding RAG performs similarly to vanilla RAG, the Confident RAG method exhibits superior performance in comparison with each the vanilla LLM and vanilla RAG, with average improvements of 9.9% and 4.9%, respectively, when using the perfect confidence metric. Moreover, we discuss the optimal variety of embedding models for the Confident RAG method based on the outcomes.
[...]

The overall context will contain a couple of documents and lands around 2–3k tokens. There’s some waste here, but as an alternative of deciding for the LLM, we send in additional information so it may well scan entire documents relatively than isolated chunks.

For the system you construct, you’ll be able to cache this context as well so the LLM can answer follow-up questions.

Add within the LLM to reply (the G in RAG)

Let’s wind up with the last step, simply to see how the LLM answers with the context it has been handed.

I didn’t plan on adding this part, as this text is primarily in regards to the retrieval pipeline, nevertheless it’s good to check with an LLM to see if it may well interpret the knowledge into something coherent.

In case you use a bigger LLM here, you’ll be able to expect the latency to go up quite a bit. At the identical time, should you use a smaller model, it might not answer as intelligently.

I made a decision to check with GPT-5 (low reasoning), because it’s very low-cost to make use of but in addition quite slow. You’ll see the result below with the messy user query:

[optimize query... retrieve... fuse... re-rank... expand... build context...]

LLM Answers:

LLMs sometimes "lie" because they generate the subsequent likely words based on patterns, not on verified facts. Without grounding in sources, they will produce confident but false statements-known as hallucinations-and their built‑in knowledge could be outdated or incomplete [6][3]. That is dangerous in domains that demand strict accuracy (e.g., law or telecom regulations) where small errors have big consequences [3].
RAG helps by having the model first retrieve relevant documents from an external corpus after which generate answers grounded in that evidence. This reduces hallucinations and brings in up‑to‑date information [2][6][1]. In practice, domain‑tailored RAG has raised accuracy substantially; for instance, a radio‑regulations RAG pipeline boosted GPT‑4o accuracy by +11.9% over vanilla prompting and achieved as much as 97% retrieval accuracy, supporting more reliable answers [3].
Nonetheless, RAG is just nearly as good as its retrieval and the way well the model uses the retrieved text. Static, off‑the‑shelf retrievers could be brittle for complex tasks (e.g., multihop questions or procedural queries) that need iterative lookups or query reformulation [2]. Naive RAG can even drown the model in long, noisy contexts or pull semantically similar but irrelevant chunks-the "needle‑in‑a‑haystack" problem-leading to fallacious answers [4].
Higher RAG practices address this: instruction‑tuning for RAG that teaches filtering, combining multiple sources, and RAG‑specific reasoning with a "think before answering" approach (HIRAG) improves use of evidence [1]; adaptive, feedback‑driven retrieval decides when and learn how to retrieve and re‑rank evidence [2]; and pipeline designs that optimize chunking and retrieval raise answer accuracy [4].
If hallucination still persists, techniques that steer decoding directly (beyond RAG) can further suppress it [5].

cited documents:
  [1] doc::b0610cc6134b401db0ea68a77096e883 - HIRAG: Hierarchical-Thought Instruction-Tuning Retrieval-Augmented Generation
  [2] doc::53b521e646b84289b46e648c66dde56a - Test-time Corpus Feedback: From Retrieval to RAG
  [3] doc::9694bd0124d0453c81ecb32dd75ab489 - Retrieval-Augmented Generation for Reliable Interpretation of Radio Regulations
  [4] doc::6d7a7d88cfc04636b20931fdf22f1e61 - KeyKnowledgeRAG (K^2RAG): An Enhanced RAG method for improved LLM question-answering capabilities
  [5] doc::3c9a1937ecbc454b8faff4f66bdf427f - DSCC-HS: A Dynamic Self-Reinforcing Framework for Hallucination Suppression in Large Language Models
  [6] doc::688cfbc0abdc4520a73e219ac26aff41 - A Systematic Review of Key Retrieval-Augmented Generation (RAG) Systems: Progress, Gaps, and Future Directions

You’ll see that it cites sources appropriately and uses the knowledge it has been handed, but as we’re using GPT-5, the latency is sort of high with this massive context.

It takes about 9 seconds to first token with GPT-5 (but it is going to rely on your environment).

If your entire retrieval pipeline takes about 4–5 seconds (and this is just not optimized), this implies the last part will take about 2–3 times longer.

Some people will argue that you must send in less information within the context window to diminish latency for this part but that also defeats the aim of what we’re attempting to do.

Others will argue for using chain prompting, having one smaller LLM extract useful information after which letting one other larger LLM answer with an optimized context window but I’m unsure how much you save by way of time or if it’s value it.

Others will go as small as possible, sacrificing “intelligence” for speed and price. But there’s also a risk of using smaller with greater than a 2k window as they will begin to hallucinate.

Nevertheless, it’s as much as you ways you optimize the system. That’s the hard part.

If you ought to examine your entire pipeline for a couple of queries see this folder.

Let’s talk latency & cost

People talking about sending in entire docs into an LLM are probably not ruthlessly optimizing for latency of their systems. That is the part you’ll spend probably the most time with, users don’t need to wait.

Yes you’ll be able to apply some UX tricks, but devs might think you’re lazy in case your retrieval pipeline is slower than a couple of seconds.

This can be why it’s interesting that we see this shift into agentic search within the wild, it’s a lot slower so as to add large context windows, LLM-based query transforms, auto “router” chains, sub-question decomposition and multi-step “agentic” query engines.

For this technique here (mostly built with Codex and my instructions) we land at around 4–5 seconds for retrieval in a Serverless environment.

That is form of slow (but pretty low-cost).

You may optimize each step here to bring that number down, keeping most things warm. Nonetheless, using the APIs you’ll be able to’t all the time control how briskly they return a response.

Some people will argue to host your individual smaller models for the optimizer and routers, but then you must add in costs to host which may easily add a couple of hundred dollars per 30 days.

With this pipeline here, each run (without caching) cost us 1.2 cents ($0.0121) so should you had your org ask 200 questions day by day you’d pay around $2.42 with GPT-5.

In case you switch to GPT-5-mini for the most important LLM, one pipeline run would drop to 0.41 cents, and amount to about $0.82 per day for 200 runs.

As for embedding the documents, I paid around $0.5 for 200 PDF files using OpenAI’s large model. This cost will increase as you scale which is something to think about, then it may well make sense with small or specialized fine-tuned model.

improve it

As we’re only working with recent RAG papers, when you scale it, you’ll be able to add some stuff to make it more robust.

I should first note though that you might not see a lot of the real issues until your docs start growing. Whatever feels solid with a couple of hundred docs will begin to feel messy when you ingest tens of hundreds.

You may have the optimizer set filters, perhaps using semantic matching for topics. It’s also possible to have it set the dates to maintain the knowledge fresh while introducing an authority signal in re-ranking that reinforces certain sources.

Some teams take this a bit further and design their very own scoring functions to determine what should surface and learn how to prioritize documents, but this relies entirely on what your corpus looks like.

If you must ingest several thousand docs, it’d make sense to skip the LLM during ingestion and as an alternative use it within the retrieval pipeline, where it analyzes documents only when a question asks for it. You may then cache that result for next time.

Lastly, all the time remember so as to add proper evals to indicate retrieval quality and groundedness, especially should you’re switching models to optimize for cost. I’ll attempt to do some writing on this in the longer term.

In case you’re still with me this far, a matter you’ll be able to ask yourself is whether or not it’s value it to construct a system like this or if it is just too much work.

I’d do something that can clearly compare the output quality for naive RAG vs better-chunked RAG with expansion/metadata in the longer term.

I’d also like to match the identical use case using knowledge graphs.

To envision out more of my work and follow my future writing, connect with me on LinkedIn, Medium, Substack, or check my website.

❤

PS. I’m in search of some work in January. In case you need someone who’s constructing on this space (and enjoys constructing weird, fun things while explaining difficult technical concepts), get in touch.

Construct an Over-Engineered Retrieval System

Recap retrieval & RAG

Processing different documents

Ingesting tabular files

Ingesting PDF docs

Constructing the retrieval pipeline

Semantic, BM25 and hybrid search

Multi-query optimizer

Adding a re-ranker

Construct the context

Add within the LLM to reply (the G in RAG)

Let’s talk latency & cost

improve it

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

Enabling Multi-Node NVLink on Kubernetes for NVIDIA GB200 NVL72 and Beyond

Get your VLM running in 3 easy steps on Intel CPUs

Researchers isolate memorization from problem-solving in AI neural networks

PyTorch Tutorial for Beginners: Construct a Multiple Regression Model from Scratch

Learn how to Achieve 4x Faster Inference for Math Problem Solving

Construct an Over-Engineered Retrieval System

Recap retrieval & RAG

Processing different documents

Ingesting tabular files

Ingesting PDF docs

Constructing the retrieval pipeline

Semantic, BM25 and hybrid search

Multi-query optimizer

Adding a re-ranker

Construct the context

Add within the LLM to reply (the G in RAG)

Let’s talk latency & cost

improve it

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.