Chunk Size as an Experimental Variable in RAG Systems

-

the sorts of answers we expect today from Retrieval-Augmented Generation (RAG) systems.

Over the past few years, RAG has develop into one in all the central architectural constructing blocks for knowledge-based language models: As a substitute of relying exclusively on the knowledge stored within the model, RAG systems mix language models with external document sources.

The term was introduced by Lewis et al. and describes an approach that’s widely used to cut back hallucinations, improve the traceability of answers, and enable language models to work with proprietary data.

I wanted to know why a system selects one specific answer as an alternative of a really similar alternative. This decision is usually made already on the retrieval stage, long before an LLM comes into play.

For that reason, I conducted three experiments in this text to research how different chunk sizes (80, 220, 500) influence retrieval behavior.

1 – Why Chunk Size Is More Than Only a Parameter

In a typical RAG pipeline, documents are first split into smaller text segments, embedded into vectors, and stored in an index. When a question is issued, semantically similar text segments are retrieved after which processed into a solution. This final step is usually performed together with a language model.

Typical components of a RAG system include:

  • Document preprocessing
  • Chunking
  • Embedding
  • Vector index
  • Retrieval logic
  • Optional: Generation of the output

In this text, I concentrate on the retrieval step. This step depends upon several parameters:

  • Selection of the embedding model:
    The embedding model determines how text is converted into numerical vectors. Different models capture meaning at different levels of granularity and are trained on different objectives. For instance, lightweight sentence-transformer models are sometimes sufficient for semantic search, while larger models may capture more nuance but include higher computational cost.
  • Distance or similarity metric:
    The gap or similarity metric defines how the closeness between two vectors is measured. Common decisions include cosine similarity, dot product or Euclidean distance. For normalized embeddings, cosine similarity is usually used
  • Variety of retrieved results (Top-k):
    The variety of retrieved results specifies what number of text segments are returned by the retrieval step. A small Top-k can miss relevant context, while a big Top-k increases recall but may introduce noise.
  • Overlap between text segments:
    Overlap defines how much text is shared between consecutive chunks. It is usually used to avoid losing essential information at chunk boundaries. A small overlap reduces redundancy but risks cutting explanations in half, while a bigger overlap increases robustness at the fee of storing and processing more similar chunks.
  • Chunk size:
    Describes the dimensions of the text units which might be extracted from a document and stored as individual vectors. Depending on the implementation, chunk size will be defined based on characters, words, or tokens. The dimensions determines how much context a single vector represents.

Small chunks contain little or no context and are highly specific. Large chunks include more surrounding information, but at a much coarser level. Consequently, chunk size determines which parts of the meaning are literally compared when a question is matched against a piece.

Chunk size implicitly reflects assumptions about how much context is required to capture meaning, how strongly information could also be fragmented, and the way clearly semantic similarity will be measured.

With this text, I desired to explore exactly this through a small RAG system experiment and asked myself:

How do different chunk sizes affect retrieval behavior?

The main target just isn’t on a system intended for production use. As a substitute, I desired to learn how different chunk sizes affect the retrieval results.

2 – How Does Chunk Size Influence the Retrieval Ends in Small RAG Systems?

I due to this fact asked myself the next questions:

  • How does chunk size change retrieval leads to a small, controlled RAG system?
  • Which text segments make it to the highest of the rating when the queries are equivalent however the chunk sizes differ?

To analyze this, I deliberately defined a straightforward setup by which all conditions (except chunk size) remain the identical:

  • Three Markdown documents because the knowledge base
  • Three equivalent, fixed questions
  • The identical embedding model for vectorizing the texts

The text utilized in the three Markdown files is predicated on a documentation from an actual tool called OneLatex. To maintain the experiment focused on retrieval behavior, the content was barely simplified and reduced to the core explanations relevant for the questions.

The three questions I used where:

"Q1: What's the primary advantage of separating content creation from formatting in OneLatex?"
"Q2: How does OneLatex interpret text highlighted in green in OneNote?"
"Q3: How does OneLatex interpret text highlighted in yellow in OneNote?"

As well as, I deliberately omitted an LLM for output generation.

The explanation for this is straightforward: I didn’t want an LLM to show incomplete or poorly-matched text segments right into a coherent answer. This makes it much clearer what actually happens within the retrieval step, how the parameters of the retrieval interact, and what role the sentence transformer plays.

3 – Minimal RAG System Without Output Generation

For the experiments, I due to this fact used a small RAG system with the next components: Markdown documents because the knowledge base, a straightforward chunking logic with overlap, a sentence transformer model to generate embeddings, and a rating of text segments using cosine similarity.

Because the embedding model, I used all-MiniLM-L6-v2 from the Sentence-Transformers library. This model is lightweight and due to this fact well-suited for running locally on a private laptop (I ran it locally on my Lenovo laptop with 64 GB of RAM). The similarity between a question and a text segment is calculated using cosine similarity. Since the vectors are normalized, the dot product will be compared directly.

I deliberately kept the system small and due to this fact didn’t include any chat history, memory or agent logic, or LLM-based answer generation.

As an “answer,” the system simply returns the highest-ranked text segment. This makes it much clearer which content is definitely identified as relevant by the retrieval step.

The complete code for the mini RAG system will be present in my GitHub repository:

→ 🤓 Find the total code within the GitHub Repo 🤓 ←

4 – Three Experiments: Chunk Size as a Variable

For the evaluation, I ran the three commands below via the command line:

#Experiment 1 - Baseline
python primary.py --chunk-size 220 --overlap 40 --top-k 3

#Experiment 2 - Small Chunk-Size
python primary.py --chunk-size 80 --overlap 10 --top-k 3

#Experiment 3 - Big Chunk-Size
python primary.py --chunk-size 500 --overlap 50 --top-k 3

The setup from Section 3 stays the exact same: The identical three documents, the identical three questions, and the identical embedding model.

Chunk size defines the variety of characters per text segment. As well as, I used an overlap in each experiment to cut back information loss at chunk boundaries. For every experiment, I computed the semantic similarity scores between the query and all chunks and ranked the highest-scoring segments.

Small Chunks (80 Characters) – Lack of Context

With very small chunks (chunk-size 80), a robust fragmentation of the content becomes apparent: Individual text segments often contain only sentence fragments or isolated statements without sufficient context. Explanations are split across multiple chunks, in order that individual segments contain only parts of the unique content.

Formally, the retrieval still works appropriately: Semantically similar fragments are found and ranked highly.

Nevertheless, once we have a look at the actual content, we see that the outcomes are hardly usable:

Screenshot taken by the Creator.

The returned chunks are thematically related, but they don’t provide a self-contained answer. The system roughly recognizes what the subject is about, nevertheless it breaks the content down so strongly that the person results don’t say much on their very own.

Medium Chunks (220 characters) – Apparent Stability

With the medium chunks (chunk-size 220), the outcomes already improved clearly. A lot of the returned text segments contained complete explanations and were content-wise plausible. At first glance, the retrieval appeared stable and reliable: It often returned exactly the data one would expect.

Nevertheless, a concrete problem became apparent when distinguishing between green and yellow highlighted text. No matter whether I asked concerning the meaning of the green or the yellow highlighting, the system returned the chunk concerning the yellow highlighting as the highest end in each cases. The proper chunk was present, nevertheless it was not chosen as Top-1.

Shows the results of the retrieval experiment with chunk size 220.
Screenshot taken by the creator.

The explanation lies within the very similar similarity scores of the 2 top results:

  • Rating for Top-1: 0.873
  • Rating for Top-2: 0.774

The system can hardly distinguish between the 2 candidates semantically and ultimately selects the chunk with the marginally higher rating.

The issue? It doesn’t match the query content-wise and is solely fallacious.

For us as humans, this may be very easy to acknowledge. For a sentence transformer like all-MiniLM-L6-v2, it appears to be a challenge.

What matters here is that this: If we only have a look at the Top-1 result, this error stays invisible. Only by comparing the scores will we see that the system is uncertain in this example. Because it is forced to make a transparent decision in our setup, it returns the Top-1 chunk as the reply.

Large Chunks (500 characters) – Robust Contexts

With the larger chunks (chunk-size 500), the text segments contain rather more coherent context. There’s also hardly any fragmentation anymore: Explanations are not any longer split across multiple chunks.

And indeed, the error in distinguishing between green and yellow not occurs. The questions on green and yellow highlighting at the moment are appropriately distinguished, and the respective matching chunk is clearly ranked as the highest result. We may also see that the similarity scores of the relevant chunks at the moment are more clearly separated.

Shows the result of the retrieval experiment with chunk size 500.
Screenshot taken by the creator.

This makes the rating more stable and easier to know. The downside of this setting, nonetheless, is the coarser granularity: Individual chunks contain more information and are less finely tailored to specific features.

In our setup with three Markdown files, where the content is already thematically well separated, this downside hardly plays a task. With in a different way structured documentation, similar to long continuous texts with multiple topics per section, an excessively large chunk size may lead to irrelevant information being retrieved along with relevant content.



5 – Final Thoughts

The outcomes of the three quite simple experiments will be traced back to how retrieval works. Each chunk is represented as a vector, and its proximity to the query is calculated using cosine similarity. The resulting rating indicates how similar the query and the text segment are within the semantic space.

What is vital here is that the rating just isn’t a measure of correctness. It’s a measure of relative comparison throughout the available chunks for a given query in a single run.

When multiple segments are semantically very similar, even minimal differences within the scores can determine which chunk is returned as Top-1. One example of this was the wrong distinction between green and yellow within the medium chunk size.

One possible extension could be to permit the system to explicitly signal uncertainty. If the scores of the Top-1 and Top-2 chunks are very close, the system could return an “I don’t know” or “I’m uncertain” response as an alternative of forcing a choice.

Based on this small RAG system experiment, it just isn’t really possible to derive a “best chunk size” conclusion.

But what we are able to observe as an alternative is the next:

  • Small chunks result in high variance: Retrieval reacts very precisely to individual terms but quickly loses the general context.
  • Medium-sized chunks: Appear stable at first glance, but can create dangerous ambiguities when multiple candidates are scored almost equally.
  • Large chunks: Provide more robust context and clearer rankings, but they’re coarser and fewer precisely tailored.

Chunk size due to this fact, determines how sharply retrieval can distinguish between similar pieces of content.

On this small setup, this didn’t play a serious role. Nevertheless, once we take into consideration larger RAG systems in production environments, this type of retrieval instability could develop into an actual problem: Because the variety of documents grows, the variety of semantically similar chunks increases as well. Which means that many situations with very small rating differences are more likely to occur. I may also imagine that such effects are sometimes masked by downstream language models, when an LLM turns incomplete or only partially matching text segments into plausible answers.

Where Can You Proceed Learning?

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x