RAG 101: Chunking Strategies

-

UNLOCK THE FULL POTENTIAL OF YOUR RAG WORKFLOW

Why, When, and Learn how to chunk for enhanced RAG

How can we split the balls? (Generated using Cava)

The maximum variety of tokens that a Large Language Model can process in a single request is often called context length (or context window). The table below shows the context length for all versions of GPT-4 (as of Sep 2024). While context lengths have been increasing with every iteration and with every newer model, there stays a limit to the knowledge we will provide the model. Furthermore, there may be an inverse correlation between the scale of the input and the context relevancy of the responses generated by the LLM, short and focused inputs produce higher results than long contexts containing vast information. This emphasizes the importance of breaking down our data into smaller, relevant chunks to make sure more appropriate responses from the LLMs—no less than until LLMs can handle enormous amounts of information without re-training.

Context Window limit for gpt-4 models (referred from OpenAI)

The Context Window represented within the image is inclusive of each input and output tokens.

Though longer contexts give a more holistic picture to the model and help it in understanding relationships and make higher inferences, shorter contexts however reduce the quantity of information that the model needs to grasp and thus decreases latency, making the model more responsive. It also helps in minimizing hallucinations of the LLM since only the relevant data is given to the model. So, it’s a balance between performance, efficiency, and the way complex our data is and, we’d like to run experiments on how much data is the correct amount of information that yields best results with reasonable resources.

GPT-4 model’s 128k tokens may look like lots, so let’s convert them to actual words and put them in perspective. From the OpenAI Tokenizer:

A helpful rule of thumb is that one token generally corresponds to ~4 characters of text for common English text. This translates to roughly ¾ of a word (so 100 tokens ~= 75 words)

Let’s consider The Hound of the Baskervilles by Arthur Conan Doyle (Project Gutenberg License) as our example throughout this text. This book is 7734 lines long with 62303 words, which involves roughly 83,700 tokens

For those who are eager about exactly calculating tokens and not only approximation, you should utilize OpenAI’s tiktoken:

import request.
from tiktoken import encoding_for_model

url = "https://www.gutenberg.org/cache/epub/3070/pg3070.txt"

response = requests.get(url)
if response.status_code == 200:
book_full_text = response.text

encoder = encoding_for_model("gpt-4o")
tokens = encoder.encode(book_full_text)

print(f"Variety of tokens: {len(tokens)}")

Which supplies the variety of tokens to be Variety of tokens: 82069

Chunking Cheese!! (Generated using Canva)

I just like the wiki definition of chunking, because it applies to RAG as much because it is true in cognitive psychology.

Chunking is a process by which small individual pieces of a set of data are certain together. The chunks are supposed to improve short-term retention of the fabric, thus bypassing the limited capability of working memory and allowing the working memory to be more efficient

The means of splitting large datasets into smaller, meaningful pieces of data in order that the LLM’s non-parametric memory will be used more effectively known as chunking. There are lots of alternative ways to separate the information to enhance retrieval of chunks for RAG, and we’d like to decide on depending on the style of data that’s being consumed.

Chunking is an important pre-retrieval step within the RAG pipeline that directly influences the retrieval process and significantly affects the ultimate output. In this text, we are going to have a look at essentially the most common strategies of chunking and evaluate them for retrieval metrics within the context of our data.

As an alternative of going over existing chunking strategies/splitters available in several libraries immediately, let’s start constructing an easy splitter and explore the necessary facets that should be considered, to construct the intuition of writing a brand new splitter. Let’s start with a basic splitter and progressively improve it by solving its drawbacks/limitations.

1. Naive Chunking

After we discuss splitting data, the very first thing that involves our mind is to separate it at newline character. Lets go ahead with the implementation. But as you may see it leaves plenty of return carriage characters. Also, we just assumed n and r since we’re only coping with the English language, but what if we wish to parse other languages? Let’s add the flexibleness to pass within the characters to separate as well.

def naive_splitter_v2(text: str, separators: List[str] = ["n", "r"]) -> List[str]:
"""Splits text at every separator"""
splits = [text]
for sep in separators:
splits = [segment for part in result for segment in part.split(sep) if segment]

return splits

output of naive_splitter_v2

You would possibly’ve already guessed from the output why we call this method Naive. The concept has plenty of drawbacks:

  1. No Chunk limits. So long as a line has one in all the delimiter, it’ll break, but what if we a piece doesn’t have those delimiters, it could go to any length.
  2. Similarly, as you may clearly see within the output, there are chunks which might be too small! a single word chunks doesn’t make any sense without surrounding context.
  3. Breaks in between lines: A bit is retrieved based on the query that’s asked, but a sentence/line is completely incomplete and even has different meaning if we truncate it mid sentence.

Let’s attempt to fix these problems one after the other.

2. Fixed Window Chunking

Let’s first tackle the primary problem of too long or too short chunk sizes. This time we absorb a limit for the scale and take a look at to separate the text exactly after we reach the scale.

def fixed_window_splitter(text: str, chunk_size: int = 1000) -> List[str]:
"""Splits text at given chunk_size"""
splits = []
for i in range(0, len(text), chunk_size):
splits.append(text[i:i + chunk_size])
return splits
output of fixed_window_splitter

We did solve the minimum and maximum bounds of the chunk, because it is all the time going to be chunk_size. However the breaks in between words still stays the identical. From the output we will see, we’re losing the meaning of a piece because it is split mid-sentence.

3. Fixed Window with Overlap Chunking

The simplest strategy to be certain that that we don’t split in between words is to be certain that we go over until the tip of the word after which stop. Though this can make the context not too long and inside the the expected chunk_size range, a greater approach could be to start out the following chunk some x characters/words/tokens behind the actual start position, in order that the context is all the time preserved and will probably be continuous.

def fixed_window_with_overlap_splitter(text: str, chunk_size: int = 1000, chunk_overlap: int = 10) -> List[str]:
"""Splits text at given chunk_size, and starts next chunk from start - chunk_overlap position"""
chunks = []
start = 0

while start <= len(text):
end = start + chunk_size
chunks.append(text[start:end])
start = end - chunk_overlap

return chunks

output of fixed_window_with_overlap_splitter

4. Recursive Character Chunking

With Chunk Size and Chunk Overlap fixed, we will now solve the issue of mid-word or mid-sentence splitting. This will be solved with a little bit of modification to our initial Naive splitter. We take an inventory of separators and pick an excellent separator as we grow more to the chunk size. Meanwhile we are going to still proceed to make use of the chunk overlap the identical way. That is one of the crucial popular splitters available in LangChain package called RecursiveCharacterTextSplitter. This works the identical way we approached:

  1. Starts with highest priority separator, which starts from starting nn and moves to next within the separators list.
  2. If a split exceeds the chunk_size, it applies the following separator until the present split falls under the proper size.
  3. The subsequent split starts chunk_overlap characters behind the present split ending, thus maintaining the continuity of the context.
output of recursive_character_splitter

4. Semantic Chunking

Up to now, we’ve only considered where to separate our data, whether it’s at end of a paragraph or a brand new line or a tab or other separators. But we haven’t considered when to separate, that’s, the right way to higher capture a meaningful chunk fairly than simply a piece of some length. This approach is often called semantic chunking. Let’s use Flair to detect sentence boundaries or specific entities and create meaningful chunks. The text is split into sentences using SegtokSentenceSplitter, which ensures it is split at meaningful boundaries. We keep the sizing logic the identical, to group until we reach chunk_size and overlap of chunk_overlap to make sure context is maintained.

def semantic_splitter(text: str, chunk_size: int = 1000, chunk_overlap: int = 10) -> List[str]:
from flair.models import SequenceTagger
from flair.data import Sentence
from flair.splitter import SegtokSentenceSplitter

splitter = SegtokSentenceSplitter()

# Split text into sentences
sentences = splitter.split(text)

chunks = []
current_chunk = ""

for sentence in sentences:
# Add sentence to the present chunk
if len(current_chunk) + len(sentence.to_plain_string()) <= chunk_size:
current_chunk += " " + sentence.to_plain_string()
else:
# If adding the following sentence exceeds max size, start a brand new chunk
chunks.append(current_chunk.strip())
current_chunk = sentence.to_plain_string()

# Add the last chunk if it exists
if current_chunk:
chunks.append(current_chunk.strip())

return chunks

output of semantic_splitter

LangChain has two such splitters, using the NLTK and spaCy libraries, so do check them out.

So, generally, in static chunking methods, Chunk Size and Chunk Overlap are two major aspects to think about while determining chunking strategy. Chunk size is the variety of character/words/tokens of every chunk and chunk overlap is the quantity of previous chunk to be included in the present chunk so the context is continuous. Chunk overlap will also be a expressed as variety of character/words/tokens or a percentage of chunk size.

You should utilize the cool ChunkViz tool to visualise how different chunking strategies behave with different chunk size and overlap parameters:

Hound Of Baskervilles on ChunkViz

5. Embedding Chunking

Though Semantic chunking gets the job done, NLTK, spaCy, or Flair use their very own models/embeddings to grasp the given data and take a look at to present us when best the information will be split semantically. After we move on to our actual RAG implementation, our embeddings may be different from those that our chunks are merged along with and hence will be understood another way altogether. So, on this approach we start off splitting to sentences and form the chunks based on the identical embedding model we later are going to make use of for our RAG retrieval. To do things in another way, we are going to use NLTK for on this to separate to sentences and us OpenAIEmbeddings to merge them to form sentences.

def embedding_splitter(text_data, chunk_size=400):
import os
import nltk
from langchain_openai.embeddings import AzureOpenAIEmbeddings
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
from dotenv import load_dotenv, find_dotenv
from tqdm import tqdm
from flair.splitter import SegtokSentenceSplitter

load_dotenv(find_dotenv())

# Set Azure OpenAI API environment variables (ensure these are set in your environment)
# You may also set these in your environment directly
# os.environ["OPENAI_API_KEY"] = "your-azure-openai-api-key"
# os.environ["OPENAI_API_BASE"] = "your-azure-openai-api-endpoint"
os.environ["OPENAI_API_VERSION"] = "2023-05-15"

# Initialize OpenAIEmbeddings using LangChain's Azure support
embedding_model = AzureOpenAIEmbeddings(deployment="text-embedding-ada-002-01") # Use your Azure model name

# Step 1: Split the text into sentences
def split_into_sentences(text):
splitter = SegtokSentenceSplitter()

# Split text into sentences
sentences = splitter.split(text)
sentence_str = []
for sentence in sentences:
sentence_str.append(sentence.to_plain_string())
return sentence_str[:100]

# Step 2: Get embeddings for every sentence using the identical Azure embedding model
def get_embeddings(sentences):
embeddings = []
for sentence in tqdm(sentences, desc="Generating embeddings"):
embedding = embedding_model.embed_documents([sentence]) # Embeds a single sentence
embeddings.append(embedding[0]) # embed_documents returns an inventory, so take the primary element
return embeddings

# Step 3: Form chunks based on sentence embeddings, a similarity threshold, and a max chunk character size
def form_chunks(sentences, embeddings, similarity_threshold=0.7, chunk_size=500):
chunks = []
current_chunk = []
current_chunk_emb = []
current_chunk_length = 0 # Track the character length of the present chunk

for i, (sentence, emb) in enumerate(zip(sentences, embeddings)):
emb = np.array(emb) # Make sure the embedding is a numpy array
sentence_length = len(sentence) # Calculate the length of the sentence

if current_chunk:
# Calculate similarity with the present chunk's embedding (mean of embeddings within the chunk)
chunk_emb = np.mean(np.array(current_chunk_emb), axis=0).reshape(1, -1) # Average embedding of the chunk
similarity = cosine_similarity(emb.reshape(1, -1), chunk_emb)[0][0]

if similarity < similarity_threshold or current_chunk_length + sentence_length > chunk_size:
# If similarity is below threshold or adding this sentence exceeds max chunk size, create a brand new chunk
chunks.append(current_chunk)
current_chunk = [sentence]
current_chunk_emb = [emb]
current_chunk_length = sentence_length # Reset chunk length
else:
# Else, add sentence to the present chunk
current_chunk.append(sentence)
current_chunk_emb.append(emb)
current_chunk_length += sentence_length # Update chunk length
else:
current_chunk.append(sentence)
current_chunk_emb = [emb]
current_chunk_length = sentence_length # Set initial chunk length

# Add the last chunk
if current_chunk:
chunks.append(current_chunk)

return chunks

# Apply the sentence splitting
sentences = split_into_sentences(text_data)

# Get sentence embeddings
embeddings = get_embeddings(sentences)

# Form chunks based on embeddings
chunks = form_chunks(sentences, embeddings, chunk_size=chunk_size)

return chunks

output of embedding_splitter

6. Agentic Chunking

Our Embedding Chunking should come closer to splitting the information with the cosine similarity of the embeddings created. Though this works well, we now have one major drawback: it doesn’t understand the semantics of the text. “I Like You” vs “I Like You” with sarcasm on “like,” each sentences may have the identical embeddings and hence will correspond to the identical cosine distance when calculated. That is where Agentic (or LLM-based) chunking is useful. It analyzes the content to discover points to interrupt logically based on standalone-ness and semantic coherence.

def agentic_chunking(text_data):
from langchain_openai import AzureChatOpenAI
from langchain.prompts import PromptTemplate
from langchain
llm = AzureChatOpenAI(model="gpt-4o",
api_version="2023-03-15-preview",
verbose=True,
temperature=1)
prompt = """I'm providing a document below.
Please split the document into chunks that maintain semantic coherence and make sure that each chunk represents a whole and meaningful unit of data.
Each chunk should stand alone, preserving the context and meaning without splitting key ideas across chunks.
Use your understanding of the content’s structure, topics, and flow to discover natural breakpoints within the text.
Be sure that no chunk exceeds 1000 characters length, and prioritize keeping related concepts or sections together.

Don't modify the document, just split to chunks and return them as an array of strings, where each string is one chunk of the document.
Return all the book not dont stop in betweek some sentences.

Document:
{document}
"""

prompt_template = PromptTemplate.from_template(prompt)

chain = prompt_template | llm

result = chain.invoke({"document": text_data})
return result

We’ll cover the RAG evaluation techniques in an upcoming post; on this post we are going to see two metrics defined by RAGAS, context_precision and context_relevance, that determine how our chunking strategies performed.

Context Precision is a metric that evaluates whether the entire ground-truth relevant items present within the contexts are ranked higher or not. Ideally all of the relevant chunks must appear at the highest ranks. This metric is computed using the query, ground_truth and the contexts, with values ranging between 0 and 1, where higher scores indicate higher precision.

Context Relevancy gauges the relevancy of the retrieved context, calculated based on each the query and contexts. The values fall inside the range of (0, 1), with higher values indicating higher relevancy.

In the following article we are going to go over proposal retrieval, one in all the agentic splitting methods, and calculate RAGAS metrics for all out strategies.

In this text we’ve covered why we’d like chunking and have developed an intuition to construct a number of the strategies and their implementation in addition to their corresponding code in a number of the well-known libraries. These are only basic chunking strategies, though newer and newer strategies are being invented on daily basis to make higher retrieval even higher.

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x