Hitchhiker’s Guide to RAG: From Tiny Files to Tolstoy with OpenAI’s API and LangChain

-

, I walked you thru organising a quite simple RAG pipeline in Python, using OpenAI’s API, LangChain, and your local files. In that post, I cover the very basics of making embeddings out of your local files with LangChain, storing them in a vector database with FAISS, making API calls to OpenAI’s API, and ultimately generating responses relevant to your files. 🌟

Image by writer

Nonetheless, in this straightforward example, I only exhibit tips on how to use a tiny .txt file. On this post, I further elaborate on how you’ll be able to utilize larger files along with your RAG pipeline by adding an additional step to the method — .

What about chunking?

Chunking refers back to the strategy of parsing a text into smaller pieces of text—chunks—which are then transformed into embeddings. This may be very vital since it allows us to effectively process and create embeddings for larger files. All embedding models include various limitations on the dimensions of the text that’s passed — I’ll get into more details about those limitations in a moment. These limitations allow for higher performance and low-latency responses. Within the case that the text we offer doesn’t meet those size limitations, it’ll get truncated or rejected.

If we desired to create a RAG pipeline reading, say from Leo Tolstoy’s text (a fairly large book), we wouldn’t have the ability to directly load it and transform it right into a single embedding. As an alternative, we want to first do the — create smaller chunks of text, and create embeddings for every one. Each chunk being below the dimensions limits of whatever embedding model we use allows us to effectively transform any file into embeddings. So, a realistic landscape of a RAG pipeline would look as follows:

Image by writer

There are several parameters to further customize the chunking process and fit it to our specific needs. A key parameter of the chunking process is the , which allows us to specify what the dimensions of every chunk shall be (in characters or in tokens). The trick here is that the chunks we create must be sufficiently small to be processed throughout the size limitations of the embedding, but at the identical time, they must also be large enough to include meaningful information.

As an illustration, let’s assume we would like to process the next sentence from , where Prince Andrew contemplates the battle:

Image by writer

Let’s also assume we created the next (fairly small) chunks :

image by writer

Then, if we were to ask something like we may not get a very good answer since the chunk doesn’t contain any context and is vague. In contrast, the meaning is scattered across multiple chunks. Thus, though it is analogous to the query we ask and should be retrieved, it doesn’t contain any meaning to supply a relevant response. Subsequently, choosing the suitable chunk size for the chunking process in step with the kind of documents we use for the RAG, can largely influence the standard of the responses we’ll be getting. Basically, the content of a piece should make sense for a human reading it without some other information, to be able to also have the ability to make sense for the model. Ultimately, a trade-off for the chunk size exists — chunks should be sufficiently small to fulfill the embedding model’s size limitations, but large enough to preserve meaning.

• • •

One other significant parameter is the chunk overlap. That’s how much overlap we would like the chunks to have with each other. As an illustration, within the example, we’d get something like the next chunks if we selected a piece overlap of 5 characters.

Image by writer

This can be a vital decision we have now to make because:

  • Larger overlap means more calls and tokens spent on embedding creation, which implies dearer + slower
  • Smaller overlap means the next probability of losing relevant information between the chunk boundaries

Selecting the right chunk overlap largely relies on the kind of text we would like to process. For instance, a recipe book where the language is easy and easy most likely won’t require an exotic chunking methodology. On the flip side, a classic literature book like , where language may be very complex and meaning is interconnected throughout different paragraphs and sections, will most likely require a more thoughtful approach to chunking to ensure that the RAG to supply meaningful results.

• • •

But what if all we want is an easier RAG that appears up to a few of documents that fit the dimensions limitations of whatever embeddings model we use in only one chunk? Can we still need the chunking step, or can we just directly make one single embedding for the whole text? The short answer is that it’s at all times higher to perform the chunking step, even for a knowledge base that does fit the dimensions limits. That’s because, because it seems, when coping with large documents, we face the issue of getting lost in the center — missing relevant information that’s incorporated in large documents and respective large embeddings.

What are those mysterious ‘size limitations’?

Basically, a request to an embedding model can include a number of chunks of text. There are several different sorts of limitations we have now to think about relatively to the dimensions of the text we want to create embeddings for and its processing. Each of those various kinds of limits takes different values depending on the embedding model we use. More specifically, these are:

  • Chunk Size, or also maximum tokens per input, or context window. That is the utmost size in tokens for every chunk. As an illustration, for OpenAI’s text-embedding-3-small embedding model, the chunk size limit is 8,191 tokens. If we offer a piece that’s larger than the chunk size limit, normally, it’ll be silently truncated‼️ (an embedding goes to be created, but just for the primary part that meets the chunk size limit), without producing any error.
  • Variety of Chunks per Request, or also variety of inputs. There may be also a limit on the variety of chunks that will be included in each request. As an illustration, all OpenAI’s embedding models have a limit of two,048 inputs — that’s, a maximum of two,048 chunks per request.
  • Total Tokens per Request: There may be also a limitation on the overall variety of tokens of all chunks in a request. For all OpenAI’s models, the overall maximum variety of tokens across all chunks in a single request is 300,000 tokens.

So, what happens if our documents are greater than 300,000 tokens? As you’ll have imagined, the reply is that we make multiple consecutive/parallel requests of 300,000 tokens or fewer. Many Python libraries do that mechanically behind the scenes. For instance, LangChain’s OpenAIEmbeddings that I exploit in my previous post, mechanically batches the documents we offer into batches under 300,000 tokens, provided that the documents are already provided in chunks.

Reading larger files into the RAG pipeline

Let’s take a have a look at how all these play out in an easy Python example, using the text as a document to retrieve within the RAG. The information I’m using — Leo Tolstoy’s text — is licensed as Public Domain and will be present in Project Gutenberg.

So, initially, let’s attempt to read from the text with none setup for chunking. For this tutorial, you’ll have to have installed the langchain, openai, and faiss Python libraries. We will easily install the required packages as follows:

pip install openai langchain langchain-community langchain-openai faiss-cpu

After ensuring the required libraries are installed, our code for a quite simple RAG looks like this and works wonderful for a small and straightforward .txt file within the text_folder.

from openai import OpenAI # Chat_GPT API key 
api_key = "your key" 

# initialize LLM
llm = ChatOpenAI(openai_api_key=api_key, model="gpt-4o-mini", temperature=0.3)

# loading documents for use for RAG 
text_folder =  "RAG files"  

documents = []
for filename in os.listdir(text_folder):
    if filename.lower().endswith(".txt"):
        file_path = os.path.join(text_folder, filename)
        loader = TextLoader(file_path)
        documents.extend(loader.load())

# generate embeddings
embeddings = OpenAIEmbeddings(openai_api_key=api_key)

# create vector database w FAISS 
vector_store = FAISS.from_documents(documents, embeddings)
retriever = vector_store.as_retriever()


def primary():
    print("Welcome to the RAG Assistant. Type 'exit' to quit.n")
    
    while True:
        user_input = input("You: ").strip()
        if user_input.lower() == "exit":
            print("Exiting…")
            break

        # get relevant documents
        relevant_docs = retriever.invoke(user_input)
        retrieved_context = "nn".join([doc.page_content for doc in relevant_docs])

        # system prompt
        system_prompt = (
            "You're a helpful assistant. "
            "Use ONLY the next knowledge base context to reply the user. "
            "If the reply just isn't within the context, say you do not know.nn"
            f"Context:n{retrieved_context}"
        )

        # messages for LLM 
        messages = [
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_input}
        ]

        # generate response
        response = llm.invoke(messages)
        assistant_message = response.content.strip()
        print(f"nAssistant: {assistant_message}n")

if __name__ == "__main__":
    primary()

But, if I add the .txt file in the identical folder, and check out to directly create an embedding for it, I get the next error:

Image by writer

ughh 🙃

So what happens here? LangChain’s OpenAIEmbeddingscannot split the text into separate, lower than 300,000 token iterations, because we didn’t provide it in chunks. It doesn’t split the chunk, which is 777,181 tokens, resulting in a request that exceeds the 300,000 tokens maximum per request.

• • •

Now, let’s try to establish the chunking process to create multiple embeddings from this massive file. To do that, I shall be using the text_splitter library provided by LangChain, and more specifically, the RecursiveCharacterTextSplitter. In RecursiveCharacterTextSplitter, the chunk size and chunk overlap parameters are specified as quite a few characters, but other splitters like TokenTextSplitter or OpenAITokenSplitter also allow to establish these parameters as quite a few tokens.

So, we are able to arrange an instance of the text splitter as below:

splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)

… after which use it to separate our initial document into chunks…

split_docs = []
for doc in documents:
    chunks = splitter.split_text(doc.page_content)
    for chunk in chunks:
        split_docs.append(Document(page_content=chunk))

…after which use those chunks to create the embeddings…

documents= split_docs

# create embeddings + FAISS index
embeddings = OpenAIEmbeddings(openai_api_key=api_key)
vector_store = FAISS.from_documents(documents, embeddings)
retriever = vector_store.as_retriever()

.....

… and voila 🌟

Now our code can effectively parse the provided document, even whether it is a bit larger, and supply relevant responses.

Image by writer

On my mind

Selecting a chunking approach that matches the dimensions and complexity of the documents we would like to feed into our RAG pipeline is crucial for the standard of the responses that we’ll be receiving. Of course, there are several other parameters and different chunking methodologies one must bear in mind. Nonetheless, understanding and fine-tuning chunk size and overlap is the muse for constructing RAG pipelines that produce meaningful results.

• • •

📰📝💼

• • •

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x