A Guide on 12 Tuning Strategies for Production-Ready RAG Applications

Artificial Intelligence

A Guide on 12 Tuning Strategies for Production-Ready RAG Applications

admin

December 6, 2023

A Guide on 12 Tuning Strategies for Production-Ready RAG Applications

Easy methods to improve the performance of your Retrieval-Augmented Generation (RAG) pipeline with these “hyperparameters” and tuning strategies

Tuning Strategies for Retrieval-Augmented Generation Applications

Data Science is an experimental science. It starts with the “No Free Lunch Theorem,” which states that there isn’t a one-size-fits-all algorithm that works best for each problem. And it ends in data scientists using experiment tracking systems to assist them tune the hyperparameters of their Machine Learning (ML) projects to realize the perfect performance.

This text looks at a Retrieval-Augmented Generation (RAG) pipeline through the eyes of an information scientist. It discusses potential “hyperparameters” you possibly can experiment with to enhance your RAG pipeline’s performance. Just like experimentation in Deep Learning, where, e.g., data augmentation techniques will not be a hyperparameter but a knob you possibly can tune and experiment with, this text can even cover different strategies you possibly can apply, which will not be per se hyperparameters.

This text covers the next “hyperparameters” sorted by their relevant stage. Within the ingestion stage of a RAG pipeline, you possibly can achieve performance improvements by:

And within the inferencing stage (retrieval and generation), you possibly can tune:

Note that this text covers text-use cases of RAG. For multimodal RAG applications, different considerations may apply.

The ingestion stage is a preparation step for constructing a RAG pipeline, much like the info cleansing and preprocessing steps in an ML pipeline. Normally, the ingestion stage consists of the next steps:

Collect data
Chunk data
Generate vector embeddings of chunks
Store vector embeddings and chunks in a vector database

Documents are first chunked, then the chunks are embedded, and the embeddings are stored in the vector database — Ingestion stage of a RAG pipeline

This section discusses impactful techniques and hyperparameters which you could apply and tune to enhance the relevance of the retrieved contexts within the inferencing stage.

Data cleansing

Like all Data Science pipeline, the standard of your data heavily impacts the end result in your RAG pipeline [8, 9]. Before moving on to any of the next steps, make sure that your data meets the next criteria:

Clean: Apply at the very least some basic data cleansing techniques commonly utilized in Natural Language Processing, resembling ensuring all special characters are encoded appropriately.
Correct: Be certain that your information is consistent and factually accurate to avoid conflicting information confusing your LLM.

Chunking

Chunking your documents is an important preparation step to your external knowledge source in a RAG pipeline that may impact the performance [1, 8, 9]. It’s a way to generate logically coherent snippets of knowledge, often by breaking up long documents into smaller sections (but it might probably also mix smaller snippets into coherent paragraphs).

One consideration it’s worthwhile to make is the selection of the chunking technique. For instance, in LangChain, different text splitters split up documents by different logics, resembling by characters, tokens, etc. This will depend on the form of data you’ve gotten. For instance, you will have to make use of different chunking techniques in case your input data is code vs. whether it is a Markdown file.

The perfect length of your chunk (chunk_size) will depend on your use case: In case your use case is query answering, you might need shorter specific chunks, but in case your use case is summarization, you might need longer chunks. Moreover, if a bit is just too short, it may not contain enough context. Then again, if a bit is just too long, it would contain an excessive amount of irrelevant information.

Moreover, you will have to take into consideration a “rolling window” between chunks (overlap) to introduce some additional context.

Embedding models

Embedding models are on the core of your retrieval. The quality of your embeddings heavily impacts your retrieval results [1, 4]. Normally, the upper the dimensionality of the generated embeddings, the upper the precision of your embeddings.

For an idea of what alternative embedding models can be found, you possibly can take a look at the Massive Text Embedding Benchmark (MTEB) Leaderboard, which covers 164 text embedding models (on the time of this writing).

While you should utilize general-purpose embedding models out-of-the-box, it might make sense to fine-tune your embedding model to your specific use case in some cases to avoid out-of-domain issues afterward [9]. In keeping with experiments conducted by LlamaIndex, fine-tuning your embedding model can result in a 5–10% performance increase in retrieval evaluation metrics [2].

Note that you just cannot fine-tune all embedding models (e.g., OpenAI’s text-ebmedding-ada-002 can’t be fine-tuned for the time being).

Metadata

If you store vector embeddings in a vector database, some vector databases allow you to store them along with metadata (or data that just isn’t vectorized). Annotating vector embeddings with metadata may be helpful for added post-processing of the search results, resembling metadata filtering [1, 3, 8, 9]. For instance, you can add metadata, resembling the date, chapter, or subchapter reference.

Multi-indexing

If the metadata just isn’t sufficient enough to offer additional information to separate various kinds of context logically, you might need to experiment with multiple indexes [1, 9]. For instance, you should utilize different indexes for various kinds of documents. Note that you’re going to should incorporate some index routing at retrieval time [1, 9]. For those who are desirous about a deeper dive into metadata and separate collections, it is advisable to learn more concerning the concept of native multi-tenancy.

Indexing algorithms

To enable lightning-fast similarity search at scale, vector databases and vector indexing libraries use an Approximate Nearest Neighbor (ANN) search as a substitute of a k-nearest neighbor (kNN) search. Because the name suggests, ANN algorithms approximate the closest neighbors and thus may be less precise than a kNN algorithm.

There are different ANN algorithms you can experiment with, resembling Facebook Faiss (clustering), Spotify Annoy (trees), Google ScaNN (vector compression), and HNSWLIB (proximity graphs). Also, a lot of these ANN algorithms have some parameters you can tune, resembling ef, efConstruction, and maxConnections for HNSW [1].

Moreover, you possibly can enable vector compression for these indexing algorithms. Analogous to ANN algorithms, you’ll lose some precision with vector compression. Nevertheless, depending on the selection of the vector compression algorithm and its tuning, you possibly can optimize this as well.

Nevertheless, in practice, these parameters are already tuned by research teams of vector databases and vector indexing libraries during benchmarking experiments and never by developers of RAG systems. Nevertheless, if you should experiment with these parameters to squeeze out the last bits of performance, I like to recommend this text as a place to begin:

The important components of the RAG pipeline are the retrieval and the generative components. This section mainly discusses strategies to enhance the retrieval (Query transformations, retrieval parameters, advanced retrieval strategies, and re-ranking models) as that is the more impactful component of the 2. However it also briefly touches on some strategies to enhance the generation (LLM and prompt engineering).

Standard RAG schema — Inference stage of a RAG pipeline

Query transformations

For the reason that search query to retrieve additional context in a RAG pipeline can also be embedded into the vector space, its phrasing may also impact the search results. Thus, in case your search query doesn’t end in satisfactory search results, you possibly can experiment with various query transformation techniques [5, 8, 9], resembling:

Rephrasing: Use an LLM to rephrase the query and take a look at again.
Hypothetical Document Embeddings (HyDE): Use an LLM to generate a hypothetical response to the search query and use each for retrieval.
Sub-queries: Break down longer queries into multiple shorter queries.

Retrieval parameters

The retrieval is an integral part of the RAG pipeline. The primary consideration is whether or not semantic search will probably be sufficient to your use case or if you should experiment with hybrid search.

Within the latter case, it’s worthwhile to experiment with weighting the aggregation of sparse and dense retrieval methods in hybrid search [1, 4, 9]. Thus, tuning the parameter alpha, which controls the weighting between semantic (alpha = 1) and keyword-based search (alpha = 0), will turn into essential.

Also, the variety of search results to retrieve will play an important role. The variety of retrieved contexts will impact the length of the used context window (see Prompt Engineering). Also, in the event you are using a re-ranking model, it’s worthwhile to consider what number of contexts to input to the model (see Re-ranking models).

Note, while the used similarity measure for semantic search is a parameter you possibly can change, you need to not experiment with it but as a substitute set it in line with the used embedding model (e.g., text-embedding-ada-002 supports cosine similarity or multi-qa-MiniLM-l6-cos-v1 supports cosine similarity, dot product, and Euclidean distance).

Advanced retrieval strategies

This section could technically be its own article. For this overview, we are going to keep this as concise as possible. For an in-depth explanation of the next techniques, I like to recommend this DeepLearning.AI course:

The underlying idea of this section is that the chunks for retrieval shouldn’t necessarily be the identical chunks used for the generation. Ideally, you’d embed smaller chunks for retrieval (see Chunking) but retrieve larger contexts. [7]

Sentence-window retrieval: Do not only retrieve the relevant sentence, however the window of appropriate sentences before and after the retrieved one.
Auto-merging retrieval: The documents are organized in a tree-like structure. At query time, separate but related, smaller chunks may be consolidated right into a larger context.

Re-ranking models

While semantic search retrieves context based on its semantic similarity to the search query, “most similar” doesn’t necessarily mean “most relevant”. Re-ranking models, resembling Cohere’s Rerank model, can assist eliminate irrelevant search results by computing a rating for the relevance of the query for every retrieved context [1, 9].

“most similar” doesn’t necessarily mean “most relevant”

For those who are using a re-ranker model, you might have to re-tune the variety of search results for the input of the re-ranker and the way lots of the reranked results you should feed into the LLM.

As with the embedding models, you might need to experiment with fine-tuning the re-ranker to your specific use case.

LLMs

The LLM is the core component for generating the response. Similarly to the embedding models, there may be a big selection of LLMs you possibly can select from depending in your requirements, resembling open vs. proprietary models, inferencing costs, context length, etc. [1]

As with the embedding models or re-ranking models, you might need to experiment with fine-tuning the LLM to your specific use case to include specific wording or tone of voice.

Prompt engineering

The way you phrase or engineer your prompt will significantly impact the LLM’s completion [1, 8, 9].

Please base your answer only on the search results and nothing else!

Very essential! Your answer MUST be grounded within the search results provided. 
Please explain why your answer is grounded within the search results!

Moreover, using few-shot examples in your prompt can improve the standard of the completions.

As mentioned in Retrieval parameters, the variety of contexts fed into the prompt is a parameter you need to experiment with [1]. While the performance of your RAG pipeline can improve with increasing relevant context, you may also run right into a “Lost within the Middle” [6] effect where relevant context just isn’t recognized as such by the LLM whether it is placed in the course of many contexts.

As an increasing number of developers gain experience with prototyping RAG pipelines, it becomes more essential to debate strategies to bring RAG pipelines to production-ready performances. This text discussed different “hyperparameters” and other knobs you possibly can tune in a RAG pipeline in line with the relevant stages:

This text covered the next strategies within the ingestion stage:

Data cleansing: Ensure data is clean and proper.
Chunking: Alternative of chunking technique, chunk size (chunk_size) and chunk overlap (overlap).
Embedding models: Alternative of the embedding model, incl. dimensionality, and whether to fine-tune it.
Metadata: Whether to make use of metadata and selection of metadata.
Multi-indexing: Determine whether to make use of multiple indexes for various data collections.
Indexing algorithms: Alternative and tuning of ANN and vector compression algorithms may be tuned but are often not tuned by practitioners.

And the next strategies within the inferencing stage (retrieval and generation):

Query transformations: Experiment with rephrasing, HyDE, or sub-queries.
Retrieval parameters: Alternative of search technique (alpha if you’ve gotten hybrid search enabled) and the variety of retrieved search results.
Advanced retrieval strategies: Whether to make use of advanced retrieval strategies, resembling sentence-window or auto-merging retrieval.
Re-ranking models: Whether to make use of a re-ranking model, selection of re-ranking model, variety of search results to input into the re-ranking model, and whether to fine-tune the re-ranking model.
LLMs: Alternative of LLM and whether to fine-tune it.
Prompt engineering: Experiment with different phrasing and few-shot examples.