Enhancing RAG: Beyond Vanilla Approaches

Retrieval-Augmented Generation (RAG) is a robust technique that enhances language models by incorporating external information retrieval mechanisms. While standard RAG implementations improve response relevance, they often struggle in complex retrieval scenarios. This text explores the constraints of a vanilla RAG setup and introduces advanced techniques to reinforce its accuracy and efficiency.

The Challenge with Vanilla RAG

As an example RAG’s limitations, consider a straightforward experiment where we try and retrieve relevant information from a set of documents. Our dataset includes:

A primary document discussing best practices for staying healthy, productive, and in fine condition.
Two additional documents on unrelated topics, but contain some similar words used in several contexts.

main_document_text = """
Morning Routine (5:30 AM - 9:00 AM)
✅ Wake Up Early - Aim for 6-8 hours of sleep to feel well-rested.
✅ Hydrate First - Drink a glass of water to rehydrate your body.
✅ Morning Stretch or Light Exercise - Do 5-10 minutes of stretching or a brief workout to activate your body.
✅ Mindfulness or Meditation - Spend 5-10 minutes practicing mindfulness or deep respiration.
✅ Healthy Breakfast - Eat a balanced meal with protein, healthy fats, and fiber.
✅ Plan Your Day - Set goals, review your schedule, and prioritize tasks.
...
"""

Using an ordinary RAG setup, we query the system with:

Helper Functions

To boost retrieval accuracy and streamline query processing, we implement a set of essential helper functions. These functions serve various purposes, from querying the ChatGPT API to computing document embeddings and similarity scores. By leveraging these functions, we create a more efficient RAG pipeline that effectively retrieves probably the most relevant information for user queries.

To support our RAG improvements, we define the next helper functions:

# **Imports**
import os
import json
import openai
import numpy as np
from scipy.spatial.distance import cosine
from google.colab import userdata

# Arrange OpenAI API key
os.environ["OPENAI_API_KEY"] = userdata.get('AiTeam')

def query_chatgpt(prompt, model="gpt-4o", response_format=openai.NOT_GIVEN):
    try:
        response = client.chat.completions.create(
            model=model,
            messages=[{"role": "user", "content": prompt}],
            temperature=0.0 , # Adjust for kind of creativity
            response_format=response_format
        )
        return response.selections[0].message.content.strip()
    except Exception as e:
        return f"Error: {e}"

def get_embedding(text, model="text-embedding-3-large"): #"text-embedding-ada-002"
    """Fetches the embedding for a given text using OpenAI's API."""
    response = client.embeddings.create(
        input=[text],
        model=model
    )
    return response.data[0].embedding

def compute_similarity_metrics(embed1, embed2):
    """Computes different similarity/distance metrics between two embeddings."""
    cosine_sim = 1- cosine(embed1, embed2)  # Cosine similarity

    return cosine_sim

def fetch_similar_docs(query, docs, threshold = .55, top=1):
  query_em = get_embedding(query)
  data = []
  for d in docs:
    # Compute and print similarity metrics
    similarity_results = compute_similarity_metrics(d["embedding"], query_em)
    if(similarity_results >= threshold):
      data.append({"id":d["id"], "ref_doc":d.get("ref_doc", ""), "rating":similarity_results})

  # Sorting by value (second element in each tuple)
  sorted_data = sorted(data, key=lambda x: x["score"], reverse=True)  # Ascending order
  sorted_data = sorted_data[:min(top, len(sorted_data))]
  return sorted_data

Evaluating the Vanilla RAG

To judge the effectiveness of a vanilla RAG setup, we conduct a straightforward test using predefined queries. Our goal is to find out whether the system retrieves probably the most relevant document based on semantic similarity. We then analyze the constraints and explore possible improvements.

"""# **Testing Vanilla RAG**"""

query = "what should I do to remain healthy and productive?"
r = fetch_similar_docs(query, docs)
print("query = ", query)
print("documents = ", r)

query = "what are one of the best practices to remain healthy and productive ?"
r = fetch_similar_docs(query, docs)
print("query = ", query)
print("documents = ", r)

Advanced Techniques for Improved RAG

To further refine the retrieval process, we introduce advanced functions that enhance the capabilities of our RAG system. These functions generate structured information that aids in document retrieval and query processing, making our system more robust and context-aware.

To handle these challenges, we implement three key enhancements:

1. Generating FAQs

By routinely creating an inventory of incessantly asked questions related to a document, we expand the range of potential queries the model can match. These FAQs are generated once and stored alongside the document, providing a richer search space without incurring ongoing costs.

def generate_faq(text):
  prompt = f'''
  given the next text: """{text}"""
  Ask relevant easy atomic questions ONLY (don't answer them) to cover all subjects covered by the text. Return the result as a json list example [q1, q2, q3...]
  '''
  return query_chatgpt(prompt, response_format={ "type": "json_object" })

2. Creating an Overview

A high-level summary of the document helps capture its core ideas, making retrieval simpler. By embedding the overview alongside the document, we offer additional entry points for relevant queries, improving match rates.

def generate_overview(text):
  prompt = f'''
  given the next text: """{text}"""
  Generate an abstract for it that tells in maximum 3 lines what's it about and use high level terms that can capture the principal points,
  Use terms and words that will likely be most probably utilized by average person.
  '''
  return query_chatgpt(prompt)

3. Query Decomposition

As a substitute of searching with broad user queries, we break them down into smaller, more precise sub-queries. Each sub-query is then compared against our enhanced document collection, which now includes:

The unique document
The generated FAQs
The generated overview

By merging the retrieval results from these multiple sources, we significantly improve the likelihood of finding relevant information.

def decompose_query(query):
  prompt = f'''
  Given the user query: """{query}"""
break it down into smaller, relevant subqueries
that may retrieve one of the best information for answering the unique query.
Return them as a ranked json list example [q1, q2, q3...].
'''
  return query_chatgpt(prompt, response_format={ "type": "json_object" })

Evaluating the Improved RAG

Implementing these techniques, we re-run our initial queries. This time, query decomposition generates several sub-queries, each specializing in different features of the unique query. In consequence, our system successfully retrieves relevant information from each the FAQ and the unique document, demonstrating a considerable improvement over the vanilla RAG approach.

"""# **Testing Advanced Functions**"""

## Generate overview of the document
overview_text = generate_overview(main_document_text)
print(overview_text)
# generate embedding
docs.append({"id":"overview_text", "ref_doc": "main_document_text", "embedding":get_embedding(overview_text)})


## Generate FAQ for the document
main_doc_faq_arr = generate_faq(main_document_text)
print(main_doc_faq_arr)
faq =json.loads(main_doc_faq_arr)["questions"]

for f, i in zip(faq, range(len(faq))):
  docs.append({"id": f"main_doc_faq_{i}", "ref_doc": "main_document_text", "embedding":  get_embedding(f)})


## Decompose the first query
query = "what should I do to remain healty and productive?"
subqueries = decompose_query(query)
print(subqueries)




subqueries_list = json.loads(subqueries)['subqueries']


## compute the similarities between the subqueries and documents, including FAQ
for subq in subqueries_list:
  print("query = ", subq)
  r = fetch_similar_docs(subq, docs, threshold=.55, top=2)
  print(r)
  print('=================================n')


## Decompose the 2nd query
query = "what one of the best practices to remain healty and productive?"
subqueries = decompose_query(query)
print(subqueries)

subqueries_list = json.loads(subqueries)['subqueries']


## compute the similarities between the subqueries and documents, including FAQ
for subq in subqueries_list:
  print("query = ", subq)
  r = fetch_similar_docs(subq, docs, threshold=.55, top=2)
  print(r)
  print('=================================n')

Listed here are a few of the FAQ that were generated:

{
  "questions": [
    "How many hours of sleep are recommended to feel well-rested?",
    "How long should you spend on morning stretching or light exercise?",
    "What is the recommended duration for mindfulness or meditation in the morning?",
    "What should a healthy breakfast include?",
    "What should you do to plan your day effectively?",
    "How can you minimize distractions during work?",
    "How often should you take breaks during work/study productivity time?",
    "What should a healthy lunch consist of?",
    "What activities are recommended for afternoon productivity?",
    "Why is it important to move around every hour in the afternoon?",
    "What types of physical activities are suggested for the evening routine?",
    "What should a nutritious dinner include?",
    "What activities can help you reflect and unwind in the evening?",
    "What should you do to prepare for sleep?",
    …
  ]
}

Cost-Profit Evaluation

While these enhancements introduce an upfront processing cost—generating FAQs, overviews, and embeddings—it is a one-time cost per document. In contrast, a poorly optimized RAG system would result in two major inefficiencies:

Frustrated users as a result of low-quality retrieval.
Increased query costs from retrieving excessive, loosely related documents.

For systems handling high query volumes, these inefficiencies compound quickly, making preprocessing a worthwhile investment.

Conclusion

By integrating document preprocessing (FAQs and overviews) with query decomposition, we create a more intelligent RAG system that balances accuracy and cost-effectiveness. This approach enhances retrieval quality, reduces irrelevant results, and ensures a greater user experience.

As RAG continues to evolve, these techniques will likely be instrumental in refining AI-driven retrieval systems. Future research may explore further optimizations, including dynamic thresholding and reinforcement learning for query refinement.

Enhancing RAG: Beyond Vanilla Approaches

The Challenge with Vanilla RAG

Helper Functions

Evaluating the Vanilla RAG

Advanced Techniques for Improved RAG

1. Generating FAQs

2. Creating an Overview

3. Query Decomposition

Evaluating the Improved RAG

Cost-Profit Evaluation

Conclusion

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

a Leaderboard for Real World Use Cases

Patch Time Series Transformer in Hugging Face

Constitutional AI with Open LLMs

Hugging Face Text Generation Inference available for AWS Inferentia2

The best way to Leverage Slash Commands to Code Effectively

Enhancing RAG: Beyond Vanilla Approaches

The Challenge with Vanilla RAG

Helper Functions

Evaluating the Vanilla RAG

Advanced Techniques for Improved RAG

1. Generating FAQs

2. Creating an Overview

3. Query Decomposition

Evaluating the Improved RAG

Cost-Profit Evaluation

Conclusion

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.