Quick POCs
Most quick proof of concepts (POCs) which permit a user to explore data with the assistance of conversational AI simply blow you away. It looks like pure magic when you possibly can abruptly check with your documents, or data, or code base.
These POCs work wonders on small datasets with a limited count of docs. Nevertheless, as with almost anything if you bring it to production, you quickly run into problems at scale. Once you do a deep dive and also you inspect the answers the AI gives you, you notice:
- Your agent doesn’t reply with complete information. It missed some vital pieces of information
- Your agent doesn’t reliably give the identical answer
- Your agent isn’t capable of inform you how and where it got which information, making the reply significantly less useful
It seems that the real magic in RAG doesn’t occur within the generative AI step, but within the technique of retrieval and composition. When you dive in, it’s pretty obvious why…
* RAG = Retrieval Augmented Generation — Wikipedia Definition of RAG
A fast recap of how an easy RAG process works:
- All of it starts with a query. The user asked an issue, or some system is attempting to answer an issue.
- A search is completed with the query. Mostly you’d embed the query and do a similarity search, but you can even do a classic elastic search or a mix of each, or a straight lookup of data
- The search result’s a set of documents (or document snippets, but let’s simply call them documents for now)
- The documents and the essence of the query are combined into some easily readable context in order that the AI can work with it
- The AI interprets the query and the documents and generates a solution
- Ideally this answer is fact checked, to see if the AI based the reply on the documents, and/or if it is acceptable for the audience
The dirty little secret is that the essence of the RAG process is that you might have to offer the reply to the AI (before it even does anything), in order that it is ready to provide you the reply that you simply’re on the lookout for.
In other words:
- the work that the AI does (step 5) is apply judgement, and properly articulate the reply
- the work that the engineer does (step 3 and 4) is find the reply and compose it such that AI can digest it
Which is more vital? The reply is, in fact, it depends, because if judgement is the critical element, then the AI model does all of the magic. But for an infinite amount of business use cases, finding and properly composing the pieces that make up the reply, is the more vital part.
The primary set of problems to resolve when running a RAG process are the info ingestion, splitting, chunking, document interpretation issues. I’ve written about a couple of of those in prior articles, but am ignoring them here. For now let’s assume you might have properly solved your data ingestion, you might have a beautiful vector store or search index.
Typical challenges:
- Duplication — Even the best production systems often have duplicate documents. More so when your system is large, you might have extensive users or tenants, you hook up with multiple data sources, otherwise you take care of versioning, etc.
- Near duplication — Documents which largely contain the identical data, but with minor changes. There are two sorts of near duplication:
— Meaningful — E.g. a small correction, or a minor addition, e.g. a date field with an update
— Meaningless — E.g.: minor punctuation, syntax, or spacing differences, or simply differences introduced by timing or intake processing - Volume — Some queries have a really large relevant response data set
- Data freshness vs quality — Which snippets of the response data set have probably the most top quality content for the AI to make use of vs which snippets are most relevant from a time (freshness) perspective?
- Data variety — How can we ensure a wide range of search results such that the AI is correctly informed?
- Query phrasing and ambiguity — The prompt that triggered the RAG flow, won’t be phrased in such a way that it yields the optimal result, or might even be ambiguous
- Response Personalization — The query might require a distinct response based on who asks it
This list goes on, but you get the gist.
Short answer: no.
The fee and performance impact of using extremely large context windows shouldn’t be underestimated (you easily 10x or 100x your per query cost), not including any follow up interaction that the user/system has.
Nevertheless, putting that aside. Imagine the next situation.
We put Anne in room with a chunk of paper. The paper says: *patient Joe: complex foot fracture.* Now we ask Anne, does the patient have a foot fracture? Her answer is “yes, he does”.
Now we give Anne 100 pages of medical history on Joe. Her answer becomes “well, depending on what time you’re referring to, he had …”
Now we give Anne hundreds of pages on all of the patients within the clinic…
What you quickly notice, is that how we define the query (or the prompt in our case) starts to get very vital. The larger the context window, the more nuance the query needs.
Moreover, the larger the context window, the universe of possible answers grows. This could be a positive thing, but in practice, it’s a technique that invites lazy engineering behavior, and is probably going to scale back the capabilities of your application if not handled intelligently.
As you scale a RAG system from POC to production, here’s the best way to address typical data challenges with specific solutions. Each approach has been adjusted to suit production requirements and includes examples where useful.
Duplication
Duplication is inevitable in multi-source systems. Through the use of fingerprinting (hashing content), document IDs, or semantic hashing, you possibly can discover exact duplicates at ingestion and stop redundant content. Nevertheless, consolidating metadata across duplicates can be helpful; this lets users know that certain content appears in multiple sources, which may add credibility or highlight repetition within the dataset.
# Fingerprinting for deduplication
def fingerprint(doc_content):
return hashlib.md5(doc_content.encode()).hexdigest()# Store fingerprints and filter duplicates, while consolidating metadata
fingerprints = {}
unique_docs = []
for doc in docs:
fp = fingerprint(doc['content'])
if fp not in fingerprints:
fingerprints[fp] = [doc]
unique_docs.append(doc)
else:
fingerprints[fp].append(doc) # Consolidate sources
Near Duplication
Near-duplicate documents (similar but not equivalent) often contain vital updates or small additions. Provided that a minor change, like a standing update, can carry critical information, freshness becomes crucial when filtering near duplicates. A practical approach is to make use of cosine similarity for initial detection, then retain the freshest version inside each group of near-duplicates while flagging any meaningful updates.
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.cluster import DBSCAN
import numpy as np# Cluster embeddings with DBSCAN to search out near duplicates
clustering = DBSCAN(eps=0.1, min_samples=2, metric="cosine").fit(doc_embeddings)
# Organize documents by cluster label
clustered_docs = {}
for idx, label in enumerate(clustering.labels_):
if label == -1:
proceed
if label not in clustered_docs:
clustered_docs[label] = []
clustered_docs[label].append(docs[idx])
# Filter clusters to retain only the freshest document in each cluster
filtered_docs = []
for cluster_docs in clustered_docs.values():
# Select the document with probably the most recent timestamp or highest relevance
freshest_doc = max(cluster_docs, key=lambda d: d['timestamp'])
filtered_docs.append(freshest_doc)
Volume
When a question returns a high volume of relevant documents, effective handling is essential. One approach is a **layered strategy**:
- Theme Extraction: Preprocess documents to extract specific themes or summaries.
- Top-k Filtering: After synthesis, filter the summarized content based on relevance scores.
- Relevance Scoring: Use similarity metrics (e.g., BM25 or cosine similarity) to prioritize the highest documents before retrieval.
This approach reduces the workload by retrieving synthesized information that’s more manageable for the AI. Other strategies could involve batching documents by theme or pre-grouping summaries to further streamline retrieval.
Data Freshness vs. Quality
Balancing quality with freshness is important, especially in fast-evolving datasets. Many scoring approaches are possible, but here’s a general tactic:
- Composite Scoring: Calculate a top quality rating using aspects like source reliability, content depth, and user engagement.
- Recency Weighting: Adjust the rating with a timestamp weight to emphasise freshness.
- Filter by Threshold: Only documents meeting a combined quality and recency threshold proceed to retrieval.
Other strategies could involve scoring only high-quality sources or applying decay aspects to older documents.
Data Variety
Ensuring diverse data sources in retrieval helps create a balanced response. Grouping documents by source (e.g., different databases, authors, or content types) and choosing top snippets from each source is one effective method. Other approaches include scoring by unique perspectives or applying diversity constraints to avoid over-reliance on any single document or perspective.
# Ensure variety by grouping and choosing top snippets per sourcefrom itertools import groupby
k = 3 # Variety of top snippets per source
docs = sorted(docs, key=lambda d: d['source'])
grouped_docs = {key: list(group)[:k] for key, group in groupby(docs, key=lambda d: d['source'])}
diverse_docs = [doc for docs in grouped_docs.values() for doc in docs]
Query Phrasing and Ambiguity
Ambiguous queries can result in suboptimal retrieval results. Using the precise user prompt is usually not be the very best method to retrieve the outcomes they require. E.g. there might need been an information exchange earlier on within the chat which is relevant. Or the user pasted a considerable amount of text with an issue about it.
To make sure that you utilize a refined query, one approach is to make sure that a RAG tool provided to the model asks it to rephrase the query right into a more detailed search query, just like how one might rigorously craft a search query for Google. This approach improves alignment between the user’s intent and the RAG retrieval process. The phrasing below is suboptimal, nevertheless it provides the gist of it:
tools = [{
"name": "search_our_database",
"description": "Search our internal company database for relevent documents",
"parameters": {
"type": "object",
"properties": {
"query": {
"type": "string",
"description": "A search query, like you would for a google search, in sentence form. Take care to provide any important nuance to the question."
}
},
"required": ["query"]
}
}]
Response Personalization
For tailored responses, integrate user-specific context directly into the RAG context composition. By adding a user-specific layer to the ultimate context, you permit the AI to consider individual preferences, permissions, or history without altering the core retrieval process.