Home Artificial Intelligence LLM+RAG-Based Query Answering

LLM+RAG-Based Query Answering

1
LLM+RAG-Based Query Answering

The right way to do poorly on Kaggle, and study RAG+LLM from it

23 min read

Dec 25, 2023

Image generated with ChatGPT+/DALL-E3, asking for an illustrative image for an article about RAG.

Retrieval Augmented Generation (RAG) appears to be quite popular nowadays. Along the wave of Large Language Models (LLM’s), it’s one among the favored techniques to get LLM’s to perform higher on specific tasks akin to query answering on in-house documents. A while ago, I played on a Kaggle competition that allowed me to try it out and learn a bit higher than random experiments alone. Listed here are just a few learnings from that and the next experiments while writing this text.

All images, unless otherwise noted, are by the creator. Generated with the assistance of ChatGPT+/DALL-E3 (where noted), or taken from my personal Jupyter notebooks.

RAG has two important parts, retrieval and generation. In the primary part, retrieval is used to fetch (chunks of) documents related to the query of interest. Generation uses those fetched chunks as added input, called context, to the reply generation model within the second part. This added context is meant to provide the generator more up-to-date, hopefully higher, information to base its generated answer on than simply its base training data.

LLM’s have a maximum context or sequence window length they’ll handle, and the generated input context for RAG must be short enough to suit into this sequence window. We wish to suit as much relevant information into this context as possible, so getting the very best “chunks” of text from the potential input documents is significant. These chunks should optimally be probably the most relevant ones for generating the right answer to the query posed to the RAG system.

As a primary step, the input text is usually chunked into smaller pieces. A basic pre-processing step in RAG is converting these chunks into embeddings using a selected embedding model. A typical sequence window for an embedding model is 512 tokens, which also makes a practical goal for chunk size. Once the documents are chunked and encoded into embeddings, a similarity search using the embeddings could be performed to construct the context for generating the reply.

I even have found Langchain to supply useful tools for input loading and chunking. For instance, chunking a document with Langchain (on this case, using tokenizer for Flan-T5-Large model) is so simple as:

from transformers import AutoTokenizer 
from langchain.text_splitter import RecursiveCharacterTextSplitter

#That is the Flan-T5-Large model I used for the Kaggle competition
llm = "/mystuff/llm/flan-t5-large/flan-t5-large"
tokenizer = AutoTokenizer.from_pretrained(llm, local_files_only=True)
text_splitter = RecursiveCharacterTextSplitter
           .from_huggingface_tokenizer(tokenizer, chunk_size=12,
                       chunk_overlap=2,                        
separators=["nn", "n", ". "])
section_text="Hello. This is a few text to separate. With just a few "
"uncharacteristic words to chunk, expecting 2 chunks."
texts = text_splitter.split_text(section_text)
print(texts)

This produces the next two chunks:

['Hello. This is some text to split',
'. With a few uncharacteristic words to chunk, expecting 2 chunks.']

Within the above code, chunk_size 12 tells LangChain to aim for a maximum of 12 tokens per chunk. Depending on the text structure, this may occasionally not at all times be 100% exact. Nevertheless, in my experience it really works generally well. Something to consider is the difference between tokens vs words. Here is an example of tokenizing the above section_text:

section_text="Hello. This is a few text to separate. With just a few " 
"uncharacteristic words to chunk, expecting 2 chunks."
encoded_text = tokenizer(section_text)
tokens = tokenizer.convert_ids_to_tokens(encoded_text['input_ids'])
print(tokens)

Resulting output tokens:

['▁Hello', '.', '▁This', '▁is', '▁some', '▁text', '▁to', '▁split', '.', 
'▁With', '▁', 'a', '▁few', '▁un', 'character', 'istic', '▁words',
'▁to', '▁chunk', ',', '▁expecting', '▁2', '▁chunk', 's', '.', '']

Most words within the section_text form a token on their very own, as they’re common words in texts. Nevertheless, for special types of words, or domain words this is usually a bit more complicated. For instance, here the word “uncharacteristic” becomes three tokens [“ un”, “ character”, “ istic”]. It’s because the model tokenizer knows those 3 partial sub-words but not the complete word (“ uncharacteristic “). Each model comes with its own tokenizer to match these rules in input and model training.

In chunking, the RecursiveCharacterTextSplitter from Langchain utilized in above code counts these tokens, and appears for given separators to separate the text into chunks as requested. Trials with different chunk sizes could also be useful. In my Kaggle experiment I began with the utmost size for the embedding model, which was 512 tokens. Then proceeded to try chunk sizes of 256, 128, and 64 tokens.

The Kaggle competition I discussed was about multiple-choice query answering based on Wikipedia data. The duty was to pick the right answer option from the multiple options for every query. The apparent approach was to make use of RAG to search out required information from a Wikipedia dump, and use it to generate the right. Here is the primary query from competition data, and its answer options as an instance:

Example query and answer options A-E.

The multiple-choice questions were an interesting topic to check out RAG. But probably the most common RAG use case is, I feel, answering questions based on source documents. Sort of like a chatbot, but typically query answering over domain specific or (company) internal documents. I take advantage of this basic query answering use case to display RAG in this text.

For example RAG query for this text, I needed something the LLM wouldn’t know the reply to directly based on its training data alone. I used Wikipedia data, and because it is probably going used as part of coaching data for LLM’s, I needed a matter related to something after the model was trained. The model I used for this text was Zephyr 7B beta, trained in early 2023. Finally, I settled on asking in regards to the Google Bard AI chatbot. It has had many developments over the past 12 months, after the Zephyr training date. I even have a good knowledge of Bard to guage the LLM’s answers. Thus I used “what’s google bard? “ for instance query for this text.

The primary phase of retrieval in RAG is predicated on the embedding vectors, that are really just points in a multidimensional space. They appear something like this (only the primary 10 values here):

q_embeddings[:10]
array([-0.45518905, -0.6450379, 0.3097812, -0.4861114 , -0.08480848,
-0.1664767 , 0.1875889, 0.3513346, -0.04495572, 0.12551129],

These embedding vectors could be used to check the words/sentences, and their relations, against one another. These vectors could be built using embedding models. A pleasant set of those models with various stats per model could be found on the MTEB leaderboard. Using one among those models is so simple as this:

from sentence_transformers import SentenceTransformer, util

embedding_model_path = "/mystuff/llm/bge-small-en"
embedding_model = SentenceTransformer(embedding_model_path, device='cuda')

The model page on HuggingFace typically shows the instance code. The above loads the model “ bge-small-en “ from local disk. To create the embeddings using this model is just:

query = "what's google bard?" 
q_embeddings = embedding_model.encode(query)

On this case, the embedding model is used to encode the given query into an embedding vector. The vector is identical as the instance above:

q_embeddings.shape
(, 384)

q_embeddings[:10]
array([-0.45518905, -0.6450379, 0.3097812, -0.4861114 , -0.08480848,
-0.1664767 , 0.1875889, 0.3513346, -0.04495572, 0.12551129],
dtype=float32)

The form (, 384) tells me q_embeddings is a single vector (versus embedding a listing of multiple texts directly) of length 384 floats. The slice above shows the primary 10 values out of those 384. Some models use longer vectors for more accurate relations, others, like this one, shorter (here 384). Again, MTEB leaderboard has good examples. The small ones require less space and computation, larger ones give some improvements in representing the relations between chunks, and sometimes sequence length.

For my RAG similarity search, I first needed embeddings for the query. That is the q_embeddings above. This needed to be compared against embedding vectors of all of the searched articles (or their chunks). On this case all of the chunked Wikipedia articles. To construct embedding for all of those:

article_embeddings = embedding_model.encode(article_chunks)

Here article_chunks is a listing of all chunks for all articles from the English Wikipedia dump. This manner they could be batch-encoded.

Implementing similarity search over a big set of documents / document chunks is just not too complicated at a basic level. A typical way is to calculate cosine similarity between the query and document vectors, and kind accordingly. Nevertheless, at large scale, this sometimes gets a bit complicated to administer. Vector databases are tools that make this management and search easier / more efficient at scale.

For instance, Weaviate is a vector database that was utilized in StackOverflow’s AI-based search. In its latest versions, it might even be utilized in an embedded mode, which must have made it usable even in a Kaggle notebook. It is usually utilized in some Deeplearning.AI LLM short courses, so at the least seems somewhat popular. After all, there are various others and it is sweet to make comparisons, this field also evolves fast.

In my trials, I used FAISS from Facebook/Meta research because the vector database. FAISS is more of a library than a client-server database, and was thus easy to make use of in a Kaggle notebook. And it worked quite nicely.

Once the chunking and embedding of all of the articles was all done, I built a Pandas DataFrame with all of the relevant information. Here is an example with the primary 5 chunks of the Wikipedia dump I used, for a document titled Anarchism:

First 5 chunks from the primary article within the Wikipedia dump I used.

Each row on this table (a Pandas DataFrame) comprises data for a single chunk after the chunking process. It has 5 columns:

  • chunk_id: allows me to map chunk embeddings to the chunk text later.
  • doc_id: allows mapping the chunks back to their document.
  • doc_title: for trialing approaches akin to adding the doc title to every chunk.
  • chunk_title: article subsection title for the chunk, same purpose as doc_title
  • chunk: the actual chunk text

Listed here are the embeddings for the primary five Anarchism chunks, same order because the DataFrame above:

[[ 0.042624 -0.131264 -0.266858 ... -0.329627 0.178211 0.248001]
[-0.120318 -0.110153 -0.059611 ... -0.297150 -0.043165 0.558150]
[ 0.116761 -0.066759 -0.498548 ... -0.330301 0.019448 0.326484]
[-0.517585 0.183634 0.186501 ... 0.134235 -0.033262 0.498731]
[-0.245819 -0.189427 0.159848 ... -0.077107 -0.111901 0.483461]]

Each row is partially only shown here, but illustrates the thought.

Earlier I encoded the query vector for query “ what’s google bard? “‘, followed by encoding all of the article chunks. With these two sets of embeddings, the primary a part of RAG search is easy: finding the documents “semantically” closest to the query. In practice just calculating a measure akin to cosine similarity between the query embedding vector and all of the chunk vectors, and sorting by the similarity rating.

Listed here are the highest 10 “semantically” closest chunks to the q_embeddings:

Top 10 chunks sorted by their cosine similarity with the query.

Each row on this table (DataFrame) represents a bit. The sim_score here is the calculated cosine similarity rating, and the rows are sorted from highest cosine similarity to lowest. The table shows the highest 10 highest sim_score rows.

A pure embeddings based similarity search may be very fast and low-cost by way of computation. Nevertheless, it is just not quite as accurate as another approaches. Re-ranking is a term used to explain the technique of using one other model to more accurately sort this initial list of top documents, with a more computationally expensive model. This model is normally too expensive to run against all documents and chunks, but running it on the set of top chunks after the initial similarity search is way more feasible. Re-ranking helps to get a greater list of ultimate chunks to construct the input context for the generation a part of RAG.

The identical MTEB leaderboard that hosts metrics for the embedding models also has re-ranking scores for a lot of models. On this case I used the bge-reranker-base model for re-ranking:

import torch 
from transformers import AutoModelForSequenceClassification, AutoTokenizer

rerank_model_path = "/mystuff/llm/bge-reranker-base"
rerank_tokenizer = AutoTokenizer.from_pretrained(rerank_model_path)
rerank_model = AutoModelForSequenceClassification
.from_pretrained(rerank_model_path)
rerank_model.eval()

def calculate_rerank_scores(pairs):
with torch.no_grad(): inputs = rerank_tokenizer(pairs, padding=True,
truncation=True, return_tensors='pt',
max_length=512)
scores = rerank_model(**inputs, return_dict=True)
.logits.view(-1, ).float()
return scores

query = questions[idx]
pairs = [(question, chunk) for chunk in doc_chunks_all[idx]]
rerank_scores = calculate_rerank_scores(pairs)
df["rerank_score"] = rerank_scores

After adding rerank_score to the chunk DataFrame, and sorting with it:

Top 10 chunks sorted by their re-rank rating with the query.

Comparing the 2 tables above (first sorted by sim_score vs now by rerank_score), there are some clear differences. Sorting by the plain similarity rating ( sim_score) from embeddings, the Tenor page is the fifth most similar chunk. Since Tenor appears to be a GIF search engine hosted by Google, I assume it makes some sense to see its embeddings near the query “ what’s google bard? “. However it has nothing really to do with Bard itself, except that Tenor is a Google product in an identical domain.

Nevertheless, after sorting by the rerank_score, the outcomes make way more sense. Tenor is gone from the highest 10, and only the last two chunks from the highest 10 list look like unrelated. These are in regards to the names “Bard” and “Bård”. Possibly because the very best source of data on Google Bard appears to be the page on Google Bard, which within the above tables is document with id 6026776. After that I assume RAG runs out of excellent article matches and goes a bit off-road (Bård). Which can be seen within the negative re-rank scores for those two last rows/chunks of the table.

Typically there would likely be many relevant documents and chunks across those documents, not only the 1 document and eight chunks as above. But on this case this limitation helps illustrate the difference in basic embeddings-based similarity search and re-ranking, and the way re-ranking can positively affect the final result.

What can we do once we now have collected the highest chunks for RAG input? We want to construct the context for the generator model from these chunks. At its simplest, that is only a concatenation of the chosen top chunks into a protracted text sequence. The utmost length of this sequence in constrained by the used model. As I used the Zephyr 7B model, I used 4096 tokens as the utmost length. The Zephyr page gives this as a versatile sequence limit (with sliding attention window). Longer context seems higher, however it appears this is just not at all times the case. Higher try it.

Here is the bottom code I used to generate the reply with this context:

from transformers import AutoTokenizer, AutoModelForCausalLM 
import torch

llm_answer_path = "/mystuff/llm/zephyr-7b-beta"
torch_device = "cuda:0"
tokenizer = AutoTokenizer.from_pretrained(llm_answer_path,
local_files_only=True)
llm_answer = AutoModelForCausalLM.from_pretrained(llm_answer_path,
device_map=torch_device, local_files_only=True,
torch_dtype=torch.float16)
# assuming here that "context" comprises the pre-built context
query = "answer the next query, "
"based in your knowledge and the provided context. "n
"Keep the reply concise.nnquestion:" + query +
"nncontext:"+context

input_ids = tokenizer.encode(query+"nnANSWER:", return_tensors='pt',
return_attention_mask=False).to(torch_device)
greedy_output = llm_answer.generate(input_ids, max_new_tokens=1024,
do_sample=True)
answer = tokenizer.decode(greedy_output[0], skip_special_tokens=True)
print(answer[len(query):])

As noted, on this case the context was only a concatenation of the highest ranked chunks.

For comparison, first lets try what the model answers with none added context, i.e. based on its training data alone:

query = "what's google bard?" 
input_ids = tokenizer.encode(query+"nnANSWER:", return_tensors='pt',
return_attention_mask=False).to(torch_device)
greedy_output = llm_answer.generate(input_ids, max_new_tokens=1024,
do_sample=True)
answer = tokenizer.decode(greedy_output[0], skip_special_tokens=True)
print(answer[len(query):])

This offers (one among many runs, slight variations but generally similar):

ANSWER:
Google Bard is an experimental, AI-based language model developed by
Google's sister company, DeepMind. Its primary use is to generate
human-like text responses to prompts, which may also help in tasks akin to
content creation, idea generation, and text summarization. Bard is
trained on an unlimited amount of textual data and may provide highly
relevant and contextually accurate responses, making it a useful gizmo
in various applications where text generation is required. Nevertheless, as
an experimental feature, Bard's accuracy and effectiveness should still be
improving, and it is just not yet publicly available to be used.

Generally accurate, but missing much of the most recent developments. As compared, lets try with providing the generated context to the query:

query = "answer the next query, "
"based in your knowledge and the provided context. "
"Keep the reply concise.nn"
"query:" + query + "nncontext:"+context
input_ids = tokenizer.encode(query+"nnANSWER:", return_tensors='pt',
return_attention_mask=False).to(torch_device)
greedy_output = llm_answer.generate(input_ids, max_new_tokens=1024,
do_sample=True)
answer = tokenizer.decode(greedy_output[0], skip_special_tokens=True)
print(answer[len(query):])

The next is an example answer with the highest sim_score sorted chunks as context (includes the Tenor and Bård page chunks):

ANSWER: 
Bard is a conversational generative artificial intelligence (AI)
chatbot developed by Google based initially on the LaMDA family of
large language models (LLMs) and later PaLM. It was launched in a
limited capability in March 2023 to mixed reviews, and expanded to other
countries in May, following the launch of Microsoft's Bing Chat. Google
researcher Jacob Devlin resigned from the corporate after alleging Bard
had surreptitiously leveraged data from ChatGPT. Tenor is a web-based GIF
search engine and database owned by Google, available on Android, iOS,
and macOS. Its important product is the GIF Keyboard, which is integrated
into quite a few apps and messaging services. Bård is a Norwegian
masculine given name of Old Norse origin. It is usually used as a
surname. It could discuss with several people. Tenor and Bård aren't related.

This is just not a superb answer because it starts talking about completely non-related topics here, Tenor and Bård. Partly because on this case the Tenor chunk is included within the context, and chunk order also generally less optimal because it is just not re-ranked.

As compared, with rerank_score sorted context chunks (higher chunk ordering and Tenor gone):

ANSWER:
Google Bard is a conversational generative artificial intelligence (AI)
chatbot, based initially on the LaMDA family of huge language models
(LLMs) and afterward PaLM, developed by Google to compete with OpenAI's
ChatGPT. It was rolled out in a limited capability in March 2023 and
expanded to more countries in May, prompting a mixed reception from
critics, who raised safety and ethical concerns about its accuracy and
usefulness. Google has promised that Bard might be tightly integrated
with other Google AI services and products, resulting in claims that a
latest AI-powered version of the Google Assistant, dubbed "Assistant with
Bard", is being prepared for launch. Google has also stressed that Bard
remains to be in its early stages and being constantly refined, with plans
to upgrade it with latest personalization and productivity features, while
stressing that it stays distinct from Google Search.

Now the unrelated topics are gone and the reply basically is healthier and more to the purpose.

This highlights that it is just not only vital to search out proper context to provide to the model, but additionally to trim out the unrelated context. At the least on this case, the Zephyr model was not capable of directly discover which a part of the context was relevant, but moderately seems to have summarized the all of it. Cannot really fault the model, as I gave it that context and asked to make use of it.

the re-rank scores for the chunks, a general filtering approach based on metrics akin to negative re-rank scores would have solved this issue also within the above case, because the “bad” chunks on this case have a negative re-rank rating.

Something to notice is that Google released a latest and far improved Gemini family of models for Bard, across the time I used to be writing this text. It is just not mentioned within the generated answers here because the Wikipedia dumps are generated with a slight delay. In order one may think, it will be important to attempt to have up-to-date information within the context, and to maintain it relevant and focused.

Embeddings are an ideal tool, but sometimes it’s a bit difficult to essentially grasp how they’re working, and what is occurring with the similarity search. A basic approach is to plot the embeddings against one another to get some insight into their relations.

Constructing such a visualization is sort of easy with PCA and visualization libraries. It involves mapping the embedding vectors to 2 or 3 dimensions, and plotting the outcomes. Here I map from those 384 dimensions to 2, and plot the result:

import seaborn as sns 
import numpy as np

fp_embeddings = embedding_model.encode(first_chunks)
q_embeddings_reshaped = q_embeddings.reshape(1, -1)
combined_embeddings = np.concatenate((fp_embeddings, q_embeddings_reshaped))

df_embedded_pca = pd.DataFrame(X_pca, columns=["x", "y"])
# text is brief version of chunk text (plot title)
df_embedded_pca["text"] = titles
# row_type = article or query per each embedding
df_embedded_pca["row_type"] = row_types

X = combined_embeddings pca = PCA(n_components=2).fit(X)
X_pca = pca.transform(X)

plt.figure(figsize=(16,10))
sns.scatterplot(x="x", y="y", hue="row_type",
palette={"article": "blue", "query": "red"},
data=df_embedded_pca, #legend="full",
alpha=0.8, s=100 )
for i in range(df_embedded_pca.shape[0]):
plt.annotate(df_embedded_pca["text"].iloc[i],
(df_embedded_pca["x"].iloc[i], df_embedded_pca["y"].iloc[i]),
fontsize=20 )
plt.legend(fontsize='20')
# Change the font size for x and y axis ticks plt.xticks(fontsize=16)
plt.yticks(fontsize=16)
# Change the font size for x and y axis labels
plt.xlabel('X', fontsize=16)
plt.ylabel('Y', fontsize=16)

For the highest 10 articles within the “ what’s google bard? “ query, this offers the next visualization:

PCA-based 2D plot of query embeddings vs article 1st chunk embeddings.

On this plot, the red dot is the embedding for the query “ what’s google bard?”. The blue dots are the closest Wikipedia article matches, in line with sim_score.

The Bard article is clearly the closest one to the query, while the remaining are a bit further off. The Tenor article appears to be about second closest, while the Bård one is a bit further away, possibly attributable to the loss of data in mapping from 384 dimensions to 2. As a consequence of this, the visualization is just not perfectly accurate but helpful for quick human overview.

The next figure illustrates an actual error finding from my Kaggle code using an identical PCA plot. In search of a little bit of insights, I attempted an easy query in regards to the first article within the Wikipedia dump (“ Anarchism”). With the query “ what’s the definition of anarchism? “ . The next is what the PCA visualization looked like for the closest articles, the marked outliers are perhaps probably the most interesting part:

My fail shown in PCA-based 2D plot of Kaggle embeddings for chosen top documents.

The red dot in the underside left corner is again the query. The cluster of blue dots next to it are all related articles about anarchism. After which there are the 2 outlier dots on the highest right. I removed the titles from the plot to maintain it readable. The 2 outlier articles appeared to don’t have anything to do with the query when looking.

Why is that this? As I indexed the articles with various chunk sizes of 512, 256, 128, and 64, I had some issues in processing all of the articles for 256 chunk size, and restarted the chunking in the center. This resulted in some differences in indices of a few of those embeddings vs the chunk texts I had stored. After noticing these strange looking results, I re-calculated the embeddings with the 256 token chunk size, and compared the outcomes vs size 512, noted this difference. Too bad the competition was done at the moment 🙂

Within the above I discussed chunking the documents and using similarity search + re-ranking as a technique to search out relevant chunks and construct a context for the query answering. I discovered sometimes it’s also useful to think about how the initial documents to chunk are chosen vs just the chunks themselves.

As example methods, the advanced RAG course on DeepLearning.AI , presents two approaches: sentence windowing, and hierarchical chunk merging. In summary this looks at nearby-chunks and if multiple are ranked high by their scores, takes them as a single large chunk. The “hierarchy” coming from considering larger and bigger chunk mixtures for joint relevance. Aiming for more cohesive context vs random ordered small chunks, giving the generator LLM higher input to work with.

As an easy example of this, here is the re-ranked set of top chunks for my above Bard example:

Top 10 chunks for my Bard example, sorted by rerank_score.

The leftmost column here is the index of the chunk. In my generation, I just took the highest chunks on this sorted order as within the table. If we desired to make the context a bit more coherent, we could sort the ultimate chosen chunks by their order inside a document. If there may be a small piece missing between highly ranked chunks, adding the missing one (e.g., here chunk id 7) could assist in missing gaps, much like the hierarchical merging. This could possibly be something to try as a final step for final gains.

In my Kaggle experiments, I performed initial document selection based on the primary chunk only. Partly attributable to Kaggle’s resource limits, however it appeared to have another benefits as well. Typically, an article’s starting acts as a summary (introduction or abstract). Initial chunk selection from such ranked articles may help select chunks with more relevant overall context.

That is visible in my Bard example above, where each the rerank_score and sim_score are highest for the primary chunk of the very best article. To try to enhance this, I also tried using a bigger chunk size for this initial document selection, to incorporate more of the introduction for higher relevance. Then chunked the highest chosen documents with smaller chunk sizes for experimenting on how good the context is with each size.

While I couldn’t run the initial search on all chunks of all documents on Kaggle attributable to resource limitations, I attempted it outside of Kaggle. In these trials, I noticed that sometimes single chunks of unrelated articles get ranked high, while in point of fact misleading for the reply generation. For instance, actor biography in a related movie. Initial document relevance selection may help avoid this. Unfortunately, I didn’t have time to review this further with different configurations, and good re-ranking may already help.

Finally, repeating the identical information in multiple chunks within the context is just not very useful. Top rating of the chunks doesn’t guarantee that they best complement one another, or best chunk diversity. For instance, LangChain has a special chunk selector for Maximum Marginal Relevance. It does this by penalizing latest chunks by how close they’re to the already added chunks.

I used a quite simple query / query for my RAG example here (“ what’s google bard?”), and straightforward is sweet as an instance the essential RAG concept. This can be a pretty short query input considering that the embedding model I used had a 512 token maximum sequence length. If I encode this query into tokens using the tokenizer for the embedding model ( bge-small-en), I get the next tokens:

['[CLS]', 'what', 'is', 'google', 'bard', '?', '[SEP]']

Which amounts to a complete of seven tokens. With a maximum sequence length of 512, this leaves loads of room if I would like to make use of an extended query sentence. Sometimes this could be useful, especially if the knowledge we wish to retrieve is just not such an easy query, or if the domain is more complex. For a really small query, the semantic search may not work best, as noted also within the Stack Overflows AI Journey posting.

For instance, the Kaggle competition had a set of questions, each with 5 answer options to select from. I initially tried RAG with just the query because the input for the embedding model. The search results weren’t too great, so I attempted again with the query + all the reply options because the query. This produced significantly better results.

For example, the primary query within the training dataset of the competition:

Which of the next statements accurately describes the impact of 
Modified Newtonian Dynamics (MOND) on the observed "missing baryonic mass"
discrepancy in galaxy clusters?

That is 32 tokens for the bge-small-en model. So about 480 still left to suit into the utmost 512 token sequence length.

Here is the primary query together with the 5 answer options given for it:

Example query and answer options A-E. Concatenating all these texts formed the query.

Concatenating the query and the given options into one RAG query gives this a length 235 tokens, with still greater than 50% of embedding model sequence length left. In my case, this approach produced significantly better results. Each from manual inspection, and for the competition rating. Thus, experimenting with alternative ways to make the RAG query itself more expressive is price a try.

Finally, there may be the subject of hallucinations, where the model produces text that is inaccurate or fabricated. The Tenor example from my sim_score sorting is one form of an example, even when the generator did base it on the actual given context. So higher keep the context good I assume :).

To handle hallucinations, the chatbots from the large AI firms ( Google Bard, ChatGPT, Bing Chat) all provide means to link parts of their generated answers to verifiable sources. Bard has a selected “G” button that performs a Google search and highlights parts of the generated answer that match the search results. Too bad we don’t at all times have a world-class search-engine for our data to assist.

Bing Chat has an identical approach, highlighting parts of the reply and adding a reference to the source web sites. ChatGPT has a rather different approach; I needed to explicitly ask it to confirm its answer and update with latest developments, telling it to make use of its browser tool. After this, it did an online search and linked to specific web sites as sources. The source quality appeared to vary quite a bit as in any web search. After all, for internal documents the sort of web search is just not possible. Nevertheless, linking to the source should at all times be possible even internally.

I also asked Bard, ChatGPT+, and Bing for ideas on detecting hallucinations. The outcomes included an LLM hallucination rating index, including RAG hallucination. When tuning LLM’s, it may also help to set the temperature parameter to zero for the LLM to generate deterministic, most probable output tokens.

Finally, as this can be a quite common problem, there appear to be various approaches being built to deal with this challenge a bit higher. For instance, specific LLM’s to assist detect halluciations appear to be a promising area. I didn’t have time to try them, but definitely relevant in larger projects.

Besides implementing a working RAG solution, it’s also nice to have the ability to inform something about how well it really works. Within the Kaggle competition this was quite easy. I just ran the answer to try to reply the given questions within the training dataset, comparing to the right answers given within the training data. Or submitted the model for scoring on the Kaggle competition test set. The higher the reply rating, the higher one could call the RAG solution, even when there was more to the rating.

In lots of cases, an acceptable evaluation dataset for domain specific RAG might not be available. For this scenario, one might want to start out with some generic NLP evaluation datasets, akin to this list. Tools akin to LangChain also include support for auto-generating questions and answers, and evaluating them. On this case, an LLM is used to create example questions and answers for a given set of documents, and one other LLM is used to guage whether the RAG can provide the right answer to those questions. This is probably higher explained on this tutorial on RAG evaluation with LangChain.

While the generic solutions are likely good to start out with, in an actual project I might try to gather an actual dataset of questions and answers from the domain experts and the intended users of the RAG solution. Because the LLM is usually expected to generate a natural language response, this may vary rather a lot while still being correct. Because of this, evaluating if the reply was correct or not is just not as straightforward as a daily expression or similar pattern matching. Here, I find the thought of using one other LLM to guage whether the given response matches a reference response a really useful gizmo. These models can cope with the text variation significantly better.

RAG is a really nice tool, and is sort of a preferred topic nowadays with the high interest in LLM’s basically. While RAG and embeddings have been around for a great while, the most recent powerful LLM’s and their fast evolution have perhaps made them more interesting for a lot of advanced use cases. I expect the sector to maintain evolving at a great pace, and it is usually a bit difficult to maintain up so far on every part. For this, summaries akin to reviews on RAG developments may give points to at the least keep the important developments in sight.

The RAG approach basically is sort of easy: discover a set of chunks of text much like the given query, concatenate them right into a context, and ask the LLM for a solution. Nevertheless, as I attempted to indicate here, there could be various issues to think about in how one can make this work well and efficiently for various needs. From good context retrieval, to rating and choosing the very best results, and at last with the ability to link the outcomes back to actual source documents. And evaluating the resulting query contexts and answers. And as Stack Overflow people noted, sometimes the more traditional lexical or hybrid search may be very useful as well, even when semantic search is cool.

That’s all for today. RAG on…

ChatGPT+/DALL-E3 vision of what it means to RAG on..

1 COMMENT

  1. I enjoyed it just as much as you will be able to accomplish here. You should be apprehensive about providing the following, but the sketch is lovely and the writing is stylish; yet, you should definitely return back as you will be doing this walk so frequently.

LEAVE A REPLY

Please enter your comment!
Please enter your name here