How one can Improve LLMs with RAG

Artificial Intelligence

How one can Improve LLMs with RAG

admin

March 10, 2024

Imports

We start by installing and importing crucial Python libraries.

!pip install llama-index
!pip install llama-index-embeddings-huggingface
!pip install peft
!pip install auto-gptq
!pip install optimum
!pip install bitsandbytes
# if not running on Colab ensure transformers is installed too

from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core import Settings, SimpleDirectoryReader, VectorStoreIndex
from llama_index.core.retrievers import VectorIndexRetriever
from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.core.postprocessor import SimilarityPostprocessor

Organising Knowledge Base

We are able to configure our knowledge base by defining our embedding model, chunk size, and chunk overlap. Here, we use the ~33M parameter bge-small-en-v1.5 embedding model from BAAI, which is obtainable on the Hugging Face hub. Other embedding model options can be found on this text embedding leaderboard.

# import any embedding model on HF hub
Settings.embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")Settings.llm = None # we cannot use LlamaIndex to establish LLM
Settings.chunk_size = 256
Settings.chunk_overlap = 25

Next, we load our source documents. Here, I even have a folder called “articles,” which comprises PDF versions of three Medium articles I wrote on fat tails. If running this in Colab, you need to download the articles folder from the GitHub repo and manually upload it to your Colab environment.

For every file on this folder, the function below will read the text from the PDF, split it into chunks (based on the settings defined earlier), and store each chunk in a listing called documents.

documents = SimpleDirectoryReader("articles").load_data()

For the reason that blogs were downloaded directly as PDFs from Medium, they resemble a webpage greater than a well-formatted article. Due to this fact, some chunks may include text unrelated to the article, e.g., webpage headers and Medium article recommendations.

Within the code block below, I refine the chunks in documents, removing a lot of the chunks before or after the meat of an article.

print(len(documents)) # prints: 71
for doc in documents:
if "Member-only story" in doc.text:
documents.remove(doc)
proceedif "The Data Entrepreneurs" in doc.text:
documents.remove(doc)
if " min read" in doc.text:
documents.remove(doc)
print(len(documents)) # prints: 61

Finally, we will store the refined chunks in a vector database.

index = VectorStoreIndex.from_documents(documents)

Organising Retriever

With our knowledge base in place, we will create a retriever using LlamaIndex’s VectorIndexRetreiver(), which returns the highest 3 most similar chunks to a user query.

# set variety of docs to retreive
top_k = 3# configure retriever
retriever = VectorIndexRetriever(
index=index,
similarity_top_k=top_k,
)

Next, we define a question engine that uses the retriever and query to return a set of relevant chunks.

# assemble query engine
query_engine = RetrieverQueryEngine(
retriever=retriever,
node_postprocessors=[SimilarityPostprocessor(similarity_cutoff=0.5)],
)

Use Query Engine

Now, with our knowledge base and retrieval system arrange, let’s use it to return chunks relevant to a question. Here, we’ll pass the identical technical query we asked ShawGPT (the YouTube comment responder) from the previous article.

query = "What's fat-tailedness?"
response = query_engine.query(query)

The query engine returns a response object containing the text, metadata, and indexes of relevant chunks. The code block below returns a more readable version of this information.

# reformat response
context = "Context:n"
for i in range(top_k):
context = context + response.source_nodes[i].text + "nn"print(context)

Context:
A few of the controversy may be explained by the statement that log-
normal distributions behave like Gaussian for low sigma and like Power Law
at high sigma [2].
Nonetheless, to avoid controversy, we will depart (for now) from whether some
given data matches a Power Law or not and focus as a substitute on fat tails.
Fat-tailedness — measuring the space between Mediocristan
and Extremistan
Fat Tails are a more general idea than Pareto and Power Law distributions.
A technique we will give it some thought is that “fat-tailedness” is the degree to which
rare events drive the mixture statistics of a distribution. From this point of
view, fat-tailedness lives on a spectrum from not fat-tailed (i.e. a Gaussian) to
very fat-tailed (i.e. Pareto 80 – 20).
This maps on to the thought of Mediocristan vs Extremistan discussed
earlier. The image below visualizes different distributions across this
conceptual landscape [2].print("mean kappa_1n = " + str(np.mean(kappa_dict[filename])))
print("")
Mean κ (1,100) values from 1000 runs for every dataset. Image by writer.
These more stable results indicate Medium followers are essentially the most fat-tailed,
followed by LinkedIn Impressions and YouTube earnings.
Note: One can compare these values to Table III in ref [3] to raised understand each
κ value. Namely, these values are comparable to a Pareto distribution with α
between 2 and three.
Although each heuristic told a rather different story, all signs point toward
Medium followers gained being essentially the most fat-tailed of the three datasets.
Conclusion
While binary labeling data as fat-tailed (or not) could also be tempting, fat-
tailedness lives on a spectrum. Here, we broke down 4 heuristics for
quantifying how fat-tailed data are.
Pareto, Power Laws, and Fat Tails
What they don’t teach you in statistics
towardsdatascience.com
Although Pareto (and more generally power law) distributions give us a
salient example of fat tails, this can be a more general notion that lives on a
spectrum starting from thin-tailed (i.e. a Gaussian) to very fat-tailed (i.e.
Pareto 80 – 20).
The spectrum of Fat-tailedness. Image by writer.
This view of fat-tailedness provides us with a more flexible and precise way of
categorizing data than simply labeling it as a Power Law (or not). Nonetheless,
this begs the query: how will we define fat-tailedness?
4 Ways to Quantify Fat Tails

Adding RAG to LLM

We start by downloading the fine-tuned model from the Hugging Face hub.

# load fine-tuned model from hub
from peft import PeftModel, PeftConfig
from transformers import AutoModelForCausalLM, AutoTokenizermodel_name = "TheBloke/Mistral-7B-Instruct-v0.2-GPTQ"
model = AutoModelForCausalLM.from_pretrained(model_name,
device_map="auto",
trust_remote_code=False,
revision="principal")
config = PeftConfig.from_pretrained("shawhin/shawgpt-ft")
model = PeftModel.from_pretrained(model, "shawhin/shawgpt-ft")
# load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)

As a baseline, we will see how the model responds to the technical query with none context from the articles. To do that, we create a prompt template using a lambda function, which takes in a viewer comment and returns a prompt for the LLM. For more details on where this prompt comes from, see the previous article of this series.

# prompt (no context)
intstructions_string = f"""ShawGPT, functioning as a virtual data science 
consultant on YouTube, communicates in clear, accessible language, escalating 
to technical depth upon request. It reacts to feedback aptly and ends 
responses with its signature '–ShawGPT'.ShawGPT will tailor the length of its responses to match the viewer's comment, 
providing concise acknowledgments to temporary expressions of gratitude or 
feedback, thus keeping the interaction natural and fascinating.
Please reply to the next comment.
"""
prompt_template = lambda comment: f'''[INST] {intstructions_string} n{comment} n[/INST]'''
comment = "What's fat-tailedness?"
prompt = prompt_template(comment)
print(prompt)

[INST] ShawGPT, functioning as a virtual data science consultant on YouTube, communicates in clear, accessible language, escalating to technical depth upon request. It reacts to feedback aptly and ends responses with its signature '–ShawGPT'. ShawGPT will tailor the length of its responses to match the viewer's comment, providing concise acknowledgments to temporary expressions of gratitude or feedback, thus keeping the interaction natural and fascinating.Please reply to the next comment.
What's fat-tailedness? 
[/INST]

Next, we will pass this prompt to the model using the code below.

model.eval()inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(input_ids=inputs["input_ids"].to("cuda"), 
max_new_tokens=280)
print(tokenizer.batch_decode(outputs)[0])

Here’s the model’s response (no context).

Great query!Fat-tailedness is a statistical property of a distribution. In easy terms, 
it refers back to the presence of utmost outliers or heavy tails within the 
distribution.
For example, consider the distribution of heights in a population. A standard 
distribution would have most individuals clustered around a mean height with a 
few people deviating barely from the mean. Nonetheless, in a fat-tailed 
distribution, you'll observe a bigger number of individuals being significantly 
taller or shorter than the typical.
In finance, fat-tailedness is commonly used to explain the presence of utmost 
market moves or outliers in financial returns. It's a vital concept in 
risk management and financial modeling.
I hope this explanation helps! Let me know if you will have any questions.
–ShawGPT

Although the response’s style and formatting are great, the model’s explanation of fat-tailedness differs from how I defined it in my video and blog series.

Let’s see what happens to the model’s response once we include the suitable context. To do that, we create one other prompt template, which also can absorb context from the RAG system.

# prompt (with context)
prompt_template_w_context = lambda context, comment: f"""[INST]ShawGPT, functioning as a virtual data science consultant on YouTube, communicates in clear, accessible language, escalating to technical depth upon request. 
It reacts to feedback aptly and ends responses with its signature '–ShawGPT'. 
ShawGPT will tailor the length of its responses to match the viewer's comment, providing concise acknowledgments to temporary expressions of gratitude or feedback, 
thus keeping the interaction natural and fascinating.{context}
Please reply to the next comment. Use the context above whether it is helpful.
{comment}
[/INST]
"""

Next, we pass the prompt with context from the query engine and the view comment to the model.

prompt = prompt_template_w_context(context, comment)inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(input_ids=inputs["input_ids"].to("cuda"), max_new_tokens=280)
print(tokenizer.batch_decode(outputs)[0])

Here’s the brand new response (with context).

Great query!Fat-tailedness is a property of a distribution that describes how much the 
presence of utmost values (rare events) affects the general statistics of the 
distribution.
Within the context of the video, the writer discusses how Mediocristan and 
Extremistan distributions differ when it comes to fat-tailedness. Mediocristan 
distributions have a comparatively small impact from extreme values, while 
Extremistan distributions are heavily influenced by extreme values.
The writer also mentions that fat-tailedness is a more general concept than 
specific distributions like Pareto or Power Laws. As a substitute, it lives on a 
spectrum starting from thin-tailed (Gaussian) to very fat-tailed (Pareto 80-20).
I hope that helps make clear things a bit! Let me know if you will have any questions.
–ShawGPT

This does a a lot better job of capturing my explanation of fat tails than the no-context response and even calls out the area of interest concepts of Mediocristan and Extremistan.

Here, I gave a beginner-friendly introduction to RAG and shared a concrete example of find out how to implement it using LlamaIndex. RAG allows us to enhance an LLM system with updateable and domain-specific knowledge.

While much of the recent AI hype has centered around constructing AI assistants, a robust (yet less popular) innovation has come from text embeddings (i.e. the things we used to do retrieval). In the following article of this series, I’ll explore text embeddings in additional detail, including how they may be used for semantic search and classification tasks.

More on LLMs 👇