A beginner’s guide to constructing a Retrieval Augmented Generation (RAG) application from scratch

Learn critical knowledge for constructing AI apps, in plain english

Retrieval Augmented Generation, or RAG, is all the craze nowadays since it introduces some serious capabilities to large language models like OpenAI’s GPT-4 — and that’s the power to make use of and leverage their very own data.

This post will teach you the basic intuition behind RAG while providing a straightforward tutorial to make it easier to start.

There’s a lot noise within the AI space and specifically about RAG. Vendors are attempting to overcomplicate it. They’re attempting to inject their tools, their ecosystems, their vision.

It’s making RAG far more complicated than it must be. This tutorial is designed to assist beginners learn easy methods to construct RAG applications from scratch. No fluff, no (okay, minimal) jargon, no libraries, just a straightforward step-by-step RAG application.

Jerry from LlamaIndex advocates for constructing things from scratch to essentially understand the pieces. When you do, using a library like LlamaIndex makes more sense.

Construct from scratch to learn, then construct with libraries to scale.

Let’s start!

You could or may not have heard of Retrieval Augmented Generation or RAG.

Here’s the definition from the blog post introducing the concept from Facebook:

Constructing a model that researches and contextualizes is tougher, nevertheless it’s essential for future advancements. We recently made substantial progress on this realm with our Retrieval Augmented Generation (RAG) architecture, an end-to-end differentiable model that mixes an information retrieval component (Facebook AI’s dense-passage retrieval system) with a seq2seq generator (our Bidirectional and Auto-Regressive Transformers [BART] model). RAG will be fine-tuned on knowledge-intensive downstream tasks to attain state-of-the-art results compared with even the most important pretrained seq2seq language models. And in contrast to these pretrained models, RAG’s internal knowledge will be easily altered and even supplemented on the fly, enabling researchers and engineers to manage what RAG knows and doesn’t know without wasting time or compute power retraining the complete model.

Wow, that’s a mouthful.

In simplifying the technique for beginners, we will state that the essence of RAG involves adding your personal data (via a retrieval tool) to the prompt that you just pass right into a large language model. Consequently, you get an output. That offers you many advantages:

You possibly can include facts within the prompt to assist the LLM avoid hallucinations
You possibly can (manually) consult with sources of truth when responding to a user query, helping to double check any potential issues.
You possibly can leverage data that the LLM won’t have been trained on.

a group of documents (formally called a corpus)
An input from the user
a similarity measure between the gathering of documents and the user input

Yes, it’s that straightforward.

To begin learning and understanding RAG based systems, you don’t need a vector store, you don’t even need an LLM (not less than to learn and understand conceptually).

While it is commonly portrayed as complicated, it doesn’t should be.

We’ll perform the next steps in sequence.

Receive a user input
Perform our similarity measure
Post-process the user input and the fetched document(s).

The post-processing is completed with an LLM.

The actual RAG paper is clearly the resource. The issue is that it assumes a LOT of context. It’s more complicated than we’d like it to be.

For example, here’s the overview of the RAG system as proposed within the paper.

An summary of RAG from the RAG paper by Lewis, et al

That’s dense.

It’s great for researchers but for the remainder of us, it’s going to be so much easier to learn step-by-step by constructing the system ourselves.

Let’s get back to constructing RAG from scratch, step-by-step. Here’s the simplified steps that we’ll be working through. While this isn’t technically “RAG” it’s simplified model to learn with and permit us to progress to more complicated variations.

Below you may see that we’ve got a straightforward corpus of ‘documents’ (please be generous 😉).

corpus_of_documents = [
"Take a leisurely walk in the park and enjoy the fresh air.",
"Visit a local museum and discover something new.",
"Attend a live music concert and feel the rhythm.",
"Go for a hike and admire the natural scenery.",
"Have a picnic with friends and share some laughs.",
"Explore a new cuisine by dining at an ethnic restaurant.",
"Take a yoga class and stretch your body and mind.",
"Join a local sports league and enjoy some friendly competition.",
"Attend a workshop or lecture on a topic you're interested in.",
"Visit an amusement park and ride the roller coasters."
]

Now we’d like a way of measuring the similarity between the user input we’re going to receive and the collection of documents that we organized. Arguably the only similarity measure is jaccard similarity. I’ve written about that previously (see this post however the short answer is that the jaccard similarity is the intersection divided by the union of the “sets” of words.

This enables us to check our user input with the source documents.

Side note: preprocessing

A challenge is that if now we have a plain string like "Take a leisurely walk within the park and revel in the fresh air.",, we’ll should pre-process that right into a set, in order that we will perform these comparisons. We will do that within the easiest way possible, lower case and split by " ".

def jaccard_similarity(query, document):
query = query.lower().split(" ")
document = document.lower().split(" ")
intersection = set(query).intersection(set(document))
union = set(query).union(set(document))
return len(intersection)/len(union)

Now we’d like to define a function that takes in the precise query and our corpus and selects the ‘best’ document to return to the user.

def return_response(query, corpus):
similarities = []
for doc in corpus:
similarity = jaccard_similarity(query, doc)
similarities.append(similarity)
return corpus_of_documents[similarities.index(max(similarities))]

Now we will run it, we’ll start with a straightforward prompt.

user_prompt = "What's a leisure activity that you just like?"

And a straightforward user input…

user_input = "I prefer to hike"

Now we will return our response.

return_response(user_input, corpus_of_documents)

'Go for a hike and admire the natural scenery.'

Congratulations, you’ve built a basic RAG application.

I got 99 problems and bad similarity is one

Now we’ve opted for a straightforward similarity measure for learning. But that is going to be problematic since it’s so easy. It has no notion of semantics. It’s just looks at what words are in each documents. That implies that if we offer a negative example, we’re going to get the identical “result” because that’s the closest document.

user_input = "I don't love to hike"

return_response(user_input, corpus_of_documents)

'Go for a hike and admire the natural scenery.'

It is a topic that’s going to return up so much with “RAG”, but for now, rest assured that we’ll address this problem later.

At this point, now we have not done any post-processing of the “document” to which we’re responding. Thus far, we’ve implemented only the “retrieval” a part of “Retrieval-Augmented Generation”. The subsequent step is to enhance generation by incorporating a big language model (LLM).

To do that, we’re going to make use of ollama to stand up and running with an open source LLM on our local machine. We could just as easily use OpenAI’s gpt-4 or Anthropic’s Claude but for now, we’ll start with the open source llama2 from Meta AI.

This post goes to assume some basic knowledge of enormous language models, so let’s get right to querying this model.

import requests
import json

First we’re going to define the inputs. To work with this model, we’re going to take

user input,
fetch essentially the most similar document (as measured by our similarity measure),
pass that right into a prompt to the language model,
then return the result to the user

That introduces a latest term, the prompt. Briefly, it’s the instructions that you just provide to the LLM.

If you run this code, you’ll see the streaming result. Streaming is very important for user experience.

user_input = "I prefer to hike"
relevant_document = return_response(user_input, corpus_of_documents)
full_response = []

prompt = """
You might be a bot that makes recommendations for activities. You answer in very short sentences and don't include extra information.
That is the beneficial activity: {relevant_document}
The user input is: {user_input}
Compile a suggestion to the user based on the beneficial activity and the user input.
"""

Having defined that, let’s now make the API call to ollama (and llama2). a very important step is to make certain that ollama’s running already in your local machine by running ollama serve.

Note: this may be slow in your machine, it’s actually slow on mine. Be patient, young grasshopper.

url = 'http://localhost:11434/api/generate'
data = {
"model": "llama2",
"prompt": prompt.format(user_input=user_input, relevant_document=relevant_document)
}

headers = {'Content-Type': 'application/json'}
response = requests.post(url, data=json.dumps(data), headers=headers, stream=True)
try:
count = 0
for line in response.iter_lines():
# filter out keep-alive latest lines
# count += 1
# if count % 5== 0:
#     print(decoded_line['response']) # print every fifth token
if line:
decoded_line = json.loads(line.decode('utf-8'))full_response.append(decoded_line['response'])
finally:
response.close()
print(''.join(full_response))

Great! Based in your interest in mountaineering, I like to recommend trying out the nearby trails for a difficult and rewarding experience with breathtaking views Great! Based in your interest in mountaineering, I like to recommend testing the nearby trails for a fun and difficult adventure.

This offers us a whole RAG Application, from scratch, no providers, no services. all the components in a Retrieval-Augmented Generation application. Visually, here’s what we’ve built.

The LLM (should you’re lucky) will handle the user input that goes against the beneficial document. We will see that below.

user_input = "I don't love to hike"
relevant_document = return_response(user_input, corpus_of_documents)
# https://github.com/jmorganca/ollama/blob/important/docs/api.md
full_response = []

prompt = """
You might be a bot that makes recommendations for activities. You answer in very short sentences and don't include extra information.
That is the beneficial activity: {relevant_document}
The user input is: {user_input}
Compile a suggestion to the user based on the beneficial activity and the user input.
"""
url = 'http://localhost:11434/api/generate'
data = {
"model": "llama2",
"prompt": prompt.format(user_input=user_input, relevant_document=relevant_document)
}
headers = {'Content-Type': 'application/json'}
response = requests.post(url, data=json.dumps(data), headers=headers, stream=True)
try:
for line in response.iter_lines():
# filter out keep-alive latest lines
if line:
decoded_line = json.loads(line.decode('utf-8'))
# print(decoded_line['response'])  # uncomment to results, token by token
full_response.append(decoded_line['response'])
finally:
response.close()
print(''.join(full_response))

Sure, here is my response:Try kayaking as a substitute! It's a terrific method to enjoy nature without having to hike.

If we return to our diagream of the RAG application and take into consideration what we’ve just built, we’ll see various opportunities for improvement. These opportunities are where tools like vector stores, embeddings, and prompt ‘engineering’ gets involved.

Listed here are ten potential areas where we could improve the present setup:

The variety of documents 👉 more documents might mean more recommendations.
The depth/size of documents 👉 higher quality content and longer documents with more information may be higher.
The variety of documents we give to the LLM 👉 At once, we’re only giving the LLM one document. We could feed in several as ‘context’ and permit the model to offer a more personalized suggestion based on the user input.
The parts of documents that we give to the LLM 👉 If now we have larger or more thorough documents, we’d just wish to add in parts of those documents, parts of varied documents, or some variation there of. Within the lexicon, this known as chunking.
Our document storage tool 👉 We would store our documents otherwise or different database. Particularly, if now we have numerous documents, we’d explore storing them in a knowledge lake or a vector store.
The similarity measure 👉 How we measure similarity is of consequence, we’d must trade off performance and thoroughness (e.g., every individual document).
The pre-processing of the documents & user input 👉 We would perform some extra preprocessing or augmentation of the user input before we pass it into the similarity measure. For example, we’d use an embedding to convert that input to a vector.
The similarity measure 👉 We will change the similarity measure to fetch higher or more relevant documents.
The model 👉 We will change the ultimate model that we use. We’re using llama2 above, but we could just as easily use an Anthropic or Claude Model.
The prompt 👉 We could use a unique prompt into the LLM/Model and tune it in line with the output we wish to get the output we wish.
In case you’re frightened about harmful or toxic output 👉 We could implement a “circuit breaker” of sorts that runs the user input to see if there’s toxic, harmful, or dangerous discussions. For example, in a healthcare context you can see if the knowledge contained unsafe languages and respond accordingly — outside of the standard flow.

The scope for improvements isn’t limited to those points; the probabilities are vast, and we’ll delve into them in future tutorials. Until then, don’t hesitate to reach out on Twitter if you’ve gotten any questions. Completely happy RAGING :).

This post was originally posted on learnbybuilding.ai. I’m running a course on The right way to Construct Generative AI Products for Product Managers in the approaching months, enroll here.

A beginner’s guide to constructing a Retrieval Augmented Generation (RAG) application from scratch

Learn critical knowledge for constructing AI apps, in plain english

Side note: preprocessing

I got 99 problems and bad similarity is one

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

“AI Agent Benchmarks Are Different from Model Evaluation…Cost is the Key”

MIT researchers introduce generative AI for databases

Ban Sung-hoon, CEO of Recon Labs: “Immersive 3D content might be the long run of commerce”

Master This Data Science Skill and You Will Land a Job In Big Tech— Part I

5 Best AI Tools for the Construction Industry (July 2024)

A beginner’s guide to constructing a Retrieval Augmented Generation (RAG) application from scratch

Learn critical knowledge for constructing AI apps, in plain english

Side note: preprocessing

I got 99 problems and bad similarity is one

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.