, I walked through constructing an easy RAG pipeline using OpenAI’s API, LangChain, and native files, in addition to effectively chunking large text files. These posts cover the fundamentals of organising a RAG pipeline in a position to generate responses based on the content of local files.
So, thus far, we’ve talked about reading the documents from wherever they’re stored, splitting them into text chunks, after which creating an embedding for every chunk. After that, we by some means magically pick the embeddings which might be appropriate for the user query and generate a relevant response. But it surely’s necessary to further understand how the retrieval step of RAG actually works.
Thus, on this post, we’ll take things a step further by taking a more in-depth take a look at how the retrieval mechanism works and analyzing it in additional detail. As in my previous post, I will likely be using the text for example, licensed as Public Domain and simply accessible through Project Gutenberg.
What concerning the embeddings?
With a purpose to understand how the retrieval step of the RAG framework works, it’s crucial to first understand how text is transformed and represented in embeddings. For LLMs to handle any text, it should be in the shape of a vector, and to perform this transformation, we want to utilise an embedding model.
An embedding is a vector representation of information (in our case, text) that captures its semantic meaning. Each word or sentence of the unique text is mapped to a high-dimensional vector. Embedding models used to perform this transformation are designed in such a way that similar meanings lead to vectors which might be close to 1 one other within the vector space. For instance, the vectors for the words and can be close to 1 one other within the vector space, whereas the vector for the word can be removed from them.
To create high-quality embeddings that work effectively in an RAG pipeline, one must utilize pretrained embedding models, like BERT and GPT. There are numerous varieties of embeddings one can create and corresponding models available. As an illustration:
- Word Embeddings: In word embeddings, each word has a set vector no matter context. Popular models for creating such a embedding are Word2Vec and GloVe.
- Contextual Embeddings: Contextual embeddings consider that the meaning of a word can change based on context. Take, as an illustration, and . Some models that will be used for producing contextual embeddings are BERT, RoBERTa, and GPT.
- Sentence Embeddings: These are embeddings capturing the meaning of full sentences. Respective models that will be used are Sentence-BERT or USE.
In any case, text should be transformed into vectors to be usable in computations. These vectors are simply representations of the text. In other words, the vectors and numbers don’t have any inherent meaning on their very own. As a substitute, they’re useful because they capture similarities and relationships between words or phrases in a mathematical form.
As an illustration, we could imagine a tiny vocabulary consisting of the words , , , and , and assign each of them an arbitrary vector.
king = [0.25, 0.75]
queen = [0.23, 0.77]
man = [0.15, 0.80]
woman = [0.13, 0.82]
Then, we could attempt to do some vector operations like:
king - man + woman
= [0.25, 0.75] - [0.15, 0.80] + [0.13, 0.82]
= [0.23, 0.77]
≈ queen 👑
Notice how the semantics of the words and the relationships between them are preserved after mapping them into vectors, allowing us to perform operations.
So, an embedding is just that — a mapping of words to vectors, aiming to preserve meaning and relationships between words, and allowing to perform computations with them. We will even visualize these dummy vectors in a vector space to see how related words cluster together.

The difference between these easy vector examples and the true vectors produced by embedding models is that actual embedding models generate vectors with. Two-dimensional vectors are useful for constructing intuition about how meaning will be mapped right into a vector space, but they’re far too low-dimensional to capture the complexity of real language and vocabulary. That’s why real embedding models work with much higher dimensions, often within the lots of and even hundreds. For instance, Word2Vec produces 300-dimensional vectors, while BERT Base produces 768-dimensional vectors. This higher dimensionality allows embeddings to capture the multiple dimensions of real language, like meaning, usage, syntax, and the context of words and phrases.
Assessing the similarity of embeddings
After the text is transformed into embeddings, inference becomes vector math. This is precisely what allows us to discover and retrieve relevant documents within the retrieval step of the RAG framework. Once we turn each the user’s query and the knowledge base documents into vectors using an embedding model, we are able to then compute how similar they’re using cosine similarity.
Cosine similarity is a measure of how similar two vectors (embeddings) are. Given two vectors A and B, cosine similarity is calculated as follows:

Simply put, cosine similarity is calculated because the cosine of the angle between two vectors, and it ranges from 1 to -1. More specifically:
- 1 indicates that the vectors are semantically an identical (e.g., and ).
- 0 indicates that the vectors don’t have any semantic relationship (e.g., and ).
- -1 indicates that the vectors are semantically opposite (e.g., and ).
In practice, nonetheless, values near -1 are extremely rare in embedding models. It’s because even semantically opposite words (like and ) often occur in similar contexts (e.g., and ). For cosine similarity to achieve -1, the words themselves and their contexts would each must be perfectly opposite—something that doesn’t really occur in natural language. Consequently, even opposite words typically have embeddings which might be still somewhat close in meaning.
Other similarity metrics other than cosine similarity do exist, corresponding to the dot product or Euclidean distance, but these usually are not normalized and are magnitude-dependent, making them less suitable for comparing text embeddings. In this manner, cosine similarity is the dominant metric used for quantifying the similarity between embeddings.
Back to our RAG pipeline, by calculating the cosine similarity between the user’s query embeddings and the knowledge base embeddings, we are able to discover the chunks of text which might be most similar—and due to this fact contextually relevant—to the user’s query, retrieve them, after which use them to generate the reply.
Finding the highest k similar chunks
So, after getting the embeddings of the knowledge base and the embedding(s) for the user query text, that is where the magic happens. What we essentially do is that we calculate the cosine similarity between the user query embedding and the knowledge base embeddings. Thus, for every text chunk of the knowledge base, we get a rating between 1 and -1 indicating the chunk’s similarity with the user’s query.
Once now we have the similarity scores, we sort them in descending order and choose the highest k chunks. These top k chunks are then passed into the generation step of the RAG pipeline, allowing it to effectively retrieve relevant information for the user’s query.
To hurry up this process, the Approximate Nearest Neighbor (ANN) search is usually used. ANN finds vectors which might be nearly essentially the most similar, delivering results near the true top-N but at a much faster rate than exact search methods. In fact, exact search is more accurate; nonetheless, additionally it is more computationally expensive and should not scale well in real-world applications, especially when coping with massive datasets.
On top of this, a threshold could also be applied to the similarity scores to filter out chunks that don’t meet a minimum relevance rating. For instance, in some cases, a piece might only be considered if its similarity rating exceeds a certain threshold (e.g., cosine similarity > 0.3).
So, who’s Anna Pávlovna?
Within the ‘‘ example, as demonstrated in my previous post, we split your entire text into chunks after which create the respective embeddings for every chunk. Then, when the user submits a question, like ‘?’, we also create the respective embedding(s) for the user’s query text.
import os
from langchain.chat_models import ChatOpenAI
from langchain.document_loaders import TextLoader
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import FAISS
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.docstore.document import Document
api_key = 'your_api_key'
# initialize LLM
llm = ChatOpenAI(openai_api_key=api_key, model="gpt-4o-mini", temperature=0.3)
# initialize embeddings model
embeddings = OpenAIEmbeddings(openai_api_key=api_key)
# loading documents for use for RAG
text_folder = "RAG files"
documents = []
for filename in os.listdir(text_folder):
if filename.lower().endswith(".txt"):
file_path = os.path.join(text_folder, filename)
loader = TextLoader(file_path)
documents.extend(loader.load())
splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
split_docs = []
for doc in documents:
chunks = splitter.split_text(doc.page_content)
for chunk in chunks:
split_docs.append(Document(page_content=chunk))
documents = split_docs
# create vector database w FAISS
vector_store = FAISS.from_documents(documents, embeddings)
retriever = vector_store.as_retriever()
def foremost():
print("Welcome to the RAG Assistant. Type 'exit' to quit.n")
while True:
user_input = input("You: ").strip()
if user_input.lower() == "exit":
print("Exiting…")
break
# get relevant documents
relevant_docs = retriever.invoke(user_input)
retrieved_context = "nn".join([doc.page_content for doc in relevant_docs])
# system prompt
system_prompt = (
"You're a helpful assistant. "
"Use ONLY the next knowledge base context to reply the user. "
"If the reply just isn't within the context, say you do not know.nn"
f"Context:n{retrieved_context}"
)
# messages for LLM
messages = [
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_input}
]
# generate response
response = llm.invoke(messages)
assistant_message = response.content.strip()
print(f"nAssistant: {assistant_message}n")
if __name__ == "__main__":
foremost()
On this script, I used LangChain’s retriever object retriever = vector_store.as_retriever()
, which by default uses the cosine similarity to evaluate the relevance of the document embeddings with the user’s query. It also retrieves by default the k=4 documents. Thus, in essence, what we’re doing there may be that we retrieve the top k most relevant to the user query chunks based on cosine similarity.
In any case, LangCahin’s .as_retriever()
method doesn’t allow us to display the cosine similarity values — we just get the highest k relevant chunks. So, so as to take a take a look at the cosine similarities, I’m going to regulate our script a little bit bit and use .similarity_search_with_score()
as an alternative of .as_retriever()
. We will easily do that by adding the next part to our foremost()
function:
# REMOVE THIS LINE
retriever = vector_store.as_retriever()
def foremost():
print("Welcome to the RAG Assistant. Type 'exit' to quit.n")
while True:
user_input = input("You: ").strip()
if user_input.lower() == "exit":
print("Exiting…")
break
# ADD THIS SECTION
# Similarity search with rating
results = vector_store.similarity_search_with_score(user_input, k=2)
# Extract documents and cosine similarity scores
print(f"nCosine Similarities for Top 5 Chunks:n")
for idx, (doc, sim_score) in enumerate(results):
print(f"Chunk {idx + 1}:")
print(f"Cosine Similarity: {sim_score:.4f}")
print(f"Content:n{doc.page_content}n")
# CONTINUE WITH REST OF THE CODE...
# System prompt for LLM generation
retrieved_context = "nn".join([doc.page_content for doc, _ in results])
Notice how we are able to explicitly define the variety of retrieved chunks k, now set as k=2.
Finally, we are able to again ask and receive an answear:

… but now we’re also in a position to see the text chunks based on which this answer is created, and the respective cosine similarity scores…

Apparently, different parameters may end up in different answers. As an illustration, we’re going to get barely different answers when retrieving the highest k=2, k=4, and k=10 results. Making an allowance for the extra parameters which might be utilized in the chunking step, like chunk size and chunk overlap, it becomes obvious that parameters play a vital role in getting good results from a RAG pipeline.
• • •
📰 💌 💼☕
• • •
What about pialgorithms?
Trying to bring the facility of RAG into your organization?
pialgorithms can do it for you
