, off the back of Retrieval Augmented Generation (RAG), vector databases are getting a whole lot of attention within the AI world.
Many individuals say you would like tools like Pinecone, Weaviate, Milvus, or Qdrant to construct a RAG system and manage your embeddings. In the event you are working on enterprise applications with a whole lot of thousands and thousands of vectors, then tools like these are essential. They allow you to perform CRUD operations, filter by metadata, and use disk-based indexing that goes beyond your computer’s memory.
But for many internal tools, documentation bots, or MVP agents, adding a dedicated vector database could be overkill. It increases complexity, network delays, adds serialisation costs, and makes things more complicated to administer.
The reality is that “Vector Search” (i.e the Retrieval a part of RAG) is just matrix multiplication. And Python already has a number of the world’s best tools for that.
In this text, we’ll show methods to construct a production-ready retrieval component of a RAG pipeline for small-to-medium data volumes using only NumPy and SciKit-Learn. You’ll see that it’s possible to look thousands and thousands of text strings in milliseconds, all in memory and with none external dependencies.
Understanding Retrieval as Matrix Math
Typically, RAG involves 4 important steps:
- Embed: Turn the text of your source data into vectors (lists of floating-point numbers)
- Store: Squirrel those vectors away right into a database
- Retrieve: Find vectors which might be mathematically “close” to the query vector.
- Generate: Feed the corresponding text to an LLM and get your final answer.
Steps 1 and 4 depend on large language models. Steps 2 and three are the domain of the Vector DB. We’ll focus on parts 2 and three and the way we avoid using vector DBs entirely.
But once we’re searching our vector database, what actually is “closeness”? Often, it’s Cosine Similarity. In case your two vectors are normalised to have a magnitude of 1, then cosine similarity is just the dot product of the 2.
If you will have a one-dimensional query vector of size N, Q(1xN), and a database of document vectors of size M by N, D(MxN), finding the perfect matches just isn’t a database query; it’s a matrix multiplication operation, the dot product of D with the transpose of Q.
Scores = D.Q^T
NumPy is designed to perform this type of operation efficiently, using routines that leverage modern CPU features similar to vectorisation.
The Implementation
We’ll create a category called SimpleVectorStore to handle ingestion, indexing, and retrieval. Our input data will consist of a number of files containing the text we would like to look on. Using Sentence Transformers for local embeddings will make all the pieces work offline.
Prerequisites
Arrange a brand new development environment, install the required libraries, and begin a Jupyter notebook.
Type the next commands right into a command shell. I’m using UV as my package manager; change to suit whatever tool you’re using.
$ uv init ragdb
$ cd ragdb
$ uv venv ragdb
$ source ragdb/bin/activate
$ uv pip install numpy scikit-learn sentence-transformers jupyter
$ jupyter notebook
The In-Memory Vector Store
We don’t need a sophisticated server. All we’d like is a function to load our text data from the input files and chunk it into byte-sized pieces, and a category with two lists: one for the raw text chunks and one for the embedding matrix. Here’s the code.
import numpy as np
import os
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
from typing import List, Dict, Any
from pathlib import Path
class SimpleVectorStore:
def __init__(self, model_name: str = 'all-MiniLM-L6-v2'):
print(f"Loading embedding model: {model_name}...")
self.encoder = SentenceTransformer(model_name)
self.documents = [] # Stores the raw text and metadata
self.embeddings = None # Will grow to be a numpy array
def add_documents(self, docs: List[Dict[str, Any]]):
"""
Ingests documents.
docs format: [{'text': '...', 'metadata': {...}}, ...]
"""
texts = [d['text'] for d in docs]
# 1. Generate Embeddings
print(f"Embedding {len(texts)} documents...")
new_embeddings = self.encoder.encode(texts)
# 2. Normalize Embeddings
# (Critical optimization: allows dot product to approximate cosine similarity)
norm = np.linalg.norm(new_embeddings, axis=1, keepdims=True)
new_embeddings = new_embeddings / norm
# 3. Update Storage
if self.embeddings is None:
self.embeddings = new_embeddings
else:
self.embeddings = np.vstack([self.embeddings, new_embeddings])
self.documents.extend(docs)
print(f"Store now comprises {len(self.documents)} documents.")
def search(self, query: str, k: int = 5):
"""
Retrieves the top-k most similar documents.
"""
if self.embeddings is None or len(self.documents) == 0:
print("Warning: Vector store is empty. No documents to look.")
return []
# 1. Embed and Normalize Query
query_vec = self.encoder.encode([query])
norm = np.linalg.norm(query_vec, axis=1, keepdims=True)
query_vec = query_vec / norm
# 2. Vectorized Search (Matrix Multiplication)
# Result shape: (1, N_docs)
scores = np.dot(self.embeddings, query_vec.T).flatten()
# 3. Get Top-K Indices
# argsort sorts ascending, so we take the last k and reverse them
# Ensure k doesn't exceed the variety of documents
k = min(k, len(self.documents))
top_k_indices = np.argsort(scores)[-k:][::-1]
results = []
for idx in top_k_indices:
results.append({
"rating": float(scores[idx]),
"text": self.documents[idx]['text'],
"metadata": self.documents[idx].get('metadata', {})
})
return results
def load_from_directory(directory_path: str, chunk_size: int = 1000, overlap: int = 200):
"""
Reads .txt files and splits them into overlapping chunks.
"""
docs = []
# Use pathlib for robust path handling and backbone
path = Path(directory_path).resolve()
if not path.exists():
print(f"Error: Directory '{path}' not found.")
print(f"Current working directory: {os.getcwd()}")
return docs
print(f"Loading documents from: {path}")
for file_path in path.glob("*.txt"):
try:
with open(file_path, "r", encoding="utf-8") as f:
text = f.read()
# Easy sliding window chunking
# We iterate through the text with a step size smaller than the chunk size
# to create overlap (preserving context between chunks).
step = chunk_size - overlap
for i in range(0, len(text), step):
chunk = text[i : i + chunk_size]
# Skip chunks which might be too small (e.g., leftover whitespace)
if len(chunk) < 50:
proceed
docs.append({
"text": chunk,
"metadata": {
"source": file_path.name,
"chunk_index": i
}
})
except Exception as e:
print(f"Warning: Couldn't read file {file_path.name}: {e}")
print(f"Successfully loaded {len(docs)} chunks from {len(list(path.glob('*.txt')))} files.")
return docs
The embedding model used
The all-MiniLM-L6-v2 model utilized in the code is from the Sentence Transformers library. This was chosen because,
- It’s fast and light-weight.
- It produces 384-dimensional vectors that use less memory than larger models.
- It performs well on a wide selection of English-language tasks with no need specialised fine-tuning.
This model is only a suggestion. You should utilize any embedding model you wish if you will have a selected favourite.
Why Normalise?
You would possibly notice the normalisation steps within the code. We mentioned it before, but to be clear, given two vectors X and Y, cosine similarity is defined as
Where:
- X · Y is the dot product of vectors X and Y
- ||X|| is the magnitude (length) of vector X
- ||Y|| is the magnitude of vector Y
Since division takes extra computation, if all our vectors have unit magnitude, the denominator is 1, so the formula reduces to the dot product of X and Y, which makes searching much faster.
Testing the Performance
The very first thing we'd like to do is get some input data to work with. You should utilize any input text file for this. For previous RAG experiments, I used a book I downloaded from Project Gutenberg. The consistently riveting:
“Diseases of cattle, sheep, goats, and swine by Jno. A. W. Dollar & G. Moussu”
https://www.gutenberg.org/policy/permission.html
I downloaded the text of the book from the Project Gutenberg website to my local PC using this link,
https://www.gutenberg.org/ebooks/73019.txt.utf-8
This book contained roughly 36,000 lines of text. Querying the book takes only six lines of code. For my sample query, line 2315 of the book discusses a disease called CONDYLOMATA. Here is the excerpt,
INFLAMMATION OF THE INTERDIGITAL SPACE.
(CONDYLOMATA.)
Condylomata result from chronic inflammation of the skin covering the
interdigital ligament. Any injury to this region causing even
superficial damage may end in chronic inflammation of the skin and
hypertrophy of the papillæ, the primary stage within the production of
condylomata.Injuries produced by cords slipped into the interdigital space for the
purpose of lifting the feet when shoeing working oxen are also fruitful
causes.
In order that‘s what we’ll ask, “What's Condylomata?” Note that we won’t get a correct answer as we’re not feeding our search result into an LLM, but we must always see that our search returns a text snippet that might give the LLM all of the required information to formulate a solution had we done so.
%%time
# 1. Initialize
store = SimpleVectorStore()
# 2. Load Documents
real_docs = load_from_directory("/mnt/d/book")
# 3. Add to Store
if real_docs:
store.add_documents(real_docs)
# 4. Search
results = store.search("What's Condylomata?", k=1)
results
And here is the output.
Loading embedding model: all-MiniLM-L6-v2...
Loading documents from: /mnt/d/book
Successfully loaded 2205 chunks from 1 files.
Embedding 2205 documents...
Store now comprises 2205 documents.
CPU times: user 3.27 s, sys: 377 ms, total: 3.65 s
Wall time: 3.82 s
[{'score': 0.44883957505226135,
'text': 'two lastnphalanges, the latter operation being easier than
the former, andnproviding flaps of more regular shape and better adapted
for thenproduction of a satisfactory stump.nnn
INFLAMMATION OF THE INTERDIGITAL SPACE.nn(CONDYLOMATA.)nn
Condylomata result from chronic inflammation of the skin covering
theninterdigital ligament. Any injury to this region causing
evennsuperficial damage may result in chronic inflammation of the
skin andnhypertrophy of the papillæ, the first stage in the production
ofncondylomata.nnInjuries produced by cords slipped into the
interdigital space for thenpurpose of lifting the feet when shoeing
working oxen are also fruitfulncauses.nnInflammation of the
interdigital space is also a common complication ofnaphthous eruptions
around the claws and in the space between them.nContinual contact with
litter, dung and urine favour infection ofnsuperficial or deep wounds,
and by causing exuberant granulation lead tonhypertrophy of the papillary
layer of ',
'metadata': {'source': 'cattle_disease.txt', 'chunk_index': 122400}}]
Under 4 seconds to read, chunk, store, and appropriately query a 36000-line text document is pretty good going.
SciKit-Learn: The Upgrade Path
NumPy works well for brute-force searches. But what if you will have dozens or a whole lot of documents, and brute-force is just too slow? Before switching to a vector database, you'll be able to try SciKit-Learn’s NearestNeighbors. It uses tree-based structures like KD-Tree and Ball-Tree to hurry up searches to O(log N) as an alternative of O(N).
To check this out, I downloaded a bunch of other books from Gutenberg, including:-
- A Christmas Carol by Charles Dickens
- The Life and Adventures of Santa Claus by L. Frank Baum
- War and Peace by Tolstoy
- A Farewell to Arms by Hemingway
In total, these books contain around 120,000 lines of text. I copied and pasted all five input book files ten times, leading to fifty files and 1.2 million lines of text. That’s around 12 million words, assuming a median of 10 words per line. To offer some context, this text comprises roughly 2800 words, so the info volume we’re testing with is such as over 4000 times the amount of this text.
$ dir
achristmascarol - Copy (2).txt cattle_disease - Copy (9).txt santa - Copy (6).txt
achristmascarol - Copy (3).txt cattle_disease - Copy.txt santa - Copy (7).txt
achristmascarol - Copy (4).txt cattle_disease.txt santa - Copy (8).txt
achristmascarol - Copy (5).txt farewelltoarms - Copy (2).txt santa - Copy (9).txt
achristmascarol - Copy (6).txt farewelltoarms - Copy (3).txt santa - Copy.txt
achristmascarol - Copy (7).txt farewelltoarms - Copy (4).txt santa.txt
achristmascarol - Copy (8).txt farewelltoarms - Copy (5).txt warandpeace - Copy (2).txt
achristmascarol - Copy (9).txt farewelltoarms - Copy (6).txt warandpeace - Copy (3).txt
achristmascarol - Copy.txt farewelltoarms - Copy (7).txt warandpeace - Copy (4).txt
achristmascarol.txt farewelltoarms - Copy (8).txt warandpeace - Copy (5).txt
cattle_disease - Copy (2).txt farewelltoarms - Copy (9).txt warandpeace - Copy (6).txt
cattle_disease - Copy (3).txt farewelltoarms - Copy.txt warandpeace - Copy (7).txt
cattle_disease - Copy (4).txt farewelltoarms.txt warandpeace - Copy (8).txt
cattle_disease - Copy (5).txt santa - Copy (2).txt warandpeace - Copy (9).txt
cattle_disease - Copy (6).txt santa - Copy (3).txt warandpeace - Copy.txt
cattle_disease - Copy (7).txt santa - Copy (4).txt warandpeace.txt
cattle_disease - Copy (8).txt santa - Copy (5).txtLet's say we're ut
Let’s say we were ultimately searching for a solution to the next query,
Who, after the Christmas holidays, did Nicholas tell his mother of his love for?
In case you didn’t know, this comes from the novel War and Peace.
Let’s see how our latest search does against this huge body of knowledge.
Here is the code using SciKit-Learn.
First off, we now have a brand new class that implements SciKit-Learn’s nearest Neighbour algorithm.
from sklearn.neighbors import NearestNeighbors
class ScikitVectorStore(SimpleVectorStore):
def __init__(self, model_name='all-MiniLM-L6-v2'):
super().__init__(model_name)
# Brute force is commonly faster than trees for high-dimensional data
# unless N may be very large, but 'ball_tree' may help in specific cases.
self.knn = NearestNeighbors(n_neighbors=5, metric='cosine', algorithm='brute')
self.is_fit = False
def build_index(self):
print("Constructing Scikit-Learn Index...")
self.knn.fit(self.embeddings)
self.is_fit = True
def search(self, query: str, k: int = 5):
if not self.is_fit: self.build_index()
query_vec = self.encoder.encode([query])
# Note: Scikit-learn handles normalization internally for cosine metric
# if configured, but explicit is healthier.
distances, indices = self.knn.kneighbors(query_vec, n_neighbors=k)
results = []
for i in range(k):
idx = indices[0][i]
# Convert distance back to similarity rating (1 - dist)
rating = 1 - distances[0][i]
results.append({
"rating": rating,
"text": self.documents[idx]['text']
})
return results
And our search code is just so simple as for the NumPy version.
%%time
# 1. Initialize
store = ScikitVectorStore()
# 2. Load Documents
real_docs = load_from_directory("/mnt/d/book")
# 3. Add to Store
if real_docs:
store.add_documents(real_docs)
# 4. Search
results = store.search("Who, after the Christmas holidays, did Nicholas tell his mother of his love for", k=1)
results
And our output.
Loading embedding model: all-MiniLM-L6-v2...
Loading documents from: /mnt/d/book
Successfully loaded 73060 chunks from 50 files.
Embedding 73060 documents...
Store now comprises 73060 documents.
Constructing Scikit-Learn Index...
CPU times: user 1min 46s, sys: 18.3 s, total: 2min 4s
Wall time: 1min 13s
[{'score': 0.6972659826278687,
'text': 'nCHAPTER XIIInnSoon after the Christmas holidays Nicholas told
his mother of his lovenfor Sónya and of his firm resolve to marry her. The
countess, whonhad long noticed what was going on between them and was
expecting thisndeclaration, listened to him in silence and then told her son
that henmight marry whom he pleased, but that neither she nor his father
wouldngive their blessing to such a marriage. Nicholas, for the first time,
nfelt that his mother was displeased with him and that, despite her loven
for him, she would not give way. Coldly, without looking at her son,nshe
sent for her husband and, when he came, tried briefly and coldly toninform
him of the facts, in her son's presence, but unable to restrainnherself she
burst into tears of vexation and left the room. The oldncount began
irresolutely to admonish Nicholas and beg him to abandon hisnpurpose.
Nicholas replied that he could not go back on his word, and hisnfather,
sighing and evidently disconcerted, very soon became silent ',
'metadata': {'source': 'warandpeace - Copy (6).txt',
'chunk_index': 1396000}}]
Almost all the 1m 13s it took to do the above processing was spent on loading and chunking our input data. The actual search part, once I ran it individually, took lower than one-tenth of a second!
Not too shabby in any respect.
Summary
I'm not arguing that Vector Databases are usually not needed. They solve specific problems that NumPy and SciKit-Learn don't handle. It's best to migrate from something like our SimpleVectorStore or ScikitVectorStore to Weaviate/Pinecone/pgvector, etc, when any of the next conditions apply.
Persistence: You would like data to survive a server restart without rebuilding the index from source files each time. Though np.save or pickling works for easy persistence. Engineering at all times involves trade-offs. Using a vector database adds complexity to your setup in exchange for scalability you could not need at once. In the event you start with a more straightforward RAG setup using NumPy and/or SciKit-Learn for the retrieval process, you get:
RAM is the bottleneck: Your embedding matrix exceeds your server’s memory. Note: 1 million vectors of 384 dimensions [float32] is just ~1.5GB of RAM, so you'll be able to fit quite a bit in memory.
CRUD frequency: You want to always update or delete individual vectors while reading. NumPy arrays, for instance, are immutable, and appending requires copying the entire array, which is slow.
Metadata Filtering: You would like complex queries like “Find vectors near X where user_id=10 AND date > 2023”. Doing this in NumPy requires boolean masks that may get messy.
Engineering at all times involves trade-offs. Using a vector database adds complexity to your setup in exchange for scalability you could not need at once. In the event you start with a more straightforward RAG setup using NumPy and/or SciKit-Learn for the retrieval process, you get:
- Lower Latency. No network hops.
- Lower Costs. No SaaS subscriptions or extra instances.
- Simplicity. It's only a Python script.
Just as you don’t need a sports automotive to go to the food market. In lots of cases, NumPy or SciKit-Learn could also be all of the RAG search you would like.
