a contemporary vector database—Neo4j, Milvus, Weaviate, Qdrant, Pinecone—there may be a really high likelihood that Hierarchical Navigable Small World (HNSW) is already powering your retrieval layer. It is kind of likely you probably did not select it while constructing the database, nor did you tune it and even comprehend it is there. And yet, HNSW is quietly deciding what your LLM sees as truth. It determines which document chunks are fed into your RAG pipeline, which memories your agent recalls, and ultimately, whether the model answers appropriately—or hallucinates confidently.
As your vector database grows, retrieval quality degrades progressively:
- No exceptions are raised
- No errors are logged
- Latency often looks perfectly fantastic
However the , and your RAG system becomes less reliable over time—regardless that the embedding model and distance metric remain unchanged.
In this text, I reveal—using controlled experiments and real data—how HNSW impacts retrieval quality as database size grows, why this degradation is worse than flat search, and what you possibly can realistically do about it in production RAG systems.
Specifically, I’ll:
- Construct a practical, reproducible use case to measure the effect of HNSW on RAG retrieval quality using Recall@k.
- Show that, for fixed HNSW settings, recall degrades faster than flat search because the corpus grows.
- Discuss practical tuning strategies for balancing recall and latency beyond simply increasing of HNSW.
What’s HNSW?
HNSW is a graph-based algorithm for Approximate Nearest Neighbor (ANN) search. It organizes data into multiple layers of connected neighbors and uses this graph structure to hurry up search.
Each vector is connected to a limited variety of neighbors in each layer. During a search, it performs a greedy search through these layers, and the variety of neighbors checked at each layer is constant (controlled by and ), which makes the search process logarithmic with respect to the variety of vectors. In comparison with flat search, where time complexity is O(N), HNSW search has a time complexity of O(log N), which suggests the time required for a search grows very slowly (logarithmically) as in comparison with linear search. We’ll see this in the results of our use case.
Parameters of HNSW index
1. Construct time parameters: and . May be set before constructing the database only.
M defines the utmost variety of connections (neighbors) that every vector (node) can have in each layer of the graph. The next M means more connections, making the graph denser and potentially increasing recall but at the price of more memory and slower indexing.
controls the size of the candidate set used throughout the construction of the graph. Essentially, it governs how thoroughly the graph is constructed during indexing. The next value for ef_construction means the graph is built more thoroughly, with more candidates being considered before making each connection, which results in a higher quality graph and higher recall at the price of increased memory and slower indexing.
For a general purpose RAG application, typical values of are inside a spread of 12 and 48 and between 64 and 200.
2. Query time parameter:
This defines the variety of candidate nodes (or vectors) to explore throughout the query process (i.e., throughout the seek for nearest neighbors). It controls how thorough the search process is by determining what number of candidates are evaluated before the search result’s returned. The next value for means the search will explore more candidates, leading to higher recall but potentially slower queries.
What’s Recall@k?
Recall@k is a key metric for measuring the accuracy of vector search and RAG systems. It measures the flexibility of the retriever to search out the relevant chunks for a user query throughout the top k results. It is critical because If the retriever misses the chunks containing the knowledge required to reply the query (low recall), the LLM cannot possibly generate an accurate answer within the response synthesis step, no matter how powerful it’s.
[ text{Recall}@k = frac{text{relevant items retrieved in top } k}{text{total number of relevant items in the corpus}} ]
In practice, it is a difficult metric to measure since the denominator (ground truth documents) shouldn’t be easily known for a real-life production system. What we are going to do as an alternative, is design a use case where the bottom truth (eg; vector index) is exclusive and known, and Recall@k will measure the common variety of times it’s retrieved in top-k results, over numerous sample queries.
As an illustration, Recall@5 will measure the common variety of times the bottom truth index appeared in top-5 retrievals over 500 queries.
For a RAG, the suitable range of Recall@5 is 70-90% and Recall@10 is 80-95%, and we are going to see that our use case adheres to those ranges for the Flat index.
Use Case
To check HNSW, we want a vector database with sufficiently large variety of vectors (> 100,000). There doesn’t appear to be such a big public dataset available consisting of document chunks and associated query(ies) for which the actual chunk could be regarded as ground truth. And even when it were there, natural language will be ambiguous, so it’s difficult to confidently say which all chunks within the corpus could possibly be regarded as relevant for a question (the denominator in Recall@k formula). Developing such a curated dataset would require finding numerous documents, chunking and embedding them, then developing queries for the chunks. That may be a resource intensive process.
As an alternative, lets re-imagine our RAG problem as .
For this approach, I utilized the publicly available LAION-Aesthetics dataset. To access, you will have to be logged in to Hugging Face, and conform to the terms mentioned. Details concerning the dataset is accessible on the LAOIN site here. It incorporates an enormous variety of rows containing URLs to pictures together with a text caption. They give the impression of being like the next:

I downloaded a subset of rows and generated 200,000 CLIP embeddings of the photographs to construct the vector database. The text captions of the photographs will be conveniently used as queries for RAG. And every caption has just one image vector as the bottom truth so the denominator of Recall@k is strictly known for all queries. Also, the CLIP embeddings of the image and its caption are never a precise match, so there may be enough “fuzziness” in retrievals much like a purely document RAG, where a text query is used to retrieve relevant document chunks using a distance metric. This will likely be evident after we see the chart of Recall@k in the following sections.
Measuring Recall@k for Flat vs HNSW
We adopt the next approach:
- Embeddings of 200k images are stored as file.
- From the laion dataset, 500 captions(queries) are randomly chosen and embedded using CLIP. The chosen query indices also form the bottom truth as they correspond to the unique image for the query.
- The database is in-built increments of fifty,000 vectors, so 4 iterations of size 50k, 100k, 150k and 200k vectors. Each flat and HNSW indexes are built. HNSW is built using =16 and =100.
- Recall@k is calculated for k = 1, 5, 10, 15 and 20 based upon if the bottom truth indices are included in top-k results.
- First, the Recall@k values are calculated for every of the query vectors and averaged over the variety of samples (500).
- Then, average Recall@k values are calculated for HNSW values of 10, 20, 40, 80 and 160.
- Finally, 5 charts are drawn, one for every of the Recall@k values. Each chart depicts the evolution of Recall@k as database size grows for Flat index and different values of HNSW.
The code will be viewed here
import pandas as pd
import numpy as np
import faiss
import torch
import open_clip
import os
import random
import matplotlib.pyplot as plt
def evaluate_subset(size, embeddings_all, df_all, query_vectors_all, eval_indices_all, ef_search_values):
# Subset embeddings
embeddings = embeddings_all[:size]
dimension = embeddings.shape[1]
# Construct Indices in-memory for this subset size
index_flat = faiss.IndexFlatL2(dimension)
index_flat.add(embeddings)
index_hnsw = faiss.IndexHNSWFlat(dimension, 16)
index_hnsw.hnsw.efConstruction = 100
index_hnsw.add(embeddings)
num_samples = len(eval_indices_all)
results = []
ks = [1, 5, 10, 15, 20]
# Evaluate Flat
flat_recalls = {k: 0 for k in ks}
for i, qv in enumerate(query_vectors_all):
_, I = index_flat.search(qv, max(ks))
goal = eval_indices_all[i]
for k in ks:
if goal in I[0][:k]:
flat_recalls[k] += 1
flat_res = {"Setting": "Flat"}
for k in ks:
flat_res[f"R@{k}"] = flat_recalls[k]/num_samples
results.append(flat_res)
# Evaluate HNSW with different efSearch
for ef in ef_search_values:
index_hnsw.hnsw.efSearch = ef
hnsw_recalls = {k: 0 for k in ks}
for i, qv in enumerate(query_vectors_all):
_, I = index_hnsw.search(qv, max(ks))
goal = eval_indices_all[i]
for k in ks:
if goal in I[0][:k]:
hnsw_recalls[k] += 1
hnsw_res = {"Setting": f"HNSW (ef={ef})", "ef": ef}
for k in ks:
hnsw_res[f"R@{k}"] = hnsw_recalls[k]/num_samples
results.append(hnsw_res)
return results
def format_table(size, results):
ks = [1, 5, 10, 15, 20]
lines = []
lines.append(f"nDatabase Size: {size}")
lines.append("="*80)
header = f"{'Index/efSearch':<20}"
for k in ks:
header += f" | {'R@'+str(k):<8}"
lines.append(header)
lines.append("-" * 80)
for row in results:
line = f"{row['Setting']:<20}"
for k in ks:
line += f" | {row[f'R@{k}']:<8.2f}"
lines.append(line)
lines.append("="*80)
return "n".join(lines)
def important(n):
dataset_path = r"C:databaselaion_final.parquet"
embeddings_path = r"C:databaseembeddings.npy"
results_dir = r"C:results"
db_sizes = [50000, 100000, 150000, 200000]
ef_search_values = [10, 20, 40, 80, 160]
num_samples = n
output_txt = os.path.join(results_dir, f"eval_results_{num_samples}.txt")
output_png = os.path.join(results_dir, f"recall_vs_dbsize_{num_samples}.png")
if not os.path.exists(dataset_path) or not os.path.exists(embeddings_path):
print("Error: Dataset or embeddings not found.")
return
os.makedirs(results_dir, exist_ok=True)
# Load All Data Once
print("Loading base data...")
df_all = pd.read_parquet(dataset_path)
embeddings_all = np.load(embeddings_path).astype('float32')
# Load CLIP model once
print("Loading CLIP model (ViT-B-32)...")
model, _, preprocess = open_clip.create_model_and_transforms('ViT-B-32', pretrained='laion2b_s34b_b79k')
tokenizer = open_clip.get_tokenizer('ViT-B-32')
device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)
model.eval()
# Use samples valid for all subsets
eval_indices = random.sample(range(min(db_sizes)), num_samples)
print(f"Sampling {num_samples} queries for consistent evaluation...")
# Generate query vectors
query_vectors = []
for idx in eval_indices:
text = df_all.iloc[idx]['TEXT']
text_tokens = tokenizer([text]).to(device)
with torch.no_grad():
text_features = model.encode_text(text_tokens)
text_features /= text_features.norm(dim=-1, keepdim=True)
query_vectors.append(text_features.cpu().numpy().astype('float32'))
all_output_text = []
# Collect all results for plotting
# structure: { 'R@1': { 'Flat': [val1, val2...], 'ef=10': [val1, val2...] }, ... }
ks = [1, 5, 10, 15, 20]
plot_data = {f"R@{k}": { "Flat": [] } for k in ks}
for ef in ef_search_values:
for k in ks:
plot_data[f"R@{k}"][f"HNSW ef={ef}"] = []
for size in db_sizes:
print(f"Evaluating with database size: {size}...")
results = evaluate_subset(size, embeddings_all, df_all, query_vectors, eval_indices, ef_search_values)
table_str = format_table(size, results)
# Print to screen
print(table_str)
all_output_text.append(table_str)
# Collect for plot
for row in results:
label = row["Setting"]
if label == "Flat":
for k in ks:
plot_data[f"R@{k}"]["Flat"].append(row[f"R@{k}"])
else:
ef = row["ef"]
for k in ks:
plot_data[f"R@{k}"][f"HNSW ef={ef}"].append(row[f"R@{k}"])
# Save text results
with open(output_txt, "w", encoding="utf-8") as f:
f.write("n".join(all_output_text))
print(f"nFinal results saved to {output_txt}")
# Create Individual Plots for every K
for k in ks:
plt.figure(figsize=(10, 6))
k_key = f"R@{k}"
for label, values in plot_data[k_key].items():
linestyle = '--' if label == "Flat" else '-'
marker = 'o' if label == "Flat" else 's'
plt.plot(db_sizes, values, label=label, linestyle=linestyle, marker=marker)
plt.title(f"Recall@{k} vs Database Size")
plt.xlabel("Database Size")
plt.ylabel("Recall")
plt.grid(True)
plt.legend()
output_png = os.path.join(results_dir, f"recall_vs_dbsize_{k}.png")
plt.tight_layout()
plt.savefig(output_png)
plt.close()
print(f"Plot saved to {output_png}")
if __name__ == "__main__":
important(500)
And the outcomes are the next:


Observations
- For the Flat index (dotted line), Recall@5 and Recall@10 are within the range of 0.70 – 0.85, as will be expected of real life RAG applications.
- Flat index provides the very best Recall@k across all database sizes and forms a benchmark upper certain for HNSW.
- At any given database size, Recall@k increases for a better k. So for database size of 100k vectors, Recall@20 > Recall@15 > Recall@10 > Recall@5 > Recall@1. That is comprehensible as with a better k, there may be more probability that the bottom truth index is present within the retrieved set.
- Each Flat and HNSW deteriorate consistently because the database size grows. It is because high-dimensional vector spaces turn out to be increasingly crowded because the variety of vectors grows.
- Performance improves for HNSW for higher values.
- Because the database size approaches 200k, HNSW appears to degrade faster than Flat search.
Does HNSW degrade faster than Flat Search?
To view the relative performance of Flat vs HNSW indexes as database size grows, a rather different approach is adopted:
- The database indexes construction and query selection process stays same as before.
- As an alternative of considering the bottom truth, we calculate the overlap between the Flat index and every of the HNSW results for a given retrieval count(k).
- Five charts are drawn for every of the k values, denoting the evolution of overlap as database size grows. For an ideal match with Flat index, the HNSW line will show a rating of 1. More importantly, if the degradation of HNSW results is greater than Flat index, the line can have a negative slope, else can have a horizontal or positive slope.
The code will be viewed here
import pandas as pd
import numpy as np
import faiss
import torch
import open_clip
import os
import random
import matplotlib.pyplot as plt
import time
def evaluate_subset_compare(size, embeddings_all, df_all, query_vectors_all, ef_search_values):
# Subset embeddings
embeddings = embeddings_all[:size]
dimension = embeddings.shape[1]
# Construct Indices in-memory for this subset size
index_flat = faiss.IndexFlatL2(dimension)
index_flat.add(embeddings)
index_hnsw = faiss.IndexHNSWFlat(dimension, 16)
index_hnsw.hnsw.efConstruction = 100
index_hnsw.add(embeddings)
num_samples = len(query_vectors_all)
results = []
ks = [1, 5, 10, 15, 20]
max_k = max(ks)
# 1. Evaluate Flat once for this subset
flat_times = []
flat_results_all = []
for qv in query_vectors_all:
start_t = time.perf_counter()
_, I_flat_all = index_flat.search(qv, max_k)
flat_times.append(time.perf_counter() - start_t)
flat_results_all.append(I_flat_all[0])
avg_flat_time_ms = (sum(flat_times) / num_samples) * 1000
# 2. Evaluate HNSW relative to Flat
for ef in ef_search_values:
index_hnsw.hnsw.efSearch = ef
hnsw_times = []
# Track intersection counts for every k
overlap_counts = {k: 0 for k in ks}
for i, qv in enumerate(query_vectors_all):
# HNSW top-max_k
start_t = time.perf_counter()
_, I_hnsw_all = index_hnsw.search(qv, max_k)
hnsw_times.append(time.perf_counter() - start_t)
# Flat result was already pre-calculated
I_flat_all = flat_results_all[i]
for k in ks:
set_flat = set(I_flat_all[:k])
set_hnsw = set(I_hnsw_all[0][:k])
intersection = set_flat.intersection(set_hnsw)
overlap_counts[k] += len(intersection) / k
avg_hnsw_time_ms = (sum(hnsw_times) / num_samples) * 1000
hnsw_res = {
"Setting": f"HNSW (ef={ef})",
"ef": ef,
"FlatTime_ms": avg_flat_time_ms,
"HNSWTime_ms": avg_hnsw_time_ms
}
for k in ks:
# Average over all queries
hnsw_res[f"R@{k}"] = overlap_counts[k] / num_samples
results.append(hnsw_res)
return results
def format_all_tables(db_sizes, ef_search_values, all_results):
ks = [1, 5, 10, 15, 20]
lines = []
# 1. Create one table for every Recall@k
for k in ks:
k_label = f"R@{k}"
lines.append(f"nTable: {k_label} (HNSW Overlap with Flat)")
lines.append("=" * (20 + 12 * len(db_sizes)))
# Header
header = f"{'ef_search':<18}"
for size in db_sizes:
header += f" | {size:<9}"
lines.append(header)
lines.append("-" * (20 + 12 * len(db_sizes)))
# Rows (ef values)
for ef in ef_search_values:
row_str = f"{ef:<18}"
for size in db_sizes:
# Find the result for this size and ef
val = 0
for r in all_results[size]:
if r.get('ef') == ef:
val = r.get(k_label, 0)
break
row_str += f" | {val:<9.2f}"
lines.append(row_str)
lines.append("=" * (20 + 12 * len(db_sizes)))
# 2. Create Search Time Table
lines.append("nTable: Average Search Time (ms)")
lines.append("=" * (20 + 12 * len(db_sizes)))
header = f"{'Index Setting':<18}"
for size in db_sizes:
header += f" | {size:<9}"
lines.append(header)
lines.append("-" * (20 + 12 * len(db_sizes)))
# Flat Row
row_flat = f"{'Flat Index':<18}"
for size in db_sizes:
# Flat time is same for all ef in a size, so just take any
t = all_results[size][0]['FlatTime_ms']
row_flat += f" | {t:<9.4f}"
lines.append(row_flat)
# HNSW Rows
for ef in ef_search_values:
row_str = f"HNSW (ef={ef:<3})"
for size in db_sizes:
t = 0
for r in all_results[size]:
if r.get('ef') == ef:
t = r.get('HNSWTime_ms', 0)
break
row_str += f" | {t:<9.4f}"
lines.append(row_str)
lines.append("=" * (20 + 12 * len(db_sizes)))
return "n".join(lines)
def important(n):
dataset_path = r"C:databaselaion_final.parquet"
embeddings_path = r"C:databaseembeddings.npy"
results_dir = r"C:results"
db_sizes = [50000, 100000, 150000, 200000]
ef_search_values = [10, 20, 40, 80, 160]
num_samples = n
output_txt = os.path.join(results_dir, f"compare_results_{num_samples}.txt")
output_png_prefix = "compare_vs_dbsize"
if not os.path.exists(dataset_path) or not os.path.exists(embeddings_path):
print("Error: Dataset or embeddings not found.")
return
os.makedirs(results_dir, exist_ok=True)
# Load All Data Once
print("Loading base data...")
df_all = pd.read_parquet(dataset_path)
embeddings_all = np.load(embeddings_path).astype('float32')
# Load CLIP model once
print("Loading CLIP model (ViT-B-32)...")
model, _, preprocess = open_clip.create_model_and_transforms('ViT-B-32', pretrained='laion2b_s34b_b79k')
tokenizer = open_clip.get_tokenizer('ViT-B-32')
device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)
model.eval()
# Use queries from the primary 50k rows
eval_indices = random.sample(range(min(db_sizes)), num_samples)
print(f"Sampling {num_samples} queries...")
# Generate query vectors
query_vectors = []
for idx in eval_indices:
text = df_all.iloc[idx]['TEXT']
text_tokens = tokenizer([text]).to(device)
with torch.no_grad():
text_features = model.encode_text(text_tokens)
text_features /= text_features.norm(dim=-1, keepdim=True)
query_vectors.append(text_features.cpu().numpy().astype('float32'))
all_results_data = {}
ks = [1, 5, 10, 15, 20]
plot_data = {f"R@{k}": {} for k in ks}
for ef in ef_search_values:
for k in ks:
plot_data[f"R@{k}"][f"ef={ef}"] = []
for size in db_sizes:
print(f"Evaluating with database size: {size}...")
results = evaluate_subset_compare(size, embeddings_all, df_all, query_vectors, ef_search_values)
all_results_data[size] = results
# Collect for plot
for row in results:
ef = row["ef"]
for k in ks:
plot_data[f"R@{k}"][f"ef={ef}"].append(row[f"R@{k}"])
# Format pivoted tables
final_output_text = format_all_tables(db_sizes, ef_search_values, all_results_data)
print(final_output_text)
# Save text results
with open(output_txt, "w", encoding="utf-8") as f:
f.write(final_output_text)
print(f"nFinal results saved to {output_txt}")
# Create Individual Plots for every K
for k in ks:
plt.figure(figsize=(10, 6))
k_key = f"R@{k}"
for label, values in plot_data[k_key].items():
plt.plot(db_sizes, values, label=label, marker='s')
plt.title(f"HNSW vs Flat Overlap Recall@{k} vs Database Size")
plt.xlabel("Database Size")
plt.ylabel("Overlap Ratio")
plt.grid(True)
plt.legend()
output_png = os.path.join(results_dir, f"{output_png_prefix}_{k}.png")
plt.tight_layout()
plt.savefig(output_png)
plt.close()
print(f"Plot saved to {output_png}")
if __name__ == "__main__":
important(500)
And the outcomes are the next:


Observations
- In all cases, the lines have a negative slope, indicating that HNSW degrades faster than the Flat index as database grows.
- Higher values degrade slower than lower values, which fall quite sharply.
- Higher values have significant overlap (>90%) with the benchmark flat search as in comparison with the lower values.
Recall-latency trade-off
We all know that HNSW is quicker than Flat search. To see it in motion, I actually have also measured the common latency within the code of the previous section. Listed below are the common search times (in ms):
| Database size | 50,000 | 100,000 | 150,000 | 200,000 |
| Flat Index | 5.1440 | 9.3850 | 14.8843 | 18.4100 |
| HNSW (ef=10 ) | 0.0851 | 0.0742 | 0.0763 | 0.0768 |
| HNSW (ef=20 ) | 0.1159 | 0.0876 | 0.0959 | 0.0983 |
| HNSW (ef=40 ) | 0.1585 | 0.1366 | 0.1415 | 0.1493 |
| HNSW (ef=80 ) | 0.2508 | 0.2262 | 0.2398 | 0.2417 |
| HNSW (ef=160 ) | 0.4613 | 0.3992 | 0.4140 | 0.4064 |
Observations
- HNSW is orders of magnitude faster than flat search, which is the first reason for it to be the search algorithm of alternative for just about all vector databases.
- Time taken by Flat search increases almost linearly with database size (O(N) complexity)
- For a given value (a row), HNSW time is sort of constant. At this scale (200k vectors), HNSW latency stays nearly constant.
- As increases in a column, the HNSW time increases very significantly. As an illustration, time taken for ef=160 is 3X that of ef=40
Tuning the RAG pipeline
The above evaluation shows that while HNSW is certainly the choice to adopt in a production scenario for latency reasons, there may be a have to periodically tune the to keep up the latency-recall balance because the database grows. Some best practices that ought to be adopted are as follows:
- Given the issue of measuring Recall@k in a production database, keep a test case repository of ground truth document chunks and queries, which will be run at regular intervals to envision retrieval quality. We could start with probably the most frequent queries asked by the user, and chunks which are needed for a very good recall.
- One other indirect approach to ascertain recall quality could be to make use of a robust LLM to guage the standard of the retrieved context. As an alternative of asking “Did we get the best documents for the user query?”, which is difficult to say precisely for a big database, we are able to ask a rather weaker query “Does the retrieved context actually contain the reply to the user’s query?” and let the judge LLM reply to that.
- Collect user feedback in production. User rating of a response together with any manual correction will be used as a trigger for performance tuning.
- While tuning start with a conservatively high value, measure Recall@k, then reduce until latency is appropriate.
- Measure Recall on the top_k that the RAG uses, normally between 3 and 10. Consider relaxing top_k to fifteen or 20 and let the LLM determine which chunks within the given context to make use of for the response during synthesis step. Assuming the context doesn't turn out to be too large to slot in the LLM’s context window, such an approach would enable a high recall while having a moderate value, thereby keeping latency low.
Hybrid RAG pipeline
HNSW tuning using cannot fix the difficulty of falling recall with increasing database size beyond a degree. That's because vector search even using a flat index, becomes noisy when too many vectors are packed close together within the N dimensional space (N being the variety of dimensions output by the embedding model). Because the charts within the above section show, recall drops by 10%+ as database grows from 50k to 200k. The reliable approach to maintain recall is to make use of metadata filtering (eg; using a knowledge graph), to discover potential document ids and run retrieval just for those.
Key Takeaways
- HNSW is the default retrieval algorithm in most vector databases, but it surely is never tuned or monitored in production RAG systems.
- Retrieval quality degrades silently because the vector database grows, even when latency stays stable.
- For a similar corpus size, Flat search consistently achieves higher Recall@k than HNSW, serving as a useful upper certain for evaluation.
- HNSW recall degrades faster than Flat seek for fixed ef_search values as database size increases.
- Increasing ef_search improves recall, but latency grows rapidly, making a sharp recall–latency trade-off.
- Simply tuning HNSW parameters is insufficient at scale—vector search itself becomes noisy in dense embedding spaces.
- Hybrid RAG pipelines using metadata filters (SQL, graphs, inverted indexes) are probably the most reliable approach to maintain recall at scale.
Conclusion
HNSW has earned its place because the backbone of contemporary vector databases—not since it is perfectly accurate, but since it is fast enough to make large-scale semantic search practical.
Nevertheless, in RAG systems, speed without recall is a false optimization.
This text shows that as vector databases grow, retrieval quality deteriorates quietly—especially under approximate search—while latency metrics remain deceptively stable. The result's a system that appears healthy from an infrastructure perspective, but progressively feeds weaker context to the LLM, increasing hallucinations and reducing answer quality.
The answer shouldn't be to desert HNSW, nor to arbitrarily increase .
As an alternative, production-grade RAG systems must:
- Measure retrieval quality explicitly and repeatedly.
- Treat Flat search as a recall baseline.
- Constantly rebalance recall and latency.
- And ultimately, move toward hybrid retrieval architectures that narrow the search space before vector similarity is applied.
In case your RAG system’s answers are getting worse as your data grows, the issue might not be your LLM, your prompts, or your embeddings—however the retrieval algorithm you never realized you were counting on.
Images utilized in this text are synthetically generated. LAOIN-Aesthetics dataset used under CC-BY 4.0 license. Figures and code created by me
