When (Not) to Use Vector DB

-

. They solve an actual problem, and in lots of cases, they’re the precise selection for RAG systems. But here’s the thing: simply because you’re using embeddings doesn’t mean you a vector database.

We’ve seen a growing trend where every RAG implementation starts by plugging in a vector DB. Which may make sense for large-scale, persistent knowledge bases, however it’s not at all times essentially the most efficient path, especially when your use case is more dynamic or time-sensitive.

At Planck, we utilize embeddings to boost LLM-based systems. Nonetheless, in one in every of our real-world applications, we opted to avoid a vector database and as a substitute used a easy key-value store, which turned out to be a significantly better fit.

Before I dive into that, let’s explore an easy, generalized version of our scenario to elucidate why.

Foo Example

Let’s imagine an easy RAG-style system. A user uploads a number of text files, perhaps some reports or meeting notes. We split those files into chunks, generate embeddings for every chunk, and use those embeddings to reply questions. The user asks a handful of questions over the following couple of minutes, then leaves. At that time, each the files and their embeddings are useless and might be safely discarded.

In other words, the information is ephemeral, the user will ask only a few questions, and we wish to reply them as fast as possible.

Now pause for a second and ask yourself:


Most individuals’s instinct is: “I actually have embeddings, so I would like a vector database”, but pause for a second and take into consideration what’s actually happening behind that abstraction. Whenever you send embeddings to a vector DB, it doesn’t just “store” them. It builds an index that accelerates similarity searches. That indexing work is where plenty of the magic comes from, and in addition where plenty of the price lives.

In a long-lived, large-scale knowledge base, this trade-off makes perfect sense: you pay an indexing cost once (or incrementally as data changes), after which spread that cost over thousands and thousands of queries. In our Foo example, that’s not what’s happening. We’re doing the alternative: continually adding small, one-off batches of embeddings, answering a tiny variety of queries per batch, after which throwing every part away.

So the actual query isn’t “should I take advantage of a vector database?” but “is the indexing work price it?” To reply that, we will take a look at an easy benchmark.

Benchmarking: No-Index Retrieval vs. Indexed Retrieval

Photo by Julia Fiander on Unsplash

Results

We would like to check two systems:

  1. No indexing in any respect, just keeps embeddings in memory and scans them directly.
  2. A vector database, where we pay an indexing cost upfront to make each query faster.

First, consider the “no vector DB” approach. When a question is available in, we compute similarities between the query embedding and all stored embeddings, then select the top-k. That’s just K-Nearest Neighbors with none index.

import numpy as np

def run_knn(embeddings: np.ndarray, query_embedding: np.ndarray, top_k: int) -> np.ndarray:
    sims = embeddings @ query_embedding
    return sims.argsort()[-top_k:][::-1]

The code uses the dot product as a proxy for cosine similarity (assuming normalized vectors) and sorts the scores to seek out the very best matches. It literally just scans all vectors and picks the closest ones.

Now, let’s take a look at what a vector DB typically does. Under the hood, most vector databases depend on an approximate nearest neighbor (ANN) index. ANN methods trade a little bit of accuracy for a big boost in search speed, and some of the widely used algorithms for that is HNSW. We’ll use the hnswlib library to simulate the index behavior.

import numpy as np
import hnswlib

def create_hnsw_index(embeddings: np.ndarray, num_dims: int) -> hnswlib.Index:
    index = hnswlib.Index(space='cosine', dim=num_dims)
    index.init_index(max_elements=embeddings.shape[0])
    index.add_items(embeddings)
    return index

def query_hnsw(index: hnswlib.Index, query_embedding: np.ndarray, top_k: int) -> np.ndarray:
    labels, distances = index.knn_query(query_embedding, k=top_k)
    return labels[0]

To see where the trade-off lands, we will generate some random embeddings, normalize them, and measure how long each step takes:

import time
import numpy as np
import hnswlib
from tqdm import tqdm

def run_benchmark(num_embeddings: int, num_dims: int, top_k: int, num_iterations: int) -> None:
    print(f"Benchmarking with {num_embeddings} embeddings of dimension {num_dims}, retrieving top-{top_k} nearest neighbors.")

    knn_times: list[float] = []
    index_times: list[float] = []
    hnsw_query_times: list[float] = []

    for _ in tqdm(range(num_iterations), desc="Running benchmark"):
        embeddings = np.random.rand(num_embeddings, num_dims).astype('float32')
        embeddings = embeddings / np.linalg.norm(embeddings, axis=1, keepdims=True)
        query_embedding = np.random.rand(num_dims).astype('float32')
        query_embedding = query_embedding / np.linalg.norm(query_embedding)

        start_time = time.time()
        run_knn(embeddings, query_embedding, top_k)
        knn_times.append((time.time() - start_time) * 1e3)

        start_time = time.time()
        vector_db_index = create_hnsw_index(embeddings, num_dims)
        index_times.append((time.time() - start_time) * 1e3)

        start_time = time.time()
        query_hnsw(vector_db_index, query_embedding, top_k)
        hnsw_query_times.append((time.time() - start_time) * 1e3)

    print(f"BENCHMARK RESULTS (averaged over {num_iterations} iterations)")
    print(f"[Naive KNN] Average search time without indexing: {np.mean(knn_times):.2f} ms")
    print(f"[HNSW Index] Average index construction time: {np.mean(index_times):.2f} ms")
    print(f"[HNSW Index] Average query time with indexing: {np.mean(hnsw_query_times):.2f} ms")

run_benchmark(num_embeddings=50000, num_dims=1536, top_k=5, num_iterations=20)

Results

In this instance, we use 50,000 embeddings with 1,536 dimensions (matching OpenAI’s text-embedding-3-small) and retrieve the top-5 neighbors. The precise results will vary with different configs, however the pattern we care about is similar.

I encourage you to run the benchmark together with your own numbers, it’s the very best method to see how the trade-offs play out in your specific use case.

On average, the naive KNN search takes 24.54 milliseconds per query. Constructing the HNSW index for a similar embeddings takes around 277 seconds. Once the index is built, each query takes about 0.47 milliseconds.

From this, we will estimate the break-even point. The difference between naive KNN and indexed queries is 24.07 ms per query. That suggests you wish 11,510 queries before the time saved on each query compensates for the time spent constructing the index.

Generated using the benchmark code: A graph comparing naive KNN and indexed search efficiency

Moreover, even with different values for the variety of embeddings and top-k, the break-even point stays within the 1000’s of queries and stays inside a reasonably narrow range. You don’t get a scenario where indexing starts to repay after just a number of dozen queries.

Generated using the benchmark code: A graph showing break-even points for various embedding counts and top-k settings (image by writer)

Now compare that to the Foo example. A user uploads a small set of files and asks a number of questions, not 1000’s. The system never reaches the purpose where the index pays off. As an alternative, the indexing step simply delays the moment when the system can answer the primary query and adds operational complexity.

For this kind of short-lived, per-user context, the straightforward in-memory KNN approach isn’t only easier to implement and operate, but it’s also faster end-to-end.

If in-memory storage isn’t an option, either since the system is distributed or because we want to preserve the user’s state for a number of minutes, we will use a key-value store like Redis. We will store a novel identifier for the user’s request as the important thing and store all of the embeddings as the worth.

This offers us a light-weight, low-complexity solution that’s well-suited to our use case of short-lived, low-query contexts.

Real-World Example: Why We Selected a Key-Value Store

Photo by Gavin Allanwood on Unsplash

At Planck, we answer insurance-related questions on businesses. A typical request begins with a business name and address, after which we retrieve real-time data about that specific business, including its online presence, registrations, and other public records. This data becomes our context, and we use LLMs and algorithms to reply questions based on it.

The necessary bit is that each time we get a request, we generate a fresh context. We’re not reusing existing data, it’s fetched on demand and stays relevant for a number of minutes at most.

For those who think back to the sooner benchmark, this pattern should already be triggering your “this isn’t a vector DB use case” sensor.

Each time we receive a request, we generate fresh embeddings for short-lived data that we’ll likely query only a number of hundred times. Indexing those embeddings in a vector DB adds unnecessary latency. In contrast, with Redis, we will immediately store the embeddings and run a fast similarity search in the appliance code with almost no indexing delay.

That’s why we selected Redis as a substitute of a vector database. While vector DBs are excellent at handling large volumes of embeddings and supporting fast nearest-neighbor queries, they introduce indexing overhead, and in our case, that overhead isn’t price it.

In Conclusion

If it’s good to store thousands and thousands of embeddings and support high-query workloads across a shared corpus, a vector DB can be a greater fit. And yes, there are definitely use cases on the market that actually need and profit from a vector DB.

But simply because you’re using embeddings or constructing a RAG system doesn’t mean it is best to default to a vector DB.

Each database technology has its strengths and trade-offs. The most effective selection begins with a deep understanding of your data and use case, reasonably than mindlessly following the trend.

So, the following time it’s good to select a database, pause for a moment and ask: am I selecting the precise one based on objective trade-offs, or am I just going with the trendiest, shiniest selection?

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x