, we talked intimately about what Prompt Caching is in LLMs and the way it might prevent loads of time and money when running AI-powered apps with high traffic. But other than Prompt Caching, the concept of a cache will also be utilized in several other parts of AI applications, equivalent to RAG retrieval caching or caching of entire query-response pairs, providing further cost and time savings. On this post, we’re going to have a look in additional detail at what other components of an AI app can profit from caching mechanisms. So, let’s take a have a look at caching in AI beyond Prompt Caching.
Why does it make sense to cache other things?
So, Prompt Caching is smart because we expect system prompts and directions to be passed as input to the LLM, in the exact same format each time. But beyond this, we may expect user queries to be repeated or look alike to some extent. Especially when talking about deploying RAG or other AI apps inside a corporation, we expect a big portion of the queries to be semantically similar, and even equivalent. Naturally, groups of users inside a corporation are going to be serious about similar things more often than not, like ‘‘, or ‘‘. Nevertheless, statistically, it is very unlikely that multiple users will ask the very same query (the very same words allowing for a precise match), unless we offer them with proposed, standardized queries inside the UI of the app. Nonetheless, there’s a really high likelihood that users ask queries with different words which might be semantically very . Thus, it is smart to also consider a semantic other than the standard cache.
In this fashion, we are able to further distinguish between the 2 sorts of cache:
- Exact-Match Caching, that’s, once we cache the unique text or some normalized version of it. Then we hit cache only with exact, word-for-word matches of the text. Exact-match caching may be implemented using a KV cache like Redis.
- Semantic Caching, that’s, creating an embedding of the text. Then we hit cache with any text that’s semantically much like it and exceeds a predefined similarity rating threshold (like cosine similarity above ~0.95). Since we’re serious about the semantics of the texts and we perform a similarity search, a vector database, equivalent to ChromaDB, would should be used as a cache store.
Unlike Prompt Caching, where we get to make use of a cache integrated into the API service of the LLM, to implement caching in other stages of a RAG pipeline, we now have to make use of an external cache store, like Redis or ChromaDB mentioned above. While this can be a little bit of a hassle, as we’d like to establish those cache stores ourselves, it also provides us with more control over the parametrization of the cache. For example, we get to determine about our Cache Expiration policies, meaning how long a cached item stays valid and may be reused. This parameter of the cache memory is defined as Time-To-Live (TTL).
As illustrated in my previous posts, a quite simple RAG pipeline looks something like this:
Even in the best type of a RAG pipeline, we already use a caching-like mechanism without even realizing it. That’s, storing the embeddings in a vector database and retrieving them from there, as a substitute of constructing requests to an embedding model each time and recalculating the embeddings. This could be very straightforward and essentially a non-negotiable part (it might be silly of us to not do it) even of a quite simple RAG pipeline, since the embeddings of the documents generally remain the identical (we’d like to recalculate an embedding only when a document of the knowledge base is altered), so it is smart to calculate once and store it somewhere.
But other than storing the knowledge base embeddings in a vector database, other parts of the RAG pipeline will also be reused, and we are able to profit from applying caching to them. Let’s see what those are in additional detail!
. . .
1. Query Embedding Cache
The very first thing that is completed in a RAG system when a question is submitted is that the query is transformed into an embedding vector, in order that we are able to perform semantic search and retrieval against the knowledge base. Apparently, this step could be very lightweight as compared to calculating the embeddings of your complete knowledge base. Nonetheless, in high-traffic applications, it might still add unnecessary latency and value, and in any case, recalculating the identical embeddings for a similar queries over and once again is wasteful.
So, as a substitute of computing the query embedding each time from scratch, we are able to first check if we now have already computed the embedding for a similar query before. If yes, we simply reuse the cached vector. If not, we generate the embedding once, store it within the cache, and make it available for future reuse.
On this case, our RAG pipeline would look something like this:

Essentially the most straightforward strategy to implement query embedding caching is by in search of the exact-match of the raw user query. For instance:
What area codes correspond to Athens, Greece?
Nevertheless, we may use a normalized version of the raw user query by performing some easy operations, like making it lowercase or stripping punctuation. In this fashion, the next queries…
What area codes correspond to athens greece?
What area codes correspond to Athens, Greece
what area codes correspond to Athens // Greece?
… would all map to …
what area codes correspond to athens greece?
We then seek for this normalized query within the KV store, and if we get a cache hit, we are able to then directly use the embedding that’s stored within the cache, without having to make a request to the embedding model again. That’s going to be an embedding looking something like this, for instance:
[0.12, -0.33, 0.88, ...]
Usually, for the query embedding cache, the key-values have the next format:
query → embedding
As it’s possible you’ll already imagine, the hit for this may significantly improve if we propose the users with standardized queries inside the app’s UI, beyond letting them type their very own queries in free text.
. . .
2. Retrieval Cache
Caching will also be utilized on the retrieval step of an RAG pipeline. Because of this we are able to cache the retrieved results for a particular query and minimize the necessity to perform a full retrieval for similar queries. On this case, the important thing of the cache will be the raw or normalized user query, or the query embedding. The worth we get back from the cache is the retrieved document chunks. So, our RAG pipeline with retrieval caching, either exact-match or semantic, would look something like this:

So for our normalized query…
what area codes correspond to athens greece?
or from the query embedding…
[0.12, -0.33, 0.88, ...]
we’d directly get back from the cache the retrieved chunks.
[
chunk_12,
chunk_98,
chunk_42
]
In this fashion, when a similar and even somewhat similar query is submitted, we have already got the relevant chunks and documents within the cache — there isn’t a must perform the retrieval step. In other words, even for queries which might be only moderately similar (for instance, cosine similarity above ~0.85), the precise response may not exist within the cache, however the relevant chunks and documents needed to reply the query often do.
Usually, for the retrieval cache, the key-values have the next format:
query → retrieved_chunks
One may wonder how that is different from the query embedding cache. In any case, if the query is similar, why circuitously hit the cache within the retrieval cache and likewise include a question embedding cache? The reply is that in practice, the query embedding cache and the retrieval cache could have different TTL policies. That’s since the documents within the knowledge base may change, and even when we now have the identical query or the identical query embedding, the corresponding chunks could also be different. This explains the usefulness of the query embedding cache existing individually.
. . .
3. Reranking Cache
One other strategy to utilize caching within the context of RAG is by caching the outcomes of the reranker model (if we use one). More specifically, which means that as a substitute of passing the retrieved ranked results to a reranker model and getting back the reranked results, we directly get the reranked order from the cache, for a particular query and retrieved chunks. On this case, our RAG pipeline would look something like this:

In our Athens area codes example, for our normalized query:
what area codes correspond to athens greece?
and hypothetical retrieved and ranked chunks
[
chunk_12,
chunk_98,
chunk_42
]
we could directly get the reranked chunks as output of the cache:
[
chunk_98,
chunk_12,
chunk_42
]
Usually, for the reranking cache, the keys and values have the next format:
(query + retrieved_chunks) → reranked_chunks
Again, one may wonder: if we hit the reranking cache, shouldn’t we also all the time hit the retrieval cache? At first glance, this might sound true, but in practice, it isn’t necessarily the case.
One reason is that, as explained already, different caches could have different TTL policies. Even when the reranking result continues to be cached, the retrieval cache could have already expired and require performing the retrieval step from scratch.
But beyond this, in a posh RAG system, we likely are going to make use of multiple retrieval mechanism (e.g., semantic search, BM25, etc.). Because of this, we may hit the retrieval cache for one in every of the retrieval mechanisms, but not for all, and thus not hit the cache for reranking. Vice versa, we may hit the cache for reranking, but miss on the person caches of the assorted retrieval mechanisms — we may find yourself with the identical set of documents, but by retrieving different documents from each individual retrieval mechanism. For these reasons, the retrieval and reranking caches are conceptually and practically different.
. . .
4. Prompt Assembly Cache
One other useful place to use caching in a RAG pipeline is in the course of the prompt assembly stage. That’s, once retrieval and reranking are accomplished, the relevant chunks are combined with the system prompt and the user query to form the ultimate prompt that is shipped as input to the LLM. So, if the query, system prompt, and reranked chunks all match, then we hit cache. Because of this we don’t must reconstruct the ultimate prompt again, but we are able to get parts of it (the context) and even your complete final prompt directly from cache.
Caching the prompt assembly step in a RAG pipeline would look something like this:

Continuing with our Athens example, suppose the user submits the query…
what area codes correspond to athens greece?
and after retrieval and reranking, we get the next chunks (either from the reranker or the reranking cache):
[
chunk_98,
chunk_12,
chunk_42
]
Throughout the prompt assembly step, these chunks are combined with the system prompt and the user query to construct the ultimate prompt that will probably be sent to the LLM. For instance, the assembled prompt may look something like:
System: You might be a helpful assistant that answers questions using the provided context.
Context:
[chunk_98]
[chunk_12]
[chunk_42]
User: what area codes correspond to athens greece?
Usually, for the prompt assembly cache, the important thing values have the next format:
(query + system_prompt + retrieved_chunks) → assembled_prompt
Apparently, the computational savings listed below are smaller in comparison with the opposite caching layers mentioned above. Nonetheless, context caching can still reduce latency and simplify prompt construction in high-traffic systems. Specifically, prompt assembly caching is smart to implement in systems where prompt assembly is complex and includes more operations than a straightforward concatenation, like inserting guardrails.
. . .
5. Query – Response Caching
Last but not least, we are able to cache pairs of entire queries and responses. Intuitively, once we discuss caching, the very first thing that involves mind is caching query and response pairs. And this may be the final word jackpot for our RAG pipeline, as on this case, we don’t must run any of it, and we are able to just provide a response to the user’s query solely by utilizing the cache.
More specifically, on this case, we store entire query — final response pairs within the cache, and completely avoid any retrieval (in case of RAG) and re-generation of a response. In this fashion, as a substitute of retrieving relevant chunks and generating a response from scratch, we directly get a precomputed response, which was generated at some earlier time for a similar or an analogous query.
To securely implement query-response caching, we either need to use exact matches in the shape of a key-value cache or use semantic caching with a really strict threshold (like 0.99 cosine similarity between user query and cached query).
So, our RAG pipeline with query-response caching would look something like this:

Continuing with our Athens example, suppose a user asks the query:
what area codes correspond to athens greece?
Assume that earlier, the system already processed this question through the complete RAG pipeline, retrieving relevant chunks, reranking them, assembling the prompt, and generating the ultimate answer with the LLM. The generated response might look something like:
The important telephone area code for Athens, Greece is 21.
Numbers within the Athens metropolitan area typically start with the prefix 210,
followed by the local subscriber number.
The following time a similar or extremely similar query appears, the system doesn’t must run the retrieval, reranking, or generation steps again. As an alternative, it might immediately return the cached response.
Usually, for the query-response cache, the important thing values have the next format:
query → final_response
. . .
On my mind
Aside from Prompt Caching directly provided within the API services of the assorted LLMs, several other caching mechanisms may be utilized in an RAG application to attain cost and latency savings. More specifically, we are able to utilize caching mechanisms in the shape of query embeddings cache, retrieval cache, reranking cache, prompt assembly cache, and query response cache. In practice, in a real-world RAG, many or all of those cache stores may be used together to offer improved performance when it comes to cost and time because the users of the app scale.
📰 💌 💼☕
