Why Care About Prompt Caching in LLMs?

-

, we’ve talked lots about what an incredible tool RAG is for leveraging the facility of AI on custom data. But, whether we’re talking about plain LLM API requests, RAG applications, or more complex AI agents, there may be one common query that is still the identical. How do all these items scale? Particularly, what happens with cost and latency because the variety of requests in such apps grows? Especially for more advanced AI agents, which can contain multiple calls to an LLM for processing a single user query, these questions grow to be of particular importance.

Fortunately, in point of fact, when making calls to an LLM, the identical input tokens are frequently repeated across multiple requests. Users are going to ask some specific questions far more than others, system prompts and directions integrated in AI-powered applications are repeated in every user query, and even for a single prompt, models perform recursive calculations to generate a whole response (remember how LLMs produce text by predicting words one after the other?). Just like other applications, the usage of the ing concept can significantly help optimize LLM request costs and latency. As an illustration, in accordance with OpenAI documentation, Prompt Caching can reduce latency by as much as a powerful 80% and input token costs by as much as 90%.


What about caching?

Normally, caching in computing is not any latest idea. At its core, a cache is a component that stores data temporarily in order that future requests for a similar data might be served faster. In this fashion, we will distinguish between two basic cache states – a cache hit and a cache miss. Particularly:

  • A cache hit occurs when the requested data is present in the cache, allowing for a fast and low-cost retrieval.
  • A cache miss occurs when the information just isn’t within the cache, forcing the appliance to access the unique source, which is dearer and time-consuming.

One of the vital typical implementations of cache is in web browsers. When visiting a web site for the primary time, the browser checks for the URL in its memory, but finds nothing (that can be a cache miss). Because the data we’re on the lookout for isn’t locally available, the browser has to perform a dearer and time-consuming request to the online server across the web, so as to find the information within the distant server where they originally exist. Once the page finally loads, the browser typically copies that data into its local . If we attempt to reload the identical page 5 minutes later, the browser will search for it in its local storage. This time, it can find it (a cache hit) and cargo it from there, without reaching back to the server. This makes the browser work more quickly and eat fewer resources.

As it’s possible you’ll imagine, caching is especially useful in systems where the identical data is requested multiple times. In most systems, data access isn’t uniform, but relatively tends to follow a distribution where a small fraction of the information accounts for the overwhelming majority of requests. A big portion of real-life applications follows the Pareto principle, meaning that about of 80% of the requests are about 20% of the information. If not for the Pareto principle, cache memory would have to be as large as the first memory of the system, rendering it very, very expensive.


Prompt Caching and a Little Bit about LLM Inference

The caching concept – – is utilized in the same manner for improving the efficiency of LLM calls, allowing for significantly reduced costs and latency. Caching might be utilised in various elements that could be involved in an AI application, most vital of which is Prompt Caching. Nevertheless, caching may also provide great advantages by being applied to other features of an AI app, similar to, as an example, caching in RAG retrieval or query-response caching. Nonetheless, this post goes to solely give attention to Prompt Caching.


To know how Prompt Caching works, we must first understand a little bit bit about how LLM inference – using a trained LLM to generate text – functions. LLM inference just isn’t a single continuous process, but is relatively divided into two distinct stages. Those are:

  • Pre-fill, which refers to processing the complete prompt without delay to provide the primary token. This stage requires heavy computation, and it’s thus compute-bound. We may picture a really simplified version of this stage as each token attending to all other tokens, or something like comparing every token with every previous token.
  • Decoding, which appends the last generated token back into the sequence and generates the following one auto-regressively. This stage is memory-bound, because the system must load the complete context of previous tokens from memory to generate each latest token.

For instance, imagine we have now the next prompt:

What should I cook for dinner? 

From which we may then get the primary token:

Here

and the next decoding iterations:

Here 
Listed below are 
Listed below are 5 
Listed below are 5 easy 
Listed below are 5 easy dinner 
Listed below are 5 easy dinner ideas

The difficulty with that is that so as to generate the entire response, the model would must process the identical previous tokens over and once more to provide each next word in the course of the decoding stage, which, as it’s possible you’ll imagine, is very inefficient. In our example, which means the model would process again the tokens ‘‘ for producing the output ‘‘, even when it has already processed the tokens ‘ some milliseconds ago.

To resolve this, KV (Key-Value) Caching is utilized in LLMs. Which means intermediate Key and Value tensors for the input prompt and previously generated tokens are calculated once after which stored on the KV cache, as a substitute of recomputing from scratch at each iteration. This leads to the model performing the minimum needed calculations for producing each response. In other words, for every decoding iteration, the model only performs calculations to predict the latest token after which appends it to the KV cache.

Nonetheless, KV caching only works for a single prompt and for generating a single response. Prompt Caching extends the principles utilized in KV caching for utilizing caching across different prompts, users, and sessions.


In practice, with prompt caching, we save the repeated parts of a prompt after the primary time it’s requested. These repeated parts of a prompt normally have the shape of huge prefixes, like system prompts, instructions, or retrieved context. In this fashion, when a brand new request incorporates the identical prefix, the model uses the computations made previously as a substitute of recalculating from scratch. That is incredibly convenient since it may possibly significantly reduce the operating costs of an AI application (we don’t must pay for repeated inputs that contain the identical tokens), in addition to reduce latency (we don’t must wait for the model to process tokens which have already been processed). This is particularly useful in applications where prompts contain large repeated instructions, similar to RAG pipelines.

It will be important to grasp that this caching operates on the token level. In practice, which means even when two prompts differ at the top, so long as they share the identical token prefix, the cached computations for that shared portion can still be reused, and only perform latest calculations for the tokens that differ. The tricky part here is that the common tokens must be at first of the prompt, so how we form our prompts and directions becomes of particular importance. In our cooking example, we will imagine the next consecutive prompts.

Prompt 1
What should I cook for dinner? 

after which if we input the prompt:

Prompt 2
What should I cook for launch? 

The shared tokens ‘ needs to be a cache hit, and thus one should expect to eat significantly reduced tokens for Prompt 2.

Nonetheless, if we had the next prompts…

Prompt 1
Supper time! What should I cook? 

after which

Prompt 2
Launch time! What should I cook? 

This might be a cache miss, for the reason that first token of every prompt is different. Because the prompt prefixes are different, we cannot hit cache, even when their semantics are essentially the identical.

In consequence, a basic rule of thumb on getting prompt caching to work is to at all times append any static information, like instructions or system prompts, at first of the model input. On the flip side, any typically variable information like timestamps or user identifications should go at the top of the prompt.


Getting our hands dirty with the OpenAI API

Nowadays, many of the frontier foundation models, like GPT or Claude, provide some sort of Prompt Caching functionality directly integrated into their APIs. More specifically, within the mentioned APIs, Prompt Caching is shared amongst all users of a company accessing the identical API key. In other words, once a user makes a request and its prefix is stored in cache, for some other user inputting a prompt with the identical prefix, we get a cache hit. That’s, we get to make use of precomputed calculations, which significantly reduce the token consumption and make the response generation faster. This is especially useful when deploying AI applications within the enterprise, where we expect many users to make use of the identical application, and thus the identical prefixes of inputs.

On most up-to-date models, Prompt Caching is robotically activated by default, but some level of parametrization is offered. We will distinguish between:

  • In-memory prompt cache retention, where the cached prefixes are maintained for like 5-10 minutes and as much as 1 hour, and
  • Prolonged prompt cache retention (only available for specific models), allowing for an extended retention of the cached prefix, as much as a maximum of 24 hours.

But let’s take a better look!

We will see all these in practice with the next minimal Python example, making requests to the OpenAI API, using Prompt Caching, and the cooking prompts mentioned earlier. I added a relatively large shared prefix to my prompts, in order to make the results of caching more visible:

from openai import OpenAI
api_key = "your_api_key"
client = OpenAI(api_key=api_key)

prefix = """
You're a helpful cooking assistant.

Your task is to suggest easy, practical dinner ideas for busy people.
Follow these guidelines fastidiously when generating suggestions:

General cooking rules:
- Meals should take lower than half-hour to arrange.
- Ingredients needs to be easy to search out in an everyday supermarket.
- Recipes should avoid overly complex techniques.
- Prefer balanced meals including vegetables, protein, and carbohydrates.

Formatting rules:
- All the time return a numbered list.
- Provide 5 suggestions.
- Each suggestion should include a brief explanation.

Ingredient guidelines:
- Prefer seasonal vegetables.
- Avoid exotic ingredients.
- Assume the user has basic pantry staples similar to olive oil, salt, pepper, garlic, onions, and pasta.

Cooking philosophy:
- Favor easy home cooking.
- Avoid restaurant-level complexity.
- Give attention to meals that individuals realistically cook on weeknights.

Example meal styles:
- pasta dishes
- rice bowls
- stir fry
- roasted vegetables with protein
- easy soups
- wraps and sandwiches
- sheet pan meals

Eating regimen considerations:
- Default to healthy meals.
- Avoid deep frying.
- Prefer balanced macronutrients.

Additional instructions:
- Keep explanations concise.
- Avoid repeating the identical ingredients in every suggestion.
- Provide variety across the meal suggestions.

""" * 80   
# huge prefix to be certain that i get the 1000 something token threshold for activating prompt caching

prompt1 = prefix + "What should I cook for dinner?"

after which for the prompt 2

prompt2 = prefix + "What should I cook for lunch?"

response2 = client.responses.create(
    model="gpt-5.2",
    input=prompt2
)

print("nResponse 2:")
print(response2.output_text)

print("nUsage stats:")
print(response2.usage)

So, for prompt 2, we can be only billed the remaining, non-identical a part of the prompt. That will be the input tokens minus the cached tokens: 20,014 – 19,840 = only 174 tokens, or in other words, 99% less tokens.

In any case, since OpenAI imposes a 1,024 token minimum threshold for activating prompt caching and the cache might be preserved for a maximum of 24 hours, it becomes clear that those cost advantages might be obtained in practice only when running AI applications at scale, with many energetic users performing many requests day by day. Nonetheless, as explained for such cases, the Prompt Caching feature can provide substantial cost and time advantages for LLM-powered applications.


On my mind

Prompt Caching is a strong optimization for LLMs that may significantly improve the efficiency of AI applications each by way of cost and time. By reusing previous computations for an identical prompt prefixes, the model can skip redundant calculations and avoid repeatedly processing the identical input tokens. The result is quicker responses and lower costs, especially in applications where large parts of prompts—similar to system instructions or retrieved context—remain constant across many requests. As AI systems scale and the variety of LLM calls increases, these optimizations grow to be increasingly necessary.


📰 💌 💼☕

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x