In my previous post, Prompt Caching — what it’s, how it really works, and the way it might probably prevent plenty of time and cash when running AI-powered apps with high traffic. In today’s post, I walk you thru implementing Prompt Caching specifically using OpenAI’s API, and we discuss some common pitfalls.
A transient reminder on Prompt Caching
Before getting our hands dirty, let’s briefly revisit what precisely the concept of Prompt Caching is. Prompt Caching is a functionality provided in frontier model API services just like the OpenAI API or Claude’s API, that permits caching and reusing parts of the LLM’s input which might be repeated continuously. Such repeated parts could also be system prompts or instructions which might be passed to the model each time when running an AI app, together with some other variable content, just like the user’s query or information retrieved from a knowledge base. To give you the chance to hit cache with prompt caching, the repeated parts of the prompt have to be in the beginning of it, namely, a . As well as, to ensure that prompt caching to be activated, this prefix must exceed a certain (e.g., for OpenAI the prefix ought to be greater than 1,024 tokens, while Claude has different minimum cache lengths for various models). So far as those two conditions are satisfied — repeated tokens as a prefix exceeding the scale threshold defined by the API service and model — caching might be activated to attain economies of scale when running AI apps.
Unlike caching in other components in a RAG or other AI app, prompt caching operates on the token level, in the inner procedures of the LLM. Specifically, LLM inference takes place in two steps:
- Pre-fill, that’s, the LLM takes into consideration the user prompt to generate the primary token, and
- Decoding, that’s, the LLM recursively generates the tokens of the output one after the other
Briefly, prompt caching stores the computations that happen within the pre-fill stage, so the model doesn’t must recompute it again when the identical prefix reappears. Any computations going down within the decoding iterations phase, even when repeated, aren’t going to be cached.
For the remainder of the post, I can be focusing solely on using prompt caching within the OpenAI API.
What in regards to the OpenAI API?
In OpenAI’s API, prompt caching was initially introduced on the 1st of October 2024. Originally, it offered a 50% discount on the cached tokens, but nowadays, this discount goes as much as 90%. On top of this, by hitting their prompt cache, additional savings on latency might be achived as much as 80%.
When prompt caching is activated, the API service attempts to hit the cache for a submitted request by routing the submitted prompt to an appropriate machine, where the respective cache is predicted to exist. This known as the Cache Routing, and to do that, the API service typically utilizes a hash of the primary 256 tokens of the prompt.
Beyond this, their API also allows for explicitly defining a the prompt_cache_key parameter within the API request to the model. That may be a single key defining which cache we’re referring to, aiming to further increase the possibilities of our prompt being routed to the proper machine and hitting cache.
As well as, OpenAI API provides two distinct sorts of caching with reference to duration, defined through the prompt_cache_retention parameter. Those are:
- In-memory prompt cache retention: This is actually the default style of caching, available for all models for which prompt caching is on the market. With in-memory cache, cached data remain lively for a period of 5-10 minutes beteen requests.
- Prolonged prompt cache retention: This available for specific models. Prolonged cache allows for keeping data in cache for loger and as much as a maximum of 24 hours.
Now, with reference to how much all these cost, OpenAI charges the identical per input (non cached) token, either we now have prompt caching activated or not. If we manage to hit cache succesfully, we’re billed for the cached tokens at a greatly discounted price, with a reduction as much as 90%. Furthermore, the worth per input token stays the identical each for the in memory and prolonged cache retention.
Prompt Caching in Practice
So, let’s see how prompt caching actually works with an easy Python example using OpenAI’s API service. More specifically, we’re going to do a sensible scenario where a long system prompt (prefix) is reused across multiple requests. When you are here, I assume you have already got your OpenAI API key in place and have installed the required libraries. So, the very first thing to do can be to import the OpenAI library, in addition to time for capturing latency, and initialize an instance of the OpenAI client:
from openai import OpenAI
import time
client = OpenAI(api_key="your_api_key_here")
then we are able to define our prefix (the tokens which might be going to be repeated and we’re aiming to cache):
long_prefix = """
You might be a highly knowledgeable assistant specialized in machine learning.
Answer questions with detailed, structured explanations, including examples when relevant.
""" * 200
Notice how we artificially increase the length (multiply with 200) to be certain that the 1,024 token caching threshold is met. Then we also arrange a timer in order to measure our latency savings, and we’re finally able to make our call:
start = time.time()
response1 = client.responses.create(
model="gpt-4.1-mini",
input=long_prefix + "What's overfitting in machine learning?"
)
end = time.time()
print("First response time:", round(end - start, 2), "seconds")
print(response1.output[0].content[0].text)
So, what will we expect to occur from here? For models from gpt-4o and newer, prompt caching is activated by default, and since our 4,616 input tokens are well above the 1,024 prefix token threshold, we’re good to go. Thus, what this request does is that it initially checks if the input is a cache hit (it shouldn’t be, since that is the primary time we do a request with this prefix), and because it shouldn’t be, it processes the complete input after which caches it. Next time we send an input that matches the initial tokens of the cached input to some extent, we’re going to get a cache hit. Let’s check this in practice by making a second request with the identical prefix:
start = time.time()
response2 = client.responses.create(
model="gpt-4.1-mini",
input=long_prefix + "What's regularization?"
)
end = time.time()
print("Second response time:", round(end - start, 2), "seconds")
print(response2.output[0].content[0].text)

Indeed! The second request runs significantly faster (23.31 vs 15.37 seconds). It is because the model has already made the calculations for the cached prefix and only must process from scratch the brand new part, “What’s regularization?”. Because of this, by utilizing prompt caching, we get significantly lower latency and reduced cost, since cached tokens are discounted.
One other thing mentioned within the OpenAI documentation we’ve already talked about is the prompt_cache_key parameter. Specifically, in line with the documentation, we are able to explicitly define a prompt cache key when making a request, and in this fashion define the requests that need to make use of the identical cache. Nonetheless, I attempted to incorporate it in my example by appropriately adjusting the request parameters, but didn’t have much luck:
response1 = client.responses.create(
prompt_cache_key = 'prompt_cache_test1',
model="gpt-5.1",
input=long_prefix + "What's overfitting in machine learning?"
)

🤔
Plainly while prompt_cache_key exists within the API capabilities, it shouldn’t be yet exposed within the Python SDK. In other words, we cannot explicitly control cache reuse yet, nevertheless it is somewhat automatic and best-effort.
So, what can go unsuitable?
Activating prompt caching and really hitting the cache appears to be type of straightforward from what we’ve said up to now. So, what may go unsuitable, leading to us missing the cache? Unfortunately, plenty of things. As straightforward because it is, prompt caching requires plenty of different assumptions to be in place. Missing even certainly one of those prerequisites goes to end in a cache miss. But let’s take a greater look!
One obvious miss is having a prefix that’s lower than the edge for activating prompt caching, namely, lower than 1,024 tokens. Nonetheless, this may be very easily solvable — we are able to all the time just artificially increase the prefix token count by simply multiplying by an appropriate value, as shown in the instance above.
One other thing can be silently breaking the prefix. Specifically, even once we use persistent instructions and system prompts of appropriate size across all requests, we have to be exceptionally careful not to interrupt the prefixes by adding any variable content in the beginning of the model’s input, before the prefix. That may be a guaranteed option to break the cache, irrespective of how long and repeated the next prefix is. Usual suspects for falling into this pitfall are dynamic data, for example, appending the user ID or timestamps in the beginning of the prompt. Thus, a best practice to follow across all AI app development is that any dynamic content should all the time be appended at the tip of the prompt — never in the beginning.
Ultimately, it’s price highlighting that prompt caching is just in regards to the pre-fill phase — decoding is rarely cached. Because of this even when we impose on the model to generate responses following a particular template, that beggins with certain fixed tokens, those tokens aren’t going to be cached, and we’re going to be billed for his or her processing as usual.
Conversely, for specific use cases, it doesn’t really make sense to make use of prompt caching. Such cases can be highly dynamic prompts, like chatbots with little repetition, one-off requests, or real-time personalized systems.
. . .
On my mind
Prompt caching can significantly improve the performance of AI applications each when it comes to cost and time. Specifically when seeking to scale AI apps prompt caching comes extremelly handy, for maintaining cost and latency in acceptable levels.
For OpenAI’s API prompt caching is activated by default and costs for input, non-cached tokens are the identical either we activate prompt caching or not. Thus, one can only win by activating prompt caching and aiming to hit it in every request, even in the event that they don’t succeed.
Claude also provides extensive functionality on prompt caching through their API, which we’re going to be exploring intimately in a future post.
Thanks for reading! 🙂
. . .
📰 💌 💼☕
