OpenAI Prompt Cache Monitoring

-

A worked example using Python and the chat completion API

As a part of their recent DEV Day presentation, OpenAI announced that Prompt Caching was now available for various models. On the time of writing, those models were:-

GPT-4o, GPT-4o mini, o1-preview and o1-mini, in addition to fine-tuned versions of those models.

This news shouldn’t be underestimated, as it’ll allow developers to save lots of on costs and reduce application runtime latency.

API calls to supported models will mechanically profit from Prompt Caching on prompts longer than 1,024 tokens. The API caches the longest prefix of a prompt that has been previously computed, starting at 1,024 tokens and increasing in 128-token increments. Should you reuse prompts with common prefixes, OpenAI will mechanically apply the Prompt Caching discount without requiring you to alter your API integration.

As an OpenAI API developer, the one thing you might have to fret about is tips on how to monitor your Prompt Caching use, i.e. check that it’s being applied.

In this text, I’ll show you tips on how to do this using Python, a Jupyter Notebook and a chat completion example.

Install WSL2 Ubuntu

ASK DUKE

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x