of automating a big variety of tasks. Because the release of ChatGPT in 2022, we have now seen an increasing number of AI products available on the market utilizing LLMs. Nevertheless, there are still a number of improvements that must be made in the best way we utilize LLMs. Improving your prompt with an LLM prompt improver and utilizing cached tokens are, for instance, two easy techniques you’ll be able to utilize to vastly improve the performance of your LLM application.
In this text, I’ll discuss several specific techniques you’ll be able to apply to the best way you create and structure your prompts, which is able to reduce latency and price, and likewise increase the standard of your responses. The goal is to present you with these specific techniques, so you’ll be able to immediately implement them into your individual LLM application.
Why it’s best to optimize your prompt
In a number of cases, you may have a prompt that works with a given LLM and yields adequate results. Nevertheless, in a number of cases, you haven’t spent much time optimizing the prompt, which leaves a number of potential on the table.
I argue that using the particular techniques I’ll present in this text, you’ll be able to easily each improve the standard of your responses and reduce costs without much effort. Simply because a prompt and LLM work doesn’t mean it’s performing optimally, and in a number of cases, you’ll be able to see great improvements with little or no effort.
Specific techniques to optimize
On this section, I’ll cover the particular techniques you’ll be able to utilize to optimize your prompts.
At all times keep static content early
The primary technique I’ll cover is to all the time keep static content early in your prompt. With static content, I seek advice from content that is still the identical once you make multiple API calls.
The explanation it’s best to keep the static content early is that each one the large LLM providers, reminiscent of Anthropic, Google, and OpenAI, utilize cached tokens. Cached tokens are tokens which have already been processed in a previous API request, and that may be processed cheaply and quickly. It varies from provider to provider, but cached input tokens are frequently priced around 10% of normal input tokens.
Cached tokens are tokens which have already been processed in a previous API request, and that may be processed cheaper and faster than normal tokens
Meaning, should you send in the identical prompt two times in a row, the input tokens of the second prompt will only cost 1/tenth the input tokens of the primary prompt. This works since the LLM providers cache the processing of those input tokens, which makes processing your recent request cheaper and faster.
In practice, caching input tokens is finished by keeping variables at the top of the prompt.
For instance, if you have got a protracted system prompt with a matter that varies from request to request, it’s best to do something like this:
prompt = f"""
{long static system prompt}
{user prompt}
"""
For instance:
prompt = f"""
You might be a document expert ...
It's best to all the time reply on this format ...
If a user asks about ... it's best to answer ...
{user query}
"""
Here we have now the static content of the prompt first, before we put the variable contents (the user query) last.
In some scenarios, you ought to feed in document contents. In case you’re processing a number of different documents, it’s best to keep the document content at the top of the prompt:
# if processing different documents
prompt = f"""
{static system prompt}
{variable prompt instruction 1}
{document content}
{variable prompt instruction 2}
{user query}
"""
Nevertheless, suppose you’re processing the identical documents multiple times. In that case, you’ll be able to be sure the tokens of the document are also cached by ensuring no variables are put into the prompt beforehand:
# if processing the identical documents multiple times
prompt = f"""
{static system prompt}
{document content} # keep this before any variable instructions
{variable prompt instruction 1}
{variable prompt instruction 2}
{user query}
"""
Note that cached tokens are frequently only activated if the primary 1024 tokens are the identical in two requests. For instance, in case your static system prompt within the above example is shorter than 1024 tokens, you’ll not utilize any cached tokens.
# do NOT do that
prompt = f"""
{variable content} < --- this removes all usage of cached tokens
{static system prompt}
{document content}
{variable prompt instruction 1}
{variable prompt instruction 2}
{user query}
"""
Your prompts should all the time be built up with essentially the most static contents first (the content various the least from request to request), the essentially the most dynamic content (the content various essentially the most from request to request)
- If you have got a protracted system and user prompt with none variables, it's best to keep that first, and add the variables at the top of the prompt
- In case you are fetching text from documents, for instance, and processing the identical document twice, it's best to
May very well be document contents, or if you have got a protracted prompt -> make use of caching
Query at the top
One other technique it's best to utilize to enhance LLM performance is to all the time put the user query at the top of your prompt. Ideally, you organize it so you have got your system prompt containing all the overall instructions, and the user prompt simply consists of only the user query, reminiscent of below:
system_prompt = ""
user_prompt = f"{user_question}"
In Anthropic’s prompt engineering docs, the state that features the user prompt at the top can improve performance by as much as 30%, especially should you are using long contexts. Including the query in the long run makes it clearer to the model which task it’s trying to attain, and can, in lots of cases, lead to raised results.
Using a prompt optimizer
Lots of times, when humans write prompts, they turn out to be messy, inconsistent, include redundant content, and lack structure. Thus, it's best to all the time feed your prompt through a prompt optimizer.
The only prompt optimizer you should use is to prompt an LLM to ,and it's going to offer you a more structured prompt, with less redundant content, and so forth.
A good higher approach, nonetheless, is to make use of a particular prompt optimizer, reminiscent of one you will discover in OpenAI’s or Anthropic’s consoles. These optimizers are LLMs specifically prompted and created to optimize your prompts, and can often yield higher results. Moreover, it's best to be sure to incorporate:
- Details concerning the task you’re trying to attain
- Examples of tasks the prompt succeeded at, and the input and output
- Example of tasks the prompt failed at, with the input and output
Providing this extra information will often yield way higher results, and also you’ll find yourself with a a lot better prompt. In lots of cases, you’ll only spend around 10-Quarter-hour and find yourself with a far more performant prompt. This makes using a prompt optimizer considered one of the bottom effort approaches to improving LLM performance.
Benchmark LLMs
The LLM you employ will even significantly impact the performance of your LLM application. Different LLMs are good at different tasks, so you have to check out the several LLMs in your specific application area. I like to recommend not less than establishing access to the most important LLM providers like Google Gemini, OpenAI, and Anthropic. Setting this up is sort of easy, and switching your LLM provider takes a matter of minutes should you have already got credentials arrange. Moreover, you'll be able to consider testing open-source LLMs as well, though they typically require more effort.
You now need to establish a particular benchmark for the duty you’re trying to attain, and see which LLM works best. Moreover, it's best to commonly check model performance, for the reason that big LLM providers occasionally upgrade their models, without necessarily coming out with a new edition. It's best to, after all, even be able to check out any recent models coming out from the big LLM providers.
Conclusion
In this text, I’ve covered 4 different techniques you'll be able to utilize to enhance the performance of your LLM application. I discussed utilizing cached tokens, having the query at the top of the prompt, using prompt optimizers, and creating specific LLM benchmarks. These are all relatively easy to establish and do, and might result in a big performance increase. I consider many similar and easy techniques exist, and it's best to all the time attempt to be looking out for them. These topics are frequently described in several blog posts, where Anthropic is considered one of the blogs that has helped me improve LLM performance essentially the most.
👉 Find me on socials:
🧑💻 Get in contact
✍️ Medium
You may as well read a few of my other articles:
