in production, actively responding to user queries. Nevertheless, you now need to improve your model to handle a bigger fraction of customer requests successfully. How do you approach this?
In this text, I discuss the scenario where you have already got a running LLM and need to investigate and optimize its performance. I’ll discuss the approaches I take advantage of to uncover where the LLM works and where it needs improvement. Moreover, I’ll also discuss the tools I take advantage of to enhance my LLM’s performance, with tools similar to Anthropic’s prompt optimizer.
In brief, I follow a three-step process to quickly improve my LLM’s performance:
- Analyze LLM outputs
- Iteratively improve areas with probably the most value to effort
- Evaluate and iterate
Table of Contents
Motivation
My motivation for this text is that I often find myself within the scenario described within the intro. I have already got my LLM up and running; nonetheless, it’s not performing as expected or reaching customer expectations. Through countless experiences of analyzing my LLMs, I actually have created this straightforward three-step process I at all times use to enhance LLMs.
Step 1: Analyzing LLM outputs
Step one to improving your LLMs should at all times be to investigate their output. To have high observability in your platform, I strongly recommend using an LLM manager tool for tracing, similar to Langfuse or PromptLayer. These tools make it easy to collect all of your LLM invocations in a single place, ready for evaluation.
I’ll now discuss some different approaches I apply to investigate my LLM outputs.
Manual inspection of raw output
The only approach to investigate your LLM output is to manually inspect a lot of your LLM invocations. It’s best to gather your last 50 LLM invocations, read through your entire context you fed into the model, and the output the model provided. I find this approach surprisingly effective in uncovering problems. I actually have, for instance, discovered:
- Duplicate context (a part of my context was duplicated because of a programming error)
- Missing context (I wasn’t feeding all the data I expected into my LLM)
- etc.
Manual inspection of information should never be underestimated. Thoroughly searching through the info manually gives you an understanding of the dataset you’re working on, which is difficult to acquire in some other manner. Moreover, I also find that I should manually inspect more data points than I initially need to spend time evaluating.
For instance, let’s say it takes 5 minutes to manually inspect one input-output example. My intuition often tells me to possibly spend 20-Half-hour on this, and thus inspect 4-6 data points. Nevertheless, I find that it is best to normally spend rather a lot longer on this a part of the method. I like to recommend not less than 5x-ing this time, so as a substitute of spending Half-hour manually inspecting, you spend 2.5 hours. Initially, you’ll think that is numerous time to spend on manual inspection, but you’ll normally find it saves you loads of time in the long term. Moreover, in comparison with a whole 3-week project, 2.5 hours is an insignificant period of time.
Group queries in response to taxonomy
Sometimes, you’ll not get all of your answers from easy manual evaluation of your data. In those instances, I’d move over to more quantitative evaluation of my data. That is versus the primary approach, which I consider qualitative since I’m manually inspecting each data point.
Grouping user queries in response to a taxonomy is an efficient approach to raised understand what users expect out of your LLM. I’ll provide an example to make this easier to grasp:
Imagine you’re Amazon, and you’ve a customer support LLM handling incoming customer questions. On this instance, a taxonomy will look something like:
- Refund requests
- Consult with a human requests
- Questions on individual products
- …
I’d then take a look at the last 1000 user queries and manually annotate them into this taxonomy. It will tell you which ones questions are most prevalent, and which of them it is best to focus most on answering appropriately. You’ll often find that the distribution of things in each category will follow a Pareto distribution, with most items belonging to a couple of specific categories.
Moreover, you annotate whether a customer request was successfully answered or not. With this information, you possibly can now discover what sorts of questions you’re combating and which of them your LLM is nice at. Perhaps the LLM easily transfers customer queries to humans when requested; nonetheless, it struggles when queried about details a few product. On this instance, it is best to focus your effort on improving the group of questions you’re combating probably the most.
LLM as a judge on a golden dataset
One other quantitative approach I take advantage of to investigate my LLM outputs is to create a golden dataset of input-output examples and utilize LLM as a judge. It will help once you make changes to your LLM.
Continuing on the shopper support example from previously, you possibly can create a listing of fifty (real) user queries and the specified response from each of them. Each time you make changes to your LLM (change model version, add more context, …), you possibly can robotically test the brand new LLM on the golden dataset, and have an LLM as a judge determine if the response from the brand new model is not less than pretty much as good because the response from the old model. It will prevent vast amounts of time manually inspecting LLM outputs at any time when you update your LLM.
If you need to learn more about LLM as a judge, you possibly can read my TDS article on the subject here.
Step 2: Iteratively improving your LLM
You’re done with the first step, and also you now need to use those insights to enhance your LLM. On this section, I discuss how I approach this step to efficiently improve the performance of my LLM.
If I discover significant issues, for instance, when manually inspecting data, I at all times fix those first. This may, for instance, be discovering unnecessary noise being added to the LLM’s context, or typos in my prompts. Once I’m done with that, I proceed using some tools.
One tool I take advantage of is prompt optimizers, similar to Anthropic’s prompt improver. With these tools, you usually input your prompt and a few input-output examples. You possibly can, for instance, input the prompt you utilize in your customer support agents, together with examples of customer interactions where the LLM failed. The prompt optimizer will analyze your prompt and examples and return an improved version of your prompt. You’ll likely see improvements similar to:
- Improved structure in your prompt, for instance, using Markdown
- Handling of edge cases. For instance, handling cases where the user queries the shopper support agent about completely unrelated topics, similar to asking “What’s the weather in Recent York today?”. The prompt optimizer might add something like “If the query will not be related to Amazon, tell the user that you just’re only designed to reply questions on Amazon”.
If I actually have more quantitative data, similar to from grouping user queries or a golden dataset, I also analyze these data, and create a price effort graph. The worth effort graph highlights different available improvements you possibly can make, similar to:
- Improved edge case handling within the system prompt
- Use a greater embedding model for improved RAG
You then plot these data points in a 2D grid, similar to below. It’s best to naturally prioritize items within the upper left quadrant because they supply numerous value and require little effort. Normally, nonetheless, items are contained on a diagonal, where improved value correlates strongly with higher required effort.
I put all my improvement suggestions right into a value-effort graph, after which regularly pick items which might be as high as possible in value, and as little as possible in effort. That is a brilliant effective approach to quickly solve probably the most pressing issues together with your LLM, positively impacting the biggest number of shoppers you possibly can for a given amount of effort.
Step 3: Evaluate and iterate
The last step in my three-step process is to judge my LLM and iterate. There are a plethora of techniques you should utilize to judge your LLM, numerous which I cover in my article on the subject.
Preferably, you create some quantitative metrics in your LLMs’ performance, and ensure those metrics have improved from the changes you applied in step 2. After applying these changes and verifying they improved your LLM, it is best to consider whether the model is nice enough or should you should proceed improving the model. I most frequently operate on the 80% principle, which states that 80% performance is nice enough in just about all cases. This will not be a literal 80% as in accuracy. It moderately highlights the purpose that you just don’t must create an ideal model, but moderately only create a model that’s .
Conclusion
In this text, I actually have discussed the scenario where you have already got an LLM in production, and you need to analyze and improve your LLM. I approach this scenario by first analyzing the model inputs and outputs, preferably by full manual inspection. After ensuring I actually understand the dataset and the way the model behaves, I also move into more quantitative metrics, similar to grouping queries right into a taxonomy and using LLM as a judge. Following this, I implement improvements based on my findings within the previous step, and lastly, I evaluate whether my improvements worked as intended.
👉 Find me on socials:
🧑💻 Get in contact
✍️ Medium
Or read my other articles: