Notes on LLM Evaluation

, one could argue that the majority of the work resembles traditional software development greater than ML or Data Science, considering we regularly use off-the-shelf foundation models as a substitute of coaching them ourselves. Even so, I still imagine that one of the vital critical parts of constructing an LLM-based application centers on data, specifically the evaluation pipeline. You’ll be able to’t improve what you possibly can’t measure, and you possibly can’t measure what you don’t understand. To construct an evaluation pipeline, you continue to need to take a position a considerable amount of effort in examining, understanding, and analyzing your data.

On this blog post, I would like to document some notes on the means of constructing an evaluation pipeline for an LLM-based application I’m currently developing. It’s also an exercise in applying theoretical concepts I’ve examine online to a concrete example, mainly from Hamel Husain’s blog.

The Application – Explaining our scenario and use case
The Eval Pipeline – Overview of the evaluation pipeline and its foremost components. For every step, we are going to divide it into:
1. Overview – A temporary, conceptual explanation of the step.
2. In Practice – A concrete example of applying the concepts based on our use case.
What Lies Ahead – That is just the start. How will our evaluation pipeline evolve?
Conclusion – Recapping the important thing steps and final thoughts.

1. The Application

To ground our discussion, let’s use a concrete example: an AI-powered IT Helpdesk Assistant*.

The AI serves as the primary line of support. An worker submits a ticket describing a technical issue—their laptop is slow, they will’t connect with the VPN, or an application is crashing. The AI’s task is to research the ticket, provide initial troubleshooting steps, and either resolve the difficulty or escalate it to the suitable human specialist.

Evaluating the performance of this application is a subjective task. The AI’s output is free-form text, meaning there isn’t any single “correct” answer. A helpful response may be phrased in some ways, so we cannot simply check if the output is “Option A” or “Option B.” It is usually not a regression task, where we are able to measure numerical error using metrics like Mean Squared Error (MSE).

A “good” response is defined by a mixture of things: Did the AI appropriately diagnose the issue? Did it suggest relevant and protected troubleshooting steps? Did it know when to escalate a critical issue to a human expert? A response may be factually correct but unhelpful, or it could possibly fail by not escalating a significant issue.

* For context: I’m using the IT Helpdesk scenario as an alternative to my actual use case to debate the methodology openly. The analogy isn’t perfect, so some examples might feel a bit stretched to make a particular point.

2. The Eval Pipeline

Now that we understand our use case, let’s proceed with an outline of the proposed evaluation pipeline. In the next sections, we are going to detail each section and contextualize it by providing examples relevant to our use case.

Overview of the proposed evaluation pipeline, showing the flow from data collection to a repeatable, iterative improvement cycle. Image by creator.

The Data

All of it starts with data – ideally, real data out of your production environment. Should you don’t have it yet, you possibly can try using your application yourself or ask friends to make use of it to get a way of how it could possibly fail. In some cases, it’s possible to generate synthetic data to get things began, or to enhance existing data, in case your volume is low.

When using synthetic data, ensure it’s of top quality and closely matches the expectations of real-world data.

While LLMs are relatively recent, humans have been studying, training, and certifying themselves for quite a while. If possible, attempt to leverage existing material designed for humans to assist you with generating data on your application.

In Practice

My initial dataset was small, containing a handful of real user tickets from production and a few demonstration examples created by a website expert to cover common scenarios.

Since I didn’t have many examples, I used existing certification exams for IT support professionals, which consisted of multiple-choice questions with a solution guide and scoring keys. This manner, I not only had the proper answer but additionally an in depth explanation of why each selection was incorrect or right.

I used an LLM to rework these exam questions right into a more useful format. Each query became a simulated user ticket, and the reply keys and explanations were repurposed to generate examples of each effective and ineffective AI responses, complete with a transparent rationale for every.

When using external sources, it’s necessary to be mindful of information contamination. If the certification material is publicly available, it could have already been included within the training data for the inspiration model. This might cause you to evaluate the model’s memory as a substitute of its ability to reason on recent, unseen problems, which can yield overly optimistic or misleading results. If the model’s performance on this data seems surprisingly perfect, or if its outputs closely match the source text, chances are high contamination is involved.

Data Annotation

Now that you will have gathered some data, the subsequent crucial step is analyzing it. This process needs to be lively, so be sure that to notice your insights as you go. There are many ways to categorize or divide different tasks involved in data annotation. I typically consider this in two foremost parts:

Error Evaluation: Reviewing existing (often imperfect) outputs to discover failures. For instance, you may add free-text notes explaining the failures or tag inadequate responses with different error categories. You will discover a rather more detailed explanation of error evaluation on Hamel Husain’s blog.
Success Definition: Creating ideal artifacts to define what success looks like. For instance, for every output, you may write ground-truth reference answers or develop a rubric with guidelines that specify what a super answer should include.

The foremost goal is to achieve a clearer understanding of your data and application. Error evaluation helps discover the first failure modes your application faces, enabling you to handle the underlying issues. Meanwhile, defining success enables you to determine the suitable criteria and metrics for accurately assessing your model’s performance.

Don’t worry if you happen to’re unsure about recording information precisely. It’s higher to begin with open-ended notes and unstructured annotations slightly than stressing over the proper format. Over time, you’ll notice the important thing facets to evaluate and customary failure patterns naturally emerge.

In Practice

I made a decision to approach this by first making a custom-made tool designed explicitly for data annotation, which enables me to scan through production data, add notes, and generate reference answers, as previously discussed. I discovered this to be a comparatively fast process because we are able to construct a tool that operates somewhat independently of your foremost application. Considering it’s a tool for private use and of limited scope, I used to be in a position to “vibe-code” it with less concern than I might have in usual settings. In fact, I’d still review the code, but I wasn’t too concerned if things broke infrequently.

To me, a very powerful end result of this process is that I steadily learned what makes a nasty response bad and what makes an excellent response good. With that, you possibly can define your evaluation metrics to effectively measure what matters to your use case. For instance, I spotted my solution exhibited a behavior of “over-referral,” which implies escalating easy requests to human specialists. Other issues, to a lesser extent, included inaccurate troubleshooting steps and incorrect root-cause diagnosis.

Writing Rubrics

Within the success definition steps, I discovered that writing rubrics was very helpful. My guideline for creating the rubrics was to ask myself: what makes a super response an excellent response? This enables for reducing the subjectivity of the evaluation process – regardless of how the response is phrased, it should tick all of the boxes within the rubric.

Considering that is the initial stage of your evaluation process, you won’t know all the general criteria beforehand, so I might define the necessities on an example basis, slightly than trying to determine a single guideline for all examples. I also didn’t worry an excessive amount of about setting a rigorous schema. Any criteria in my rubric must have a key and a price. I can select this value to be either a boolean, a string, or an inventory of strings. The rubrics may be flexible because they’re intended to be utilized by either a human or an LLM judge, and each can take care of this subjectivity. Also, as mentioned before, as you proceed with this process, the perfect rubric guidelines will naturally stabilize.

Here’s an example:

{
  "fields": {
    "clarifying_questions": {
      "type": "array",
      "value": [
        "Asks for the specific error message",
        "Asks if the user recently changed their password"
      ]
    },
    "root_cause_diagnosis": {
      "type": "string",
      "value": "Expired user credentials or MFA token sync issue"
    },
    "escalation_required": {
      "type": "boolean",
      "value": false
    },
    "recommended_solution_steps": {
      "type": "array",
      "value": [
        "Guide user to reset their company password",
        "Instruct user to re-sync their MFA device"
      ]
    }
  }
}

Although each example’s rubric may differ from the others, we are able to group them into well-defined evaluation criteria for the subsequent step.

Running the Evaluations

With annotated data in hand, you possibly can construct a repeatable evaluation process. Step one is to curate a subset of your annotated examples to create a versioned evaluation dataset. This dataset should contain representative examples that cover your application’s common use cases and all of the failure modes you will have identified. Versioning is critical; when comparing different experiments, you need to ensure they’re benchmarked against the identical data.

For subjective tasks like ours, where outputs are free-form text, an “LLM-as-a-judge” can automate the grading process. The evaluation pipeline feeds the LLM judge an input out of your dataset, the AI application’s corresponding output, and the annotations you created (corresponding to the reference answer and rubric). The judge’s role is to attain the output against the provided criteria, turning a subjective assessment into quantifiable metrics.

These metrics let you systematically measure the impact of any changes, whether it’s a brand new prompt, a distinct model, or a change in your RAG strategy. To make sure that these metrics are meaningful, it is crucial to periodically confirm that the LLM judge’s evaluations align with those of a human domain expert inside an accepted range.

In Practice

After completing the info annotation process, we should always gain a clearer understanding of what makes a response good or bad and, with that knowledge, establish a core set of evaluation dimensions. In my case, I identified the next areas:

Escalation Behavior: Measures if the AI escalates tickets appropriately. A response is rated as ADEQUATE, OVER-ESCALATION (escalating easy issues), or UNDER-ESCALATION (failing to escalate critical problems).
Root Cause Accuracy: Assesses whether the AI appropriately identifies the user’s problem. This can be a binary CORRECT or INCORRECT evaluation.
Solution Quality: Evaluates the relevance and safety of the proposed troubleshooting steps. It also considers whether the AI asks for obligatory clarifying information before offering an answer. It’s rated ADEQUATE or INADEQUATE.

With these dimensions defined, I could run evaluations. For every item in my versioned evaluation set, the system generates a response. This response, together with the unique ticket and its annotated rubric, is then passed to an LLM judge. The judge receives a prompt that instructs it on methods to use the rubric to attain the response across the three dimensions.

That is the prompt I used for the LLM judge:

You might be an authority IT Support AI evaluator. Your task is to evaluate the standard of an AI-generated response to an IT helpdesk ticket. To achieve this, you shall be given the ticket details, a reference answer from a senior IT specialist, and a rubric with evaluation criteria.

#{ticket_details}

**REFERENCE ANSWER (from IT Specialist):**
#{reference_answer}

**NEW AI RESPONSE (to be evaluated):**
#{new_ai_response}

**RUBRIC CRITERIA:**
#{rubric_criteria}

**EVALUATION INSTRUCTIONS:**

[Evaluation instructions here...]

**Evaluation Dimensions**
Evaluate the AI response on the next dimensions:
- Overall Judgment: GOOD/BAD
- Escalation Behavior: If the rubric's `escalation_required` is `false` however the AI escalates, label it as `OVER-ESCALATION`. If `escalation_required` is `true` however the AI doesn't escalate, label it `UNDER-ESCALATION`. Otherwise, label it `ADEQUATE`.
- Root Cause Accuracy: Compare the AI's diagnosis with the `root_cause_diagnosis` field within the rubric. Label it `CORRECT` or `INCORRECT`.
- Solution Quality: If the AI's response fails to incorporate obligatory `recommended_solution_steps` or `clarifying_questions` from the rubric, or suggests something unsafe, label it as `INADEQUATE`. Otherwise, label it as `ADEQUATE`.

If the rubric doesn't provide enough information to guage a dimension, use the reference answer and your expert judgment.

**Please provide:**
1. An overall judgment (GOOD/BAD)
2. An in depth explanation of your reasoning
3. The escalation behavior (`OVER-ESCALATION`, `ADEQUATE`, `UNDER-ESCALATION`)
4. The basis cause accuracy (`CORRECT`, `INCORRECT`)
5. The answer quality (`ADEQUATE`, `INADEQUATE`)

**Response Format**
Provide your response in the next JSON format:

{
  "JUDGMENT": "GOOD/BAD",
  "REASONING": "Detailed explanation",
  "ESCALATION_BEHAVIOR": "OVER-ESCALATION/ADEQUATE/UNDER-ESCALATION",
  "ROOT_CAUSE_ACCURACY": "CORRECT/INCORRECT",
  "SOLUTION_QUALITY": "ADEQUATE/INADEQUATE"
}

3. What Lies Ahead

Our application is starting out easy, and so is our evaluation pipeline. Because the system expands, we’ll need to regulate our methods for measuring its performance. This implies we’ll have to contemplate several facets down the road. Some key ones include:

What number of examples are enough?

I began with about 50 examples, but I haven’t analyzed how close that is to a super number. Ideally, we would like enough examples to supply reliable results while keeping the price of running them reasonably priced. In Chip Huyen’s AI Engineering book, there’s a mention of an interesting approach that involves creating bootstraps of your evaluation set. For example, from my original 50-sample set, I could create multiple bootstraps by drawing 50 samples with alternative, then evaluate and compare performance across these bootstraps. Should you observe very different results, it probably means you wish more examples in your evaluation set.

Relating to error evaluation, we can even apply a helpful rule of thumb from Husain’s blog:

Keep iterating on more traces until you reach theoretical saturation, meaning recent traces don’t seem to disclose recent failure modes or information to you. As a rule of thumb, it’s best to aim to review at the least 100 traces.

Aligning LLM Judges with Human Experts

We would like our LLM judges to stay as consistent as possible, but that is difficult since the judgment prompts shall be revised, the underlying model can change or be updated by the provider, and so forth. Moreover, your evaluation criteria will improve over time as you grade outputs, so it’s crucial to all the time ensure your LLM Judges stay aligned along with your judgment or that of your domain experts. You’ll be able to schedule regular meetings with the domain expert to review a sample of LLM judgments, and calculate a straightforward agreement percentage between automated and human evaluations, and naturally, adjust your pipeline when obligatory.

Overfitting

Overfitting continues to be a thing within the LLM world. Even when we’re not training a model directly, we’re still training our system by tweaking instruction prompts, refining retrieval systems, setting parameters, and enhancing context engineering. If our changes are based on evaluation results, there’s a risk of over-optimizing for our current set, so we still have to follow standard advice to stop overfitting, corresponding to using held-out sets.

Increased Complexity

For now, I’m keeping this application easy, so we’ve got fewer components to guage. As our solution becomes more complex, our evaluation pipeline will even grow more complex. If our application involves multi-turn conversations with memory, or different tool usage or context retrieval systems, we should always break down the system into multiple tasks and evaluate each component individually. Up to now, I’ve been using easy input/output pairs for evaluation, so retrieving data directly from my database is sufficient. Nevertheless, as our system evolves, we’ll likely have to track the complete chain of events for a single request. This involves adopting solutions for logging LLM traces, corresponding to using platforms like Arize, HoneyHive, or LangFuse.

Continuous Iteration and Data Drift

Production environments are always changing. User expectations evolve, usage patterns shift, and recent failure modes arise. An evaluation set created today may now not be representative in six months. This shift requires ongoing data annotation to make sure the evaluation set all the time reflects the present state of how the appliance is used and where it falls short.

4. Conclusion

On this post, we covered some key concepts for constructing a foundation to guage our data, together with practical details for our use case. We began with a small, mixed-source dataset and steadily developed a repeatable measurement system. The foremost steps involved actively annotating data, analyzing errors, and defining success using rubrics, which helped us turn a subjective problem into measurable dimensions. After annotating our data and gaining a greater understanding of it, we used an LLM as a judge to automate scoring and create a feedback loop for continuous improvement.

Although the pipeline outlined here’s a start line, the subsequent steps involve addressing challenges corresponding to data drift, judge alignment, and increasing system complexity. By putting in the hassle to grasp and organize your evaluation data, you’ll gain the clarity needed to iterate effectively and develop a more reliable application.

Notes on LLM Evaluation

Table of Contents

1. The Application