Methods to Use LLMs for Powerful Automatic Evaluations

discuss how you may perform automatic evaluations using LLM as a judge. LLMs are widely used today for quite a lot of applications. Nonetheless, an often underestimated aspect of LLMs is their use case for evaluation. With LLM as a judge, you utilize LLMs to evaluate the standard of an output, whether it’s giving it a rating between 1 and 10, comparing two outputs, or providing pass/fail feedback. The goal of the article is to offer insights into how you may utilize LLM as a judge for your individual application, to make development more practical.

This infographic highlights the contents of my article. Image by ChatGPT.

You can even read my article on Benchmarking LLMs with ARC AGI 3 and take a look at my website, which accommodates all my information and articles.

Motivation

My motivation for writing this text is that I work day by day on different LLM applications. I’ve read an increasing number of about using LLM as a judge, and I began reading up on the subject. I imagine utilizing LLMs for automated evaluations of machine-learning systems is a brilliant powerful aspect of LLMs that’s often underestimated.

Using LLM as a judge can prevent enormous amounts of time, considering it might automate either a part of, or the entire, evaluation process. Evaluations are critical for machine-learning systems to make sure they perform as intended. Nonetheless, evaluations are also time-consuming, and also you thus wish to automate them as much as possible.

One powerful example use case for LLM as a judge is in a question-answering system. You’ll be able to gather a series of input-output examples for 2 different versions of a prompt. You then can ask the LLM judge to reply with whether the outputs are equal (or the latter prompt version output is best), and thus ensure changes in your application shouldn’t have a negative impact on performance. This could, for instance, be used pre-deployment of recent prompts.

Definition

I define LLM as a judge, as any case where you prompt an LLM to guage the output of a system. The system is primarily machine-learning-based, though this shouldn’t be a requirement. You just provide the LLM with a set of instructions on the best way to evaluate the system, providing information reminiscent of what’s vital for the evaluation and what evaluation metric ought to be used. The output can then be processed to proceed deployment or stop the deployment because the standard is deemed lower. This eliminates the time-consuming and inconsistent step of manually reviewing LLM outputs before making changes to your application.

LLM as a judge evaluation methods

LLM as a judge will be used for quite a lot of applications, reminiscent of:

Query answering systems
Classification systems
Information extraction systems
…

Different applications would require different evaluation methods, so I’ll describe three different methods below

Compare two outputs

Comparing two outputs is an amazing use of LLM as a judge. With this evaluation metric, you compare the output of two different models.

The difference between the models can, for instance, be:

Different input prompts
Different LLMs (i.e., OpenAI GPT4o vs Claude Sonnet 4.0)
Different embedding models for RAG

You then provide the LLM judge with 4 items:

The input prompt(s)
Output from model 1
Output from model 2
Instructions on the best way to perform the evaluation

You’ll be able to then ask the LLM judge to offer considered one of the three following outputs:

Equal (the essence of the outputs is similar)
Output 1 (the primary model is best)
Output 2 (the second model is best).

You’ll be able to, for instance, use this within the scenario I described earlier, if you desire to update the input prompt. You’ll be able to then make sure that the updated prompt is the same as or higher than the previous prompt. If the LLM judge informs you that every one test samples are either equal or the brand new prompt is best, you may likely mechanically deploy the updates.

Rating outputs

One other evaluation metric you need to use for LLM as a judge is to offer the output a rating, for instance, between 1 and 10. On this scenario, you have to provide the LLM judge with the next:

Instructions for performing the evaluation
The input prompt
The output

On this evaluation method, it’s critical to offer clear instructions to the LLM judge, considering that providing a rating is a subjective task. I strongly recommend providing examples of outputs that resemble a rating of 1, a rating of 5, and a rating of 10. This provides the model with different anchors it might utilize to offer a more accurate rating. You can even try using fewer possible scores, for instance, only scores of 1, 2, and three. Fewer options will increase the model accuracy, at the price of constructing smaller differences harder to distinguish, due to less granularity.

The scoring evaluation metric is beneficial for running larger experiments, comparing different prompt versions, models, and so forth. You’ll be able to then utilize the typical rating over a bigger test set to accurately judge which approach works best.

Pass/fail

Pass or fail is one other common evaluation metric for LLM as a judge. On this scenario, you ask the LLM judge to either approve or disapprove the output, given an outline of what constitutes a pass and what constitutes a fail. Much like the scoring evaluation, this description is critical to the performance of the LLM judge. Again, I like to recommend using examples, essentially utilizing few-shot learning to make the LLM judge more accurate. You’ll be able to read more about few-shot learning in my article on context engineering.

The pass fail evaluation metric is beneficial for RAG systems to evaluate if a model accurately answered an issue. You’ll be able to, for instance, provide the fetched chunks and the output of the model to find out whether the RAG system answers accurately.

Essential notes

Compare with a human evaluator

I even have a number of vital notes regarding LLM as a judge, from working on it myself. The primary learning is that while LLM as a judge system can prevent large amounts of time, it might even be unreliable. When implementing the LLM judge, you thus must test the system manually, ensuring the LLM as a judge system responds similarly to a human evaluator. This could preferably be performed as a blind test. For instance, you may arrange a series of pass/fail examples, and see how often the LLM judge system agrees with the human evaluator.

Cost

One other vital note to take into accout is the price. The fee of LLM requests is trending downwards, but when developing an LLM as a judge system, you’re also performing a whole lot of requests. I might thus keep this in mind and perform estimations on the price of the system. For instance, if each LLM as a judge runs costs 10 USD, and also you, on average, perform five such runs a day, you incur a price of fifty USD per day. You might need to guage whether that is an appropriate price for more practical development, or should you should reduce the price of the LLM as a judge system. You’ll be able to for instance reduce the price through the use of cheaper models (GPT-4o-mini as an alternative of GPT-4o), or reduce the variety of test examples.

Conclusion

In this text, I actually have discussed how LLM as a judge works and the way you may put it to use to make development more practical. LLM as a judge is an often neglected aspect of LLMs, which will be incredibly powerful, for instance, pre-deployments to make sure your query answering system still works on historic queries.

I discussed different evaluation methods, with how and when it is best to utilize them. LLM as a judge is a versatile system, and you have to adapt it to whichever scenario you’re implementing. Lastly, I also discussed some vital notes, for instance, comparing the LLM judge with a human evaluator.

👉 Find me on socials:

🧑‍💻 Get in contact

🔗 LinkedIn

🐦 X / Twitter

✍️ Medium

Methods to Use LLMs for Powerful Automatic Evaluations

Table of contents

Motivation

Definition