Perform Comprehensive Large Scale LLM Validation

and evaluations are critical to making sure robust, high-performing LLM applications. Nevertheless, such topics are sometimes ignored within the greater scheme of LLMs.

Imagine this scenario: You could have an LLM query that replies appropriately 999/1000 times when prompted. Nevertheless, you’ve to run backfilling on 1.5 million items to populate the database. On this (very realistic) scenario, you’ll experience 1500 errors for this LLM prompt alone. Now scale this as much as 10s, if not 100s of various prompts, and also you’ve got an actual scalability issue at hand.

The answer is to validate your LLM output and ensure high performance using evaluations, that are each topics I’ll discuss in this text

This infographic highlights the principal contents of this text. I’ll be discussing validation and evaluation of LLM outputs, Qualitative vs quantitative scoring, and coping with large-scale LLM applications. Image by ChatGPT.

What’s LLM validation and evaluation?

I feel it’s essential to start out by defining what LLM validation and evaluation are, and why they’re vital in your application.

LLM validation is about validating the standard of your outputs. One common example of that is running some piece of code that checks if the LLM response answered the user’s query. Validation is very important since it ensures you’re providing high-quality responses, and your LLM is performing as expected. Validation may be seen as something you do real time, on individual responses. For instance, before returning the response to the user, you confirm that the response is definitely of top quality.

LLM evaluation is comparable; nevertheless, it normally doesn’t occur in real time. Evaluating your LLM output could, for instance, involve taking a look at all of the user queries from the last 30 days and quantitatively assessing how well your LLM performed.

Validating and evaluating your LLM’s performance is very important since you will experience issues with the LLM output. It could, for instance, be

Issues with input data (missing data)
An edge case your prompt just isn’t equipped to handle
Data is out of distribution
Etc.

Thus, you wish a strong solution for handling LLM output issues. You’ll want to make sure you avoid them as often as possible and handle them within the remaining cases.

Murphy’s law adapted to this scenario:

On a big scale, all the things that may go flawed, will go flawed

Qualitative vs quantitative assessments

Before moving on to the person sections on performing validation and evaluations, I also wish to comment on qualitative vs quantitative assessments of LLMs. When working with LLMs, it’s often tempting to manually evaluate the LLM’s performance for various prompts. Nevertheless, such manual (qualitative) assessments are highly subject to biases. For instance, you may focus most of your attention on the cases by which the LLM succeeded, and thus overestimate the performance of your LLM. Having the potential biases in mind when working with LLMs is very important to mitigate the chance of biases influencing your ability to enhance the model.

Large-scale LLM output validation

After running thousands and thousands of LLM calls, I’ve seen a variety of different outputs, reminiscent of GPT-4o returning … or Qwen2.5 responding with unexpected Chinese characters in

These errors are incredibly difficult to detect with manual inspection because they typically occur in lower than 1 out of 1000 API calls to the LLM. Nevertheless, you wish a mechanism to catch these issues once they occur in real time, on a big scale. Thus, I’ll discuss some approaches to handling these issues.

Easy if-else statement

The best solution for validation is to have some code that uses an easy if statement, which checks the LLM output. For instance, if you need to generate summaries for documents, it is advisable to make sure the LLM output is not less than above some minimal length

# LLM summay validation

# first generate summary through an LLM client reminiscent of OpenAI, Anthropic, Mistral, etc. 
summary = llm_client.chat(f"Make a summary of this document {document}")

# validate the summary
def validate_summary(summary: str) -> bool:
    if len(summary) < 20:
        return False
    return True

You then can run the validation.

If the validation passes, you possibly can proceed as usual
If it fails, you possibly can decide to ignore the request or utilize a retry mechanism

You'll be able to, after all, make the validate_summary function more elaborate, for instance:

Utilizing for complex string matching
Using a library reminiscent of Tiktoken to count the variety of tokens within the request
Ensure specific words are present/not present within the response
etc.

LLM as a validator

A more advanced and expensive validator is using an LLM. In these cases, you utilize one other LLM to evaluate if the output is valid. This works because validating correctness is generally a more straightforward task than generating an accurate response. Using an LLM validator is basically utilizing LLM as a judge, a subject I actually have written one other Towards Data Science article about here.

I often utilize smaller LLMs to perform this validation task because they've faster response times, cost less, and still work well, considering that the duty of validating is less complicated than generating an accurate response. For instance, if I utilize GPT-4.1 to generate a summary, I might consider GPT-4.1-mini or GPT-4.1-nano to evaluate the validity of the generated summary.

Again, if the validation succeeds, you proceed your application flow, and if it fails, you possibly can ignore the request or decide to retry it.

Within the case of validating the summary, I might prompt the validating LLM to search for summaries that:

Are too short
Don’t adhere to the expected answer format (for instance, Markdown)
And other rules you might have for the generated summaries

Quantitative LLM evaluations

It's also super vital to perform large-scale evaluations of LLM outputs. I like to recommend either running this continually, or in regular intervals. Quantitative LLM evaluations are also more practical when combined with qualitative assessments of knowledge samples. For instance, suppose the evaluation metrics highlight that your generated summaries are longer than what users prefer. In that case, it's best to manually look into those generated summaries and the documents they're based on. This helps you understand the underlying problem, which again makes solving the issue easier.

LLM as a judge

Same as with validation, you possibly can utilize LLM as a judge for evaluation. The difference is that while validation uses LLM as a judge for binary predictions (either the output is valid, or it’s not valid), evaluation uses it for more detailed feedback. You'll be able to for instance receive feedback from the LLM judge on the standard of a summary from 1-10, making it easier to tell apart medium quality summaries (around 4-6), from prime quality summarie (7+).

Again, you've to think about costs when using LLM as a judge. Despite the fact that you could be utilizing smaller models, you might be essentially doubling the variety of LLM calls when using LLM as a judge. You'll be able to thus consider the next changes to save lots of on costs:

Sampling data points, so you simply run LLM as a judge on a subset of knowledge points
Grouping several data points into one LLM as a judge prompt, to save lots of on input and output tokens

I like to recommend detailing the judging criteria to the LLM judge. For instance, it's best to state what constitutes a rating of 1, a rating of 5, and a rating of 10. Using examples is commonly an excellent way of instructing LLMs, as discussed in my article on utilizing LLM as a judge. I often take into consideration how helpful examples are for me when someone is explaining a subject, and you possibly can thus imagine how helpful it's for an LLM.

User feedback

User feedback is an excellent way of receiving quantitative metrics in your LLM’s outputs. User feedback can, for instance, be a thumbs-up or thumbs-down button, stating if the generated summary is satisfactory. In the event you mix such feedback from tons of or hundreds of users, you've a reliable feedback mechanism you possibly can utilize to vastly improve the performance of your LLM summary generator!

These users may be your customers, so it's best to make it easy for them to offer feedback and encourage them to offer as much feedback as possible. Nevertheless, these users can essentially be anyone who doesn’t utilize or develop your application on a day-to-day basis. It’s vital to do not forget that any such feedback, will likely be incredibly priceless to enhance the performance of your LLM, and it doesn’t really cost you (because the developer of the applying), any time to assemble this feedback..

Conclusion

In this text, I actually have discussed how you possibly can perform large-scale validation and evaluation in your LLM application. Doing that is incredibly vital to each ensure your application performs as expected and to enhance your application based on user feedback. I like to recommend incorporating such validation and evaluation flows in your application as soon as possible, given the importance of ensuring that inherently unpredictable LLMs can reliably provide value in your application.

You may as well read my articles on Benchmark LLMs with ARC AGI 3 and Effortlessly Extract Receipt Information with OCR and GPT-4o mini

👉 Find me on socials:

🧑‍💻 Get in contact

🔗 LinkedIn

🐦 X / Twitter

✍️ Medium

Perform Comprehensive Large Scale LLM Validation

Table of Contents