LLMs being released almost weekly. Some recent releases we’ve had are Qwen3 coing models, GPT 5, Grok 4, all of which claim the highest of some benchmarks. Common benchmarks are Humanities Last Exam, SWE-bench, IMO, and so forth.
Nevertheless, these benchmarks have an inherent flaw: The businesses releasing latest front-end models are strongly incentivized to optimize their models for such performance on these benchmarks. The rationale is that these well-known benchmarks are essentially what set the usual for what’s considered a brand new breakthrough LLM.
Luckily, there exists a straightforward solution to this problem: Develop your personal internal benchmarks, and test each LLM on the benchmark, which is what I’ll be discussing in this text.
Table of Contents
You can even find out about How one can Benchmark LLMs – ARC AGI 3, or you possibly can examine ensuring reliability in LLM applications.
Motivation
My motivation for this text is that latest LLMs are released rapidly. It’s difficult to not sleep thus far on all advances inside the LLM space, and also you thus must trust benchmarks and online opinions to work out which models are best. Nevertheless, this can be a severely flawed approach to judging which LLMs it is best to use either day-to-day or in an application you’re developing.
Benchmarks have the flaw that frontier model developers are incentivized to optimize their models for benchmarks, making benchmark performance possibly flawed. Online opinions even have their problems because others can have other use cases for LLMs than you. Thus, it is best to develop an internal benchmark to properly test newly released LLMs and work out which of them work best to your specific use case.
How one can develop an internal benchmark
There are numerous approaches to developing your personal internal benchmark. The fundamental point here is that your benchmark shouldn’t be an excellent common task LLMs perform (generating summaries, for instance, doesn’t work). Moreover, your benchmark should preferably utilize some internal data not available online.
It’s best to keep two fundamental things in mind when developing an internal benchmark
- It must be a task that’s either unusual (so the LLMs will not be specifically trained on it), or it must be using data not available online
- It must be as automatic as possible. You don’t have time to check each latest release manually
- You get a numeric rating from the benchmark so that you would be able to rank different models against one another
Varieties of tasks
Internal benchmarks could look very different from one another. Given some use cases, listed below are some example benchmarks you possibly can develop
Use case: Development in a rarely used programming language.
Benchmark: Have the LLM zero-shot a selected application like Solitaire (That is inspired by how Fireship benchmarks LLMs by developing a Svelte application)
Use case: Internal query answering chatbot
Benchmark: Gather a series of prompts out of your application (preferably actual user prompts), along with their desired response, and see which LLM is closest to the specified responses.
Use case: Classification
Benchmark: Create a dataset of input output examples. For this benchmark, the input generally is a text, and the output a selected label, akin to a sentiment evaluation dataset. Evaluation is easy on this case, since you would like the LLM output to precisely match the bottom truth label.
Ensuring automatic tasks
After determining which task you should create internal benchmarks for, it’s time to develop the duty. When developing, it’s essential to make sure the task runs as routinely as possible. When you needed to perform quite a lot of manual work for every latest model release, it will be inconceivable to take care of this internal benchmark.
I thus recommend creating a normal interface to your benchmark, where the one thing you want to change per latest model is so as to add a function that takes within the prompt and outputs the raw model text response. Then the remaining of your application can remain static when latest models are released.
To maintain the evaluations as automated as possible, I like to recommend running automated evaluations. I recently wrote an article about How one can Perform Comprehensive Large Scale LLM Validation, where you possibly can learn more about automated validation and evaluation. The fundamental highlights are that you would be able to either run a function to confirm correctness or utilize LLM as a judge.
Testing in your internal benchmark
Now that you just’ve developed your internal benchmark, it’s time to check some LLMs on it. I like to recommend at the very least testing out all closed-source frontier model developers, akin to
Nevertheless, I also highly recommend testing out open-source releases as well, for instance, with
Usually, every time a brand new model makes a splash (for instance, when DeepSeek released R1), I like to recommend running it in your benchmark. And since you made sure to develop your benchmark to be as automated as possible, the associated fee is low to check out latest models.
Continuing, I also recommend taking note of latest model version releases. For instance, Qwen initially released their Qwen 3 model. Nevertheless, some time later, they updated this model with Qwen-3-2507, which is alleged to be an improvement over the baseline Qwen 3 model. It’s best to make certain to not sleep thus far on such (smaller) model releases as well.
My final point on running the benchmark is that it is best to run the benchmark recurrently. The rationale for that is that models can change over time. For instance, if you happen to’re using OpenAI and never locking the model version, you possibly can experience changes in outputs. It’s thus essential to recurrently run benchmarks, even on models you’ve already tested. This is applicable especially if you’ve such a model running in production, where maintaining high-quality outputs is critical.
Avoiding contamination
When utilizing an internal benchmark, it’s incredibly essential to avoid contamination, for instance, by having among the data online. The rationale for that is that today’s frontier models have essentially scraped the whole web for web data, and thus, the models have access to all of this data. In case your data is offered online (especially if the solutions in your benchmarks can be found), you’ve got a contamination issue at hand, and the model probably has access to the info from its pre-training.
Use as little time as possible
Imagine this task as staying up thus far on model releases. Yes, it’s an excellent essential a part of your job; nevertheless, this is a component that you would be able to spend little time on and still get quite a lot of value. I thus recommend minimizing the time you spend on these benchmarks. Every time a brand new frontier model is released, you test the model against your benchmark and confirm the outcomes. If the brand new model achieves vastly improved results, it is best to consider changing models in your application or day-to-day life. Nevertheless, if you happen to only see a small incremental improvement, it is best to probably wait for more model releases. Be mindful that when it is best to change the model is determined by aspects akin to:
- How much time does it take to vary models
- The price difference between the old and the brand new model
- Latency
- …
Conclusion
In this text, I actually have discussed how you possibly can develop an internal benchmark for testing all of the LLM releases happening recently. Staying up thus far on one of the best LLMs is difficult, especially relating to testing which LLM works best in your use case. Developing internal benchmarks makes this testing process loads faster, which is why I highly recommend it to not sleep thus far on LLMs.
👉 Find me on socials:
🧑💻 Get in contact
✍️ Medium
Or read my other articles: