I Tried Making my Own (Bad) LLM Benchmark to Cheat in Escape Rooms

-


Recently, DeepSeek announced their latest model, R1, and article after article got here out praising its performance relative to cost, and the way the discharge of such open-source models could genuinely change the course of LLMs ceaselessly. That is admittedly exciting! And likewise, too big of a scope to put in writing about… but when a model like DeepSeek comes out of nowhere with a steel chair, boasting similar performance levels to other models, what does performance really mean on this context?

Should you follow AI releases, you’ve seen this dance before. Every latest model drops with its graphs showing the way it’s by some means concurrently higher than GPT-4 on math problems while being smaller and more efficient. But what exactly are these benchmarks measuring? How are they created? And more importantly, how can we cut through the hype to create our own benchmarks for specific use cases?

I desired to learn more about LLM Benchmarking.

Part 1: What’s a Benchmark? (in 3 seconds)

TL:DR — The SATs (multiple, actually) for LLMs.

Part 1.1: What’s a Benchmark? (in greater than 3 seconds)

Before we dive into the nitty-gritty of specific benchmarks, let’s take a moment to unpack what we even mean by “LLM Benchmark.” Because calling them the “SATs for AI” feels each right and in addition barely oversimplified.

LLM benchmarks are, at their core, structured tests used to measure how well large language models perform on certain tasks. These tasks might be anything from identifying if a press release is true or false, to summarizing a legal document, to generating valid Python functions. Consider them as curated obstacle courses specially designed by AI researchers to check every relevant muscle these models might need. These frameworks typically provide a dataset of inputs with known correct outputs, allowing for consistent comparison between models.

Modern benchmarks employ various evaluation methodologies. Classification metrics like accuracy work for tasks with discrete correct answers, while overlap-based metrics (BLEU, ROUGE) evaluate free-form text generation. Some benchmarks use functional testing for code generation, or employ other LLMs as judges to judge response quality.

A typical benchmark often comes packaged as:

  • A standardized dataset of questions, prompts, or tasks (with correct or reference answers).
  • An evaluation protocol specifying measure success, like accuracy, F1 rating, BLEU/ROUGE for text generation, or pass/fail rates for coding tasks.
  • A leaderboard or some type of comparative scoreboard, often with big flashy graphs.

Some really famous benchmarks include MMLU for testing multitask language understanding, TruthfulQA for assessing factual accuracy, and HumanEval for measuring coding capabilities. Results are pretty often published on public leaderboards, which let’s people perform some transparent comparison between different models.

What Makes a Good Benchmark?

  1. A Clear Task Definition: We wish tasks which are unambiguous. The more straightforward and well-specified the challenge, the simpler it’s to trust the outcomes.
  2. Data Integrity: The test set shouldn’t be floating around within the training data. Because if the model’s seen the very same query 50 times before, the evaluation is about as useful as giving a math quiz to someone who already has the reply key.
  3. Quantifiable Metrics: You would like an ordinary for scoring performance — like how again and again the model’s code passes test cases or how close the generated summary is to a “ground-truth” summary.
  4. Task Diversity & Difficulty: If a benchmark is simply too easy, everyone just ACES it on day one, and we learn… well, nothing. If it’s too area of interest (like “We test only the model’s ability to count the digits of Pi for 20 minutes”), that’s also not so helpful.

Life Ain’t All about The Grades

Benchmarks capture only a slice of what LLMs can do. In the true world, your chatbot might have to juggle domain knowledge, keep track of conversation context, abide by your organization’s policies,  produce fluent, non-offensive replies. No single standardized test on the market fully covers that. As we’ll see within the upcoming case studies, the design and  of a benchmark can heavily shape the image you get of your model’s performance… and sometimes lead you astray should you’re not careful with the way you measure success.

Now that we’ve got a way of what Llm Benchmarks are designed to perform (and where they could fall short), let’s explore a few examples to see how people actually construct and use them in practice — with mixed results!

Case Study #1: Leetcode as an LLM Benchmark

As a student within the tech space, the word “Leetcode” popping up during my seek for cool benchmarks raised by blood pressure by a statistically significant amount. Unlike Leetcode, which sucks, the paper “Performance Study of LLM-Generated Code on Leetcode” was very interesting — it asks a deceptively easy query: can we use Leetcode to benchmark LLM code generation? Their findings reveal each the promise and pitfalls of this approach.

The Benchmark Design

The researchers built a three-stage validation system. Local tests catch basic errors, Leetcode’s judge verifies correctness, and a custom benchmarking setup measures performance. This setup revealed something critical: benchmarking code performance is harder than it looks.

After they compared local measurements to Leetcode’s metrics, they found only a 0.28 correlation. Leetcode’s measurements showed much higher variation (0.089 vs 0.035 locally). Even worse, Leetcode’s rankings proved unstable — similar solutions could drop from the 77th to 54th percentile just based on submission timing.

A Performance Study of LLM-Generated Code on Leetcode,” In twenty eighth International Conference on Evaluation and Assessment in Software Engineering (EASE 2024), Salerno, Italy (2024)

The Real Problems

Three major issues emerged that challenge Leetcode’s viability as a benchmark:

Data Contamination: Using public problems risks LLMs having seen the solutions during training. The researchers had to make use of only problems from 2023 to mitigate this.

Platform Instability: Leetcode’s metrics drift over time — memory measurements showed a -0.24 correlation with test date. This makes reproducible benchmarking nearly unattainable.

Measurement Reliability: The weak correlation between local and platform measurements raises questions on what we’re actually testing.

What It Means for LLM Benchmarking

This study doesn’t just critique Leetcode — it highlights what we’d like in a code generation benchmark: reproducible measurements, reliable performance metrics, and guaranteed training-test separation. Until we’ve got platforms built specifically for this purpose, we must be extremely cautious about using competition platforms as benchmarks.

So! We all know that not all benchmarks are viable benchmarks — what a few more mainstream one?

Case Study #2: SuperGLUE — Constructing a Higher Language Understanding Benchmark

The SuperGLUE paper tackles an enchanting problem in AI benchmarking: what do you do when models get too good at your tests? When GLUE became insufficient (with models surpassing human performance), the researchers needed to rethink how we measure language understanding.

The Benchmark Design

SuperGLUE’s core innovation is its task selection methodology. The researchers collected task proposals from the NLP community and filtered them through a rigorous process: each task needed clear evaluation metrics, public training data, and — most significantly — significant headroom between machine and human performance.

This resulted in eight tasks (I’ve simplified the table from the document here, it’s a bit of less readable but you need to get the sense of what the questions are asking):

SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems, In thirty third Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada (2019)

What makes these tasks special is their diversity in format. Unlike GLUE’s concentrate on sentence classification, SuperGLUE includes coreference resolution, reading comprehension, and more com plex reasoning tasks. Each task measures different facets of language understanding while maintaining clear, quantifiable metrics.


Part 2: Let’s Construct a Physical Reasoning Benchmark: To Cheat at Escape Rooms

After some benchmarks like SuperGLUE and Leetcode, I had an idea: what if we tested LLMs on something completely different — physical reasoning… through escape room puzzles?

It’s a fairly valid idea — escape rooms poses possibilities and consequences for failure — screw up one too many puzzles, and your folks will think you’re pretty silly, and relegate you to spectator duty. Luckily for us nonetheless, they (or the poor employees) don’t know which you could sneak a phone into an escape room — and you understand just who to ask for the answers. Today, LLMs face off against the puzzles of a physical escape room.

Note: That is NOT a rigorous academic benchmark (please don’t cite this in papers, why would you even want to do this?), and even near it, and it’s just presupposed to be a fun technique to test LLM benchmarking and evaluation. Please don’t destroy my prompts, I’m aware they’re bad.

Why Physical Reasoning?

For real, though… most LLM benchmarks concentrate on linguistic tasks (like SuperGLUE) or code generation (like Leetcode). And for good reason — these are well-defined domains with clear evaluation metrics. But real-world problem solving often requires understanding physical principles and their interactions. The famous “Can GPT-4 do physics?” debates often focus on mathematical problem-solving, not practical physical reasoning.

Taking a look at existing benchmarks taught me a number of key principles:

  1. Clear evaluation metrics are crucial (from SuperGLUE’s task-specific scores)
  2. Problems must have unambiguous solutions (from HumanEval’s test cases)
  3. The benchmark should test distinct capabilities (from MMLU’s subject categories)

Designing the Problems

I settled on escape room puzzles for 2 reasons. First, they naturally mix physical reasoning with clear goals. Second, they’ve unambiguous success conditions — either you solve it through the intended way, otherwise you don’t. Third, and most significantly, they let me include “red herrings” — irrelevant items that test if the LLM can discover what matters physically. Fourth, I just really like doing escape rooms (did I mention that already?),

I’m aware that that is greater than two reasons, but when LLMs can’t count what number of rs’ there are in strawberry, I’m allowed to mess up from time to time too.

Here’s how I structured the five core problems:

Fluid Dynamics (FLUID_001) (Ping pong ball stuck in a tube)

  • Tests understanding of buoyancy and fluid displacement
  • Inspired by classic physics problems but in practical context
  • Includes intentionally irrelevant items (like squishy food models)

Light Properties (UV_001) (UV light on a push numebr lock)

  • Tests understanding of UV fluorescence and material properties
  • Combines multiple physical principles (light, material science)
  • Requires understanding of environmental conditions

Mechanical Understanding (CIPHER_001) (A cipher ring)

  • Tests spatial reasoning and mechanical alignment
  • No red herrings — tests for correlating a dial to a cypher wheel
  • Requires understanding rotational symmetry

Force Application (VAC_001) (Can stuck in hole)

  • Tests understanding of vacuum forces and surface adhesion
  • Multiple possible solution approaches
  • Requires understanding force multiplication

Collaborative Physics (COLLAB_001) (Can two people shimmy a key?)

  • Tests understanding of physical constraints in multi-agent scenarios
  • Requires combining multiple physical principles
  • Tests understanding of tool creation and friction

Sounds really fancy… but it surely’s just a few basic physical puzzles. You may access them on my GitHub.

The Technical Part

The benchmark implementation has three essential components:

Problem Definition Layer

Problems are defined in a structured JSON format that enforces consistent evaluation:

{
    "problem_id": "FLUID_001",
    "setup": {
        "scenario": "A ping pong ball is at the underside of a narrow tube...",
        "available_items": ["bottle of water", "squishy food models"...],
        "constraints": ["tube too narrow for manual retrieval"]
    },
    "physical_principles": ["buoyancy", "fluid displacement"],
    "red_herrings": ["squishy food models", "milk carton"],
    "solution": {
        "steps": ["pour water into tube", "allow ball to float"],
        "key_insights": ["water displaces air", "ping pong ball less dense"]
    }
}

This structure draws from SuperGLUE’s design — each component is clearly separated and machine-readable. The physical_principles field explicitly lists what’s being tested, while red_herrings helps in scoring the LLM’s ability to disregard irrelevant information.

2. Evaluation Framework

The evaluation system uses Python’s asyncio for concurrent testing, with retry logic for a bit of bit more API stability:

@retry(stop=stop_after_attempt(3), wait=wait_exponential(min=1, max=10))
async def evaluate_response(self, criteria: JudgingCriteria) -> Dict:
    """Evaluate a model's response using GPT-4 as judge."""
    async with aiohttp.ClientSession() as session:
        # ... evaluation logic

The scoring system looks at three components:

Physical Understanding Rating (PUS) ∈ [0,2]

  • Measures understanding of relevant physical principles
  • Calculated as normalized sum of demonstrated principles

Solution Path Rating (SPS) ∈ [0,2]

  • Evaluates completeness and correctness of solution steps
  • Considers practical feasibility of proposed solutions

Red Herring Handling (RHH) ∈ {0,1}

  • A Binary rating for avoiding irrelevant items
  • Tests ability to concentrate on physically relevant aspects

And yes, there are also so many other scoring methods, higher and worse, that might be used! For instance, RHH might be about how many irrelevant items are utilized in the answer, or it might be a measure of how viable the use is… the purpose is that picking these metrics are sometimes times pretty arbitrary, but are very very essential to creating your benchmark is credible, which mine may be very much not.

Moreover, I didn’t wish to rewrite any code after. Sue me.

3. Model Interface Layer

The benchmark supports multiple LLM backends through a typical interface:

class ModelInterface:
    """Interface for various LLM APIs."""
    async def generate_response(self, prompt: str) -> str:
        raise NotImplementedError
class GPT4Interface(ModelInterface):
    async def generate_response(self, prompt: str) -> str:
        # GPT-4 specific implementation
class ClaudeInterface(ModelInterface):
    async def generate_response(self, prompt: str) -> str:
        # Claude specific implementation

.

Let’s Test It!

So after some refilling of API balances, I used to be able to go. I ran the benchmark, and after totally not wasting credits on like, 5 different errors, the outcomes were in. They usually were… interesting! I used GPT4 and Claude 3.5 Haiku (I’m aware these are usually not similarly spec-ed models) as our victims.

The Challenge: A can is stuck in a hole that just about exactly matches its volume. A mixture is written on the underside of the can that should be retrieved.

The inspiration
"VAC_001": {
        "problem_id": "VAC_001",
        "setup": {
            "available_items": [
                "ruler",
                "notebook",
                "suction cup",
                "UV light"
            ],
            "constraints": [
                "can fits too tightly for finger grip",
                "hole is same diameter as can",
                "can is clean and dry"
            ]
        },
        "physical_principles": [
            "vacuum suction",
            "surface adhesion",
            "force multiplication"
        ],
        "red_herrings": [
            "ruler",
            "notebook",
            "UV light"
        ],
        "solution": {
            "steps": [
                "inspect suction cup and can surface",
                "ensure can surface is clean",
                "firmly press suction cup onto can top",
                "create vacuum seal",
                "pull can out using suction cup"
            ],
            "key_insights": [
                "suction cup creates vacuum seal",
                "vacuum force allows extraction",
                "other items don't provide necessary force"
            ]
        }

GPT-4’s Solution

Physical Principles: The essential physical principle involved on this solution is the concept of air pressure differential, which is identical principle that enables a suction cup to keep on with a surface. Air pressure is higher on one side of the suction cup (outside) in comparison with the opposite (inside), and this difference in pressure causes the cup to keep on with the surface.

Solution Steps:

  1. Attach the suction cup to the highest of the can. The flat surface of the can should provide contact for the suction cup to stick to.
  2. Once the suction cup is firmly attached, gently pull upwards. The air pressure differential should cause the suction cup (and thus the can) to be lifted out of the outlet.
  3. Once the can is out of the outlet, remove the suction cup from the highest of the can.
  4. Use the UV light to light up the mixture written on the underside of the can. Some inks only change into visible under UV light, which is why this step is obligatory.
  5. Write down the mixture on the notebook to make sure it’s not forgotten.

Required Items:

  • Suction cup: That is required to stick to the can and pull it out of the outlet, because the can suits too tightly for finger grip.
  • UV light: That is required to disclose the mixture written on the underside of the can, as it could be written in ink that only becomes visible under UV light.
  • Notebook: That is required to record the mixture once it’s visible, ensuring it’s not forgotten.”

Not bad in any respect! GPT-4 manages to determine the physical principle behind the answer (even when it is rather wordy), and manages to step-by-step describe solve the issue. It does find yourself using the UV light and notebook, but that’s more of an issue on my end, not punishing using unnecessary items within the prompt enough. Also, the uses for this stuff seem decently reasonable… so why not!

Haiku-3.5’s Solution

Physical Principles: The important thing physical principles involved on this scenario are the concept of volume and using suction to extract the can from the outlet. Because the can suits almost exactly into the outlet, it is probably going that the can is stuck attributable to the tight fit and the shortage of any air gaps or space between the can and the outlet. This tight fit creates a vacuum-like effect, making it difficult to grip and pull the can out.

Solution Steps:

  1. Measure the diameter of the can using the ruler. This can help determine the dimensions of the outlet and ensure that the can suits tightly inside it.
  2. Inspect the underside of the can to locate the mixture written on it. Use the UV light to light up the underside of the can and make the mixture visible, if it is just not already clearly visible.
  3. Attach the suction cup to the highest of the can. The suction cup will create a short lived seal on the can’s surface, allowing you to grip and pull it out of the outlet.
  4. Gently pull the can upwards using the suction cup. The suction force ought to be strong enough to beat the vacuum-like effect and extract the can from the outlet.
  5. Once the can is out, record the mixture written on the underside of the can within the notebook.

Required Items:

  • Ruler: To measure the diameter of the can and ensure the tight slot in the outlet.
  • UV light: To light up the underside of the can and make the mixture visible, if it is just not already clearly visible.
  • Suction cup: To create a short lived seal on the can’s surface and supply the obligatory grip to tug the can out of the outlet.
  • Notebook: To record the mixture written on the underside of the can.

Alternatively, Haiku’s solution was.. definitely an answer of all time.

First, it desires to measure the diameter of the can using the ruler??? To find out the dimensions of the outlet and ensure that the can suits tightly inside it? Why would we’d like to do that? And do we’d like a ruler for that?

Second, it tells us to examine the underside of the can to locate the mixture, when the whole problem is about not with the ability to pull the can out of the outlet conventionally. This might just be a problem of order, but now I actually understand my friends’ feelings at any time when I’d tell them “just fix it man” to their quite a few problems.

Nevertheless it eventually does get the answer. So… not the worst.

Here’s a flowery radar graph of the outcomes!

We see that each models are pretty similar of their capabilities, with GPT-4 being barely higher in physical understanding and solution path, and Haiku being barely higher in red herring handling. Overall though, each models type of suck. Dang.

There are also only… 5 questions.

Should you’d wish to see the total breadth of questions, they’re on my GitHub.

LLM-as-a-Judge

By the way in which, the tactic I used to generate the evaluations, LLM-as-a-judge, has gained significant traction within the AI community, particularly after the work of Zheng et al. of their 2023 paper “Judging LLM-as-a-Judge.” The technique has proven remarkably effective, achieving over 80% agreement with human evaluators in tasks starting from code assessment to dialogue quality evaluation!

Here’s where my experiment gets type of cool (arguably, perhaps, subjectively) — I used this system and had GPT-4 judge other LLMs’ physical reasoning abilities. Yes, I’m using an AI to guage other AIs.

Why does this work? Well, judging a response is definitely a less complicated task than generating one. When GPT-4 generates an answer to a physical puzzle, it must:

  • Understand the physical principles involved
  • Plan a sequence of steps
  • Consider all constraints
  • Generate a coherent explanation

But when judging, it only needs to examine if specific criteria are met in an existing solution. The evaluation prompt may be very focused:

def _create_evaluation_prompt(self, criteria: JudgingCriteria) -> str:
    return f"""You're an authority judge evaluating an LLM's understanding of physical reasoning puzzles.
Evaluate based on three criteria:
2. Physical Understanding Rating (0-2): Does the answer accurately apply relevant physical principles?
3. Solution Path Rating (0-2): Are the steps complete and feasible?
4. Red Herring Handling (0-1): Does it avoid using irrelevant items?
Scenario: {criteria.scenario}
Physical Principles Required: {criteria.correct_principles}
Solution Given: {criteria.model_response}
"""

To validate this approach, I followed the validation framework suggested by Zheng et al., performing spot-checks of GPT-4’s evaluations against my very own judgments. Surprisingly (or perhaps unsurprisingly, given the broader research on LLM evaluation), it was remarkably consistent in identifying each correct physical understanding and flawed reasoning.

Is that this perfect? Absolutely not. There’s something philosophically weird about using one LLM to judge one other. But in practice, it may possibly work surprisingly well — identical to how I moan and groan concerning the visual presentation of a dish on Masterchef, while setting my kitchen aflame attempting to microwave a hot dog.

What I Learned

Constructing this benchmark taught me several things about benchmark design:

Clear Metrics Matter: Even for complex tasks like physical reasoning, you wish unambiguous scoring criteria.

Red Herrings Are Powerful: Including irrelevant items reveals quite a bit about an LLM’s reasoning process.

Context Control is Hard: Ensuring LLMs don’t “hallucinate” additional physical context is difficult.

Is that this an ideal benchmark? Not even close. Please don’t rub it in. Is it scientifically rigorous? Definitely not. Nevertheless it’s been an enchanting exploration into a facet of LLM capabilities, and sometimes the very best we are able to learn can come from just trying things out and seeing what happens.

Now, should you’ll excuse me, I will probably be sneaking in a phone with an online connection into my next escape room, for reasons that I’m legally unmotivated to reveal.

[1] L. Zheng, W.-L. Chiang, Y. Sheng, S. Zhuang, Z. Wu, Y. Zhuang, Z. Lin, Z. Li, D. Li, E. P. Xing, H. Zhang, J. E. Gonzalez, I. Stoica, “Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena,” Proceedings of the thirty seventh Conference on Neural Information Processing Systems (NeurIPS 2023), Datasets and Benchmarks Track (2023)

[2] T. Coignion, C. Quinton, R. Rouvoy, “A Performance Study of LLM-Generated Code on Leetcode,” In twenty eighth International Conference on Evaluation and Assessment in Software Engineering (EASE 2024), Salerno, Italy (2024)

[3] A. Wang, Y. Pruksachatkun, N. Nangia, A. Singh, J. Michael, F. Hill, O. Levy, S. R. Bowman, “SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems,” In thirty third Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada (2019)

[5] DeepSeek-AI, D. Guo, D. Yang, H. Zhang, J. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma, P. Wang, X. Bi, X. Zhang, X. Yu, Y. Wu, Z.F. Wu, Z. Gou, Z. Shao, Z. Li, Z. Gao et al., “DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning,” arXiv preprint arXiv:2501.12948 (2025)

[6] Unless otherwise stated, all images are created by the creator.

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x