Find out how to Benchmark LLMs – ARC AGI 3

the previous few weeks, we’ve got seen the discharge of powerful LLMs corresponding to Qwen 3 MoE, Kimi K2, and Grok 4. We are going to proceed seeing such rapid improvements within the foreseeable future, and to match the LLMs against one another, we require benchmarks. In this text, I discuss the newly released ARC AGI 3 benchmark and why frontier LLMs struggle to finish any tasks on the benchmark.

Motivation

Today, we’re announcing a preview of ARC-AGI-3, the Interactive Reasoning Benchmark with the widest gap between easy for humans and hard for AI

We’re releasing:
* 3 games (environments)
* $10K agent contest
* AI agents API

Starting scores – Frontier AI: 0%, Humans: 100% pic.twitter.com/3YY6jV2RdY

— ARC Prize (@arcprize) July 18, 2025

ARC AGI 3 was recently released.

My motivation for writing this text is to remain on top of the most recent developments in LLM technology. Only within the last couple of weeks have we seen the Kimi K2 model (best open-source model when released), Qwen 3 235B-A22B (currently best open-source model), Grok 4, and so forth. There’s a lot happening within the LLM space, and one technique to sustain is to trace the benchmarks.

I feel the ARC AGI benchmark is especially interesting, mainly because I would like to see if LLMs can match human-level intelligence. ARC AGI puzzles are made in order that humans are in a position to complete them, but LLMs will struggle.

You may as well read my article on Utilizing Context Engineering to Significantly Enhance LLM Performance and take a look at my website, which comprises all my information and articles.

Introduction to ARC AGI

ARC AGI is actually a puzzle game of pattern matching.

ARC AGI 1: You’re given a series of input-output pairs, and have to finish the pattern
ARC AGI 2: Much like the primary benchmark, performing pattern matching on input and output examples
ARC AGI 3: Here you’re playing a game, where you have got to maneuver your block into the goal area, but some required steps in between

I feel it’s cool to check out these puzzle games and complete them myself. Then, you’ll be able to see LLMs initially struggle with the benchmarks, after which increase their performance with higher models. OpenAI, for instance, scored:

7.8% with o1 mini
75% with o3-low
88% with o3-high

As you may as well see within the image below:

This figure shows the performance of various OpenAI models on the ARC AGI 1 benchmark. You’ll be able to see how performance increases with more advanced models. Image from ARC AGI, which is under the Apache 2 license.

Playing the ARC AGI benchmark

You may as well try the ARC AGI benchmarks yourself or construct an AI to perform the tasks. Go to the ARC AGI 3 website and begin playing the sport.

The entire point of the games is that you have got no instructions, and you have got to work out the foundations yourself. I enjoy this idea, because it represents determining a completely latest problem, with none help. This highlights your ability to learn latest environments, adapt to them, and solve problems.

You’ll be able to see a recording of me playing ARC AGI 3 here, encountering the issues for the primary time. I used to be unfortunately unable to embed the link within the article. Nevertheless, it was super interesting to check out the benchmark and picture the challenge an LLM has to undergo to unravel it. I first observe the environment, and what happens once I perform the several actions. An motion on this case is pressing considered one of the relevant buttons. Some actions do nothing, while others affect the environment. I then proceed to uncover the goal of the puzzle (for instance, get the item to the goal area) and take a look at to attain this goal.

Why frontier models achieve 0%

This text states that when frontier models were tested on the ARC AGI 3 preview, they achieved 0%. This might sound disappointing to some people, considering you were probably in a position to successfully complete a number of the tasks yourself, relatively quickly.

As I previously discussed, several OpenAI models have had success with the sooner ARC AGI benchmarks, with their best model achieving 88% on the primary version. Nevertheless, initially, models achieved 0%, or within the low single-digit percentages.

I even have a number of theories for why frontier models weren’t in a position to perform tasks on ARC AGI 3:

Context length

When working on ARC AGI 3, you don’t get any information in regards to the game. The model thus has to check out a wide range of actions, see the output of those actions (for instance, nothing happens, or a block moves, etc). The model then has to judge the actions it took, together with the output, and consider its next moves.

I imagine the motion space on ARC AGI 3 could be very large, and it’s thus difficult for the models to each experiment enough to seek out the proper motion and avoid repeating unsuccessful actions. The models essentially have an issue with their context length and utilizing the total length of it.

I recently read an interesting article from Manus about how they develop their agents and manage their memory. You should utilize techniques corresponding to summarizing previous context or using a file system to store essential context. I imagine this can be key to increasing performance on the ARC AGI 3 benchmark.

Training dataset

One other primary reason frontier models are unable to finish ARC AGI 3 tasks successfully is that the tasks are very different from their training dataset. LLMs will almost all the time perform way higher on a task if such a task (or the same one) is included within the training dataset. On this instance, I imagine LLMs have little training data on working with games, for instance. Moreover, a vital point here can also be the agentic training data for the LLMs.

With agentic training data, I mean data where the LLM is utilizing tools and performing actions. I imagine we’re seeing a rapid increase in LLMs used as agents, and thus, the proportional amount of coaching data for agentic behavior is rapidly increasing. Nevertheless, it is likely to be that current frontier models still should not pretty much as good at performing such actions, though it would likely increase rapidly in the approaching months.

Some people will highlight how this proves LLMs shouldn’t have real intelligence: The entire point of intelligence (and the ARC AGI benchmark) is to give you the option to grasp tasks with none clues, only by examining the environment. To some extent, I agree with this point, and I hope to see models perform higher on ARC AGI due to increased model intelligence, and never due to benchmark chasing, an idea I explore later in this text.

Benchmark performance in the long run

In the long run, I imagine we’ll see vast improvements in model performance on ARC AGI 3. Mostly because I feel you’ll be able to create AI agents which are fine-tuned for agentic performance, and that optimally utilize their memory. I imagine relatively low-cost improvements will be used to vastly improve performance, though I also expect costlier improvements (for instance, the discharge of GPT-5) will perform well on this benchmark.

Benchmark chasing

I feel it’s essential to depart a piece about benchmark chasing. Benchmark chasing is the concept of LLM providers chasing optimal scores on benchmarks, moderately than simply creating the perfect or most intelligent LLMs. This can be a problem since the correlation between benchmark performance and LLM intelligence shouldn’t be 100%.

Within the reinforcement learning world, benchmark chasing can be known as reward hacking. A scenario where the agent figures out a technique to hack the environment they’re in to attain a reward, without properly performing a task.

The explanation LLM providers do that is that each time a brand new model is released, people normally take a look at two things:

Benchmark performance
Vibe

Benchmark performance is normally measured on known benchmarks, corresponding to SWE-bench and ARC AGI. Vibe testing can also be a way LLMs are sometimes measured by the general public (I’m not saying it’s way of testing the model, I’m simply saying it happens in practice). The issue with this, nonetheless, is that I imagine it’s quite easy to impress individuals with the vibe of a model, because vibe checking tries some very small percentage of the motion space for the LLM. You might only be asking it certain questions which can be found on the net, or asking it to program an application which the model has already seen 1000 instances of in its training data.

Thus, what it’s best to do is to have a benchmark on your individual, for instance, an in-house dataset that has not been leaked to the web. Then you definately can benchmark which LLM works best on your use case and prioritize using this LLM.

Conclusion

In this text, I even have discussed LLM benchmarks and why they’re essential for comparing LLMs. I even have introduced you to the newly released ARC AGI 3 benchmark. This benchmark is super interesting considering humans are easily in a position to complete a number of the tasks, while frontier models rating 0%. This thus represents a task where human intelligence still outperforms LLMs.

As we advance, I imagine we’ll see rapid improvements in LLM performance on ARC AGI 3, though I hope this can not be the results of benchmark chasing, but moderately the intelligence improvement of LLMs.

Find out how to Benchmark LLMs – ARC AGI 3

Motivation

Table of Contents

Introduction to ARC AGI

Playing the ARC AGI benchmark