Can we fix AI’s evaluation crisis?

-

As a tech reporter I often get asked questions like “Is DeepSeek actually higher than ChatGPT?” or “Is the Anthropic model any good?” If I don’t feel like turning it into an hour-long seminar, I’ll normally give the diplomatic answer: “They’re each solid in alternative ways.”

Most individuals asking aren’t defining “good” in any precise way, and that’s fair. It’s human to need to make sense of something latest and seemingly powerful. But that easy query—Is that this model good?—is basically just the on a regular basis version of a rather more complicated technical problem.

Up to now, the best way we’ve tried to reply that query is thru benchmarks. These give models a hard and fast set of inquiries to answer and grade them on what number of they get right. But identical to exams just like the SAT (an admissions test utilized by many US colleges), these benchmarks don’t at all times reflect deeper abilities. These days it feels as if a brand new AI model drops every week, and each time an organization launches one, it comes with fresh scores showing it beating the capabilities of predecessors. On paper, every part appears to be improving on a regular basis.

In practice, it’s not so easy. Just as grinding for the SAT might boost your rating without improving your critical considering, models could be trained to optimize for benchmark results without actually getting smarter, as Russell Brandon explained in his piece for us. As OpenAI and Tesla AI veteran Andrej Karpathy recently put it, we’re living through an evaluation crisis—our scoreboard for AI not reflects what we really need to measure.

Benchmarks have grown stale for just a few key reasons. First, the industry has learned to “teach to the test,” training AI models to attain well somewhat than genuinely improve. Second, widespread data contamination means models can have already seen the benchmark questions, and even the answers, somewhere of their training data. And at last, many benchmarks are simply maxed out. On popular tests like SuperGLUE, models have already reached or surpassed 90% accuracy, making further gains feel more like statistical noise than meaningful improvement. At that time, the scores stop telling us anything useful. That’s very true in high-skill domains like coding, reasoning, and complicated STEM problem-solving. 

Nonetheless, there are a growing variety of teams world wide trying to deal with the AI evaluation crisis. 

One result’s a brand new benchmark called LiveCodeBench Pro. It draws problems from international algorithmic olympiads—competitions for elite highschool and university programmers where participants solve difficult problems without external tools. The highest AI models currently manage only about 53% at first pass on medium-difficulty problems and 0% on the toughest ones. These are tasks where human experts routinely excel.

Zihan Zheng, a junior at NYU and a world finalist in competitive coding, led the project to develop LiveCodeBench Pro with a team of olympiad medalists. They’ve published each the benchmark and an in depth study showing that top-tier models like GPT-4o mini and Google’s Gemini 2.5 perform at a level comparable to the highest 10% of human competitors. Across the board, Zheng observed a pattern: AI excels at planning and executing tasks, nevertheless it struggles with nuanced algorithmic reasoning. “It shows that AI remains to be removed from matching the very best human coders,” he says.

LiveCodeBench Pro might define a brand new upper bar. But what in regards to the floor? Earlier this month, a gaggle of researchers from multiple universities argued that LLM agents needs to be evaluated totally on the premise of their riskiness, not only how well they perform. In real-world, application-driven environments—especially with AI agents—unreliability, hallucinations, and brittleness are ruinous. One improper move could spell disaster when money or safety are on the road.

There are other latest attempts to deal with the issue. Some benchmarks, like ARC-AGI, now keep a part of their data set private to stop AI models from being optimized excessively for the test, an issue called “overfitting.” Meta’s Yann LeCun has created LiveBench, a dynamic benchmark where questions evolve every six months. The goal is to guage models not only on knowledge but on adaptability.

Xbench, a Chinese benchmark project developed by HongShan Capital Group (formerly Sequoia China), is one other one in every of these effort. I just wrote about it in a story. Xbench was initially in-built 2022—right after ChatGPT’s launch—as an internal tool to guage models for investment research. Over time, the team expanded the system and brought in external collaborators. It just made parts of its query set publicly available last week. 

Xbench is notable for its dual-track design, which tries to bridge the gap between lab-based tests and real-world utility. The primary track evaluates technical reasoning skills by testing a model’s STEM knowledge and talent to perform Chinese-language research. The second track goals to evaluate practical usefulness—how well a model performs on tasks in fields like recruitment and marketing. As an example, one task asks an agent to discover five qualified battery engineer candidates; one other has it match brands with relevant influencers from a pool of greater than 800 creators. 

The team behind Xbench has big ambitions. They plan to expand its testing capabilities into sectors like finance, law, and design, they usually plan to update the test set quarterly to avoid stagnation. 

That is something that I often wonder about, because a model’s hardcore reasoning ability doesn’t necessarily translate right into a fun, informative, and inventive experience. Most queries from average users are probably not going to be rocket science. There isn’t much research yet on learn how to effectively evaluate a model’s creativity, but I’d like to know which model could be the very best for creative writing or art projects.

Human preference testing has also emerged as an alternative choice to benchmarks. One increasingly popular platform is LMarena, which lets users submit questions and compare responses from different models side by side—after which pick which one they like best. Still, this method has its flaws. Users sometimes reward the reply that sounds more flattering or agreeable, even when it’s improper. That may incentivize “sweet-talking” models and skew ends in favor of pandering.

AI researchers are starting to understand—and admit—that the established order of AI testing cannot proceed. On the recent CVPR conference, NYU professor Saining Xie drew on historian James Carse’s Finite and Infinite Games to critique the hypercompetitive culture of AI research. An infinite game, he noted, is open-ended—the goal is to maintain playing. But in AI, a dominant player often drops a giant result, triggering a wave of follow-up papers chasing the identical narrow topic. This race-to-publish culture puts enormous pressure on researchers and rewards speed over depth, short-term wins over long-term insight. “If academia chooses to play a finite game,” he warned, “it would lose every part.”

I discovered his framing powerful—and possibly it applies to benchmarks, too. So, do we now have a really comprehensive scoreboard for a way good a model is? Not likely. Many dimensions—social, emotional, interdisciplinary—still evade assessment. However the wave of recent benchmarks hints at a shift. As the sphere evolves, a little bit of skepticism might be healthy.

The Algorithm

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x