The Death of the Static AI Benchmark

-

Benchmarking as a Measure of Success

Benchmarks are sometimes hailed as a trademark of success. They’re a celebrated way of measuring progress — whether it’s achieving the sub 4-minute mile or the power to excel on standardized exams. Within the context of Artificial Intelligence (AI) benchmarks are essentially the most common approach to evaluating a model’s capability. Industry leaders corresponding to OpenAI, Anthropic, Meta, Google, etc. compete in a race to one-up one another with superior benchmark scores. Nevertheless, recent research studies and industry grumblings are casting doubt about whether common benchmarks truly capture the essence of a models ability.

Source: Dalle 3

Emerging research points to the probability that training sets of some models have been contaminated with the very data that they’re being assessed on — raising doubts on the the authenticity of their benchmark scores reflecting true understanding. Similar to in movies where actors can portray Doctors or Scientists, they deliver the lines without truly grasping the underlying concepts. When Cillian Murphy played famous physicist J. Robert Oppenheimer within the movie Oppenheimer, he likely didn’t understand the complex physics theories he spoke of. Although benchmarks are meant to judge a models capabilities, are they honestly doing so if like an actor the model has memorized them?

Recent findings from the University of Arizona have discovered that GPT-4 is contaminated with AG News, WNLI, and XSum datasets discrediting their associated benchmarks[1]. Further, researchers from the University of Science and Technology of China found that once they deployed their “probing” technique on the favored MMLU Benchmark [2], results decreased dramatically.

Their probing techniques included a series of methods meant to challenge the models understanding of the query when posed other ways with different answer options, but the identical correct answer. Examples of the probing techniques consisted of: paraphrasing questions, paraphrasing selections, permuting selections, adding extra context into questions, and adding a recent selection to the benchmark questions.

From the graph below, one can gather that although each tested model performed well on the unaltered “vanilla” MMLU benchmark, when probing techniques were added to different sections of the benchmark (LU, PS, DK, All) they didn’t perform as strongly.

“Vanilla” represents performance on the unaltered MMLU Benchmark.The opposite keys represent the performance on the altered sections of the MMLU Benchmark:Language Understanding (LU),Problem Solving (PS),Domain Knowledge (DK), All

This evolving situation prompts a re-evaluation of how AI models are assessed. The necessity for benchmarks that each reliably show capabilities and anticipate the problems of knowledge contamination and memorization is becoming apparent.

As models proceed to evolve and are updated to potentially include benchmark data of their training sets, benchmarks may have an inherently short lifespan. Moreover, model context windows are increasing rapidly, allowing a bigger amount of context to be included within the models response. The larger the context window the more potential impact of contaminated data not directly skewing the model’s learning process, making it biased towards the seen test examples .

To handle these challenges, revolutionary approaches corresponding to dynamic benchmarks are emerging, employing tactics like: altering questions, complicating questions, introduce noise into the query, paraphrasing the query, reversing the polarity of the query, and more [3].

The instance below provides an example on several methods to change benchmark questions (either manually or language model generated).

Source: Benchmark Self-Evolving: A Multi-Agent Framework for Dynamic LLM Evaluation

As we move forward, the imperative to align evaluation methods more closely with real-world applications becomes clear. Establishing benchmarks that accurately reflect practical tasks and challenges is not going to only provide a truer measure of AI capabilities but in addition guide the event of Small Language Models (SLMs) and AI Agents. These specialized models and agents require benchmarks that genuinely capture their potential to perform practical and helpful tasks.

ASK DUKE

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x