One in every of the goals of the research was to define an inventory of criteria that make benchmark. “It’s definitely a crucial problem to debate the standard of the benchmarks, what we wish from them, what we want from them,” says Ivanova. “The difficulty is that there isn’t one good standard to define benchmarks. This paper is an try to provide a set of evaluation criteria. That’s very useful.”
The paper was accompanied by the launch of an internet site, Higher Bench, that ranks the most well-liked AI benchmarks. Rating aspects include whether or not experts were consulted on the design, whether the tested capability is well defined, and other basics—for instance, is there a feedback channel for the benchmark, or has it been peer-reviewed?
The MMLU benchmark had the bottom rankings. “I disagree with these rankings. Actually, I’m an creator of a few of the papers ranked highly, and would say that the lower ranked benchmarks are higher than them,” says Dan Hendrycks, director of CAIS, the Center for AI Safety, and one in every of the creators of the MMLU benchmark. That said, Hendrycks still believes that one of the best method to move the sphere forward is to construct higher benchmarks.
Some think the factors could also be missing the larger picture. “The paper adds something worthwhile. Implementation criteria and documentation criteria—all of this is significant. It makes the benchmarks higher,” says Marius Hobbhahn, CEO of Apollo Research, a research organization specializing in AI evaluations. “But for me, crucial query is, do you measure the best thing? You could possibly check all of those boxes, but you can still have a terrible benchmark since it just doesn’t measure the best thing.”
Essentially, even when a benchmark is perfectly designed, one which tests the model’s ability to supply compelling evaluation of Shakespeare sonnets could also be useless if someone is de facto concerned about AI’s hacking capabilities.
“You’ll see a benchmark that’s alleged to measure moral reasoning. But what meaning isn’t necessarily defined thoroughly. Are people who find themselves experts in that domain being incorporated in the method? Often that isn’t the case,” says Amelia Hardy, one other creator of the paper and an AI researcher at Stanford University.
There are organizations actively attempting to improve the situation. For instance, a brand new benchmark from Epoch AI, a research organization, was designed with input from 60 mathematicians and verified as difficult by two winners of the Fields Medal, which is essentially the most prestigious award in mathematics. The participation of those experts fulfills one in every of the factors within the Higher Bench assessment. The present most advanced models are in a position to answer lower than 2% of the questions on the benchmark, which suggests there’s a major method to go before it’s saturated.
“We actually tried to represent the total breadth and depth of contemporary math research,” says Tamay Besiroglu, associate director at Epoch AI. Despite the problem of the test, Besiroglu speculates it can take only around 4 or five years for AI models to attain well against it.