Fixing Open LLM Leaderboard with Math-Confirm

3 weeks ago, we showed how hard it’s to appropriately evaluate LLM performance on math problems, and introduced Math-Confirm, a greater solution to validate models on math (read more within the announcement)!

Today, we’re thrilled to share that we’ve used Math-Confirm to thoroughly re-evaluate all 3,751 models ever submitted to the Open LLM Leaderboard, for even fairer and more robust model comparisons!

Why math evaluation on the Open LLM Leaderboard was broken

The Open LLM Leaderboard is probably the most used leaderboard on the Hugging Face Hub: it compares open Large Language Models (LLM) performance across various tasks. One in every of these tasks, called MATH-Hard, is specifically about math problems: it evaluates how well LLMs solve high-school and university-level math problems. It uses 1,324 highest difficulty problems (Level 5) from the Hendrycks MATH dataset spread across 7 topics (precalculus, prealgebra, algebra, intermediate algebra, counting/probability and number theory), using a 5-shot approach (the model is supplied with 5 examples within the prompt to showcase the way it should answer).

A typical query looks like this:

For all real numbers $r$ and $s$, define the mathematical operation $#$ such that the next conditions apply: $r # 0 = r, r # s = s # r$, and $(r + 1) # s = (r # s) + s + 1$. What's the value of $11 # 5$?

To which the reply could be:

Within the leaderboard, models would should end their answers with a really specific string (following the Minerva-Math paper):

“Final answer is [ANSWER]. I hope it's correct.”

The leaderboard would then attempt to parse [ANSWER] with SymPy to convert it to a symbolic representation (and simplify the values if needed), before finally comparing it to the gold goal.

Nonetheless, users reported quite a lot of issues with the above.

To begin, a recurring issue was the lack of some models to follow the expected answer format from the examples: they outputted other sentences as a substitute to introduce their answers. Because the format was not followed, answers were marked as incorrect even in the event that they were actually correct! (Which is a difficulty if what you’re all in favour of is “how good the model is at math” specifically).

📄 Example	❗️Issue	✅ Math-Confirm	🛑 Old-Leaderboard
Due to this fact, the perimeter of one in all these triangles is $14 + 7sqrt{2}$ inches, expressed in simplest radical form.	Failed extraction	`7*sqrt(2) + 14`	None
Due to this fact, the sum of the infinite geometric series is (frac{7}{9}).	Failed extraction	`7/9`	None
( p(n) ) and ( p(n+1) ) share a typical factor greater than 1 is (boxed{41}).	Failed extraction	`4`	None
So it’s frac{1}{9}	Failed extraction	`1/9`	None
Concluding he has boxed{5} cars	Failed extraction	`5`	None

The subsequent step, converting [ANSWER] to the symbolic representation also presented some issues, this time linked to the SymPy parsing:

📄 Example	❗️Issue	✅ Math-Confirm	🛑 Old-Leaderboard
The ultimate answer is $2x + 4y + z – 19 = 0$. I hope it’s correct.	Partial parse of parametric eq	Eq(2x + 4y + z – 19, 0)	0
(23)	Failed extraction attributable to latex borders	`23`	None
((- infty, -14) cup (-3, infty)).	Failed extraction attributable to interval	Union(Interval.open(-oo, -14), Interval.open(-3, oo))	None
100%	Failed extraction attributable to invalid symbol	`1`	None
begin{pmatrix}frac{1}{50}&frac{7}{50}frac{7}{50}&frac{49}{50}end{pmatrix}	Failed extraction attributable to Matrix	Matrix([[1/50, 7/50], [7/50, 49/50]])	None

On the ultimate step, when comparing the extracted answer with the goal expression, quite a lot of issues also occurred:

📄 Example	❗️Issue	✅ Math-Confirm	🛑 Old-Leaderboard
1/3 == 0.333333	No rounding support	True	False
sqrt(1/2)7 == sqrt(0.5)7	No numerical evaluation support	True	False
k = 1 == 1	No variable project support	True	False
Matrix.ones == Matrix.ones	No support for matrix equivalence	True	False
{1} union {1,4} == {1,4}	No support for set comparison	True	False

All of those issues are actually completely fixed with the brand new Math-Confirm parser!

Which model is the most effective at math? A whole reshuffling of cards because of fairer evaluations

As all these issues are likely to accumulate, some models deeply suffered from this, and their performance was strongly underestimated… so we removed the previous evaluator and added Math-Confirm, which was so simple as changing only 3 lines of code! (You’ll be able to try it too in your math evals!)

This subsequently meant re-evaluating all submitted models since June… and it completely overhauled the highest 20 models on the MATH subset of the leaderboard.

Impact of the change

On average, models solved 61 more problems across the board, equating to a 4.66-point increase across the board!

The 2 subsets that showed probably the most significant improvement were each algebra-related (Algebra and Prealgebra) with gains of 8.27 and 6.93, respectively. In extreme cases, some models demonstrated improvements of nearly 90 points on these subsets.
We imagine these subsets saw the best improvement because they often involve answers presented as sets (attributable to questions with multiple solutions) and matrices. The Math-Confirm has enhanced its handling of each answer types, contributing to those notable gains.

Model Family Changes

We initially discovered the mathematics evaluation issues when inspecting Qwen models, which had unusually low scores in comparison with the self-reported performance. After the Math-Confirm introduction, the scores greater than doubled for these models, showcasing previous severe underestimation of performance.

But Qwen models aren’t alone. One other major family affected is DeepSeek. After switching to Math-Confirm, DeepSeek models almost tripled their scores! It is because their answers are typically wrapped in boxed (boxed{}) notations which the old evaluator couldn’t extract.

Changes within the MATH-Hard Leaderboard

As mentioned originally, the Top 20 rankings have undergone a big shift, with Nvidia’s AceMath models now dominating the MATH-Hard leaderboard.
Other major beneficiaries of this modification are the Qwen derivatives, which are actually almost exclusively the one models rating right below AceMath.
Following is the whole table comparing the old and recent Top 20 leaderboard rankings:

$math_hard_leaderboard_change$

Changes within the Leaderboard

Finally, we examined how the general Leaderboard results have evolved. While the highest 4 positions remain unchanged, the remaining have undergone significant shifts. As a result of the rise of multiple Qwen derivatives within the MATH subset, the presence of Qwen models among the many top 20 has grown-derived models grown even further on the Overall results.

Many other models also completely jumped within the rankings, gaining 200 places or more! You’ll be able to try the ends in more detail on the Open LLM Leaderboard.

Wrapping Up

The introduction of Math-Confirm has significantly improved the accuracy and fairness of our evaluations on the Open LLM Leaderboard. This has led to a reshuffling of the leaderboard, with many models showing substantial improvements of their scores.

We encourage all developers and researchers to adopt Math-Confirm for their very own math evaluations. By doing so, you possibly can make sure that your models are evaluated with more reliable results. Moreover, we invite you to explore the updated rankings and see how your favorite models have modified in performance.

Source link

Fixing Open LLM Leaderboard with Math-Confirm

Why math evaluation on the Open LLM Leaderboard was broken