Because we’re done trusting black-box leaderboards over the community

-



banner

TL;DR: Benchmark datasets on Hugging Face can now host leaderboards. Models store their very own eval scores. The whole lot links together. The community can submit results via PR. Verified badges prove that the outcomes will be reproduced.



Evaluation is broken

Let’s be real about where we’re with evals in 2026. MMLU is saturated above 91%. GSM8K hit 94%+. HumanEval is conquered. Yet some models that ace benchmarks still cannot reliably browse the net, write production code, or handle multi-step tasks without hallucinating, based on usage reports. There may be a transparent gap between benchmark scores and real-world performance.

Moreover, there may be one other gap inside reported benchmark scores. Multiple sources report different results. From Model Cards, to papers, to evaluation platforms, there isn’t a alignment in reported scores. The result’s that the community lacks a single source of truth.



What We’re Shipping

Decentralized and transparent evaluation reporting.

We’re going to take evaluations on the Hugging Face Hub in a brand new direction by decentralizing reporting and allowing the complete community to openly report scores for benchmarks. At first, we are going to start with a shortlist of 4 benchmarks and over time we’ll expand to essentially the most relevant benchmarks.

For Benchmarks: Dataset repos can now register as benchmarks (MMLU-Pro, GPQA, HLE are already live). They mechanically aggregate reported results from across the Hub and display leaderboards within the dataset card. The benchmark defines the eval spec via eval.yaml, based on the Inspect AI format, so anyone can reproduce it. The reported results must align with the duty definition.

benchmark image

For Models: Eval scores live in .eval_results/*.yaml within the model repo. They seem on the model card and are fed into benchmark datasets. Each the model writer’s results and open pull requests for results will probably be aggregated. Model authors will give you the chance to shut rating PR and conceal results.

For the Community: Any user can submit evaluation results for any model via a PR. Results get shown as “community”, without waiting for model authors to merge or close. The community can link to sources like a paper, Model Card, third-party evaluation platform, or inspect eval logs. The community can discuss scores like every PR. Because the Hub is Git based, there may be a history of when evals were added, when changes were made, etc. The sources seem like below.

model image

To learn more about evaluation results, try the docs.



Why This Matters

Decentralizing evaluation will expose scores that exist already across the community in sources like model cards and papers. By exposing these scores, the community can construct on top of them to aggregate, track, and understand scores across the sector. Also, all scores will probably be exposed via Hub APIs, making it easy to aggregate and construct curated leaderboards, dashboards, etc.

Community evals don’t replace benchmarks so leaderboards and closed evals with published results are still crucial. Nevertheless, we imagine it is important to contribute to the sector with open eval results based on reproducible eval specs.

This may not solve benchmark saturation or close the benchmark-reality gap. Nor will it stop training on test sets. Nevertheless it makes the sport visible by exposing what’s evaluated, how, when, and by whom.

Mostly, we hope to make the Hub an lively place to construct and share reproducible benchmarks. Particularly specializing in latest tasks and domains that challenge SOTA models more.



Get Began

Add eval results: Publish the evals you conducted as YAML files in .eval_results/ on any model repo.

Try the scores on the benchmark dataset.

Register a brand new benchmark: Add eval.yaml to your dataset repo and contact us to be included within the shortlist.

The feature is in beta. We’re constructing within the open. Feedback welcome.



Source link

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x