LLM-as-a-Judge has emerged as a well-liked strategy to grade natural language outputs from LLM applications, but how can we know which models make the very best judges?
We’re excited to launch Judge Arena – a platform that lets anyone easily compare models as judges side-by-side. Just run the judges on a test sample and vote which judge you agree with most. The outcomes will likely be organized right into a leaderboard that displays the very best judges.
Judge Arena
Crowdsourced, randomized battles have proven effective at benchmarking LLMs. LMSys’s Chatbot Arena has collected over 2M votes and is highly regarded as a field-test to discover the very best language models. Since LLM evaluations aim to capture human preferences, direct human feedback can be key to determining which AI judges are most helpful.
How it really works
- Select your sample for evaluation:
- Let the system randomly generate a 👩 User Input / 🤖 AI Response pair
- OR input your individual custom sample
- Two LLM judges will:
- Rating the response
- Provide their reasoning for the rating
-
Review each judges’ evaluations and vote for the one which best aligns along with your judgment
(We recommend reviewing the scores first before comparing critiques)
After each vote, you may:
- Regenerate judges: Get recent evaluations of the identical sample
- Start a 🎲 Latest round: Randomly generate a brand new sample to be evaluated
- OR, input a brand new custom sample to be evaluated
To avoid bias and potential abuse, the model names are only revealed after a vote is submitted.
Chosen Models
Judge Arena focuses on the LLM-as-a-Judge approach, and due to this fact only includes generative models (excluding classifier models that solely output a rating). We formalize our selection criteria for AI judges as the next:
- The model should possess the power to attain AND critique other models’ outputs effectively.
- The model needs to be prompt-able to guage in several scoring formats, for various criteria.
We chosen 18 state-of-the-art LLMs for our leaderboard. While many are open-source models with public weights, we also included proprietary API models to enable direct comparison between open and closed approaches.
- OpenAI (GPT-4o, GPT-4 Turbo, GPT-3.5 Turbo)
- Anthropic (Claude 3.5 Sonnet / Haiku, Claude 3 Opus / Sonnet / Haiku)
- Meta (Llama 3.1 Instruct Turbo 405B / 70B / 8B)
- Alibaba (Qwen 2.5 Instruct Turbo 7B / 72B, Qwen 2 Instruct 72B)
- Google (Gemma 2 9B / 27B)
- Mistral (Instruct v0.3 7B, Instruct v0.1 7B)
The present list represents the models mostly utilized in AI evaluation pipelines. We stay up for adding more models if our leaderboard proves to be useful.
The Leaderboard
The votes collected from the Judge Arena will likely be compiled and displayed on a dedicated public leaderboard. We calculate an Elo rating for every model and can update the leaderboard hourly.
Early Insights
These are only very early results, but here’s what we’ve observed to date:
- Mixture of top performers between proprietary and open source: GPT-4 Turbo leads by a narrow margin however the Llama and Qwen models are extremely competitive, surpassing nearly all of proprietary models
- Smaller models show impressive performance: Qwen 2.5 7B and Llama 3.1 8B are performing remarkably well and competing with much larger models. As we gather more data, we hope to raised understand the connection between model scale and judging ability
- Preliminary empirical support for emerging research: LLM-as-a-Judge literature suggests that Llama models are well-suited as base models, demonstrating strong out-of-the-box performance on evaluation benchmarks. Several approaches including Lynx, Auto-J, and SFR-LLaMA-3.1-Judge opted to start out with Llama models before post-training for evaluation capabilities. Our provisional results align with this trend, showing Llama 3.1 70B and 405B rating 2nd and third, respectively
Because the leaderboard shapes out over the approaching weeks, we stay up for sharing further evaluation on results on our blog.
How you can contribute
We hope the Judge Arena is a helpful resource for the community. By contributing to this leaderboard, you’ll help developers determine which models to make use of of their evaluation pipeline. We’re committed to sharing 20% of the anonymized voting data in the approaching months as we hope developers, researchers and users will leverage our findings to construct more aligned evaluators.
We’d love to listen to your feedback! For general feature requests or to submit / suggest recent models so as to add to the world, please open up a discussion within the community tab or refer to us on Discord. Don’t hesitate to tell us if you’ve got questions or suggestions by messaging us on X/Twitter.
Atla currently funds this out of our own pocket. We’re on the lookout for API credits (with no strings attached) to support this community effort – please get in contact at support@atla-ai.com if you happen to are all for collaborating 🤗
Credits
Because of all the oldsters who helped test this arena and shout out to the LMSYS team for the inspiration. Special mention to Clémentine Fourrier and the Hugging Face team for making this possible!
