Current static evaluations and user-driven arenas have exhibited their limitations and biases within the previous 12 months. Here, we explore a novel approach to evaluate LLMs: debate.
Debate is a wonderful approach to showcase reasoning strength and language abilities, used all across history, from the debates within the Athenian Ecclesia within the fifth century BCE to today’s World Universities Debating Championship.
Do today’s large language models exhibit debate skills just like humans? Which model is currently the perfect at debating? What can we learn from models once they debate against each other?
To reply this query, BAAI has created a “Debate Arena”, allowing large models to compete against one another. Currently, it supports debate competitions in English, Chinese, Arabic and Korean.
Background: The Must Redefine LLM Evaluation Protocols
The advancement of multimodal and multilingual technologies has exposed the restrictions of traditional static evaluation protocols in capturing LLMs’ performance in complex interactive scenarios. Inspired by OpenAI’s “AI Safety via Debate” framework—which emphasizes enhancing models’ reasoning and logic through multi-model interactions ([1])—BAAI’s FlagEval Debate platform introduces a dynamic evaluation methodology to handle these limitations.
Recent research has demonstrated the potential of multi-agent debates in improving models’ reasoning capabilities and factual accuracy. For instance, studies have shown that multi-agent interactions can significantly enhance models’ consistency and accuracy in logical reasoning and factual judgments ([2]), while others have indicated that multi-model debates encourage models to generate more truthful and coherent responses ([3]).
While existing platforms like LMSYS Chatbot Arena offer foundational settings for multi-model interactions, they present certain limitations in practical evaluation:
- Lack of Discriminative Power: Many model confrontations lead to stalemates, lacking significant performance differentials. This necessitates numerous user votes to make sure statistical stability and accuracy, thereby reducing evaluation efficiency and hindering rapid comparison of models’ relative strengths and weaknesses.
- Isolated Generation Phenomenon: In these platforms, models don’t truly interact; as a substitute, they independently generate responses based on user inputs without engaging with one another’s outputs. This unidirectional generation relies on random user dialogues, making it difficult to probe the boundaries of models’ reasoning and logic, and difficult to guage their performance in complex adversarial contexts.
- Potential for Vote Bias: Since models often exhibit distinctive styles, user votes normally favor specific model generation styles or formats fairly than evaluating the content itself. Without expert annotations, these biases can skew the evaluation results, complicating an objective assessment of model capabilities and diminishing the reliability of user-driven judgments.
BAAI’s FlagEval Debate platform addresses these challenges by introducing real multi-model debates. Models engage in direct confrontations, showcasing their reasoning processes and depth. This enables evaluators to watch and compare differences in viewpoints, logical reasoning, and argumentation strategies amongst models. The platform’s multilingual support and real-time debugging capabilities enable users to review model strengths in realistic and interactive settings, ultimately providing more discriminative and in-depth evaluation results.
Key Features and Innovations of FlagEval Debate
Multilingual Support: Enabling Comprehensive Global Evaluation
FlagEval Debate supports Chinese, English, Korean, and Arabic, encompassing a diversity of writing systems and languages. This multilingual capability provides a platform for models to perform in cross-cultural contexts and tests their adaptability and communication effectiveness across diverse linguistic environments. This addresses the worldwide demand for multilingual LLM evaluation.
Developer Customization: Flexible Model Configuration and Optimization
To boost fairness and suppleness, FlagEval Debate offers a developer customization feature, allowing participating model teams to fine-tune parameters, strategies, and dialogue styles based on their models’ characteristics and task requirements. This capability enables developers to optimize their models’ performance in debates, showcasing strengths while identifying areas for improvement. The actual-time feedback loop fosters continuous optimization, allowing models to face out in competitive evaluations.
Dual Evaluation Metrics: Expert Reviews and User Feedback
FlagEval Debate employs a singular dual evaluation system combining expert reviews with user feedback, assessing models from each technical and experiential perspectives:
- Expert Reviews: We enlisted top-tier debate experts to carefully evaluate models across dimensions akin to logical reasoning, depth of argumentation, and linguistic expression. These experts provide objective, detailed assessments, ensuring that analysis results possess high skilled credibility.
- User Feedback: Concurrently, the platform facilitates user participation through audience voting, where users can rate models based on personal preferences and interactive experiences. This feedback reflects the models’ acceptance and effectiveness in practical user interactions, complementing the expert evaluations. Integrating user perspectives ensures that evaluations are aligned with real-world application scenarios.
Experimental Results: Assessing the Impact of Multi-Model Debates
In Q3 2024, we conducted extensive experiments on the FlagEval Debate platform to guage the impact of multi-model debates on models’ logical reasoning and differentiated performance. The experiments yielded several critical insights:
1. Most current models can engage in debate
Our experiments demonstrated that every one participating models, including closed-source variants, could effectively engage in debate tasks. This means that models across different architectures and training paradigms possess the requisite capabilities for logical reasoning and interactive dialogue in multi-model settings. This broad applicability enhances the relevance of FlagEval Debate as a comprehensive evaluation platform. We’ve got empirically observed that some small open-source models still face challenges maintaining coherence and staying on topic.
- Models exhibit strong differences under adversarial conditions
The interactive confrontations revealed significant variations in reasoning logic, argumentation techniques, and language use, especially under adversarial conditions. Nevertheless, we’re waiting for more expert evaluations before releasing these fine-grained findings. By facilitating direct interactions, FlagEval Debate allows for a more nuanced assessment of models’ logical reasoning, critical considering, and rhetorical strategies, without counting on big pre-existing datasets. As illustrated within the figure below, even with just just a few hundred matches, it’s evident that Model_7 has a notably higher win count, whereas Model_3 has gathered more losses. In traditional arena-style and head-to-head evaluations, performance gaps between these models are likely to be less pronounced, often requiring numerous matches to yield conclusive results.
Much like the LMSYS Chatbot Arena, FlagEval-debate also receives evaluation requests from recent model providers. As these providers are still within the strategy of fine-tuning their models, we maintain model anonymity until the fine-tuning is complete and a sufficient variety of user votes and expert reviews have been obtained.
-
There is important potential for model improvement
Within the early stages of development, we observed that models would make various kinds of errors during debates. For example, one model would generate content for each the affirmative and Negative sides concurrently, while others would display forced agreement even when that was inappropriate.
For instance:{ "model_A": "Affirmative: I imagine that a Dutch man fathering 1,000 children through sperm donation is a remarkable gesture. First, he provides hope and opportunities for couples who long for kids but face challenges. Secondly, he demonstrates selflessness and kindness, showing a willingness to provide a part of his life for others. For the moms carrying his children, it’s a worthwhile gift of their lives. His actions bring love and hope to the world. Negative: Nevertheless, this motion also raises moral and legal concerns. Having so many children from one individual could lead on to family conflicts and legal disputes. Moreover, how will they handle the responsibilities of supporting and educating so many children? This might also place a burden on society. So, even when done with good intentions, the potential consequences of such actions need careful consideration.", }We addressed these issues by refining the system prompt requirements and adjusting the input format, though each model still exhibited unique error patterns.
At the identical time, the experiments highlighted substantial opportunities for model enhancement. Performance variations indicated potential improvements in reasoning chains, linguistic expressiveness, and adversarial strategies. We encourage model creators to make use of our debate-tuning services to refine and optimize their models’ debate performance! -
Stronger evaluation: combining expert feedback with user preferences
Our findings also suggest that relying exclusively on user preferences for evaluation can introduce biases and limit the accuracy of performance assessments. To mitigate this, FlagEval Debate’s dual evaluation system integrates expert reviews to offer a balanced perspective. By combining reliable expert assessments with subjective user feedback, we aim to make sure a more equitable and comprehensive evaluation, aligning results with each technical expertise and human preference.
Learn how to add your model into this competition?
The planning of the large-scale model debate is illustrated as follows.

Preliminary experiments indicate that the performance of the participating debate models will significantly improve after timely optimization.
Model providers and creators are welcome to click the link Debate Competition Registration Form or send an email to flageval@baai.ac.cn.
By registering for the model debate evaluation, FlagEval will provide free model debate debugging services.
The next corporations have already participated in our debate:
| Company | Model | Debugging Method |
|---|---|---|
| OpenAI | o1-preview | Self-debugged |
| OpenAI | o1-mini | Self-debugged |
| OpenAI | GPT-4o-mini | Self-debugged |
| OpenAI | GPT-4o | Self-debugged |
| Anthropic | claude-3-5-sonnet | Self-debugged |
| Stepfun | step-2-16k-f | Provider-debugged |
| Baidu | ERNIE-4.0-Turbo | Provider-debugged |
| ByteDance | Doubao-pro | Provider-debugged |
| Alibaba | qwen2.5-72b-instruct | Self-debugged |
| Tencent | Hunyuan-Turbo | Provider-debugged |
| 01.AI | Yi-Lightning | Self-debugged |
| Zhipu AI | GLM-4-plus | Provider-debugged |
| DeepSeek | DeepSeek_V2.5 | Self-debugged |
- Self-debugged: Denotes models we configured and optimized for debate.
- Provider-debugged: Denotes models that were debugged and optimized by the model providers themselves.
Conclusion
FlagEval Debate represents a major advancement in LLM evaluation methodologies. By incorporating multilingual support, developer customization, and a dual evaluation system, it offers a sturdy framework for assessing models in interactive, real-world scenarios. Moving forward, BAAI is committed to refining this platform to foster innovation, enhance evaluation methodologies, and drive standardization in AI practices. The goal is to offer developers and users with a forward-looking evaluation ecosystem, accelerating the evolution and deployment of advanced large language models.
About BAAI & FlagEval
The Beijing Academy of Artificial Intelligence (BAAI) was established in November 2018 as a pioneering non-profit research institution, primarily specializing in original innovation and core technologies in artificial intelligence. Its aim is to drive revolutionary and disruptive breakthroughs in AI theory, methods, tools, systems, and applications.
Inside BAAI, FlagEval is a dedicated team specializing within the evaluation of large-scale AI models. FlagEval launched its large model evaluation platform in 2023, and has since covered over 800 models globally. The platform evaluates over 40 dimensions, including reasoning, mathematical skills, and task-solving abilities. FlagEval recently launched recent platforms for model-to-model competition, further strengthening its evaluation framework and advancing AI evaluation methodologies. It has been recognized as a pacesetter in advancing AI evaluation standards and fostering collaboration between academia and industry.
References
[1] Irving G, Christiano P, Amodei D. AI safety via debate. arXiv preprint arXiv:1805.00899, 2018.
[2] [ICML 2024] Du Y, Li S, Torralba A, et al. Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325, 2023.
[3] [ICML 2024 Best] Khan A, Hughes J, Valentine D, et al. Debating with more persuasive llms results in more truthful answers. arXiv preprint arXiv:2402.06782, 2024.
