The First Multilingual LLM Debate Competition

Current static evaluations and user-driven arenas have exhibited their limitations and biases within the previous 12 months. Here, we explore a novel approach to evaluate LLMs: debate.
Debate is a wonderful approach to showcase reasoning strength and language abilities, used all across history, from the debates within the Athenian Ecclesia within the fifth century BCE to today’s World Universities Debating Championship.
Do today’s large language models exhibit debate skills just like humans? Which model is currently the perfect at debating? What can we learn from models once they debate against each other?
To reply this query, BAAI has created a “Debate Arena”, allowing large models to compete against one another. Currently, it supports debate competitions in English, Chinese, Arabic and Korean.

FlagEval-Debate

Background: The Must Redefine LLM Evaluation Protocols

The advancement of multimodal and multilingual technologies has exposed the restrictions of traditional static evaluation protocols in capturing LLMs’ performance in complex interactive scenarios. Inspired by OpenAI’s “AI Safety via Debate” framework—which emphasizes enhancing models’ reasoning and logic through multi-model interactions ([1])—BAAI’s FlagEval Debate platform introduces a dynamic evaluation methodology to handle these limitations.
Recent research has demonstrated the potential of multi-agent debates in improving models’ reasoning capabilities and factual accuracy. For instance, studies have shown that multi-agent interactions can significantly enhance models’ consistency and accuracy in logical reasoning and factual judgments ([2]), while others have indicated that multi-model debates encourage models to generate more truthful and coherent responses ([3]).
While existing platforms like LMSYS Chatbot Arena offer foundational settings for multi-model interactions, they present certain limitations in practical evaluation:

Lack of Discriminative Power: Many model confrontations lead to stalemates, lacking significant performance differentials. This necessitates numerous user votes to make sure statistical stability and accuracy, thereby reducing evaluation efficiency and hindering rapid comparison of models’ relative strengths and weaknesses.
Isolated Generation Phenomenon: In these platforms, models don’t truly interact; as a substitute, they independently generate responses based on user inputs without engaging with one another’s outputs. This unidirectional generation relies on random user dialogues, making it difficult to probe the boundaries of models’ reasoning and logic, and difficult to guage their performance in complex adversarial contexts.
Potential for Vote Bias: Since models often exhibit distinctive styles, user votes normally favor specific model generation styles or formats fairly than evaluating the content itself. Without expert annotations, these biases can skew the evaluation results, complicating an objective assessment of model capabilities and diminishing the reliability of user-driven judgments.

BAAI’s FlagEval Debate platform addresses these challenges by introducing real multi-model debates. Models engage in direct confrontations, showcasing their reasoning processes and depth. This enables evaluators to watch and compare differences in viewpoints, logical reasoning, and argumentation strategies amongst models. The platform’s multilingual support and real-time debugging capabilities enable users to review model strengths in realistic and interactive settings, ultimately providing more discriminative and in-depth evaluation results.

Key Features and Innovations of FlagEval Debate

Multilingual Support: Enabling Comprehensive Global Evaluation

FlagEval Debate supports Chinese, English, Korean, and Arabic, encompassing a diversity of writing systems and languages. This multilingual capability provides a platform for models to perform in cross-cultural contexts and tests their adaptability and communication effectiveness across diverse linguistic environments. This addresses the worldwide demand for multilingual LLM evaluation.

Developer Customization: Flexible Model Configuration and Optimization

To boost fairness and suppleness, FlagEval Debate offers a developer customization feature, allowing participating model teams to fine-tune parameters, strategies, and dialogue styles based on their models’ characteristics and task requirements. This capability enables developers to optimize their models’ performance in debates, showcasing strengths while identifying areas for improvement. The actual-time feedback loop fosters continuous optimization, allowing models to face out in competitive evaluations.

Dual Evaluation Metrics: Expert Reviews and User Feedback

FlagEval Debate employs a singular dual evaluation system combining expert reviews with user feedback, assessing models from each technical and experiential perspectives:

Expert Reviews: We enlisted top-tier debate experts to carefully evaluate models across dimensions akin to logical reasoning, depth of argumentation, and linguistic expression. These experts provide objective, detailed assessments, ensuring that analysis results possess high skilled credibility.
User Feedback: Concurrently, the platform facilitates user participation through audience voting, where users can rate models based on personal preferences and interactive experiences. This feedback reflects the models’ acceptance and effectiveness in practical user interactions, complementing the expert evaluations. Integrating user perspectives ensures that evaluations are aligned with real-world application scenarios.

Experimental Results: Assessing the Impact of Multi-Model Debates

In Q3 2024, we conducted extensive experiments on the FlagEval Debate platform to guage the impact of multi-model debates on models’ logical reasoning and differentiated performance. The experiments yielded several critical insights:
1. Most current models can engage in debate
Our experiments demonstrated that every one participating models, including closed-source variants, could effectively engage in debate tasks. This means that models across different architectures and training paradigms possess the requisite capabilities for logical reasoning and interactive dialogue in multi-model settings. This broad applicability enhances the relevance of FlagEval Debate as a comprehensive evaluation platform. We’ve got empirically observed that some small open-source models still face challenges maintaining coherence and staying on topic.

Models exhibit strong differences under adversarial conditions
The interactive confrontations revealed significant variations in reasoning logic, argumentation techniques, and language use, especially under adversarial conditions. Nevertheless, we’re waiting for more expert evaluations before releasing these fine-grained findings. By facilitating direct interactions, FlagEval Debate allows for a more nuanced assessment of models’ logical reasoning, critical considering, and rhetorical strategies, without counting on big pre-existing datasets. As illustrated within the figure below, even with just just a few hundred matches, it’s evident that Model_7 has a notably higher win count, whereas Model_3 has gathered more losses. In traditional arena-style and head-to-head evaluations, performance gaps between these models are likely to be less pronounced, often requiring numerous matches to yield conclusive results.
Much like the LMSYS Chatbot Arena, FlagEval-debate also receives evaluation requests from recent model providers. As these providers are still within the strategy of fine-tuning their models, we maintain model anonymity until the fine-tuning is complete and a sufficient variety of user votes and expert reviews have been obtained.

There is important potential for model improvement
Within the early stages of development, we observed that models would make various kinds of errors during debates. For example, one model would generate content for each the affirmative and Negative sides concurrently, while others would display forced agreement even when that was inappropriate.
For instance:

{
"model_A": 
"Affirmative: I imagine that a Dutch man fathering 1,000 children through sperm donation is a remarkable gesture. First, he provides hope and opportunities for couples who long for kids but face challenges. Secondly, he demonstrates selflessness and kindness, showing a willingness to provide a part of his life for others. For the moms carrying his children, it’s a worthwhile gift of their lives. His actions bring love and hope to the world.

Negative: Nevertheless, this motion also raises moral and legal concerns. Having so many children from one individual could lead on to family conflicts and legal disputes. Moreover, how will they handle the responsibilities of supporting and educating so many children? This might also place a burden on society. So, even when done with good intentions, the potential consequences of such actions need careful consideration.",
}

We addressed these issues by refining the system prompt requirements and adjusting the input format, though each model still exhibited unique error patterns.
At the identical time, the experiments highlighted substantial opportunities for model enhancement. Performance variations indicated potential improvements in reasoning chains, linguistic expressiveness, and adversarial strategies. We encourage model creators to make use of our debate-tuning services to refine and optimize their models’ debate performance!

Stronger evaluation: combining expert feedback with user preferences
Our findings also suggest that relying exclusively on user preferences for evaluation can introduce biases and limit the accuracy of performance assessments. To mitigate this, FlagEval Debate’s dual evaluation system integrates expert reviews to offer a balanced perspective. By combining reliable expert assessments with subjective user feedback, we aim to make sure a more equitable and comprehensive evaluation, aligning results with each technical expertise and human preference.

Learn how to add your model into this competition?

The planning of the large-scale model debate is illustrated as follows.

Preliminary experiments indicate that the performance of the participating debate models will significantly improve after timely optimization.
Model providers and creators are welcome to click the link Debate Competition Registration Form or send an email to flageval@baai.ac.cn.
By registering for the model debate evaluation, FlagEval will provide free model debate debugging services.
The next corporations have already participated in our debate:

Company	Model	Debugging Method
OpenAI	o1-preview	Self-debugged
OpenAI	o1-mini	Self-debugged
OpenAI	GPT-4o-mini	Self-debugged
OpenAI	GPT-4o	Self-debugged
Anthropic	claude-3-5-sonnet	Self-debugged
Stepfun	step-2-16k-f	Provider-debugged
Baidu	ERNIE-4.0-Turbo	Provider-debugged
ByteDance	Doubao-pro	Provider-debugged
Alibaba	qwen2.5-72b-instruct	Self-debugged
Tencent	Hunyuan-Turbo	Provider-debugged
01.AI	Yi-Lightning	Self-debugged
Zhipu AI	GLM-4-plus	Provider-debugged
DeepSeek	DeepSeek_V2.5	Self-debugged

Self-debugged: Denotes models we configured and optimized for debate.
Provider-debugged: Denotes models that were debugged and optimized by the model providers themselves.

Conclusion

FlagEval Debate represents a major advancement in LLM evaluation methodologies. By incorporating multilingual support, developer customization, and a dual evaluation system, it offers a sturdy framework for assessing models in interactive, real-world scenarios. Moving forward, BAAI is committed to refining this platform to foster innovation, enhance evaluation methodologies, and drive standardization in AI practices. The goal is to offer developers and users with a forward-looking evaluation ecosystem, accelerating the evolution and deployment of advanced large language models.

About BAAI & FlagEval

The Beijing Academy of Artificial Intelligence (BAAI) was established in November 2018 as a pioneering non-profit research institution, primarily specializing in original innovation and core technologies in artificial intelligence. Its aim is to drive revolutionary and disruptive breakthroughs in AI theory, methods, tools, systems, and applications.
Inside BAAI, FlagEval is a dedicated team specializing within the evaluation of large-scale AI models. FlagEval launched its large model evaluation platform in 2023, and has since covered over 800 models globally. The platform evaluates over 40 dimensions, including reasoning, mathematical skills, and task-solving abilities. FlagEval recently launched recent platforms for model-to-model competition, further strengthening its evaluation framework and advancing AI evaluation methodologies. It has been recognized as a pacesetter in advancing AI evaluation standards and fostering collaboration between academia and industry.

References

[1] Irving G, Christiano P, Amodei D. AI safety via debate. arXiv preprint arXiv:1805.00899, 2018.

[2] [ICML 2024] Du Y, Li S, Torralba A, et al. Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325, 2023.

[3] [ICML 2024 Best] Khan A, Hughes J, Valentine D, et al. Debating with more persuasive llms results in more truthful answers. arXiv preprint arXiv:2402.06782, 2024.

Source link

The First Multilingual LLM Debate Competition

Background: The Must Redefine LLM Evaluation Protocols

Key Features and Innovations of FlagEval Debate

Multilingual Support: Enabling Comprehensive Global Evaluation

Developer Customization: Flexible Model Configuration and Optimization

Dual Evaluation Metrics: Expert Reviews and User Feedback

Experimental Results: Assessing the Impact of Multi-Model Debates

Learn how to add your model into this competition?

Conclusion

About BAAI & FlagEval

References

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

Newton Adds Contact-Wealthy Manipulation and Locomotion Capabilities for Industrial Robotics

Methods to Construct a Production-Ready Claude Code Skill

How NVIDIA Dynamo 1.0 Powers Multi-Node Inference at Production Scale

Follow the AI Footpaths

Scaling Autonomous AI Agents and Workloads with NVIDIA DGX Spark

The First Multilingual LLM Debate Competition

Background: The Must Redefine LLM Evaluation Protocols

Key Features and Innovations of FlagEval Debate

Multilingual Support: Enabling Comprehensive Global Evaluation

Developer Customization: Flexible Model Configuration and Optimization

Dual Evaluation Metrics: Expert Reviews and User Feedback

Experimental Results: Assessing the Impact of Multi-Model Debates

Learn how to add your model into this competition?

Conclusion

About BAAI & FlagEval

References

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.