Judging code generations end to finish with code executions

Evaluating the standard of AI-generated code is notoriously difficult. While humans can easily spot whether a bit of code “looks right,” determining if it actually works appropriately, handles edge cases properly, and produces the intended result requires running and testing it. That is why today, we’re thrilled to announce BigCodeArena — the primary human-in-the-loop platform for evaluating code generation models through execution.

Inspired by LMArena for LLMs, we have built a platform that permits anyone to check code generation models side-by-side, but with an important difference: you possibly can actually run the code and see what it produces. Just submit a coding task, watch two different models generate solutions, execute each programs, and vote on which model produced higher results. The outcomes are organized right into a leaderboard that displays the community’s highest-rated models.

<br />

Motivation

The sector of code generation has long struggled with reliable evaluation methods. Traditional benchmarks like HumanEval test code against predefined test cases, but these represent only a tiny fraction of real-world programming tasks. Human evaluation platforms exist for general chatbots, but they fall short for code: reading raw source code and mentally simulating its execution is cognitively demanding and error-prone, especially for longer programs or complex UI applications.

Consider this scenario:

You ask two AI models to construct a responsive photo gallery website. Each generate code that appears syntactically correct. But which one is definitely higher? Without running the code, it’s nearly unattainable to inform. One might produce an exquisite, functional grid layout, while the opposite might need subtle bugs or poor styling that only turn out to be apparent when rendered in a browser.

This commentary led us to a key insight: execution feedback is important for humans to evaluate code quality reliably. That is exactly what BigCodeArena provides.

The BigCodeArena Platform

BigCodeArena extends the Chatbot Arena framework with powerful features specifically designed for code evaluation:

Real-Time Execution

Every code snippet generated by models is routinely executed in isolated sandbox environments. Whether it is a Python script, a React web app, a PyGame game, or a C++ algorithm, you possibly can see the actual output, not only the source code.

Multi-Language & Framework Support

We currently support 10 languages (Python, JavaScript, TypeScript, HTML, C, C++, Java, Go, Rust, and Markdown) and eight execution environments:

Web Frameworks: React, Vue, Core Web (vanilla HTML/CSS/JS)
Python Frameworks: Streamlit, Gradio, PyGame
Diagrams: Mermaid
General Purpose Interpreters: Python and JavaScript code interpreters, plus compiled language runners

Interactive Testing

Unlike static code comparison, you possibly can actually interact with the generated applications:

Click buttons and test UI elements in web apps
Play the games generated by models
Edit the code and re-run it to check modifications
View visual outputs like plots, charts, and diagrams

Multi-Turn Conversations

Real programming is not one-and-done. BigCodeArena supports multi-turn interactions, allowing you to refine requirements, ask for features to be added, or request bug fixes — similar to working with an actual coding assistant.

What We have Learned: 5 Months of Community Evaluation

Since launching in February 2025, BigCodeArena has collected over 14,000 conversations from greater than 500 unique users, with 4,700+ high-quality preference votes comparing 10 frontier LLMs.

Programming Topics within the Wild

Our users have explored remarkably diverse coding scenarios:

Web Design (36%): Constructing responsive web sites, interactive dashboards, and web applications
Problem Solving (23%): Algorithms, data structures, and computational challenges
Game Development (16%): Creating interactive games with physics, collision detection, and graphics
Scientific Computing (14%): Data evaluation, visualization, and numerical simulations
Creative Coding (8%): Artistic visualizations, generative art, and experimental interfaces
Diagram Creation (3%): Flowcharts, system architectures, and data visualizations

Language and Framework Popularity

Python dominates with over 4,000 conversations, followed by JavaScript/TypeScript (3,359), HTML (1,601), and C++ (642). Amongst frameworks, direct Python interpreters lead usage (6,000 sessions), with React (2,729), Core Web (1,574), Streamlit (1,254), and PyGame (1,087) also seeing heavy use.

User Interaction Patterns

Most interactions are focused and efficient: 76% of conversations consist of just 2 turns (one request, one response), with a mean conversation length of 4.12 messages. Nevertheless, the platform supports prolonged multi-turn debugging sessions when needed, with some conversations exceeding 10 turns as users refine complex applications.

Model Rankings from Community Votes

From our 14K conversations, we filtered for high-quality pairwise comparisons: conversations with at the least two turns and actual code execution. This yielded 4,731 voting samples, with each evaluated model receiving at the least 700 votes. We aggregate these votes into Elo rankings using the Bradley-Terry model, which estimates the probability that one model beats one other based on head-to-head comparisons.

To make sure robust rankings, we use 100 bootstrap resamples to construct 95% confidence intervals, so we are able to discover statistically significant performance differences between models.

We evaluate models under three settings to regulate for various factors:

All Data: Uses all pairwise comparisons no matter execution environment or programming language
Environment Matched: Only compares models when each were executed in the identical sandbox (e.g., each in React or each in PyGame)
Language Matched: Further restricts comparisons to the identical programming language

Rankings remain remarkably consistent across all three settings, revealing clear performance tiers:

Top Tier: o3-mini and o1-mini consistently lead with the best Elo rankings and tight confidence intervals. These models maintain top performance no matter environment or language constraints, showing strong robustness across coding scenarios. Claude-3.5-Sonnet follows closely, particularly excelling when language is controlled.

Mid Tier: GPT-4o, o1, and Gemini-2.0-Pro/Flash form a competitive middle tier. GPT-4o shows some sensitivity to language matching, suggesting room for improvement in multilingual consistency.

Open Source Models: Qwen2.5 variants and Llama-3.3-70B lag behind frontier proprietary models, highlighting the performance gap that is still between leading closed and open models.

Figure: Overall win rate heatmaps (percentage of all pairwise comparisons won) of every model within the sessions across languages (left) and execution environments (right). For every category, we only keep models that appear in at the least 3 conversation sessions.

Performance Across Languages

Breaking down performance by programming language reveals interesting patterns:

Top-tier models like o3-mini and o1-mini achieve dominant win rates in mainstream languages like Python, Java, and C++
Gemini-2.0-Pro shows particular strength in Rust, achieving the best win rate in that category
Different models exhibit distinct areas of experience, with frontier models excelling in numerous niches
Open models like Qwen2.5 variants show inconsistent performance, particularly scuffling with Rust and Go

Performance Across Execution Environments

Analyzing win rates by execution environment reveals how models handle different runtime contexts:

Robust Performers: o3-mini maintains consistently strong performance across React, Streamlit, Gradio, Core Web, and PyGame, demonstrating excellent environmental adaptability.

Stable but Selective: Claude-3.5-Sonnet and Gemini-2.0-Flash show generally stable performance but with reduced win rates in complex UI-heavy environments like Vue and Mermaid.

Framework-Specific Weaknesses: Qwen2.5 models, while competitive in some web frameworks (Core Web, React), struggle significantly with interactive and visualization-oriented environments like PyGame, Vue, and Mermaid. These environments often require precise handling of control flow, graphics rendering, and package dependencies.

These results highlight a very important insight: aggregate Elo scores don’t tell the entire story. Some models remain brittle under specific runtime constraints, and execution environment matters significantly for real-world deployment.

Two Recent Benchmarks: BigCodeReward and AutoCodeArena

To advance research beyond crowdsourced evaluation, we’re releasing two complementary benchmarks:

BigCodeReward: Evaluating Reward Models for Code

Constructing on our 4,700+ preference votes, BigCodeReward tests how well LLMs can judge code quality when acting as reward models. The important thing finding? Execution results dramatically improve judgment accuracy.

When models can see execution outputs (screenshots of web apps, game visuals, program logs), their alignment with human preferences increases substantially:

Claude-Sonnet-4: 56.7% → 62.3% accuracy
GPT-4o: 54.6% → 63.8% accuracy
Qwen2.5-VL-72B: 58.7% → 66.2% accuracy

This reinforces our core thesis: you possibly can’t reliably judge code without running it — and this is applicable to each humans and AI judges.

AutoCodeArena: Automated Code Generation Benchmarks

Inspired by Arena-Hard-Auto, AutoCodeArena provides a scalable technique to evaluate recent models without waiting for 1000’s of human votes. We fastidiously chosen 600 representative prompts from our crowdsourced data, spanning all programming topics and frameworks.

Using automated LLM judges (Claude-3.7-Sonnet) to guage code execution results against a GPT-4.1 baseline, we are able to rapidly benchmark recent models. This approach enables weekly leaderboard updates as recent models are released.

Our automated benchmark evaluated 20+ cutting-edge models, including recently released systems:

Top Performers:

GPT-5 — Establishes recent state-of-the-art by a big margin
Claude-Opus-4 and Claude-Sonnet-4 — Strong second tier, excelling in reasoning-heavy tasks
Qwen3-Coder, Kimi-K2, GLM-4.5 — Leading open models that narrow the gap with mid-tier proprietary systems

Figure: Win rates of recent LLMs on AutoCodeArena against a GPT-4.1 baseline, judged by Claude-3.7-Sonnet. The 50% mark represents parity with GPT-4.1. Models above this line outperform the baseline, while those below underperform. Error bars show 95% confidence intervals. Note: Claude-3.7-Sonnet is excluded from rankings to avoid self-judgment bias, and GPT-4.1 appears only because the reference baseline.

The outcomes show that while proprietary models maintain an edge, open-source models are rapidly closing the gap, with some approaching GPT-4.1-level performance.

Try It Yourself

BigCodeArena is open to everyone — no account required! Visit https://huggingface.co/spaces/bigcode/arena to:

Compare code from newer frontier LLMs (e.g., Qwen3, DeepSeek-V3.X, and other proprietary models)
Test web apps, games, visualizations, and algorithms
See real execution results, not only source code
Vote in your preferences to assist improve the leaderboard
Explore multi-turn coding conversations

Whether you are constructing a React dashboard, making a PyGame game, solving algorithmic challenges, or generating creative visualizations, BigCodeArena helps you to see which models truly deliver.

Open Source The whole lot

Following the BigCode Project’s commitment to transparency, we’re releasing:

Codebase: Full evaluation pipelines and Gradio application source (GitHub)
Crowdsourced Data: 14K raw conversations and 4.7K preference votes (HuggingFace Collection)
Benchmarks: BigCodeReward and AutoCodeArena datasets

What’s Next?

We envision BigCodeArena as a long-term project that evolves with the community:

Expanded Language Support: More programming languages and frameworks.
Live Benchmarks: Repeatedly refreshed evaluation prompts to stop overfitting
Agent-Based Evaluation: Using AI agents to interact with web apps for deeper testing
Higher Reward Models: Advancing automated code quality assessment
Community Contributions: We welcome recent execution environments, evaluation criteria, and model additions. PRs are at all times welcome!

Conclusion

Evaluating code is different from evaluating text — you’ll want to run it, test it, and interact with it. BigCodeArena makes this possible at scale, combining human judgment with real execution feedback to create probably the most reliable evaluation platform for code generation models.

Join us in constructing the long run of code generation evaluation. Write a prompt, compare the models, and vote in your favorite. Your feedback helps the whole community understand which models truly deliver on the promise of AI-assisted programming.

We would love to listen to your feedback! Connect with us on GitHub, join discussions within the Hugging Face Space community tab, or reach out to the BigCode Project at contact@bigcode-project.org.

Acknowledgements

We thank Leandro von Werra for his priceless suggestions and feedback on the blog.

Citation

@article{zhuo2025bigcodearena,
    title={BigCodeArena: Unveiling More Reliable Human Preferences in Code Generation via Execution},
    creator={Terry Yue Zhuo, Xiaolong Jin, Hange Liu, Juyong Jiang, Tianyang Liu, Chen Gong, Bhupesh Bishnoi, Vaisakhi Mishra, Marek Suppa, Noah Ziems, Saiteja Utpala, Ming Xu, Guangyu Song, Kaixin Li, Yuhan Cao, Bo Liu, Zheng Liu, Sabina Abdurakhmanova, Wenhao Yu, Mengzhao Jia, Jihan Yao, Kenneth Hamilton, Kumar Shridhar, Minh Chien Vu, Dingmin Wang, Jiawei Liu, Zijian Wang, Qian Liu, Binyuan Hui, Meg Risdal, Ahsen Khaliq, Atin Sood, Zhenchang Xing, Wasi Uddin Ahmad, John Grundy, David Lo, Banghua Zhu, Xiaoning Du, Torsten Scholak, Leandro von Werra},
    12 months={2025}
}

Try BigCodeArena now: Hugging Face Space

Read the paper: Hugging Face

Run the code: GitHub

Explore the gathering: Hugging Face Collection

Source link

Judging code generations end to finish with code executions

Motivation