The International Olympiad in Informatics (IOI) is recognized as probably the most prestigious algorithmic programming competitions and serves as a critical benchmark for evaluating the reasoning and problem-solving abilities of huge language models (LLMs). Achieving gold-medal performance on the IOI represents a serious milestone in measuring AI competency. While several proprietary models have recently been reported to achieve this level, their methods remain undisclosed, limiting reproducibility and progress throughout the research community.
We’re excited to share that, for the primary time, an open-weight model, gpt-oss-120b, has achieved gold-medal performance at IOI 2025, operating under the identical time, memory, and submission constraints as human contestants, including the 50-submission limit per problem. This milestone was made possible by our transparent and reproducible test-time compute framework, GenCluster. GenCluster is a scalable, multi-stage pipeline that efficiently surfaces essentially the most promising solutions from 1000’s of solutions generated in parallel, using behavioral clustering and tournament-style rating to discover the most effective candidates.
Using gpt-oss-120b as the bottom model, GenCluster achieved a final rating of 446.75 at IOI 2025, surpassing the gold-medal threshold of 438.3. This marks the primary demonstration of gold-level performance at IOI using an open-weight model, setting a transparent and reproducible benchmark for future research in competitive programming and AI reasoning.
Our experiments reveal a transparent scaling trend: larger candidate pools consistently improve each constrained and unconstrained scores. This demonstrates the advantages of scaling test-time compute together with GenCluster and provides a promising path to surpass gold-level performance.
How does GenCluster work?
GenCluster operates in 4 key stages, methodically examining 1000’s of candidate solutions to uncover essentially the most promising ones when only a limited variety of final verifications are allowed:
Parallel Candidate Generation
We start by generating 1000’s of candidate solutions for every problem in parallel. As an alternative of expecting a single perfect answer, GenCluster explores a big and diverse pool of possibilities, increasing the possibility that at the least one perfect solution emerges.
Using gpt-oss-120b, this stage achieves a Rating@5000 of 499.51 on IOI 2025, representing the upper limit for GenClusterto to pick out the most effective 50 submissions per problem.
Behavioral Clustering
Next, we group these solutions based on how they behave. We run every candidate against a set of LLM generated test cases and cluster together those that produce similar outputs. This transforms the chaos of 1000’s of individual solutions right into a manageable set of distinct problem-solving strategies.
Rating with Tournament
To search out the winning strategy, we hold a tournament. A representative solution from each cluster competes in head-to-head matchups judged by the LLM. Clusters are then ranked by their variety of wins, allowing essentially the most promising strategies to rise to the highest.
Submission Strategy
Finally, we employ a round-robin submission technique to benefit from IOI’s strict 50-attempt limit per problem. Solutions from top-ranked clusters are submitted one after the other, starting with the toughest subtasks. Inside each cluster, solutions are ranked and chosen by the length of their reasoning trace. This structured strategy ensures that the strongest candidates are evaluated first, maximizing performance while making efficient use of each submission.
What’s the most effective open-weight model for IOI 2025?
We evaluated several leading open-weight models on competitive programming benchmarks and located that gpt-oss-120b achieved the very best rating by a big margin. It’s the only model with the potential to achieve gold-medal performance when scaled to five,000 generations per problem. Furthermore, the gpt-oss family shows stronger gains because the variety of generations increases, suggesting that it scales more effectively with test-time compute. In contrast, while Qwen3-235B-A22B-Considering outperforms gpt-oss-20b and DeepSeek-R1-0528 at smaller generation budgets, its performance scales less favorably at larger ones.
Impact of the Maximum Variety of Tokens
Previous studies have shown that longer reasoning traces often correlate with higher accuracy on complex problems, and our results confirm this trend. After we varied the utmost generation length, the gpt-oss models kept improving all the best way as much as their token limits, while Qwen3-235B-A22B plateaued around 48K tokens, well below the 80K length advisable by its authors. Interestingly, the gpt-oss models not only produced longer, more detailed reasoning paths but in addition delivered the strongest overall performance, outperforming DeepSeek-R1-0528 and Qwen3-235B-A22B once larger compute budgets were applied.
Conclusion
Our results show that open-weight models, when combined with a scalable test-time compute framework, can get near the performance of leading closed systems on the IOI benchmark. By introducing a totally reproducible pipeline built entirely on open-weight, we aim to make advanced reasoning research more transparent, accessible, and verifiable. We hope this work inspires future efforts to leverage test-time compute as a way to further scale open models and push the boundaries of algorithmic problem-solving.




