Over the previous few weeks, we’ve focused our efforts on reproducing the competitive programming (code reasoning) elements of the DeepSeek-R1 recipe.
On this post, we’re excited to share:
- The development of CodeForces-CoTs: a dataset of nearly 100k high-quality samples distilled from R1 to supply solutions in C++ and Python.
- The IOI benchmark: a brand new benchmark of difficult problems from the 2024 International Olympiad in Informatics (IOI).
- OlympicCoder: two fine-tuned 7B and 32B code models that outperform closed-source frontier models like Claude 3.7 Sonnet on IOI problems.
Here’s an outline of how the OlympicCoder models stack up against various instruction fine-tuned and reasoning models. We discover that training models on CodeForces-CoTs produces top-tier performance, with OlympicCoder-32B outperforming all open-weight models we tested, including some which are over 100x larger 🤯.
Read on to learn the way we built the dataset, benchmark and models!
Key links
CodeForces
International Olympiad in Informatics (IOI)
OlympicCoder
CodeForces-CoTs Dataset
CodeForces is one of the vital popular web sites amongst competitive programmers, hosting regular contests where participants must solve difficult algorithmic optimization problems. The difficult nature of those problems makes them an interesting dataset to enhance and test models’ code reasoning capabilities.
While previous efforts equivalent to DeepMind’s CodeContests dataset have compiled numerous CodeForces problems, today we’re releasing our own open-r1/codeforces dataset, with greater than 10k problems covering the very first contests all of the option to 2025, ~3k of which weren’t included in DeepMind’s dataset. Moreover, for around 60% of the issues, we’ve included the editorial, which is an evidence, written by the competition organizers, explaining the right solution. You will even find 3 correct solutions per problem extracted from the official website.
Moreover, we’re releasing open-r1/codeforces-cots, which incorporates chain of thought generations produced by DeepSeek-R1 on these problems, where we asked the model to supply solutions in C++ (the fundamental language utilized in competitive programming) and Python, totaling near 100k samples.
We fine-tuned Qwen2.5 Coder Instruct 7B and 32B on this dataset, leading to our OlympicCoder-7B and OlympicCoder-32B models. You’ll find more details on these models further along within the blogpost.
Code verifiability crisis
While datasets like DeepMind’s contests and others containing competitive programming problems include test cases and claim to be verifiable, these test cases are sometimes a small subset of the total suite used on contest web sites. CodeForces, particularly, caps displayed test cases at ~500 characters, which implies that these datasets only contain the shorter, easier test cases that fit into this limit.
For example, we took 7 problems for which the R1 generated solution passed all the general public test cases and tried submitting them to the CodeForces platform:
Although they passed the shorter tests, each one in all these solutions failed on the total test set. This underscores the necessity for brand spanking new fully verifiable competitive programming dataset. While we plan to try model-based solutions to generate and validate additional difficult tests that could be added to our CodeForces dataset in the longer term, for now we went in search of fully available problem data elsewhere.
International Olympiad in Informatics (IOI)
The International Olympiad in Informatics (IOI) is one in all five international science olympiads (in the event you are acquainted with AIME, IOI is the programming equivalent of IMO, for which the highest students who participate in AIME are invited) and tests a really select group of highschool students (4 per country) in complex algorithmic problems.
The issues are extremely difficult, and the total test sets can be found and released under a permissive (CC-BY) license. Which means that IOI is the proper dataset to check a model’s code reasoning capabilities.
In IOI, each problem has several subtasks, each with different input constraints. To resolve a subtask, a submission must pass all of its test cases throughout the (strict) cut-off dates. While the ultimate subtask is often the “full problem”, some subtasks effectively describe a much easier (more constrained) problem, and contestants fairly often goal specific subtasks to get partial scores as an alternative of just trying to resolve the total problem (perfect scores are relatively rare).
Following a recent OpenAI paper where o1 live competed at IOI’2024 (the last iteration), we similarly processed all problems from IOI’2024 (in addition to previous IOIs up until 2020), and split them into their subtasks, such that every prompt would ask models to resolve one specific subtask. We release the processed problem statements, in addition to all grading/checker files required to run them and test cases in open-r1/ioi and open-r1/ioi-test-cases.
We created custom code to run solutions (many problems have complicated setups, requiring a “manager” process communicating with several processes running the user submission and special checkers to validate solutions) and to grade in accordance with IOI rules, which is on the market at https://github.com/huggingface/ioi, and evaluated over 40 leading reasoning models on IOI’2024.
At IOI, contestants have a 50 submission limit per problem. We generated 50 submissions for every subtask after which applied a variety strategy just like the one utilized by OpenAI to get scores under contest conditions for every problem. Results could be found below, where the horizontal lines represent the edge for bronze, silver and gold models from real contest data. While o1 comes very near bronze, ultimately no model would reach the medal threshold (fiftieth percentile of contestants).
Our OlympicCoder models (in red), do fairly well in comparison with other frontier models, even surpassing some closed source models (in gold) equivalent to claude-3.7-sonnet-thinking, and OlympicCoder-32B even outperforms o1-mini and DeepSeek-R1, the model we distilled from, on the 50 submission limit setting.
Submission strategy
A vital note is that our submission strategy may penalize non reasoning models, equivalent to Qwen-2.5-Coder-32B-Instruct, OlympicCoder-32B’s base model. To simulate real contest conditions, during which a submission’s rating is just known after it is definitely submitted, we employed a round robin submission strategy just like the one employed by OpenAI for o1-ioi: we start by submitting an answer that targets the last subtask of the issue, then one targeting the second to last, and so forth, only evaluating an answer when it’s picked for submission. We skip over submissions targeting subtasks which have already been solved by previously chosen submissions, and inside each targeted subtask we prefer submissions from longer generations — which is a criteria that is smart for reasoning models, but less so for other models.
If we remove the 50 submission limit (which might place us outside the competition conditions), and evaluate all submissions we generated (50 per subtask), we obtain the next results:
Lessons learned from training code models on R1 traces
While creating the OlympicCoder models, we ran numerous SFT experiments to know the role of varied filters applied to the CodeForces dataset. We found the next subsets of open-r1/codeforces-cots gave the perfect overall performance:
solutions: R1 generated solutions given the issue statement.solutions_w_editorials: R1 generated solutions given the issue statement and an evidence which explains the right solution.
Note that we only focused on the C++ solutions, but mixing within the Python solutions would likely boost performance even further.
We used LiveCodeBench as a test bed for our models after which ran the perfect performing checkpoints through the much harder IOI benchmark. We tested various hyperparameter configurations to coach our models and settled on the next:
- Models: Qwen2.5 Coder Instruct 7B and 32B
- Epochs: 10
- Effective batch size: 128
- Learning rate: 4e-5
- Scheduler: Cosine with a decay to 10% of the height learning rate
- Context size: 32,768 tokens for 7B and 22,528 tokens 32B
Below we share some lessons we learned from tuning the Qwen2.5 Coder models on R1 reasoning traces.
Lesson 1: Sample packing hurts performance for reasoning
Sample packing is a widely used method to efficiently process variable-length sequences and speed up training. As shown within the figure below, the best way this works concatenating training samples (colored) into equal sized chunks that removes the necessity to use padding tokens (gray) across batches:
With packing, samples can overlap across the boundaries of every chunk, but in practice this doesn’t matter an excessive amount of if a lot of the samples are much smaller than the chunk size.
Nevertheless, for the reasoning traces we distilled from R1, we wondered whether packing could hurt performance because many traces are long and the fraction of clipped answers might be high. This implies it might be difficult to coach the model to take care of long context information, particularly when the query and answer have been packed into different chunks.
As shown within the striking figure below, we found that packing substantially hurts the model’s performance: with packing, the model is barely capable of solve problems from LiveCodebench, but without packing, the performance improves over several epochs before plateauing out:
We suspect this stark difference is accentuated by the undeniable fact that our training sets only consist of C++ solutions, while LiveCodeBench only evaluates performance on Python. Nevertheless, we found that across all datasets we ablated, packing was consistently worse.
Lesson 2: Use large learning rates for best performance
For many SFT experiments that we’ve run with Qwen models, a learning rate of 2e-5 is often ok to get strong performance. Nevertheless, when applying SFT with reasoning data to an existing instruct model, we found that significant performance gains might be had by utilizing a much larger learning rate of 4e-5. As shown within the figure below, the difference across each doubling of the educational rate amounted to almost 10 points improvement on LiveCodeBench!
Lesson 3: Including the editorials doesn’t boost performance
Once we created the solutions_w_editorials subset of open-r1/codeforces-cots, our intuition was that by providing R1 with the issue and solution, we’d obtain higher reasoning traces. Surprisingly, this intuition was not borne out during evaluation, with the naive sampling from problem statements giving a modest, yet consistent boost during training:
Lesson 4: Prefill with to consistently enable long CoT
While vibe testing a few of our models, we encountered a peculiar phenomenon: if prompted with in-domain queries (like competitive coding), the model would emit the expected long CoT inside and tokens, followed by the ultimate answer. Nevertheless, for out-of-domain queries like “What’s the capital of Switzerland”, the model would revert back to the behaviour of the unique instruct model!
Here’s an example of a straightforward query failing to return the long CoT:
User: What’s 1+1?
Assistant: 2. It’s one in all the best mathematical addition problems. 1 + 1 equals 2. This problem is commonly used as a riddle or puzzle, particularly in job interviews or online tests to find out a candidate’s ability to resolve problems quickly. While it could seem obvious to most individuals, it will possibly be a useful query for employers to evaluate critical considering and problem-solving skills. Moreover, it’s a vital concept in mathematics and forms the premise for more complex arithmetic operations. So next time someone asks you what 1+1 is, you may confidently state that it’s 2!
def add_numbers(a, b): return a + b result = add_numbers(1, 1) print("The sum of 1 + 1 is:", result)On this solution, we define a function
add_numbersthat takes two parametersaandband returns their sum. We then call this function with 1 and 1, storing the lead to the variableresult, which is then printed. This demonstrates an easy option to compute and display the sum of two integers. The instance serves as a straightforward yet effective illustration of basic arithmetic operations in Python.
Nevertheless, if we prefill the assistant’s response with the specified behaviour emerges:
User: What’s 1+1?
Assistant:
Okay, I want to reply the query “What’s 1+1?” Let’s examine. Oh right, 1 plus 1 equals 2. But wait, perhaps there’s more to it. Like, what if someone asks differently? Or perhaps considering other number bases? Like in base 10 it’s 2, but in binary it’s 10. Hmm, however the query is simple. The user probably expects the easy answer. So the right answer here is 2. Should I just state that? Yes, 1 + 1 is the same as 2. Is there anything you want to find out about this calculation?
This implies that when distilling reasoning traces into an existing post-trained model, one should implement prefilling by default in the ultimate model’s chat template. Incidentally, this can also be what DeepSeek does with their distilled models and the behaviour above often is the reason why. (It’s also amusing to invest whether hitting the “think” button on popular chat UIs is solely prefilling the assistant response 🙂).
Combining all these lessons produced OlympicCoder-7B, which matches the performance of DeepSeek’s distilled model and significantly improves on the bottom Qwen2.5-Coder one:
Lesson 5: Use 8-bit optimisers to scale large models with long context
Throughout the training of OlympicCoder-7B, we found DeepSpeed ZeRO-3 was sufficient to coach each model with 32k context on a single node of 8 x H100s. Nevertheless, once we tried scaling up our recipe to 32B, we bumped into a bunch of memory issues. Particularly, our runs would OOM once the context was scaled beyond 20k tokens, even on 16 nodes 😢. This wasn’t ideal as 20% of the CodeForces-CoTs traces are larger than 20k tokens, which implies they’d get truncated during training.
The foundation of the issue is that transformers and trl don’t yet support context parallelism, although see this issue to trace progress.
Within the meantime, we explored a wide range of memory saving techniques and located that combining FSDP with the paged_adamw_8bit optimizer allowed us to scale the context to 22,528 tokens: still not ideal, but only 9% of the info was truncated.
Updates
GRPO Updates
Recent progress has been made to enhance the implementation of GRPO in TRL, bringing enhancements that further boost efficiency, scalability, and resource utilization. Here’s a summary of an important changes for the reason that last update:
Generation Reuse
One among the fundamental bottlenecks of GRPO is similar as for any online method: generation takes time. A key option to make GRPO more sample-efficient is to reuse generated samples multiple times during optimization as an alternative of discarding them after a single use. This method was actually introduced a protracted time ago with PPO.
For GRPO, the variety of times a sample is reused is denoted as μ.
It’s now possible to reuse generated samples multiple times, significantly speeding up the method.
from trl import GRPOConfig
training_args = GRPOConfig(..., num_iterations=...)
Nevertheless, be cautious—if μ is simply too large, it will possibly negatively impact learning. Based on our experience, a superb balance is between 2 and 4.
Reward Weighting
When training a model, not all rewards are equally necessary. For instance, we might want the model to prioritize correctness over formatting, slightly than treating each elements equally.
To handle this, it’s now possible to assign different weights to different rewards, allowing finer control over the optimization process. By adjusting these weights, we are able to guide the model to focus more on the elements that matter most for a given task.
from trl import GRPOConfig, GRPOTrainer
def very_important_reward(completions, **kwargs):
...
def less_important_reward(completions, **kwargs):
...
training_args = GRPOConfig(
...,
reward_weights=[0.9, 0.1],
)
trainer = GRPOTrainer(
...,
reward_funcs=[very_important_reward, less_important_reward],
args=training_args,
)
Other Enhancements
Several smaller but impactful improvements have been made to GRPO:
- PEFT + vLLM Integration – It’s now possible to make use of PEFT (Parameter-Efficient Positive-Tuning) and vLLM together, combining efficient fine-tuning with optimized inference for higher scalability.
- Gradient Checkpointing – This feature has been added to cut back memory consumption during training by recomputing certain activations as an alternative of storing them, making it possible to coach larger models.
- Optimized Selective Log Softmax Computation – A brand new method for computing log softmax has been introduced, reducing memory peaks during training.
Next Steps and Work in Progress
The present focus is on two key areas:
- Improving Generation Speed – Further optimizations (like static caching) are being explored to make the generation process even faster.
- Scaling GRPO to Multi-Node Settings – Work is underway to enable GRPO to scale across multiple nodes, making it possible to coach much larger models.
Open R1 Math-Dataset update
We have now further enriched our previously released OpenR1-Math-Raw dataset with recent metadata to enable more informed decision-making during filtering and verification. Specifically, we’ve added the next columns:
reparsed_answers: We observed that many entries within theanswercolumn were either improperly formatted in LaTeX or contained only partial answers. Moreover, since some questions are multiple-choice, each the right answer itself and the corresponding letter ought to be considered valid responses. To handle this, we re-extracted all answers from thesolutioncolumn using the Llama-3.3-70B-Instruct model to make surereparsed_answersincludes each the right answer and, in multiple-choice cases, the corresponding letter as well. We consider this addition can be highly worthwhile to the community, improving each column grading accuracy and verification processes during GRPO.correctness: Running answer verification could be resource-intensive when counting on model based one. Due to this fact, we evaluated all solutions using Llama-3.3-70B-Instruct as a judge alongside withmath_verifyrun against each theanswerandreparsed_answerscolumns.
Experiment Details
To assist the community understand the impact of verification-based filtering on math datasets for SFT distillation, we conducted several ablation experiments. Starting with a randomly chosen pool of 200k samples, we created six distinct SFT datasets based on the next filtering rules:
no_restrictions(200k) – No filtering applied.llama_verification(124k) – Samples that were graded as correct by LLAMA-70B verification.math_verify_answer(88.7k) – Samples verified as correct bymath_verifyon theanswercolumn.math_verify_reparsed(101k) – Samples verified as correct bymath_verifyon thereparsed_answerscolumn.llama_verification_or_math_verify_reparsed(LorMV) (154k) – The union of datasets 2 and 4.llama_verification_and_math_verify_reparsed(LandMV) (71.2k) – The intersection of datasets 2 and 4.
Training and Evaluation
For filtering in data-constrained settings, each precision and recall are necessary considerations. Due to this fact, we didn’t run each experiment with the identical token budget but slightly trained on single epoch over all data. For the model, we selected Qwen7B-Instruct and fine-tuned it with RoPE extension to a 32k context length and cosine schedule. To trace performance progression, we evaluated the models at every fortieth step on AIME-24, AIME-25, and MATH-500 using lighteval. The outcomes are summarized below:
Key Observations
- Verification significantly impacts early-stage performance. Filtering proved especially necessary throughout the first 40 steps. On the MATH-500 dataset, stricter verification methods delivered a considerable performance boost in early stages (e.g.,
no_restrictionsscored 0.61 vs.LandMVat 0.72). Nevertheless, as training progressed, this performance gap diminished, and having more samples proved helpful—even when some contained errors. - Training loss differed notably between datasets. Datasets filtered using
math_verifyexhibited consistently lower training loss. We assume thatmath_verifyeffectively identifies specific subsets of tasks (primarily numeric ones), whereas Llama-based verification or unfiltered datasets maintain a broader variety of information. - Surprisingly, unfiltered datasets didn’t suffer severe degradation. Despite containing incorrect samples, the
no_restrictionsdataset maintained competitive performance over prolonged training runs.
Recommendations
Based on our findings, the optimal filtering method depends heavily on the training token budget. For shorter runs, stricter verification methods offer significant benefits. As a general advice, we recommend combining llama verification with math_verify—as done within the LorMV dataset.
The Reasoning Course
The Hugging Face learn team are working on accessible material on Reinforcement Learning, GRPO, and training reasoning models with TRL. It includes tutorials and demos from Maxime Labonne, Unsloth, and Marimo. Take a look at the reasoning course in the event you’re in search of a superb place to start on this fast paced field!
Community highlights
The previous few weeks have seen continued exploration of GRPO across a wide selection of tasks, together with several recent reasoning datasets that focus on domains which are broader than math and code. Here’s a few of the releases we found especially exciting:
GRPO explorations
- The optimisation wizards at Unsloth have further reduced the quantity of VRAM needed to coach GRPO models with LoRA, now right down to 5GB for a 1.5B model 🧙
- Kalomaze has written an excellent blog post on selecting good hyperparameters for GRPO. Additionally they observe that models smaller than 7B are likely to converge much slower, which means it’s best to explore recent ideas at these scales before concluding they don’t work.
- Hrishbh Dalal has shown which you can use GRPO to show LLMs to resolve Sudoku puzzles!
- Yutai Li and co-authors have published a excellent paper which shows that for smol models, it’s best to distill a mix of long and short CoT data from more powerful teachers.
Reasoning datasets
- KodKode have released a really large dataset of ~500k samples for programming. This looks like a wonderful complement to CodeForces-CoTs and we’re excited to coach some recent models on it!
- The team at GeneralReasoning have began releasing high-quality and diverse datasets like
GeneralReasoning/GeneralThought-323Kthat cover each more domains and models than what’s publicly available. Additionally they have a pleasant website that allows you to explore the info, together with
What’s next?
With this update, we now have the fundamental pieces in place to finish Steps 1 and a pair of of our reproduction plan:
In the approaching weeks, we plan to focus heavily on:
- Perfecting the mix of distilled datasets to coach general-purpose reasoners.
- Scaling up GRPO to larger models like
Qwen/Qwen2.5-Coder-32B-Instructso we are able to derive R1-Zero variants. - Combining reward signals from multiple domains like math and code, in addition to folding in reward models to attain non-reasoning data.














