Open R1: Update #2

We at the moment are two weeks into the Open R1 project which goals to reconstruct the missing pieces of DeepSeek R1—specifically, the training pipeline and artificial data.

On this post, we’re blissful to share the development of OpenR1-Math-220k: our first large-scale dataset for mathematical reasoning!

We also take a take a look at some exciting developments from the community towards curating small, high-quality datasets for fine-tuning, together with insights into learn how to control the length of the chain-of-thought from reasoning models at each train-time and inference-time.

Let’s dive in!

OpenR1-Math-220k dataset

Considered one of the important thing benefits of DeepSeek R1 is its ability to transfer advanced reasoning capabilities to smaller models through distillation. The DeepSeek team demonstrated this by generating 600k reasoning traces and fine-tuning a series of Qwen and Llama models, showing that direct distillation from R1 can achieve competitive reasoning performance without reinforcement learning. Notably, DeepSeek-R1-Distill-Qwen-7B achieved 55.5% on AIME 2024, surpassing larger models like QwQ-32B-Preview.

Nonetheless, the reasoning traces used for distillation haven’t been released publicly, prompting the community to independently recreate similar datasets. Thus far, multiple open datasets have been released by the community, including OpenThoughts-114k, Bespoke-Stratos-17k, Dolphin-R1, and LIMO.

🐳 Introducing OpenR1-Math-220k, a large-scale math reasoning dataset generated locally on 512 H100s, with multiple answers per problem. To create OpenR1-Math-220k, we collaborated with Numina who’ve developed a brand new edition of their popular NuminaMath-CoT dataset.

What’s latest in OpenR1 dataset in comparison with existing datasets:

800k R1 reasoning traces: We generate two answers for 400k problems using DeepSeek R1. The filtered dataset incorporates 220k problems with correct reasoning traces.
512 H100s running locally: As a substitute of counting on an API, we leverage vLLM and SGLang to run generations locally on our science cluster, generating 180k reasoning traces per day.
Based on NuminaMath 1.5: we deal with math reasoning traces and generate answers for problems in NuminaMath 1.5, an improved version of the NuminaMath-CoT dataset.
Automated filtering: We apply Math Confirm to only retain problems with at the least one correct answer. We also leverage Llama3.3-70B-Instruct as a judge to retrieve more correct examples (e.g for cases with malformed answers that may’t be verified with a rules-based parser)
We match the performance of DeepSeek-Distill-Qwen-7B by finetuning Qwen-7B-Math-Instruct on our dataset.

By demonstrating scalable, high-quality reasoning data generation, we hope this pipeline could be prolonged beyond math to domains like code generation.

Data generation

To construct OpenR1-220k, we prompt DeepSeek R1 to generate solutions for 400k problems from NuminaMath 1.5. We follow the model card’s beneficial parameters and prepend the next instruction to the user prompt:

“Please reason step-by-step, and put your final answer inside boxed{}.”

We set a 16k token limit per generation, as our evaluation showed that only 75% of problems might be solved in under 8k tokens, and a lot of the remaining problems required the complete 16k tokens. Initially, we used vLLM for inference, achieving a throughput of 15 generations per hour per H100, and shared our generation scripts in previous updates and on the OpenR1 repo. Recently, we began experimenting with SGLang and we were capable of generate 25 solutions per hour per H100 (almost 2x speedup!), enabling us to generate 300k problem solutions per day on 512 H100s. This allowed us to provide 800k reasoning traces in only a number of days.

We generate two solutions per problem—and in some cases, 4—to supply flexibility in filtering and training. This approach allows for rejection sampling, much like DeepSeek R1’s methodology, and likewise makes the dataset suitable for preference optimisation methods like DPO.

The scripts for the information generation can be found here: https://github.com/huggingface/open-r1/tree/essential/slurm

The unfiltered dataset is accessible here: https://huggingface.co/datasets/open-r1/OpenR1-Math-Raw

Data Filtering

To retain only high-quality, correct reasoning traces, we leverage Math Confirm, a sturdy mathematical expression evaluation system designed to evaluate LLM-generated answers. We extract the ultimate answers from model generations and compare them against ground truth answers within the dataset.

We discover that 55% of problems have at the least one correct answer. Nonetheless, some ground truth answers in NuminaMath 1.5 were empty or not in a verifiable format, making automatic validation difficult. While we’ve got improved Math-Confirm to more accurately handle these more unusual output formats (see Math-Confirm improvements below), we also explored an alternate method to get better valid solutions from rejected samples: using Llama-3.3-70B-Instruct as a judge on a subset of rejected problems. Before running this verification step, we filter out samples which are incomplete or that contain an empty ground truth answer, ensuring that only well-formed responses with a clearly boxed final answer are considered. This process successfully retrieves 28,000 of previously rejected problems.

We prompt Llama3.3-70B-Instruct as follows:

You might be a mathematical answer validator. You will probably be supplied with a mathematical problem and you have to compare the reply within the reference solution, and the ultimate answer in a model's solution to find out in the event that they are equivalent, even when formatted otherwise.

PROBLEM:

{problem}

REFERENCE SOLUTION:

{answer}

MODEL'S SOLUTION:

{generation}

Focus ONLY on comparing the ultimate mathematical answer provided by the model while ignoring differences in:

- Formatting (e.g., boxed{{}} vs plain text)
- Multiple selection formatting (e.g., "A" vs full solution)
- Order of coordinate pairs or solutions
- Equivalent mathematical expressions or notation variations
- If the model's answer is nonsense, return "Verdict: AMBIGUOUS"

Start with a temporary explanation of your comparison (2-3 sentences). Then output your final answer in one in every of the next formats:

- "Verdict: EQUIVALENT"
- "Verdict: DIFFERENT"
- "Verdict: AMBIGUOUS"

By combining rule-based verification (Math Confirm) with LLM-based evaluation, we improve dataset quality while maintaining scale. The ultimate dataset consists of 220k problems with verified reasoning traces, making it a invaluable resource for training reasoning models. Providing multiple solutions per problem gives the community flexibility to filter for higher generations and apply more targeted refinements based on NuminaMath data sources and problem types.

The dataset is accessible in two splits:

default (94k problems), which achieves the perfect performance after SFT.
prolonged (131k problems), which incorporates additional NuminaMath 1.5 sources like cn_k12, providing more reasoning traces. Nonetheless, we observed that performance after SFT on this subset was lower than the default split, likely resulting from cn_k12 containing simpler questions in comparison with other sources.

For rows with multiple correct answers, we also tried applying a Reward Model (RM) as a final filter to pick out the perfect response. For every row with multiple correct generations by R1, we extracted the ultimate answer by removing the pondering tokens (…), after which pass the issue + the extracted answer to Qwen/Qwen2.5-Math-RM-72B served using vLLM to get an rating. Using these scores, we built a rating for every row containing multiple correct response. The highest-1 correct generations were chosen and included within the training dataset, but sadly the training ablations showed that this approach doesn’t help to enhance model performance with respect to choosing one random correct generation. A possible improvement might be to incorporate the reasoning trace relatively than simply the ultimate answer when scoring with the RM.

Performance Comparison with DeepSeek-Distill-Qwen-7B

We fine-tune Qwen2.5-Math-Instruct for 3 epochs on the default split of the dataset using a learning rate of 5e-5. To increase the context length from 4k to 32k, we increase RoPE frequency to 300k. The training follows a linear learning rate schedule with a ten% warmup phase. The table below compares the performance of OpenR1-Qwen-7B to DeepSeek-Distill-Qwen-7B and OpenThinker-7B using lighteval.

Model	MATH-500	AIME24	AIME25
DeepSeek-Distill-Qwen-7B	91.6	43.3	40
OpenR1-Qwen-7B	90.6	36.7	40
OpenThinker-7B	89.6	30.0	33.3

This dataset represents an initial version, providing a foundation for further refinement. The community can explore additional filtering strategies to enhance performance, akin to rejection sampling, which was utilized in DeepSeek R1 to boost quality.

Math-Confirm improvements

We identified several failure cases in Math-Confirm during our inspection of the verification results. To handle these issues, we implemented significant improvements and fixes. We strongly recommend updating to the newest version (0.5.2) to profit from these enhancements:

pip install math-verify==0.5.2

The next is the summary of a very powerful improvements:

Improved parsing and verification of text only answers (e.g $text{E}$ == $E$)
Improved parsing of list of answers (e.g $1$ and $2$ and $3$ == $1,2,3$)
Fixed parsing of multiple boxed answers in single latex env (e.g $boxed{1},boxed{2}$ == {1,2})
Introduction of ordered tuples. Inferring whether the list is a tuple of set could be very hard, and we due to this fact use the gold answer to guide us:
- (1,2,3) ≠ {3,2,1}; 1,2,3 == {3,2,1}; {3,2,1} == {1,2,3}
Support for relational (e.g. lower than) in gold and interval in prediction (e.g $1 < x < 2$ == $(1,2)$)

Community highlights

This week saw the community explore GRPO from many alternative angles, while multiple research labs have shown that only ~1000 top quality training samples could also be sufficient to elicit reasoning in existing open models.

GRPO within the wild

nrehiew showed that applying GRPO on to the Qwen2.5-0.5B base model yields ~51% accuracy on the GSM8k benchmark, which is a ten point improvement over the Qwen2.5-0.5B-Instruct model. Impressive results like these have prompted many discussions concerning the role of instruct data in pretraining, as people haven’t (yet) been capable of obtain similar gains when applying GRPO to other base models like Llama 3. Specifically, researchers at Sea AI Lab (SAIL) showed that base models could be easily prompted to provide self-reflection and that the “aha” moment from the DeepSeek-R1 paper could also be more a symptom of the bottom model than the RL optimisation process.
Unsloth have applied their optimisation magic to enable models as much as 15B parameters to be trained with GRPO with just 15GB VRAM 🤯. This implies you’ll be able to now use GRPO in Google Colab totally free!
Wing Lian from Axolotl has shown that DoRA converges faster than each LoRA and full-finetuning.
Alexander Doria found a solution to craft reward functions for poetry. That is exciting because it provides one in every of the primary public examples of GRPO being applied to a site that isn’t conventionally treated as “verifiable”.

Evaluation

The primary a part of the AIME 2025 was released this week, which consists of 15 difficult math problems which are used to coach highschool students for the International Math Olympiad. Up to now yr, AIME 2024 has stood because the essential benchmark to probe the mathematical capabilities of LLMs and the community was excited to see how well models performs on a brand new set of unseen problems:

Do LLMs have to reason in natural language?

An interesting latest research paper shows that by utilizing a recurrent language model, it is feasible to scale test-time compute by implicitly reasoning in latent space. This resembles Meta’s Coconut work to coach language models in latent space, but now adapted to reasoning tasks. The advantage of those methods is that they’re much more compute efficient: by exploring the latent, one doesn’t have to generate huge amounts of “pondering” tokens to acquire high performance.

A shift toward smaller, high-quality reasoning data?

While DeepSeek R1 leveraged 600k reasoning traces for distillation, recent work suggests that complex reasoning can emerge in language models not through massive-scale training, but through a small variety of rigorously curated samples.

One example of this approach is the s1K dataset. It consists of 1,000 rigorously chosen math questions with distilled reasoning traces from Gemini Flash. The choice approach focuses on difficulty, diversity, and quality. The authors fine-tune Qwen2.5-32B-Instruct on s1K and manage to exceed OpenAI’s o1-preview on competition math benchmarks by as much as 27%.

One other dataset, LIMO, pushes this concept further, achieving strong performance on AIME and MATH benchmarks using only 817 training samples. The authors hypothesize that when a model has already acquired extensive domain knowledge during pre-training, only a small variety of well-structured examples could also be needed to unlock advanced reasoning capabilities.

CoT length: budget forcing & reward shaping

One necessary ingredient allowing the fine-tuned Qwen2.5-32B-Instruct model from s1K to achieve such strong performance is budget forcing, a test-time compute technique that either extends or truncates reasoning by appending “Wait” or an end-of-thinking token delimiter to the model’s generation, respectively. This tool allowed the authors to differ pondering time and conclude that their model exhibits test-time scaling: as pondering time increases, so does accuracy on different math benchmarks.

Similarly, Demystifying Long Chain-of-Thought Reasoning in LLMs (Yeo et al.) also studies the effect of Chain-of-Thought (CoT) length on model performance. They introduce the Cosine Reward — a novel reward function that they use to incentivize shorter CoTs for proper generations and longer CoTs for incorrect generations — which stabilizes RL training, particularly when the model has relatively limited max context size and average response length could explode. Repetition penalty can also be employed when the model starts to point out signs of reward hacking on hard questions, by artificially increasing CoT length through repetition as a substitute of attempting to resolve the issue.

What’s next?

Now that GRPO is humming in TRL, we’re running an intensive set of experiments to grasp which hyperparameters and reward functions have the best impact on training. You may follow our progress within the community tab and can write up our findings in the following update!

If you need to contribute take a look at the open-r1 repository on GitHub or follow the Hugging Face open-r1 org.

Source link

Open R1: Update #2