In case you’ve ever struggled with a tricky math problem, you realize how useful it’s to think somewhat longer and work through it fastidiously. OpenAI’s o1 model showed that when LLMs are trained to do the identical—through the use of more compute during inference—they get significantly higher at solving reasoning tasks like mathematics, coding, and logic.
Nevertheless, the recipe behind OpenAI’s reasoning models has been a well kept secret. That’s, until last week, when DeepSeek released their DeepSeek-R1 model and promptly broke the web (and the stock market!).
Besides performing as well or higher than o1, the DeepSeek-R1 release was accompanied by an in depth tech report that outlined the important thing steps of their training recipe. This recipe involved several innovations, most notably the applying of pure reinforcement learning to show a base language model the way to reason without any human supervision. As shown within the figure below, making a strong reasoning model is now quite simple if you’ve gotten access to a capable base model and a high-quality data mixture:

Nevertheless, the DeepSeek-R1 release leaves open several questions on:
- Data collection: How were the reasoning-specific datasets curated?
- Model training: No training code was released by DeepSeek, so it’s unknown which hyperparameters work best and the way they differ across different model families and scales.
- Scaling laws: What are the compute and data trade-offs in training reasoning models?
These questions prompted us to launch the Open-R1 project, an initiative to systematically reconstruct DeepSeek-R1’s data and training pipeline, validate its claims, and push the boundaries of open reasoning models. By constructing Open-R1, we aim to offer transparency on how reinforcement learning can enhance reasoning, share reproducible insights with the open-source community, and create a foundation for future models to leverage these techniques.
On this blog post we take a take a look at key ingredients behind DeepSeek-R1, which parts we plan to duplicate, and the way to contribute to the Open-R1 project.
Let’s dive in 🚀!
How did they do it?
DeepSeek-R1 is a reasoning model built on the muse of DeepSeek-V3. Like every good reasoning model, it starts with a robust base model, and DeepSeek-V3 is precisely that. This 671B Mixture of Experts (MoE) model performs on par with heavyweights like Sonnet 3.5 and GPT-4o. What’s especially impressive is how cost-efficient it was to coach—just $5.5M—due to architectural changes like Multi Token Prediction (MTP), Multi-Head Latent Attention (MLA) and a LOT (seriously, rather a lot) of hardware optimization.
DeepSeek also introduced two models: DeepSeek-R1-Zero and DeepSeek-R1, each with a definite training approach. DeepSeek-R1-Zero skipped supervised fine-tuning altogether and relied entirely on reinforcement learning (RL), using Group Relative Policy Optimization (GRPO) to make the method more efficient. A straightforward reward system was used to guide the model, providing feedback based on the accuracy and structure of its answers. This approach helped the model develop useful reasoning skills, equivalent to breaking problems into steps and verifying its own outputs. Nevertheless, its responses often lacked clarity and were difficult to read.
That’s where DeepSeek-R1 is available in. It began with a “cold start” phase, fine-tuning on a small set of fastidiously crafted examples to enhance clarity and readability. From there, it went through more RL and refinement steps, including rejecting low-quality outputs with each human preference based and verifiable reward, to create a model that not only reasons well but additionally produces polished and consistent answers.

This all sounds great, but what’s actually missing? Let’s have a take a look at the missing pieces of the puzzle.
Open-R1: the missing pieces
The discharge of DeepSeek-R1 is a tremendous boon for the community, but they didn’t release every part—although the model weights are open, the datasets and code used to coach the model usually are not 😢.
The goal of Open-R1 is to construct these last missing pieces in order that the entire research and industry community can construct similar or higher models using these recipes and datasets. And by doing this within the open, everybody locally can contribute!
As shown within the figure below, here’s our plan of attack:
- Step 1: Replicate the R1-Distill models by distilling a high-quality reasoning dataset from DeepSeek-R1.
- Step 2: Replicate the pure RL pipeline that DeepSeek used to create R1-Zero. This may involve curating recent, large-scale datasets for math, reasoning, and code.
- Step 3: Show we will go from base model → SFT → RL via multi-stage training.

The synthetic datasets will allow everybody to fine-tune existing or recent LLMs into reasoning models by simply fine-tuning on them. The training recipes involving RL will function a place to begin for anybody to construct similar models from scratch and can allow researchers to construct much more advanced methods on top.
Note that we don’t wish to stop at math datasets. There’s loads of potential in exploring other areas, obvious one like code but additionally scientific fields equivalent to medicine, where reasoning models could have significant impact.
This initiative isn’t nearly replicating results—it’s about sharing insights with the community. By documenting what works, what doesn’t, and why, we hope to save lots of others from wasting time and compute on unproductive paths.
If this sounds interesting, we’d love your help! Whether it’s contributing code, joining discussions on Hugging Face, there are many ways to become involved. Let’s construct this together! 🚀
