Reproduce Deepseek R1 „aha moment“ a RL tutorial

-


Philipp Schmid's avatar


This post was written by Philipp Schmid and orginially posted on philschmid.de code can found here.

The discharge of Deepseek R1 shocked the industry. Why? Well, DeepSeek-R1 is an open model that rivals OpenAI’s o1 in complex reasoning tasks, introduced using Group Relative Policy Optimization (GRPO) and RL-focused multi-stage training approach. They not only released the model, but in addition a research paper on how they did it.

Within the paper they described an “aha moment” when using pure RL to coach the model. During this phase, DeepSeek-R1-Zero (the primary test of DeepSeek-R1) learns to allocate more considering time to an issue by reevaluating its initial approach with none human feedback or data describing tips on how to do it. They describe this as an “aha moment” as:

This behavior shouldn’t be only a testament to the model’s growing reasoning abilities but in addition a fascinating example of how reinforcement learning can result in unexpected and complicated outcomes.

On this blog post we would like to recreate the small “aha moment” of DeepSeek-R1 using Group Relative Policy Optimization (GRPO) and the Countdown Game. We are going to train an open model using reinforcement learning attempting to teach it self-verification and search abilities all by itself to unravel the Countdown Game.
The Countdown game is a numbers puzzle where players use a set of randomly drawn numbers and basic arithmetic operations (+, -, ×, ÷) to succeed in or get as close as possible to a goal number.

Goal Number: 952
Available Numbers: 25, 50, 75, 100, 3, 6

(100 × (3 × 3)) + (50 + 6 / 3) = 952

The blog post focuses on training distributed training using Deepspeed and vLLM. It was run on a 4x NVIDIA H100 GPUs.

  1. Setup the event environment
  2. Distributed Training example for GRPO using Deepspeed and vLLM
  3. Results and Training Observations

Note: This blog is inspired by Jiayi Pan who initially explored the thought and proofed it with a small model.

But Before we start, let’s take a take a look at the Group Relative Policy Optimization (GRPO) and understand how it really works.

Group Relative Policy Optimization (GRPO)

Group Relative Policy Optimization (GRPO) is a reinforcement learning algorithm to enhance the reasoning capabilities of LLMs. It was introduced within the DeepSeekMath paper within the context of mathematical reasoning. GRPO modifies the normal Proximal Policy Optimization (PPO) by eliminating the necessity for a price function model. As an alternative, it estimates baselines from group scores, reducing memory usage and computational overhead. GRPO, now also utilized by the Qwen team, could be used with rule/binary-based Rewards in addition to General Reward Models to enhance models on helpfulness.

  1. Sampling: Generate multiple outputs for every prompt using the present policy
  2. Reward Scoring: Each generation is scored using a reward function, might be (rule-based or outcome-based)
  3. Advantage Calculation: The common reward of the generated outputs is used as a baseline. The advantage of every solution inside the group is then computed relative to this baseline. The reward is normalized inside a gaggle.
  4. Policy Optimization: The policy tries to maximise the GRPO objective, which incorporates the calculated benefits and a KL divergence term. That is different from how PPO implements the KL term inside the reward.

image/png



1. Setup the event environment

Our first step is to put in Hugging Face Libraries and Pytorch, vllm, and trl, transformers and datasets. Should you have not heard of trl yet, don’t fret. It’s a brand new library on top of transformers and datasets, which makes it easier to fine-tune, rlhf, align open LLMs.


%pip install "torch==2.5.1" tensorboard "setuptools<71.0.0"  --index-url https://download.pytorch.org/whl/cu121


%pip install flash-attn 


%pip install  --upgrade 
  "transformers==4.48.1" 
  "datasets==3.1.0" 
  "speed up==1.3.0" 
  "hf-transfer==0.1.9" 
  "deepspeed==0.15.4" 
  "trl==0.14.0"


%pip install "vllm==0.7.0"

Note: chances are you’ll must restart the kernel to make use of updated packages.

We are going to use the Hugging Face Hub as a distant model versioning service. This implies we are going to robotically push our model, logs and data to the Hub during training. You have to register on the Hugging Face for this. After you’ve gotten an account, we are going to use the login util from the huggingface_hub package to log into our account and store our token (access key) on the disk.

from huggingface_hub import login

login(token="", add_to_git_credential=True) 



2. Distributed Training example for GRPO using Deepspeed and vLLM

We’re going to use the Jiayi-Pan/Countdown-Tasks-3to4 dataset, which incorporates samples with 3 to 4 numbers and solutions. As Model we’re going to use Qwen/Qwen2.5-3B-Instruct which is a 3B parameter instruction tuned model. This makes it easier to showcase the “aha moment” because it already can follow instructions. Jiayi-Pan explored that the model must have a certain quality to give you the option to learn the reasoning process, starting with > 1.5B parameters.

TRL supports Group Relative Policy Optimization (GRPO) through a dedicated GRPOTrainer for aligning LLMs from preference data, as described in DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models. The GRPOTrainer is a subclass of the Trainer from the transformers library and supports all the identical features, including logging, checkpointing, distributed training, and parameter efficient fine-tuning (PEFT).

The GRPOTrainer supports generic Final result Reward Models (ORM) and custom reward functions, that could be used to implement Rule-Based Reward Models. Within the Deepseek R1 paper they implemented Rule-Based Reward Models to confirm the correctness of the generated solutions. In our exmaple we’re going to do an analogous approach, where we are going to create 2 reward functions that:

  1. Format Reward: Checks if the generated format is correct [thinking] [answer]
  2. Accuracy Reward: Extracts the equation from the tag and evaluates it against the goal and if every number is used once.

Note: Correct in our example includes the equation, for instance 55 + 36 - 7 - 19

Hugging Face TRL added support for distributed training with Deepspeed and using vLLM for faster generation. I preprared a run_r1_grpo.py script and a receipes/grpo-qwen-2.5-3b-deepseek-r1-countdown.yaml config file to run the training.

This configuration is tested and validated on a Node with 4x H100 80GBs, where a single step takes around 45-60s, as we will leverage vLLM for generation and DeepSpeed for distributed training. Subsequently we want to make sure that we appropriately set the num_processes to the variety of GPUs you’ve gotten – 1 because the last one might be used with vLLM for Generation. Should you are using more GPUS you’ll want to change the vllm_device within the config file to last index GPU, e.g. if you’ve gotten 8 GPUs you’ll want to set vllm_device=7 and your num_processes to 7.

command to run the training:

speed up launch --num_processes 3 --config_file configs/accelerate_configs/deepspeed_zero3.yaml scripts/run_r1_grpo.py --config receipes/grpo-qwen-2.5-3b-deepseek-r1-countdown.yaml

With the optimized distributed training a single step with 8 generations per sample on 4x H100 80GBs takes around 45-60s. The total training for 450 steps takes around 6 hours.



3. Results and Training Observations

The script saves random completions to the completion_samples folder, which you should use to examine the model’s progress. It includes completion_samples.txt and success_completion_samples.txt. The completion_samples.txt includes all completions, while the success_completion_samples.txt which appropriately solves the equation. Below you’ll find the interesating training obeserations on how the performance changes over time, in addition to the Tensornoard logs and successfull reasoning samples.

The model with checkpoints for each twenty fifth step could be found at philschmid/qwen-2.5-3b-r1-countdown.



Hyperparameters

I began the experiment using the hyperparameters from the DeepSeekMath paper with a learning rate of 1e-6 and a beta (KL coefficient) of 0.04, which led to unstable training runs after around 150 steps. I ran some small ablations and decreased each the training rate to 5e-7 and the beta to 0.001, based on a test from OpenRLHF. I coulnd’t test how increasing the num_generations from 8 to 64 would affect the training. 64 is the generation value, which was utilized in the DeepSeekMath paper. All other parameters could be present in the grpo-qwen-2.5-3b-deepseek-r1-countdown.yaml config file.



Training Observations:

  • At ~50 steps the model has learned the proper format ...n....
  • At 100 steps the success rate for solving the equation is around 25%. The model starts to “reason” with words see examples below.
  • At 200 steps the performance seems to converge much slower and we’re at ~40% success rate. The model starts to learn a brand new “format” where it solves the equation much like how you’d do it programmatically, by trying different combos and reviewing the outcomes, see “Successfull Reasoning Samples between step 200 and 450”.
  • At 450 steps we’ve got 50% success rate for solving the equation. The performance still improves slowly and the model kept its latest format form from step 200.

I even have 3 potential assumptions why the model shifts from “word reasoning” to “programmatic execution”:

  1. Qwen 2.5 3B shouldn’t be strong enough or to small, Deepseek mentions that you just need a really strong base model.
  2. The reward functions will not be adequate defined and it reward hacks the model to unravel the equation. We could attempt to force it to make use of words, e.g. having a number to word frequency condition. (We do not know much concerning the reward functions from Deepseek)
  3. Only training on the Countdown Game tasks might force the model naturally to learn essentially the most effective method to solve the equation as no other formats are required.
  4. We didn’t train the model long enough, within the R1 paper they showed visuals of coaching for over 8000 steps.



Tensorboard Results

image/png



Successfull Reasoning Samples as much as step 200

<think> First, I would like to think concerning the numbers we've got: 14, 70, and 80. To get 85, I feel 80 could be  place to begin since it's right in between 85. If I subtract or add 80, it won't result in 85 easily. Next, 70 appears to be  intermediate number, especially to make use of along side 14. 
Now, I can take into consideration different combos using addition, subtraction and multiplication/division. If I take 80 - 70 = 10, which shouldn't be needed. So, I would like to search out a method to use 14. If I subtract 14 from 80, I get 66, which still is not 85. If I subtract 70 from 80, I get 10 again, this is not constructive yet. 
Multiplication could create numbers larger than the goal 85. Let's consider division: If I divide 70 by 14, I get 5. Then if I add this result to 80, I get 85. This mix seems promising: 80 - (70 / 14) = 80 - 5 = 75, this still is not quite right. Let me correct my approach: Let's use division after which addition. If I take 80 / (70 / 14) → 80 / 5 = 16, after which 80 - 16 = 64, still incorrect. Let me simplify: 80 / (70 / 14) → 80 / 5 = 16, after which 80 - 16 gives 64 again, let me correct: 80 - (70 / 14) → 80 - 5 = 75, but I would like 85. Thus I'll use the combos in a rather different order to insert numbers: I get 80 + (70 / 14) = 80 + 5 = 85think>
<answer> 80 + (70 / 14) answer>
<think> First, I would like to think about how I can use the numbers 1, 21, and 90 to succeed in 68. I can try different combos of operations. Let's try to make use of subtraction and addition. If I subtract 90 from 21, I get -69, which is simply too removed from 68. Now, if I add 1 to 90, I get 91, which can be not right. One other way is to see what happens if I subtract 21 from 90, which supplies me 69. That is closer but still not 68. Let me use multiplication as an alternative of subtraction. 90 * 1 = 90. Now, if I try to scale back this barely by subtracting 21, I'm not getting near 68.
Now, let's try one other multiplication. I'll consider 1 * (larger value) + (smaller value). Let's try 1 * 90 + 21. This evaluates to 90 + 21 = 111, which is simply too much. As an alternative, I'll try: 90 / 1 + 21 - 7. But this offers a division by 7, which seems complex. Let's re-evaluate: I'll use 1 to divide 21: 21 / 1 = 21. Now, subtract 21 from 90, and add 1, we've got 90 - 21 = 69, using one...
Hey, wait on re-evaluating I'll subtract 21 from 90, and include 1 to finish the answer: 90 - (21 + 1) which can be 68.think>
<answer> 90 - (21 + 1) answer>
<think> First, I would like to think about the numbers we've got: 59, 80, and 66. I would like to make use of these numbers to get 45. Let's try to make use of subtraction first since it could possibly quickly reduce large numbers.
Let's examine if we will get close by subtracting or adding these numbers. If I subtract 66 from 80, I get 14. Then I would like to make this 14 equal 45 when combined with 59. That offers me 14 + 59 = 73, which is simply too high.
Now let's try adding numbers. If I add 59 and 66, I get 125. Then I would wish to subtract 80 from 125 to get 45, but 80 is simply too large to suit well with this.
Let me try a distinct approach. What if I subtract 66 from 80 first after which add 59? That will be 14 + 59 = 73, which remains to be not 45.
Hmm, I would like to consider one other combination. What if I start with 59 after which add 80 and subtract 66? That will be (59 + 80) - 66, let me calculate it. That equals 139 - 66, which is 73.
That is still not 45, so let me try rearranging it barely: perhaps 80 - 66 + 59 can be closer to 45. Calculating again: 80 - 66 is 14, then adding 59 gives 73.
This manner doesn't work either. Let's flip this: perhaps (59 + 66) - 80. Let me check: that is 125 - 80, which equals 45.think>
<answer>(59 + 66) - 80answer>
<think> First, I would like to search out a mixture of the numbers 61, 4, 94, and 21 that could be manipulated using basic arithmetic operations to equal 50. One method to approach that is to search for numbers that could be used to succeed in a goal number near 50 after which adjust with the remaining numbers.
Let's try combining 94 and 21 first:
94 + 21 = 115
Now, we want to get from 115 to 50. We will try subtracting 65:
115 - 65 = 50
Now we want to search out a method to get 65 using the remaining numbers 61 and 4. We will try:
61 - 4 = 57
Now we want to get from 57 to 65. Adding 8 would give us that:
57 + 8 = 65
So, we want to search out a method to get 8 using 61 and 4. We will try:
61 - 94 = -36
That is negative, so let's try dividing:
61 / 94 = 0.648987...
This shouldn't be helping, so let's try multiplication and division another way:
(61 - 4) * (94 / 21) = 57 * 4.476... = 254.4
This shouldn't be working either. Let's try a distinct approach:
(94 - 61) = 33
Now we've got:
33 + 21 = 54
We'd like to regulate from 54 to 50, so we will:
54 - 4 = 50
So, the equation becomes:
(94 - 61) + (21 - 4) = 50
think>
<answer>(94 - 61) + (21 - 4)answer>
<think> First, I'll consider the potential for directly subtracting two numbers to get near 31. If I subtract 57 from 63, I get 7. Now, if I subtract this 7 from 25, I get 25 - 7 = 18. Removed from 31, so this does not work.
Next, let's try adding and subtracting. Dividing 63 by 25 will give a decimal, in order that's unlikely to work. Now, let's examine if adding and subtracting in one other order could work. For instance, let's try 63 - 25, after which we see this gets us 38. Now, we want a method to get to 31 from here. Subtracting 7 from 38 would work, but we won't determine tips on how to create that 7 from the given numbers.
Hmm, let me try to include all numbers by adding and subtracting. If I try 63 + 25, I get 88. This is sort of removed from 31, but when 88 - 57, I get 31 exactly!
This appears to be the equation: 63 + 25 - 57 equals 31.
think>
<answer> 63 + 25 - 57 answer>



Successfull Reasoning Samples between step 200 and 450

<think> We'd like to make use of the numbers 39, 16, and 55 with basic arithmetic operations to get 78. Let's try different combos:
- 55 + 39 - 16 = 84 - 16 = 68 (too low)
- 55 + 16 - 39 = 71 - 39 = 32 (too low)
- 55 - 16 + 39 = 39 + 39 = 78 (this works)
So, the equation is 55 - 16 + 39 = 78. think>
<answer> 55 - 16 + 39 answer>
<think> We'd like to make use of the numbers 21, 5, and 40 with basic arithmetic operations to get 24. Let's try different combos:
- 21 - 5 - 40 = -15
- 21 - 5 + 40 = 56
- 21 + 5 - 40 = -14
- 21 * 5 / 40 = 2.1 (not an integer, let's try one other)
- 40 * 5 / 21 = 9.52 (not 24)
- 40 / 5 - 21 = 9 - 21 = -12
- 5 * 40 / 21 = 100 / 21 ~= 4.76 (not 24)
- 5 + 40 - 21 = 24 ! This works.
think>
<answer> 5 + 40 - 21 answer>
<think> We'd like to make use of the numbers 75, 99, 62, and 34 with basic arithmetic operations to get 72. Let's try different combos:
- 99 - 75 + 62 - 34 = 92 - 34 = 58 (too high)
- 99 - 75 - 62 + 34 = 24 - 62 + 34 = -38 + 34 = -4 (too low)
- 99 + 75 - 62 - 34 = 174 - 96 = 78 (too high)
- 99 + 75 - 62 + 34 = 174 + 34 - 62 = 208 - 62 = 146 (too high)
- 99 - 75 + 62 + 34 = 24 + 96 = 120 (too high)
- 75 + 99 - 62 - 34 = 174 - 96 = 78 (too high)
- 75 + 99 - 62 + 34 = 174 + 34 - 62 = 208 - 62 = 146 (too high)
- 75 + 62 - 99 + 34 = 137 - 99 + 34 = 38 + 34 = 72
So, 75 + 62 - 99 + 34 equals 72.
think>
<answer> 75 + 62 - 99 + 34 answer>



Conclusion

The discharge of DeepSeek R1 and its research paper could be a breakpoint for open-science and open-source development. Just per week after the DeepSeek release, we have been capable of reproduce a straightforward version of R1 learned “reasoning” using GRPO and the Countdown Game. While our implementation focuses on a particular task moderately than general reasoning and convergence right into a very specific “reasoning” format, it shows that the tactic is working.

Looking in to 2025, it’s clear that we’re on the cusp of much more significant progress. RL will change into more accessible and user-friendly, more researchers and developers will explore its potential, but in addition require amount of more compute as before and in comparison with supervised fine-tuning.

I’m excited for 2025. If you’ve gotten any questions or ideas, be at liberty to reach out to me.

If this sounds interesting, we’d love your help! Whether it’s contributing code, joining discussions on Hugging Face.



Source link

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x