in fashion. DeepSeek-R1, Gemini-2.5-Pro, OpenAI’s O-series models, Anthropic’s Claude, Magistral, and Qwen3 — there’s a brand new one every month. Once you ask these models a matter, they go right into a before generating a solution.
I recently asked myself the query, “Hmm… I’m wondering if I should write a Reinforcement Learning loop from scratch that teaches this ‘pondering’ behaviour to small models — “. It ought to be easy, right?
Well, it wasn’t.
Small models simply wouldn’t have the world knowledge that enormous models do. This makes < 1B parameter model lack the “common sense” to simply reason through complex logical tasks. Due to this fact, you can't just depend on compute to coach them to reason.
You wish additional tricks up your sleeve.
In this text, I won’t just cover tricks though. I’ll cover the main ideas behind training reasoning behaviours into language models, share some easy code snippets, and a few practical tricks to fine-tune Small Language Models (SLMs) with RL.
This text is split into 5 sections:
- Intro to RLVR (Reinforcement Learning with Verifiable Rewards) and why it’s uber cool
- A visible overview of the GRPO algorithm and the clipped surrogate PPO loss.
- A code walkthrough!
- Supervised fine-tuning and practical tricks to train reasoning models
- Results!
1. Reinforcement Learning with Verifiable Rewards (RLVR)
Before diving into specific challenges with Small models, let’s first introduce some terms.
Group Relative Policy Optimization, or GRPO, is a (quite latest) Reinforcement Learning (RL) technique that researchers are using to fine-tune Large Language Models (LLMs) on logical and analytical tasks. Since its inception, a brand new term has been circulating within the LLM research space: RLVR, or Reinforcement Learning with Verifiable Rewards.
To know what makes RLVR unique, it’s helpful to contrast it with essentially the most common application of RL in language models: RLHF (Reinforcement Learning with Human Feedback). In RLHF, an RL module is trained to maximise scores from a separate reward model, which acts as a proxy for human preferences. This reward model is trained on a dataset where humans have ranked or rated different model responses.
In other words, RLHF is trained so LLMs can output responses which are more aligned with human preferences. It tries to make models follow instructions more closely.
RLVR tries to unravel a distinct problem. RLVR teaches a model to be verifiably correct, often by learning to generate it’s own chain of thought.
Where RLHF had a reward model, RLVR uses an verifier. The core idea is to offer rewards based on whether a solution is demonstrably correct, not on a prediction of what a human might prefer.

This is precisely why this method is known as ‘RL with ‘. Not every query’s answer will be verified easily. Especially open-ended questions like ” or “. Some use cases, nevertheless, do fit easily within the “verifiable rewards” paradigm, like math, logical tasks, and code-writing, to call a number of. Within the reasoning-gym
section below, we’ll look into how exactly these tasks will be simulated and the way the rewards will be generated.
We’ll train the LLM to generate arbitrarily long chain of thought reasoning texts before generating the ultimate answer. We instruct the model to wrap its pondering process in
tags and its final conclusion in
tags.
The complete language model response will look something like this:
User has asked me to count the variety of r's in strawberry.
Let's do a cumulative count.
s=0, t=0, r=1, a=0, w=0, b=0, e=0, r=2, r=3, y=4
It seems there are 3 r's in strawberry.
I notice that there's an r in straw and a couple of r's in berry.
Since 1+2=3 I'm more confident there are 3 r's
3
This structure allows us to simply extract just the ultimate answer and check if it’s correct. The verifier is a single source of truth, and is usually a easy piece of code that (literally) counts alphabets.
def count_alphabets(word, letter):
return sum([1 for l in word if l == letter])
reward = 1 if (lm_answer == count_alphabets("strawberry", "r") else -1
We’ll keep a record of the model’s experiences — its responses and the corresponding rewards received from the verifier. The RL algorithm will then train to advertise behaviours that increase the likelihood of correct final answers.
By consistently rewarding correct answers and good formatting, we might increase the likelihood of reasoning tokens that result in correct answers.
Get this: we don’t to directly evaluate the intermediate reasoning tokens. By simply rewarding the ultimate answer, we’ll not directly elicit reasoning steps into the LLM’s chain of thought that result in correct answers!

2. GRPO (Group Relative Policy Optimization)
I’m going to skip the same old intro here, I expect most of you who read this far to know the fundamentals of RL. There’s an agent who observes states from the environment and takes an motion — the environment rewards the agent depending on how good the motion was — the agent stores these experiences and trains to take higher actions in the longer term that result in higher rewards. class dismissed.
Let’s speak about our algorithm of selection — Group Relative Policy Optimization to know how. GRPO works in two iteratively self-repeating phases — an experience collection phase where the Language Model (LM) accumulates experiences within the environment with its current weights. And a training phase where it uses the collected memories to update its weights to enhance. After training, it once more goes into an experience collection step with the updated weights.
Experience Collection
Let’s dissect each step within the experience collection phase now.
- Step 1: The environment is a black box that generates questions on logical or math tasks. We’ll discuss this in an upcoming section with the
reasoning-gym
library.
- Step 2: We tokenize the input questions right into a sequence of integer tokens.

- Step 3: The “agent” or the “policy” is the present SLM we’re training. It observes the environment’s tokenized questions and generates responses. The LLM response gets converted into text and returned to the environment. The environment rewards each response.

- Step 4: From the rewards, we calculate the advantage of every response. In GRPO, the advantage is the relative goodness of every response within the group. Importantly, benefits are calculated per group, i.e. we don’t standardize rewards different questions.

(Illustrated by the Creator)
- Step 5: The unique query, the log probabilities for every LM-generated token, and the benefits are all collected inside a memory buffer.
- Steps 1-5 are repeated till the buffer size reaches the specified threshold.

Training Phase
After the top of the experience collection phase, our goal is to enter the training phase. Here, we’ll learn from the reward patterns the LLM observed and use RL to enhance its weights. Here is how that works:
- Randomly sample a minibatch of memories. Remember, each memory already contained its group-relative-advantage (Step 5 from the experience collection phase). Randomly sampling question-answer pairs improves the robustness of the training because the gradients are calculated as a mean of a various set of experiences, stopping over-fitting on any single query.
- For every minibatch, we would like to maximise this term following the usual PPO (Proximal Policy Optimization) formulation. The foremost difference with GRPO is that we don’t need a further reward model or a worth network to calculate benefits. As an alternative, GRPO samples multiple responses to the identical query to calculate the relative advantage of every response. The memory footprint is significantly reduced since we won’t have to train those additional models!
- Repeat the above steps.

What the PPO Loss means
Let me explain the PPO Loss in an intuitive step-by-step fashion. The PPO Loss looks like this.

- Here,
pi_old
is the old-policy neural network that we used throughout the data collection phase.
π
is the present policy neural network we’re training. For the reason that weights ofπ
change after each gradient update,π
andπ_old
do not stay the identical throughout the training phase — hence the excellence.
G
is the variety of generated responses for a single query.|o_i|
is the length of the i-th response within the group. Due to this fact, those summation and normalization operation computes a mean over all of the tokens over all responses. What does it compute the mean of? Well it’sπ/π_old * A_{it}
. What does that mean?

A_it
is the advantage of the t-th token within the i-th response. Remember once we calculated the advantage of every response in Step 5 during experience collection? The simplest approach to assign a bonus to every token is by simply duplicating the identical advantage to every token — this implies we’re saying that each token is equally chargeable for generating the proper answer.
- Lastly, what’s
π(o_it | q, o_i < t)
? It means what's the probability of thet-th
token within thei-th
response? Meaning, how was that token when it was generated? - The importance sampling ratio reweights the benefits between the present updating policy and the old exploration policy.
- The clipping term ensures that the updates to the network don't grow to be too large and the weights don't move too distant from the old policy. This adds more stability to the training process by keeping the model updates near “a trust region” from the data-collection policy.

After we are maximizing the PPO objective, we're effectively asking the LLM to the log-probability of the tokens that led to a high advantage, while the log-probability of tokens that had a low advantage.
In other words: make tokens that generate good benefits more likely and tokens that generate low benefits less likely.
Understanding the PPO Loss with an example
Let’s forget in regards to the clipping term and the π_old
for now, and let’s just see what maximizing 𝜋(𝑜_i) * A_i
means. To remind you, this a part of the equation simply means, “the product of the probability of the i-th token (o_i) and the advantage of the i-th token (A_i)
Let’s say for a matter, the LLM generated these two sequences: “A B C” and “D E F”, and it got a bonus of +1 for the previous and -1 for the latter*. Let’s say now we have the log probabilities for every of the three tokens as shown below.
*
Notice what happens while you multiply the benefits A_it
by the present logprobs pi
. Now really take into consideration what it means to maximise the mean of that product matrix.

Remember we will only change the possibilities coming out of the LLM. The benefits come from the environment and are subsequently treated as constants. Increasing this expected rating would subsequently mean increasing the probability of tokens with a positive advantage, and decreasing the worth of the negative advantage example.

(Illustrated by the Creator)
Below, you'll find an example of how log-probs change after a number of rounds of coaching. Notice how the blue line is moving closer to zero when the advantage is high? This means that the log-probabilities increased (or the possibilities increased) after going through RL Training. Compare that to the plot on the proper, which shows a distinct response with a low advantage. The blue line is moving away from 0, becoming less probable for selection in later rounds.

In the subsequent section, let’s take a take a look at the reasoning-gym
library and understand how we could sample tasks.
3. Implementation
So, to do RL, we first need tasks. A typical approach to do that is by utilizing an existing dataset of math problems, just like the GSM-8K dataset. In this text, let’s take a look at a distinct case — generating tasks procedurally with a Python library called reasoning-gym.
For my experiments, I used two tasks: syllogism and propositional logic. reasoning-gym
accommodates a bunch of various repositories of various difficulty.
A syllogism task is a sort of logical puzzle designed to check deductive reasoning. Principally, we'll provide the LLM with two premises and ask if the conclusion is correct or not. The propositional logic task is a symbolic reasoning task where the LLM is provided tasks with symbols and asked to generate the conclusion. Unlike syllogism, this isn't a YES/NO classification response — they must generate the proper conclusion directly. This makes this task considerably harder.

The jury remains to be out on what qualifies as a “small” model (some say <14B, some say <7B), but for my YouTube video, I picked even smaller models: SmolLM-135M-Instruct, SmolLM-360M-Instruct, and Qwen3-0.6B. These are ~135M, ~360M, and ~600M models, respectively.
Let’s see methods to arrange the essential training loop. First, we will use Huggingface’s transformers
library to load in a model we would like to coach, let’s say the little 135M param model SmolLM-135M-Instruct
.
To generate some propositional logic tasks, for instance, you simply call this reasoning_gym.create_dataset function
as shown below.
import re
from reasoning_gym import create_dataset, get_score_answer_fn
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_name = "HuggingfaceTB/SmolLM-135M-Instruct"
# load model from huggingface
lm = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained(model_name)
# This sets all models as trainable
for param in lm.parameters():
param.requires_grad = True
# In my experiments, I used a LORA adapter (more on this later)
# specify name of the env
environment_name = "propositional_logic"
# In practice, it's best to wrap this with a torch dataloader
# to sample a minibatch of questions
dataset = create_dataset(
environment_name, seed=42, size=DATA_SIZE
)
for d in dataset:
query = d["question"] # Accessing the query
# We'll use this later to confirm if answer is correct
validation_object = d["metadata"]["source_dataset"]
score_fn = get_score_answer_fn(validation_object)
To generate reasoning data, we would like the LM to generate pondering, followed by the response. Below is the system prompt we will likely be using.
system_prompt = """A conversation between User and Assistant. The user asks a matter, and the Assistant solves it.
The assistant first thinks in regards to the reasoning process within the mind after which provides the user
with the reply. The reasoning process and answer are enclosed inside and
tags, respectively, i.e., reasoning process here
answer here .
Don't generate latest code. Don't write python code.
You could even be given examples by the user telling you the expected response format.
Follow the format of the examples, but solve the precise problem asked by the user, not the examples.
Very necessary - Remember again, your output format ought to be:
reasoning process here
answer here
Your response will likely be scored by extracting the substring between the ... tags.
It's critical to follow the above format.
feature_extraction_utilsling to follow the response format will end in a penalty.
"""
To generate answers, we first tokenize the system prompt and the query as shown below.
# Create messages structure
messages = [
{"role": "system", "content": system_prompt},
{"role": "user", "content": question}, # Obtained from reasoning-gym
]
# Create tokenized representation
inputs = tokenizer.apply_chat_template(
messages,
tokenize=True,
return_tensors="pt",
add_generation_prompt=True
)
Then we pass it through the LM — generate multiple responses using the num_return_sequences
parameter, and detokenize it back to get a string response. No gradients are calculated during this stage.
generated_response = lm.generate(
input_ids=inputs["input_ids"],
attention_mask=inputs["attention_mask"],
max_new_tokens=max_new_tokens, # The max variety of tokens to generate
do_sample=True, # Probabilistic sampling
top_p=0.95, # Nucleus sampling
num_return_sequences=G, # Variety of sequences per query
temperature=1, # Increase randomness
eos_token_id=eos_token_id,
pad_token_id=eos_token_id,
)
We also write the extract_answer
function, which uses regular expressions to extract answers between the reply tags.
def extract_answer(response):
answer = re.search(r"(.*?) ", response, re.DOTALL)
if answer isn't None:
return answer.group(1).strip()
else:
return ""
Finally, we use the rating function we got previously to generate a reward depending on whether the LM’s response was correct. To calculate rewards, we add a format reward and a correction reward. The correction reward comes from the environment, and the format reward is awarded if the model appropriately generates the
and
tags.
The benefits are calculated by standardizing across each group.
# Response is an array of string of length [B*G]
# B is the variety of questions, G is the variety of responses per query
correctness_reward = score_fn(response, validation_object)
format_reward = calculate_format_reward(response)
# Total reward is a weighted sum of correctness and formatting rewards
reward = correctness_reward * 0.85 + format_reward * 0.15
# Convert rewards from [B*G, 1] -> [B, G]
rewards = rewards.reshape(B, G)
# Calculate benefits
benefits = (rewards - np.mean(rewards, axis=1, keepdims=True)) / (
np.std(rewards, axis=1, keepdims=True) + 1e-8
)
benefits = benefits.reshape(-1, 1)
Store the (old) log probs, benefits, responses, and response masks in a memory buffer.
# A function that returns the log prob of every chosen token
log_probs = calculate_log_probs(lm, generated_response)
buffer.extend([{
"full_response": generated_response[i],
"response_mask": response_mask[i], # A binary mask to indicate which tokens in generated response are AI generated, 0 for system prompt and questions
"old_log_probs": log_probs[i],
"benefits": benefits[i]
} for i in range(len(generated_response))])
After multiple experience collection step, once the buffer is full, we initiate our training loop. Here, we sample minibatches from our experience, calculate the log probs, compute loss, and backdrop.
# full_response, response_mask, old_log_probs, benefits <--- Buffer
# Recompute the brand new log_probs. Notice no torch.no_grad(), so gradients WILL BE USED here.
logits = llm(input_ids=full_response).logits
# Extract log probs from the logits
# Does log_softmax over the vocabulary and extracts the log-prob of every chosen token
log_probs = calculate_log_probs(
logits,
full_responses
)
# Calculate the clipped surrogate loss
reasoning_loss = calculate_ppo_loss(
log_probs, # Trainable
old_log_probs, # Obtained from exploration, not trainable
benefits, # Obtained from environment, not trainable
response_mask # Obtained from exploration, not trainable
)
# Optimizaiton steps
accelerator.backward(reasoning_loss)
optimizer.step()
optimizer.zero_grad()
You should utilize additional entropy losses here, or minimize KLD along with your reference model as suggested in the unique Deepseek-R1 paper, but future papers have concluded that these leash the training process and never a requirement.
4. Warming up with Supervised Tremendous-tuning
Technically, we will attempt to run an enormous RL training without delay and hope that the small models can pull through and conquer our tasks. Nonetheless, the probability of that's incredibly low.
There's one big problem — our small models will not be appropriately trained to generate formatted outputs or perform well on these tasks. Off the box, their responses do have logical flow to them, because of the pretraining or instruction tuning from their original developers, but they will not be adequate for our goal task.

Give it some thought — RL trains by collecting experiences and updating the policy to maximise the great experiences. But when many of the experiences are completely bad and the model receives 0 rewards, it has no approach to optimize, since it gets no signal to enhance in any respect. So the really helpful approach is to first teach the model the behavior you would like to train using supervised fine-tuning. Here is a straightforward script:
client = openai.AsyncClient()
ENVIRONMENT = "propositional_logic"
model = "gpt-4.1-mini"
semaphore = asyncio.Semaphore(50)
num_datapoints = 200
system_prompt = (
system_prompt
+ """You will even be provided the true answer. Your pondering should eventually end in producing the true answer."""
)
dataloader = create_dataset(name=ENVIRONMENT, size=num_datapoints)
@backoff.on_exception(backoff.expo, openai.RateLimitError)
async def generate_response(item):
async with semaphore:
messages = [
{"role": "system", "content": system_prompt},
{
"role": "user",
"content": f"""
Question: {item['question']}
Metadata: {item['metadata']}
Answer: {item['answer']}
""",
},
]
response = await client.chat.completions.create(messages=messages, model=model)
return {
"query": item["question"],
"metadata": item["metadata"],
"answer": item["answer"],
"response": response.selections[0].message.content,
}
async def foremost():
responses = await asyncio.gather(*[generate_response(item) for item in dataloader])
fname = f"responses_{ENVIRONMENT}_{model}.json"
json.dump(responses, open(fname, "w"), indent=4)
print(f"Saved responses to {fname}")
if __name__ == "__main__":
asyncio.run(foremost())
To generate the fine-tuning dataset, I first generated the pondering and answer tags with a small LLM-like GPT-4.1-mini. Doing that is incredibly easy — we sample 200 or so examples for every task, call the OpenAI API to generate a response, and reserve it on disk.
During SFT, we load the bottom model we would like to coach, attach a trainable LORA adapter ,and do parameter-efficient fine-tuning. Listed below are the LORA configurations I used.
lora:
r: 32
lora_alpha: 64
lora_dropout: 0
target_modules: ["q_proj", "v_proj", "k_proj", "o_proj",
"up_proj", "down_proj", "gate_proj"]
LORA allows the training process to be more memory efficient and likewise reduces the chance of corrupting the unique model. You will discover the main points of parameter-efficient supervised fine-tuning in my YouTube video right here.
I trained a LORA adapter on 200 examples of syllogism data with the smallest language model I could find — the HuggingfaceTB/SmolLM-135M-Instruct, and it got us an accuracy of 46%. Roughly, which means that we generate an accurate answer 46% of the time. More importantly, we regularly get the formatting right, so our can safely extract answers from the responses as a rule.
Some more optimizations for SLMs and practical considerations
- Not all reasoning tasks will be solved by all models. A straightforward approach to confirm if a task is just too hard or too easy for the model is to only check the bottom accuracy of the model in your task. Whether it is, let’s say below 10-20%, the duty is probably going very hard and you wish additional supervised warmup fine-tuning.
- SFT, even on small datasets, can generally show massive accuracy gains on small models. When you can acquire dataset, you could not even have to do Reinforcement Learning in lots of scenarios. SLMs are immensely tunable.
- Papers like DAPO and Critical Perspectives on R1 have claimed that the unique loss normalization from DeepSeek has a length bias. They've proposed other normalization methods which are price . For my project, the regular DeepSeek loss just worked.
- DAPO also mentions removing the KLD term in the unique R1 paper. Originally, the goal of this loss was to make sure that the updating policy is rarely too distant from the bottom policy, but DAPO suggests not using this since the behaviour of the policy can drastically change during reasoning, making this KLD term an unnecessary regularisation term that may restrict the model’s intelligence.
- Generating diverse responses IS KEY to creating RL possible. When you only generated correct responses, or in case you only generated incorrect responses, the advantage will likely be 0, and this can give the RL algorithm no training signal in any respect. We are able to generate diverse responses by increasing the
temperature
,top_p
, andnum_return_sequences
parameters within thegenerate()
. - You can too generate diverse rewards, by adding more terms into the reward function. For instance, a length reward that penalizes overly long reasoning.
- The next parameters increase the stability of coaching at the price of more computation: increasing num generations per rollout, increasing the dimensions of the buffer and lowering the educational rate.
- Use gradient accumulation (and even gradient checkpointing) if you've gotten limited resources to coach these models.
- There's some superb print I skipped in this text related to padding. When saving experiences into buffer, it’s best practice to remove the pad tokens altogether — and recreate them when loading a minibatch during training.
- It's best to depart whitespace around
and (and their closing tags). This ends in consistent tokenization and makes training barely easier for the SLMs.
4. Results
Here is my YouTube video that explains every part on this blog post more pictorially and provides a hands-on tutorial on methods to code such a thing.
On the supervised-fine-tuned SmolLM-135M on the syllogism task, we got a bump to 60%! You'll be able to see the reward curve here — the healthy standard deviation of the rewards shows that we were indeed getting diverse responses throughout, which is a healthy thing if we would like to coach with RL.

Here's a set of hyperparameters that worked well for me.
config:
name: "path/to/sft_model"
max_new_tokens: 300 # reasoning + answer token budget
exploration_batchsize: 8 # variety of questions per batch during rollout
G: 6 # num responses per group
temperature: 0.7
batch_size: 16 # minibatch size during training
gradient_accumulation_steps: 12
learning_rate: 0.000001 # Advisable to maintain this low, like 1e-6 or 1e-7
top_p: 0.95
buffer_size: 500
I also repeated this experiment with larger models — the SmolLM-360M-Instruct and the Qwen3-0.6B model. Within the latter, I used to be in a position to get accuracies as much as 81% which is awesome! We got a 20% additive bump on average within the syllogism task!
Within the propositional logic task, which for my part is a harder reasoning task, I also saw similar gains across all small models! I'm sure that with more instruction tuning and RL fine-tuning, possibly on multiple tasks directly, we will raise the intelligence of those models lots higher. Training on a single task can generate quick results which is what I wanted for this Youtube video, but it could also act as a bottleneck for the model’s overall intelligence.
Let’s end this text with a GIF of the small models outputting reasoning data and solving tasks. Enjoy, and stay magnificent!

References
Creator’s YouTube channel: https://www.youtube.com/@avb_fj
Creator’s Patreon: www.patreon.com/NeuralBreakdownwithAVB
Creator’s Twitter (X) account: https://x.com/neural_avb
Deepseek Math: https://arxiv.org/pdf/2402.03300
DeepSeek R1: https://arxiv.org/abs/2501.12948
DAPO: https://arxiv.org/abs/2503.14476
Critical Perspectives on R1: https://arxiv.org/abs/2503.20783
Reasoning Gym Library: github.com/open-thought/reasoning-gym
A very good place to examine Reasoning: https://github.com/willccbb/verifiers
An amazing place to review code: https://github.com/huggingface/trl/blob/foremost/trl/trainer/grpo_trainer.py