How LLMs Work: Reinforcement Learning, RLHF, DeepSeek R1, OpenAI o1, AlphaGo

Welcome to part 2 of my LLM deep dive. If you happen to’ve not read Part 1, I highly encourage you to ascertain it out first.

Previously, we covered the primary two major stages of coaching an LLM:

Pre-training — Learning from massive datasets to form a base model.
Supervised fine-tuning (SFT) — Refining the model with curated examples to make it useful.

Now, we’re diving into the subsequent major stage: Reinforcement Learning (RL). While pre-training and SFT are well-established, RL continues to be evolving but has change into a critical a part of the training pipeline.

I’ve taken reference from Andrej Karpathy’s widely popular 3.5-hour YouTube. Andrej is a founding member of OpenAI, his insights are gold — you get the concept.

Let’s go 🚀

What’s the aim of reinforcement learning (RL)?

Humans and LLMs process information otherwise. What’s intuitive for us — like basic arithmetic — might not be for an LLM, which only sees text as sequences of tokens. Conversely, an LLM can generate expert-level responses on complex topics just because it has seen enough examples during training.

This difference in cognition makes it difficult for human annotators to offer the “perfect” set of labels that consistently guide an LLM toward the best answer.

As a substitute of relying solely on explicit labels, the model explores different token sequences and receives feedback — reward signals — on which outputs are most useful. Over time, it learns to align higher with human intent.

Intuition behind RL

LLMs are stochastic — meaning their responses aren’t fixed. Even with the identical prompt, the output varies since it’s sampled from a probability distribution.

We will harness this randomness by generating hundreds and even hundreds of thousands of possible responses in parallel. Consider it because the model exploring different paths — some good, some bad. Our goal is to encourage it to take the higher paths more often.

To do that, we train the model on the sequences of tokens that lead to higher outcomes. Unlike supervised fine-tuning, where human experts provide labeled data, reinforcement learning allows the model to

The model discovers which responses work best, and after each training step, we update its parameters. Over time, this makes the model more prone to produce high-quality answers when given similar prompts in the long run.

But how can we determine which responses are best? And the way much RL should we do? The main points are tricky, and getting them right is just not trivial.

RL is just not “recent” — It could surpass human expertise (AlphaGo, 2016)

A fantastic example of RL’s power is DeepMind’s AlphaGo, the primary AI to defeat knowledgeable Go player and later surpass human-level play.

Within the 2016 Nature paper (graph below), when a model was trained purely by SFT (giving the model tons of excellent examples to mimic from), the model was capable of reach human-level performance, but never surpass it.

The dotted line represents Lee Sedol’s performance — the very best Go player on this planet.

Nonetheless, RL enabled AlphaGo to play against itself, refine its strategies, and ultimately exceed human expertise (blue line).

Image taken from AlphaGo 2016 paper

RL represents an exciting frontier in AI — where models can explore strategies beyond human imagination once we train it on a various and difficult pool of problems to refine it’s pondering strategies.

RL foundations recap

Let’s quickly recap the important thing components of a typical RL setup:

—The learner or decision maker. It observes the present situation (), chooses an motion, after which updates its behaviour based on the end result ().
— The external system during which the agent operates.
— A snapshot of the environment at a given step .

At each timestamp, the agent performs an within the environment that may change the environment’s state to a brand new one. The agent will even receive feedback indicating how good or bad the motion was.

This feedback is known as a , and is represented in a numerical form. A positive reward encourages that behaviour, and a negative reward discourages it.

By utilizing feedback from different states and actions, the agent steadily learns the optimal technique to maximise the entire reward over time.

Policy

The policy is the agent’s strategy. If the agent follows a very good policy, it can consistently make good decisions, resulting in higher rewards over many steps.

In mathematical terms, it’s a function that determines the probability of various outputs for a given state — .

Value function

An estimate of how good it’s to be in a certain state, considering the long run expected reward. For an LLM, the reward might come from human feedback or a reward model.

Actor-Critic architecture

It’s a well-liked RL setup that mixes two components:

Actor — Learns and updates the policy (πθ), deciding which motion to soak up each state.
Critic — Evaluates the value function (V(s)) to offer feedback to the actor on whether its chosen actions are resulting in good outcomes.

How it really works:

The actor picks an motion based on its current policy.
The critic evaluates the end result (reward + next state) and updates its value estimate.
The critic’s feedback helps the actor refine its policy in order that future actions result in higher rewards.

Putting all of it together for LLMs

The state may be the present text (prompt or conversation), and the motion may be the subsequent token to generate. A reward model (eg. human feedback), tells the model how good or bad it’s generated text is.

The policy is the model’s strategy for selecting the subsequent token, while the worth function estimates how helpful the present text context is, by way of eventually producing top quality responses.

DeepSeek-R1 (published 22 Jan 2025)

To spotlight RL’s importance, let’s explore Deepseek-R1, a reasoning model achieving top-tier performance while remaining open-source. The paper introduced two models: DeepSeek-R1-Zero and DeepSeek-R1.

DeepSeek-R1-Zero was trained solely via large-scale RL, skipping supervised fine-tuning (SFT).
DeepSeek-R1 builds on it, addressing encountered challenges.

Deepseek R1 is one of the crucial amazing and impressive breakthroughs I’ve ever seen — and as open source, a profound gift to the world. 🤖🫡

— Marc Andreessen 🇺🇸 (@pmarca) January 24, 2025

Let’s dive into a few of these key points.

1. RL algo: Group Relative Policy Optimisation (GRPO)

One key game changing RL algorithm is Group Relative Policy Optimisation (GRPO), a variant of the widely popular Proximal Policy Optimisation (PPO). GRPO was introduced within the DeepSeekMath paper in Feb 2024.

PPO struggles with reasoning tasks on account of:

Dependency on a critic model.
PPO needs a separate critic model, effectively doubling memory and compute.
Training the critic may be complex for nuanced or subjective tasks.
High computational cost as RL pipelines demand substantial resources to guage and optimise responses.
Absolute reward evaluations
If you depend on an absolute reward — meaning there’s a single standard or metric to guage whether a solution is “good” or “bad” — it could be hard to capture the nuances of open-ended, diverse tasks across different reasoning domains.

GRPO eliminates the critic model by utilizing relative evaluation — responses are compared inside a bunch slightly than judged by a set standard.

Imagine students solving an issue. As a substitute of a teacher grading them individually, they compare answers, learning from one another. Over time, performance converges toward higher quality.

How does GRPO fit into the entire training process?

GRPO modifies how loss is calculated while keeping other training steps unchanged:

Gather data (queries + responses)
– For LLMs, queries are like questions
– The old policy (older snapshot of the model) generates several candidate answers for every query
Assign rewards — each response within the group is scored (the “reward”).
Compute the GRPO loss
Traditionally, you’ll compute a loss — which shows the deviation between the model prediction and the true label.
In GRPO, nevertheless, you measure:
a) How likely is the brand new policy to supply past responses?
b) Are those responses relatively higher or worse?
c) Apply clipping to stop extreme updates.
This yields a scalar loss.
Back propagation + gradient descent
– Back propagation calculates how each parameter contributed to loss
– Gradient descent updates those parameters to cut back the loss
– Over many iterations, this steadily shifts the brand new policy to prefer higher reward responses
Update the old policy occasionally to match the brand new policy.
This refreshes the baseline for the subsequent round of comparisons.

2. Chain of thought (CoT)

Traditional LLM training follows pre-training → SFT → RL. Nonetheless, DeepSeek-R1-Zero skipped SFT, allowing the model to directly explore CoT reasoning.

Like humans pondering through a tricky query, CoT enables models to interrupt problems into intermediate steps, boosting complex reasoning capabilities. OpenAI’s o1 model also leverages this, as noted in its September 2024 report: o1’s performance improves with more RL (train-time compute) and more reasoning time (test-time compute).

DeepSeek-R1-Zero exhibited reflective tendencies, autonomously refining its reasoning.

Without explicit programming, it began revisiting past reasoning steps, improving accuracy. This highlights chain-of-thought reasoning as an emergent property of RL training.

The model also had an “aha moment” (below) — an interesting example of how RL can result in unexpected and complex outcomes.

Note: Unlike DeepSeek-R1, OpenAI doesn’t show full exact reasoning chains of thought in o1 as they’re concerned a couple of distillation risk — where someone is available in and tries to mimic those reasoning traces and recuperate quite a lot of the reasoning performance by just imitating. As a substitute, o1 just summaries of those chains of thoughts.

Reinforcement learning with Human Feedback (RLHF)

For tasks with verifiable outputs (e.g., math problems, factual Q&A), AI responses may be easily evaluated. But what about areas like summarisation or creative writing, where there’s no single “correct” answer?

That is where human feedback is available in — but naïve RL approaches are unscalable.

Let’s have a look at the naive approach with some arbitrary numbers.

That’s one billion human evaluations needed! This is simply too costly, slow and unscalable. Hence, a better solution is to coach an AI “reward model” to learn human preferences, dramatically reducing human effort.

Upsides of RLHF

Could be applied to any domain, including creative writing, poetry, summarisation, and other open-ended tasks.
Rating outputs is way easier for human labellers than generating creative outputs themselves.

Downsides of RLHF

The reward model is an approximation — it might not perfectly reflect human preferences.
RL is sweet at gaming the reward model — if run for too long, the model might exploit loopholes, generating nonsensical outputs that also get high scores.

For empirical, verifiable domains (e.g. math, coding), RL can run indefinitely and discover novel strategies. RLHF, then again, is more like a fine-tuning step to align models with human preferences.

Conclusion

And that’s a wrap! I hope you enjoyed Part 2 🙂 If you happen to haven’t already read Part 1 — do test it out here.

Got questions or ideas for what I should cover next? Drop them within the comments — I’d love to listen to your thoughts. See you in the subsequent article!

How LLMs Work: Reinforcement Learning, RLHF, DeepSeek R1, OpenAI o1, AlphaGo

What’s the aim of reinforcement learning (RL)?

Intuition behind RL

RL is just not “recent” — It could surpass human expertise (AlphaGo, 2016)