introduction
Welcome to part 2 of the LLM deep dive. In part 1, we covered pre-training and supervised fine-tuning (SFT) — the foundational stages of building an LLM. Now, we're diving into reinforcement learning (RL) — where things get really interesting.
This article was heavily inspired by Andrej Karpathy's 3.5-hour YouTube video on how LLMs work. If you haven't watched it, I'd highly recommend it.
what's the purpose of reinforcement learning (RL)?
Humans and LLMs process information very differently. What's intuitive for us may not be intuitive for an LLM, and vice versa. RL bridges this gap by allowing the model to learn from its own experience.
Instead of relying on explicit labels provided by humans, the model explores different token sequences and receives reward signals that tell it how well it's doing. Over time, it learns to generate sequences that maximise these rewards.
intuition behind RL
LLMs are stochastic — given the same prompt, they can produce different outputs each time. We can harness this randomness by generating thousands of responses in parallel, then training the model on the token sequences that lead to better outcomes.
Unlike SFT, where human experts provide labelled data for the model to imitate, RL allows the model to learn from itself. It doesn't need a human to show it the "right" answer — it discovers what works through exploration.
RL is not "new" — it can surpass human expertise (AlphaGo, 2016)
One of the most striking demonstrations of RL's power came from DeepMind's AlphaGo in 2016.
AlphaGo was first trained using SFT on expert human games of Go. This got it to roughly human-level performance — but never beyond it. SFT is fundamentally about replication, not innovation. The model can only be as good as the data it learns from.

The breakthrough came when RL was introduced. By playing against itself millions of times, AlphaGo discovered strategies that no human had ever conceived of. It didn't just match human expertise — it exceeded it. The dotted line in the graph represents Lee Sedol, one of the world's best Go players at the time.
RL foundations recap
Before we apply RL to LLMs, let's quickly recap the key components:
- Agent: the learner and decision maker
- Environment: what the agent interacts with
- State: the current situation the agent is in
- Action: what the agent can do
- Reward: feedback signal from the environment
Two important concepts:
- Policy — the agent's strategy for choosing actions given a state, denoted as
πθ(a|s) - Value function — estimates how beneficial a particular state is, helping the agent evaluate long-term outcomes
Actor-Critic architecture
A common RL architecture splits the model into two roles:
- Actor — learns the policy (what action to take)
- Critic — evaluates the value function (how good is this state?)
The Actor proposes actions, and the Critic provides feedback on how good those actions are — helping the Actor improve over time.
putting it all together for LLMs
When we apply RL to language models, the components map as follows:
- State = the current text (prompt + tokens generated so far)
- Action = the next token to generate
- Reward model = a model trained on human feedback to score outputs
- Policy = the LLM's strategy for picking the next token
- Value function = estimates how beneficial the current text context is for producing a high-reward response
DeepSeek-R1 (published 22 Jan 2025)
DeepSeek released two notable models:
- DeepSeek-R1-Zero — trained solely via large-scale RL, completely skipping SFT
- DeepSeek-R1 — the full model incorporating both SFT and RL stages
RL algo: Group Relative Policy Optimisation (GRPO)
GRPO is a variant of Proximal Policy Optimisation (PPO) — one of the most widely used RL algorithms. So why did DeepSeek choose GRPO over PPO?
PPO struggles with a few issues:
- Dependency on a critic model — this effectively doubles memory and compute requirements
- High computational cost — training both an actor and critic is expensive
- Absolute reward evaluations — PPO evaluates responses in absolute terms, which can be noisy and unstable
GRPO addresses these by eliminating the critic model entirely. Instead of absolute evaluations, it uses relative evaluation — responses are compared within a group.
Think of it like students comparing answers with each other, rather than a teacher grading each student individually.
The GRPO training loop works like this:
- Gather data — generate a group of responses for a given prompt
- Assign rewards — score each response using the reward model
- Compute GRPO loss — this considers:
- How likely is the new policy to produce the past responses?
- Are the responses relatively better or worse within the group?
- Apply clipping to prevent the policy from changing too drastically
- Back propagation + gradient descent — update the model weights
- Update old policy — periodically sync the reference policy
chain of thought (CoT)
One of the most fascinating findings from DeepSeek-R1-Zero was that by skipping SFT and directly training with RL, the model spontaneously developed chain of thought reasoning.
CoT is like how humans think through tough questions — breaking down a problem step by step before arriving at a final answer. OpenAI's o1 model also leverages this kind of extended reasoning.
A key observation during training was that as training progressed, the model's responses grew longer and more detailed. The model learned to "think" more before answering, allocating more compute to harder problems.
The researchers noted an "aha moment" — a point during training where the model suddenly started producing dramatically better reasoning chains.
A note on OpenAI o1: OpenAI doesn't show the full reasoning chains in o1's outputs. One reason is to mitigate the risk of distillation — competitors could use those detailed reasoning traces to train their own models.
Reinforcement Learning with Human Feedback (RLHF)
Not all tasks have easily verifiable outputs. For things like summarisation, creative writing, or conversational quality, there's no simple way to automatically check if an answer is "correct."
This is where RLHF comes in.
the problem with naive approaches
Consider this: if you wanted to evaluate every possible response a model could generate, you'd need humans to score an astronomical number of outputs — potentially over a billion evaluations. This is completely unscalable.

a smarter solution: train an AI reward model
Instead of having humans evaluate every response, we train a reward model that learns to predict human preferences. The key insight is that ranking responses is much easier than scoring them absolutely.

Human labellers compare pairs of responses and indicate which is better. This ranking data is then used to train the reward model, which can score new responses at scale.
upsides
- Domain-agnostic — can be applied to any task where humans can express preferences
- Ranking is easier — human labellers find it much more natural to rank responses than to assign absolute scores
downsides
- The reward model is an approximation — it won't perfectly capture human preferences
- RL can game the reward model — the model may learn to exploit quirks in the reward model rather than genuinely improving
- RLHF is not the same as traditional RL — it's more accurately described as a fine-tuning step guided by learned human preferences
wrapping up
RL is what transforms LLMs from sophisticated pattern matchers into models capable of reasoning, self-improvement, and even surpassing human expertise in certain domains. From AlphaGo to DeepSeek-R1 to OpenAI's o1, the trajectory is clear — RL is a critical piece of the puzzle.
The combination of techniques like GRPO, CoT, and RLHF gives us a powerful toolkit for aligning models with human preferences while pushing the boundaries of what they can achieve.