Welcome to part 4 of our LLM series! Today, we're exploring one of the most exciting frontiers in AI: reasoning models. These aren't just chatbots that parrot information—they're systems that can genuinely break down complex problems, think step-by-step, and arrive at solutions through logical deduction.
Let me start with a puzzle that reveals the difference between a standard language model and a reasoning model:
"A bat and ball cost $1.10 together. The bat costs $1 more than the ball. How much does the ball cost?"
Most people—and most standard LLMs—instinctively say "10 cents." But that's wrong. The correct answer is 5 cents, and arriving at it requires actual reasoning, not just pattern matching.
From Intuition to Reasoning: A Fundamental Shift
First, let's clarify what we mean by "reasoning" in AI. It's not about being smarter or knowing more facts. It's about being more deliberate. When you ask a reasoning model a question, it doesn't jump to an answer. Instead, it breaks the problem down, explores different approaches, checks its work, and then—and only then—produces a final answer.
# The core difference: Intuition vs Reasoning
question = "Bat ($1 more than ball) + Ball = $1.10. Ball price?"
def intuitive_model():
"""System 1 thinking: Fast, associative"""
return "10 cents" # ❌ Quick, wrong
def reasoning_model():
"""System 2 thinking: Slow, analytical"""
steps = [
"Let ball price = x",
"Then bat price = x + 1.00",
"Total: x + (x + 1.00) = 1.10",
"2x + 1.00 = 1.10",
"2x = 0.10",
"x = 0.05"
]
return "The ball costs 5 cents" # ✅ Methodical, correct
Traditional language models work through what psychologists call "System 1" thinking: fast, intuitive, associative. Reasoning models engage in "System 2" thinking: slow, analytical, deliberate.
The Chain of Thought Revolution
The breakthrough came in 2022 with Chain of Thought (CoT) prompting. Researchers discovered that if you simply add the phrase "Let's think step by step" to a prompt, models become significantly better at math problems, logical puzzles, and other tasks requiring reasoning.
# Traditional vs CoT prompting
def traditional_prompt(question):
return f"Q: What is 25% of 80?\nA:"
def cot_prompt(question):
return f"""Q: What is 25% of 80?
Let's think step by step:
1. 25% means 25 per 100, or one quarter
2. To find 25% of 80, we can calculate 80 ÷ 4
3. 80 ÷ 4 = 20
4. Therefore, 25% of 80 is 20
A: 20"""
But prompting was just the beginning. The real revolution came when researchers started training models specifically for reasoning, creating systems like OpenAI's o1, DeepSeek R1, and Google's Gemini 2.0.
The Training Challenge: Why Reasoning is Hard
You might wonder: if reasoning is so valuable, why didn't we build reasoning models from the start? The answer lies in how these models are trained.
Traditional language models are trained through Supervised Fine-Tuning (SFT): you show them examples of questions and answers, and they learn to mimic the pattern. But this approach falls short for reasoning because:
- Human reasoning data is scarce and expensive (experts who can solve complex problems and explain their thinking are rare)
- There are often multiple valid reasoning paths to the same answer
- Models might discover better reasoning strategies than humans use
Imagine trying to teach someone chess by only showing them the final positions of games. They might memorize some patterns, but they won't learn strategy or tactics. That's the limitation of SFT for reasoning tasks.
Reinforcement Learning: The Right Tool for the Job
RL is perfect for reasoning because reasoning tasks have clear, verifiable outcomes. Did the code compile? Did it pass the test cases? Is the math answer correct? These are binary rewards that RL can optimize for.
The most common RL approach for reasoning is called Proximal Policy Optimization (PPO). But PPO has a problem: it's computationally expensive. It requires training not just the main model, but also a separate "value function" that predicts how good each partial solution is.
Enter GRPO (Group Relative Policy Optimization), a newer, more elegant approach.
GRPO: The Secret Sauce of Modern Reasoning Models
GRPO takes a clever shortcut. Instead of trying to predict absolute quality at every step, it simply compares solutions against each other:
import torch
import numpy as np
class GRPOTrainer:
"""
Group Relative Policy Optimization
Simplified implementation
"""
def __init__(self, model, num_groups=4):
self.model = model
self.num_groups = num_groups
def generate_group(self, prompt):
"""Generate multiple solutions for same prompt"""
solutions = []
for _ in range(self.num_groups):
solution = self.model.generate(
prompt,
temperature=0.8, # For diversity
max_length=500
)
solutions.append(solution)
return solutions
def compute_relative_rewards(self, solutions):
"""
Key insight: Compare against group average, not absolute threshold
"""
scores = [self.score_solution(s) for s in solutions]
group_mean = np.mean(scores)
group_std = np.std(scores) + 1e-8
# Relative rewards (z-scores)
relative_rewards = [(s - group_mean) / group_std for s in scores]
return relative_rewards
def grpo_loss(self, log_probs, relative_rewards):
"""Optimize policy based on relative performance"""
log_probs = torch.stack(log_probs)
rewards = torch.tensor(relative_rewards)
rewards = (rewards - rewards.mean()) / (rewards.std() + 1e-8)
# Policy gradient loss
loss = -(log_probs * rewards).mean()
return loss
# Why GRPO beats PPO for reasoning:
advantages = {
"simplicity": "No value function needed",
"efficiency": "Single forward/backward pass",
"stability": "Relative comparisons more stable",
"diversity": "Encourages multiple solution paths"
}
The beauty of GRPO is its simplicity. Models learn by competing against themselves. If one approach works better than others, that approach gets reinforced. Over time, the model discovers effective reasoning strategies through pure trial and error.
The Verbosity Problem and Its Solutions
GRPO has a known issue: length bias. Models learn that longer answers often get higher rewards because:
- More verbose solutions are less likely to make careless errors
- Graders often reward thoroughness
- There's more room to include partial credit steps
The result can be excessively verbose reasoning. Researchers have developed several fixes:
def fix_length_bias(log_probs, rewards, lengths):
"""Solutions to the verbosity problem"""
# Method 1: Length normalization
normalized_rewards = [r / (l ** 0.5) for r, l in zip(rewards, lengths)]
# Method 2: Token-level DPO
# Compare token-by-token preferences
# Method 3: GRPO "done right"
# Equalize token contributions
return normalized_rewards
DeepSeek R1: A Masterclass in Building Reasoning Models

One of the most impressive reasoning models is DeepSeek R1. Its training pipeline reveals what makes reasoning models work:
class DeepSeekR1Pipeline:
"""The step-by-step recipe for a reasoning model"""
def train(self):
# Phase 1: Cold Start SFT
# Start with minimal high-quality human reasoning data
# Just enough to bootstrap the reasoning capability
# Phase 2: Reinforcement Learning (GRPO)
# Generate millions of synthetic problems
# Let the model discover reasoning strategies through trial and error
# Phase 3: Rejection Sampling SFT
# Have the model generate many solutions to each problem
# Keep only the correct ones
# Fine-tune on these "self-curated" examples
# Phase 4: Final Alignment
# Make helpful, harmless, and honest
pass
What's particularly fascinating is an experiment DeepSeek ran called R1-Zero. They took a pre-trained language model and applied RL (with no SFT at all). The model discovered reasoning on its own, but with quirks: it mixed languages, had poor formatting, and was hard to read. This proved that RL alone can teach reasoning, but it needs refinement to be useful.
Evaluating Reasoning Models: The Pass@K Metric
You can't improve what you can't measure. For reasoning models, we use specialized benchmarks and metrics. The most important is Pass@K:
import math
def pass_at_k(total_samples, correct_samples, k_attempts):
"""
Calculate probability of success with k attempts
Example: Model generates 100 solutions, 15 are correct
With 5 attempts, probability ≈ 56%
"""
if total_samples - correct_samples < k_attempts:
return 1.0
# Probability all attempts fail
fail_prob = math.comb(total_samples - correct_samples, k_attempts) / math.comb(total_samples, k_attempts)
return 1.0 - fail_prob
Why does this matter? Real users don't just try once. They retry, rephrase, experiment. Pass@5 or Pass@10 gives us a realistic success rate that reflects actual usage.
Reasoning Benchmarks: The AI Olympics
Different reasoning models excel at different tasks:
- Mathematics: GSM8K (grade school), MATH (high school), AIME (olympiad)
- Coding: HumanEval (function completion), SWE-bench (real GitHub issues)
- Science: MMLU-STEM, PubMedQA
As of early 2025, the state-of-the-art looks something like this:
- GSM8K: Models scoring 99%+ (essentially perfect on grade school math)
- MATH: Top models in the 90-95% range
- SWE-bench: Still challenging, with top models around 45-50%
The Economics of Reasoning: Cost vs. Value
There's a practical problem with reasoning models: they're expensive. All that thinking takes computational resources. OpenAI's o1 models, for example, cost 2-3x more than standard GPT-4.
class ReasoningEconomics:
def compare_costs(self):
problem = "Solve: ∫(x² + 3x + 2) dx from 0 to 5"
standard_llm = {
"response": "The integral is 145.83",
"tokens_used": 10,
"cost": "$0.0001",
"correct": "Maybe?"
}
reasoning_model = {
"thinking_tokens": 150, # All that step-by-step work
"answer_tokens": 5,
"total_tokens": 155,
"cost": "$0.00155", # 15.5x more expensive!
"correct": "Verified",
"value": "Shows work, can debug, teaches user"
}
return {"standard": standard_llm, "reasoning": reasoning_model}
Making Reasoning Practical: Knowledge Distillation
The solution to the cost problem is knowledge distillation: training smaller, cheaper models to mimic the reasoning of larger ones.
class ReasoningDistillation:
"""
Train small models to mimic big models' reasoning
"""
def train_small_model(self, large_model, small_model):
# Step 1: Have the large model solve many problems
# Step 2: Capture not just the answer, but the entire reasoning chain
# Step 3: Train the small model to reproduce the exact reasoning tokens
# The result: A model that "thinks like" the big model
# But runs 10-100x cheaper
pass
This approach typically gets small models to 70-90% of the large model's capability at a fraction of the cost.
Practical Guide: Building Your Own Reasoning Model
Step 1: Start with a Strong Base
base_models = {
"llama_3_70b": {
"reasoning_potential": "Good",
"cost": "Medium",
"recommendation": "Best balance"
},
"mistral_8b": {
"reasoning_potential": "Limited but trainable",
"cost": "Low",
"recommendation": "For experimentation"
}
}
Step 2: Collect/Build Training Data
def build_reasoning_dataset():
sources = [
("GSM8K", "math word problems"),
("MATH", "competition math"),
("HumanEval", "coding problems"),
("synthetic_math", "generate with rules"),
("your_domain", "domain-specific problems")
]
# Key: Need step-by-step solutions!
return sources
Step 3: Implement GRPO Training
from trl import GRPOConfig, GRPOTrainer
grpo_config = GRPOConfig(
model_name="your-base-model",
learning_rate=1e-6,
num_generations=8, # Group size
temperature=0.8, # For diversity
reward_func=your_reward_function # Critical!
)
def reward_function(samples):
rewards = []
for sample in samples:
score = check_correctness(sample)
length_penalty = len(sample.split()) / 1000 # Penalize verbosity
rewards.append(score - 0.1 * length_penalty)
return rewards
The Future of Reasoning Models
Where is this all heading? Several exciting directions:
- Multimodal reasoning: Models that can reason about images, audio, and video
- Tool use: Models that can use calculators, code interpreters, web search
- Long-horizon reasoning: Planning complex projects, writing research papers
- Self-improvement: Models that can critique and refine their own reasoning
- Selective reasoning: Knowing when to think deeply vs. when to answer quickly
Key Takeaways
- Reasoning isn't magic—it's just giving models time and structure to think
- RL beats SFT for teaching reasoning, but needs careful implementation
- GRPO is currently state-of-the-art for efficient reasoning training
- Watch out for length bias—verbose doesn't always mean better
- Evaluate with Pass@K—it reflects real-world usage
- Consider distillation for production use—big reasoning is expensive
Try It Yourself
The best way to understand reasoning models is to use them. Try this puzzle with both a standard model and a reasoning model:
A snail climbs 3 feet up a wall each day but slips back 2 feet each night.
The wall is 30 feet tall. How many days to reach the top?
Hint: The answer isn't 30 days. Watch how reasoning models methodically work through the problem while standard models often jump to the wrong conclusion.
Next in our series: We'll explore Agentic AI.
What reasoning tasks have you found models surprisingly good (or bad) at? What domain-specific reasoning would be most valuable for your work? Let's discuss in the comments.

Top comments (0)