ruchika bhat

Posted on Feb 2

The Thinking Machines: How AI Learned to Reason Step-by-Step

#rag #ai #machinelearning #beginners

Welcome to part 4 of our LLM series! Today, we're exploring one of the most exciting frontiers in AI: reasoning models. These aren't just chatbots that parrot information—they're systems that can genuinely break down complex problems, think step-by-step, and arrive at solutions through logical deduction.

Let me start with a puzzle that reveals the difference between a standard language model and a reasoning model:

"A bat and ball cost $1.10 together. The bat costs $1 more than the ball. How much does the ball cost?"

Most people—and most standard LLMs—instinctively say "10 cents." But that's wrong. The correct answer is 5 cents, and arriving at it requires actual reasoning, not just pattern matching.

From Intuition to Reasoning: A Fundamental Shift

First, let's clarify what we mean by "reasoning" in AI. It's not about being smarter or knowing more facts. It's about being more deliberate. When you ask a reasoning model a question, it doesn't jump to an answer. Instead, it breaks the problem down, explores different approaches, checks its work, and then—and only then—produces a final answer.

# The core difference: Intuition vs Reasoning
question = "Bat ($1 more than ball) + Ball = $1.10. Ball price?"

def intuitive_model():
    """System 1 thinking: Fast, associative"""
    return "10 cents"  # ❌ Quick, wrong

def reasoning_model():
    """System 2 thinking: Slow, analytical"""
    steps = [
        "Let ball price = x",
        "Then bat price = x + 1.00",
        "Total: x + (x + 1.00) = 1.10",
        "2x + 1.00 = 1.10",
        "2x = 0.10",
        "x = 0.05"
    ]
    return "The ball costs 5 cents"  # ✅ Methodical, correct

Traditional language models work through what psychologists call "System 1" thinking: fast, intuitive, associative. Reasoning models engage in "System 2" thinking: slow, analytical, deliberate.

The Chain of Thought Revolution

The breakthrough came in 2022 with Chain of Thought (CoT) prompting. Researchers discovered that if you simply add the phrase "Let's think step by step" to a prompt, models become significantly better at math problems, logical puzzles, and other tasks requiring reasoning.

# Traditional vs CoT prompting
def traditional_prompt(question):
    return f"Q: What is 25% of 80?\nA:"

def cot_prompt(question):
    return f"""Q: What is 25% of 80?

Let's think step by step:
1. 25% means 25 per 100, or one quarter
2. To find 25% of 80, we can calculate 80 ÷ 4
3. 80 ÷ 4 = 20
4. Therefore, 25% of 80 is 20

A: 20"""

But prompting was just the beginning. The real revolution came when researchers started training models specifically for reasoning, creating systems like OpenAI's o1, DeepSeek R1, and Google's Gemini 2.0.

The Training Challenge: Why Reasoning is Hard

You might wonder: if reasoning is so valuable, why didn't we build reasoning models from the start? The answer lies in how these models are trained.

Traditional language models are trained through Supervised Fine-Tuning (SFT): you show them examples of questions and answers, and they learn to mimic the pattern. But this approach falls short for reasoning because:

Human reasoning data is scarce and expensive (experts who can solve complex problems and explain their thinking are rare)
There are often multiple valid reasoning paths to the same answer
Models might discover better reasoning strategies than humans use

Imagine trying to teach someone chess by only showing them the final positions of games. They might memorize some patterns, but they won't learn strategy or tactics. That's the limitation of SFT for reasoning tasks.

Reinforcement Learning: The Right Tool for the Job

RL is perfect for reasoning because reasoning tasks have clear, verifiable outcomes. Did the code compile? Did it pass the test cases? Is the math answer correct? These are binary rewards that RL can optimize for.

The most common RL approach for reasoning is called Proximal Policy Optimization (PPO). But PPO has a problem: it's computationally expensive. It requires training not just the main model, but also a separate "value function" that predicts how good each partial solution is.

Enter GRPO (Group Relative Policy Optimization), a newer, more elegant approach.

GRPO: The Secret Sauce of Modern Reasoning Models

GRPO takes a clever shortcut. Instead of trying to predict absolute quality at every step, it simply compares solutions against each other:

import torch
import numpy as np

class GRPOTrainer:
    """
    Group Relative Policy Optimization
    Simplified implementation
    """

    def __init__(self, model, num_groups=4):
        self.model = model
        self.num_groups = num_groups

    def generate_group(self, prompt):
        """Generate multiple solutions for same prompt"""
        solutions = []
        for _ in range(self.num_groups):
            solution = self.model.generate(
                prompt,
                temperature=0.8,  # For diversity
                max_length=500
            )
            solutions.append(solution)
        return solutions

    def compute_relative_rewards(self, solutions):
        """
        Key insight: Compare against group average, not absolute threshold
        """
        scores = [self.score_solution(s) for s in solutions]
        group_mean = np.mean(scores)
        group_std = np.std(scores) + 1e-8

        # Relative rewards (z-scores)
        relative_rewards = [(s - group_mean) / group_std for s in scores]
        return relative_rewards

    def grpo_loss(self, log_probs, relative_rewards):
        """Optimize policy based on relative performance"""
        log_probs = torch.stack(log_probs)
        rewards = torch.tensor(relative_rewards)
        rewards = (rewards - rewards.mean()) / (rewards.std() + 1e-8)

        # Policy gradient loss
        loss = -(log_probs * rewards).mean()
        return loss

# Why GRPO beats PPO for reasoning:
advantages = {
    "simplicity": "No value function needed",
    "efficiency": "Single forward/backward pass",
    "stability": "Relative comparisons more stable",
    "diversity": "Encourages multiple solution paths"
}

The beauty of GRPO is its simplicity. Models learn by competing against themselves. If one approach works better than others, that approach gets reinforced. Over time, the model discovers effective reasoning strategies through pure trial and error.

The Verbosity Problem and Its Solutions

GRPO has a known issue: length bias. Models learn that longer answers often get higher rewards because:

More verbose solutions are less likely to make careless errors
Graders often reward thoroughness
There's more room to include partial credit steps

The result can be excessively verbose reasoning. Researchers have developed several fixes:

def fix_length_bias(log_probs, rewards, lengths):
    """Solutions to the verbosity problem"""
    # Method 1: Length normalization
    normalized_rewards = [r / (l ** 0.5) for r, l in zip(rewards, lengths)]

    # Method 2: Token-level DPO
    # Compare token-by-token preferences

    # Method 3: GRPO "done right"
    # Equalize token contributions
    return normalized_rewards

DeepSeek R1: A Masterclass in Building Reasoning Models

One of the most impressive reasoning models is DeepSeek R1. Its training pipeline reveals what makes reasoning models work:

class DeepSeekR1Pipeline:
    """The step-by-step recipe for a reasoning model"""

    def train(self):
        # Phase 1: Cold Start SFT
        # Start with minimal high-quality human reasoning data
        # Just enough to bootstrap the reasoning capability

        # Phase 2: Reinforcement Learning (GRPO)
        # Generate millions of synthetic problems
        # Let the model discover reasoning strategies through trial and error

        # Phase 3: Rejection Sampling SFT
        # Have the model generate many solutions to each problem
        # Keep only the correct ones
        # Fine-tune on these "self-curated" examples

        # Phase 4: Final Alignment
        # Make helpful, harmless, and honest
        pass

What's particularly fascinating is an experiment DeepSeek ran called R1-Zero. They took a pre-trained language model and applied RL (with no SFT at all). The model discovered reasoning on its own, but with quirks: it mixed languages, had poor formatting, and was hard to read. This proved that RL alone can teach reasoning, but it needs refinement to be useful.

Evaluating Reasoning Models: The Pass@K Metric

You can't improve what you can't measure. For reasoning models, we use specialized benchmarks and metrics. The most important is Pass@K:

import math

def pass_at_k(total_samples, correct_samples, k_attempts):
    """
    Calculate probability of success with k attempts
    Example: Model generates 100 solutions, 15 are correct
    With 5 attempts, probability ≈ 56%
    """
    if total_samples - correct_samples < k_attempts:
        return 1.0

    # Probability all attempts fail
    fail_prob = math.comb(total_samples - correct_samples, k_attempts) / math.comb(total_samples, k_attempts)
    return 1.0 - fail_prob

Why does this matter? Real users don't just try once. They retry, rephrase, experiment. Pass@5 or Pass@10 gives us a realistic success rate that reflects actual usage.

Reasoning Benchmarks: The AI Olympics

Different reasoning models excel at different tasks:

Mathematics: GSM8K (grade school), MATH (high school), AIME (olympiad)
Coding: HumanEval (function completion), SWE-bench (real GitHub issues)
Science: MMLU-STEM, PubMedQA

As of early 2025, the state-of-the-art looks something like this:

GSM8K: Models scoring 99%+ (essentially perfect on grade school math)
MATH: Top models in the 90-95% range
SWE-bench: Still challenging, with top models around 45-50%

The Economics of Reasoning: Cost vs. Value

There's a practical problem with reasoning models: they're expensive. All that thinking takes computational resources. OpenAI's o1 models, for example, cost 2-3x more than standard GPT-4.

class ReasoningEconomics:
    def compare_costs(self):
        problem = "Solve: ∫(x² + 3x + 2) dx from 0 to 5"

        standard_llm = {
            "response": "The integral is 145.83",
            "tokens_used": 10,
            "cost": "$0.0001",
            "correct": "Maybe?"
        }

        reasoning_model = {
            "thinking_tokens": 150,  # All that step-by-step work
            "answer_tokens": 5,
            "total_tokens": 155,
            "cost": "$0.00155",  # 15.5x more expensive!
            "correct": "Verified",
            "value": "Shows work, can debug, teaches user"
        }

        return {"standard": standard_llm, "reasoning": reasoning_model}

Making Reasoning Practical: Knowledge Distillation

The solution to the cost problem is knowledge distillation: training smaller, cheaper models to mimic the reasoning of larger ones.

class ReasoningDistillation:
    """
    Train small models to mimic big models' reasoning
    """

    def train_small_model(self, large_model, small_model):
        # Step 1: Have the large model solve many problems
        # Step 2: Capture not just the answer, but the entire reasoning chain
        # Step 3: Train the small model to reproduce the exact reasoning tokens

        # The result: A model that "thinks like" the big model
        # But runs 10-100x cheaper
        pass

This approach typically gets small models to 70-90% of the large model's capability at a fraction of the cost.

Practical Guide: Building Your Own Reasoning Model

Step 1: Start with a Strong Base

base_models = {
    "llama_3_70b": {
        "reasoning_potential": "Good",
        "cost": "Medium",
        "recommendation": "Best balance"
    },
    "mistral_8b": {
        "reasoning_potential": "Limited but trainable",
        "cost": "Low",
        "recommendation": "For experimentation"
    }
}

Step 2: Collect/Build Training Data

def build_reasoning_dataset():
    sources = [
        ("GSM8K", "math word problems"),
        ("MATH", "competition math"),
        ("HumanEval", "coding problems"),
        ("synthetic_math", "generate with rules"),
        ("your_domain", "domain-specific problems")
    ]
    # Key: Need step-by-step solutions!
    return sources

Step 3: Implement GRPO Training

from trl import GRPOConfig, GRPOTrainer

grpo_config = GRPOConfig(
    model_name="your-base-model",
    learning_rate=1e-6,
    num_generations=8,  # Group size
    temperature=0.8,    # For diversity
    reward_func=your_reward_function  # Critical!
)

def reward_function(samples):
    rewards = []
    for sample in samples:
        score = check_correctness(sample)
        length_penalty = len(sample.split()) / 1000  # Penalize verbosity
        rewards.append(score - 0.1 * length_penalty)
    return rewards

The Future of Reasoning Models

Where is this all heading? Several exciting directions:

Multimodal reasoning: Models that can reason about images, audio, and video
Tool use: Models that can use calculators, code interpreters, web search
Long-horizon reasoning: Planning complex projects, writing research papers
Self-improvement: Models that can critique and refine their own reasoning
Selective reasoning: Knowing when to think deeply vs. when to answer quickly

Key Takeaways

Reasoning isn't magic—it's just giving models time and structure to think
RL beats SFT for teaching reasoning, but needs careful implementation
GRPO is currently state-of-the-art for efficient reasoning training
Watch out for length bias—verbose doesn't always mean better
Evaluate with Pass@K—it reflects real-world usage
Consider distillation for production use—big reasoning is expensive

Try It Yourself

The best way to understand reasoning models is to use them. Try this puzzle with both a standard model and a reasoning model:

A snail climbs 3 feet up a wall each day but slips back 2 feet each night.
The wall is 30 feet tall. How many days to reach the top?

Hint: The answer isn't 30 days. Watch how reasoning models methodically work through the problem while standard models often jump to the wrong conclusion.

Next in our series: We'll explore Agentic AI.

What reasoning tasks have you found models surprisingly good (or bad) at? What domain-specific reasoning would be most valuable for your work? Let's discuss in the comments.

DEV Community