Well, This week has been all about DeepSeek-R1 making the headlines. So, in this post, let's understand right from What
is DeepSeek-R1 model and it's working internals
in depth.
Firstly, What's DeepSeek-R1?
DeepSeek-R1 is an open-source reasoning model developed by DeepSeek, a Chinese AI company which can work on tasks that require logical inference, mathematical problem-solving, and real-time decision-making.
What sets reasoning models like DeepSeek-R1 and OpenAIβs o1 apart from traditional large language models (LLMs) is their ability to show how they arrived at a conclusion.
As you can see in the image above, with DeepSeek-R1, you see what steps it follows for reasoning for a prompt which makes it easier to understand and if necessary, challenge its output. This capability gives reasoning models an edge in fields where outcomes need to be explainable, like research or complex decision-making.
Also, this model challenges the industry's reliance on supervised fine-tuning
(SFT) by showing that reinforcement learning
(RL) can improve reasoning capabilities. But again, apart from the things I mentioned above, what makes this revolutionary?
-
Autonomous Skill Emergence: Unlike GPT-4 or Claude 3.5 Sonnet, which requires human-curated reasoning examples,
R1-Zero
develops skills like self-verification and multi-step planning through pure RL. - Cost: Distilled 7B models outperform GPT-4o at 1/100th the training cost.
- Open Source: Full release of model weights, training code.
Technical Architecture :
Base Model Foundation :
It's built on top of DeepSeek-V3-Base
model which is - a 671B parameter Mixture-of-Experts model (MoE = integrating multiple specialized models, or "experts," to solve complex problems more effectively) with :
- 16 Expert Networks: which are each specialized submodels for math, code, logic, etc
- Dynamic Activation: 37B parameters activated per token through learned routing.
- Pre-Training: 4.8T (yes, Trillion) tokens spanning 52 languages and technical domains which includes STEM papers, Github repositories.
The R1 Variants :
Model | Parameters | Training Approach | Key Innovation |
---|---|---|---|
R1-Zero | 671B MoE | Pure RL (No SFT) | Autonomous reasoning discovery |
R1 | 671B MoE | Multi-stage SFT+RL | Human-aligned CoT generation |
R1-Distill | 1.5Bβ70B | SFT on R1 outputs | Cost-efficient deployment |
DeepSeek Internals in Depth:
1. Reinforcement Learning at it's Core :
DeepSeek-R1βs most groundbreaking feature is its reliance on reinforcement learning (RL) to develop reasoning capabilities. Unlike traditional LLMs that depend on supervised fine-tuning (SFT) with human-curated examples, DeepSeek-R1 uses RL to autonomously discover reasoning patterns. Hereβs how it works:
A. Group Relative Policy Optimization (GRPO)
This is a critic-free
RL framework that reduces the compute costs by 40% when used instead of Proximal Policy Optimization (PPO).
The way that this algorithm works is :
- Group Sampling : For each prompt, the model generates G = 16 responses using the current policy. These responses form a group which lateron is used to compute rewards and advantages.
- Reward Normalization: Each response in the group is assigned a reward based on the accuracy, format and on is language consistency and Advantage Ai is calculated. This normalization helps stabilize training by reducing variance in group statistics.
- Policy Update : Maximization of advantage while constraining KL divergence. (Kullback-Leibler (KL) divergence is a statistical metric that measures the difference between two probability distributions). In the equation below, Ξ²=0.01 controls the strength of the KL penalty, ensuring the policy doesnβt deviate too far from the reference.
B. Hybrid Reward Engineering :
This is a three-tiered reward system that prevents reward hacking
. (Reward hacking occurs when a reinforcement learning (RL) agent exploits flaws or ambiguities in the reward function to achieve high rewards, without genuinely learning or completing the intended task. Reward hacking exists because RL environments are often imperfect, and it is fundamentally challenging to accurately specify a reward function.)
Reward Type | Calculation Method | Weight (Ξ») |
---|---|---|
Accuracy (r_acc) | Binary (1 if final answer correct) | 1.0 |
Format (r_fmt) | Cosine similarity to <think>/<answer> template | 0.3 |
Language (r_lang) | % of tokens in target language | 0.2 |
Total Reward: r_total = r_acc + Ξ»1r_fmt + Ξ»2r_lang
2. Cold-Start Supervised Fine-Tuning (SFT) :
Before applying RL, DeepSeek-R1 goes through a cold-start SFT phase which helps in seeding
the model with basic reasoning patterns. Now, this phase involves:
A. Curated Dataset
- ~1,000 high-quality Chain-of-Thought (CoT) examples are manually curated.
- Each example follows a strict XML-style template:
B. Template Enforcement :
The model is fine-tuned to generate responses in the
<think>
/<answer>
format.This ensures that the reasoning process is structured and interpretable.
3. Rejection Sampling for High-Quality Data:
After the RL process, DeepSeek-R1 generates 600K high-quality reasoning samples through rejection sampling. The way that it works is :
-
Sample Generation :
- The RL Model generates multiple responses for each prompt.
- Only those responses that pass the rule-based checks are retained.
-
Semantic Filtering :
- Responses with low semantic coherence or incorrect reasoning are discarded.
-
Final Dataset :
- The filtered dataset is used for further fine-tuning and distillation.
4. Distillation to Smaller Models
DeepSeek-R1βs reasoning capabilities are distilled into smaller models (1.5Bβ70B parameters) for cost-efficient deployment. The distillation process involves:
-
Dataset Creation :
- 800k samples are generated from the RL-trained model.
- These samples include both reasoning (600k) and general tasks (200k).
-
Fine-Tuning :
- Smaller models (like Qwen-7B, Llama-70B) are fine-tuned on the distilled dataset.
- No RL is applied during distillation, which makes it computationally efficient.
-
Performance :
- The distilled 7B model achieves 55.5% pass@1 on AIME 2024, outperforming GPT-4o (9.3%) at a fraction of the cost.
Performance Analysis: Benchmarks
Mathematical Reasoning
Benchmark | R1 | R1-Zero | GPT-40 | Human Expert |
---|---|---|---|---|
AIME 2024 (pass@1) | 79.8% | 71.0% | 9.3% | 85% |
MATH-500 (pass@1) | 97.3% | 95.9% | 74.6% | 98% |
IMO Problem Formalization | 81% | N/A | 22% | 89% |
Key Insight: R1 achieves near-human performance on Olympiad-level problems through:
- Step Recycling: Reusing partial solutions across similar problems
- Symbolic-Statistical Fusion: Combining neural intuition with algebraic simplifications
Coding & Software Engineering
Task | R1 | GPT-40 | SWE Human |
---|---|---|---|
LiveCodeBench (pass@1) | 65.9% | 32.9% | 72% |
Codeforces Elo | 2029 | 759 | 2100 (95th percentile) |
SWE-Bench Resolved | 49.2% | 38.8% | 58% |
Breakthroughs:
- Debugging Chains: Automatically generates test cases to validate code patches
- Cross-Language Transfer: Solves Python problems then ports solutions to Rust
Top comments (1)