This is a Plain English Papers summary of a research paper called DeepDistill: New LLM Reasoning Method Outperforms Distilled Models, Nears SOTA. If you like these kinds of analysis, you should join AImodels.fyi or follow us on Twitter.
Enhancing LLM Reasoning: The DeepDistill Approach
Despite remarkable achievements on complex reasoning tasks, the academic community still lacks an in-depth understanding of how to effectively train base models for long-form reasoning. This research addresses this gap by constructing a large-scale, difficulty-graded reasoning dataset with approximately 3.34 million unique queries and 40 million distilled responses from multiple models.
The authors demonstrate that by leveraging pass rate and Coefficient of Variation (CV), they can identify the most valuable training data to enhance reasoning capabilities. They observe that reasoning-focused training requires higher learning rates than traditional post-training methods when working with base models.
Using carefully selected data, they improve base model reasoning capabilities significantly, achieving a 79.2% pass rate on the AIME2024 mathematical reasoning benchmark—surpassing most current distilled models and approaching state-of-the-art performance. Similar approaches are discussed in Improving Mathematical Reasoning Capabilities of Small Language Models.

Figure 1: Benchmark performance of open-source models on AIME2024.
Creating a Comprehensive Reasoning Dataset
Diverse Data Collection Across Multiple Domains
To ensure comprehensive coverage, the researchers collected datasets from multiple publicly available sources, carefully categorizing them into six main domains:
Mathematical Reasoning - Datasets that demand advanced numerical logic, including OpenR1-Math-220k, AIME_1983_2024, and others.
Code Generation - Programming problem-solving tasks from sources like PRIME, DeepCoder, and KodCode.
Scientific Reasoning - Datasets assessing performance in natural sciences and logical reasoning.
Instruction Following (IF) - Tasks focused on accurately comprehending and executing instructions.
Multi-turn Conversations - Datasets emphasizing context coherence across multiple interactions.
Others/General Reasoning - A wide range of general knowledge and everyday logical reasoning tasks.
This diverse collection enables thorough assessment of the proposed data distillation method across various domains, similar to approaches seen in 14 Million Open Source Distilled Reasoning Dataset.
Rigorous Query Processing for High-Quality Data
The researchers applied meticulous preprocessing procedures to ensure data quality:
-
Deduplication and Filtering
- Exact deduplication to remove identical queries
- Unicode ratio filtering to eliminate corrupted content
- Incomplete query filtering to ensure completeness
- Special content filtering to remove URLs and tabular content
-
Decontamination
- Exact matching filtering to remove overlap with evaluation sets
- Semantic deduplication using embedding techniques to eliminate similar content
These stringent processes significantly improved data quality and prevented information leakage between training and evaluation datasets.
Multi-Model Distillation Approach
To assess query difficulty accurately, the team conducted data distillation using three models with progressively increasing capabilities:
- DeepSeek-R1-Distill-Qwen-1.5B
- DeepSeek-R1-Distill-Qwen-7B
- DeepSeek-R1
Each query was independently distilled four times by each model, generating approximately 40 million responses in total. This approach captured variations among model outputs, facilitating subsequent difficulty assessment.
Comprehensive Ground Truth Verification
The researchers designed rigorous verification methods tailored to different data categories:
- Mathematical Reasoning: Two-stage validation using Math-Verify and Qwen2.5-7B-Instruct
- Code Generation: Sandbox fusion tests on selected test cases
- Scientific Reasoning: Validation using Qwen2.5-7B-Instruct
- Instruction Following: Validation using ifeval with additional constraints
- Multi-turn Conversations and Others: Evaluation using Decision-Tree-Reward-Llama-3.1-8B on coherence, correctness, and helpfulness
These verification methods produced category-specific scores that determined whether a response passed verification thresholds.
Additional Quality Control Measures
Beyond the primary processing procedures, the team implemented supplementary quality assurance measures:
- Perplexity-based Filtering: Removing instances with perplexity scores exceeding 20
- High-Frequency Ngram Filtering: Eliminating repetitive content
- Additional Logical Checks: Ensuring structural integrity and consistency
These measures further enhanced data quality, similar to approaches in OpenCodereasoning: Advancing Data Distillation for Competitive Coding.
Finding the Most Valuable Training Data
Using Coefficient of Variation to Identify Learning Opportunities
Not all generated data contribute equally to model training. Queries with extremely high average verification scores are too simple, while those with very low scores might be too difficult. The researchers focused on queries with higher learning potential by employing the Coefficient of Variation (CV) as a key indicator.
The CV is defined as:
CV = σ/μ = √(1/n ∑(verify_score_i - μ)²)/μ
Where μ = 1/n ∑verify_score_i
This normalized metric objectively reflects data variability. High CV values indicate instability in model performance, highlighting significant room for improvement. For example:
- Query A with scores [0.5, 0.5, 0.5, 0.5, 0.5] has μ=0.5, σ=0, CV=0
- Query B with scores [0.9, 0.1, 0.7, 0.3, 0.5] has μ=0.5, σ≈0.32, CV≈0.63
Although both queries have the same mean score, Query B shows higher variability, indicating that the model's performance on Query B is less stable, thus having greater learning value.
This approach to identifying valuable training data aligns with strategies discussed in How Difficulty-Aware Staged Reinforcement Learning Enhances.
Strategic Data Selection Process
The researchers implemented a two-stage data selection approach:
Stage I: Large Scale Reasoning Data Selection
The team filtered data based on verify_score and CV, retaining only high-quality examples with strong learning value. Their algorithm:
- Computes maximum verify_score for each query
- Discards queries below category-specific thresholds
- Computes CV for remaining queries
- Retains challenging queries (CV > 0.05) with high verify_scores
- For easy queries (CV ≤ 0.05), retains only a portion of other/multiturn categories

Figure 2: Distribution of training data types in Supervised Fine-Tuning (SFT) Stage I. The left pie chart illustrates the proportion at the instance-level, while the right pie chart shows the distribution at the answer token-level.
Stage II: Annealed Data Selection
After the model improved from Stage I training, the researchers increased data difficulty by focusing exclusively on queries with higher variability:
- Applied stringent verify_score threshold of 0.99
- Selected only queries with CV > 0.05
- Retained queries with the highest CV values
- Randomly sampled one response per query for training

Figure 3: Distribution of training data types in Supervised Fine-Tuning (SFT) Stage II. The left pie chart illustrates the proportion at the instance-level, the right pie chart shows the distribution at the answer token-level.
This carefully designed selection strategy produced a dataset with approximately 5 million reasoning-intensive samples for Stage I and more challenging data for Stage II.
Experimental Validation and Results
Higher Learning Rates for Reasoning-Focused Training
The researchers observed a crucial training pattern shift: reasoning-focused training on base models requires higher learning rates than traditional post-training methods. They implemented:
- Learning rate of 8×10⁻⁵ for one epoch
- Packing strategy with 32k token maximum length
- Global batch size of 64
- Cosine learning rate scheduler with 5% warmup
This higher learning rate proved essential for capturing complex reasoning patterns, as lower rates led to underfitting on long-sequence reasoning tasks. This approach aligns with strategies discussed in DeepDistill: Enhancing LLM Reasoning Capabilities via Large-Scale Difficulty-Graded Data Training.
Two-Stage Training with Difficulty Annealing
In the annealing stage (Stage II), the researchers:
- Started from the optimal Stage I model
- Used a more stringent selection strategy (CV > 0.05)
- Discontinued packing to focus on individual complex queries
- Reduced learning rate to 8×10⁻⁶
- Trained for two epochs
This strategy allowed the model to further refine its reasoning capabilities on the most challenging examples.
Evaluation on Challenging Reasoning Benchmarks
The researchers evaluated their models on three challenging benchmarks:
- AIME2024 - 30 integer-answer questions from a challenging mathematical competition
- LiveCodeBench - A comprehensive coding benchmark with programming challenges
- GPQA-Diamond - 198 high-difficulty graduate-level multiple-choice questions in science
These benchmarks provide rigorous evaluation of different reasoning capabilities.
Competitive Performance Through Supervised Fine-Tuning Alone
| Model | AIME2024 (%) | GPQA-Diamond (%) | LiveCodeBench (%) |
|---|---|---|---|
| DS-Distill-32B (SFT) | 72.6 | 62.1 | 57.2 |
| QwQ-32B (RL) | 79.5 | 65.9 | 63.4 |
| DS-Distill-70B (SFT) | 70.0 | 65.2 | 57.5 |
| DeepSeek-R1 (RL) | 79.8 | 71.5 | 65.9 |
| Ours-Distill-32B (SFT) | 75.8 | 66.3 | 64.2 |
| Ours-Distill-72B (SFT) | 79.2 | 65.7 | 63.8 |
Table 1: Performance comparison of various models.
The 72B model achieved a 79.2% pass rate on AIME2024, significantly outperforming DeepSeek-Distill models and approaching reinforcement learning-based models like DeepSeek-R1 (79.8%), despite using only supervised fine-tuning.
The two-stage training strategy further improved performance on key benchmarks:
| Model | AIME2024 (%) | GPQA-Diamond (%) | LiveCodeBench (%) |
|---|---|---|---|
| Ours-32B (Stage I) | 75.8 | 66.3 | 64.2 |
| Ours-32B (Stage II) | 77.9 | 66.9 | 61.7 |
Table 2: Model performance of two stage SFT models.
On AIME2024, the 32B model improved from 75.8% to 77.9% after Stage II, though LiveCodeBench performance decreased slightly, indicating a need for further tuning.

Figure 4: Loss curves of 72B model training.
The training loss curves clearly demonstrate that higher learning rates allow the model to better fit complex reasoning data.

Figure 5: The variations of 32B AIME2024 Score, Generation Stop Ratio, and Average Generated Token Lengths with training steps
The model's performance improved continuously during training, with the generation stop ratio improving significantly and average token length stabilizing, demonstrating effective adaptation to the QA format.
Future Directions and Limitations
This research introduces a large-scale, difficulty-graded reasoning dataset and an effective training methodology that enhances base model performance on complex reasoning tasks. The observed training pattern shift—requiring higher learning rates for reasoning tasks—represents an important finding for future research.
The two-stage training approach with annealing shows promise for further performance improvements. Future work will focus on developing more refined methods for evaluating data quality and investigating how models with varying initial capabilities influence subsequent reinforcement learning outcomes.
The publicly released datasets and methods aim to facilitate rapid progress in developing open-source LLMs with exceptional reasoning capabilities. By sharing these resources, the researchers contribute significantly to advancing the field's understanding of effective reasoning training strategies.
Top comments (0)