DeepSeek-R1-0528 now demonstrates performance comparable to Gemini Pro and Claude 4, closing in on OpenAI's O3
DeepSeek-R1-0528 in huggingface
1. Breakthrough in Reasoning Capabilities
Computational Scaling & Algorithmic Optimization
- Increased average reasoning depth per complex problem: 12K → 23K tokens (AIME test set)
- AIME 2025 accuracy: 70% → 87.5% (+17.5%)
- HMMT 2025 pass rate: 41.7% → 79.4% (+90% improvement)
- Mathematical Olympiad performance:
- CNMO 2024: 78.8% → 86.9%
- AIME 2024: 79.8% → 91.4%
2. Hallucination Rate Reduction
- SimpleQA benchmark: Correctness improved from 30.1% → 27.8%
- Error rate reduction in fact-intensive tasks:
- FRAMES accuracy: 82.5% → 83.0%
- GPQA-Diamond: 71.5% → 81.0% (+13.3%)
3. Enhanced Function Calling Support
-
Tool utilization benchmarks:
- BFCL_v3_MultiTurn accuracy: 37.0% (first-time measurement)
- Tau-Bench performance:
- Airline domain: 53.5%
- Retail domain: 63.9%
- API response reliability improved by 17% (SWE Verified resolution: 49.2% → 57.6%)
4. Optimized Vibe Coding Experience
- LiveCodeBench (2024-08 to 2025-05): Pass@1 rate surged from 63.5% → 73.3% (+9.8%)
- Aider-Polyglot accuracy: 53.3% → 71.6% (+34.3%)
- Key enhancements:
- Context-aware autocompletion
- Real-time error prediction
- Multi-language pattern recognition
Top comments (0)