DeepSeek-R1-0528 now demonstrates performance comparable to Gemini Pro and Claude 4, closing in on OpenAI's O3
DeepSeek-R1-0528 in huggingface
1. Breakthrough in Reasoning Capabilities
Computational Scaling & Algorithmic Optimization
- Increased average reasoning depth per complex problem: 12K → 23K tokens (AIME test set)
 - AIME 2025 accuracy: 70% → 87.5% (+17.5%)
 - HMMT 2025 pass rate: 41.7% → 79.4% (+90% improvement)
 - Mathematical Olympiad performance:
- CNMO 2024: 78.8% → 86.9%
 - AIME 2024: 79.8% → 91.4%
 
 
2. Hallucination Rate Reduction
- SimpleQA benchmark: Correctness improved from 30.1% → 27.8%
 - Error rate reduction in fact-intensive tasks:
- FRAMES accuracy: 82.5% → 83.0%
 - GPQA-Diamond: 71.5% → 81.0% (+13.3%)
 
 
3. Enhanced Function Calling Support
- 
Tool utilization benchmarks:
- BFCL_v3_MultiTurn accuracy: 37.0% (first-time measurement)
 - Tau-Bench performance:
 - Airline domain: 53.5%
 - Retail domain: 63.9%
 
 - API response reliability improved by 17% (SWE Verified resolution: 49.2% → 57.6%)
 
4. Optimized Vibe Coding Experience
- LiveCodeBench (2024-08 to 2025-05): Pass@1 rate surged from 63.5% → 73.3% (+9.8%)
 - Aider-Polyglot accuracy: 53.3% → 71.6% (+34.3%)
 - Key enhancements:
- Context-aware autocompletion
 - Real-time error prediction
 - Multi-language pattern recognition
 
 

    
Top comments (0)