Has anyone else been noticing that Deepseek-r1 is destroying in benchmarks?

If you take a look here, This shows the crazy benchmark comparisons of other AI compared to Deepseek-r1

MMLU: DeepSeek-R1 achieved a score of 90.8%, outperforming Claude-3.5-Sonnet-1022, GPT-4o 0513, OpenAI o1-mini, and OpenAI o1-1217.
MMLU-Redux: DeepSeek-R1 scored 92.9%, surpassing all other models listed.
MMLU-Pro: DeepSeek-R1 performed with a score of 84.0%, leading all other models.
DROP: DeepSeek-R1 achieved 92.2%, outperforming all other models.
IF-Eval: DeepSeek-R1 scored 83.3%, which is lower than GPT-4o 0513 but higher than Claude-3.5-Sonnet-1022 and DeepSeek V3.
GPQA-Diamond: DeepSeek-R1 scored 71.5%, which is lower than OpenAI o1-1217 but higher than other models.
SimpleQA: DeepSeek-R1 scored 30.1%, which is lower than OpenAI o1-1217 but higher than GPT-4o 0513 and Claude-3.5-Sonnet-1022.
FRAMES: DeepSeek-R1 achieved 82.5%, outperforming all other models.
AlpacaEval2.0: DeepSeek-R1 scored 87.6%, significantly higher than other models.
ArenaHard: DeepSeek-R1 achieved 92.3%, outperforming all other models.
LiveCodeBench: DeepSeek-R1 scored 65.9%, outperforming all other models.
Codeforces: DeepSeek-R1 achieved 96.3%, which is slightly lower than OpenAI o1-1217 but higher than other models.
SWE Verified: DeepSeek-R1 scored 49.2%, slightly higher than OpenAI o1-1217.
Aider-Polyglot: DeepSeek-R1 scored 53.3%, outperforming all other models.
AIME 2024: DeepSeek-R1 achieved 79.8%, which is slightly higher than OpenAI o1-1217.
MATH-500: DeepSeek-R1 scored 97.3%, leading all other models.
CNMO 2024: DeepSeek-R1 achieved 78.8%, outperforming all other models.
CLUEWSC: DeepSeek-R1 scored 92.8%, outperforming all other models.
C-Eval: DeepSeek-R1 achieved 91.8%, outperforming all other models.
C-SimpleQA: DeepSeek-R1 scored 63.7%, outperforming all other models.
Distilled versions of DeepSeek-R1 also show strong performance:

DeepSeek-R1-Distill-Qwen-1.5B: Achieved 83.9% on MATH-500 and 954 on CodeForces rating.
DeepSeek-R1-Distill-Qwen-7B: Achieved 92.8% on MATH-500 and 1189 on CodeForces rating.
DeepSeek-R1-Distill-Qwen-14B: Achieved 93.9% on MATH-500 and 1481 on CodeForces rating.
DeepSeek-R1-Distill-Qwen-32B: Achieved 94.3% on MATH-500 and 1691 on CodeForces rating.
DeepSeek-R1-Distill-Llama-8B: Achieved 89.1% on MATH-500 and 1205 on CodeForces rating.
DeepSeek-R1-Distill-Llama-70B: Achieved 94.5% on MATH-500 and 1633 on CodeForces rating.

Just thought this was interesting, Thanks for reading my post!

DEV Community

Has anyone else been noticing that Deepseek-r1 is destroying in benchmarks?

Top comments (0)