DEV Community

jyomama28
jyomama28

Posted on

Has anyone else been noticing that Deepseek-r1 is destroying in benchmarks?

If you take a look here, This shows the crazy benchmark comparisons of other AI compared to Deepseek-r1

MMLU: DeepSeek-R1 achieved a score of 90.8%, outperforming Claude-3.5-Sonnet-1022, GPT-4o 0513, OpenAI o1-mini, and OpenAI o1-1217.
MMLU-Redux: DeepSeek-R1 scored 92.9%, surpassing all other models listed.
MMLU-Pro: DeepSeek-R1 performed with a score of 84.0%, leading all other models.
DROP: DeepSeek-R1 achieved 92.2%, outperforming all other models.
IF-Eval: DeepSeek-R1 scored 83.3%, which is lower than GPT-4o 0513 but higher than Claude-3.5-Sonnet-1022 and DeepSeek V3.
GPQA-Diamond: DeepSeek-R1 scored 71.5%, which is lower than OpenAI o1-1217 but higher than other models.
SimpleQA: DeepSeek-R1 scored 30.1%, which is lower than OpenAI o1-1217 but higher than GPT-4o 0513 and Claude-3.5-Sonnet-1022.
FRAMES: DeepSeek-R1 achieved 82.5%, outperforming all other models.
AlpacaEval2.0: DeepSeek-R1 scored 87.6%, significantly higher than other models.
ArenaHard: DeepSeek-R1 achieved 92.3%, outperforming all other models.
LiveCodeBench: DeepSeek-R1 scored 65.9%, outperforming all other models.
Codeforces: DeepSeek-R1 achieved 96.3%, which is slightly lower than OpenAI o1-1217 but higher than other models.
SWE Verified: DeepSeek-R1 scored 49.2%, slightly higher than OpenAI o1-1217.
Aider-Polyglot: DeepSeek-R1 scored 53.3%, outperforming all other models.
AIME 2024: DeepSeek-R1 achieved 79.8%, which is slightly higher than OpenAI o1-1217.
MATH-500: DeepSeek-R1 scored 97.3%, leading all other models.
CNMO 2024: DeepSeek-R1 achieved 78.8%, outperforming all other models.
CLUEWSC: DeepSeek-R1 scored 92.8%, outperforming all other models.
C-Eval: DeepSeek-R1 achieved 91.8%, outperforming all other models.
C-SimpleQA: DeepSeek-R1 scored 63.7%, outperforming all other models.
Distilled versions of DeepSeek-R1 also show strong performance:

DeepSeek-R1-Distill-Qwen-1.5B: Achieved 83.9% on MATH-500 and 954 on CodeForces rating.
DeepSeek-R1-Distill-Qwen-7B: Achieved 92.8% on MATH-500 and 1189 on CodeForces rating.
DeepSeek-R1-Distill-Qwen-14B: Achieved 93.9% on MATH-500 and 1481 on CodeForces rating.
DeepSeek-R1-Distill-Qwen-32B: Achieved 94.3% on MATH-500 and 1691 on CodeForces rating.
DeepSeek-R1-Distill-Llama-8B: Achieved 89.1% on MATH-500 and 1205 on CodeForces rating.
DeepSeek-R1-Distill-Llama-70B: Achieved 94.5% on MATH-500 and 1633 on CodeForces rating.

Just thought this was interesting, Thanks for reading my post!

Heroku

This site is built on Heroku

Join the ranks of developers at Salesforce, Airbase, DEV, and more who deploy their mission critical applications on Heroku. Sign up today and launch your first app!

Get Started

Top comments (0)

A Workflow Copilot. Tailored to You.

Pieces.app image

Our desktop app, with its intelligent copilot, streamlines coding by generating snippets, extracting code from screenshots, and accelerating problem-solving.

Read the docs