<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: jyomama28</title>
    <description>The latest articles on DEV Community by jyomama28 (@jyoeymama).</description>
    <link>https://dev.to/jyoeymama</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F2780555%2F103a9ec1-e5a4-46a2-99f2-2bfcb8848a40.jpeg</url>
      <title>DEV Community: jyomama28</title>
      <link>https://dev.to/jyoeymama</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/jyoeymama"/>
    <language>en</language>
    <item>
      <title>Has anyone else been noticing that Deepseek-r1 is destroying in benchmarks?</title>
      <dc:creator>jyomama28</dc:creator>
      <pubDate>Thu, 30 Jan 2025 02:06:38 +0000</pubDate>
      <link>https://dev.to/jyoeymama/has-anyone-else-been-noticing-that-deepseek-r1-is-destroying-in-benchmarks-4c4f</link>
      <guid>https://dev.to/jyoeymama/has-anyone-else-been-noticing-that-deepseek-r1-is-destroying-in-benchmarks-4c4f</guid>
      <description>&lt;p&gt;If you take a look here, This shows the crazy benchmark comparisons of other AI compared to Deepseek-r1&lt;/p&gt;

&lt;p&gt;MMLU: DeepSeek-R1 achieved a score of 90.8%, outperforming Claude-3.5-Sonnet-1022, GPT-4o 0513, OpenAI o1-mini, and OpenAI o1-1217.&lt;br&gt;
MMLU-Redux: DeepSeek-R1 scored 92.9%, surpassing all other models listed.&lt;br&gt;
MMLU-Pro: DeepSeek-R1 performed with a score of 84.0%, leading all other models.&lt;br&gt;
DROP: DeepSeek-R1 achieved 92.2%, outperforming all other models.&lt;br&gt;
IF-Eval: DeepSeek-R1 scored 83.3%, which is lower than GPT-4o 0513 but higher than Claude-3.5-Sonnet-1022 and DeepSeek V3.&lt;br&gt;
GPQA-Diamond: DeepSeek-R1 scored 71.5%, which is lower than OpenAI o1-1217 but higher than other models.&lt;br&gt;
SimpleQA: DeepSeek-R1 scored 30.1%, which is lower than OpenAI o1-1217 but higher than GPT-4o 0513 and Claude-3.5-Sonnet-1022.&lt;br&gt;
FRAMES: DeepSeek-R1 achieved 82.5%, outperforming all other models.&lt;br&gt;
AlpacaEval2.0: DeepSeek-R1 scored 87.6%, significantly higher than other models.&lt;br&gt;
ArenaHard: DeepSeek-R1 achieved 92.3%, outperforming all other models.&lt;br&gt;
LiveCodeBench: DeepSeek-R1 scored 65.9%, outperforming all other models.&lt;br&gt;
Codeforces: DeepSeek-R1 achieved 96.3%, which is slightly lower than OpenAI o1-1217 but higher than other models.&lt;br&gt;
SWE Verified: DeepSeek-R1 scored 49.2%, slightly higher than OpenAI o1-1217.&lt;br&gt;
Aider-Polyglot: DeepSeek-R1 scored 53.3%, outperforming all other models.&lt;br&gt;
AIME 2024: DeepSeek-R1 achieved 79.8%, which is slightly higher than OpenAI o1-1217.&lt;br&gt;
MATH-500: DeepSeek-R1 scored 97.3%, leading all other models.&lt;br&gt;
CNMO 2024: DeepSeek-R1 achieved 78.8%, outperforming all other models.&lt;br&gt;
CLUEWSC: DeepSeek-R1 scored 92.8%, outperforming all other models.&lt;br&gt;
C-Eval: DeepSeek-R1 achieved 91.8%, outperforming all other models.&lt;br&gt;
C-SimpleQA: DeepSeek-R1 scored 63.7%, outperforming all other models.&lt;br&gt;
Distilled versions of DeepSeek-R1 also show strong performance:&lt;/p&gt;

&lt;p&gt;DeepSeek-R1-Distill-Qwen-1.5B: Achieved 83.9% on MATH-500 and 954 on CodeForces rating.&lt;br&gt;
DeepSeek-R1-Distill-Qwen-7B: Achieved 92.8% on MATH-500 and 1189 on CodeForces rating.&lt;br&gt;
DeepSeek-R1-Distill-Qwen-14B: Achieved 93.9% on MATH-500 and 1481 on CodeForces rating.&lt;br&gt;
DeepSeek-R1-Distill-Qwen-32B: Achieved 94.3% on MATH-500 and 1691 on CodeForces rating.&lt;br&gt;
DeepSeek-R1-Distill-Llama-8B: Achieved 89.1% on MATH-500 and 1205 on CodeForces rating.&lt;br&gt;
DeepSeek-R1-Distill-Llama-70B: Achieved 94.5% on MATH-500 and 1633 on CodeForces rating.&lt;/p&gt;

&lt;p&gt;Just thought this was interesting, Thanks for reading my post!&lt;/p&gt;

</description>
    </item>
  </channel>
</rss>
