<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Sky</title>
    <description>The latest articles on DEV Community by Sky (@sky_05).</description>
    <link>https://dev.to/sky_05</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3843102%2F52391195-31fb-437e-a1de-210a58b1da77.png</url>
      <title>DEV Community: Sky</title>
      <link>https://dev.to/sky_05</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/sky_05"/>
    <language>en</language>
    <item>
      <title>New Benchmark for Open-Source Agents: What is Claw-Eval? How Step 3.5 Flash Secured the #2 Spot</title>
      <dc:creator>Sky</dc:creator>
      <pubDate>Wed, 25 Mar 2026 12:53:21 +0000</pubDate>
      <link>https://dev.to/sky_05/new-benchmark-for-open-source-agents-what-is-claw-eval-how-step-35-flash-secured-the-2-spot-592d</link>
      <guid>https://dev.to/sky_05/new-benchmark-for-open-source-agents-what-is-claw-eval-how-step-35-flash-secured-the-2-spot-592d</guid>
      <description>&lt;p&gt;Recently, a new Agent evaluation framework called &lt;strong&gt;Claw-Eval&lt;/strong&gt; has sparked significant discussion within the developer community. In its latest rankings, &lt;strong&gt;Step 3.5 Flash&lt;/strong&gt; emerged as the #2 open-source model, trailing only GLM 5, while sharing the top spot for the Pass@3 metric.&lt;/p&gt;

&lt;p&gt;What makes this leaderboard unique is that it doesn't test "knowledge breadth" or "abstract reasoning." Instead, it focuses on a more fundamental question: Can the model actually call tools, execute steps, and complete tasks reliably in a real-world environment?&lt;/p&gt;

&lt;p&gt;Today, we’ll explore the design philosophy behind Claw-Eval and analyze why Step 3.5 Flash performed so exceptionally under this rigorous evaluation system.&lt;/p&gt;




&lt;h2&gt;
  
  
  Claw-Eval: Testing "Doing," Not Just "Knowing"
&lt;/h2&gt;

&lt;p&gt;Developed by a joint team from Peking University and the University of Hong Kong, Claw-Eval features tasks that are entirely human-verified. Its positioning is clear: &lt;strong&gt;End-to-end testing of an AI Agent’s ability to complete tasks in the real world.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Traditional benchmarks (like MMLU, MATH, or HumanEval) measure whether a model "knows the answer." Claw-Eval answers a different question: Given a live operational environment, can the model successfully complete a task by calling tools and executing multi-step operations?&lt;/p&gt;

&lt;p&gt;To achieve this, Claw-Eval built a comprehensive testing ecosystem:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;104 Tasks&lt;/strong&gt;: Covering real-world scenarios like calendar management, file operations, web search, code execution, financial analysis, and email processing.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;15 Mock Enterprise Services&lt;/strong&gt;: Creating an interactive tool-calling environment rather than just paper-based Q&amp;amp;A.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Docker Sandbox Isolation&lt;/strong&gt;: Each test runs in an independent environment to ensure no cross-interference.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Human Verification&lt;/strong&gt;: Every task is verified by humans—no "LLM-as-a-judge"—to eliminate biases inherent in automated scoring.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Pass³: Stability Through Triple Consistency
&lt;/h2&gt;

&lt;p&gt;The most critical design element of Claw-Eval is its core scoring mechanism: &lt;strong&gt;Pass³&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;While most benchmarks calculate scores based on a single run, Claw-Eval is far stricter. A task is only considered successful if it passes &lt;strong&gt;three independent runs consecutively&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The logic is simple: One success might be luck; three consecutive successes prove capability.&lt;/p&gt;

&lt;p&gt;The scoring formula is as follows:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;task_score = safety × (0.8 × completion + 0.2 × robustness)
Threshold: pass ≥ 75
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The four dimensions emphasize different strengths:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Pass³&lt;/strong&gt;: The percentage of tasks passed in all three independent runs (the primary ranking metric).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Completion&lt;/strong&gt;: The quality of the task outcome.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Robustness&lt;/strong&gt;: Stability when facing edge cases or anomalous inputs.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Safety&lt;/strong&gt;: Security and safety during the execution process.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This mechanism essentially tests &lt;strong&gt;"dependable stability"&lt;/strong&gt;—the most critical hurdle an Agent must clear to move from a "prototype" to a "production-ready tool."&lt;/p&gt;




&lt;h2&gt;
  
  
  Current Leaderboard (Open-Source, General Category)
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Rank&lt;/th&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Source&lt;/th&gt;
&lt;th&gt;Pass³&lt;/th&gt;
&lt;th&gt;Pass@3&lt;/th&gt;
&lt;th&gt;Completion&lt;/th&gt;
&lt;th&gt;Robustness&lt;/th&gt;
&lt;th&gt;Safety&lt;/th&gt;
&lt;th&gt;Avg Score&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;🥇 1&lt;/td&gt;
&lt;td&gt;GLM 5&lt;/td&gt;
&lt;td&gt;Zhipu AI&lt;/td&gt;
&lt;td&gt;57.7%&lt;/td&gt;
&lt;td&gt;70.2%&lt;/td&gt;
&lt;td&gt;68.9 ±2.0&lt;/td&gt;
&lt;td&gt;95.4 ±0.3&lt;/td&gt;
&lt;td&gt;93.9 ±0.6&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;73.0&lt;/strong&gt; ±1.6&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;🥈 &lt;strong&gt;2&lt;/strong&gt;
&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Step 3.5 Flash&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;StepFun&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;56.7%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;70.2%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;68.3 ±0.8&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;94.4 ±0.3&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;93.3 ±0.0&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;72.3&lt;/strong&gt; ±0.8&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;🥉 3&lt;/td&gt;
&lt;td&gt;Kimi K2.5&lt;/td&gt;
&lt;td&gt;Moonshot AI&lt;/td&gt;
&lt;td&gt;52.9%&lt;/td&gt;
&lt;td&gt;73.1%&lt;/td&gt;
&lt;td&gt;67.4 ±1.3&lt;/td&gt;
&lt;td&gt;94.2 ±0.8&lt;/td&gt;
&lt;td&gt;92.6 ±0.6&lt;/td&gt;
&lt;td&gt;71.6 ±0.9&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;DeepSeek V3.2&lt;/td&gt;
&lt;td&gt;DeepSeek&lt;/td&gt;
&lt;td&gt;51.0%&lt;/td&gt;
&lt;td&gt;71.2%&lt;/td&gt;
&lt;td&gt;63.9 ±0.5&lt;/td&gt;
&lt;td&gt;93.1 ±0.3&lt;/td&gt;
&lt;td&gt;92.0 ±0.6&lt;/td&gt;
&lt;td&gt;68.4 ±0.4&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;MiniMax M2.5&lt;/td&gt;
&lt;td&gt;MiniMax&lt;/td&gt;
&lt;td&gt;51.0%&lt;/td&gt;
&lt;td&gt;69.2%&lt;/td&gt;
&lt;td&gt;65.5 ±0.4&lt;/td&gt;
&lt;td&gt;93.6 ±0.6&lt;/td&gt;
&lt;td&gt;92.0 ±0.6&lt;/td&gt;
&lt;td&gt;69.9 ±0.3&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;MiMo V2 Flash&lt;/td&gt;
&lt;td&gt;Xiaomi&lt;/td&gt;
&lt;td&gt;48.1%&lt;/td&gt;
&lt;td&gt;67.3%&lt;/td&gt;
&lt;td&gt;63.3 ±0.5&lt;/td&gt;
&lt;td&gt;94.7 ±0.5&lt;/td&gt;
&lt;td&gt;92.9 ±0.6&lt;/td&gt;
&lt;td&gt;68.4 ±0.3&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;7&lt;/td&gt;
&lt;td&gt;Qwen3.5 397A17B&lt;/td&gt;
&lt;td&gt;Alibaba&lt;/td&gt;
&lt;td&gt;48.1%&lt;/td&gt;
&lt;td&gt;67.3%&lt;/td&gt;
&lt;td&gt;66.4 ±2.4&lt;/td&gt;
&lt;td&gt;93.8 ±0.5&lt;/td&gt;
&lt;td&gt;92.0 ±0.6&lt;/td&gt;
&lt;td&gt;70.7 ±2.0&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;blockquote&gt;
&lt;p&gt;Data Source: &lt;a href="https://claw-eval.github.io" rel="noopener noreferrer"&gt;claw-eval.github.io&lt;/a&gt;, Filter: "Open-Source" + General category. Snapshot date: 2026-03-25.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Several interesting insights can be drawn from this data:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Second in Pass³, Zero Variance in Safety.&lt;/strong&gt; Step 3.5 Flash achieved a Safety score of 93.3 ±0.0. A standard deviation of zero means its safety performance was perfectly consistent across all runs. For an Agent system being deployed into a production environment, this predictability is more valuable than peak performance.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pass@3 Tied for First.&lt;/strong&gt; Step 3.5 Flash and GLM 5 both hit 70.2% for Pass@3, showing they are neck-and-neck in single-run success rates. The slight difference in Pass³ (57.7% vs 56.7%) reflects a minor gap in triple-run stability rather than raw capability.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A Notable Speed Advantage.&lt;/strong&gt; According to Claw-Eval’s "Pass Rate vs. Speed" scatter plot, Step 3.5 Flash sits in the "High Speed + High Pass Rate" quadrant. With an average task time of 50–70 seconds, it is significantly faster than other models in its class.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why Agent-Specific Rankings Matter
&lt;/h2&gt;

&lt;p&gt;Many models shine on traditional benchmarks like math or coding but stumble in real-world Agent scenarios. This is because the challenges of Agent tasks are fundamentally different:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; &lt;strong&gt;Multi-step Chains&lt;/strong&gt;: If any single step fails, the entire task fails. A simple calendar invite might require searching, parsing, and then writing; a failure at any point collapses the workflow.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;High Precision for Tool Calling&lt;/strong&gt;: Formatting errors, missing parameters, or selecting the wrong tool will immediately break the task.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Reliability is the True Capability&lt;/strong&gt;: Succeeding once is easy; succeeding every time is hard.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Step 3.5 Flash’s performance—56.7% Pass³, 94.4 Robustness, and zero safety variance—indicates it is a model you can "actually rely on" for Agent workflows, rather than just a set of impressive numbers on a chart.&lt;/p&gt;

&lt;p&gt;From an engineering perspective, you wouldn't put a model that "works when it's lucky and crashes when it's not" into a production pipeline. Pass³ measures the exact stability required for trust.&lt;/p&gt;




&lt;h2&gt;
  
  
  Parameter Efficiency: High Performance at Low Cost
&lt;/h2&gt;

&lt;p&gt;Looking at Claw-Eval’s "Pass Rate vs. Cost" analysis, Step 3.5 Flash occupies a very low-cost bracket. This isn't accidental; it’s a result of its architectural design:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;196B Total Parameters, only 11B Active&lt;/strong&gt; (Sparse MoE architecture).&lt;/li&gt;
&lt;li&gt;In 128K context scenarios, inference costs are roughly &lt;strong&gt;1/6th&lt;/strong&gt; that of DeepSeek V3.2.&lt;/li&gt;
&lt;li&gt;The &lt;strong&gt;MTP-3 (Multi-Token Prediction)&lt;/strong&gt; heads enable generation speeds of &lt;strong&gt;100–300 tok/s&lt;/strong&gt;, peaking at 350 tok/s.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For applications requiring high-frequency Agent calls—such as automated workflows, multi-turn research tasks, or large-scale data processing—this cost advantage translates directly into significant savings. The balance between high performance and low cost is a core characteristic of Step 3.5 Flash.&lt;/p&gt;




&lt;h2&gt;
  
  
  Resource Links
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Resource&lt;/th&gt;
&lt;th&gt;Link&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Claw-Eval Leaderboard&lt;/td&gt;
&lt;td&gt;&lt;a href="https://claw-eval.github.io" rel="noopener noreferrer"&gt;https://claw-eval.github.io&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Claw-Eval GitHub&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/claw-eval/claw-eval" rel="noopener noreferrer"&gt;https://github.com/claw-eval/claw-eval&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Step 3.5 Flash GitHub&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/stepfun-ai/Step-3.5-Flash" rel="noopener noreferrer"&gt;https://github.com/stepfun-ai/Step-3.5-Flash&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;StepFun Open Platform (Global)&lt;/td&gt;
&lt;td&gt;&lt;a href="https://platform.stepfun.ai" rel="noopener noreferrer"&gt;https://platform.stepfun.ai&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;StepFun Open Platform (China)&lt;/td&gt;
&lt;td&gt;&lt;a href="https://platform.stepfun.com" rel="noopener noreferrer"&gt;https://platform.stepfun.com&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;HuggingFace Models&lt;/td&gt;
&lt;td&gt;&lt;a href="https://huggingface.co/stepfun-ai/Step-3.5-Flash" rel="noopener noreferrer"&gt;https://huggingface.co/stepfun-ai/Step-3.5-Flash&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ModelScope&lt;/td&gt;
&lt;td&gt;&lt;a href="https://modelscope.cn/models/stepfun-ai/Step-3.5-Flash" rel="noopener noreferrer"&gt;https://modelscope.cn/models/stepfun-ai/Step-3.5-Flash&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Technical Report&lt;/td&gt;
&lt;td&gt;&lt;a href="https://arxiv.org/abs/2602.10604" rel="noopener noreferrer"&gt;https://arxiv.org/abs/2602.10604&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;If you are building Agent-related applications or are interested in how models perform in real-world scenarios, feel free to join the discussion in the comments or connect with the StepFun developer community (scan the QR code on our GitHub home page).&lt;/p&gt;

</description>
      <category>opensource</category>
      <category>ai</category>
      <category>benchmark</category>
      <category>llm</category>
    </item>
  </channel>
</rss>
