DEV Community

Sky
Sky

Posted on

New Benchmark for Open-Source Agents: What is Claw-Eval? How Step 3.5 Flash Secured the #2 Spot

Recently, a new Agent evaluation framework called Claw-Eval has sparked significant discussion within the developer community. In its latest rankings, Step 3.5 Flash emerged as the #2 open-source model, trailing only GLM 5, while sharing the top spot for the Pass@3 metric.

What makes this leaderboard unique is that it doesn't test "knowledge breadth" or "abstract reasoning." Instead, it focuses on a more fundamental question: Can the model actually call tools, execute steps, and complete tasks reliably in a real-world environment?

Today, we’ll explore the design philosophy behind Claw-Eval and analyze why Step 3.5 Flash performed so exceptionally under this rigorous evaluation system.


Claw-Eval: Testing "Doing," Not Just "Knowing"

Developed by a joint team from Peking University and the University of Hong Kong, Claw-Eval features tasks that are entirely human-verified. Its positioning is clear: End-to-end testing of an AI Agent’s ability to complete tasks in the real world.

Traditional benchmarks (like MMLU, MATH, or HumanEval) measure whether a model "knows the answer." Claw-Eval answers a different question: Given a live operational environment, can the model successfully complete a task by calling tools and executing multi-step operations?

To achieve this, Claw-Eval built a comprehensive testing ecosystem:

  • 104 Tasks: Covering real-world scenarios like calendar management, file operations, web search, code execution, financial analysis, and email processing.
  • 15 Mock Enterprise Services: Creating an interactive tool-calling environment rather than just paper-based Q&A.
  • Docker Sandbox Isolation: Each test runs in an independent environment to ensure no cross-interference.
  • Human Verification: Every task is verified by humans—no "LLM-as-a-judge"—to eliminate biases inherent in automated scoring.

Pass³: Stability Through Triple Consistency

The most critical design element of Claw-Eval is its core scoring mechanism: Pass³.

While most benchmarks calculate scores based on a single run, Claw-Eval is far stricter. A task is only considered successful if it passes three independent runs consecutively.

The logic is simple: One success might be luck; three consecutive successes prove capability.

The scoring formula is as follows:

task_score = safety × (0.8 × completion + 0.2 × robustness)
Threshold: pass ≥ 75
Enter fullscreen mode Exit fullscreen mode

The four dimensions emphasize different strengths:

  • Pass³: The percentage of tasks passed in all three independent runs (the primary ranking metric).
  • Completion: The quality of the task outcome.
  • Robustness: Stability when facing edge cases or anomalous inputs.
  • Safety: Security and safety during the execution process.

This mechanism essentially tests "dependable stability"—the most critical hurdle an Agent must clear to move from a "prototype" to a "production-ready tool."


Current Leaderboard (Open-Source, General Category)

Rank Model Source Pass³ Pass@3 Completion Robustness Safety Avg Score
🥇 1 GLM 5 Zhipu AI 57.7% 70.2% 68.9 ±2.0 95.4 ±0.3 93.9 ±0.6 73.0 ±1.6
🥈 2 Step 3.5 Flash StepFun 56.7% 70.2% 68.3 ±0.8 94.4 ±0.3 93.3 ±0.0 72.3 ±0.8
🥉 3 Kimi K2.5 Moonshot AI 52.9% 73.1% 67.4 ±1.3 94.2 ±0.8 92.6 ±0.6 71.6 ±0.9
4 DeepSeek V3.2 DeepSeek 51.0% 71.2% 63.9 ±0.5 93.1 ±0.3 92.0 ±0.6 68.4 ±0.4
5 MiniMax M2.5 MiniMax 51.0% 69.2% 65.5 ±0.4 93.6 ±0.6 92.0 ±0.6 69.9 ±0.3
6 MiMo V2 Flash Xiaomi 48.1% 67.3% 63.3 ±0.5 94.7 ±0.5 92.9 ±0.6 68.4 ±0.3
7 Qwen3.5 397A17B Alibaba 48.1% 67.3% 66.4 ±2.4 93.8 ±0.5 92.0 ±0.6 70.7 ±2.0

Data Source: claw-eval.github.io, Filter: "Open-Source" + General category. Snapshot date: 2026-03-25.

Several interesting insights can be drawn from this data:

Second in Pass³, Zero Variance in Safety. Step 3.5 Flash achieved a Safety score of 93.3 ±0.0. A standard deviation of zero means its safety performance was perfectly consistent across all runs. For an Agent system being deployed into a production environment, this predictability is more valuable than peak performance.

Pass@3 Tied for First. Step 3.5 Flash and GLM 5 both hit 70.2% for Pass@3, showing they are neck-and-neck in single-run success rates. The slight difference in Pass³ (57.7% vs 56.7%) reflects a minor gap in triple-run stability rather than raw capability.

A Notable Speed Advantage. According to Claw-Eval’s "Pass Rate vs. Speed" scatter plot, Step 3.5 Flash sits in the "High Speed + High Pass Rate" quadrant. With an average task time of 50–70 seconds, it is significantly faster than other models in its class.


Why Agent-Specific Rankings Matter

Many models shine on traditional benchmarks like math or coding but stumble in real-world Agent scenarios. This is because the challenges of Agent tasks are fundamentally different:

  1. Multi-step Chains: If any single step fails, the entire task fails. A simple calendar invite might require searching, parsing, and then writing; a failure at any point collapses the workflow.
  2. High Precision for Tool Calling: Formatting errors, missing parameters, or selecting the wrong tool will immediately break the task.
  3. Reliability is the True Capability: Succeeding once is easy; succeeding every time is hard.

Step 3.5 Flash’s performance—56.7% Pass³, 94.4 Robustness, and zero safety variance—indicates it is a model you can "actually rely on" for Agent workflows, rather than just a set of impressive numbers on a chart.

From an engineering perspective, you wouldn't put a model that "works when it's lucky and crashes when it's not" into a production pipeline. Pass³ measures the exact stability required for trust.


Parameter Efficiency: High Performance at Low Cost

Looking at Claw-Eval’s "Pass Rate vs. Cost" analysis, Step 3.5 Flash occupies a very low-cost bracket. This isn't accidental; it’s a result of its architectural design:

  • 196B Total Parameters, only 11B Active (Sparse MoE architecture).
  • In 128K context scenarios, inference costs are roughly 1/6th that of DeepSeek V3.2.
  • The MTP-3 (Multi-Token Prediction) heads enable generation speeds of 100–300 tok/s, peaking at 350 tok/s.

For applications requiring high-frequency Agent calls—such as automated workflows, multi-turn research tasks, or large-scale data processing—this cost advantage translates directly into significant savings. The balance between high performance and low cost is a core characteristic of Step 3.5 Flash.


Resource Links

If you are building Agent-related applications or are interested in how models perform in real-world scenarios, feel free to join the discussion in the comments or connect with the StepFun developer community (scan the QR code on our GitHub home page).

Top comments (0)