Recently, a new Agent evaluation framework called Claw-Eval has sparked significant discussion within the developer community. In its latest rankings, Step 3.5 Flash emerged as the #2 open-source model, trailing only GLM 5, while sharing the top spot for the Pass@3 metric.
What makes this leaderboard unique is that it doesn't test "knowledge breadth" or "abstract reasoning." Instead, it focuses on a more fundamental question: Can the model actually call tools, execute steps, and complete tasks reliably in a real-world environment?
Today, we’ll explore the design philosophy behind Claw-Eval and analyze why Step 3.5 Flash performed so exceptionally under this rigorous evaluation system.
Claw-Eval: Testing "Doing," Not Just "Knowing"
Developed by a joint team from Peking University and the University of Hong Kong, Claw-Eval features tasks that are entirely human-verified. Its positioning is clear: End-to-end testing of an AI Agent’s ability to complete tasks in the real world.
Traditional benchmarks (like MMLU, MATH, or HumanEval) measure whether a model "knows the answer." Claw-Eval answers a different question: Given a live operational environment, can the model successfully complete a task by calling tools and executing multi-step operations?
To achieve this, Claw-Eval built a comprehensive testing ecosystem:
- 104 Tasks: Covering real-world scenarios like calendar management, file operations, web search, code execution, financial analysis, and email processing.
- 15 Mock Enterprise Services: Creating an interactive tool-calling environment rather than just paper-based Q&A.
- Docker Sandbox Isolation: Each test runs in an independent environment to ensure no cross-interference.
- Human Verification: Every task is verified by humans—no "LLM-as-a-judge"—to eliminate biases inherent in automated scoring.
Pass³: Stability Through Triple Consistency
The most critical design element of Claw-Eval is its core scoring mechanism: Pass³.
While most benchmarks calculate scores based on a single run, Claw-Eval is far stricter. A task is only considered successful if it passes three independent runs consecutively.
The logic is simple: One success might be luck; three consecutive successes prove capability.
The scoring formula is as follows:
task_score = safety × (0.8 × completion + 0.2 × robustness)
Threshold: pass ≥ 75
The four dimensions emphasize different strengths:
- Pass³: The percentage of tasks passed in all three independent runs (the primary ranking metric).
- Completion: The quality of the task outcome.
- Robustness: Stability when facing edge cases or anomalous inputs.
- Safety: Security and safety during the execution process.
This mechanism essentially tests "dependable stability"—the most critical hurdle an Agent must clear to move from a "prototype" to a "production-ready tool."
Current Leaderboard (Open-Source, General Category)
| Rank | Model | Source | Pass³ | Pass@3 | Completion | Robustness | Safety | Avg Score |
|---|---|---|---|---|---|---|---|---|
| 🥇 1 | GLM 5 | Zhipu AI | 57.7% | 70.2% | 68.9 ±2.0 | 95.4 ±0.3 | 93.9 ±0.6 | 73.0 ±1.6 |
| 🥈 2 | Step 3.5 Flash | StepFun | 56.7% | 70.2% | 68.3 ±0.8 | 94.4 ±0.3 | 93.3 ±0.0 | 72.3 ±0.8 |
| 🥉 3 | Kimi K2.5 | Moonshot AI | 52.9% | 73.1% | 67.4 ±1.3 | 94.2 ±0.8 | 92.6 ±0.6 | 71.6 ±0.9 |
| 4 | DeepSeek V3.2 | DeepSeek | 51.0% | 71.2% | 63.9 ±0.5 | 93.1 ±0.3 | 92.0 ±0.6 | 68.4 ±0.4 |
| 5 | MiniMax M2.5 | MiniMax | 51.0% | 69.2% | 65.5 ±0.4 | 93.6 ±0.6 | 92.0 ±0.6 | 69.9 ±0.3 |
| 6 | MiMo V2 Flash | Xiaomi | 48.1% | 67.3% | 63.3 ±0.5 | 94.7 ±0.5 | 92.9 ±0.6 | 68.4 ±0.3 |
| 7 | Qwen3.5 397A17B | Alibaba | 48.1% | 67.3% | 66.4 ±2.4 | 93.8 ±0.5 | 92.0 ±0.6 | 70.7 ±2.0 |
Data Source: claw-eval.github.io, Filter: "Open-Source" + General category. Snapshot date: 2026-03-25.
Several interesting insights can be drawn from this data:
Second in Pass³, Zero Variance in Safety. Step 3.5 Flash achieved a Safety score of 93.3 ±0.0. A standard deviation of zero means its safety performance was perfectly consistent across all runs. For an Agent system being deployed into a production environment, this predictability is more valuable than peak performance.
Pass@3 Tied for First. Step 3.5 Flash and GLM 5 both hit 70.2% for Pass@3, showing they are neck-and-neck in single-run success rates. The slight difference in Pass³ (57.7% vs 56.7%) reflects a minor gap in triple-run stability rather than raw capability.
A Notable Speed Advantage. According to Claw-Eval’s "Pass Rate vs. Speed" scatter plot, Step 3.5 Flash sits in the "High Speed + High Pass Rate" quadrant. With an average task time of 50–70 seconds, it is significantly faster than other models in its class.
Why Agent-Specific Rankings Matter
Many models shine on traditional benchmarks like math or coding but stumble in real-world Agent scenarios. This is because the challenges of Agent tasks are fundamentally different:
- Multi-step Chains: If any single step fails, the entire task fails. A simple calendar invite might require searching, parsing, and then writing; a failure at any point collapses the workflow.
- High Precision for Tool Calling: Formatting errors, missing parameters, or selecting the wrong tool will immediately break the task.
- Reliability is the True Capability: Succeeding once is easy; succeeding every time is hard.
Step 3.5 Flash’s performance—56.7% Pass³, 94.4 Robustness, and zero safety variance—indicates it is a model you can "actually rely on" for Agent workflows, rather than just a set of impressive numbers on a chart.
From an engineering perspective, you wouldn't put a model that "works when it's lucky and crashes when it's not" into a production pipeline. Pass³ measures the exact stability required for trust.
Parameter Efficiency: High Performance at Low Cost
Looking at Claw-Eval’s "Pass Rate vs. Cost" analysis, Step 3.5 Flash occupies a very low-cost bracket. This isn't accidental; it’s a result of its architectural design:
- 196B Total Parameters, only 11B Active (Sparse MoE architecture).
- In 128K context scenarios, inference costs are roughly 1/6th that of DeepSeek V3.2.
- The MTP-3 (Multi-Token Prediction) heads enable generation speeds of 100–300 tok/s, peaking at 350 tok/s.
For applications requiring high-frequency Agent calls—such as automated workflows, multi-turn research tasks, or large-scale data processing—this cost advantage translates directly into significant savings. The balance between high performance and low cost is a core characteristic of Step 3.5 Flash.
Resource Links
| Resource | Link |
|---|---|
| Claw-Eval Leaderboard | https://claw-eval.github.io |
| Claw-Eval GitHub | https://github.com/claw-eval/claw-eval |
| Step 3.5 Flash GitHub | https://github.com/stepfun-ai/Step-3.5-Flash |
| StepFun Open Platform (Global) | https://platform.stepfun.ai |
| StepFun Open Platform (China) | https://platform.stepfun.com |
| HuggingFace Models | https://huggingface.co/stepfun-ai/Step-3.5-Flash |
| ModelScope | https://modelscope.cn/models/stepfun-ai/Step-3.5-Flash |
| Technical Report | https://arxiv.org/abs/2602.10604 |
If you are building Agent-related applications or are interested in how models perform in real-world scenarios, feel free to join the discussion in the comments or connect with the StepFun developer community (scan the QR code on our GitHub home page).
Top comments (0)