New Benchmark for Open-Source Agents: What is Claw-Eval? How Step 3.5 Flash Secured the #2 Spot

#ai #llm #benchmark #opensource

Recently, a new Agent evaluation framework called Claw-Eval has sparked significant discussion within the developer community. In its latest rankings, Step 3.5 Flash emerged as the #2 open-source model, trailing only GLM 5, while sharing the top spot for the Pass@3 metric.

What makes this leaderboard unique is that it doesn't test "knowledge breadth" or "abstract reasoning." Instead, it focuses on a more fundamental question: Can the model actually call tools, execute steps, and complete tasks reliably in a real-world environment?

Today, we’ll explore the design philosophy behind Claw-Eval and analyze why Step 3.5 Flash performed so exceptionally under this rigorous evaluation system.

Claw-Eval: Testing "Doing," Not Just "Knowing"

Developed by a joint team from Peking University and the University of Hong Kong, Claw-Eval features tasks that are entirely human-verified. Its positioning is clear: End-to-end testing of an AI Agent’s ability to complete tasks in the real world.

Traditional benchmarks (like MMLU, MATH, or HumanEval) measure whether a model "knows the answer." Claw-Eval answers a different question: Given a live operational environment, can the model successfully complete a task by calling tools and executing multi-step operations?

To achieve this, Claw-Eval built a comprehensive testing ecosystem:

104 Tasks: Covering real-world scenarios like calendar management, file operations, web search, code execution, financial analysis, and email processing.
15 Mock Enterprise Services: Creating an interactive tool-calling environment rather than just paper-based Q&A.
Docker Sandbox Isolation: Each test runs in an independent environment to ensure no cross-interference.
Human Verification: Every task is verified by humans—no "LLM-as-a-judge"—to eliminate biases inherent in automated scoring.

Pass³: Stability Through Triple Consistency

The most critical design element of Claw-Eval is its core scoring mechanism: Pass³.

While most benchmarks calculate scores based on a single run, Claw-Eval is far stricter. A task is only considered successful if it passes three independent runs consecutively.

The logic is simple: One success might be luck; three consecutive successes prove capability.

The scoring formula is as follows:

task_score = safety × (0.8 × completion + 0.2 × robustness)
Threshold: pass ≥ 75

The four dimensions emphasize different strengths:

Pass³: The percentage of tasks passed in all three independent runs (the primary ranking metric).
Completion: The quality of the task outcome.
Robustness: Stability when facing edge cases or anomalous inputs.
Safety: Security and safety during the execution process.

This mechanism essentially tests "dependable stability"—the most critical hurdle an Agent must clear to move from a "prototype" to a "production-ready tool."

Current Leaderboard (Open-Source, General Category)

Rank	Model	Source	Pass³	Pass@3	Completion	Robustness	Safety	Avg Score
🥇 1	GLM 5	Zhipu AI	57.7%	70.2%	68.9 ±2.0	95.4 ±0.3	93.9 ±0.6	73.0 ±1.6
🥈 2	Step 3.5 Flash	StepFun	56.7%	70.2%	68.3 ±0.8	94.4 ±0.3	93.3 ±0.0	72.3 ±0.8
🥉 3	Kimi K2.5	Moonshot AI	52.9%	73.1%	67.4 ±1.3	94.2 ±0.8	92.6 ±0.6	71.6 ±0.9
4	DeepSeek V3.2	DeepSeek	51.0%	71.2%	63.9 ±0.5	93.1 ±0.3	92.0 ±0.6	68.4 ±0.4
5	MiniMax M2.5	MiniMax	51.0%	69.2%	65.5 ±0.4	93.6 ±0.6	92.0 ±0.6	69.9 ±0.3
6	MiMo V2 Flash	Xiaomi	48.1%	67.3%	63.3 ±0.5	94.7 ±0.5	92.9 ±0.6	68.4 ±0.3
7	Qwen3.5 397A17B	Alibaba	48.1%	67.3%	66.4 ±2.4	93.8 ±0.5	92.0 ±0.6	70.7 ±2.0

Data Source: claw-eval.github.io, Filter: "Open-Source" + General category. Snapshot date: 2026-03-25.

Several interesting insights can be drawn from this data:

Second in Pass³, Zero Variance in Safety. Step 3.5 Flash achieved a Safety score of 93.3 ±0.0. A standard deviation of zero means its safety performance was perfectly consistent across all runs. For an Agent system being deployed into a production environment, this predictability is more valuable than peak performance.

Pass@3 Tied for First. Step 3.5 Flash and GLM 5 both hit 70.2% for Pass@3, showing they are neck-and-neck in single-run success rates. The slight difference in Pass³ (57.7% vs 56.7%) reflects a minor gap in triple-run stability rather than raw capability.

A Notable Speed Advantage. According to Claw-Eval’s "Pass Rate vs. Speed" scatter plot, Step 3.5 Flash sits in the "High Speed + High Pass Rate" quadrant. With an average task time of 50–70 seconds, it is significantly faster than other models in its class.

Why Agent-Specific Rankings Matter

Many models shine on traditional benchmarks like math or coding but stumble in real-world Agent scenarios. This is because the challenges of Agent tasks are fundamentally different:

Multi-step Chains: If any single step fails, the entire task fails. A simple calendar invite might require searching, parsing, and then writing; a failure at any point collapses the workflow.
High Precision for Tool Calling: Formatting errors, missing parameters, or selecting the wrong tool will immediately break the task.
Reliability is the True Capability: Succeeding once is easy; succeeding every time is hard.

Step 3.5 Flash’s performance—56.7% Pass³, 94.4 Robustness, and zero safety variance—indicates it is a model you can "actually rely on" for Agent workflows, rather than just a set of impressive numbers on a chart.

From an engineering perspective, you wouldn't put a model that "works when it's lucky and crashes when it's not" into a production pipeline. Pass³ measures the exact stability required for trust.

Parameter Efficiency: High Performance at Low Cost

Looking at Claw-Eval’s "Pass Rate vs. Cost" analysis, Step 3.5 Flash occupies a very low-cost bracket. This isn't accidental; it’s a result of its architectural design:

196B Total Parameters, only 11B Active (Sparse MoE architecture).
In 128K context scenarios, inference costs are roughly 1/6th that of DeepSeek V3.2.
The MTP-3 (Multi-Token Prediction) heads enable generation speeds of 100–300 tok/s, peaking at 350 tok/s.

For applications requiring high-frequency Agent calls—such as automated workflows, multi-turn research tasks, or large-scale data processing—this cost advantage translates directly into significant savings. The balance between high performance and low cost is a core characteristic of Step 3.5 Flash.

Resource Links

Resource	Link
Claw-Eval Leaderboard	https://claw-eval.github.io
Claw-Eval GitHub	https://github.com/claw-eval/claw-eval
Step 3.5 Flash GitHub	https://github.com/stepfun-ai/Step-3.5-Flash
StepFun Open Platform (Global)	https://platform.stepfun.ai
StepFun Open Platform (China)	https://platform.stepfun.com
HuggingFace Models	https://huggingface.co/stepfun-ai/Step-3.5-Flash
ModelScope	https://modelscope.cn/models/stepfun-ai/Step-3.5-Flash
Technical Report	https://arxiv.org/abs/2602.10604

If you are building Agent-related applications or are interested in how models perform in real-world scenarios, feel free to join the discussion in the comments or connect with the StepFun developer community (scan the QR code on our GitHub home page).