I tested 5 AI agents 100 times each. Single-run benchmarks are lying to you.

Your agent passes on Monday, fails on Wednesday. Same prompt, same model, same code. You run it again — it works. Is it fixed? You have no idea.

I ran into this problem building agents with LangGraph. My agent scored well on manual tests, but in production it failed unpredictably. So I started running it multiple times. The results changed everything about how I think about agent evaluation.

The experiment

I took 5 agents representing common archetypes and ran each one 400 times (100 trials × 4 test cases). Not once. Not ten times. Four hundred times, with statistical confidence intervals on every metric.

Here's what I found.

Agent	Pass Rate	95% CI	Avg Cost	Cost per Success
Reliable RAG	91.0%	[87.8%, 93.4%]	$0.014	$0.016
Expensive Multi-Model	87.5%	[83.9%, 90.4%]	$0.141	$0.161
Inconsistent	69.2%	[64.6%, 73.6%]	$0.036	$0.052
Flaky Coding	65.5%	[60.7%, 70.0%]	$0.052	$0.079
Fast-But-Wrong	45.2%	[40.4%, 50.1%]	$0.003	$0.007

Three things jumped out.

1. The confidence interval is the real number, not the point estimate.

The Flaky Coding agent scored 65.5%. But the 95% CI is [60.7%, 70.0%]. That's a 9-point range. If you tested it once and got lucky, you'd report 80%. If you got unlucky, 50%. Neither tells you what the agent actually does.

Every benchmark you've seen reports a single number. No error bars. No confidence interval. That number is one sample from a distribution you've never seen.

2. Cost per success is the metric that matters, and nobody reports it.

The Expensive Multi-Model agent has an 87.5% pass rate — only 3.5 points below the Reliable RAG. Sounds close, right?

But it costs 10× more per successful call ($0.161 vs $0.016). That 3.5-point gap hides a 10× cost difference. In production at scale, you'd burn $56 to do what the cheaper agent does for $6.

The Fast-But-Wrong agent looks incredibly cheap at $0.003/call. But at 45% success, its cost per completed task is $0.007 — and you'd need to run it twice to get one success. The real cost is the wasted compute on the 55% that fail.

3. Failure attribution tells you WHERE to fix, not just that something broke.

This was the most useful part. Every agent had a characteristic failure pattern:

Reliable RAG: 100% of failures in the retrieve step — it's a retrieval quality problem, not a reasoning problem
Flaky Coding: 71% in execute, 29% in plan — mostly runtime errors, some bad plans
Expensive Multi-Model: 100% in validate — the expensive final check is the weak link
Fast-But-Wrong: 100% in respond — no verification step means garbage output
Inconsistent: 100% in reason — bimodal: either works perfectly or crashes entirely

When your agent fails, knowing which step fails changes what you fix. Without this, you're guessing.

The tool

I built agentrial to do this automatically. It's a CLI that runs your agent N times and gives you statistics instead of anecdotes.

pip install agentrial
agentrial init
agentrial run --trials 100

What it does:

Runs your agent N times (default 10, configurable)
Reports pass rate with Wilson score confidence intervals
Tracks cost and latency per run with bootstrap CIs
Attributes failures to specific steps using Fisher's exact test with Benjamini-Hochberg correction
Works with LangGraph, CrewAI, AutoGen, Pydantic AI, OpenAI Agents, and smolagents
Plugs into CI/CD — block PRs when reliability drops below a threshold

It's open-source (MIT), runs locally, and your data never leaves your machine.

Why this matters

The AI agent ecosystem has a measurement problem. Models score 80%+ on SWE-bench, but only 10% of enterprises successfully deploy agents in production. The gap isn't capability — it's reliability, and we don't have the tools to measure reliability properly.

If you're deploying agents and you're not running multi-trial evaluations, you're flying blind. A single test run is an anecdote. A hundred runs with confidence intervals is data.

GitHub repo — stars appreciated if this is useful.

DEV Community

I tested 5 AI agents 100 times each. Single-run benchmarks are lying to you.

The experiment

The tool

Why this matters

Top comments (0)