DEV Community

Alessandro Potenza
Alessandro Potenza

Posted on

I tested 5 AI agents 100 times each. Single-run benchmarks are lying to you.

Your agent passes on Monday, fails on Wednesday. Same prompt, same model, same code. You run it again — it works. Is it fixed? You have no idea.

I ran into this problem building agents with LangGraph. My agent scored well on manual tests, but in production it failed unpredictably. So I started running it multiple times. The results changed everything about how I think about agent evaluation.

The experiment

I took 5 agents representing common archetypes and ran each one 400 times (100 trials × 4 test cases). Not once. Not ten times. Four hundred times, with statistical confidence intervals on every metric.

Here's what I found.

Agent Pass Rate 95% CI Avg Cost Cost per Success
Reliable RAG 91.0% [87.8%, 93.4%] $0.014 $0.016
Expensive Multi-Model 87.5% [83.9%, 90.4%] $0.141 $0.161
Inconsistent 69.2% [64.6%, 73.6%] $0.036 $0.052
Flaky Coding 65.5% [60.7%, 70.0%] $0.052 $0.079
Fast-But-Wrong 45.2% [40.4%, 50.1%] $0.003 $0.007

Three things jumped out.

1. The confidence interval is the real number, not the point estimate.

The Flaky Coding agent scored 65.5%. But the 95% CI is [60.7%, 70.0%]. That's a 9-point range. If you tested it once and got lucky, you'd report 80%. If you got unlucky, 50%. Neither tells you what the agent actually does.

Every benchmark you've seen reports a single number. No error bars. No confidence interval. That number is one sample from a distribution you've never seen.

2. Cost per success is the metric that matters, and nobody reports it.

The Expensive Multi-Model agent has an 87.5% pass rate — only 3.5 points below the Reliable RAG. Sounds close, right?

But it costs 10× more per successful call ($0.161 vs $0.016). That 3.5-point gap hides a 10× cost difference. In production at scale, you'd burn $56 to do what the cheaper agent does for $6.

The Fast-But-Wrong agent looks incredibly cheap at $0.003/call. But at 45% success, its cost per completed task is $0.007 — and you'd need to run it twice to get one success. The real cost is the wasted compute on the 55% that fail.

3. Failure attribution tells you WHERE to fix, not just that something broke.

This was the most useful part. Every agent had a characteristic failure pattern:

  • Reliable RAG: 100% of failures in the retrieve step — it's a retrieval quality problem, not a reasoning problem
  • Flaky Coding: 71% in execute, 29% in plan — mostly runtime errors, some bad plans
  • Expensive Multi-Model: 100% in validate — the expensive final check is the weak link
  • Fast-But-Wrong: 100% in respond — no verification step means garbage output
  • Inconsistent: 100% in reason — bimodal: either works perfectly or crashes entirely

When your agent fails, knowing which step fails changes what you fix. Without this, you're guessing.

The tool

I built agentrial to do this automatically. It's a CLI that runs your agent N times and gives you statistics instead of anecdotes.

pip install agentrial
agentrial init
agentrial run --trials 100
Enter fullscreen mode Exit fullscreen mode

What it does:

  • Runs your agent N times (default 10, configurable)
  • Reports pass rate with Wilson score confidence intervals
  • Tracks cost and latency per run with bootstrap CIs
  • Attributes failures to specific steps using Fisher's exact test with Benjamini-Hochberg correction
  • Works with LangGraph, CrewAI, AutoGen, Pydantic AI, OpenAI Agents, and smolagents
  • Plugs into CI/CD — block PRs when reliability drops below a threshold

It's open-source (MIT), runs locally, and your data never leaves your machine.

Why this matters

The AI agent ecosystem has a measurement problem. Models score 80%+ on SWE-bench, but only 10% of enterprises successfully deploy agents in production. The gap isn't capability — it's reliability, and we don't have the tools to measure reliability properly.

If you're deploying agents and you're not running multi-trial evaluations, you're flying blind. A single test run is an anecdote. A hundred runs with confidence intervals is data.

GitHub repo — stars appreciated if this is useful.

Top comments (0)