Your agent passes on Monday, fails on Wednesday. Same prompt, same model, same code. You run it again — it works. Is it fixed? You have no idea.
I ran into this problem building agents with LangGraph. My agent scored well on manual tests, but in production it failed unpredictably. So I started running it multiple times. The results changed everything about how I think about agent evaluation.
The experiment
I took 5 agents representing common archetypes and ran each one 400 times (100 trials × 4 test cases). Not once. Not ten times. Four hundred times, with statistical confidence intervals on every metric.
Here's what I found.
| Agent | Pass Rate | 95% CI | Avg Cost | Cost per Success |
|---|---|---|---|---|
| Reliable RAG | 91.0% | [87.8%, 93.4%] | $0.014 | $0.016 |
| Expensive Multi-Model | 87.5% | [83.9%, 90.4%] | $0.141 | $0.161 |
| Inconsistent | 69.2% | [64.6%, 73.6%] | $0.036 | $0.052 |
| Flaky Coding | 65.5% | [60.7%, 70.0%] | $0.052 | $0.079 |
| Fast-But-Wrong | 45.2% | [40.4%, 50.1%] | $0.003 | $0.007 |
Three things jumped out.
1. The confidence interval is the real number, not the point estimate.
The Flaky Coding agent scored 65.5%. But the 95% CI is [60.7%, 70.0%]. That's a 9-point range. If you tested it once and got lucky, you'd report 80%. If you got unlucky, 50%. Neither tells you what the agent actually does.
Every benchmark you've seen reports a single number. No error bars. No confidence interval. That number is one sample from a distribution you've never seen.
2. Cost per success is the metric that matters, and nobody reports it.
The Expensive Multi-Model agent has an 87.5% pass rate — only 3.5 points below the Reliable RAG. Sounds close, right?
But it costs 10× more per successful call ($0.161 vs $0.016). That 3.5-point gap hides a 10× cost difference. In production at scale, you'd burn $56 to do what the cheaper agent does for $6.
The Fast-But-Wrong agent looks incredibly cheap at $0.003/call. But at 45% success, its cost per completed task is $0.007 — and you'd need to run it twice to get one success. The real cost is the wasted compute on the 55% that fail.
3. Failure attribution tells you WHERE to fix, not just that something broke.
This was the most useful part. Every agent had a characteristic failure pattern:
- Reliable RAG: 100% of failures in the
retrievestep — it's a retrieval quality problem, not a reasoning problem - Flaky Coding: 71% in
execute, 29% inplan— mostly runtime errors, some bad plans - Expensive Multi-Model: 100% in
validate— the expensive final check is the weak link - Fast-But-Wrong: 100% in
respond— no verification step means garbage output - Inconsistent: 100% in
reason— bimodal: either works perfectly or crashes entirely
When your agent fails, knowing which step fails changes what you fix. Without this, you're guessing.
The tool
I built agentrial to do this automatically. It's a CLI that runs your agent N times and gives you statistics instead of anecdotes.
pip install agentrial
agentrial init
agentrial run --trials 100
What it does:
- Runs your agent N times (default 10, configurable)
- Reports pass rate with Wilson score confidence intervals
- Tracks cost and latency per run with bootstrap CIs
- Attributes failures to specific steps using Fisher's exact test with Benjamini-Hochberg correction
- Works with LangGraph, CrewAI, AutoGen, Pydantic AI, OpenAI Agents, and smolagents
- Plugs into CI/CD — block PRs when reliability drops below a threshold
It's open-source (MIT), runs locally, and your data never leaves your machine.
Why this matters
The AI agent ecosystem has a measurement problem. Models score 80%+ on SWE-bench, but only 10% of enterprises successfully deploy agents in production. The gap isn't capability — it's reliability, and we don't have the tools to measure reliability properly.
If you're deploying agents and you're not running multi-trial evaluations, you're flying blind. A single test run is an anecdote. A hundred runs with confidence intervals is data.
GitHub repo — stars appreciated if this is useful.
Top comments (0)