DEV Community: Alessandro Potenza

I tested 5 AI agents 100 times each. Single-run benchmarks are lying to you.

Alessandro Potenza — Fri, 06 Feb 2026 12:16:51 +0000

Your agent passes on Monday, fails on Wednesday. Same prompt, same model, same code. You run it again — it works. Is it fixed? You have no idea.

I ran into this problem building agents with LangGraph. My agent scored well on manual tests, but in production it failed unpredictably. So I started running it multiple times. The results changed everything about how I think about agent evaluation.

The experiment

I took 5 agents representing common archetypes and ran each one 400 times (100 trials × 4 test cases). Not once. Not ten times. Four hundred times, with statistical confidence intervals on every metric.

Here's what I found.

Agent	Pass Rate	95% CI	Avg Cost	Cost per Success
Reliable RAG	91.0%	[87.8%, 93.4%]	$0.014	$0.016
Expensive Multi-Model	87.5%	[83.9%, 90.4%]	$0.141	$0.161
Inconsistent	69.2%	[64.6%, 73.6%]	$0.036	$0.052
Flaky Coding	65.5%	[60.7%, 70.0%]	$0.052	$0.079
Fast-But-Wrong	45.2%	[40.4%, 50.1%]	$0.003	$0.007

Three things jumped out.

1. The confidence interval is the real number, not the point estimate.

The Flaky Coding agent scored 65.5%. But the 95% CI is [60.7%, 70.0%]. That's a 9-point range. If you tested it once and got lucky, you'd report 80%. If you got unlucky, 50%. Neither tells you what the agent actually does.

Every benchmark you've seen reports a single number. No error bars. No confidence interval. That number is one sample from a distribution you've never seen.

2. Cost per success is the metric that matters, and nobody reports it.

The Expensive Multi-Model agent has an 87.5% pass rate — only 3.5 points below the Reliable RAG. Sounds close, right?

But it costs 10× more per successful call ($0.161 vs $0.016). That 3.5-point gap hides a 10× cost difference. In production at scale, you'd burn $56 to do what the cheaper agent does for $6.

The Fast-But-Wrong agent looks incredibly cheap at $0.003/call. But at 45% success, its cost per completed task is $0.007 — and you'd need to run it twice to get one success. The real cost is the wasted compute on the 55% that fail.

3. Failure attribution tells you WHERE to fix, not just that something broke.

This was the most useful part. Every agent had a characteristic failure pattern:

Reliable RAG: 100% of failures in the retrieve step — it's a retrieval quality problem, not a reasoning problem
Flaky Coding: 71% in execute, 29% in plan — mostly runtime errors, some bad plans
Expensive Multi-Model: 100% in validate — the expensive final check is the weak link
Fast-But-Wrong: 100% in respond — no verification step means garbage output
Inconsistent: 100% in reason — bimodal: either works perfectly or crashes entirely

When your agent fails, knowing which step fails changes what you fix. Without this, you're guessing.

The tool

I built agentrial to do this automatically. It's a CLI that runs your agent N times and gives you statistics instead of anecdotes.

pip install agentrial
agentrial init
agentrial run --trials 100

What it does:

Runs your agent N times (default 10, configurable)
Reports pass rate with Wilson score confidence intervals
Tracks cost and latency per run with bootstrap CIs
Attributes failures to specific steps using Fisher's exact test with Benjamini-Hochberg correction
Works with LangGraph, CrewAI, AutoGen, Pydantic AI, OpenAI Agents, and smolagents
Plugs into CI/CD — block PRs when reliability drops below a threshold

It's open-source (MIT), runs locally, and your data never leaves your machine.

Why this matters

The AI agent ecosystem has a measurement problem. Models score 80%+ on SWE-bench, but only 10% of enterprises successfully deploy agents in production. The gap isn't capability — it's reliability, and we don't have the tools to measure reliability properly.

If you're deploying agents and you're not running multi-trial evaluations, you're flying blind. A single test run is an anecdote. A hundred runs with confidence intervals is data.

GitHub repo — stars appreciated if this is useful.

I built a pytest-like tool for AI agents because "it passed once" isn't good enough

Alessandro Potenza — Thu, 05 Feb 2026 20:10:25 +0000

You know that feeling when your AI agent works perfectly in development, then randomly breaks in production? Same prompt, same model, different results.

I spent way too much time debugging agents that "sometimes" failed. The worst part wasn't the failures - it was not knowing why. Was it the tool selection? The prompt? The model having a bad day?

Existing eval tools didn't help much. They run your test once, check the output, done. But agents aren't deterministic. Running a test once tells you almost nothing.

So I built agentrial.

What it does

It's basically pytest, but it runs each test multiple times and gives you actual statistics:

pip install agentrial

# agentrial.yml
suite: my-agent
agent: my_module.agent
trials: 10
threshold: 0.85

cases:
  - name: basic-math
    input:
      query: "What is 15 * 37?"
    expected:
      output_contains: ["555"]
      tool_calls:
        - tool: calculate

agentrial run

Output looks like this:

┌──────────────────────┬────────┬──────────────┬──────────┐
│ Test Case            │ Pass   │ 95% CI       │ Avg Cost │
├──────────────────────┼────────┼──────────────┼──────────┤
│ easy-multiply        │ 100.0% │ 72.2%-100.0% │ $0.0005  │
│ medium-population    │ 90.0%  │ 59.6%-98.2%  │ $0.0006  │
│ hard-multi-step      │ 70.0%  │ 39.7%-89.2%  │ $0.0011  │
└──────────────────────┴────────┴──────────────┴──────────┘

The parts that actually helped me

Confidence intervals instead of pass/fail

That "95% CI" column is Wilson score interval. With 10 trials, a 100% pass rate actually means "somewhere between 72% and 100% with 95% confidence". Sounds obvious in retrospect, but seeing "100% (72-100%)" instead of just "100%" completely changed how I thought about agent reliability.

Step-level failure attribution

When a test fails, it tells you which step diverged:

Failures: medium-population (90% pass rate)
  Step 0 (tool_selection): called 'calculate' instead of 'lookup_country_info'

Turns out my agent was occasionally picking the wrong tool on ambiguous queries. Would have taken me hours to figure that out manually.

Real cost tracking

Pulls actual token usage from the API response metadata. Ran 100 trials across 10 test cases, cost me 6 cents total. Now I know exactly how much each test costs before I scale up.

How I actually use it

I have a GitHub Action that runs on every PR:

- uses: alepot55/agentrial@v0.1.4
  with:
    trials: 10
    threshold: 0.80

If pass rate drops below 80%, the PR gets blocked. Caught two regressions last week that I would have shipped otherwise.

What it doesn't do (yet)

Only supports LangGraph right now. CrewAI and AutoGen adapters are next.
No fancy UI - it's CLI only
No LLM-as-judge for semantic evaluation (coming later)

The code

It's open source, MIT licensed: github.com/alepot55/agentrial

Built the whole thing in about a week using Claude Code. The statistical stuff (Wilson intervals, Fisher exact test for regression detection, Benjamini-Hochberg correction) was the fun part.

If you're building agents and tired of "it works on my machine", give it a shot. And let me know what metrics would actually be useful for your workflows - I'm still figuring out what to prioritize next.