I built a pytest-like tool for AI agents because "it passed once" isn't good enough

#agents #python #showdev #testing

You know that feeling when your AI agent works perfectly in development, then randomly breaks in production? Same prompt, same model, different results.

I spent way too much time debugging agents that "sometimes" failed. The worst part wasn't the failures - it was not knowing why. Was it the tool selection? The prompt? The model having a bad day?

Existing eval tools didn't help much. They run your test once, check the output, done. But agents aren't deterministic. Running a test once tells you almost nothing.

So I built agentrial.

What it does

It's basically pytest, but it runs each test multiple times and gives you actual statistics:

pip install agentrial

# agentrial.yml
suite: my-agent
agent: my_module.agent
trials: 10
threshold: 0.85

cases:
  - name: basic-math
    input:
      query: "What is 15 * 37?"
    expected:
      output_contains: ["555"]
      tool_calls:
        - tool: calculate

agentrial run

Output looks like this:

┌──────────────────────┬────────┬──────────────┬──────────┐
│ Test Case            │ Pass   │ 95% CI       │ Avg Cost │
├──────────────────────┼────────┼──────────────┼──────────┤
│ easy-multiply        │ 100.0% │ 72.2%-100.0% │ $0.0005  │
│ medium-population    │ 90.0%  │ 59.6%-98.2%  │ $0.0006  │
│ hard-multi-step      │ 70.0%  │ 39.7%-89.2%  │ $0.0011  │
└──────────────────────┴────────┴──────────────┴──────────┘

The parts that actually helped me

Confidence intervals instead of pass/fail

That "95% CI" column is Wilson score interval. With 10 trials, a 100% pass rate actually means "somewhere between 72% and 100% with 95% confidence". Sounds obvious in retrospect, but seeing "100% (72-100%)" instead of just "100%" completely changed how I thought about agent reliability.

Step-level failure attribution

When a test fails, it tells you which step diverged:

Failures: medium-population (90% pass rate)
  Step 0 (tool_selection): called 'calculate' instead of 'lookup_country_info'

Turns out my agent was occasionally picking the wrong tool on ambiguous queries. Would have taken me hours to figure that out manually.

Real cost tracking

Pulls actual token usage from the API response metadata. Ran 100 trials across 10 test cases, cost me 6 cents total. Now I know exactly how much each test costs before I scale up.

How I actually use it

I have a GitHub Action that runs on every PR:

- uses: alepot55/agentrial@v0.1.4
  with:
    trials: 10
    threshold: 0.80

If pass rate drops below 80%, the PR gets blocked. Caught two regressions last week that I would have shipped otherwise.

What it doesn't do (yet)

Only supports LangGraph right now. CrewAI and AutoGen adapters are next.
No fancy UI - it's CLI only
No LLM-as-judge for semantic evaluation (coming later)

The code

It's open source, MIT licensed: github.com/alepot55/agentrial

Built the whole thing in about a week using Claude Code. The statistical stuff (Wilson intervals, Fisher exact test for regression detection, Benjamini-Hochberg correction) was the fun part.

If you're building agents and tired of "it works on my machine", give it a shot. And let me know what metrics would actually be useful for your workflows - I'm still figuring out what to prioritize next.