You know that feeling when your AI agent works perfectly in development, then randomly breaks in production? Same prompt, same model, different results.
I spent way too much time debugging agents that "sometimes" failed. The worst part wasn't the failures - it was not knowing why. Was it the tool selection? The prompt? The model having a bad day?
Existing eval tools didn't help much. They run your test once, check the output, done. But agents aren't deterministic. Running a test once tells you almost nothing.
So I built agentrial.
What it does
It's basically pytest, but it runs each test multiple times and gives you actual statistics:
pip install agentrial
# agentrial.yml
suite: my-agent
agent: my_module.agent
trials: 10
threshold: 0.85
cases:
- name: basic-math
input:
query: "What is 15 * 37?"
expected:
output_contains: ["555"]
tool_calls:
- tool: calculate
agentrial run
Output looks like this:
┌──────────────────────┬────────┬──────────────┬──────────┐
│ Test Case │ Pass │ 95% CI │ Avg Cost │
├──────────────────────┼────────┼──────────────┼──────────┤
│ easy-multiply │ 100.0% │ 72.2%-100.0% │ $0.0005 │
│ medium-population │ 90.0% │ 59.6%-98.2% │ $0.0006 │
│ hard-multi-step │ 70.0% │ 39.7%-89.2% │ $0.0011 │
└──────────────────────┴────────┴──────────────┴──────────┘
The parts that actually helped me
Confidence intervals instead of pass/fail
That "95% CI" column is Wilson score interval. With 10 trials, a 100% pass rate actually means "somewhere between 72% and 100% with 95% confidence". Sounds obvious in retrospect, but seeing "100% (72-100%)" instead of just "100%" completely changed how I thought about agent reliability.
Step-level failure attribution
When a test fails, it tells you which step diverged:
Failures: medium-population (90% pass rate)
Step 0 (tool_selection): called 'calculate' instead of 'lookup_country_info'
Turns out my agent was occasionally picking the wrong tool on ambiguous queries. Would have taken me hours to figure that out manually.
Real cost tracking
Pulls actual token usage from the API response metadata. Ran 100 trials across 10 test cases, cost me 6 cents total. Now I know exactly how much each test costs before I scale up.
How I actually use it
I have a GitHub Action that runs on every PR:
- uses: alepot55/agentrial@v0.1.4
with:
trials: 10
threshold: 0.80
If pass rate drops below 80%, the PR gets blocked. Caught two regressions last week that I would have shipped otherwise.
What it doesn't do (yet)
- Only supports LangGraph right now. CrewAI and AutoGen adapters are next.
- No fancy UI - it's CLI only
- No LLM-as-judge for semantic evaluation (coming later)
The code
It's open source, MIT licensed: github.com/alepot55/agentrial
Built the whole thing in about a week using Claude Code. The statistical stuff (Wilson intervals, Fisher exact test for regression detection, Benjamini-Hochberg correction) was the fun part.
If you're building agents and tired of "it works on my machine", give it a shot. And let me know what metrics would actually be useful for your workflows - I'm still figuring out what to prioritize next.
Top comments (0)