<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Alessandro Potenza</title>
    <description>The latest articles on DEV Community by Alessandro Potenza (@alepot55).</description>
    <link>https://dev.to/alepot55</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3755483%2F1c8476a9-b40d-473f-baa4-7890b7bd8978.png</url>
      <title>DEV Community: Alessandro Potenza</title>
      <link>https://dev.to/alepot55</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/alepot55"/>
    <language>en</language>
    <item>
      <title>I tested 5 AI agents 100 times each. Single-run benchmarks are lying to you.</title>
      <dc:creator>Alessandro Potenza</dc:creator>
      <pubDate>Fri, 06 Feb 2026 12:16:51 +0000</pubDate>
      <link>https://dev.to/alepot55/i-tested-5-ai-agents-100-times-each-single-run-benchmarks-are-lying-to-you-4e7p</link>
      <guid>https://dev.to/alepot55/i-tested-5-ai-agents-100-times-each-single-run-benchmarks-are-lying-to-you-4e7p</guid>
      <description>&lt;p&gt;Your agent passes on Monday, fails on Wednesday. Same prompt, same model, same code. You run it again — it works. Is it fixed? You have no idea.&lt;/p&gt;

&lt;p&gt;I ran into this problem building agents with LangGraph. My agent scored well on manual tests, but in production it failed unpredictably. So I started running it multiple times. The results changed everything about how I think about agent evaluation.&lt;/p&gt;

&lt;h2&gt;
  
  
  The experiment
&lt;/h2&gt;

&lt;p&gt;I took 5 agents representing common archetypes and ran each one 400 times (100 trials × 4 test cases). Not once. Not ten times. Four hundred times, with statistical confidence intervals on every metric.&lt;/p&gt;

&lt;p&gt;Here's what I found.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Agent&lt;/th&gt;
&lt;th&gt;Pass Rate&lt;/th&gt;
&lt;th&gt;95% CI&lt;/th&gt;
&lt;th&gt;Avg Cost&lt;/th&gt;
&lt;th&gt;Cost per Success&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Reliable RAG&lt;/td&gt;
&lt;td&gt;91.0%&lt;/td&gt;
&lt;td&gt;[87.8%, 93.4%]&lt;/td&gt;
&lt;td&gt;$0.014&lt;/td&gt;
&lt;td&gt;$0.016&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Expensive Multi-Model&lt;/td&gt;
&lt;td&gt;87.5%&lt;/td&gt;
&lt;td&gt;[83.9%, 90.4%]&lt;/td&gt;
&lt;td&gt;$0.141&lt;/td&gt;
&lt;td&gt;$0.161&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Inconsistent&lt;/td&gt;
&lt;td&gt;69.2%&lt;/td&gt;
&lt;td&gt;[64.6%, 73.6%]&lt;/td&gt;
&lt;td&gt;$0.036&lt;/td&gt;
&lt;td&gt;$0.052&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Flaky Coding&lt;/td&gt;
&lt;td&gt;65.5%&lt;/td&gt;
&lt;td&gt;[60.7%, 70.0%]&lt;/td&gt;
&lt;td&gt;$0.052&lt;/td&gt;
&lt;td&gt;$0.079&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Fast-But-Wrong&lt;/td&gt;
&lt;td&gt;45.2%&lt;/td&gt;
&lt;td&gt;[40.4%, 50.1%]&lt;/td&gt;
&lt;td&gt;$0.003&lt;/td&gt;
&lt;td&gt;$0.007&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Three things jumped out.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. The confidence interval is the real number, not the point estimate.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The Flaky Coding agent scored 65.5%. But the 95% CI is [60.7%, 70.0%]. That's a 9-point range. If you tested it once and got lucky, you'd report 80%. If you got unlucky, 50%. Neither tells you what the agent actually does.&lt;/p&gt;

&lt;p&gt;Every benchmark you've seen reports a single number. No error bars. No confidence interval. That number is one sample from a distribution you've never seen.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Cost per success is the metric that matters, and nobody reports it.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The Expensive Multi-Model agent has an 87.5% pass rate — only 3.5 points below the Reliable RAG. Sounds close, right?&lt;/p&gt;

&lt;p&gt;But it costs 10× more per successful call ($0.161 vs $0.016). That 3.5-point gap hides a 10× cost difference. In production at scale, you'd burn $56 to do what the cheaper agent does for $6.&lt;/p&gt;

&lt;p&gt;The Fast-But-Wrong agent looks incredibly cheap at $0.003/call. But at 45% success, its cost per &lt;em&gt;completed task&lt;/em&gt; is $0.007 — and you'd need to run it twice to get one success. The real cost is the wasted compute on the 55% that fail.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Failure attribution tells you WHERE to fix, not just that something broke.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This was the most useful part. Every agent had a characteristic failure pattern:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Reliable RAG: 100% of failures in the &lt;code&gt;retrieve&lt;/code&gt; step — it's a retrieval quality problem, not a reasoning problem&lt;/li&gt;
&lt;li&gt;Flaky Coding: 71% in &lt;code&gt;execute&lt;/code&gt;, 29% in &lt;code&gt;plan&lt;/code&gt; — mostly runtime errors, some bad plans&lt;/li&gt;
&lt;li&gt;Expensive Multi-Model: 100% in &lt;code&gt;validate&lt;/code&gt; — the expensive final check is the weak link&lt;/li&gt;
&lt;li&gt;Fast-But-Wrong: 100% in &lt;code&gt;respond&lt;/code&gt; — no verification step means garbage output&lt;/li&gt;
&lt;li&gt;Inconsistent: 100% in &lt;code&gt;reason&lt;/code&gt; — bimodal: either works perfectly or crashes entirely&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;When your agent fails, knowing &lt;em&gt;which step&lt;/em&gt; fails changes what you fix. Without this, you're guessing.&lt;/p&gt;

&lt;h2&gt;
  
  
  The tool
&lt;/h2&gt;

&lt;p&gt;I built &lt;a href="https://github.com/alepot55/agentrial" rel="noopener noreferrer"&gt;agentrial&lt;/a&gt; to do this automatically. It's a CLI that runs your agent N times and gives you statistics instead of anecdotes.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;agentrial
agentrial init
agentrial run &lt;span class="nt"&gt;--trials&lt;/span&gt; 100
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;What it does:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Runs your agent N times (default 10, configurable)&lt;/li&gt;
&lt;li&gt;Reports pass rate with Wilson score confidence intervals&lt;/li&gt;
&lt;li&gt;Tracks cost and latency per run with bootstrap CIs&lt;/li&gt;
&lt;li&gt;Attributes failures to specific steps using Fisher's exact test with Benjamini-Hochberg correction&lt;/li&gt;
&lt;li&gt;Works with LangGraph, CrewAI, AutoGen, Pydantic AI, OpenAI Agents, and smolagents&lt;/li&gt;
&lt;li&gt;Plugs into CI/CD — block PRs when reliability drops below a threshold&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It's open-source (MIT), runs locally, and your data never leaves your machine.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why this matters
&lt;/h2&gt;

&lt;p&gt;The AI agent ecosystem has a measurement problem. Models score 80%+ on SWE-bench, but only 10% of enterprises successfully deploy agents in production. The gap isn't capability — it's reliability, and we don't have the tools to measure reliability properly.&lt;/p&gt;

&lt;p&gt;If you're deploying agents and you're not running multi-trial evaluations, you're flying blind. A single test run is an anecdote. A hundred runs with confidence intervals is data.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/alepot55/agentrial" rel="noopener noreferrer"&gt;GitHub repo&lt;/a&gt; — stars appreciated if this is useful.&lt;/p&gt;

</description>
    </item>
    <item>
      <title>I built a pytest-like tool for AI agents because "it passed once" isn't good enough</title>
      <dc:creator>Alessandro Potenza</dc:creator>
      <pubDate>Thu, 05 Feb 2026 20:10:25 +0000</pubDate>
      <link>https://dev.to/alepot55/i-built-a-pytest-like-tool-for-ai-agents-because-it-passed-once-isnt-good-enough-2j30</link>
      <guid>https://dev.to/alepot55/i-built-a-pytest-like-tool-for-ai-agents-because-it-passed-once-isnt-good-enough-2j30</guid>
      <description>&lt;p&gt;You know that feeling when your AI agent works perfectly in development, then randomly breaks in production? Same prompt, same model, different results.&lt;/p&gt;

&lt;p&gt;I spent way too much time debugging agents that "sometimes" failed. The worst part wasn't the failures - it was not knowing &lt;em&gt;why&lt;/em&gt;. Was it the tool selection? The prompt? The model having a bad day?&lt;/p&gt;

&lt;p&gt;Existing eval tools didn't help much. They run your test once, check the output, done. But agents aren't deterministic. Running a test once tells you almost nothing.&lt;/p&gt;

&lt;p&gt;So I built &lt;a href="https://github.com/alepot55/agentrial" rel="noopener noreferrer"&gt;agentrial&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  What it does
&lt;/h2&gt;

&lt;p&gt;It's basically pytest, but it runs each test multiple times and gives you actual statistics:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;agentrial
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# agentrial.yml&lt;/span&gt;
&lt;span class="na"&gt;suite&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;my-agent&lt;/span&gt;
&lt;span class="na"&gt;agent&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;my_module.agent&lt;/span&gt;
&lt;span class="na"&gt;trials&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;10&lt;/span&gt;
&lt;span class="na"&gt;threshold&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;0.85&lt;/span&gt;

&lt;span class="na"&gt;cases&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;basic-math&lt;/span&gt;
    &lt;span class="na"&gt;input&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;query&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;What&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;is&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;15&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;*&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;37?"&lt;/span&gt;
    &lt;span class="na"&gt;expected&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;output_contains&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;555"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
      &lt;span class="na"&gt;tool_calls&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;tool&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;calculate&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;agentrial run
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Output looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌──────────────────────┬────────┬──────────────┬──────────┐
│ Test Case            │ Pass   │ 95% CI       │ Avg Cost │
├──────────────────────┼────────┼──────────────┼──────────┤
│ easy-multiply        │ 100.0% │ 72.2%-100.0% │ $0.0005  │
│ medium-population    │ 90.0%  │ 59.6%-98.2%  │ $0.0006  │
│ hard-multi-step      │ 70.0%  │ 39.7%-89.2%  │ $0.0011  │
└──────────────────────┴────────┴──────────────┴──────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  The parts that actually helped me
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Confidence intervals instead of pass/fail&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;That "95% CI" column is Wilson score interval. With 10 trials, a 100% pass rate actually means "somewhere between 72% and 100% with 95% confidence". Sounds obvious in retrospect, but seeing "100% (72-100%)" instead of just "100%" completely changed how I thought about agent reliability.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step-level failure attribution&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;When a test fails, it tells you which step diverged:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Failures: medium-population (90% pass rate)
  Step 0 (tool_selection): called 'calculate' instead of 'lookup_country_info'
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Turns out my agent was occasionally picking the wrong tool on ambiguous queries. Would have taken me hours to figure that out manually.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Real cost tracking&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Pulls actual token usage from the API response metadata. Ran 100 trials across 10 test cases, cost me 6 cents total. Now I know exactly how much each test costs before I scale up.&lt;/p&gt;

&lt;h2&gt;
  
  
  How I actually use it
&lt;/h2&gt;

&lt;p&gt;I have a GitHub Action that runs on every PR:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;alepot55/agentrial@v0.1.4&lt;/span&gt;
  &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;trials&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;10&lt;/span&gt;
    &lt;span class="na"&gt;threshold&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;0.80&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If pass rate drops below 80%, the PR gets blocked. Caught two regressions last week that I would have shipped otherwise.&lt;/p&gt;

&lt;h2&gt;
  
  
  What it doesn't do (yet)
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Only supports LangGraph right now. CrewAI and AutoGen adapters are next.&lt;/li&gt;
&lt;li&gt;No fancy UI - it's CLI only&lt;/li&gt;
&lt;li&gt;No LLM-as-judge for semantic evaluation (coming later)&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The code
&lt;/h2&gt;

&lt;p&gt;It's open source, MIT licensed: &lt;a href="https://github.com/alepot55/agentrial" rel="noopener noreferrer"&gt;github.com/alepot55/agentrial&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Built the whole thing in about a week using Claude Code. The statistical stuff (Wilson intervals, Fisher exact test for regression detection, Benjamini-Hochberg correction) was the fun part.&lt;/p&gt;

&lt;p&gt;If you're building agents and tired of "it works on my machine", give it a shot. And let me know what metrics would actually be useful for your workflows - I'm still figuring out what to prioritize next.&lt;/p&gt;

</description>
      <category>agents</category>
      <category>python</category>
      <category>showdev</category>
      <category>testing</category>
    </item>
  </channel>
</rss>
