<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: J. S. Morris</title>
    <description>The latest articles on DEV Community by J. S. Morris (@dingomanhammer).</description>
    <link>https://dev.to/dingomanhammer</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3792853%2F251c07ef-e6e7-4755-aac5-25c0526d5f0d.jpeg</url>
      <title>DEV Community: J. S. Morris</title>
      <link>https://dev.to/dingomanhammer</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/dingomanhammer"/>
    <language>en</language>
    <item>
      <title>Why Your AI Agent Works in Demo But Fails in Production</title>
      <dc:creator>J. S. Morris</dc:creator>
      <pubDate>Mon, 02 Mar 2026 08:36:26 +0000</pubDate>
      <link>https://dev.to/dingomanhammer/why-your-ai-agent-works-in-demo-but-fails-in-production-4e51</link>
      <guid>https://dev.to/dingomanhammer/why-your-ai-agent-works-in-demo-but-fails-in-production-4e51</guid>
      <description>&lt;h2&gt;
  
  
  Why Your AI Agent Works in Demo But Fails in Production
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;And the 5 failure modes nobody tests for.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;Every agent demo is a magic trick. You walk through the happy path, the LLM nails the tool call, the output is clean, and the audience nods. Ship it.&lt;/p&gt;

&lt;p&gt;Then production happens.&lt;/p&gt;

&lt;p&gt;The agent loops on ambiguous inputs. It hallucinates tool parameters that pass schema validation but produce garbage downstream. It burns $40 in API calls on a task that should cost $0.12. It works perfectly 93% of the time — and the other 7% files a support ticket, or worse, executes a wrong action with full confidence.&lt;/p&gt;

&lt;p&gt;This isn’t a prompting problem. It’s an evaluation problem. And the reason most teams don’t catch these failures before users do is that they’re testing the wrong things.&lt;/p&gt;

&lt;p&gt;I’ve spent the last year building &lt;a href="https://github.com/fallenone269/agentprobe" rel="noopener noreferrer"&gt;AgentProbe&lt;/a&gt;, an open-source evaluation framework for agentic systems. Here are the five failure modes I see teams miss over and over — and the specific tests that catch them.&lt;/p&gt;




&lt;h2&gt;
  
  
  1. The Confident Wrong Turn
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;What it looks like:&lt;/strong&gt; The agent selects a tool, passes valid parameters, receives a valid response — and it’s the completely wrong tool for the task.&lt;/p&gt;

&lt;p&gt;This is the most dangerous failure mode because nothing errors out. Your logs look clean. Your schema validation passes. The agent just… did the wrong thing. Confidently.&lt;/p&gt;

&lt;p&gt;Traditional evals miss this because they test tool calls in isolation: “Given this prompt, did the agent call the right function?” That works for single-turn interactions. In multi-step workflows, the problem is rarely that the agent can’t call the right tool — it’s that it calls a &lt;em&gt;plausible&lt;/em&gt; tool when the &lt;em&gt;correct&lt;/em&gt; tool requires contextual reasoning across prior steps.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How to catch it:&lt;/strong&gt; Test the full decision chain, not individual calls. Define expected tool sequences for representative scenarios, then assert on the &lt;em&gt;path&lt;/em&gt;, not just the final output. In AgentProbe, this looks like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;agentprobe&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;probe&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;expect_tool_sequence&lt;/span&gt;

&lt;span class="nd"&gt;@probe&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;multi-step-booking-flow&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;test_booking_requires_availability_before_reserve&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;agent&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Book the cheapest available room for March 15&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;expect_tool_sequence&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;search_availability&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# Must check availability first
&lt;/span&gt;        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;compare_prices&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;       &lt;span class="c1"&gt;# Then compare
&lt;/span&gt;        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;create_reservation&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;    &lt;span class="c1"&gt;# Then book
&lt;/span&gt;    &lt;span class="p"&gt;])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If your agent jumps straight to &lt;code&gt;create_reservation&lt;/code&gt; because it “remembers” a room from a previous conversation turn, that’s a failure — even if the booking succeeds.&lt;/p&gt;




&lt;h2&gt;
  
  
  2. The Invisible Cost Explosion
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;What it looks like:&lt;/strong&gt; The agent completes the task correctly. The output is great. It cost 47x what it should have.&lt;/p&gt;

&lt;p&gt;This happens when agents enter reasoning loops — restating the problem, re-reading context, calling tools redundantly, or generating intermediate chain-of-thought that balloons token consumption without improving output quality. In development, nobody notices because you’re watching &lt;em&gt;behavior&lt;/em&gt;, not &lt;em&gt;spend&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;In production at scale, this is how you get a $12,000 bill for a feature that was projected to cost $800/month.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How to catch it:&lt;/strong&gt; Set cost and token budgets per task class and treat overruns as test failures. This isn’t monitoring — it’s a pre-deployment gate.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;agentprobe&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;probe&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;BudgetConstraint&lt;/span&gt;

&lt;span class="nd"&gt;@probe&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;summarize-document&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;constraints&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="nc"&gt;BudgetConstraint&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;max_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;4000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_tool_calls&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;test_summary_stays_within_budget&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;agent&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Summarize this quarterly report&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;output_quality_score&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mf"&gt;0.85&lt;/span&gt;
    &lt;span class="c1"&gt;# Test passes only if quality is high AND cost is within bounds
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The insight: quality without cost-awareness is a demo metric, not a production metric.&lt;/p&gt;




&lt;h2&gt;
  
  
  3. The State Bleed
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;What it looks like:&lt;/strong&gt; Agent A handles User 1’s request and retains context that leaks into Agent A’s handling of User 2’s request. Or: a sub-agent inherits parent context that changes its behavior in ways the orchestrator didn’t intend.&lt;/p&gt;

&lt;p&gt;This is the multi-agent version of a global variable bug, and it’s endemic in frameworks that pass context through shared memory, vector stores, or poorly scoped conversation histories.&lt;/p&gt;

&lt;p&gt;The symptom is non-determinism that you can’t reproduce in isolation. The agent works fine in unit tests. In integration tests with concurrent users or multi-agent pipelines, it produces subtly wrong outputs — different every time, depending on who else hit the system recently.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How to catch it:&lt;/strong&gt; Run identical inputs through the agent under concurrent load and assert on output consistency. If the same input produces materially different outputs depending on system state, you have a bleed.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;agentprobe&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;probe&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;isolation_test&lt;/span&gt;

&lt;span class="nd"&gt;@probe&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;context-isolation&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;test_no_cross_user_contamination&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;isolation_test&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;agent&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;agent&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="nb"&gt;input&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;What&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s my account balance?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;user_contexts&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
            &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;alice&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;balance&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1500&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
            &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bob&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;balance&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;42000&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="n"&gt;concurrency&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;runs_per_context&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="c1"&gt;# Each user's responses must reference ONLY their own data
&lt;/span&gt;    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;run&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;runs_for&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;alice&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;42000&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;run&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;output&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;run&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;runs_for&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bob&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1500&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;run&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;output&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you’re building multi-agent systems and you’re not testing for context isolation under concurrency, you are shipping a data leak.&lt;/p&gt;




&lt;h2&gt;
  
  
  4. The Graceful Degradation Failure
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;What it looks like:&lt;/strong&gt; A downstream tool times out, an API returns a 500, a rate limit kicks in — and the agent either hangs indefinitely, retries until it exhausts your budget, or surfaces a raw error trace to the user.&lt;/p&gt;

&lt;p&gt;Most teams test the happy path exhaustively and the sad path not at all. But in production, your agent &lt;em&gt;will&lt;/em&gt; encounter degraded dependencies. The question is whether it fails gracefully — with a useful fallback, a clear error message, and bounded retry behavior — or whether it fails catastrophically.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How to catch it:&lt;/strong&gt; Inject failures into the tool layer and assert on recovery behavior.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;agentprobe&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;probe&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;inject_fault&lt;/span&gt;

&lt;span class="nd"&gt;@probe&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;api-timeout-recovery&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;test_agent_handles_tool_timeout&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;inject_fault&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;weather_api&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;fault&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;timeout&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;after_ms&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;3000&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;agent&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;What&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s the weather in Birmingham?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completed&lt;/span&gt;  &lt;span class="c1"&gt;# Agent didn't hang
&lt;/span&gt;    &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tool_retries&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;  &lt;span class="c1"&gt;# Bounded retry
&lt;/span&gt;    &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;unable&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;try again&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="c1"&gt;# Agent communicated the failure, didn't hallucinate weather data
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The worst version of this failure is when the agent &lt;em&gt;hallucinates a response&lt;/em&gt; instead of admitting the tool failed. It confidently tells the user it’s 72°F and sunny when it never successfully called the weather API. This is a trust-destroying failure, and the only way to catch it is to simulate the fault path.&lt;/p&gt;




&lt;h2&gt;
  
  
  5. The Regression You Didn’t Know You Shipped
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;What it looks like:&lt;/strong&gt; You update your system prompt, swap a model version, or change a tool schema. Your existing tests still pass. But a behavior the tests don’t cover — something users depend on — silently breaks.&lt;/p&gt;

&lt;p&gt;This is the most common failure mode in teams that &lt;em&gt;do&lt;/em&gt; test their agents. The tests are too narrow. They cover the scenarios you thought of when you wrote them, but they don’t cover the emergent behaviors that users discovered and came to rely on.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How to catch it:&lt;/strong&gt; Behavioral regression testing across prompt and model changes. Record production interactions as golden datasets, then replay them after every change.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;agentprobe&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;probe&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;golden_dataset&lt;/span&gt;

&lt;span class="nd"&gt;@probe&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;regression-suite&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;test_no_behavioral_regression&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;dataset&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;golden_dataset&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;production_interactions_v12.jsonl&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;agent&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run_batch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dataset&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;inputs&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;regression_report&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;dataset&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;compare&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;threshold&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.90&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="n"&gt;regression_report&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;pass_rate&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mf"&gt;0.95&lt;/span&gt;
    &lt;span class="c1"&gt;# Up to 5% degradation allowed for model swaps
&lt;/span&gt;    &lt;span class="c1"&gt;# but any critical-path regression is a hard failure
&lt;/span&gt;    &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;regression_report&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;critical_regressions&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is the test that turns “we think this prompt change is safe” into “we measured that this prompt change is safe.” It’s the difference between engineering and hope.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Evaluation Gap Is the Product Gap
&lt;/h2&gt;

&lt;p&gt;The current landscape of agent tooling is rich in &lt;em&gt;orchestration&lt;/em&gt; — LangChain, CrewAI, AutoGen, dozens of others helping you build and run agents. It’s remarkably poor in &lt;em&gt;evaluation&lt;/em&gt; — helping you prove those agents actually work reliably before your users do the testing for you.&lt;/p&gt;

&lt;p&gt;That’s the gap &lt;a href="https://github.com/fallenone269/agentprobe" rel="noopener noreferrer"&gt;AgentProbe&lt;/a&gt; is built to fill. It’s an open-source evaluation framework purpose-built for agentic systems: deterministic assertions on non-deterministic behavior, cost-aware testing, fault injection, context isolation validation, and behavioral regression tracking.&lt;/p&gt;

&lt;p&gt;If you’re building agents and you don’t have answers to these five questions, you’re not ready for production:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Does my agent take the right &lt;em&gt;path&lt;/em&gt;, not just produce the right output?&lt;/li&gt;
&lt;li&gt;Does it stay within cost bounds under real workloads?&lt;/li&gt;
&lt;li&gt;Does it maintain strict context isolation under concurrency?&lt;/li&gt;
&lt;li&gt;Does it degrade gracefully when dependencies fail?&lt;/li&gt;
&lt;li&gt;Can I measure behavioral regression across every prompt and model change?&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://github.com/fallenone269/agentprobe" rel="noopener noreferrer"&gt;AgentProbe on GitHub →&lt;/a&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Building agentic systems that need to work in production, not just in demo? &lt;a href="https://github.com/fallenone269/agentprobe" rel="noopener noreferrer"&gt;Star AgentProbe&lt;/a&gt; and start testing what actually matters.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>agents</category>
      <category>development</category>
    </item>
    <item>
      <title>The Prompt Change That Broke Production at 2am</title>
      <dc:creator>J. S. Morris</dc:creator>
      <pubDate>Fri, 27 Feb 2026 10:24:39 +0000</pubDate>
      <link>https://dev.to/dingomanhammer/the-prompt-change-that-broke-production-at-2am-2alg</link>
      <guid>https://dev.to/dingomanhammer/the-prompt-change-that-broke-production-at-2am-2alg</guid>
      <description>&lt;h2&gt;
  
  
  Why This Keeps Happening
&lt;/h2&gt;

&lt;p&gt;When you test traditional software, you test a deterministic function. Same input, same output. If the output changes, something broke, the test fails, you investigate.&lt;/p&gt;

&lt;p&gt;LLM agents are not deterministic functions. They’re &lt;strong&gt;probabilistic systems with behavioral contracts.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The contract isn’t “return exactly this string.” The contract is: &lt;em&gt;given this class of inputs, the output must satisfy these structural and semantic properties.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The word “liability” should appear. The summary should be in bullet points. The termination clause should be mentioned. These are the invariants your downstream systems depend on — and they’re completely untested in most production LLM pipelines.&lt;/p&gt;

&lt;p&gt;The industry’s response to this has been more evals. MMLU benchmarks, human preference ratings, red-team suites. Valuable for model builders. &lt;strong&gt;Useless for application developers&lt;/strong&gt; who need to know whether &lt;em&gt;their specific prompts&lt;/em&gt; still produce outputs &lt;em&gt;their specific systems&lt;/em&gt; can rely on.&lt;/p&gt;

&lt;p&gt;You’re not trying to measure whether Claude is generally intelligent. You’re trying to know whether your summarization prompt still hits the contract your parser expects. Those are completely different questions.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Gap Nobody Is Filling
&lt;/h2&gt;

&lt;p&gt;Here’s what the current tooling landscape looks like for an engineer who wants to regression-test their agent behavior:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Option 1: Unit tests with mocked LLMs.&lt;/strong&gt;&lt;br&gt;
Fast, deterministic, CI-friendly. Catches exactly nothing about actual model behavior because the model is mocked out.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Option 2: Manual spot-checking.&lt;/strong&gt;&lt;br&gt;
“Looks good to me.” Works until it doesn’t. Doesn’t scale. Doesn’t run on every deploy.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Option 3: Hosted eval platforms&lt;/strong&gt; (LangSmith, etc.)&lt;br&gt;
Powerful, but coupled to specific frameworks. Requires accounts, dashboards, infrastructure. Not a &lt;code&gt;pip install&lt;/code&gt; and a YAML file.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Option 4: Nothing.&lt;/strong&gt;&lt;br&gt;
Most common option. “We’ll deal with it when something breaks.”&lt;/p&gt;

&lt;p&gt;What nobody has built is the boring, obvious thing: a &lt;code&gt;pytest&lt;/code&gt; for agent behavior. A tool that runs your scenarios, checks that outputs satisfy your contracts, compares against a baseline, and exits with code 1 when something drifts. Zero infrastructure. Works in any CI.&lt;/p&gt;


&lt;h2&gt;
  
  
  What We Actually Need
&lt;/h2&gt;

&lt;p&gt;The minimum viable agent regression test looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# scenarios/summarize_contract.yaml&lt;/span&gt;
&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;summarize_contract&lt;/span&gt;
&lt;span class="na"&gt;input&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
  &lt;span class="s"&gt;Summarize this contract clause in 5 bullet points:&lt;/span&gt;
  &lt;span class="s"&gt;"...The Contractor shall indemnify...termination upon 30 days notice..."&lt;/span&gt;
&lt;span class="na"&gt;expected_contains&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;liability&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;termination&lt;/span&gt;
&lt;span class="na"&gt;max_tokens&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;512&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;One file. Declares the input and the semantic anchors that must appear. Not exact strings — anchors. The things your downstream systems depend on.&lt;/p&gt;

&lt;p&gt;Then you run it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;agentprobe run scenarios/ &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--backend&lt;/span&gt; anthropic &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--baseline&lt;/span&gt; baseline.json &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--tolerance&lt;/span&gt; 0.8
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And you get:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Model : claude-opus-4-6
────────────────────────────────────────────────
  ✓ PASS  summarize_contract
  ✗ FAIL  extract_parties
          Drift detected: similarity 0.61 &amp;lt; 0.80
          Missing expected terms: ['indemnification']
────────────────────────────────────────────────
  1/2 passed  (50%)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Exit code 1. CI fails. Nobody merges that prompt change until the contract is satisfied again.&lt;/p&gt;

&lt;p&gt;The Tuesday incident gets caught on Wednesday morning, before it reaches production.&lt;/p&gt;




&lt;h2&gt;
  
  
  How Baseline Comparison Works
&lt;/h2&gt;

&lt;p&gt;The drift detection is simple and effective: &lt;a href="https://en.wikipedia.org/wiki/Jaccard_index" rel="noopener noreferrer"&gt;Jaccard similarity&lt;/a&gt; on output tokens compared against a saved baseline.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Save a baseline after a known-good run&lt;/span&gt;
agentprobe run scenarios/ &lt;span class="nt"&gt;--backend&lt;/span&gt; anthropic &lt;span class="nt"&gt;--save-baseline&lt;/span&gt; baseline.json

&lt;span class="c"&gt;# Future runs compare against it&lt;/span&gt;
agentprobe run scenarios/ &lt;span class="nt"&gt;--backend&lt;/span&gt; anthropic &lt;span class="nt"&gt;--baseline&lt;/span&gt; baseline.json &lt;span class="nt"&gt;--tolerance&lt;/span&gt; 0.8
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;--tolerance 0.8&lt;/code&gt; means: allow up to 20% variance from baseline. Drop below that, fail.&lt;/p&gt;

&lt;p&gt;This is deliberately not semantic similarity. Jaccard is fast, deterministic, and catches structural changes — the ones that break parsers — better than embeddings-based approaches for most cases. Semantic similarity is on the roadmap for v0.2.&lt;/p&gt;

&lt;p&gt;The baseline captures the behavioral fingerprint of a known-good run:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"model"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"claude-opus-4-6"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"created_at"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2025-02-24T00:00:00+00:00"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"scenarios"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"summarize_contract"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"raw_output"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"..."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"metrics"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"prompt_hash"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"a1b2c3d4e5f6a7b8"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"found_terms"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"liability"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"termination"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;prompt_hash&lt;/code&gt; is there for a reason: if someone edits the prompt, the hash changes. You know the baseline may no longer be valid. You re-run and save a new one intentionally, rather than silently inheriting drift.&lt;/p&gt;




&lt;h2&gt;
  
  
  The CI Integration Is One Step
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# .github/workflows/agent-tests.yml&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Agent regression tests&lt;/span&gt;
  &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
    &lt;span class="s"&gt;pip install "agentprobe[anthropic]"&lt;/span&gt;
    &lt;span class="s"&gt;agentprobe run scenarios/ \&lt;/span&gt;
      &lt;span class="s"&gt;--backend anthropic \&lt;/span&gt;
      &lt;span class="s"&gt;--baseline baseline.json \&lt;/span&gt;
      &lt;span class="s"&gt;--tolerance 0.8&lt;/span&gt;
  &lt;span class="na"&gt;env&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;ANTHROPIC_API_KEY&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ secrets.ANTHROPIC_API_KEY }}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That’s it. Every push. Every prompt change. Every model update. The test runs. If behavior drifts past tolerance, the build fails.&lt;/p&gt;

&lt;p&gt;The Tuesday incident becomes: commit fails CI, engineer sees the drift report, reviews whether the change was intentional, either adjusts the prompt or updates the baseline deliberately.&lt;/p&gt;

&lt;p&gt;No more 2am pages about empty liability clause arrays.&lt;/p&gt;




&lt;h2&gt;
  
  
  Start With Three Scenarios
&lt;/h2&gt;

&lt;p&gt;You don’t need to test everything. Start with the three behaviors your system absolutely depends on.&lt;/p&gt;

&lt;p&gt;For most LLM pipelines, these are:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;The happy path&lt;/strong&gt; — core task with all expected output present&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;A safety or refusal case&lt;/strong&gt; — inputs the agent should decline or handle carefully&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The format-sensitive case&lt;/strong&gt; — where downstream parsing depends on structural output&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Write a YAML for each. Run them once with &lt;code&gt;--save-baseline&lt;/code&gt;. Add the CI step. Done in an afternoon.&lt;/p&gt;




&lt;h2&gt;
  
  
  Try It
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;The repo:&lt;/strong&gt; &lt;a href="https://github.com/fallenone269/agentprobe" rel="noopener noreferrer"&gt;github.com/fallenone269/agentprobe&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="s2"&gt;"agentprobe[anthropic]"&lt;/span&gt;
agentprobe init-scenario my_first_test scenarios/my_first_test.yaml
agentprobe run scenarios/ &lt;span class="nt"&gt;--backend&lt;/span&gt; mock  &lt;span class="c"&gt;# no API key needed to start&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you try it and hit friction, &lt;strong&gt;open an issue.&lt;/strong&gt; The roughest edges get smoothed first.&lt;/p&gt;

&lt;p&gt;If it saves you a 2am page, &lt;strong&gt;star the repo.&lt;/strong&gt; It helps other engineers find it.&lt;/p&gt;

&lt;p&gt;If you have a use case that isn’t covered, &lt;strong&gt;start a discussion.&lt;/strong&gt; The roadmap is driven by real production problems.&lt;/p&gt;




&lt;p&gt;The agent testing gap is real, the 2am incidents are happening, and the tooling to prevent them is a &lt;code&gt;pip install&lt;/code&gt; away.&lt;/p&gt;

&lt;p&gt;The only missing piece was someone building it. That’s what this is.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>devops</category>
      <category>architecture</category>
      <category>learning</category>
    </item>
    <item>
      <title>Why Agent Testing is Broken</title>
      <dc:creator>J. S. Morris</dc:creator>
      <pubDate>Wed, 25 Feb 2026 22:37:22 +0000</pubDate>
      <link>https://dev.to/dingomanhammer/why-agent-testing-is-broken-12a2</link>
      <guid>https://dev.to/dingomanhammer/why-agent-testing-is-broken-12a2</guid>
      <description>&lt;h2&gt;
  
  
  Why Agent Testing Is Broken
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;And what to do about it.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;Software testing has been solved for decades. You write a function, you assert its output, your CI turns green, you ship. The contract is clear: same input, same output, always.&lt;/p&gt;

&lt;p&gt;LLM agents broke this contract completely — and most teams haven’t noticed yet.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Problem Nobody’s Talking About
&lt;/h2&gt;

&lt;p&gt;Ask your agent “summarize this contract” today and get a good response. Ask it again tomorrow after a model update, a prompt tweak, or a context window change, and get something subtly different. Not wrong, exactly. Just… different. Different enough that the downstream system parsing it breaks silently at 2am.&lt;/p&gt;

&lt;p&gt;This is not a hypothetical. It’s happening in production right now at companies that thought they were shipping stable systems.&lt;/p&gt;

&lt;p&gt;The failure mode is insidious because:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;It doesn’t throw exceptions.&lt;/strong&gt; The agent responds. It always responds. The response is even plausible. The failure is semantic, not syntactic.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;It’s not reproducible on demand.&lt;/strong&gt; You can’t &lt;code&gt;git bisect&lt;/code&gt; a drift in model behavior. The model didn’t change — your prompts did, or the model got a silent update from your API provider, or the context you’re injecting shifted.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Your existing tests don’t catch it.&lt;/strong&gt; Unit tests mock the LLM entirely. Integration tests check that the API call completes. Neither checks whether the &lt;em&gt;content&lt;/em&gt; of the response still satisfies your downstream expectations.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;You have no regression suite for cognition. You’re flying blind.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why This Happens
&lt;/h2&gt;

&lt;p&gt;Traditional software is deterministic. LLMs are stochastic systems operating on learned representations of language. When you update a model, you’re not patching a function — you’re shifting a distribution.&lt;/p&gt;

&lt;p&gt;A 3% shift in how Claude-3.5 vs Claude-4 responds to a legal summarization prompt might be invisible in manual review and catastrophic in a pipeline that expects the word “termination” to appear in every output.&lt;/p&gt;

&lt;p&gt;The industry’s response has been to add more evals — elaborate human preference datasets, MMLU benchmarks, red-teaming suites. These are valuable for model builders. They are nearly useless for application developers.&lt;/p&gt;

&lt;p&gt;What application developers need is not “is this model generally capable?” They need: “does this model, with my specific prompts, in my specific context, still produce outputs my system can rely on?”&lt;/p&gt;

&lt;p&gt;That question has no good answer today.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Broken Looks Like in Practice
&lt;/h2&gt;

&lt;p&gt;Here’s a real pattern seen across teams shipping LLM applications:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Month 1:&lt;/strong&gt; Team writes prompts, ships agent, manually verifies outputs look good.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Month 2:&lt;/strong&gt; Someone tweaks a system prompt “slightly” to improve tone. Three downstream parsers start failing intermittently.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Month 3:&lt;/strong&gt; The model provider silently updates the model behind the same API endpoint. Response format drifts by 15%. The agent still works in demos.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Month 4:&lt;/strong&gt; A customer reports that summarized contracts are missing liability clauses. Postmortem reveals the issue started in month 2. Nobody noticed because there were no behavioral tests.&lt;/p&gt;

&lt;p&gt;This is the norm, not the exception.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Right Mental Model
&lt;/h2&gt;

&lt;p&gt;Stop thinking about agent outputs as function return values. Think about them as documents produced by a probabilistic process with a behavioral contract.&lt;/p&gt;

&lt;p&gt;The contract is: &lt;em&gt;given this class of inputs, the output must satisfy these structural and semantic properties.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Testing that contract requires:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Baseline capture.&lt;/strong&gt; Run your scenarios against a known-good version of the system and record the outputs. This is your behavioral fingerprint.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Containment checks.&lt;/strong&gt; Define what must appear in every output. Not the exact text — that would fail on every run. The semantic anchors: key terms, required sections, structural elements.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Drift detection.&lt;/strong&gt; Compare new outputs against your baseline. When similarity drops below your tolerance threshold, fail the build. Let the engineer decide if the change is intentional.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. CI integration.&lt;/strong&gt; Run this on every push. On every model version change. On every prompt edit. The same way you run unit tests.&lt;/p&gt;

&lt;p&gt;This is not complicated. It’s just not being done.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why Nobody’s Done It Yet
&lt;/h2&gt;

&lt;p&gt;The tooling doesn’t exist yet in a usable form.&lt;/p&gt;

&lt;p&gt;Existing evaluation frameworks (RAGAS, LangSmith, etc.) are either:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Coupled to specific frameworks (LangChain, etc.)&lt;/li&gt;
&lt;li&gt;Focused on RAG quality metrics rather than behavioral regression&lt;/li&gt;
&lt;li&gt;Require hosted infrastructure and accounts&lt;/li&gt;
&lt;li&gt;Too complex to add to a CI pipeline in an afternoon&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;What the market needs is a &lt;code&gt;pytest&lt;/code&gt; for agents. Lightweight. Composable. Runs locally. Zero-infrastructure. Exits with code 1 when behavior breaks.&lt;/p&gt;




&lt;h2&gt;
  
  
  What a Solution Looks Like
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# scenarios/summarize_contract.yaml&lt;/span&gt;
&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;summarize_contract&lt;/span&gt;
&lt;span class="na"&gt;input&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
  &lt;span class="s"&gt;Summarize this contract clause in 5 bullet points:&lt;/span&gt;
  &lt;span class="s"&gt;"...The Contractor shall indemnify...termination upon 30 days notice..."&lt;/span&gt;
&lt;span class="na"&gt;expected_contains&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;liability&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;termination&lt;/span&gt;
&lt;span class="na"&gt;max_tokens&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;512&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Run against real model, compare to baseline&lt;/span&gt;
agentprobe run scenarios/ &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--backend&lt;/span&gt; anthropic &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--baseline&lt;/span&gt; baseline.json &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--tolerance&lt;/span&gt; 0.8
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;✓ PASS  summarize_contract
✗ FAIL  extract_parties
        Drift detected: similarity 0.61 &amp;lt; 0.80
        Missing expected terms: ['indemnification']
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Exit code 1. CI fails. Engineer investigates before merge.&lt;/p&gt;

&lt;p&gt;This is the minimum viable interface for agent regression testing. One command. One config file. Works in any CI system. No accounts. No dashboards.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Deeper Issue
&lt;/h2&gt;

&lt;p&gt;The reason agent testing is broken isn’t technical. The tooling is straightforward to build.&lt;/p&gt;

&lt;p&gt;The reason is cultural. The teams shipping LLM applications came from two worlds:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;ML engineers&lt;/strong&gt; think about evaluation as a training-time concern. You eval the model, you ship the model, done. Application behavior is someone else’s problem.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Software engineers&lt;/strong&gt; think about testing as a code correctness concern. The LLM is a black box — you can’t unit test a neural network, so you don’t test at all.&lt;/p&gt;

&lt;p&gt;Neither group has internalized that LLM applications are probabilistic systems with testable behavioral contracts. That’s a new thing. It requires a new practice.&lt;/p&gt;

&lt;p&gt;That practice is agent regression testing. It needs to become as routine as writing unit tests. The tools to do it are simple — they just need to exist and be usable.&lt;/p&gt;




&lt;h2&gt;
  
  
  Start Today
&lt;/h2&gt;

&lt;p&gt;You don’t need a framework. Here’s the minimum viable version:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Pick your three most critical agent behaviors.&lt;/li&gt;
&lt;li&gt;Write a scenario YAML for each: input, and 2-3 terms that must appear in every valid output.&lt;/li&gt;
&lt;li&gt;Run your agent against those scenarios and save the outputs as a baseline JSON.&lt;/li&gt;
&lt;li&gt;On every deploy, run again and diff against the baseline.&lt;/li&gt;
&lt;li&gt;Fail the deploy if outputs drift beyond your tolerance.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;That’s it. If you want a tool that does this out of the box: &lt;a href="https://github.com/fallenone269/agentprobe" rel="noopener noreferrer"&gt;github.com/fallenone269/agentprobe&lt;/a&gt;.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Agent testing is broken because nobody built the right tool yet. That’s a solvable problem.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>opensource</category>
      <category>agents</category>
    </item>
  </channel>
</rss>
