DEV Community: J. S. Morris

Why Your AI Agent Works in Demo But Fails in Production

J. S. Morris — Mon, 02 Mar 2026 08:36:26 +0000

Why Your AI Agent Works in Demo But Fails in Production

And the 5 failure modes nobody tests for.

Every agent demo is a magic trick. You walk through the happy path, the LLM nails the tool call, the output is clean, and the audience nods. Ship it.

Then production happens.

The agent loops on ambiguous inputs. It hallucinates tool parameters that pass schema validation but produce garbage downstream. It burns $40 in API calls on a task that should cost $0.12. It works perfectly 93% of the time — and the other 7% files a support ticket, or worse, executes a wrong action with full confidence.

This isn’t a prompting problem. It’s an evaluation problem. And the reason most teams don’t catch these failures before users do is that they’re testing the wrong things.

I’ve spent the last year building AgentProbe, an open-source evaluation framework for agentic systems. Here are the five failure modes I see teams miss over and over — and the specific tests that catch them.

1. The Confident Wrong Turn

What it looks like: The agent selects a tool, passes valid parameters, receives a valid response — and it’s the completely wrong tool for the task.

This is the most dangerous failure mode because nothing errors out. Your logs look clean. Your schema validation passes. The agent just… did the wrong thing. Confidently.

Traditional evals miss this because they test tool calls in isolation: “Given this prompt, did the agent call the right function?” That works for single-turn interactions. In multi-step workflows, the problem is rarely that the agent can’t call the right tool — it’s that it calls a plausible tool when the correct tool requires contextual reasoning across prior steps.

How to catch it: Test the full decision chain, not individual calls. Define expected tool sequences for representative scenarios, then assert on the path, not just the final output. In AgentProbe, this looks like:

from agentprobe import probe, expect_tool_sequence

@probe("multi-step-booking-flow")
async def test_booking_requires_availability_before_reserve():
    result = await agent.run("Book the cheapest available room for March 15")
    expect_tool_sequence(result, [
        "search_availability",  # Must check availability first
        "compare_prices",       # Then compare
        "create_reservation"    # Then book
    ])

If your agent jumps straight to create_reservation because it “remembers” a room from a previous conversation turn, that’s a failure — even if the booking succeeds.

2. The Invisible Cost Explosion

What it looks like: The agent completes the task correctly. The output is great. It cost 47x what it should have.

This happens when agents enter reasoning loops — restating the problem, re-reading context, calling tools redundantly, or generating intermediate chain-of-thought that balloons token consumption without improving output quality. In development, nobody notices because you’re watching behavior, not spend.

In production at scale, this is how you get a $12,000 bill for a feature that was projected to cost $800/month.

How to catch it: Set cost and token budgets per task class and treat overruns as test failures. This isn’t monitoring — it’s a pre-deployment gate.

from agentprobe import probe, BudgetConstraint

@probe("summarize-document", constraints=[
    BudgetConstraint(max_tokens=4000, max_tool_calls=3)
])
async def test_summary_stays_within_budget():
    result = await agent.run("Summarize this quarterly report")
    assert result.output_quality_score > 0.85
    # Test passes only if quality is high AND cost is within bounds

The insight: quality without cost-awareness is a demo metric, not a production metric.

3. The State Bleed

What it looks like: Agent A handles User 1’s request and retains context that leaks into Agent A’s handling of User 2’s request. Or: a sub-agent inherits parent context that changes its behavior in ways the orchestrator didn’t intend.

This is the multi-agent version of a global variable bug, and it’s endemic in frameworks that pass context through shared memory, vector stores, or poorly scoped conversation histories.

The symptom is non-determinism that you can’t reproduce in isolation. The agent works fine in unit tests. In integration tests with concurrent users or multi-agent pipelines, it produces subtly wrong outputs — different every time, depending on who else hit the system recently.

How to catch it: Run identical inputs through the agent under concurrent load and assert on output consistency. If the same input produces materially different outputs depending on system state, you have a bleed.

from agentprobe import probe, isolation_test

@probe("context-isolation")
async def test_no_cross_user_contamination():
    results = await isolation_test(
        agent=agent,
        input="What's my account balance?",
        user_contexts=[
            {"user_id": "alice", "balance": 1500},
            {"user_id": "bob", "balance": 42000},
        ],
        concurrency=10,
        runs_per_context=5
    )
    # Each user's responses must reference ONLY their own data
    for run in results.runs_for("alice"):
        assert "42000" not in run.output
    for run in results.runs_for("bob"):
        assert "1500" not in run.output

If you’re building multi-agent systems and you’re not testing for context isolation under concurrency, you are shipping a data leak.

4. The Graceful Degradation Failure

What it looks like: A downstream tool times out, an API returns a 500, a rate limit kicks in — and the agent either hangs indefinitely, retries until it exhausts your budget, or surfaces a raw error trace to the user.

Most teams test the happy path exhaustively and the sad path not at all. But in production, your agent will encounter degraded dependencies. The question is whether it fails gracefully — with a useful fallback, a clear error message, and bounded retry behavior — or whether it fails catastrophically.

How to catch it: Inject failures into the tool layer and assert on recovery behavior.

from agentprobe import probe, inject_fault

@probe("api-timeout-recovery")
async def test_agent_handles_tool_timeout():
    with inject_fault("weather_api", fault="timeout", after_ms=3000):
        result = await agent.run("What's the weather in Birmingham?")

    assert result.completed  # Agent didn't hang
    assert result.tool_retries <= 3  # Bounded retry
    assert "unable" in result.output.lower() or "try again" in result.output.lower()
    # Agent communicated the failure, didn't hallucinate weather data

The worst version of this failure is when the agent hallucinates a response instead of admitting the tool failed. It confidently tells the user it’s 72°F and sunny when it never successfully called the weather API. This is a trust-destroying failure, and the only way to catch it is to simulate the fault path.

5. The Regression You Didn’t Know You Shipped

What it looks like: You update your system prompt, swap a model version, or change a tool schema. Your existing tests still pass. But a behavior the tests don’t cover — something users depend on — silently breaks.

This is the most common failure mode in teams that do test their agents. The tests are too narrow. They cover the scenarios you thought of when you wrote them, but they don’t cover the emergent behaviors that users discovered and came to rely on.

How to catch it: Behavioral regression testing across prompt and model changes. Record production interactions as golden datasets, then replay them after every change.

from agentprobe import probe, golden_dataset

@probe("regression-suite")
async def test_no_behavioral_regression():
    dataset = golden_dataset("production_interactions_v12.jsonl")

    results = await agent.run_batch(dataset.inputs)

    regression_report = dataset.compare(results, threshold=0.90)
    assert regression_report.pass_rate > 0.95
    # Up to 5% degradation allowed for model swaps
    # but any critical-path regression is a hard failure
    assert len(regression_report.critical_regressions) == 0

This is the test that turns “we think this prompt change is safe” into “we measured that this prompt change is safe.” It’s the difference between engineering and hope.

The Evaluation Gap Is the Product Gap

The current landscape of agent tooling is rich in orchestration — LangChain, CrewAI, AutoGen, dozens of others helping you build and run agents. It’s remarkably poor in evaluation — helping you prove those agents actually work reliably before your users do the testing for you.

That’s the gap AgentProbe is built to fill. It’s an open-source evaluation framework purpose-built for agentic systems: deterministic assertions on non-deterministic behavior, cost-aware testing, fault injection, context isolation validation, and behavioral regression tracking.

If you’re building agents and you don’t have answers to these five questions, you’re not ready for production:

Does my agent take the right path, not just produce the right output?
Does it stay within cost bounds under real workloads?
Does it maintain strict context isolation under concurrency?
Does it degrade gracefully when dependencies fail?
Can I measure behavioral regression across every prompt and model change?

AgentProbe on GitHub →

Building agentic systems that need to work in production, not just in demo? Star AgentProbe and start testing what actually matters.

The Prompt Change That Broke Production at 2am

J. S. Morris — Fri, 27 Feb 2026 10:24:39 +0000

Why This Keeps Happening

When you test traditional software, you test a deterministic function. Same input, same output. If the output changes, something broke, the test fails, you investigate.

LLM agents are not deterministic functions. They’re probabilistic systems with behavioral contracts.

The contract isn’t “return exactly this string.” The contract is: given this class of inputs, the output must satisfy these structural and semantic properties.

The word “liability” should appear. The summary should be in bullet points. The termination clause should be mentioned. These are the invariants your downstream systems depend on — and they’re completely untested in most production LLM pipelines.

The industry’s response to this has been more evals. MMLU benchmarks, human preference ratings, red-team suites. Valuable for model builders. Useless for application developers who need to know whether their specific prompts still produce outputs their specific systems can rely on.

You’re not trying to measure whether Claude is generally intelligent. You’re trying to know whether your summarization prompt still hits the contract your parser expects. Those are completely different questions.

The Gap Nobody Is Filling

Here’s what the current tooling landscape looks like for an engineer who wants to regression-test their agent behavior:

Option 1: Unit tests with mocked LLMs.
Fast, deterministic, CI-friendly. Catches exactly nothing about actual model behavior because the model is mocked out.

Option 2: Manual spot-checking.
“Looks good to me.” Works until it doesn’t. Doesn’t scale. Doesn’t run on every deploy.

Option 3: Hosted eval platforms (LangSmith, etc.)
Powerful, but coupled to specific frameworks. Requires accounts, dashboards, infrastructure. Not a pip install and a YAML file.

Option 4: Nothing.
Most common option. “We’ll deal with it when something breaks.”

What nobody has built is the boring, obvious thing: a pytest for agent behavior. A tool that runs your scenarios, checks that outputs satisfy your contracts, compares against a baseline, and exits with code 1 when something drifts. Zero infrastructure. Works in any CI.

What We Actually Need

The minimum viable agent regression test looks like this:

# scenarios/summarize_contract.yaml
name: summarize_contract
input: |
  Summarize this contract clause in 5 bullet points:
  "...The Contractor shall indemnify...termination upon 30 days notice..."
expected_contains:
  - liability
  - termination
max_tokens: 512

One file. Declares the input and the semantic anchors that must appear. Not exact strings — anchors. The things your downstream systems depend on.

Then you run it:

agentprobe run scenarios/ \
  --backend anthropic \
  --baseline baseline.json \
  --tolerance 0.8

And you get:

Model : claude-opus-4-6
────────────────────────────────────────────────
  ✓ PASS  summarize_contract
  ✗ FAIL  extract_parties
          Drift detected: similarity 0.61 < 0.80
          Missing expected terms: ['indemnification']
────────────────────────────────────────────────
  1/2 passed  (50%)

Exit code 1. CI fails. Nobody merges that prompt change until the contract is satisfied again.

The Tuesday incident gets caught on Wednesday morning, before it reaches production.

How Baseline Comparison Works

The drift detection is simple and effective: Jaccard similarity on output tokens compared against a saved baseline.

# Save a baseline after a known-good run
agentprobe run scenarios/ --backend anthropic --save-baseline baseline.json

# Future runs compare against it
agentprobe run scenarios/ --backend anthropic --baseline baseline.json --tolerance 0.8

--tolerance 0.8 means: allow up to 20% variance from baseline. Drop below that, fail.

This is deliberately not semantic similarity. Jaccard is fast, deterministic, and catches structural changes — the ones that break parsers — better than embeddings-based approaches for most cases. Semantic similarity is on the roadmap for v0.2.

The baseline captures the behavioral fingerprint of a known-good run:

{
  "model": "claude-opus-4-6",
  "created_at": "2025-02-24T00:00:00+00:00",
  "scenarios": {
    "summarize_contract": {
      "raw_output": "...",
      "metrics": {
        "prompt_hash": "a1b2c3d4e5f6a7b8",
        "found_terms": ["liability", "termination"]
      }
    }
  }
}

The prompt_hash is there for a reason: if someone edits the prompt, the hash changes. You know the baseline may no longer be valid. You re-run and save a new one intentionally, rather than silently inheriting drift.

The CI Integration Is One Step

# .github/workflows/agent-tests.yml
- name: Agent regression tests
  run: |
    pip install "agentprobe[anthropic]"
    agentprobe run scenarios/ \
      --backend anthropic \
      --baseline baseline.json \
      --tolerance 0.8
  env:
    ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}

That’s it. Every push. Every prompt change. Every model update. The test runs. If behavior drifts past tolerance, the build fails.

The Tuesday incident becomes: commit fails CI, engineer sees the drift report, reviews whether the change was intentional, either adjusts the prompt or updates the baseline deliberately.

No more 2am pages about empty liability clause arrays.

Start With Three Scenarios

You don’t need to test everything. Start with the three behaviors your system absolutely depends on.

For most LLM pipelines, these are:

The happy path — core task with all expected output present
A safety or refusal case — inputs the agent should decline or handle carefully
The format-sensitive case — where downstream parsing depends on structural output

Write a YAML for each. Run them once with --save-baseline. Add the CI step. Done in an afternoon.

Try It

The repo: github.com/fallenone269/agentprobe

pip install "agentprobe[anthropic]"
agentprobe init-scenario my_first_test scenarios/my_first_test.yaml
agentprobe run scenarios/ --backend mock  # no API key needed to start

If you try it and hit friction, open an issue. The roughest edges get smoothed first.

If it saves you a 2am page, star the repo. It helps other engineers find it.

If you have a use case that isn’t covered, start a discussion. The roadmap is driven by real production problems.

The agent testing gap is real, the 2am incidents are happening, and the tooling to prevent them is a pip install away.

The only missing piece was someone building it. That’s what this is.

Why Agent Testing is Broken

J. S. Morris — Wed, 25 Feb 2026 22:37:22 +0000

Why Agent Testing Is Broken

And what to do about it.

Software testing has been solved for decades. You write a function, you assert its output, your CI turns green, you ship. The contract is clear: same input, same output, always.

LLM agents broke this contract completely — and most teams haven’t noticed yet.

The Problem Nobody’s Talking About

Ask your agent “summarize this contract” today and get a good response. Ask it again tomorrow after a model update, a prompt tweak, or a context window change, and get something subtly different. Not wrong, exactly. Just… different. Different enough that the downstream system parsing it breaks silently at 2am.

This is not a hypothetical. It’s happening in production right now at companies that thought they were shipping stable systems.

The failure mode is insidious because:

It doesn’t throw exceptions. The agent responds. It always responds. The response is even plausible. The failure is semantic, not syntactic.
It’s not reproducible on demand. You can’t git bisect a drift in model behavior. The model didn’t change — your prompts did, or the model got a silent update from your API provider, or the context you’re injecting shifted.
Your existing tests don’t catch it. Unit tests mock the LLM entirely. Integration tests check that the API call completes. Neither checks whether the content of the response still satisfies your downstream expectations.

You have no regression suite for cognition. You’re flying blind.

Why This Happens

Traditional software is deterministic. LLMs are stochastic systems operating on learned representations of language. When you update a model, you’re not patching a function — you’re shifting a distribution.

A 3% shift in how Claude-3.5 vs Claude-4 responds to a legal summarization prompt might be invisible in manual review and catastrophic in a pipeline that expects the word “termination” to appear in every output.

The industry’s response has been to add more evals — elaborate human preference datasets, MMLU benchmarks, red-teaming suites. These are valuable for model builders. They are nearly useless for application developers.

What application developers need is not “is this model generally capable?” They need: “does this model, with my specific prompts, in my specific context, still produce outputs my system can rely on?”

That question has no good answer today.

What Broken Looks Like in Practice

Here’s a real pattern seen across teams shipping LLM applications:

Month 1: Team writes prompts, ships agent, manually verifies outputs look good.

Month 2: Someone tweaks a system prompt “slightly” to improve tone. Three downstream parsers start failing intermittently.

Month 3: The model provider silently updates the model behind the same API endpoint. Response format drifts by 15%. The agent still works in demos.

Month 4: A customer reports that summarized contracts are missing liability clauses. Postmortem reveals the issue started in month 2. Nobody noticed because there were no behavioral tests.

This is the norm, not the exception.

The Right Mental Model

Stop thinking about agent outputs as function return values. Think about them as documents produced by a probabilistic process with a behavioral contract.

The contract is: given this class of inputs, the output must satisfy these structural and semantic properties.

Testing that contract requires:

1. Baseline capture. Run your scenarios against a known-good version of the system and record the outputs. This is your behavioral fingerprint.

2. Containment checks. Define what must appear in every output. Not the exact text — that would fail on every run. The semantic anchors: key terms, required sections, structural elements.

3. Drift detection. Compare new outputs against your baseline. When similarity drops below your tolerance threshold, fail the build. Let the engineer decide if the change is intentional.

4. CI integration. Run this on every push. On every model version change. On every prompt edit. The same way you run unit tests.

This is not complicated. It’s just not being done.

Why Nobody’s Done It Yet

The tooling doesn’t exist yet in a usable form.

Existing evaluation frameworks (RAGAS, LangSmith, etc.) are either:

Coupled to specific frameworks (LangChain, etc.)
Focused on RAG quality metrics rather than behavioral regression
Require hosted infrastructure and accounts
Too complex to add to a CI pipeline in an afternoon

What the market needs is a pytest for agents. Lightweight. Composable. Runs locally. Zero-infrastructure. Exits with code 1 when behavior breaks.

What a Solution Looks Like

# scenarios/summarize_contract.yaml
name: summarize_contract
input: |
  Summarize this contract clause in 5 bullet points:
  "...The Contractor shall indemnify...termination upon 30 days notice..."
expected_contains:
  - liability
  - termination
max_tokens: 512

# Run against real model, compare to baseline
agentprobe run scenarios/ \
  --backend anthropic \
  --baseline baseline.json \
  --tolerance 0.8

✓ PASS  summarize_contract
✗ FAIL  extract_parties
        Drift detected: similarity 0.61 < 0.80
        Missing expected terms: ['indemnification']

Exit code 1. CI fails. Engineer investigates before merge.

This is the minimum viable interface for agent regression testing. One command. One config file. Works in any CI system. No accounts. No dashboards.

The Deeper Issue

The reason agent testing is broken isn’t technical. The tooling is straightforward to build.

The reason is cultural. The teams shipping LLM applications came from two worlds:

ML engineers think about evaluation as a training-time concern. You eval the model, you ship the model, done. Application behavior is someone else’s problem.

Software engineers think about testing as a code correctness concern. The LLM is a black box — you can’t unit test a neural network, so you don’t test at all.

Neither group has internalized that LLM applications are probabilistic systems with testable behavioral contracts. That’s a new thing. It requires a new practice.

That practice is agent regression testing. It needs to become as routine as writing unit tests. The tools to do it are simple — they just need to exist and be usable.

Start Today

You don’t need a framework. Here’s the minimum viable version:

Pick your three most critical agent behaviors.
Write a scenario YAML for each: input, and 2-3 terms that must appear in every valid output.
Run your agent against those scenarios and save the outputs as a baseline JSON.
On every deploy, run again and diff against the baseline.
Fail the deploy if outputs drift beyond your tolerance.

That’s it. If you want a tool that does this out of the box: github.com/fallenone269/agentprobe.

Agent testing is broken because nobody built the right tool yet. That’s a solvable problem.