Why Your AI Agent Works in Demo But Fails in Production

#ai #development #agents

Why Your AI Agent Works in Demo But Fails in Production

And the 5 failure modes nobody tests for.

Every agent demo is a magic trick. You walk through the happy path, the LLM nails the tool call, the output is clean, and the audience nods. Ship it.

Then production happens.

The agent loops on ambiguous inputs. It hallucinates tool parameters that pass schema validation but produce garbage downstream. It burns $40 in API calls on a task that should cost $0.12. It works perfectly 93% of the time — and the other 7% files a support ticket, or worse, executes a wrong action with full confidence.

This isn’t a prompting problem. It’s an evaluation problem. And the reason most teams don’t catch these failures before users do is that they’re testing the wrong things.

I’ve spent the last year building AgentProbe, an open-source evaluation framework for agentic systems. Here are the five failure modes I see teams miss over and over — and the specific tests that catch them.

1. The Confident Wrong Turn

What it looks like: The agent selects a tool, passes valid parameters, receives a valid response — and it’s the completely wrong tool for the task.

This is the most dangerous failure mode because nothing errors out. Your logs look clean. Your schema validation passes. The agent just… did the wrong thing. Confidently.

Traditional evals miss this because they test tool calls in isolation: “Given this prompt, did the agent call the right function?” That works for single-turn interactions. In multi-step workflows, the problem is rarely that the agent can’t call the right tool — it’s that it calls a plausible tool when the correct tool requires contextual reasoning across prior steps.

How to catch it: Test the full decision chain, not individual calls. Define expected tool sequences for representative scenarios, then assert on the path, not just the final output. In AgentProbe, this looks like:

from agentprobe import probe, expect_tool_sequence

@probe("multi-step-booking-flow")
async def test_booking_requires_availability_before_reserve():
    result = await agent.run("Book the cheapest available room for March 15")
    expect_tool_sequence(result, [
        "search_availability",  # Must check availability first
        "compare_prices",       # Then compare
        "create_reservation"    # Then book
    ])

If your agent jumps straight to create_reservation because it “remembers” a room from a previous conversation turn, that’s a failure — even if the booking succeeds.

2. The Invisible Cost Explosion

What it looks like: The agent completes the task correctly. The output is great. It cost 47x what it should have.

This happens when agents enter reasoning loops — restating the problem, re-reading context, calling tools redundantly, or generating intermediate chain-of-thought that balloons token consumption without improving output quality. In development, nobody notices because you’re watching behavior, not spend.

In production at scale, this is how you get a $12,000 bill for a feature that was projected to cost $800/month.

How to catch it: Set cost and token budgets per task class and treat overruns as test failures. This isn’t monitoring — it’s a pre-deployment gate.

from agentprobe import probe, BudgetConstraint

@probe("summarize-document", constraints=[
    BudgetConstraint(max_tokens=4000, max_tool_calls=3)
])
async def test_summary_stays_within_budget():
    result = await agent.run("Summarize this quarterly report")
    assert result.output_quality_score > 0.85
    # Test passes only if quality is high AND cost is within bounds

The insight: quality without cost-awareness is a demo metric, not a production metric.

3. The State Bleed

What it looks like: Agent A handles User 1’s request and retains context that leaks into Agent A’s handling of User 2’s request. Or: a sub-agent inherits parent context that changes its behavior in ways the orchestrator didn’t intend.

This is the multi-agent version of a global variable bug, and it’s endemic in frameworks that pass context through shared memory, vector stores, or poorly scoped conversation histories.

The symptom is non-determinism that you can’t reproduce in isolation. The agent works fine in unit tests. In integration tests with concurrent users or multi-agent pipelines, it produces subtly wrong outputs — different every time, depending on who else hit the system recently.

How to catch it: Run identical inputs through the agent under concurrent load and assert on output consistency. If the same input produces materially different outputs depending on system state, you have a bleed.

from agentprobe import probe, isolation_test

@probe("context-isolation")
async def test_no_cross_user_contamination():
    results = await isolation_test(
        agent=agent,
        input="What's my account balance?",
        user_contexts=[
            {"user_id": "alice", "balance": 1500},
            {"user_id": "bob", "balance": 42000},
        ],
        concurrency=10,
        runs_per_context=5
    )
    # Each user's responses must reference ONLY their own data
    for run in results.runs_for("alice"):
        assert "42000" not in run.output
    for run in results.runs_for("bob"):
        assert "1500" not in run.output

If you’re building multi-agent systems and you’re not testing for context isolation under concurrency, you are shipping a data leak.

4. The Graceful Degradation Failure

What it looks like: A downstream tool times out, an API returns a 500, a rate limit kicks in — and the agent either hangs indefinitely, retries until it exhausts your budget, or surfaces a raw error trace to the user.

Most teams test the happy path exhaustively and the sad path not at all. But in production, your agent will encounter degraded dependencies. The question is whether it fails gracefully — with a useful fallback, a clear error message, and bounded retry behavior — or whether it fails catastrophically.

How to catch it: Inject failures into the tool layer and assert on recovery behavior.

from agentprobe import probe, inject_fault

@probe("api-timeout-recovery")
async def test_agent_handles_tool_timeout():
    with inject_fault("weather_api", fault="timeout", after_ms=3000):
        result = await agent.run("What's the weather in Birmingham?")

    assert result.completed  # Agent didn't hang
    assert result.tool_retries <= 3  # Bounded retry
    assert "unable" in result.output.lower() or "try again" in result.output.lower()
    # Agent communicated the failure, didn't hallucinate weather data

The worst version of this failure is when the agent hallucinates a response instead of admitting the tool failed. It confidently tells the user it’s 72°F and sunny when it never successfully called the weather API. This is a trust-destroying failure, and the only way to catch it is to simulate the fault path.

5. The Regression You Didn’t Know You Shipped

What it looks like: You update your system prompt, swap a model version, or change a tool schema. Your existing tests still pass. But a behavior the tests don’t cover — something users depend on — silently breaks.

This is the most common failure mode in teams that do test their agents. The tests are too narrow. They cover the scenarios you thought of when you wrote them, but they don’t cover the emergent behaviors that users discovered and came to rely on.

How to catch it: Behavioral regression testing across prompt and model changes. Record production interactions as golden datasets, then replay them after every change.

from agentprobe import probe, golden_dataset

@probe("regression-suite")
async def test_no_behavioral_regression():
    dataset = golden_dataset("production_interactions_v12.jsonl")

    results = await agent.run_batch(dataset.inputs)

    regression_report = dataset.compare(results, threshold=0.90)
    assert regression_report.pass_rate > 0.95
    # Up to 5% degradation allowed for model swaps
    # but any critical-path regression is a hard failure
    assert len(regression_report.critical_regressions) == 0

This is the test that turns “we think this prompt change is safe” into “we measured that this prompt change is safe.” It’s the difference between engineering and hope.

The Evaluation Gap Is the Product Gap

The current landscape of agent tooling is rich in orchestration — LangChain, CrewAI, AutoGen, dozens of others helping you build and run agents. It’s remarkably poor in evaluation — helping you prove those agents actually work reliably before your users do the testing for you.

That’s the gap AgentProbe is built to fill. It’s an open-source evaluation framework purpose-built for agentic systems: deterministic assertions on non-deterministic behavior, cost-aware testing, fault injection, context isolation validation, and behavioral regression tracking.

If you’re building agents and you don’t have answers to these five questions, you’re not ready for production: