The Daily Agent

Posted on Mar 7

5 AI Agent Failures in Production (And How to Fix Them)

#ai #productivity #tutorial #webdev

It's 2am. Your agent ran 47 tool calls instead of 3. Your API bill spiked $200. The output is confidently wrong. No error was thrown. No alert fired. You have no idea what happened.

This is the reality of AI agent production failures — and they're fundamentally different from normal software bugs. Traditional code fails loudly: stack traces, exceptions, 500 errors. Agents fail quietly, producing plausible-looking wrong behavior five steps downstream from the actual cause. You can't grep for this. You can't set a breakpoint.

After shipping and debugging agent workflows in production, I've watched the same five failure patterns surface again and again. Here's what they look like, how to spot each one, and exactly how to fix them.

Why AI Agents Are Hard to Debug

Normal software is deterministic. Given the same inputs, you get the same outputs. Failures are local — a function throws, a request returns 4xx, you fix that line.

Agents are different in three ways that make debugging genuinely hard:

Non-determinism: The same prompt can produce different tool call sequences on different runs.
Multi-step causality: The failure you see at step 8 was caused by a bad decision at step 2. The stack trace points nowhere useful.
Silent success: The agent "completes" successfully — returns a result, exits cleanly — but the output is wrong. No exception. No alert. Users notice before you do.

To debug any agent failure, you need to answer three questions: What did it decide to do? Which tools did it call and in what order? Where did its behavior diverge from expected? Without execution traces, you're debugging blind. Every production agent needs at minimum a structured run log before you ship it.

Failure Mode #1: The Infinite Helpfulness Loop

What it looks like: Your agent keeps adding steps. It re-checks the inbox one more time. It retries the API call "just to be sure." It loops back to verify something it already verified. Step count balloons. Latency triples. Your bill doubles. Output quality doesn't improve.

This is the infinite helpfulness loop — an agent that has no concept of "done enough." It's trying to be thorough, but there's no stop condition, no step budget, no notion of diminishing returns.

How to spot it: Step count per run spikes suddenly with no corresponding improvement in output quality. Cost-per-successful-run increases. Execution time grows run over run. Add a step_count field to every run log and alert when it exceeds your expected ceiling.

The fix: Enforce hard budgets and define explicit stop conditions.

MAX_TOOL_CALLS = 10
MAX_RETRIES_PER_TOOL = 2
MAX_RUNTIME_SECONDS = 30

def run_agent(task):
    step_count = 0
    start_time = time.time()

    while not task.is_complete():
        if step_count >= MAX_TOOL_CALLS:
            return partial_result_with_escalation(task, reason="step_budget_exceeded")
        if time.time() - start_time > MAX_RUNTIME_SECONDS:
            return partial_result_with_escalation(task, reason="runtime_budget_exceeded")

        task.execute_next_step()
        step_count += 1

The key is the partial_result_with_escalation path — don't just error out. Return what you have, flag it for human review, and log the reason. Agents that hit budgets should surface as alerts, not silent billing spikes. Nebula agents have configurable step and runtime budgets built in, so loops escalate via email notification instead of showing up on your invoice.

Failure Mode #2: Tool Schema Mismatch

What it looks like: The agent calls the right tool but with malformed arguments. The tool returns a 400 or 422. The agent retries with "close enough" parameters — sometimes successfully, sometimes not. Entire workflows silently degrade. You never see a hard failure; you just see a subtly wrong output.

How to spot it: A healthy agent run should have zero tool errors. If you see any 4xx errors in your tool call logs — even ones the agent recovered from — treat them as signal. Look for clusters: three calls to the same tool with incrementally different arguments is a schema mismatch in progress.

Classic example: a calendar booking agent that passes date: "March 7th" instead of date: "2026-03-07". The API rejects it. The agent "corrects" it — maybe to "March 7, 2026". Rejected again. Third attempt: "03/07/2026". Finally accepted, but now you've burned three round trips and the agent has learned nothing for next time.

The fix: Validate tool inputs before the LLM sees them, and give the agent structured error codes it can actually reason about.

from pydantic import BaseModel, validator
from datetime import datetime

class BookingInput(BaseModel):
    date: str        # must be ISO 8601: YYYY-MM-DD
    duration_mins: int
    attendee_email: str

    @validator('date')
    def validate_iso_date(cls, v):
        try:
            datetime.strptime(v, '%Y-%m-%d')
        except ValueError:
            raise ValueError(f"date must be ISO 8601 format (YYYY-MM-DD), got: {v!r}")
        return v

Beyond input validation, make your error responses agent-readable. {"error": "invalid_date_format", "expected": "YYYY-MM-DD", "received": "March 7th"} gives the LLM something to act on. {"error": "Bad Request"} gives it nothing.

Log the exact tool input on every call — not just success/failure — so you can diff what changed between retries.

Failure Mode #3: Retrieval Pollution (RAG Agents)

What it looks like: Your RAG agent retrieves context, reasons from it, and returns a confident, fluent answer — that's wrong. Not because the model hallucinated. Because it retrieved bad chunks and reasoned correctly from incorrect inputs.

This is the most misdiagnosed failure in AI systems. Teams assume it's a prompt problem or a model problem and spend days tuning — when the actual issue is upstream in the retrieval layer.

How to spot it: Watch for groundedness score drops, agents citing information outside the original query scope, or users reporting "hallucinations" that your logs show were actually sourced from retrieved context. Add retrieval metadata to every run log: chunk count, source IDs, and relevance scores.

The fix: Three constraints that catch most retrieval pollution:

Cap chunk injection. Top-5 chunks maximum. Injecting 20 chunks fills the context window with noise and degrades reasoning quality on what actually matters.
Score-gate your retrieval. Discard any chunk with a relevance score below your minimum threshold. Returning no context is better than returning low-confidence context.
Log what was retrieved, not just what was returned. You need to know which chunks fed the answer to diagnose drift.

def retrieve_context(query: str, index, top_k=5, min_score=0.75):
    results = index.search(query, top_k=top_k)
    filtered = [r for r in results if r.score >= min_score]

    log_retrieval(
        query=query,
        chunks_retrieved=len(results),
        chunks_used=len(filtered),
        sources=[r.source_id for r in filtered],
        scores=[r.score for r in filtered]
    )

    return filtered

If your agent doesn't use RAG today, flag this section for when you add memory or context retrieval — it will apply.

Failure Mode #4: The Overconfident Wrong Answer

What it looks like: No errors. No loops. No retrieval issues. The agent completes the task, returns a result, exits cleanly. And the output is wrong. It extracted the wrong date from the contract. It classified the support ticket incorrectly. It generated a summary that missed the core point — but formatted it perfectly.

This is the hardest failure to catch because your monitoring says everything is fine. Step count: normal. Tool errors: zero. Runtime: fast. The agent succeeded at running; it failed at the actual job.

How to spot it: You can't catch this with logs alone. You need evaluation metrics — a definition of what "correct" looks like for each workflow. Without that definition, you'll only find out from users.

The fix: Add a lightweight verification step to every workflow that has a defined success criterion.

For structured outputs, validate against expected schema:

def verify_extraction_output(result: dict, contract_id: str) -> VerificationResult:
    required_fields = ["effective_date", "party_a", "party_b", "term_months"]
    missing = [f for f in required_fields if not result.get(f)]

    if missing:
        return VerificationResult(
            status="failed",
            reason=f"Missing required fields: {missing}",
            contract_id=contract_id
        )

    # Validate date is parseable
    try:
        datetime.strptime(result["effective_date"], "%Y-%m-%d")
    except ValueError:
        return VerificationResult(
            status="failed",
            reason=f"effective_date is not valid ISO format: {result['effective_date']}",
            contract_id=contract_id
        )

    return VerificationResult(status="passed", contract_id=contract_id)

For free-text outputs, an LLM-as-judge check is cheap and effective: prompt a second, lightweight model to evaluate whether the output actually answers the original question. Track task_success_rate as a top-level metric alongside latency and error rate.

Failure Mode #5: Prompt Regression After an Update

What it looks like: Everything worked. Then you updated the system prompt, added a new tool, or swapped to a cheaper model. Now a workflow that ran perfectly for three weeks produces subtly wrong outputs. No error. No alert. Just quiet drift.

This is prompt regression — and it's entirely preventable. It happens because most teams don't treat prompts as versioned artifacts with regression tests. They edit in place, ship, and hope.

How to spot it: Monitor output quality metrics (not just error rates) across version changes. If your task success rate drops after a deployment, that's your signal. If you have no quality metrics, you won't notice until users complain.

Real scenario: a support triage agent that worked well on GPT-4o starts misclassifying 15% of tickets after you swap to a smaller model to cut costs. The error rate in logs is zero. Ticket routing is silently degraded.

The fix: Version prompts like code and run evals before every change ships.

# eval_config.yaml
eval_set: golden_cases_v4.json   # 30-50 hand-labeled test cases
metrics:
  - task_success_rate
  - output_format_validity
  - classification_accuracy
threshold:
  task_success_rate: 0.90        # block deploy if below 90%
  classification_accuracy: 0.85

# Run before any prompt or model change
def run_eval_gate(eval_config_path: str) -> bool:
    config = load_yaml(eval_config_path)
    results = run_eval_suite(config["eval_set"])

    for metric, threshold in config["threshold"].items():
        if results[metric] < threshold:
            print(f"EVAL FAILED: {metric} = {results[metric]:.2%} (threshold: {threshold:.2%})")
            return False

    return True  # safe to deploy

Your golden eval set doesn't need to be large — 30 to 50 cases that cover your core workflows and known edge cases is enough to catch regressions before they hit production.

The Minimal Observability Stack You Need Right Now

You don't need LangSmith, Langfuse, or a full APM platform to catch these five failures. You need five fields logged on every agent run:

run_id, agent_name, started_at, ended_at, cost_usd
Tool call list: [{tool, status, duration_ms, retry_count}]
step_count and budget_hit (boolean)
outcome: one of success | partial | failed | escalated
error_category on failure: tool_schema | loop | retrieval | output_quality | regression | unknown

With just these five fields, you can diagnose 90% of production incidents. Here's the schema:

{
  "run_id": "run_abc123",
  "agent": "contract-extractor-v3",
  "started_at": "2026-03-07T14:22:00Z",
  "ended_at": "2026-03-07T14:22:04Z",
  "duration_ms": 4821,
  "step_count": 6,
  "budget_hit": false,
  "tool_calls": [
    {"tool": "fetch_contract", "status": "ok", "duration_ms": 1200, "retries": 0},
    {"tool": "extract_fields", "status": "ok", "duration_ms": 890, "retries": 0},
    {"tool": "verify_output",  "status": "ok", "duration_ms": 340, "retries": 0}
  ],
  "outcome": "success",
  "error_category": null,
  "cost_usd": 0.0043
}

Log this to a table or a flat JSON store. Query it when something goes wrong. Alert on budget_hit: true and any outcome that isn't success. That's your entire observability stack to start.

How Nebula Handles This Out of the Box

If you're evaluating agent platforms, here's what to look for on observability:

Full execution traces per run — every tool call, every decision, every state transition — stored automatically, queryable after the fact
Built-in step and runtime budget controls that escalate to you (email, Slack) instead of silently billing
Multi-agent delegation that bubbles sub-agent failures up to the orchestrator with full context — so you're not debugging a downstream agent with no visibility into what the parent asked it to do

Nebula runs agents with all of this built in. If you're wiring it up yourself, the schema above is your starting point.

TL;DR

The five AI agent failures that will hit you in production:

Loop (#1): No step budget. Agent retries forever. Fix: enforce MAX_TOOL_CALLS and escalate on breach.
Schema mismatch (#2): LLM passes malformed tool args. Fix: validate inputs with Pydantic/Zod before the call.
Retrieval pollution (#3): RAG agent reasons correctly from bad chunks. Fix: score-gate retrieval, cap chunk count, log sources.
Overconfident wrong answer (#4): Agent "succeeds" but output is wrong. Fix: define success criteria per workflow, add verification step.
Prompt regression (#5): A prompt/model update silently breaks behavior. Fix: version prompts, run evals before deploy.

The goal isn't perfect agents. It's agents that fail visibly — so you can find the failure, understand it, and improve. Silent failure is the enemy.

Which of these have you hit? Drop it in the comments — and if I missed a failure mode you've seen in the wild, I want to hear it.

DEV Community