Ismail Haddou

Posted on May 19

AI Agents in Production: Why They Fail Silently and How to Catch It

#discuss

Your agent passed all the tests. It's been running in production for three weeks. And it has been quietly wrong the entire time.

This is not a hypothetical. Galileo's 2026 production data shows multi-agent systems failing at rates between 41% and 86.7% in real deployments. Datadog logged 8.4 million rate limit errors in a single measurement window. Gartner predicts 40% of agentic AI projects will be scrapped by 2027.

The models are not the problem. The architecture is.

Here is what is actually breaking production agents and what to do about it.

The Core Mismatch

Traditional software fails loudly. An exception is raised, a stack trace is logged, an alert fires. You know immediately something is wrong.

Agents fail convincingly. The model does not throw an exception when it misunderstands a prompt — it generates a plausible-looking output based on a slightly wrong interpretation. Chain ten of those together and you have a system that is confidently producing garbage with no signal that anything is wrong.

The teams getting burned by this are not doing anything obviously wrong. They tested their agents. The tests passed. The problem is that they tested for "does the agent return a value?" instead of "does the agent return a correct value?" — and those are completely different things once you are running on real production data.

The Four Failure Modes That Actually Kill Production Agents

1. Specification drift

Your prompt was written for the happy path. Production surfaces edge cases your prompt never anticipated. The model improvises and the outputs start diverging from your intent in ways that only become visible when you look at them carefully.

2. Error compounding in multi-step pipelines

Step 3 gets slightly malformed output from Step 2. Step 3 has no validation, so it processes it. Step 4 receives corrupted context and generates a confident, well-formatted, completely wrong result. No step failed. No exception was raised. The pipeline ran to completion.

3. Context window degradation

Long-running agents fill the context window. Earlier instructions get compressed or dropped by the model. Your agent at step 40 is running with different effective context than at step 4. If you never tested step 40, you do not know what it does.

4. Unhandled API failures

Rate limits, timeouts, and transient errors happen in every production system. If your agent has no retry logic and no fallback behavior, a 429 silently terminates the pipeline or produces a partial output that gets treated as complete.

What Production-Grade Agent Architecture Requires

Schema Validation on Every Output

Every agent step that produces data should be validated against a strict schema before it touches anything downstream.

from pydantic import BaseModel, validator
from typing import Optional

class ExtractionResult(BaseModel):
    product_name: str
    price: float
    availability: bool
    sku: Optional[str] = None

    @validator('price')
    def price_must_be_positive(cls, v):
        if v <= 0:
            raise ValueError(f'Invalid price: {v}')
        return v

def extract_product_data(raw_llm_output: str) -> ExtractionResult:
    try:
        parsed = json.loads(raw_llm_output)
        return ExtractionResult(**parsed)
    except (json.JSONDecodeError, ValidationError) as e:
        raise AgentValidationError(f"Step failed schema validation: {e}")

When validation fails, you do not pass it forward. You route to a fallback path or stop the pipeline. Silent propagation of bad data is the thing that costs you six weeks.

Verification Gates Between Pipeline Stages

def verify_stage_output(output: dict, stage_name: str) -> bool:
    checks = {
        "extraction": lambda o: all(k in o for k in ["price", "sku", "availability"]),
        "enrichment": lambda o: o.get("confidence_score", 0) > 0.7,
        "report_gen": lambda o: len(o.get("summary", "")) > 100,
    }
    check = checks.get(stage_name)
    if check and not check(output):
        alert(f"Stage {stage_name} failed verification gate")
        return False
    return True

Sampling-Based Production Monitoring

import random

def route_for_quality_check(output: dict, sample_rate: float = 0.03):
    if random.random() < sample_rate:
        send_to_review_queue(output)
    return output

Wire this to a review interface where a team member can mark outputs as correct or incorrect. Track the error rate over time. If it moves, something in your pipeline has drifted.

Explicit Retry and Fallback Logic

from tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception_type

@retry(
    stop=stop_after_attempt(3),
    wait=wait_exponential(multiplier=1, min=4, max=60),
    retry=retry_if_exception_type(RateLimitError)
)
def call_model(prompt: str, model: str = "claude-sonnet-4-6") -> str:
    response = client.messages.create(
        model=model,
        max_tokens=1024,
        messages=[{"role": "user", "content": prompt}]
    )
    return response.content[0].text

Structure Change Detection for Data Extraction Agents

import hashlib

def check_site_structure(soup, selector: str, stored_hash: str) -> bool:
    element = soup.select_one(selector)
    if not element:
        alert(f"Selector {selector} not found — site structure may have changed")
        return False
    current_hash = hashlib.md5(str(element).encode()).hexdigest()
    if current_hash != stored_hash:
        alert(f"Structure change detected at {selector}")
        return False
    return True

This is three lines of logic that would have caught six weeks of silent failure in a real client pipeline we worked on. The agent had been running for a month pulling competitive pricing data. Three target sites updated their HTML on week two. No error was raised. The sales team found out when a deal did not make sense.

The Point

The hard part of building agents in production is not getting the model to generate good output in a demo. That is straightforward. The hard part is building the validation, observability, and failure-handling layer that makes the system reliable when it encounters inputs you never anticipated.

Every production AI agent deserves the same operational rigor you would give any other critical pipeline: schema validation, monitoring, alerting, retry logic, and a clear answer to what happens when this step fails.

If you are hitting this in production and want a second set of eyes, feel free to DM me — happy to dig in.

DEV Community