Your AI Agent Said 'Done.' Here's How I Found Out It Actually Failed Three Hours Later.

#ai #llm #agents #llmtools

Last Tuesday my AI agent sent 47 incorrect pricing emails to active customers. It took three hours before a human noticed and flagged it. By then the damage was done — three customers had already replied, confused, one had escalated to support.

The agent had completed its task. It logged "Done." It moved on. The failure happened silently, in a place the logs never looked.

This is the production observability gap nobody talks about in demos. We spend enormous effort making agents do more. We spend almost nothing making sure we know when they've failed.

The Difference Between Logs and Observability

Most agent frameworks give you logs. What you need is observability — the ability to ask "did the agent actually accomplish what I asked, and how do I know?"

There's a structural reason this is hard. When a traditional service fails, you get an exception, a stack trace, a 500 error. When an AI agent fails, it usually fails silently — it produces a plausible but wrong output, or it uses the wrong tool, or it completes a task that was itself based on stale data. The error is in the output, not in the execution.

Here is the pattern I see repeatedly:

# What most people ship
result = agent.run(task)
logger.info(f"Agent completed: {result}")

That tells you the agent ran. It tells you nothing about whether the result is correct, whether the agent did what you expected, or whether it even worked on the right problem.

Three hours of wasted time could have been avoided with a 15-minute observability layer.

The Observability Stack That Actually Works

After shipping a half-dozen agent systems in production, I've settled on a minimal observability stack that covers the failure modes I actually encounter. It has four components.

1. Structured Trace Context

Every agent session gets a UUID the moment it starts. Every log line, every tool call, every model response includes that ID. Without this, you cannot correlate what happened across a multi-step agent run.

import contextvars
import uuid
from datetime import datetime

trace_id_var = contextvars.ContextVar("trace_id", default=None)

class AgentSession:
    def __init__(self, task_description: str):
        self.trace_id = str(uuid.uuid4())[:8]
        self.task = task_description
        self.started_at = datetime.utcnow()
        self.tool_calls = []
        self.steps = []

    def log(self, level: str, message: str, **kwargs):
        print(json.dumps({
            "ts": datetime.utcnow().isoformat(),
            "trace_id": self.trace_id,
            "level": level,
            "message": message,
            **kwargs
        }))

Pass this session object through every tool call. When something breaks, you grep for the trace ID and get the full execution history.

2. Tool Call Shadow Validation

The most common failure mode isn't "the agent crashed" — it's "the agent called the right tool with the wrong parameters" or "the agent called a reasonable tool but the output was garbage."

Capture both the call and a structural validation of the response, without slowing down the agent:

def tracked_tool_call(session: AgentSession, tool_name: str, params: dict, result):
    session.tool_calls.append({
        "tool": tool_name,
        "params": params,
        "result_preview": str(result)[:200],  # first 200 chars
        "timestamp": datetime.utcnow().isoformat()
    })

    # Validate response shape if we know what to expect
    if tool_name == "send_email":
        if not isinstance(result, dict) or "message_id" not in result:
            session.log("ERROR", "email_tool_returned_unexpected_shape", 
                       trace_id=session.trace_id, result=str(result)[:100])

    return result

This catches the class of failure where the tool runs but returns something useless — without blocking the agent's execution.

3. Completion Criteria Assertions

Before the agent marks a task complete, run a set of assertions against the output. These are not unit tests — they are runtime checks against the actual result.

def validate_completion(session: AgentSession, task: str, result) -> bool:
    checks = [
        ("result_is_string", lambda: isinstance(result, str) and len(result) > 10),
        ("no_obvious_hallucination_markers", lambda: not any(
            phrase in str(result).lower() 
            for phrase in ["as an ai", "i cannot", "i'm sorry", "undefined"]
        )),
    ]

    passed = []
    for name, check_fn in checks:
        try:
            ok = check_fn()
            passed.append({"check": name, "ok": ok})
            if not ok:
                session.log("WARN", f"completion_check_failed", check=name, 
                           trace_id=session.trace_id)
        except Exception as e:
            passed.append({"check": name, "ok": False, "error": str(e)})
            session.log("ERROR", f"completion_check_error", check=name,
                        error=str(e), trace_id=session.trace_id)

    return all(p["ok"] for p in passed)

These checks won't catch every failure. But they catch the silent failures — the ones where the agent produces a confident nonsense answer that looks legitimate until you read it.

4. Time and Cost Budgeting

Every agent run gets a time budget and a token budget. When either is exceeded, the agent stops — even if it hasn't reached a conclusion. This prevents the "agent runs for 20 minutes and outputs nothing useful" failure mode.

MAX_DURATION_SECONDS = 120
MAX_TOKENS = 8000

def run_with_budget(agent, task: str) -> str:
    start = time.time()
    result = agent.run(task)
    elapsed = time.time() - start

    # Budget exceeded — fail loudly
    if elapsed > MAX_DURATION_SECONDS:
        logger.error(f"Budget exceeded: {elapsed:.1f}s > {MAX_DURATION_SECONDS}s")
        raise TimeoutError(f"Agent exceeded time budget")

    return result

What I Learned the Hard Way

The email incident taught me three things I now build into every agent from day one.

Fail loudly at decision points. The email agent had a confidence threshold — below it, it was supposed to escalate to a human. The threshold existed in the prompt. The model ignored it on Tuesday morning, probably because the query phrasing triggered a confident-but-wrong path. Prompts are not contracts. Hard-code critical business logic outside the model's control.

Correlate tool calls with outcomes. The email tool logged that it sent 47 emails. It did not log why those were the 47 emails it chose. When I investigated, I found the agent had selected the customer list using a query that returned stale data. The tool worked perfectly. The data pipeline feeding it was broken. Without the trace context linking the tool call to the query that preceded it, I would have blamed the email tool.

Your eval suite will not catch this. My agent had passed every eval test before deployment. The eval suite checked whether the agent could complete the task correctly when everything went right. It didn't check what happened when the upstream data was stale, when the model's confidence calibration was off, or when the agent encountered an ambiguous instruction. Production failure modes are not in your eval suite. You find them with observability.

The three hours I lost to that email incident cost more than the 15 minutes it would have taken to add the observability layer. That's the math I keep relearning.

If you're shipping AI agents in production and you're not logging trace IDs, validating tool call outputs, checking completion criteria, and budgeting execution time — you are running the same experiment I ran, and you'll learn the same lesson I learned. The agent will tell you it's done. It won't tell you if it failed.

What's your most painful production agent failure story? I'd love to hear what you learned — find me on DEV.to or drop a comment below.