DEV Community

Milo Antaeus
Milo Antaeus

Posted on

Your AI Agent Returns 200 and Is Wrong: The Silent-Success Drift Pattern

Your agent returned HTTP 200. The model call succeeded. The tool returned "status": "ok". The orchestrator marked the run complete. Dashboards are green. Three days later a customer emails you: the wrong invoice was sent. The agent "succeeded" — the outcome was wrong.

This is the failure mode I see most often when teams bring me logs in 2026, and it is the one that traditional observability tools handle worst. Datadog, LangSmith, Langfuse, Helicone — they all instrument the LLM call envelope. They tell you the call happened and returned. They do not tell you whether the effect on the world was the one you intended.

I call it silent-success drift: the agent's execution succeeds, but its outcome drifts away from intent. The dashboard stays green. Pager stays quiet. Revenue leaks.

The 3 signal classes that catch it

You don't need a new vendor. You need three classes of signal that almost no team is logging, and a 10-minute wiring change to add them.

Signal 1: The pre-action intent line

Before every tool call, log one structured line:

{"ts": "...", "kind": "intent", "run_id": "...", "step": "send_invoice",
 "args_summary": {"to": "customer_42", "amount_usd": 149},
 "expected_outcome": "customer_42 receives invoice for $149",
 "rollback_artifact": "stripe_invoice_id_if_applicable"}
Enter fullscreen mode Exit fullscreen mode

The key fields are expected_outcome (a human-readable sentence describing what should be true if this call works) and rollback_artifact (what you'd need to undo). If you can't write a one-sentence expected_outcome for a tool call, the call is under-specified — that's already a smell.

Signal 2: The post-action verify line

After the tool returns, log a second line based on a separate read, not the tool's own return value:

{"ts": "...", "kind": "post_verify", "run_id": "...", "step": "send_invoice",
 "verify_query": "GET /v1/invoices/{id}",
 "observed": {"status": "open", "amount": 14900, "customer": "cus_42"},
 "matches_intent": true}
Enter fullscreen mode Exit fullscreen mode

Notice: this is a second API call to an authoritative source, not a re-read of the tool's response. The whole point of silent-success drift is that the tool's own return is unreliable — so we don't trust it. We re-query.

For a database write, the post-verify is a SELECT with a where-clause matching what you just wrote. For a file write, it's a stat or a cat. For an email, it's a search of the sent-folder API.

Signal 3: The outcome-assertion line

This is the one that takes the most discipline: at the end of the run, log whether the user's original ask was actually satisfied. Not whether the agent finished — whether the user's intent was met.

{"ts": "...", "kind": "outcome", "run_id": "...",
 "user_intent": "Send invoice for $149 to customer 42 and email them a receipt",
 "agent_reported_status": "success",
 "outcome_satisfied": true,
 "evidence": "stripe_invoice_id=in_xxx, sent_folder_msg_id=eml_yyy"}
Enter fullscreen mode Exit fullscreen mode

outcome_satisfied is the only field a human should be paged on. The other two are inputs to deciding it.

A 10-minute audit script you can run today

Drop this in your agent's logging path. It walks the last 100 runs and finds the silent-success ones — runs where the agent reported success but the outcome was not actually verified.

# silent_success_audit.py
# Walks your JSONL agent log and finds runs that "succeeded" without an
# outcome_satisfied=true line. Run from cron every 6h.
import json
from collections import defaultdict
from pathlib import Path

LOG = Path("/var/log/agent/runs.jsonl")
WINDOW = 100
SILENT_SUCCESS_OUT = Path("/tmp/silent_success_audit.txt")

runs = defaultdict(dict)
for line in LOG.read_text().splitlines()[-5000:]:
    try:
        ev = json.loads(line)
    except json.JSONDecodeError:
        continue
    rid = ev.get("run_id")
    if rid:
        runs[rid][ev.get("kind")] = ev

silent = []
for rid, kinds in list(runs.items())[-WINDOW:]:
    intent = kinds.get("intent")
    outcome = kinds.get("outcome")
    if not intent:
        continue  # no structured logging on this run, skip
    if not outcome or outcome.get("outcome_satisfied") is not True:
        silent.append({
            "run_id": rid,
            "agent_status": outcome.get("agent_reported_status") if outcome else "no_outcome_logged",
            "intent_step": intent.get("step"),
            "age_steps": len(kinds),
        })

SILENT_SUCCESS_OUT.write_text(
    f"# silent-success audit — {len(silent)} of last {WINDOW} runs\n\n"
    + "\n".join(json.dumps(s) for s in silent)
)
print(f"wrote {len(silent)} silent-success runs to {SILENT_SUCCESS_OUT}")
Enter fullscreen mode Exit fullscreen mode

If silent is more than ~5% of your runs, you have a silent-success drift problem. Most teams I've worked with discover it's 15–30% the first time they run this.

Why traditional observability misses this

LangSmith and friends are built around the LLM call envelope: input tokens, output tokens, latency, model name, tool name. They instrument what the agent did. They do not instrument whether the world changed the way you wanted.

The shift is from execution observability (did the call happen?) to outcome observability (did reality match intent?). Most production teams still only have the first. Adding the three signal classes above is the smallest change that gets you the second.

The cost is not zero — a post_verify line requires a second API call per tool use. But for any tool that has a side effect a customer can feel (sending email, charging a card, writing a row, posting a message), the cost is trivially less than the cost of the wrong outcome. I've never seen a team regret adding it.

What to do this week

  1. Add intent and post_verify lines to the top three side-effecting tools in your agent.
  2. Add the outcome line at the end of every run.
  3. Run the audit script above once a day for a week.
  4. Triage the silent-success runs. Most are configuration drift. Some are real bugs.

If you'd rather have someone read the logs and tell you which of your silent-success runs are configuration drift vs. real bugs, that's the AI Ops Checkup — a one-shot forensic read of your agent logs that distinguishes drift from defect and tells you the three fixes to make first. Link in profile.

Top comments (0)