Your agent returned HTTP 200. The model call succeeded. The tool returned "status": "ok". The orchestrator marked the run complete. Dashboards are green. Three days later a customer emails you: the wrong invoice was sent. The agent "succeeded" — the outcome was wrong.
This is the failure mode I see most often when teams bring me logs in 2026, and it is the one that traditional observability tools handle worst. Datadog, LangSmith, Langfuse, Helicone — they all instrument the LLM call envelope. They tell you the call happened and returned. They do not tell you whether the effect on the world was the one you intended.
I call it silent-success drift: the agent's execution succeeds, but its outcome drifts away from intent. The dashboard stays green. Pager stays quiet. Revenue leaks.
The 3 signal classes that catch it
You don't need a new vendor. You need three classes of signal that almost no team is logging, and a 10-minute wiring change to add them.
Signal 1: The pre-action intent line
Before every tool call, log one structured line:
{"ts": "...", "kind": "intent", "run_id": "...", "step": "send_invoice",
"args_summary": {"to": "customer_42", "amount_usd": 149},
"expected_outcome": "customer_42 receives invoice for $149",
"rollback_artifact": "stripe_invoice_id_if_applicable"}
The key fields are expected_outcome (a human-readable sentence describing what should be true if this call works) and rollback_artifact (what you'd need to undo). If you can't write a one-sentence expected_outcome for a tool call, the call is under-specified — that's already a smell.
Signal 2: The post-action verify line
After the tool returns, log a second line based on a separate read, not the tool's own return value:
{"ts": "...", "kind": "post_verify", "run_id": "...", "step": "send_invoice",
"verify_query": "GET /v1/invoices/{id}",
"observed": {"status": "open", "amount": 14900, "customer": "cus_42"},
"matches_intent": true}
Notice: this is a second API call to an authoritative source, not a re-read of the tool's response. The whole point of silent-success drift is that the tool's own return is unreliable — so we don't trust it. We re-query.
For a database write, the post-verify is a SELECT with a where-clause matching what you just wrote. For a file write, it's a stat or a cat. For an email, it's a search of the sent-folder API.
Signal 3: The outcome-assertion line
This is the one that takes the most discipline: at the end of the run, log whether the user's original ask was actually satisfied. Not whether the agent finished — whether the user's intent was met.
{"ts": "...", "kind": "outcome", "run_id": "...",
"user_intent": "Send invoice for $149 to customer 42 and email them a receipt",
"agent_reported_status": "success",
"outcome_satisfied": true,
"evidence": "stripe_invoice_id=in_xxx, sent_folder_msg_id=eml_yyy"}
outcome_satisfied is the only field a human should be paged on. The other two are inputs to deciding it.
A 10-minute audit script you can run today
Drop this in your agent's logging path. It walks the last 100 runs and finds the silent-success ones — runs where the agent reported success but the outcome was not actually verified.
# silent_success_audit.py
# Walks your JSONL agent log and finds runs that "succeeded" without an
# outcome_satisfied=true line. Run from cron every 6h.
import json
from collections import defaultdict
from pathlib import Path
LOG = Path("/var/log/agent/runs.jsonl")
WINDOW = 100
SILENT_SUCCESS_OUT = Path("/tmp/silent_success_audit.txt")
runs = defaultdict(dict)
for line in LOG.read_text().splitlines()[-5000:]:
try:
ev = json.loads(line)
except json.JSONDecodeError:
continue
rid = ev.get("run_id")
if rid:
runs[rid][ev.get("kind")] = ev
silent = []
for rid, kinds in list(runs.items())[-WINDOW:]:
intent = kinds.get("intent")
outcome = kinds.get("outcome")
if not intent:
continue # no structured logging on this run, skip
if not outcome or outcome.get("outcome_satisfied") is not True:
silent.append({
"run_id": rid,
"agent_status": outcome.get("agent_reported_status") if outcome else "no_outcome_logged",
"intent_step": intent.get("step"),
"age_steps": len(kinds),
})
SILENT_SUCCESS_OUT.write_text(
f"# silent-success audit — {len(silent)} of last {WINDOW} runs\n\n"
+ "\n".join(json.dumps(s) for s in silent)
)
print(f"wrote {len(silent)} silent-success runs to {SILENT_SUCCESS_OUT}")
If silent is more than ~5% of your runs, you have a silent-success drift problem. Most teams I've worked with discover it's 15–30% the first time they run this.
Why traditional observability misses this
LangSmith and friends are built around the LLM call envelope: input tokens, output tokens, latency, model name, tool name. They instrument what the agent did. They do not instrument whether the world changed the way you wanted.
The shift is from execution observability (did the call happen?) to outcome observability (did reality match intent?). Most production teams still only have the first. Adding the three signal classes above is the smallest change that gets you the second.
The cost is not zero — a post_verify line requires a second API call per tool use. But for any tool that has a side effect a customer can feel (sending email, charging a card, writing a row, posting a message), the cost is trivially less than the cost of the wrong outcome. I've never seen a team regret adding it.
What to do this week
- Add
intentandpost_verifylines to the top three side-effecting tools in your agent. - Add the
outcomeline at the end of every run. - Run the audit script above once a day for a week.
- Triage the silent-success runs. Most are configuration drift. Some are real bugs.
If you'd rather have someone read the logs and tell you which of your silent-success runs are configuration drift vs. real bugs, that's the AI Ops Checkup — a one-shot forensic read of your agent logs that distinguishes drift from defect and tells you the three fixes to make first. Link in profile.
Top comments (0)