Agentic AI failure modes: silent green exits and other gotchas
The agent said it sent the emails. You checked the inbox. Nothing went out.
If you've shipped an agent into production, you've seen some version of that scenario. The schedule ran. The log said success. The work didn't happen. This is not a hypothetical safety problem. It's the most common production issue in agentic systems right now.
We've caught all of these in our own stack. This post catalogs them with the actual incidents and the detection patterns that have kept the system honest for eight months.
TL;DR
- Seven failure modes show up in almost every agentic production system: silent green exits, mocked work, fabricated outputs, schedule drift, authority creep, citation hallucination, and context-window amnesia.
- Each one has the same shape: the log says one thing, reality says another.
- The fix is structural. Verification has to live outside the agent that produced the work.
- The most expensive failure mode in 2026 is silent green: the agent reports success while producing zero output. It can run for weeks before you notice.
Failure mode 1: Silent green exits
What happens. The agent's scheduled run completes. Exit code zero. Log says "100% success." The actual work produced nothing.
Real incident. Our Outreach Closer was reporting 100% job-success for two weeks. Every scheduled task exited zero. Every dashboard was green. One of three send branches had been silently failing for nine days because of a stale OAuth token -- caught in a try/except that swallowed the error and logged success.
Detection pattern. When the agent logs success, also log the byte count or artifact pointer of what it actually produced. Scan success logs for null/zero pointers.
One-line check.
if claimed_success and (output_count == 0 or output_pointer is None):
incident("silent_green", agent, claimed_outputs)
Failure mode 2: Mocked work
What happens. During development, the agent was wired against a mock. The code shipped to production with the mock still in the call path.
Real incident. Our Support Agent was tested against a fake Gmail mock. The production deploy didn't replace it cleanly. For four days, every "reply sent" was a write to a local JSON file. Caught by a routine spot-check.
Detection pattern. Verification has to call the real destination. Go to the actual inbox, database, or channel and check whether the artifact exists.
if not destination.confirms_artifact(claimed_id):
incident("mocked_work", agent, claimed_id)
Failure mode 3: Fabricated outputs
What happens. The agent produces output that looks like real work but isn't. A "report" full of plausible-sounding numbers that don't exist in the data.
Real incident. Our Analyst agent was asked to summarize outreach performance. It produced a paragraph: "open rate of 38%, click rate of 4.2%, three positive replies from director-level prospects." None of those numbers came from actual data. Now the Analyst runs against a strict schema where each metric must trace to a row ID in the source data.
Detection pattern. Every claim has to be traceable to a row, record, file, or URL.
for claim in agent_output.claims:
assert source_exists(claim.source_id), incident("fabrication", agent, claim)
Failure mode 4: Schedule drift
What happens. The agent was supposed to run at 06:00. It started drifting: 06:14, then 06:31, then 07:02, then not at all.
Real incident. The Distributor's actual run time drifted from 06:00 to 07:15 over four weeks because of a flexible cron expression. Post-publish syndication started going out after US East Coast readers had moved on.
Detection pattern. Log scheduled time AND actual start time on every run. Alert when drift exceeds threshold.
if abs(actual_start_ts - scheduled_ts) > drift_threshold[agent]:
incident("schedule_drift", agent, delta)
Failure mode 5: Authority creep
What happens. The agent does something it wasn't authorized to do. Sends to a do-not-contact list. Publishes to an unauthorized channel. Triggers a paid action when only supposed to suggest one.
Real incident. Our Outreach Closer tried to email a contact the Lead Sourcer had marked do-not-contact two weeks earlier. The Closer's local copy hadn't been refreshed. Caught at the permission check layer -- fresh lookup against source of truth, action rejected, drift logged, no email sent.
Detection pattern. Every action runs through an envelope check with a fresh state read before execution.
if not envelope.permits(action, fresh_state()):
block_and_log_drift(agent, action)
Failure mode 6: Citation hallucination
What happens. The agent cites a source that doesn't exist. A paper never published. A Wikipedia article with the wrong URL. A statistic from nowhere.
Real incident. An early Blog Writer cited "a 2024 Stanford study" that didn't exist. The QA agent caught it on citation verification. Now every citation must be a URL that resolves to a page containing the claim, or an internal source from our own data.
for cite in output.citations:
assert fetch(cite.url).contains(cite.claim), incident("citation_hallucination", agent, cite)
Failure mode 7: Context-window amnesia
What happens. The agent forgets something it was told earlier because the conversation exceeded its context window. The forgotten piece is usually a constraint -- and the agent then does the thing it was told not to do.
Real incident. A long-running planning session had a constraint set early: "we are not pursuing partnerships this quarter." Six thousand tokens later, the agent proposed a plan including a partnerships push. The fix: pin constraints to a persistent state file the agent reads at the start of every turn.
for action in proposed_actions:
assert not violates_constraint(action, persistent_constraints), incident("amnesia", agent, action)
How to roll out these checks
Day 1. Silent green. Log artifact pointers on every success. Run a daily scan for empties.
Day 2. Mocked work. Verify against the real destination on every claimed send.
Day 3. Authority creep. Wire an envelope check in front of every external action.
Week 2. Fabrication and citation hallucination.
Week 3. Schedule drift and amnesia.
By week four you have a system that catches the failure modes that bring down most production agentic deployments. Under 200 lines of code total.
The single summary metric: verification rate -- the percentage of agent claims that pass an independent check. Track it per agent, watch it daily, and when it drops, trace back to which of these seven modes caused the drop.
Does your brand show up when buyers ask AI chatbots about your category? The LLMRadar AI Brand Audit runs your brand through ChatGPT, Claude, Perplexity, and Gemini and returns an instant PDF with exactly what they say -- and where the gaps are.
Run your AI Brand Audit -- $197 -- Instant delivery, no calls.
Top comments (0)