If you have 30 seconds. Holding an AI agent on the long run reveals something strange. The failures that cost you aren't the loud ones (a crash, a red build, a blank page), they're the ones that slip through the cracks of clean code. After 35 effective days, 109,000 lines and 517 commits on a solo ERP, I isolated five recurring silent modes, the fix that doesn't fix, the test that passes by construction, the memory that confabulates, the count that lies, the scope that creeps. One scene per mode, one rule per scene. A discipline doesn't get planned, it settles.
The agent doesn't fail at random
Late April 2026, I reread my accumulated feedbacks (close to a hundred files, dated, indexed) and found they were grouping around five families. The loud errors (red terminal, Sentry alert) you learn to recognize as you go. The silent ones cost more. The code passed, the agent announced green, production ran. And yet something had slipped. Five modes, five scenes, five rules. None of them was decided cold.
Mode 1 — The fix that doesn't fix
An intermittent error shows up in Sentry on a sensitive endpoint. The agent proposes a patch. Three lines, elegant, that make the report disappear. Except what disappears is the symptom. The cause keeps flowing. The malformed payload upstream still produces a null, but it's now silently returned to a consumer expecting an object. The corrupted data propagates quietly into two or three tables, and you only notice when a counter you thought reliable stops being consistent with the rest.
// app/api/leads/elementor/route.ts (condensed form)
export async function POST(req: Request) {
try {
const body = await req.json()
return await processLead(body)
} catch {
return NextResponse.json({ ok: true }) // silent workaround
}
}
A workaround can be legitimate, but explicitly assumed in the commit and a feedback file. Silent workaround forbidden. When a fix looks too simple for the symptom, I demand the full input → output pipeline before accepting.
Mode 2 — The test that passes by construction
ADR-0044, shipped May 2nd. Five contract tests for DB ↔ code enums (statuses, roles). On first run, all five pass in three seconds. Too fast. Diffuse sensation of a meter running by itself.
I add an explicit negative case, a variant that must fail because I deliberately misalign the DB enum and the TS enum.
// tests/contracts/inscription_statuses.contract.test.ts
it('throws when given a deliberately restricted set (anti-tautology)', async () => {
await expect(
assertEnumStable({
table: 'inscriptions',
column: 'status',
expected: ['enrolled'], // deliberate subset
contractRef: '(negative test)',
}),
).rejects.toThrow(/Drift DB ↔ code detected/)
})
Four of the five tests still pass. The assertion helper was silently swallowing comparisons. Without the negative case, I would have shipped a suite of Potemkin tests, green by construction, with no actual capacity to detect anything. The rule comes out in one line. Every contract test suite contains at least one negative case, otherwise it tests nothing. The presence of red is what validates green.
Mode 3 — The memory that confabulates
First week of May, billing refactor. I tell the agent, "we had chosen pattern B for emission via the partner accounting API, right?". The agent confirms, restates what it thinks is the ADR, proposes the next step. Three hours later, I happen to reopen ADR-0007 for an unrelated detail. The sentence jumps out at me in the Decision section. It's the inverse of what the agent just confirmed. Carved there since late April.
This mode is the most insidious of the five because it leverages the human's trust in their own memory; I had validated without rereading. Memory is a point of entry, never a point of arrival. Before asserting anything from a memory file, I reopen the current ADR or code. "Do you remember..." has become, for me as for the agent, the trigger for an immediate Read on the associated memory, not a request for confirmation.
Mode 4 — The count that lies
A morning in late April, Françoise crosses the hallway with her mug, the one with her own face printed on it, an office gag she keeps up every morning. She stops at the door. "How many enrolled in May at Maisons-Laffitte?". I pass the question to the embedded analytical agent. The number arrives in six seconds, clean, formatted. She swivels toward her cockpit (Excel attendance sheet on the left, accounting software on the right, the ERP in the middle) and runs through her sheet with her finger under each name, out loud. "Yeah, that's it." And then, without changing tone, "Seven missing."
The DB enum had been renamed five days earlier on another workstream. The generated SQL was flawless, except it returned zero rows on the queried values. The agent confabulated a business explanation ("there are no overdue invoices") instead of a structural one. Five days of drift with no monitoring barking.
Françoise sees the wrong number before me because she has her own cockpit, her own attendance sheet, her own habit of comparing line by line. It's the anachronistic advantage of the house, a human still tallies on paper. But the rule cannot rest on Françoise's vigilance. Every number relayed to a human comes with its provenance query, and a quarterly DB ↔ code audit is mandatory. A semantic layer without audit is a delayed fragmentation bomb. You don't know when it goes off, you just know it will.
Mode 5 — The scope that creeps
Visible bug, minimal scope. A button opens a drawer on the wrong route, the fix is two lines. The agent, "while I'm in the file," renames three props to harmonize them with another component, moves two helpers to lib/, creates a new utility file, and cleans up a few orphan imports along the way. Fourteen files touched. The diff is unreadable. Review impossible. Two regressions on landing, one on a drawer unrelated to the original bug.
# The diff I should have received — strict, two lines
- href={`/admin/${item.slug}/sessions`}
+ href={`/crm/${item.slug}/sessions`}
With an AI agent, scope creep takes on a particular intensity because the agent doesn't feel the cost of review for a human who has to read the diff tomorrow morning. The cleaner the code, the more tempting the adjacent refactor. Strict fix scope. Adjacent refactor = separate ticket, never under cover of a fix.
What you can copy into your project
Full snippets (structured feedback template Rule / Why / How to apply, anti-tautology contract negative case, DB ↔ code enum audit script) in the silent-failure-modes/ folder of the series companion repo, MIT.
Three directly applicable practices if you work with an AI agent on the long run:
A structured feedback file per incident, in the session it happens. Not at the end of the project, not "when I have time." Five minutes to write, three hours saved three weeks later when the same mode comes back. Without this inscription, the same mistake comes back every two weeks, and no one remembers it well enough to name it.
A negative
expect(...).rejects.toThrow()case in every contract test suite. Without it, a buggy assertion helper renders every contract green by construction. The presence of red is what validates green.A quarterly DB ↔ code audit of shared enums. Half an hour per quarter, a
SELECT DISTINCTon each enum column compared to the associated TypeScript constant. Any semantic layer without audit is a delayed fragmentation bomb.
And you, which of these five modes has already cost you a session without you taking the time to name it? I read the comments.
What settles
Five modes is not the closed list. It's a snapshot at day 35. At day 70 there will be seven or eight, and some of the five will subdivide. What stays invariant is the grammar. An incident, a rule, a dated memory, and the next session that inherits what was learned the week before. A discipline doesn't get planned, it settles.
Companion code, rembrandt-samples/silent-failure-modes/, anti-tautology contract negative case + DB ↔ code audit script, MIT.

Top comments (0)