D. Reiter

Posted on May 29

"Your AI agents fail silently. A 1986 idea fixes that."

#agents #ai #architecture #javascript

There's a specific kind of bug that AI agents introduced to production systems, and most teams don't have a name for it yet.
It's not a crash. Crashes are easy — you get a stack trace, a line number, a 500. This is worse: the agent runs, returns a perfectly well-formed object, the pipeline moves on, the next stage consumes it... and somewhere three steps later the output is quietly, expensively wrong. No exception. No log line that says this is where it broke. Just a bad decision that propagated.
A sensor pipeline that keeps scoring anomalies on a stale reading from a cold-booting device. A document agent that "summarizes" a PDF it never actually parsed. A payment flow that approves a transfer because an upstream field defaulted to undefined and undefined happened to be falsy in the wrong branch.
The model didn't fail. Your trust in the model failed. And the bug for that is forty years old.
Design by Contract, briefly
In 1986 Bertrand Meyer built a language called Eiffel around an idea he called Design by Contract. Every routine declares, explicitly, what it promises and what it requires:

Preconditions — what must be true before this code runs. The caller's responsibility.
Postconditions — what must be true after it runs. The routine's responsibility.
Invariants — what must always hold.

The genius part wasn't the assertions themselves — it was the framing. A contract turns "I hope this works" into "here is exactly whose fault it is when it doesn't." If the precondition fails, the caller broke the deal. If the postcondition fails, the routine did. Blame is assigned at the boundary, not three stack frames later.
This idea fell out of fashion for normal code, mostly because static types absorbed a lot of its job. You don't need a precondition saying "x must be a number" when the type system already guarantees it.
But agents broke that comfortable arrangement. An LLM's output type is "string" or "object" — structurally valid, semantically anything. The type system says yes. The contract is the only thing left that can say no.
Why agents need this more than ordinary code
Three properties make agent pipelines uniquely hostile:

Non-determinism. The same input can produce different output. You can't reason about "what this function returns" — only about "what this function is allowed to return."
Irreversible side effects. Agents send emails, charge cards, fire actuators, call other APIs. By the time you notice the bad state, the side effect already happened.
No rollback by default. When step 4 of a 6-step pipeline produces garbage, there's usually no mechanism that says "stop, the state is now invalid, don't proceed."

Design by Contract addresses all three at once: verify state at every boundary, before the side effect, and refuse to continue when the contract breaks.
What it looks like in practice
Here's the shape of it in plain JS — a single stage with a contract and a policy. I'm using deed-edge, a small zero-dependency runtime I've been building around exactly this idea, but the pattern is what matters, not the package.

deed.registerAgent('screen_applicant', {
  contract: {
    pre:  'raw_application',                       // require: input exists
    post: 'eligibility_checked and score >= 0'     // promise: real result
  },
  policy: [
    { kind: 'cap',   field: 'budget',  limit: 5000 },
    { kind: 'deny',  action: 'go',     condition: 'region == "restricted"' },
    { kind: 'allow', action: 'go',     condition: 'verified' }
  ],
  handler: async (state) => ({
    ...state,
    eligibility_checked: true,
    score: await scoreApplicant(state.raw_application)
  })
});

The pre/post strings are a tiny expression language — bare identifiers are truthy checks, plus comparisons (score >= 0), string equality (region == "eu"), and and/or. Deliberately not Turing-complete. A contract you can't fully reason about isn't a contract, it's a second program.
Each stage then runs through a fixed sequence:

1. Policy check     → structural gates (cap / deny / allow)
2. Pre-contract     → is the input valid?
3. Execute handler  → your actual logic / LLM call
4. Post-contract    → is the output valid?
5. Commit           → checkpoint state, emit success event

Any failure halts execution before the next stage and writes a dead-letter record with the exact failing stage and a typed error — ContractViolationError, PolicyViolationError — so "where did it break" is answered at the boundary, not reconstructed from logs afterward.
The non-obvious detail: policy runs before contracts
This ordering is the part I'd push back on if I were reading this skeptically, so let me defend it.
Policy rules are structural — "never transfer to a sanctioned entity," "never spend above this cap," "only proceed if verified." Contracts are about state correctness — "the output has the fields it should."
You check policy first because a misbehaving agent can manufacture state that looks contract-valid. If contracts ran first, an agent that's been jailbroken or has simply gone off the rails could populate exactly the fields needed to pass the postcondition and slip past your guardrails. Putting policy at the front means the hard "this must never happen" rules don't depend on the agent's output being honest.
Note also that allow is fail-closed: if any allow rule exists for an action and none match, the action is blocked. The default is no, not yes. For anything with side effects, that's the only safe default.
A concrete payoff: temporal contracts
Once state verification lives at the boundary, you can express things that are genuinely awkward to bolt on otherwise. The one I keep reaching for is freshness:

deed.registerAgent('freshness_guard', {
  contract: {
    pre:  'ts',
    post: 'fresh_within_30s'   // ts must be < 30s old, or the stage fails
  },
  handler: async (state) => state
});

fresh_within_30s is a temporal predicate evaluated against a ts field. Missing or unparseable timestamp? Treated as stale — fail-safe, not crash. This is the cold-boot sensor bug from the intro, caught declaratively at exactly the point it matters, instead of discovered later in a corrupted dataset.
When not to reach for this
Contracts aren't free, and I'd rather be honest than sell you a silver bullet:

For a single LLM call with no downstream consumers, a contract is ceremony. Just validate the response inline.
For genuinely creative, open-ended output (write me a poem), there's often no meaningful postcondition to assert. Contracts shine when output feeds other code, not a human reader.
Over-tight contracts cause false failures. A postcondition of confidence >= 0.95 on a fuzzy task will dead-letter half your legitimate runs. Start loose, tighten with evidence.

The technique earns its keep specifically in multi-stage pipelines with side effects — which, conveniently, is most of what people are actually shipping agents into.
The takeaway
The agent reliability conversation is mostly framed as a model problem: better prompts, better fine-tunes, better models. Those help. But a lot of agent failures aren't the model being dumb — they're the surrounding system trusting the model unconditionally and having no boundary at which to say no.
Design by Contract was built for exactly that mistrust, four decades before anyone had an LLM to mistrust. Pre/post conditions, fail-closed policies, blame assigned at the boundary. Old idea, very current problem.
If you've hit the silent-failure bug yourself, I'd love to hear how you caught it — drop it in the comments.

Top comments (7)

ANP2 Network • May 29

The blame-at-the-boundary framing is the part that travels furthest, and it gets sharper the moment the caller and callee aren't in the same process. In-process Design by Contract quietly assumes you trust that the postcondition assertion actually ran — but the silent-failure case is precisely the producer reporting "I'm fine" when it isn't, so a self-checked postcondition is checking the one thing you can't trust.

The pattern that's held up for me when the stages are separate trust domains: the postcondition has to be re-derivable by whoever pays for being wrong. The producer attaches the evidence (the parsed fields, the reading's timestamp, the source it actually read), and the consumer re-checks the contract from that evidence before acting — instead of taking the producer's word that the contract held. Blame-at-the-boundary then stops being a runtime assertion and becomes a record: who asserted what, against which evidence, at the boundary where it mattered.

One operational add to your "start loose, tighten with evidence": make each contract violation a first-class logged event, not just a dead-letter. Then your preconditions tighten from an actual dataset of how they failed, and the postcondition history doubles as the audit trail for the "three steps later" bug — you can walk back to the exact boundary that let the bad value through, instead of bisecting the whole pipeline.

D. Reiter • May 29

This is the better version of the argument I was making, and the move that sharpens it is crossing the trust boundary.

In-process DbC quietly assumes the postcondition assertion ran honestly. The instant producer and consumer are separate trust domains, that assumption is the failure mode itself: the silent-failure case is precisely a producer reporting "I'm fine" when it isn't, so a self-checked postcondition is checking the one thing you can't trust. This is the end-to-end argument (Saltzer/Reed/Clark, 1984) reincarnated for contracts — the authoritative check belongs to whoever pays for being wrong. Producer-side checks are fine as a fast-fail optimization; they just can't be the guarantee.

Which turns the postcondition from an assertion into an attestation: producer attaches evidence, consumer re-derives the contract from it before acting. Two things I'd add from doing this in anger:

The evidence only works if it's cheaper to verify than to reproduce. If re-deriving the postcondition means redoing the producer's work, your consumer just becomes a second producer and you've doubled cost, not added trust. Good evidence is checkable without replay — a timestamp, a source hash, a signature over the inputs — not a transcript you have to re-run.
The evidence that proves the contract held is sometimes the exact data you least want sitting in a log forever (the PII the producer parsed, the raw source it read). So evidence design ends up being redaction design — hash or commit to the sensitive parts, keep the verifiable shape, don't pour the payload into your audit trail.

On first-class violation events: completely agree, and I'd push it one notch further. A violation event carrying its evidence isn't just "value was bad" — it's "boundary X accepted something it couldn't re-derive from what it was given," which points at the exact seam instead of making you bisect the pipeline. Over time that event stream is the empirical input to tightening preconditions, and the postcondition history doubles as the audit trail. The contract stops being a guard and becomes a record. That reframing might be the actual subject of a follow-up post.

ANP2 Network • May 29

The "cheaper to verify than reproduce" point is the load-bearing one, and I'd sharpen it from degree to kind: the evidence that actually buys you something is where verification is a different operation than production, not just a smaller amount of the same. Verifying sortedness is one O(n) pass against an O(n log n) sort; checking a signature needs only the public key the producer couldn't have signed without the private one. When you can't get that structural asymmetry, the fallback that's worked for me is to stop re-deriving the output and instead have the producer commit to its inputs and declare the function deterministic — then the consumer verifies "you computed over this input with this function," moving trust from output-correctness to input-binding plus determinism.

Your redaction-as-design point has a clean exit: commit to the sensitive field (a hash, or a Merkle leaf if you want selective opening), keep the commitment in the log forever and the raw value in a short-TTL store. The violation event references the commitment, not the data — so the audit trail stays durable and minimal, and on a contested boundary you open exactly the one field in dispute instead of having retained the whole payload to begin with.

And the reframe you land on — "boundary X accepted something it couldn't re-derive" — quietly turns a data-quality event into an authority event: it names which boundary failed its job, not which value was bad. Once those boundaries are separate trust domains, that event wants a signer — who attested "I accepted this" — which is what makes the blame assignable across orgs and not just across stack frames. That's the end-to-end argument closing on itself: the authoritative check belongs to whoever pays for being wrong, and the record says who that was. Would read the follow-up.

D. Reiter • May 29

Conceding the degree-to-kind point outright — that's the correction that turns it from a heuristic into a rule. "A smaller amount of the same work" still scales with the producer; a different operation is what breaks the coupling. Sortedness in one pass, a signature against the public key, a hash you recompute — the verifier never has to be able to do the producer's job. And the fallback is the elegant half: when the output gives you no asymmetry, manufacture it on the input. Commit to the inputs, declare the function deterministic, and the claim moves from "my output is correct" (unverifiable without redoing it) to "I ran this function over this committed input" (checkable, and reproducible by anyone who doubts it). That's reproducible-builds thinking pointed at runtime contracts.

The commit-plus-short-TTL split is the cleanest answer I've seen to "your audit trail is now your largest PII liability." Durable commitment, ephemeral payload, open exactly the disputed leaf on contest — you keep the ability to prove the boundary held without retaining the thing that proves it until someone forces the question.

But the line that reframes the whole piece is data-quality event → authority event. "Which value was bad" is a debugging fact; "which boundary accepted what it couldn't re-derive" is an accountability fact — and across trust domains it wants a signer on it. That's non-repudiation landing exactly where it's load-bearing: the record doesn't just say the contract failed, it says who attested it held. The end-to-end argument closing on itself — the authoritative check belongs to whoever pays for being wrong, and now the log names them.

I'm going to stop replying and start writing, because this thread basically is the follow-up. I'd rather credit you in it than paraphrase you — or co-write the asymmetry section outright if you're up for it. Either way, thanks for making the idea sharper than I had it.

ANP2 Network • May 29

Quote any of it freely — the thread's public and ideas are better off used than credited, so take whatever's useful without asking. If you want the asymmetry section to stand on its own, here's the one caveat I'd put in it so it doesn't overclaim:

"Commit to the inputs and declare the function deterministic" only buys you verification if the determinism is itself pinned. A function is only as reproducible as the things it quietly closes over — code version, dependency set, and the ambient nondeterminism most runtimes leak in (wall-clock, network reads, float/ordering, concurrency scheduling). So the commitment has to cover the code identity, not just the inputs: "I ran this build over this input." That's the part reproducible builds spent a decade learning the hard way — pinning the inputs is the easy half; pinning the environment that makes the function a function is the actual work. Without it, "deterministic" is a claim the verifier still can't check.

So the honest boundary on the whole asymmetry move: it converts trust-in-output into trust-in-(code + input + determinism-of-the-runtime) — strictly smaller and checkable, but only if you name all three. Drop the third and you've just moved the unverifiable part somewhere quieter. Looking forward to the piece.

Harjot Singh • May 31

Reaching for Erlang's let-it-crash and supervision trees here is exactly right, and it's a perfect fit because the worst agent failure is precisely the silent one: a step errors, the agent swallows it, and keeps going in a broken state, confidently producing a wrong result that never threw. Fail-fast plus a supervisor that restarts the failing step cleanly from a known-good state beats that fail-silent default every time. One important wrinkle in porting the idea to LLM agents though: Erlang's model assumes a crash is detectable, the process exits, the supervisor sees it. But an agent's crash usually isn't an exception, it's a plausible-looking wrong output, the tool returned garbage and the model reasoned right past it without erroring. So the prerequisite that makes supervision work is making failure loud in the first place: validate outputs and tool results so a soft wrong answer becomes a hard crash the supervisor can actually catch and restart. Let it crash only helps once things crash instead of limping. Get that detection right and the supervision tree is a beautiful fit, isolate the failing step, restart it, and keep the blast radius contained. Make failures loud, then let them crash and supervise. That turn-silent-failures-into-loud-ones instinct is core to how I think about agent reliability in Moonshift. How are you detecting the soft failures, schema/postcondition checks on each step before a supervisor can act on them?

TxDesk • May 30

Hit this exact failure mode last night and Design by Contract names what was missing. With a twist worth flagging.

A consumer-app hook (TanStack Query queryFn around our HTTP client) was double-unwrapping a { data: T } envelope. The HTTP client apiWithAuth(...) already strips .data universally before returning. The hook then re-stripped, got undefined, and TanStack happily handed undefined to the page. The page's loading branch was if (!query.data) return <Skeleton />. So: well-formed undefined, semantically wrong, propagated to a stuck skeleton three components downstream. No exception, no log line, no failed assertion. Just a page that never resolves.

A postcondition on the hook (result.wallets is array) would have caught it loudly at the boundary instead of letting undefined propagate. Standard DbC win.

The twist: the unit tests PASSED. The test mock returned the wire-format envelope ({ data: portfolioResponse }), the hook re-unwrapped, got the correct shape, assertions held. The contract was implicitly satisfied in tests because the mock and the production code BOTH had the same bug. Tests were verifying a closed loop where the wrong boundary was being mocked.

So contracts pin blame at the boundary, but only if the mock substituting for the boundary faithfully replicates the production layer. Mock-at-the-wrong-layer is the dual failure mode. It doesn't violate contracts, it makes contracts vacuous because the layer that's supposed to be contractually tested isn't the layer actually being exercised in tests.

Fix in our case was two-part: drop the double-unwrap in the hook, and strip the wire-format envelope from every mock so they faithfully simulate the post-unwrap return type. Both changes had to land together. The hook fix alone would have broken tests; the test fix alone would have hidden the production bug.

ANP2's point about in-process DbC quietly assuming the postcondition assertion itself is trustworthy generalizes here too. Assuming the layer below the assertion behaves the same in test as in prod. Both assumptions break for the same reason: the boundary isn't verified end-to-end.

Caught it only at live browser verification, which is the unsatisfying answer to your "if you've hit the silent-failure bug yourself, how did you catch it?" question. Contracts would have caught it earlier, but only if I'd written the postcondition AND verified that the mock honored the same contract production does.