DEV Community: D. Reiter

Notes from the Trough: a developer's guide to post-hype AI

D. Reiter — Fri, 29 May 2026 15:46:11 +0000

I want to make a prediction that sounds pessimistic and is actually the opposite: the most exciting decade of AI starts after the hype dies, and the hype is dying right now.
This isn't a hot take anymore, it's the consensus reading of the data. By spring 2026 the analysts who track these things put generative AI squarely in what Gartner calls the trough of disillusionment — the part of the cycle where reality finishes mugging the marketing. The widely-cited number is that around 95% of enterprise GenAI pilots never made it to production. McKinsey's surveys put the share of AI projects actually delivering core profitability under 40%. Meanwhile the foundation labs hit eye-watering valuations without a clear story on when the unit economics close.

Generative AI is in the trough. AI agents are sitting exactly where GenAI sat two years ago — at the peak, right before the drop.
So the "is it a bubble" question has, frankly, gotten boring. Of course there's a bubble. There's always a bubble. The interesting question — the one worth being a futurist about — is what survives the trough, and what the post-hype stage actually looks like for the people who build things.
Let me try to answer that with specifics instead of vibes.
The analogy everyone reaches for, and why it's half wrong
The comforting story is fiber optics and the dot-com crash. The internet bubble burst, fortunes evaporated, and yet the infrastructure that got overbuilt — the dark fiber, the data centers, the protocols — turned out to be the foundation of the next twenty years. The bubble was real and the technology was also real. Both things at once.
That story is true and it's also a trap, because it smuggles in a happy ending. Here's the part people skip: the telecom fiber boom destroyed something like two trillion dollars of equity value on its way to being useful. The infrastructure survived. The capital that funded it mostly did not. The companies that built the future were, overwhelmingly, not the companies that profited from it.
Now look at the AI numbers with that lens. The five largest US cloud players are guiding toward something like $635–690 billion in capital expenditure in 2026, roughly double 2024, with about three-quarters of it going to AI infrastructure. Goldman's math suggests that to maintain their historical returns on that spend, these companies would need to generate over a trillion dollars in annual profit — more than double current consensus. Amazon's free cash flow is projected to turn negative this year. Morgan Stanley expects hyperscaler debt issuance to blow past $400 billion. Three of the four big hyperscalers lost market value after their most recent earnings calls, specifically on capex anxiety.
That gap — between what's being spent and what's being earned — doesn't get closed by a better prompt. It gets closed by a correction. So here's prediction number one, and it's the unglamorous one:

The infrastructure is real. The financing structure is not. Expect the first serious write-downs and at least one high-profile collapse in the "neocloud" GPU-as-a-service layer within roughly 18 months. The data centers will keep humming under new ownership at cents on the dollar. AI doesn't die in the trough. A specific generation of AI capital does.

This is the fiber lesson applied honestly. The picks-and-shovels survive. The people who bought the picks at the top do not.
The disappearance
Here's the prediction I'm most confident about, and it's the one that matters most if you write software for a living.
"AI" is going to stop being a category.
Think about what happened to the word "internet." In 1999, "internet company" meant something — it was a kind of company, a sector, a thing you put in a pitch deck to add a zero. By 2008 it was meaningless, not because the internet failed but because it won so completely that describing a company as an "internet company" became like describing it as an "electricity company." Every company used it. It dissolved into the substrate.
AI is on exactly that path, and the trough is the moment it accelerates. When everyone has the capability, the capability stops being a differentiator. The label peels off. My specific bet:

By around 2028, "AI startup" reads the way "internet startup" read in 2008 — quaint, and slightly suspicious. The serious builders stop leading with the model. They lead with the problem, and the model is plumbing. The companies that survive the trough are the ones that already think of AI as a component, not a personality.

You can already feel the gravity of this. The interesting work is quietly migrating away from the foundation models — which are commoditizing fast, with open-weight options closing the gap and gross margins at even the top labs reportedly sitting in the unglamorous 30–40% range — and toward the unsexy layer around them: integration, orchestration, verification, the boring glue that turns a probabilistic text generator into something you'd let near a payment system.
The constraint moves
Every technology has a binding constraint, and you can read the maturity of a field by watching which constraint it's currently fighting.
For the last three years the constraint was capability: can the model even do the thing. That fight is mostly over. The models can do an astonishing amount. The constraint has already moved, and it's moving to three places at once:
Power. This is not a metaphor. The headline bottleneck for the AI buildout is increasingly electricity and grid interconnection, not talent or algorithms. Data center siting is starting to be driven by where you can get a gigawatt, full stop. Whoever was telling you AI was an abstract software story was describing 2023. The 2027 story is transformers — the heavy iron kind, in substations.
Trust and verification. A model that's right 95% of the time is a miracle in a demo and a liability in production, because the 5% arrives silently and confidently. The next stage of value isn't a smarter model, it's a trustworthy system built around an imperfect one: the boundaries, the checks, the audit trails, the "refuse to proceed when the state is wrong" machinery. The bottleneck shifts from intelligence to accountable intelligence.
Distribution and data. When raw capability commoditizes, the moat reverts to the things it always was — who owns the channel, who owns the proprietary data, who already has the users. Capability was a temporary moat. It's evaporating, and the old moats are reasserting themselves underneath.
Agents: the next thing to deflate
A specific sub-prediction, because this is where today's hype is loudest. As of the most recent Gartner cycle, AI agents sit right at the peak of inflated expectations — exactly where generative AI sat two years before its fall. You know what that means. The peak is the part right before the drop.

The "autonomous agent that does your whole job" narrative deflates hard in 2026–2027. What survives is narrower and more useful: agents that operate inside tightly constrained, verifiable domains, with explicit guardrails on what they're allowed to do and hard checks on what they produced. The fantasy of the general autonomous worker recedes. The reality of the bounded, supervised, contract-bounded agent quietly ships and actually works.

The pattern repeats at every level: the grandiose version dies in the trough, the constrained version survives and compounds.
What would prove me wrong
A futurist who can't tell you how they'd be wrong is just a fortune teller. So here's the scenario that breaks everything I just wrote.
If capability scaling doesn't plateau — if the next generation of models crosses some threshold into reliable, self-correcting, genuinely autonomous reasoning, and especially if systems start meaningfully improving themselves — then the trough never completes. The curve breaks upward instead of leveling off, the economic math that looks insane today gets retroactively justified by a step-change in what the technology can do, and the cautious "AI becomes boring infrastructure" story I'm telling looks as quaint as people in 1995 saying the internet was a fad.
I don't think that's the base case for this decade. But I hold it loosely, and you should too. The honest position is that the economics are in a bubble while the technology may or may not be near its ceiling, and those are different questions that the discourse constantly mashes together.
The part that should make you optimistic
Here's why I opened by calling the collapse the best thing that could happen.
Hype is a tax on builders. When a field is at peak hype, attention and capital flow to the loudest demo, the boldest claim, the thing that sounds most like the future. Substance is a competitive disadvantage. The trough inverts that. The tourists leave. The capital gets disciplined. The grift stops paying. And suddenly the people who are left are the ones who actually wanted to build the thing — and the ground is littered with cheap infrastructure, mature tooling, and problems that everyone finally admits are real.
The trough of disillusionment isn't where technologies go to die. It's where they get serious. The plateau of productivity — the boring, durable, world-changing part — is on the other side of exactly this.
The hype is leaving. Don't mourn it. It was never on your side.

Sources and further reading, for anyone who wants the underlying numbers:
Gartner's 2025 Hype Cycle for AI (placement of GenAI in the trough and agents at the peak); the MIT/McKinsey figures on enterprise pilot and profitability failure rates; Goldman Sachs and Morgan Stanley analysis on hyperscaler capex versus profit; BloombergNEF on data-center buildout and neocloud financing; and the Belfer Center on AI's collision with the electric grid. Links:

Allianz Research, "AI capex cycle" (Mar 2026): https://www.allianz.com/content/dam/onemarketing/azcom/Allianz_com/economic-research/publications/specials/en/2026/march/2026_03_25_AI.pdf
BloombergNEF, "AI Data Center Build Advances at Full Speed" (Mar 2026): https://about.bnef.com/insights/commodities/ai-data-center-build-advances-at-full-speed-five-things-to-know/
Belfer Center, "AI, Data Centers, and the U.S. Electric Grid" (Feb 2026): https://www.belfercenter.org/research-analysis/ai-data-centers-us-electric-grid
Futurum, "AI Capex 2026: The $690B Infrastructure Sprint" (Feb 2026): https://futurumgroup.com/insights/ai-capex-2026-the-690b-infr astructure-sprint/

"Your AI agents fail silently. A 1986 idea fixes that."

D. Reiter — Fri, 29 May 2026 11:14:35 +0000

There's a specific kind of bug that AI agents introduced to production systems, and most teams don't have a name for it yet.
It's not a crash. Crashes are easy — you get a stack trace, a line number, a 500. This is worse: the agent runs, returns a perfectly well-formed object, the pipeline moves on, the next stage consumes it... and somewhere three steps later the output is quietly, expensively wrong. No exception. No log line that says this is where it broke. Just a bad decision that propagated.
A sensor pipeline that keeps scoring anomalies on a stale reading from a cold-booting device. A document agent that "summarizes" a PDF it never actually parsed. A payment flow that approves a transfer because an upstream field defaulted to undefined and undefined happened to be falsy in the wrong branch.
The model didn't fail. Your trust in the model failed. And the bug for that is forty years old.
Design by Contract, briefly
In 1986 Bertrand Meyer built a language called Eiffel around an idea he called Design by Contract. Every routine declares, explicitly, what it promises and what it requires:

Preconditions — what must be true before this code runs. The caller's responsibility.
Postconditions — what must be true after it runs. The routine's responsibility.
Invariants — what must always hold.

The genius part wasn't the assertions themselves — it was the framing. A contract turns "I hope this works" into "here is exactly whose fault it is when it doesn't." If the precondition fails, the caller broke the deal. If the postcondition fails, the routine did. Blame is assigned at the boundary, not three stack frames later.
This idea fell out of fashion for normal code, mostly because static types absorbed a lot of its job. You don't need a precondition saying "x must be a number" when the type system already guarantees it.
But agents broke that comfortable arrangement. An LLM's output type is "string" or "object" — structurally valid, semantically anything. The type system says yes. The contract is the only thing left that can say no.
Why agents need this more than ordinary code
Three properties make agent pipelines uniquely hostile:

Non-determinism. The same input can produce different output. You can't reason about "what this function returns" — only about "what this function is allowed to return."
Irreversible side effects. Agents send emails, charge cards, fire actuators, call other APIs. By the time you notice the bad state, the side effect already happened.
No rollback by default. When step 4 of a 6-step pipeline produces garbage, there's usually no mechanism that says "stop, the state is now invalid, don't proceed."

Design by Contract addresses all three at once: verify state at every boundary, before the side effect, and refuse to continue when the contract breaks.
What it looks like in practice
Here's the shape of it in plain JS — a single stage with a contract and a policy. I'm using deed-edge, a small zero-dependency runtime I've been building around exactly this idea, but the pattern is what matters, not the package.

deed.registerAgent('screen_applicant', {
  contract: {
    pre:  'raw_application',                       // require: input exists
    post: 'eligibility_checked and score >= 0'     // promise: real result
  },
  policy: [
    { kind: 'cap',   field: 'budget',  limit: 5000 },
    { kind: 'deny',  action: 'go',     condition: 'region == "restricted"' },
    { kind: 'allow', action: 'go',     condition: 'verified' }
  ],
  handler: async (state) => ({
    ...state,
    eligibility_checked: true,
    score: await scoreApplicant(state.raw_application)
  })
});

The pre/post strings are a tiny expression language — bare identifiers are truthy checks, plus comparisons (score >= 0), string equality (region == "eu"), and and/or. Deliberately not Turing-complete. A contract you can't fully reason about isn't a contract, it's a second program.
Each stage then runs through a fixed sequence:

1. Policy check     → structural gates (cap / deny / allow)
2. Pre-contract     → is the input valid?
3. Execute handler  → your actual logic / LLM call
4. Post-contract    → is the output valid?
5. Commit           → checkpoint state, emit success event

Any failure halts execution before the next stage and writes a dead-letter record with the exact failing stage and a typed error — ContractViolationError, PolicyViolationError — so "where did it break" is answered at the boundary, not reconstructed from logs afterward.
The non-obvious detail: policy runs before contracts
This ordering is the part I'd push back on if I were reading this skeptically, so let me defend it.
Policy rules are structural — "never transfer to a sanctioned entity," "never spend above this cap," "only proceed if verified." Contracts are about state correctness — "the output has the fields it should."
You check policy first because a misbehaving agent can manufacture state that looks contract-valid. If contracts ran first, an agent that's been jailbroken or has simply gone off the rails could populate exactly the fields needed to pass the postcondition and slip past your guardrails. Putting policy at the front means the hard "this must never happen" rules don't depend on the agent's output being honest.
Note also that allow is fail-closed: if any allow rule exists for an action and none match, the action is blocked. The default is no, not yes. For anything with side effects, that's the only safe default.
A concrete payoff: temporal contracts
Once state verification lives at the boundary, you can express things that are genuinely awkward to bolt on otherwise. The one I keep reaching for is freshness:

deed.registerAgent('freshness_guard', {
  contract: {
    pre:  'ts',
    post: 'fresh_within_30s'   // ts must be < 30s old, or the stage fails
  },
  handler: async (state) => state
});

fresh_within_30s is a temporal predicate evaluated against a ts field. Missing or unparseable timestamp? Treated as stale — fail-safe, not crash. This is the cold-boot sensor bug from the intro, caught declaratively at exactly the point it matters, instead of discovered later in a corrupted dataset.
When not to reach for this
Contracts aren't free, and I'd rather be honest than sell you a silver bullet:

For a single LLM call with no downstream consumers, a contract is ceremony. Just validate the response inline.
For genuinely creative, open-ended output (write me a poem), there's often no meaningful postcondition to assert. Contracts shine when output feeds other code, not a human reader.
Over-tight contracts cause false failures. A postcondition of confidence >= 0.95 on a fuzzy task will dead-letter half your legitimate runs. Start loose, tighten with evidence.

The technique earns its keep specifically in multi-stage pipelines with side effects — which, conveniently, is most of what people are actually shipping agents into.
The takeaway
The agent reliability conversation is mostly framed as a model problem: better prompts, better fine-tunes, better models. Those help. But a lot of agent failures aren't the model being dumb — they're the surrounding system trusting the model unconditionally and having no boundary at which to say no.
Design by Contract was built for exactly that mistrust, four decades before anyone had an LLM to mistrust. Pre/post conditions, fail-closed policies, blame assigned at the boundary. Old idea, very current problem.
If you've hit the silent-failure bug yourself, I'd love to hear how you caught it — drop it in the comments.

40% of Agentic AI Projects Will Be Cancelled by 2027. Governance Frameworks Won't Save Them.

D. Reiter — Wed, 20 May 2026 12:27:08 +0000

Gartner's prediction isn't about bad models. It's about a missing infrastructure layer.

Gartner predicts that more than 40% of agentic AI projects will be cancelled by 2027. The reason given: cost, weak ROI, and poor governance.
Most teams hear "poor governance" and reach for a framework. They write policies. They set up review committees. They document acceptable outputs. They file compliance reports.
None of this prevents the failure mode that's actually killing projects.

The Failure Mode Nobody Talks About
Here's what a typical agentic AI failure looks like in production:
The pipeline runs. The output looks reasonable. No exceptions are thrown. No alerts fire.
Three steps earlier, an agent received state that didn't meet the conditions for that step. It ran anyway. It produced output that looked plausible but was built on invalid foundations. Two steps later, another agent acted on that output. By the time the final result surfaces, the error is untraceable.
This isn't a governance failure in the policy sense. The policies existed. The model was aligned. The guardrails were in place.
It's an enforcement failure. Nobody checked whether the state was valid before the agent got control.

What Governance Frameworks Actually Cover
Most AI governance frameworks operate at two levels:
Design-time governance: policies about what the system is allowed to do, data handling requirements, human oversight requirements, documentation standards.
Output-time governance: guardrails on what the model returns, content filtering, output validation.
Both are necessary. Neither is sufficient.
There's a third level that frameworks consistently miss: runtime enforcement at stage boundaries — checking whether the state entering each agent step is valid, and whether the action the agent is about to take is permitted given current runtime context.
Design-time governance says: "agents should not process data from restricted jurisdictions."
Output-time governance says: "flag outputs that reference restricted jurisdictions."
Runtime enforcement says: "before this agent runs, verify that jurisdiction is not restricted. If it is — block the action, preserve state, write an audit entry."
The first two are policy. The third is enforcement. Policy without enforcement is documentation.

Why Projects Get Cancelled
The 40% cancellation rate isn't happening because organizations lack policies. Most organizations that reach the agentic AI stage have governance policies in place.
Projects get cancelled for three concrete reasons:

Failures are invisible until they're catastrophic. Agent pipelines fail silently. Shared mutable state passes through multiple LLM calls, and each call can degrade that state in ways that look like normal output. By the time the failure is visible, it has propagated through the system. Reconstruction is expensive or impossible.
Replay without idempotency is dangerous. When a pipeline fails mid-run, teams face a choice: restart from the beginning and risk re-executing side effects (duplicate API calls, double charges, repeated writes), or investigate manually. Neither is acceptable at scale.
Audit trails don't prove enforcement. Regulators and compliance teams increasingly ask not just "what did the system do" but "can you prove the system was constrained before it acted." Logging outputs answers the first question. It doesn't answer the second.

The Missing Layer
The infrastructure that would prevent these failures exists in traditional software. It just hasn't been applied to agent pipelines.
Pre-conditions: before an agent step runs, verify that the state meets required conditions. If it doesn't — reject the step, preserve state, write a structured failure event with full context.
Policy gates: before the LLM call, evaluate whether the action is permitted given current runtime context. Not after the output — before the call. If jurisdiction is restricted, the model never runs. No tokens consumed. No side effects.
Checkpoints: after each stage completes, write the state to a checkpoint. If the pipeline fails mid-run, replay from the last checkpoint. Steps already completed are skipped via idempotency keys — no double execution.
Structured audit trail: not just logs of what the model returned, but records of what conditions were true when each step ran, which policies evaluated, and whether they passed or failed.

What This Looks Like in Practice
A pipeline with enforcement built in behaves differently when something goes wrong:
[intake] ✓ pre: observation_present [intake] ✓ post: normalized [taxonomy] ✓ pre: normalized [taxonomy] ✕ post: species_identified → ContractViolation → state preserved at failure point → DLQ entry written stage: taxonomy predicate: species_identified context: { ...full state snapshot... } → replay available from: taxonomy
The failure is caught at the exact point it occurs. The state is preserved. The audit entry contains everything needed for diagnosis. Replay from the failure point doesn't re-execute completed steps.
Compare this to the standard failure mode: exception in production, stack trace in logs, restart from the beginning, hope the side effects don't cause problems.

Governance Frameworks Are Necessary But Not Sufficient
This isn't an argument against governance frameworks. EU AI Act compliance, internal audit requirements, responsible AI policies — all of these matter and all of them are necessary.
The argument is that governance frameworks operate at the wrong layer to prevent the failure mode that's causing the 40% cancellation rate.
Policy documents don't run before agent steps. Review committees don't evaluate runtime state. Compliance reports don't prevent invalid state from entering a pipeline.
The teams that will not be in the 40% aren't the ones with the most comprehensive governance policies. They're the ones that built enforcement into the pipeline itself — pre-conditions, policy gates, checkpoints, structured audit trails — as first-class primitives, not afterthoughts.
Governance tells you what's allowed.
Enforcement ensures only what's allowed actually happens.
The gap between those two things is where the 40% lives.

If you're building agentic AI pipelines and thinking about the enforcement layer, DEED is a runtime contract engine for Python agent pipelines: pre/post conditions, policy gates, checkpoint/replay. Zero dependencies. Python 3.10+.
github.com/Deadly-Reiter/deed · pip install deed-runtime

Why Runtime Governance for LLM Agents Is Inevitable

D. Reiter — Sat, 16 May 2026 11:25:04 +0000

What three independent research groups concluded in early 2026 — and what it means for anyone building agent pipelines today

In the first quarter of 2026, at least three independent research groups published formal frameworks for the same problem: how to govern the behavior of LLM agents at runtime.
That convergence is worth paying attention to.

The Core Problem
Traditional software has decades of formal specification tooling. Type systems, API contracts, assertions, interface specifications — all of these provide compile-time and runtime guarantees about what a program will do.
LLM agents have prompts.
As Bhardwaj (2026) puts it in Agent Behavioral Contracts: prompts "carry no formal semantics, no verifiable guarantees, and no enforcement mechanisms." This gap between formal software contracts and informal agent instructions is the root cause of a class of failures unique to agentic AI: behavioral drift, silent policy violations, and governance failures that only surface after the fact.
The problem compounds in multi-step pipelines. Each agent step transforms shared state. By the time a failure is visible, the system has already passed through multiple intermediate states — none of which were formally validated.

Three Research Groups, Three Angles

Behavioral Contracts (Bhardwaj, 2026) arxiv.org/abs/2602.22302 Bhardwaj formalizes the contract as C = (P, I, G, R): Preconditions, Invariants, Governance policies, and Recovery mechanisms — enforced at runtime, not at training time. The key insight is the Drift Bounds Theorem: without continuous runtime enforcement, agent behavior inevitably drifts from its specification over multi-turn interactions. An agent tasked with professional customer support may begin with appropriate responses but progressively adopt a more casual tone, hallucinate features, or volunteer information it wasn't asked for. Training-time alignment and output guardrails don't prevent this — they don't monitor state across steps. The paper also makes a distinction that matters in practice: text-level safety does not transfer to tool-call safety. Behavioral contracts must operate at the action level, not merely the output level.
Policies on Paths (Kaptein et al., 2026) arxiv.org/abs/2603.16586 The Eindhoven group argues that the execution path is the central object for governance — not the individual output, not the prompt, not the access control list. Their framing: prompt-level instructions shape the distribution over paths without evaluating them. Static access control evaluates deterministic policies that ignore the path. Neither is sufficient. Effective governance requires policies that evaluate (agent identity, partial path, proposed next action, organizational state) — before the action executes. They also note the regulatory context: EU AI Act provisions for high-risk systems take effect in August 2026 and "effectively demand proper orchestration." The gap between what organizations are deploying and what they can demonstrably govern is, in their assessment, the central obstacle to responsible agent adoption.
Governing What You Cannot Observe (2026) arxiv.org/abs/2604.24686 The third paper takes a statistical angle: agents are partially observable systems. You can't directly inspect internal state — you can only observe actions and outputs. Governance frameworks that assume full observability will fail in practice. Their RiskGate framework introduces a Viability Index that transforms governance from reactive to predictive — flagging drift before it becomes a violation, not after.

What All Three Agree On
Despite different formal approaches, the three papers converge on the same structural conclusions:
Training-time alignment is insufficient. Constitutional AI, RLHF, and output guardrails operate on individual outputs in isolation. They have no concept of state across steps, no preconditions, no invariants over multi-turn interactions.
Governance must happen before execution. Checking outputs after the fact means side effects have already occurred. Tokens were consumed. External APIs were called. State was written. By the time you detect the violation, you're doing forensics, not enforcement.
The execution path is what needs to be governed. Not the prompt. Not the output. The sequence of actions the agent took — evaluated against policies at each step, before each step.

The Infrastructure Gap
A 2026 KPMG survey found that 75% of enterprise leaders cite security, compliance, and auditability as the most critical requirements for agent deployment. Yet most production agent systems today rely on logging, output filtering, and hope.
The academic framing makes the gap concrete: organizations are deploying systems they cannot formally govern. The tooling to close that gap — runtime contract enforcement, policy evaluation on execution paths, structured audit trails — is what's missing.
This isn't a distant research problem. It's the infrastructure layer that production agent pipelines need right now.

What This Means in Practice
If you're building agent pipelines today, the research points to a few concrete requirements for any governance layer worth having:

Preconditions evaluated before execution — not assertions after the fact
Policy enforcement at the action level — not just output filtering
State preserved on failure — for audit and replay
Structured failure events — not just stack traces

These aren't exotic research requirements. They're engineering primitives that the field is converging on — from multiple independent directions, in the same quarter.

Papers referenced:

Agent Behavioral Contracts — Bhardwaj, Feb 2026 — arxiv.org/abs/2602.22302
Runtime Governance for AI Agents: Policies on Paths — Kaptein et al., Mar 2026 — arxiv.org/abs/2603.16586
Governing What You Cannot Observe — Apr 2026 — arxiv.org/abs/2604.24686

Contract-First vs Assertion-First: LLM Agent Reliability

D. Reiter — Fri, 15 May 2026 15:22:35 +0000

When an agent pipeline fails in production, the question is: where was correctness being enforced, and when did it break down?
Two approaches. Different tradeoffs.

Assertion-First

async def score_company(state: dict) -> dict:
    assert state.get("enriched"), "must be enriched before scoring"

    result = await llm.run(score_prompt, state)

    assert result.get("score") is not None
    return {**state, **result}

Fast to write. Familiar. Works fine for small pipelines.
Problems at scale: checks scatter across every executor. When a pipeline crashes mid-run, you restart from the beginning — including any external API calls already executed. When the assertion fires, you get a stack trace. Not the state that caused it.

Contract-First
Correctness lives outside the executor, in a separate spec:

agent score_agent
  policy
    cap budget_tokens <= 3000
    deny score_company if region == "restricted"

  contract score_contract
    pre  enriched
    post scored

The executor just executes. The runtime handles pre/post evaluation and policy enforcement — before the LLM call.
If region == "restricted" — the model never runs. No tokens consumed. DLQ entry written with full context.

What You Get on Failure
Assertion-first:


AssertionError: must be enriched before scoring
  File "pipeline.py", line 34

Stack trace. Restart from scratch.
Contract-first:

[taxonomy]  ✕ post: species_identified  →  ContractViolation
             → state preserved
             → DLQ entry: stage, predicate, full context snapshot
             → replay available from: taxonomy
Fix the issue. Replay from the failure point. Completed steps don't re-execute — idempotency keys handle that.

When to Use Each
Assertion-first — short pipelines, fast iteration, no replay requirements.
Contract-first — multi-stage pipelines, external side effects, auditability requirements, policy enforcement before execution.
They're not mutually exclusive. Contracts handle pipeline-level invariants. Assertions handle local logic inside executors.

pip install deed-runtime

GitHub: github.com/Deadly-Reiter/deed

Why Your LLM Agent Needs Contracts, Not Just Logs

D. Reiter — Thu, 14 May 2026 20:09:27 +0000

How we stopped debugging agent failures after the fact and started preventing them upfront

The Problem
You're running an LLM agent pipeline in production. Something goes wrong.
You open the logs. You see what the agent returned. You see that it failed. But you have no idea what the state of the system was before it happened — what data went in, whether preconditions were valid, which policy was silently violated three steps earlier.
Logging tells you what occurred.
It doesn't tell you what was allowed to occur.
This is the gap we kept hitting. Every team we talked to running agents in production has some version of this problem. Most solve it with ad-hoc assertions, careful logging, and hope. We wanted something systematic.
So we built DEED.

The Wrong Mental Model
When something breaks in a traditional service, you look at the request that came in and the response that went out. The failure boundary is clear.
LLM agent pipelines don't work like that. Each step transforms a shared state object. The agent at step 3 is operating on output that was shaped by steps 1 and 2. By the time you see the failure, the system has already passed through multiple states — and none of them were validated.
The standard fix is to add assertions:

result = await agent.run(state)
assert result.get("score") is not None
assert result.get("enriched") == True

This works until it doesn't. Assertions are scattered across executor code. They don't tell you why a condition wasn't met. They don't write to a dead-letter queue. They don't checkpoint state so you can replay from the failure point. And they're invisible to anyone who isn't reading your Python.

A Different Layer: Contracts
DEED introduces a declarative contract layer that sits between your pipeline definition and your agent executors.
Every agent has a contract: what must be true before it runs (pre-condition), and what must be true after (post-condition). Every agent also has a policy: what actions are allowed, what's capped, what's explicitly denied.
Here's what that looks like in DEED's .dd format:

agent score_agent
  description "ICP scoring agent — evaluates company fit 0.0-1.0"
  capabilities ["score_company"]

  policy
    cap budget_tokens <= 3000
    allow score_company if enriched

  contract score_contract
    pre  enriched
    post scored

  observe
    trace true

Before score_agent runs: the runtime checks that enriched is truthy in the current state. If it's not — the step is rejected, state is preserved as-is, and a DLQ entry is written with the full context snapshot.
After the agent runs: the runtime checks that scored is now present. If the post-condition fails — same outcome, plus automatic credit refund if you're using the metering layer.
The policy runs before the LLM call. allow score_company if enriched means if somehow enriched dropped to false between the pre-check and the action — the action is blocked before it executes.

The Pipeline
Contracts live next to the pipeline spec, not buried in executor code:

pipeline sales_intelligence
  description "End-to-end sales intelligence workflow"
  input company_profile

  stage enrich
    agent data_agent
    -> enrich_company()
    checkpoint after
    on_error retry

  stage score
    agent score_agent
    -> score_company()
    checkpoint after
    on_error retry

  stage brief
    agent brief_agent
    -> generate_brief()
    -> persist_result()
    on_error deadletter

  observe
    trace true

Each stage has explicit error handling. checkpoint after means the state is written to disk after the stage completes — so if the pipeline crashes mid-run, you replay from the last checkpoint, not from the beginning.
Side effects already executed are tracked via idempotency keys. No double-charges. No duplicate writes.

What Happens on Failure
A real example from the mushroom_safety pipeline — a four-stage safety-critical workflow:

pipeline foraged_mushroom_safety
  input mushroom_observation

  stage intake
    agent intake_agent
    -> normalize_observation
    checkpoint after
    on_error deadletter

  stage taxonomy
    agent taxonomy_agent
    -> classify_candidate
    -> detect_lookalikes
    checkpoint after
    on_error retry(2)

  stage risk
    agent risk_agent
    -> assess_toxicity_risk
    -> compute_confidence
    checkpoint after
    on_error retry(2)

  stage safety
    agent safety_agent
    -> generate_safety_advisory
    -> persist_case
    checkpoint after
    on_error deadletter

Run it with a deliberate failure:

python run.py --fail
What you get:
[intake]    ✓ pre: mushroom_observation
[intake]    ✓ post: normalized
[taxonomy]  ✓ pre: normalized
[taxonomy]  ✕ post: species_identified — ContractViolation
             → state preserved
             → DLQ entry written: stage=taxonomy, predicate=species_identified
             → context snapshot attached

The pipeline stops at the exact failure point. The DLQ entry contains everything you need to understand what happened — the state before the step, the state after, which predicate failed.
Fix the issue. Replay from taxonomy. Steps before it don't re-execute.

Why a DSL?
The .dd format is intentionally readable by non-engineers — compliance reviewers, domain experts, QA. The contract file is an artifact you can show an auditor, not something buried in a decorator chain.
There's also a practical reason: docs/MASTER_MANUAL_FOR_LLM.md in the repo is a system prompt that teaches LLMs to generate .dd files from domain descriptions. Describe your workflow in plain language, get a contract spec back. That works well with a structured format.
Native Python API is on the roadmap — we know the DSL is a barrier for some workflows.

What This Is Not
DEED is not an observability tool. It doesn't replace LangSmith or Langfuse. Those tell you what happened — DEED enforces what's allowed to happen before it does. Different layer. You'd use both.
DEED is not a workflow orchestrator. It doesn't replace Temporal or Prefect. You could run a DEED pipeline inside a Temporal workflow — DEED handles the contract layer, Temporal handles scheduling and retries at the workflow level.

Try It

pip install deed-runtime

Zero dependencies. Python 3.10+.
Three examples in the repo: mushroom_safety (safety-critical pipeline with deliberate failure mode), sales_agent (B2B scoring with policy deny on restricted regions), orchid_rescue (reference spec only — conservation triage workflow).

GitHub: github.com/Deadly-Reiter/deed
Docs: deed-docs.onrender.com

If you're running agents in production and have a different approach to this problem — genuinely curious what you're doing.