How we stopped debugging agent failures after the fact and started preventing them upfront
The Problem
You're running an LLM agent pipeline in production. Something goes wrong.
You open the logs. You see what the agent returned. You see that it failed. But you have no idea what the state of the system was before it happened — what data went in, whether preconditions were valid, which policy was silently violated three steps earlier.
Logging tells you what occurred.
It doesn't tell you what was allowed to occur.
This is the gap we kept hitting. Every team we talked to running agents in production has some version of this problem. Most solve it with ad-hoc assertions, careful logging, and hope. We wanted something systematic.
So we built DEED.
The Wrong Mental Model
When something breaks in a traditional service, you look at the request that came in and the response that went out. The failure boundary is clear.
LLM agent pipelines don't work like that. Each step transforms a shared state object. The agent at step 3 is operating on output that was shaped by steps 1 and 2. By the time you see the failure, the system has already passed through multiple states — and none of them were validated.
The standard fix is to add assertions:
result = await agent.run(state)
assert result.get("score") is not None
assert result.get("enriched") == True
This works until it doesn't. Assertions are scattered across executor code. They don't tell you why a condition wasn't met. They don't write to a dead-letter queue. They don't checkpoint state so you can replay from the failure point. And they're invisible to anyone who isn't reading your Python.
A Different Layer: Contracts
DEED introduces a declarative contract layer that sits between your pipeline definition and your agent executors.
Every agent has a contract: what must be true before it runs (pre-condition), and what must be true after (post-condition). Every agent also has a policy: what actions are allowed, what's capped, what's explicitly denied.
Here's what that looks like in DEED's .dd format:
agent score_agent
description "ICP scoring agent — evaluates company fit 0.0-1.0"
capabilities ["score_company"]
policy
cap budget_tokens <= 3000
allow score_company if enriched
contract score_contract
pre enriched
post scored
observe
trace true
Before score_agent runs: the runtime checks that enriched is truthy in the current state. If it's not — the step is rejected, state is preserved as-is, and a DLQ entry is written with the full context snapshot.
After the agent runs: the runtime checks that scored is now present. If the post-condition fails — same outcome, plus automatic credit refund if you're using the metering layer.
The policy runs before the LLM call. allow score_company if enriched means if somehow enriched dropped to false between the pre-check and the action — the action is blocked before it executes.
The Pipeline
Contracts live next to the pipeline spec, not buried in executor code:
pipeline sales_intelligence
description "End-to-end sales intelligence workflow"
input company_profile
stage enrich
agent data_agent
-> enrich_company()
checkpoint after
on_error retry
stage score
agent score_agent
-> score_company()
checkpoint after
on_error retry
stage brief
agent brief_agent
-> generate_brief()
-> persist_result()
on_error deadletter
observe
trace true
Each stage has explicit error handling. checkpoint after means the state is written to disk after the stage completes — so if the pipeline crashes mid-run, you replay from the last checkpoint, not from the beginning.
Side effects already executed are tracked via idempotency keys. No double-charges. No duplicate writes.
What Happens on Failure
A real example from the mushroom_safety pipeline — a four-stage safety-critical workflow:
pipeline foraged_mushroom_safety
input mushroom_observation
stage intake
agent intake_agent
-> normalize_observation
checkpoint after
on_error deadletter
stage taxonomy
agent taxonomy_agent
-> classify_candidate
-> detect_lookalikes
checkpoint after
on_error retry(2)
stage risk
agent risk_agent
-> assess_toxicity_risk
-> compute_confidence
checkpoint after
on_error retry(2)
stage safety
agent safety_agent
-> generate_safety_advisory
-> persist_case
checkpoint after
on_error deadletter
Run it with a deliberate failure:
python run.py --fail
What you get:
[intake] ✓ pre: mushroom_observation
[intake] ✓ post: normalized
[taxonomy] ✓ pre: normalized
[taxonomy] ✕ post: species_identified — ContractViolation
→ state preserved
→ DLQ entry written: stage=taxonomy, predicate=species_identified
→ context snapshot attached
The pipeline stops at the exact failure point. The DLQ entry contains everything you need to understand what happened — the state before the step, the state after, which predicate failed.
Fix the issue. Replay from taxonomy. Steps before it don't re-execute.
Why a DSL?
The .dd format is intentionally readable by non-engineers — compliance reviewers, domain experts, QA. The contract file is an artifact you can show an auditor, not something buried in a decorator chain.
There's also a practical reason: docs/MASTER_MANUAL_FOR_LLM.md in the repo is a system prompt that teaches LLMs to generate .dd files from domain descriptions. Describe your workflow in plain language, get a contract spec back. That works well with a structured format.
Native Python API is on the roadmap — we know the DSL is a barrier for some workflows.
What This Is Not
DEED is not an observability tool. It doesn't replace LangSmith or Langfuse. Those tell you what happened — DEED enforces what's allowed to happen before it does. Different layer. You'd use both.
DEED is not a workflow orchestrator. It doesn't replace Temporal or Prefect. You could run a DEED pipeline inside a Temporal workflow — DEED handles the contract layer, Temporal handles scheduling and retries at the workflow level.
Try It
pip install deed-runtime
Zero dependencies. Python 3.10+.
Three examples in the repo: mushroom_safety (safety-critical pipeline with deliberate failure mode), sales_agent (B2B scoring with policy deny on restricted regions), orchid_rescue (reference spec only — conservation triage workflow).
GitHub: github.com/Deadly-Reiter/deed
Docs: deed-docs.onrender.com
If you're running agents in production and have a different approach to this problem — genuinely curious what you're doing.
Top comments (2)
This is a solid framing. I like the shift from post-hoc logs to explicit pre/post conditions around shared state. In production agent systems, the real pain is usually not the failed step itself, but discovering which invariant drifted two hops earlier. The checkpointing + idempotency angle is especially practical.
Exactly this. The failure you see is almost never where
the problem started — it's just where the state drift
finally became visible.
The two-hops-earlier problem is what made us go declarative.
When contracts live in the spec rather than scattered across
executor code, you can trace back through the pipeline and
see exactly which predicate was already false before the
failing step ran.
Checkpointing was the other non-negotiable. Replay without
idempotency keys is just re-running side effects with extra
steps — which is sometimes worse than the original failure.
What's your current stack for agent pipelines? Curious
whether you're seeing this at the orchestration layer or
deeper in the executor logic.