DEV Community

Jonny
Jonny

Posted on

Why Your LLM Agent Needs Contracts, Not Just Logs

How we stopped debugging agent failures after the fact and started preventing them upfront

The Problem
You're running an LLM agent pipeline in production. Something goes wrong.
You open the logs. You see what the agent returned. You see that it failed. But you have no idea what the state of the system was before it happened — what data went in, whether preconditions were valid, which policy was silently violated three steps earlier.
Logging tells you what occurred.
It doesn't tell you what was allowed to occur.
This is the gap we kept hitting. Every team we talked to running agents in production has some version of this problem. Most solve it with ad-hoc assertions, careful logging, and hope. We wanted something systematic.
So we built DEED.

The Wrong Mental Model
When something breaks in a traditional service, you look at the request that came in and the response that went out. The failure boundary is clear.
LLM agent pipelines don't work like that. Each step transforms a shared state object. The agent at step 3 is operating on output that was shaped by steps 1 and 2. By the time you see the failure, the system has already passed through multiple states — and none of them were validated.
The standard fix is to add assertions:

result = await agent.run(state)
assert result.get("score") is not None
assert result.get("enriched") == True
Enter fullscreen mode Exit fullscreen mode

This works until it doesn't. Assertions are scattered across executor code. They don't tell you why a condition wasn't met. They don't write to a dead-letter queue. They don't checkpoint state so you can replay from the failure point. And they're invisible to anyone who isn't reading your Python.

A Different Layer: Contracts
DEED introduces a declarative contract layer that sits between your pipeline definition and your agent executors.
Every agent has a contract: what must be true before it runs (pre-condition), and what must be true after (post-condition). Every agent also has a policy: what actions are allowed, what's capped, what's explicitly denied.
Here's what that looks like in DEED's .dd format:

agent score_agent
  description "ICP scoring agent — evaluates company fit 0.0-1.0"
  capabilities ["score_company"]

  policy
    cap budget_tokens <= 3000
    allow score_company if enriched

  contract score_contract
    pre  enriched
    post scored

  observe
    trace true
Enter fullscreen mode Exit fullscreen mode

Before score_agent runs: the runtime checks that enriched is truthy in the current state. If it's not — the step is rejected, state is preserved as-is, and a DLQ entry is written with the full context snapshot.
After the agent runs: the runtime checks that scored is now present. If the post-condition fails — same outcome, plus automatic credit refund if you're using the metering layer.
The policy runs before the LLM call. allow score_company if enriched means if somehow enriched dropped to false between the pre-check and the action — the action is blocked before it executes.

The Pipeline
Contracts live next to the pipeline spec, not buried in executor code:

pipeline sales_intelligence
  description "End-to-end sales intelligence workflow"
  input company_profile

  stage enrich
    agent data_agent
    -> enrich_company()
    checkpoint after
    on_error retry

  stage score
    agent score_agent
    -> score_company()
    checkpoint after
    on_error retry

  stage brief
    agent brief_agent
    -> generate_brief()
    -> persist_result()
    on_error deadletter

  observe
    trace true
Enter fullscreen mode Exit fullscreen mode

Each stage has explicit error handling. checkpoint after means the state is written to disk after the stage completes — so if the pipeline crashes mid-run, you replay from the last checkpoint, not from the beginning.
Side effects already executed are tracked via idempotency keys. No double-charges. No duplicate writes.

What Happens on Failure
A real example from the mushroom_safety pipeline — a four-stage safety-critical workflow:

pipeline foraged_mushroom_safety
  input mushroom_observation

  stage intake
    agent intake_agent
    -> normalize_observation
    checkpoint after
    on_error deadletter

  stage taxonomy
    agent taxonomy_agent
    -> classify_candidate
    -> detect_lookalikes
    checkpoint after
    on_error retry(2)

  stage risk
    agent risk_agent
    -> assess_toxicity_risk
    -> compute_confidence
    checkpoint after
    on_error retry(2)

  stage safety
    agent safety_agent
    -> generate_safety_advisory
    -> persist_case
    checkpoint after
    on_error deadletter
Enter fullscreen mode Exit fullscreen mode

Run it with a deliberate failure:

python run.py --fail
What you get:
[intake]    ✓ pre: mushroom_observation
[intake]    ✓ post: normalized
[taxonomy]  ✓ pre: normalized
[taxonomy]  ✕ post: species_identified — ContractViolation
             → state preserved
             → DLQ entry written: stage=taxonomy, predicate=species_identified
             → context snapshot attached
Enter fullscreen mode Exit fullscreen mode

The pipeline stops at the exact failure point. The DLQ entry contains everything you need to understand what happened — the state before the step, the state after, which predicate failed.
Fix the issue. Replay from taxonomy. Steps before it don't re-execute.

Why a DSL?
The .dd format is intentionally readable by non-engineers — compliance reviewers, domain experts, QA. The contract file is an artifact you can show an auditor, not something buried in a decorator chain.
There's also a practical reason: docs/MASTER_MANUAL_FOR_LLM.md in the repo is a system prompt that teaches LLMs to generate .dd files from domain descriptions. Describe your workflow in plain language, get a contract spec back. That works well with a structured format.
Native Python API is on the roadmap — we know the DSL is a barrier for some workflows.

What This Is Not
DEED is not an observability tool. It doesn't replace LangSmith or Langfuse. Those tell you what happened — DEED enforces what's allowed to happen before it does. Different layer. You'd use both.
DEED is not a workflow orchestrator. It doesn't replace Temporal or Prefect. You could run a DEED pipeline inside a Temporal workflow — DEED handles the contract layer, Temporal handles scheduling and retries at the workflow level.

Try It

pip install deed-runtime
Enter fullscreen mode Exit fullscreen mode

Zero dependencies. Python 3.10+.
Three examples in the repo: mushroom_safety (safety-critical pipeline with deliberate failure mode), sales_agent (B2B scoring with policy deny on restricted regions), orchid_rescue (reference spec only — conservation triage workflow).

GitHub: github.com/Deadly-Reiter/deed
Docs: deed-docs.onrender.com

If you're running agents in production and have a different approach to this problem — genuinely curious what you're doing.

Top comments (2)

Collapse
 
vicchen profile image
Vic Chen

This is a solid framing. I like the shift from post-hoc logs to explicit pre/post conditions around shared state. In production agent systems, the real pain is usually not the failed step itself, but discovering which invariant drifted two hops earlier. The checkpointing + idempotency angle is especially practical.

Collapse
 
deadlyreiter profile image
Jonny

Exactly this. The failure you see is almost never where
the problem started — it's just where the state drift
finally became visible.

The two-hops-earlier problem is what made us go declarative.
When contracts live in the spec rather than scattered across
executor code, you can trace back through the pipeline and
see exactly which predicate was already false before the
failing step ran.

Checkpointing was the other non-negotiable. Replay without
idempotency keys is just re-running side effects with extra
steps — which is sometimes worse than the original failure.

What's your current stack for agent pipelines? Curious
whether you're seeing this at the orchestration layer or
deeper in the executor logic.