<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Jonny</title>
    <description>The latest articles on DEV Community by Jonny (@deadlyreiter).</description>
    <link>https://dev.to/deadlyreiter</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3931900%2F2635456d-32b0-4a95-8a7d-7a029ab81523.png</url>
      <title>DEV Community: Jonny</title>
      <link>https://dev.to/deadlyreiter</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/deadlyreiter"/>
    <language>en</language>
    <item>
      <title>Why Your LLM Agent Needs Contracts, Not Just Logs</title>
      <dc:creator>Jonny</dc:creator>
      <pubDate>Thu, 14 May 2026 20:09:27 +0000</pubDate>
      <link>https://dev.to/deadlyreiter/why-your-llm-agent-needs-contracts-not-just-logs-oo7</link>
      <guid>https://dev.to/deadlyreiter/why-your-llm-agent-needs-contracts-not-just-logs-oo7</guid>
      <description>&lt;p&gt;How we stopped debugging agent failures after the fact and started preventing them upfront&lt;/p&gt;

&lt;p&gt;The Problem&lt;br&gt;
You're running an LLM agent pipeline in production. Something goes wrong.&lt;br&gt;
You open the logs. You see what the agent returned. You see that it failed. But you have no idea what the state of the system was before it happened — what data went in, whether preconditions were valid, which policy was silently violated three steps earlier.&lt;br&gt;
Logging tells you what occurred.&lt;br&gt;
It doesn't tell you what was allowed to occur.&lt;br&gt;
This is the gap we kept hitting. Every team we talked to running agents in production has some version of this problem. Most solve it with ad-hoc assertions, careful logging, and hope. We wanted something systematic.&lt;br&gt;
So we built DEED.&lt;/p&gt;

&lt;p&gt;The Wrong Mental Model&lt;br&gt;
When something breaks in a traditional service, you look at the request that came in and the response that went out. The failure boundary is clear.&lt;br&gt;
LLM agent pipelines don't work like that. Each step transforms a shared state object. The agent at step 3 is operating on output that was shaped by steps 1 and 2. By the time you see the failure, the system has already passed through multiple states — and none of them were validated.&lt;br&gt;
The standard fix is to add assertions:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;agent&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;score&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
&lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;enriched&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This works until it doesn't. Assertions are scattered across executor code. They don't tell you why a condition wasn't met. They don't write to a dead-letter queue. They don't checkpoint state so you can replay from the failure point. And they're invisible to anyone who isn't reading your Python.&lt;/p&gt;

&lt;p&gt;A Different Layer: Contracts&lt;br&gt;
DEED introduces a declarative contract layer that sits between your pipeline definition and your agent executors.&lt;br&gt;
Every agent has a contract: what must be true before it runs (pre-condition), and what must be true after (post-condition). Every agent also has a policy: what actions are allowed, what's capped, what's explicitly denied.&lt;br&gt;
Here's what that looks like in DEED's .dd format:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;agent score_agent
  description "ICP scoring agent — evaluates company fit 0.0-1.0"
  capabilities ["score_company"]

  policy
    cap budget_tokens &amp;lt;= 3000
    allow score_company if enriched

  contract score_contract
    pre  enriched
    post scored

  observe
    trace true
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Before score_agent runs: the runtime checks that enriched is truthy in the current state. If it's not — the step is rejected, state is preserved as-is, and a DLQ entry is written with the full context snapshot.&lt;br&gt;
After the agent runs: the runtime checks that scored is now present. If the post-condition fails — same outcome, plus automatic credit refund if you're using the metering layer.&lt;br&gt;
The policy runs before the LLM call. allow score_company if enriched means if somehow enriched dropped to false between the pre-check and the action — the action is blocked before it executes.&lt;/p&gt;

&lt;p&gt;The Pipeline&lt;br&gt;
Contracts live next to the pipeline spec, not buried in executor code:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;pipeline sales_intelligence
  description "End-to-end sales intelligence workflow"
  input company_profile

  stage enrich
    agent data_agent
    -&amp;gt; enrich_company()
    checkpoint after
    on_error retry

  stage score
    agent score_agent
    -&amp;gt; score_company()
    checkpoint after
    on_error retry

  stage brief
    agent brief_agent
    -&amp;gt; generate_brief()
    -&amp;gt; persist_result()
    on_error deadletter

  observe
    trace true
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each stage has explicit error handling. checkpoint after means the state is written to disk after the stage completes — so if the pipeline crashes mid-run, you replay from the last checkpoint, not from the beginning.&lt;br&gt;
Side effects already executed are tracked via idempotency keys. No double-charges. No duplicate writes.&lt;/p&gt;

&lt;p&gt;What Happens on Failure&lt;br&gt;
A real example from the mushroom_safety pipeline — a four-stage safety-critical workflow:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;pipeline foraged_mushroom_safety
  input mushroom_observation

  stage intake
    agent intake_agent
    -&amp;gt; normalize_observation
    checkpoint after
    on_error deadletter

  stage taxonomy
    agent taxonomy_agent
    -&amp;gt; classify_candidate
    -&amp;gt; detect_lookalikes
    checkpoint after
    on_error retry(2)

  stage risk
    agent risk_agent
    -&amp;gt; assess_toxicity_risk
    -&amp;gt; compute_confidence
    checkpoint after
    on_error retry(2)

  stage safety
    agent safety_agent
    -&amp;gt; generate_safety_advisory
    -&amp;gt; persist_case
    checkpoint after
    on_error deadletter
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Run it with a deliberate failure:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="go"&gt;python run.py --fail
What you get:
[intake]    ✓ pre: mushroom_observation
[intake]    ✓ post: normalized
[taxonomy]  ✓ pre: normalized
[taxonomy]  ✕ post: species_identified — ContractViolation
             → state preserved
             → DLQ entry written: stage=taxonomy, predicate=species_identified
             → context snapshot attached
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The pipeline stops at the exact failure point. The DLQ entry contains everything you need to understand what happened — the state before the step, the state after, which predicate failed.&lt;br&gt;
Fix the issue. Replay from taxonomy. Steps before it don't re-execute.&lt;/p&gt;

&lt;p&gt;Why a DSL?&lt;br&gt;
The .dd format is intentionally readable by non-engineers — compliance reviewers, domain experts, QA. The contract file is an artifact you can show an auditor, not something buried in a decorator chain.&lt;br&gt;
There's also a practical reason: docs/MASTER_MANUAL_FOR_LLM.md in the repo is a system prompt that teaches LLMs to generate .dd files from domain descriptions. Describe your workflow in plain language, get a contract spec back. That works well with a structured format.&lt;br&gt;
Native Python API is on the roadmap — we know the DSL is a barrier for some workflows.&lt;/p&gt;

&lt;p&gt;What This Is Not&lt;br&gt;
DEED is not an observability tool. It doesn't replace LangSmith or Langfuse. Those tell you what happened — DEED enforces what's allowed to happen before it does. Different layer. You'd use both.&lt;br&gt;
DEED is not a workflow orchestrator. It doesn't replace Temporal or Prefect. You could run a DEED pipeline inside a Temporal workflow — DEED handles the contract layer, Temporal handles scheduling and retries at the workflow level.&lt;/p&gt;

&lt;p&gt;Try It&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;deed-runtime
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Zero dependencies. Python 3.10+.&lt;br&gt;
Three examples in the repo: mushroom_safety (safety-critical pipeline with deliberate failure mode), sales_agent (B2B scoring with policy deny on restricted regions), orchid_rescue (reference spec only — conservation triage workflow).&lt;/p&gt;

&lt;p&gt;GitHub: github.com/Deadly-Reiter/deed&lt;br&gt;
Docs:   deed-docs.onrender.com&lt;/p&gt;

&lt;p&gt;If you're running agents in production and have a different approach to this problem — genuinely curious what you're doing.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>python</category>
      <category>devops</category>
      <category>agents</category>
    </item>
  </channel>
</rss>
