<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Vlad</title>
    <description>The latest articles on DEV Community by Vlad (@mrvlad).</description>
    <link>https://dev.to/mrvlad</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3977904%2F70e34f00-e03a-483d-9785-166f364b76f7.png</url>
      <title>DEV Community: Vlad</title>
      <link>https://dev.to/mrvlad</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/mrvlad"/>
    <language>en</language>
    <item>
      <title>The write your agents lost — and why nothing errored</title>
      <dc:creator>Vlad</dc:creator>
      <pubDate>Wed, 10 Jun 2026 14:56:21 +0000</pubDate>
      <link>https://dev.to/mrvlad/the-write-your-agents-lost-and-why-nothing-errored-k1n</link>
      <guid>https://dev.to/mrvlad/the-write-your-agents-lost-and-why-nothing-errored-k1n</guid>
      <description>&lt;h3&gt;
  
  
  Three ways an agent fleet loses work
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Scenario one: the parallel sessions.&lt;/strong&gt; &lt;br&gt;
Two coding agents work the same repository — one refactoring, one writing tests, both reading and updating the shared &lt;code&gt;plan.md&lt;/code&gt;. Session B commits a revised plan. Session A, which read the plan twenty minutes ago, finishes its task and writes its version back. B's revision is gone. No exception, no conflict marker, no log line. The next agent to read the plan builds on the wrong one.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Scenario two: the orchestrator fleet.&lt;/strong&gt; &lt;br&gt;
A planner dispatches six workers; each appends its result to a shared decisions document or store key. Two workers finish in the same instant. Both writes "succeed." One of them isn't there afterward. With humans this is the oldest concurrency bug in the book; with agents it's worse, because nobody re-reads the document with suspicion — the next prompt just inherits whatever survived.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Scenario three: the overnight agent.&lt;/strong&gt; &lt;br&gt;
A long-running agent stalls mid-task while holding the write lock. Your recovery logic — correctly — reclaims the lock so the rest of the fleet isn't blocked. Hours later the stalled process wakes up and completes its write. Here's the trap: if nothing else changed the artifact in between, the version number still matches. Every version check passes. The zombie's stale commit lands on top of a state the system has long since moved past.&lt;/p&gt;
&lt;h3&gt;
  
  
  Why agents make this worse than microservices
&lt;/h3&gt;

&lt;p&gt;Distributed systems have had these bugs for fifty years. What's new is the failure &lt;em&gt;presentation&lt;/em&gt;. A service that reads stale state usually crashes or returns something visibly wrong. An agent that reads stale state confabulates continuity — it produces fluent, confident output built on the wrong version, and the error surfaces three steps downstream as "the model hallucinated" or "the agent forgot."&lt;/p&gt;

&lt;p&gt;So teams debug the model. They rewrite prompts, swap providers, add retries. But the bug isn't in the model — it's in the write path. Until the state layer can refuse a stale write, every layer above it inherits silent corruption.&lt;/p&gt;
&lt;h3&gt;
  
  
  What "enforcement" can fix it?
&lt;/h3&gt;

&lt;p&gt;agent-coherence started as a coherence protocol: MESI-style ownership and invalidation over shared artifacts, so a write from a stale view is denied fail-closed and the writer must re-read before it can land anything. That covers scenario one — the sequential stale-read-then-write.&lt;/p&gt;

&lt;p&gt;In the recent version (out now on PyPI), it completes the picture with enforcement for the two cases ownership alone can't catch:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Concurrent writers — optimistic commit-CAS.&lt;/strong&gt; &lt;br&gt;
&lt;code&gt;write_cas&lt;/code&gt; commits only if the artifact version still equals the version the writer read. Two agents racing the same key resolve to exactly one winner; the loser receives a typed conflict and a bounded retry path — read fresh, recompute, commit again. Scenario two stops being a coin flip and becomes a protocol. The invariant has a name: &lt;code&gt;NoLostUpdate&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Crash-reclaimed writers — the read-generation fence.&lt;/strong&gt; &lt;br&gt;
Every reclamation bumps the artifact's ownership epoch; every claim captures the epoch it was made under; commit checks them atomically with the version persist. The overnight zombie from scenario three is rejected even though the version number never moved — with a typed, retryable reason, not a silent overwrite. The invariant: &lt;code&gt;NoStaleApply&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;And the piece that makes reclamation safe to run at all: a crashed agent holding EXCLUSIVE forever would block the fleet, so a heartbeat/TTL sweep reclaims stale grants automatically — on by default since the previous version.&lt;/p&gt;
&lt;h3&gt;
  
  
  The rigor part
&lt;/h3&gt;

&lt;p&gt;Every guarantee above is a safety invariant model-checked with TLA+/TLC. Four specs — the MESI protocol, crash recovery, optimistic concurrency, and the fence — run in CI on every push. Each spec carries a documented mutant (remove the guard, weaken the check) that must turn the model checker red; if the mutant passes, the invariant isn't load-bearing and the build fails the&lt;br&gt;
review. The fence itself is server-side by design: no public write API accepts&lt;br&gt;
a generation or fence argument, and a CI signature guard enforces that&lt;br&gt;
boundary.&lt;/p&gt;

&lt;p&gt;This is the difference between "we added locking" and "here is the state&lt;br&gt;
machine, here is the invariant, here is the checker run that explores every&lt;br&gt;
interleaving up to the model bounds."&lt;/p&gt;
&lt;h3&gt;
  
  
  Scope, honestly
&lt;/h3&gt;

&lt;p&gt;The guarantees hold for writers that go through the coordinator, under a&lt;br&gt;
single coordinator — one host. Concurrent same-key writers on one host are&lt;br&gt;
covered. Cross-host fencing is on the roadmap and demand-gated: if your fleet&lt;br&gt;
spans machines and you need it, open an issue — that's the signal that pulls&lt;br&gt;
it forward.&lt;/p&gt;
&lt;h3&gt;
  
  
  The economics come along for free
&lt;/h3&gt;

&lt;p&gt;Correctness is the wedge, but the same protocol is why the token bill drops:&lt;br&gt;
writes publish ~12-token invalidation signals instead of rebroadcasting full&lt;br&gt;
artifacts, so read-heavy fleets stop re-paying for state they already hold.&lt;br&gt;
Measured on real LangGraph graphs: 69% savings on a read-heavy planning&lt;br&gt;
workload, 47% on moderate code review, 29% on high-churn writes.&lt;/p&gt;
&lt;h3&gt;
  
  
  Try it in five minutes
&lt;/h3&gt;

&lt;p&gt;LangGraph — one import change, no node code changes:&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from ccs.adapters import CCSStore
store = CCSStore(strategy="lazy")
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;p&gt;Plain files shared across processes — no framework required:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;ccs.adapters.coherent_volume&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;CoherentVolume&lt;/span&gt;
&lt;span class="n"&gt;vol&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;CoherentVolume&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;workspace_root&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;managed&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;plans/**&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,))&lt;/span&gt;
&lt;span class="n"&gt;plan&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;vol&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;plans/plan.md&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;vol&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;write&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;plans/plan.md&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;revised_plan&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;   &lt;span class="c1"&gt;# stale view? denied, fail-closed
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;CrewAI, AutoGen, and the OpenAI Agents SDK ship as adapters on the same protocol; the runnable lost-update demo is in the repo&lt;br&gt;
(&lt;code&gt;python -m examples.coherent_volume.main&lt;/code&gt;), and the formal protocol + verification story is on arXiv (2603.15183).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;agent-coherence
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  The ask
&lt;/h3&gt;

&lt;p&gt;If you're running a fleet that shares state — parallel coding sessions, an orchestrator with workers, agents with shared memory — I'm looking for early design partners, and the first conversation is me listening to how your system fails. Repo: &lt;a href="https://github.com/hipvlady/agent-coherence" rel="noopener noreferrer"&gt;https://github.com/hipvlady/agent-coherence&lt;/a&gt; — or message me here.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>agents</category>
      <category>langchain</category>
      <category>claude</category>
    </item>
  </channel>
</rss>
