Vlad

Posted on Jun 10

The write your agents lost — and why nothing errored

#ai #agents #langchain #claude

Three ways an agent fleet loses work

Scenario one: the parallel sessions.
Two coding agents work the same repository — one refactoring, one writing tests, both reading and updating the shared plan.md. Session B commits a revised plan. Session A, which read the plan twenty minutes ago, finishes its task and writes its version back. B's revision is gone. No exception, no conflict marker, no log line. The next agent to read the plan builds on the wrong one.

Scenario two: the orchestrator fleet.
A planner dispatches six workers; each appends its result to a shared decisions document or store key. Two workers finish in the same instant. Both writes "succeed." One of them isn't there afterward. With humans this is the oldest concurrency bug in the book; with agents it's worse, because nobody re-reads the document with suspicion — the next prompt just inherits whatever survived.

Scenario three: the overnight agent.
A long-running agent stalls mid-task while holding the write lock. Your recovery logic — correctly — reclaims the lock so the rest of the fleet isn't blocked. Hours later the stalled process wakes up and completes its write. Here's the trap: if nothing else changed the artifact in between, the version number still matches. Every version check passes. The zombie's stale commit lands on top of a state the system has long since moved past.

Why agents make this worse than microservices

Distributed systems have had these bugs for fifty years. What's new is the failure presentation. A service that reads stale state usually crashes or returns something visibly wrong. An agent that reads stale state confabulates continuity — it produces fluent, confident output built on the wrong version, and the error surfaces three steps downstream as "the model hallucinated" or "the agent forgot."

So teams debug the model. They rewrite prompts, swap providers, add retries. But the bug isn't in the model — it's in the write path. Until the state layer can refuse a stale write, every layer above it inherits silent corruption.

What "enforcement" can fix it?

agent-coherence started as a coherence protocol: MESI-style ownership and invalidation over shared artifacts, so a write from a stale view is denied fail-closed and the writer must re-read before it can land anything. That covers scenario one — the sequential stale-read-then-write.

In the recent version (out now on PyPI), it completes the picture with enforcement for the two cases ownership alone can't catch:

Concurrent writers — optimistic commit-CAS.
write_cas commits only if the artifact version still equals the version the writer read. Two agents racing the same key resolve to exactly one winner; the loser receives a typed conflict and a bounded retry path — read fresh, recompute, commit again. Scenario two stops being a coin flip and becomes a protocol. The invariant has a name: NoLostUpdate.

Crash-reclaimed writers — the read-generation fence.
Every reclamation bumps the artifact's ownership epoch; every claim captures the epoch it was made under; commit checks them atomically with the version persist. The overnight zombie from scenario three is rejected even though the version number never moved — with a typed, retryable reason, not a silent overwrite. The invariant: NoStaleApply.

And the piece that makes reclamation safe to run at all: a crashed agent holding EXCLUSIVE forever would block the fleet, so a heartbeat/TTL sweep reclaims stale grants automatically — on by default since the previous version.

The rigor part

Every guarantee above is a safety invariant model-checked with TLA+/TLC. Four specs — the MESI protocol, crash recovery, optimistic concurrency, and the fence — run in CI on every push. Each spec carries a documented mutant (remove the guard, weaken the check) that must turn the model checker red; if the mutant passes, the invariant isn't load-bearing and the build fails the
review. The fence itself is server-side by design: no public write API accepts
a generation or fence argument, and a CI signature guard enforces that
boundary.

This is the difference between "we added locking" and "here is the state
machine, here is the invariant, here is the checker run that explores every
interleaving up to the model bounds."

Scope, honestly

The guarantees hold for writers that go through the coordinator, under a
single coordinator — one host. Concurrent same-key writers on one host are
covered. Cross-host fencing is on the roadmap and demand-gated: if your fleet
spans machines and you need it, open an issue — that's the signal that pulls
it forward.

The economics come along for free

Correctness is the wedge, but the same protocol is why the token bill drops:
writes publish ~12-token invalidation signals instead of rebroadcasting full
artifacts, so read-heavy fleets stop re-paying for state they already hold.
Measured on real LangGraph graphs: 69% savings on a read-heavy planning
workload, 47% on moderate code review, 29% on high-churn writes.

Try it in five minutes

LangGraph — one import change, no node code changes:

from ccs.adapters import CCSStore
store = CCSStore(strategy="lazy")

Plain files shared across processes — no framework required:

from ccs.adapters.coherent_volume import CoherentVolume
vol = CoherentVolume(workspace_root, managed=("plans/**",))
plan = vol.read("plans/plan.md")
vol.write("plans/plan.md", revised_plan)   # stale view? denied, fail-closed

CrewAI, AutoGen, and the OpenAI Agents SDK ship as adapters on the same protocol; the runnable lost-update demo is in the repo
(python -m examples.coherent_volume.main), and the formal protocol + verification story is on arXiv (2603.15183).

pip install agent-coherence

The ask

If you're running a fleet that shares state — parallel coding sessions, an orchestrator with workers, agents with shared memory — I'm looking for early design partners, and the first conversation is me listening to how your system fails. Repo: https://github.com/hipvlady/agent-coherence — or message me here.

Top comments (4)

Aliaksei Zelianouski • Jun 10

A service reads stale state and crashes; an agent reads it and just keeps going - fluent, confident output on the wrong version, surfacing three steps later as "the model hallucinated" so everyone debugs the prompt instead of the write path. That part's right and badly under-said. But MESI plus TLA+ over shared artifacts is a lot of machine for what's usually an architecture smell - two agents writing the same file. The CAS I buy. The coherence protocol feels like solving concurrency you could have just not had.

Vlad • Jun 10

The part you're agreeing with is the part CAS doesn't fix — that's the asymmetry I'd push back on.

The misattribution failure is a read-side problem: the agent is three minutes into reasoning on a view that died at someone else's commit. CAS is a write-side check — it tells a writer it lost, at commit, on the key it's writing. It says nothing to a reader. And the worst case is common: an agent reads the plan, then writes other artifacts informed by it. It never writes the plan itself, so no CAS ever fires anywhere, and the stale read quietly contaminates everything downstream. Invalidation is the read-side mechanism: a commit pushes "your view is dead" to every cached holder, so the stale agent fails fast at its next touch — or aborts mid-compute — instead of fluently finishing on the wrong version. With agent writers there's an economic edge too: every CAS retry is a re-generation, minutes and dollars, so learning you lost at T+10s beats discovering it after the full pass.

On the architecture smell: agreed more than you'd guess. If two agents share a file by accident, partition it — that's the first thing I suggest, and the protocol costs nothing on partitioned keys (no contention, reads serve from local cache). The residue is the shared plan, the decisions log, the memory — where sharing is the collaboration. And single-writer-by-design is itself a coherence protocol: the degenerate one where ownership never moves. The first handoff — a reviewer takes the plan, agent B resumes A's work — and you're hand-rolling invalidate-on-transfer.

On "a lot of machine": the protocol is four stable states and one local coordinator process; the TLA+ runs in CI, not in the request path. The realistic alternative isn't no protocol — it's an unnamed one smeared across retry loops and prompt instructions. Naming it is what made it small enough to check.

Aliaksei Zelianouski • Jun 11

Not trying to talk you out of it - you built the thing, it solves a real shape, cool. I'd just have come at it from the other side.

My instinct is to avoid the conflict instead of coordinating it. In your examples the agents are updating a shared plan, memory, a store key mid-process. Do they really need to do it concurrently? Why not let them finish their current steps and sync up?

The MESI ownership/invalidation part reminds me of low-level concurrency control in Java over shared resources. In practice people tend to avoid it where they can: lock-free structures where possible, higher-level blocking sync everywhere else. Slower sometimes, but a lot simpler.

Anyway, if it's working for you, great.

Vlad • Jun 11

No argument with the instinct — avoid beats coordinate wherever the workflow allows it, and "finish the step, sync at a join" is the right shape when steps reliably finish. That's the assumption that leaked for me: a barrier needs a timeout, or one stalled worker wedges the whole run. And the moment it has a timeout, you've manufactured the exact case from the post — the step that "finishes" after the system already moved on. A barrier with a timeout has the same hole as a lock with a timeout. The straggler is the one participant you can't design away; the fence is just what makes its late return harmless instead of silently destructive.

On the Java analogy — I'd actually read it as the argument for this. In practice people don't avoid coordination, they avoid hand-rolling it: they reach for java.util.concurrent so the low-level part is someone else's verified code. That's the intent here — j.u.c for shared agent state. Nobody should write MESI by hand, me included. I just wrote it once, with a model checker watching.

Appreciated the pushback throughout — it sharpened the framing more than the agreement did.