DEV Community: Vlad

Two AI agents, one Postgres row: the bug your version check won't catch

Vlad — Mon, 20 Jul 2026 16:26:11 +0000

If you have two AI agents writing the same Postgres row, you have probably already added a conditional write:

UPDATE profiles SET value = $1, version = version + 1
WHERE id = $2 AND version = $3

When a peer moved the version first, you get rowcount = 0 and reject the write. That is real, it works, and you should keep it. It closes half the problem.

This post is about the other half, which is quieter and shows up as a bug you will blame on the model.

The failure

Here is the timeline that bit me.

Agent A reads the row at version 7 and starts a multi-step edit. Call it two minutes of reasoning.
Agent B commits version 8.
A finally writes. The conditional write comes back rowcount = 0 and A's write is rejected.

The row is fine. That is exactly what the CAS is for.

But look at what happened in step 2. For those two minutes, A was deriving work from a value that was already dead. The plan it wrote, the summary it saved, the tool call it fired. None of that went through the row's CAS, so none of it was rejected. It all landed normally.

The row survived. The work is contaminated.

A version check protects the write. It does not protect the reader.

This is specific to agents in a way it is not for ordinary CRUD. A normal service reads and writes in the same breath, so the window is milliseconds. An agent reads, reasons for a long time, then acts. The window is the whole reasoning pass, and everything it emits in that window escapes before the CAS ever fires.

Reproduce it

The demos are offline and need no credentials, so you can watch this happen in about a minute:

git clone https://github.com/Cohexa-ai/agent-coherence
cd agent-coherence
pip install -e ".[coherent-row]"

python -m examples.coherent_row.main --baseline

The --baseline arm runs the un-coordinated agent first, so you see it act on its stale cache, and then runs the guarded version. The deny gets measured against its absence rather than just asserted at you. Exit code is 0 only if the contract held.

The fix, in code

The piece that was missing is telling A that its cached view died, before A acts. That is what the binding adds. Condensed from examples/coherent_row/main.py:

from ccs.adapters.coherent_row import CoherentRow
from ccs.adapters.substrate import CoordinatedSubstrate, SubstrateCoordinatorSession
from ccs.core.exceptions import StaleView

session = SubstrateCoordinatorSession(root, managed=("**",))
agent_a = CoordinatedSubstrate(CoherentRow("profiles", dsn=DSN), session)

value, token = agent_a.read("user-42")
# ... A reasons for a while. Meanwhile B commits a new version ...

try:
    agent_a.commit("user-42", expected_token=token, new_bytes=edited)
except StaleView:
    fresh, fresh_token = agent_a.reacquire("user-42")
    agent_a.commit("user-42", expected_token=fresh_token, new_bytes=redecide(fresh))

The detail I care most about is what does not happen. When StaleView is raised, the write never reaches Postgres. The binding denies it at the coordinator before the substrate is touched, and the demo asserts exactly that by checking the CAS was never called. You are not catching a rejection after the fact. You are stopping the act.

Recovery is one verb, reacquire(), and it is the same verb whether the state is a row, an S3 object, or a file on disk. CoherentObject gives you the identical story against S3 with If-Match underneath.

Your bytes stay in your database

The thing that would make me suspicious of a library like this is whether it quietly becomes a second store. It does not, and the design makes that checkable.

The coordinator holds a monotonic version, per-agent state, and a fixed-width content hash. The binding declares SENDS_CONTENT_TO_COORDINATOR = False, the conformance kit asserts it, and composition refuses any binding that sets it True. The row body never leaves Postgres. Your database keeps its backups, permissions, and durability story, and stays the system of record. Remove the library tomorrow and your data is exactly where it always was.

Nobody should migrate production state to get a correctness guarantee.

When you do not need this

Worth saying plainly, because it is a real answer for some of you.

If you wrap every read and write in a pg_advisory_lock and never cache a read between them, the bare CAS is enough and you can stop here. The binding is for the agent that reads, reasons, then acts. If your window between read and write is short and you never derive anything inside it, you do not have this problem.

Honest scope

The limits matter more than the pitch, so here they are.

Single host only. Two writers on two machines are not covered by a shipped guarantee. The cross-host transport is a demo and is not production fencing.
The bindings are cooperative. A writer that goes straight to Postgres without the binding is not caught at write time. The coordinator notices the divergence on the next mediated read from the content hash, and detection after the fact is not prevention.
The demo models the row in memory so it runs offline. The real Postgres CAS is exercised against actual Postgres in the conformance kit's real_substrate arm. In production you point the same binding at a real DSN.
No read-generation fence in this binding. The v1 writers ride admit-on-absent plus the version CAS. The binding surfaces invalidation. The fence is a documented roadmap item, not shipped behaviour here.
It does not make a weak substrate strong. Each binding declares a capability tier derived from what the substrate can actually enforce, and the conformance kit checks the declared tier against observed behaviour.

Try it

pip install "agent-coherence[coherent-row]"    # Postgres
pip install "agent-coherence[coherent-object]" # S3

Repo: https://github.com/Cohexa-ai/agent-coherence
Full write-up: https://agent-coherence.dev/blog/shipped-byo-substrate/

If you have hit the read-reason-act window in your own system, I would like to hear how it surfaced for you. It almost never gets reported as a concurrency bug.

Write returned success. The file was never there.

Vlad — Thu, 25 Jun 2026 13:47:57 +0000

Four issues filed in the past week describe the same failure: an agent writes to persistent storage, the write API returns without error, and the data is gone. No exception, no log entry, no indication that anything went wrong until something tries to read what was written.

The symptoms vary. In one case, a Write tool call reports success while a concurrent disk check from a separate process shows nothing written. In another, 28 concurrent agent workflows report started=1, result=0 in their journals with no abort marker. In a third, two processes writing to the same data directory produce 157 GB of growth and a kernel panic. The corruption accumulated silently over days before the system failed. In a fourth, a memory layer agent skips writes entirely or writes partial records. The store fills with fragments no future session can act on.

The failure is structural. A write that looks atomic to the caller is not atomic to the filesystem when multiple processes share state. The write API returns when the calling process hands off to the OS or a downstream layer, not when durability is confirmed across all concurrent writers. If two writers race on the same file, one loses. If a shared runtime dies mid-flight, in-progress writes evaporate. The caller gets no signal either way.

What makes this hard to debug is where the evidence lands. The write site looks clean. The gap shows up at the read site: a future session, a downstream consumer, or a human checking disk from outside the agent's process. By then, the causal chain is several hops from where the failure occurred.

Closing this class requires three things.

Writes to shared state need to go through a coordination layer that enforces at-most-one-writer semantics. File locks, atomic renames, or a mediating coordinator all work. The mechanism matters less than the invariant: concurrent writes to the same artifact are serialized, not raced.

That coordination layer needs to sit in the critical path of the write. If the agent can bypass it, the invariant breaks under concurrent load.

And failures need to surface at the write site, not the read site. A write that cannot be confirmed as durable should return an error to the caller. A write that silently succeeds but leaves nothing behind is a lie the next session has to investigate.

None of this is new. Distributed databases and cache coherence protocols solved this class decades ago. What's changed is that multi-agent systems are hitting it at the filesystem and plugin layer, where the coordination primitives are still thin.

We built agent-coherence to address this for the AI agent case. The coordinator enforces single-writer invariants across concurrent sessions and surfaces write failures at the call site instead of the read site.

Library at github.com/hipvlady/agent-coherence, with adapters for LangGraph, CrewAI, and Claude Code workflows.

The write your agents lost — and why nothing errored

Vlad — Wed, 10 Jun 2026 14:56:21 +0000

Three ways an agent fleet loses work

Scenario one: the parallel sessions.
Two coding agents work the same repository — one refactoring, one writing tests, both reading and updating the shared plan.md. Session B commits a revised plan. Session A, which read the plan twenty minutes ago, finishes its task and writes its version back. B's revision is gone. No exception, no conflict marker, no log line. The next agent to read the plan builds on the wrong one.

Scenario two: the orchestrator fleet.
A planner dispatches six workers; each appends its result to a shared decisions document or store key. Two workers finish in the same instant. Both writes "succeed." One of them isn't there afterward. With humans this is the oldest concurrency bug in the book; with agents it's worse, because nobody re-reads the document with suspicion — the next prompt just inherits whatever survived.

Scenario three: the overnight agent.
A long-running agent stalls mid-task while holding the write lock. Your recovery logic — correctly — reclaims the lock so the rest of the fleet isn't blocked. Hours later the stalled process wakes up and completes its write. Here's the trap: if nothing else changed the artifact in between, the version number still matches. Every version check passes. The zombie's stale commit lands on top of a state the system has long since moved past.

Why agents make this worse than microservices

Distributed systems have had these bugs for fifty years. What's new is the failure presentation. A service that reads stale state usually crashes or returns something visibly wrong. An agent that reads stale state confabulates continuity — it produces fluent, confident output built on the wrong version, and the error surfaces three steps downstream as "the model hallucinated" or "the agent forgot."

So teams debug the model. They rewrite prompts, swap providers, add retries. But the bug isn't in the model — it's in the write path. Until the state layer can refuse a stale write, every layer above it inherits silent corruption.

What "enforcement" can fix it?

agent-coherence started as a coherence protocol: MESI-style ownership and invalidation over shared artifacts, so a write from a stale view is denied fail-closed and the writer must re-read before it can land anything. That covers scenario one — the sequential stale-read-then-write.

In the recent version (out now on PyPI), it completes the picture with enforcement for the two cases ownership alone can't catch:

Concurrent writers — optimistic commit-CAS.
write_cas commits only if the artifact version still equals the version the writer read. Two agents racing the same key resolve to exactly one winner; the loser receives a typed conflict and a bounded retry path — read fresh, recompute, commit again. Scenario two stops being a coin flip and becomes a protocol. The invariant has a name: NoLostUpdate.

Crash-reclaimed writers — the read-generation fence.
Every reclamation bumps the artifact's ownership epoch; every claim captures the epoch it was made under; commit checks them atomically with the version persist. The overnight zombie from scenario three is rejected even though the version number never moved — with a typed, retryable reason, not a silent overwrite. The invariant: NoStaleApply.

And the piece that makes reclamation safe to run at all: a crashed agent holding EXCLUSIVE forever would block the fleet, so a heartbeat/TTL sweep reclaims stale grants automatically — on by default since the previous version.

The rigor part

Every guarantee above is a safety invariant model-checked with TLA+/TLC. Four specs — the MESI protocol, crash recovery, optimistic concurrency, and the fence — run in CI on every push. Each spec carries a documented mutant (remove the guard, weaken the check) that must turn the model checker red; if the mutant passes, the invariant isn't load-bearing and the build fails the
review. The fence itself is server-side by design: no public write API accepts
a generation or fence argument, and a CI signature guard enforces that
boundary.

This is the difference between "we added locking" and "here is the state
machine, here is the invariant, here is the checker run that explores every
interleaving up to the model bounds."

Scope, honestly

The guarantees hold for writers that go through the coordinator, under a
single coordinator — one host. Concurrent same-key writers on one host are
covered. Cross-host fencing is on the roadmap and demand-gated: if your fleet
spans machines and you need it, open an issue — that's the signal that pulls
it forward.

The economics come along for free

Correctness is the wedge, but the same protocol is why the token bill drops:
writes publish ~12-token invalidation signals instead of rebroadcasting full
artifacts, so read-heavy fleets stop re-paying for state they already hold.
Measured on real LangGraph graphs: 69% savings on a read-heavy planning
workload, 47% on moderate code review, 29% on high-churn writes.

Try it in five minutes

LangGraph — one import change, no node code changes:

from ccs.adapters import CCSStore
store = CCSStore(strategy="lazy")

Plain files shared across processes — no framework required:

from ccs.adapters.coherent_volume import CoherentVolume
vol = CoherentVolume(workspace_root, managed=("plans/**",))
plan = vol.read("plans/plan.md")
vol.write("plans/plan.md", revised_plan)   # stale view? denied, fail-closed

CrewAI, AutoGen, and the OpenAI Agents SDK ship as adapters on the same protocol; the runnable lost-update demo is in the repo
(python -m examples.coherent_volume.main), and the formal protocol + verification story is on arXiv (2603.15183).

pip install agent-coherence

The ask

If you're running a fleet that shares state — parallel coding sessions, an orchestrator with workers, agents with shared memory — I'm looking for early design partners, and the first conversation is me listening to how your system fails. Repo: https://github.com/hipvlady/agent-coherence — or message me here.