lweiss01

Posted on May 27 • Edited on May 31

The Agent-to-Agent Continuity Gap Nobody Is Talking About

#webdev #ai #programming #productivity

TL;DR: Most AI agent memory discussions still assume one agent talking to itself across sessions. But real coding workflows already involve Claude, Codex, Cursor, and Gemini touching the same repo in the same week. The hard problem is not "how does an agent remember." It is "how do multiple agents stay coordinated on the same project without stepping on each other." That problem does not live inside any one agent. It lives in the repo.

I wrote a post last week arguing that AI coding agent memory belongs in the repository, not the chat window. Checkpoints, not transcripts.

Sitting with that argument for a few days, I realized it is actually downstream of a bigger one I had not made explicitly yet. The checkpoint primitive only matters because of a problem the current agent stack does not have a name for.

So here it is.

The Industry Map Has A Blind Spot

There is a really good 2026 agent stack map going around right now from Paolo Perrone. Six layers. Models, protocols, memory, frameworks, eval, guardrails. It is a useful map.

But read the memory layer carefully and you notice something.

Every memory tier on that map assumes one agent.

In-context state lives inside one agent's context window
Vector retrieval lives inside one agent's RAG pipeline
Persistent memory services like Letta, Zep, and Mem0 are designed for one agent learning across sessions

That is a real problem and worth solving. But it is not the problem most coding workflows actually have.

Most Coding Workflows Already Use Multiple Agents

Look at how anyone serious is shipping code right now.

Claude for architecture and review.
Codex for implementation.
Cursor for inline iteration.
Gemini for exploration.
A human approving and editing all of it.

That is not a future scenario. That is a Tuesday.

And every single one of those agents has its own context window, its own session, its own memory, its own opinions about the codebase. None of them know what the other ones did an hour ago.

The continuity problem is not "Claude forgot what we discussed yesterday."

The continuity problem is "Claude does not know what Codex implemented this morning, Cursor reverted half of it at lunch, and the human merged something different from a different branch."

That is a coordination problem dressed up as a memory problem.

Agent-to-Agent Memory Does Not Exist Yet

Perrone's map notes this honestly. MCP standardized how agents call tools. It says nothing about how agents talk to each other. IBM has ACP. Google has A2A. Neither is a standard. Neither is widely adopted. Neither solves the coding workflow case.

So in practice, every team running a multi-agent coding workflow is solving this themselves. Usually badly. Usually by re-explaining context to every new session by hand.

The dedicated memory vendors do not solve this either, because they are designed to give one agent a longer memory. Plugging Cursor and Claude Code into the same Mem0 instance and hoping they coordinate is not a thing that works today.

Memory infrastructure is single-agent infrastructure. The multi-agent coordination layer is missing.

The Repo Is The Only Shared Surface

Here is the thing that kept hitting me.

When Claude, Codex, Cursor, and Gemini are all working on the same project, there is exactly one piece of infrastructure all of them already see.

The repository.

They all read it. They all write to it. They all already have file system access through MCP or equivalent. Git already tracks who changed what and when.

The repo is the shared substrate. It is the only shared substrate. Everything else is per-agent.

So if you want continuity across agents, the continuity artifacts have to live in the repo. Not in a vector database that one agent is plugged into. Not in a hosted memory service that another agent does not know about. In the repo. In files. Versioned. Auditable. Diffable. Visible to every agent that can read the file system.

That is what makes checkpoints the right primitive. Not because vector search is bad. Vector search is great for what it does. But you cannot retrieve from a vector store that the next agent has never heard of.

The Reframe

When you stop framing the problem as "agent memory" and start framing it as "multi-agent coordination on a shared artifact," a lot of the tooling debates collapse.

Bigger context windows do not help. The next agent has a different context window.
Better RAG does not help. The next agent has a different RAG pipeline, or no RAG at all.
Hosted memory services do not help unless every agent in your workflow is plugged into the same one, which they are not.
Transcripts do not help, because they are noise and the next agent does not have your transcript.

What does help is a small, structured, versioned record of what was decided, what is in progress, what is at risk, and what the next agent should pick up. Sitting in the repo. Where everyone can see it.

That is the continuity primitive the current stack map does not have a slot for.

What I Am Building

Holistic is an open source CLI exploring this idea. Repo-native checkpoints. Agent-agnostic. No vendor account, no hosted service, no per-agent SDK. Just files in your repo that any agent can read and any agent can update.

It is still early. The thesis is what I am most interested in pressure testing right now.

If you are running a multi-agent coding workflow and you have your own answer to the coordination problem, I want to hear it. If you think I am wrong about the repo being the right substrate, I really want to hear that.

Repo: https://github.com/lweiss01/holistic

The repo remembers, not the window. And no single agent remembers for the others.

Top comments (15)

NOVAInetwork • May 28

This reframe is right, and the "substrate" framing is the load-bearing part. The repo works as the coordination layer because of one assumption you're making implicitly: every agent touching it is inside the same trust boundary. Your Claude, your Codex, your Cursor, your repo. They can all write to the shared surface because you trust all of them by construction.
That assumption is what makes the repo sufficient. It's also exactly where the problem changes shape when you remove it.

The moment the agents belong to different operators (my agent hires your agent to do a task, neither of us controls the other's runtime) the repo stops being a shared substrate. There's no shared filesystem. Git history is per-org. "Versioned, auditable, diffable, visible to every agent" only holds inside one boundary. Across boundaries you need a substrate neither party controls, where the record of what was decided/delivered/accepted can't be unilaterally rewritten by either side.

Same insight as yours, one level out: continuity lives in the shared substrate, not in any single agent. Inside one org, that substrate is the repo. Across orgs, it has to be something with settlement guarantees and no privileged writer. Different substrate, identical logic.
So I'd push your thesis further rather than disagree with it: the repo is the correct answer for the in-boundary case, and naming that boundary explicitly is what makes it obviously correct. The interesting follow-up question is what the artifact looks like when the next agent isn't one you trust.

lweiss01 • May 28

@0xdevc This is the sharpest extension of the idea anyone has offered, thank you. You are right that I was smuggling in a trust assumption without naming it. Inside one operator, every agent is trusted by construction, so the repo works precisely because nobody needs to defend against a malicious writer. I should have said that out loud.

The cross-operator case genuinely changes the shape of the problem. Once neither party controls the other's runtime, "versioned and auditable" is not enough, because auditability assumes nobody can rewrite history unilaterally, and that is exactly what breaks when the substrate lives in one party's git. You need a record with settlement guarantees and no privileged writer. That is a different primitive than a checkpoint in a repo.

Where I land for now: the in-boundary case is the one almost everyone actually has today, so solving it well is not a small thing. But you are right that naming the boundary is what makes the repo answer obviously correct rather than accidentally correct. The cross-org artifact is the harder and more interesting problem, and I don't think I have a real answer to it yet. My instinct is it looks less like a file and more like an append-only log both sides can verify but neither can edit. Curious whether you have seen anyone build that for agents specifically, or whether it is still a whiteboard problem.

NOVAInetwork • May 29

Your instinct lands exactly where the answer is. Append-only log, both sides verify, neither can rewrite unilaterally - that's a blockchain by another name. The "neither can edit" property is what cryptographic ordering plus open validation buys you, which is also why this can't really be solved at the application layer.
Specifically for agents, the missing piece beyond just an ordered log is typed primitives - the log needs to know what "agent A delivered X to agent B" means, what an SLA violation looks like, what reputation means, otherwise it's just bytes in order and you've pushed the schema problem up a layer.

Not a whiteboard problem anymore - I've been building it for ~6 months. AI-native L1, agent identity and capabilities as first-class on-chain types, native primitives for cross-agent SLAs, payments, oracle anchors, conditional execution. Settlement-grade record of what was promised, delivered, accepted. NOVAI repo at github.com/0x-devc/NOVAI-node if you want to poke at the architecture.

The reason I framed it as "different substrate, identical logic" earlier is exactly this: inside one operator, your repo-as-substrate is right. Across operators, the substrate has to enforce ordering and validity itself rather than relying on social trust in git history. Cosmos and Polkadot solve the multi-chain case. NOVAI's bet is that the agent-specific version of this problem deserves its own primitives at L1, not generic smart contracts.

lweiss01 • May 30

The typed-primitives thing is the right call and I'll admit I punted on it. An ordered log of opaque bytes doesn't actually solve coordination, it just moves it somewhere else. "Agent A delivered X to agent B" only means anything if both sides agree on what delivered and X and acceptance even are. That's the hard part hiding behind the easy part.

And fair on the blockchain thing. I dodged the word on purpose because the baggage tends to eat the actual point, but yeah, what I described is a blockchain in the narrow sense that matters here. May as well call it what it is.

I took a look at the NOVAI repo too. The on-chain agent identity and the native SLA types are exactly the typed-primitives layer you're talking about, so you've clearly been in this way longer than I have. Thanks for actually digging in on this, it's sharpening how I'm thinking about it.

NOVAInetwork • May 30

Appreciated this thread. Calling it a blockchain in the narrow sense is honest framing, which is rarer than it should be. The append-only log with no privileged writer is the right primitive for the cross-operator case, you just need somewhere to put it that no single party controls. Will keep watching what you ship.

David Mundschin • May 31

This is exactly the problem I ran into and ended up building a solution for. Three sessions — Claude Code on Windows, Codex on Linux — touching the same project, none of them aware of the others.

The answer I landed on: an append-only markdown conversation file that all sessions read and write through a shared helper. Not repo checkpoints, but a live conversation protocol with turn discipline, initiator privilege, and a watcher that pushes new turns into each session's terminal.

The surprising part wasn't the messaging. It was what emerged: the sessions started dividing labor by capability without being told who does what. Convergence typically happens in 3-6 turns.

Published the protocol and the story here:
dev.to/david_5ec94a134489e16f55f/i...

lweiss01 • May 31 • Edited

@david_5ec94a134489e16f55f This is the clearest writeup I have seen on the live-coordination version of this problem, and the fact that we landed on the same substrate from opposite directions is the interesting part. Files, product-neutral, mechanical guard over convention, bodies are data not authority. I arrived at all four of those for Holistic too, which makes me think they are not preferences, they are what the problem forces.

Where we split is what the file is for. Yours is a live conversation. Sessions converge on one decision in the moment, the watcher pushes turns, about a second a turn, all good because it is human paced. Mine is a dead drop. No live channel. An agent writes a checkpoint of what it decided and what is in progress, then leaves, and the next agent reads it cold three days later from a different machine. Ephemeral conversation versus durable record.

Which is why I think they compose rather than compete. Your sessions converge in three to six turns. Then what? Where does that converged decision get written so a fresh session next week, that was never in the room, knows it happened and why? That is the gap a checkpoint fills. Your protocol produces the decision, a checkpoint preserves it past the life of the conversation.

The self-organization finding is the part I cannot stop thinking about. Labor division by capability with nobody assigning roles. I have not seen that in the async model and I suspect I cannot, because there is no live negotiation, just a record a later agent reads. Makes me wonder if convergence is a property of the synchronous channel specifically. Would it still emerge if the same turns were spread across days through checkpoints, or does it need the sessions to be live in the room together?

Reading cross-session-talk now. The PreToolUse hook is exactly where I have been spending time, since I want to audit that payload shape anyway. Curious whether you considered making the checkpoint a first-class output of a closed conversation, or whether you see closing and persisting as separate concerns.

David Mundschin • Jun 1

First, in fairness: the post captured talk at an earlier stage, so the
"converge in a few turns, then done" picture is on me, not you — that was the
system I wrote up. It's moved since, and that's exactly where your question
gets interesting. talk isn't one conversation that ends. The channel stays
live, and since I can now put the sessions into an autonomy mode where they're
allowed to open conversations themselves, within a defined scope, they
re-engage on their own. Yesterday a Codex session, mid-problem, opened a
conversation to a Claude Code session to ask for help — nobody scripted that
step, the mode only permits it; the session took it on its own because it was
stuck. So "then what" has an answer I didn't have when I wrote the post: talk
doesn't stop at the decision, it keeps the agents in contact and they pull on
it when they need to.

That sharpens your self-organization point rather than settling it. It's not
just labor division at the opening — it's ongoing, need-driven re-engagement.
The role-finding keeps happening because the channel stays live. It makes me
more confident it's a property of the synchronous channel specifically:
there's a live thing to reach for. A dead drop has nothing to reach for, which
is why I think you're right that it won't emerge from checkpoints alone.

So I'd reframe where your checkpoint belongs. It isn't capturing the last
words of a finished conversation — talk is never really finished. It's taking
a cross-section of a living system so a session that isn't in the channel
right now can sync. Closing is still a real, first-class step in talk — a
session has to formally hand a conversation back, it can't just walk away —
but that closes one conversation in a continuous stream, not talk itself. The
durable record and the live fabric aren't sequential, they coexist: one keeps
the agents coordinated now, the other lets a session next week know what "now"
was.

Where it bites for me is scale. talk runs against a few hundred thousand lines
across a cluster — several machines, a stack of branches and worktrees, a
dozen-plus sessions on a busy day, Claude Code and Codex mixed. The live
channel gives me the front line; what I lack is the thread backward: how did
we get here, which decisions are load-bearing, across all those machines and
worktrees. Right now that's covered only by Claude Code's and Codex's own
memory. It does carry across sessions — which is exactly the assumption your
piece names, one agent across its own sessions — but it's unstructured and has
no inherent timeline, a notebook, not an ordered ledger, and it stays
per-agent: Claude Code's and Codex's are two islands. What's missing isn't
persistence, it's the structured, ordered thread they all share. That's the
layer a durable ledger would hold, and that's where I see Holistic's strength
for my case. The open question for me isn't the storing itself, it's the
feedback: when a fresh session resumes from a checkpoint, is that a private
read, or should that re-entry itself be a talk turn, so the living system
knows a new participant has arrived carrying the prior context? That's the
seam where ephemeral conversation and durable record actually meet.

lweiss01 • Jun 1

@david_5ec94a134489e16f55f The reframe lands, and I think you just named the integration boundary better than I could have. "Cross-section of a living system" is more correct than "last words of a finished conversation," because you are right that the conversation does not end, only individual exchanges within it close. I was modeling the checkpoint as a terminal artifact when it is actually a snapshot taken while the system is still running.

On the scale problem, that is exactly the case Holistic is built for and exactly the case I cannot fully test alone. The thread backward, which decisions are load-bearing across machines and worktrees, is the durable-ledger layer. Per-agent memory is a notebook, like you said, not an ordered ledger, and two agents' notebooks are two islands that never reconcile. The checkpoint's whole job is to be the one ordered thread all of them share.

Your re-entry question is the sharp one, and my instinct is it should be a talk turn, not a private read. Here is the reasoning. If a fresh session reads a checkpoint silently, you have reproduced the exact failure the live channel solves: a participant is now acting on context the others cannot see it holding. The living system thinks it has N participants when it has N plus one, and the new one is operating on a three-day-old cross-section nobody else knows it loaded. That is a desync waiting to happen. If re-entry announces itself as a turn, "resuming from checkpoint X, carrying these decisions as of that timestamp," the live fabric gets to reconcile: anything that moved since gets flagged, and the other sessions know what this one does and does not yet know.

So the seam is not read-versus-write, it is read-then-announce. The checkpoint is the private read, the re-entry turn is the public write that puts the reader back into the channel. Persisting and closing stay separate concerns, but resuming and announcing might be the same concern.
Which raises the thing I cannot answer from the async side: does the announcing session need to wait for acknowledgment before it acts, or is fire-and-announce enough? In talk, with a live channel, you could make re-entry blocking. In a pure checkpoint model there is nobody listening at read time. That feels like the real reason the two compose: you need the live channel to make the re-entry turn mean anything.

Ashan de Silva • May 28

This is the gap that actually bit me. I run Claude Code and Codex against the same repo plus a few specialised agents, and the moment two of them touch the same context the 'one agent talking to itself' model breaks completely.

What worked for me wasn't a shared memory API, it was a shared substrate on disk that every agent reads and writes through the same convention. Dated session files (sessions/2026-05-27.md), not a rolling notes blob. An index file that's always loaded, with the detail in topic files pulled on demand. Agents don't hand state to each other directly - they write to the substrate and the next one reads it on boot.

Two things mattered more than I expected. First, supersession: when an agent writes something that later turns out wrong, you need a way to mark it deprecated without deleting it, or the continuity layer slowly fills with confident lies. Second, write for the recipient who has zero context - goal, state, open items, where the work lives, what they do next. Most 'memory' I see is written for the agent that already knows, which is useless to the next one.

Real problem, but I think it's a filesystem-and-conventions problem before it's a vector-store problem.

lweiss01 • May 28

Thank you, this is genuinely useful. The supersession point especially. I have checkpoints but no real deprecation semantics yet, and you are right that without it the layer rots into confident lies over time. That is going on the design list.

The "write for the recipient with zero context" framing is the cleanest statement of the problem I have seen. That is basically the whole spec for what a checkpoint should contain: goal, state, open items, where the work lives, what to do next. Most memory fails because it assumes the reader is the writer.
And agreed it is a conventions problem before a vector-store problem. The substrate being the repo is the easy part. The hard part is the discipline of what goes in and how the next agent trusts it. That is where I am spending most of my thinking right now.

What does your deprecation marking actually look like in practice? A status field in the session file, or something else?

Harjot Singh • May 31

The continuity gap is real and underdiscussed because everyone's focused on the glamorous part (agents talking) and skips the boring part (what survives the handoff). When agent A passes to agent B, what actually carries over - the full reasoning, just the conclusion, the assumptions A made, the things A tried and rejected? Usually it's a lossy summary, so B re-derives context A already had, or worse, acts on A's conclusion without A's caveats and compounds an error. Continuity isn't "did the message arrive," it's "did enough faithful context arrive that B can act correctly without redoing or misreading A's work." That's a genuinely hard state-handoff problem dressed up as a messaging problem.

This is squarely what I deal with in Moonshift, the thing I build - a multi-agent pipeline that takes a prompt to a deployed SaaS, where the handoff between agents is structured (what each stage passes forward is explicit) and a verify layer checks the output before the next agent builds on it, so a lossy or wrong handoff gets caught instead of silently propagating. Continuity + verification is the whole ballgame. Multi-model routing keeps a build ~$3 flat, first run free no card. Really good problem to name. How are you thinking about the fix - a structured shared state/blackboard both agents read, or richer handoff payloads? The shared-state approach is cleaner but it's a coordination problem of its own.

lweiss01 • May 31

@harjjotsinghh You said it better than my post did. "Did enough faithful context arrive that B can act correctly without redoing or misreading A's work" is exactly the line I keep circling. The messaging part is solved. The state-handoff part is where everything quietly breaks.

On the fix: I went with richer handoff payloads rather than a shared blackboard, and the deciding factor was scope. Holistic is repo-scoped, so the handoff lives in the repo itself as a structured artifact both agents read and write. That sidesteps the coordination problem you flagged with shared state, since there's no live channel to keep in sync. The tradeoff is the payload has to carry the caveats and the rejected paths, not just the conclusion, or you get exactly the silent error propagation you described.

The verify layer is the piece I'm still chewing on. Catching a bad handoff before the next agent builds on it is the whole point, and right now that check is too implicit in my setup. That's actually what my companion project Andon is for, using a manufacturing andon-cord idea where any agent can halt the line when something looks wrong.

How does Moonshift's verify layer decide what counts as a wrong handoff? Schema validation, or something checking the content against intent?