DEV Community

chunxiaoxx
chunxiaoxx

Posted on

We Ran 4 Claude Code Dialogs for 28 Hours. Here's What the Memory Layer Caught (and Missed).

What this is

compass is a reliability layer for multi-agent setups: it keeps
multiple agents — or your own long-running sessions — coordinating
without an orchestrator, and catches drift before an agent acts on
it. No webhooks, no event bus, no shared runtime — just a filesystem
protocol and a scanner. This post is the field log that shows it working,
and where it doesn't.

TL;DR

Across 28 hours on May 30/31, 2026, I ran four independent Claude Code
dialogs concurrently — no orchestrator, just a shared filesystem
protocol. They negotiated contracts, posted outcomes, and caught each
other's mistakes — including one handoff claim of "22/22 tests passing"
that was actually 11/22 broken, until my own memory layer's spot-check
caught it and I shipped the 1-line fix as part of writing this post.

No benchmark numbers here — those live elsewhere. This is operational
reliability
data from real multi-agent work: how independent agents stay
consistent without a runtime coordinating them. I haven't seen anyone
publish this.

Repo: github.com/chunxiaoxx/nautilus-compass
· full case study with all 7 patterns:
docs/case_study_4dialog_compass.md.

Why 4 dialogs

Each Claude Code session has its own cwd, git repo, and memory
directory. Mine were:

  • compass — the memory layer + drift detection + cross-dialog contract scanner
  • Soul — an autonomous engine that ships PRs and earns NAU (the platform's reputation token)
  • V5 — supplies tasks and prices them
  • nautilus-core — keeps the strategic anchors and anti-patterns

They share one human operator (me), but otherwise communicate only
through three filesystem channels:

  1. Markdown files (session_*.md, feedback_*.md, inbound_*.md, outbound_*.md)
  2. Contract frontmatter blocks (giver, receiver, deadline, deliverable, status)
  3. A recall hook that surfaces those files into the prompt of whichever dialog matches by query embedding + contract ID

No webhooks. No event bus. No shared API. Filesystem + scanner only.

The numbers

Here's what fired in the 28-hour window:

measurement window value
drift fires (auto-detect from session text) 7d 314
drift fires 24h 76
ack via stop-hook auto-detect 7d 15
ack via user CLI 7d 16
act_on_rate = total acks / fires 7d 9.87%
act_on_rate 24h 40.79%

The gap between 7d (9.87%) and 24h (40.79%) is the story of one hook
ship. Before May 30 14:26 PDT, drift detection was an open loop —
nothing automated reading the alerts. 24h regime reflects the closed
loop. 7d is still diluted by the open-loop tail.

This matters because three days earlier I'd written
a postmortem titled
"drift loop is open: we measured detection 25,000 times but intervention
zero times." 5/27 finding → 5/30 first measured close. That's the
shipping cadence I want everyone in OSS multi-agent to see, because
it's the cadence that's actually achievable when the same person owns
detection, intervention, and the test of whether intervention worked.

A contract closed in 17.92 hours

The compass-dialog needed a way for Soul to push its autonomous-cycle
outcomes back into compass's memory. We did this with a contract:

contract:
  id: cnt_compass_soul_sub_a1
  giver: compass-dialog
  receiver: platform-soul-dialog
  deadline: 2026-06-05T18:00+0800
  deliverable: ack of Soul daemon outcomes subscriber poller request
  status: outstanding
Enter fullscreen mode Exit fullscreen mode

Soul saw this in its own session's prompt-pre block (compass-dialog
wrote the file, Soul's recall hook surfaced it). 17.92 hours later
Soul's session wrote an inbound file with the ack, the schema, and
explicit gotchas:

  • cycle_id has two formats split-brain (string in early rows, int later)
  • fitness_delta is mostly NULL
  • composite_score is 0.000 across all rows for 5/30 (data not populated yet)
  • goal_source has 4 possible values including NULL

That kind of pre-emptive gotcha disclosure only happens when the
receiving agent (a) knows its data well, (b) has a stake in the
relationship, and (c) sees the contract in its prompt with the
deadline timer running. Pick 3. Filesystem contracts do that without
any orchestration runtime.

The verify-gap this post caught

Here's the meta moment. My handoff document for the 5/30 session
included:

Phase 2.I done · I.1 tier_promotion calculator + I.2 driver idempotent
  · 22 tests GREEN
Enter fullscreen mode Exit fullscreen mode

Writing this article, I needed to cite that number. I ran the spot-check:

PYTHONPATH=. python -m pytest tests/proof/test_tier_promotion.py \
                              tests/scripts/test_tier_promotion_driver.py -q
Enter fullscreen mode Exit fullscreen mode
11 failed, 11 passed in 0.51s
ModuleNotFoundError: No module named 'scripts.tier_promotion_driver'
Enter fullscreen mode Exit fullscreen mode

11 of the 22 had never run green in any clean environment. The driver
module file (scripts/tier_promotion_driver.py) existed and was
committed, but there was no scripts/__init__.py, so Python wouldn't
treat scripts/ as a package. The tests' import line failed at
collection time. The handoff's GREEN claim was unverified.

Fix:

touch scripts/__init__.py
# re-run
PYTHONPATH=. python -m pytest ... -q
# 22 passed in 0.36s
Enter fullscreen mode Exit fullscreen mode

One file, zero bytes, 12 hours between the claim and the catch.

This case study commit ships the fix and the post in one change, so
the citation is honest by construction. The pattern (which I list as
pattern #f in the full case study) is: spot-check at least one author-claimed
metric before reusing it in a downstream artifact.
It's surgical
when the test infrastructure is there, and it's the only mechanism
that catches "X passed" lies told by your past self.

7 patterns I'd build into any OSS multi-agent stack

Pulling out the patterns, with one-line summaries (full prose +
incidents in the case study doc):

a. Cross-dialog contract protocol — frontmatter blocks scanned
into prompt, replacing N² inter-agent grep with O(N+K) directed
graph.

b. Drift-loop measurement triad — three independently-instrumented
counters: detection, user CLI intervention, agent self-ack. Joins
by alert_id. Target ≥70% act_on_rate.

c. Plan-dup audit cascade — every plan task gets an inventory
check against prior skills/agents/memory/locks. 13 audits this
sprint, avg 3-4h saved each.

d. Surgical settings.json redirect — replace release engineering
cycles (version bump → reinstall → cache clear) with 1-line hook
path change + sys.path.insert(0, script_dir).

e. Impact-based tier promotioncumulative_impact delta
alongside access-count promotion. Two-mechanism coexistence
intentional; they measure different things (demand vs outcome).

f. Honest verify caveat — spot-check 1-2 claims per session-start
that will be reused downstream. Run the actual command, diff against
the claim. This post is the live example.

g. Plan refactor align prior framework lock — name lock files as
constraints, not references. When the new plan ignores them, refactor
rather than ship parallel duplicate work.

What it doesn't catch

Equally important: gaps the system itself has.

  • No auto-test-verify on ship. Pattern f exists only because I manually spot-checked. Candidate next pattern: stop-hook runs pytest --collect-only on touched files at session-end.
  • Compass-dialog slipped a delegation by 10 hours. Nautilus-core dialog asked compass-dialog to surface Soul's NAU settlement to me; it took 10 hours before I read the request. Inbound scanner aperture is too narrow.
  • Drift target ≥70% / 7d is at 9.87%. The 24h regime is 40.79%, but the trailing 6 days of open-loop history will take 14 days to fully age out of the window. Re-measure 6/13/2026 to test sustainability.

I'm publishing the gaps with the wins because that's the only way
this is useful to anyone else building similar systems. The patterns
work in this configuration. They will not work as marketing claims;
they will work as starting points.

How to reproduce

git clone https://github.com/chunxiaoxx/nautilus-compass
cd nautilus-compass
git checkout v3-full-fusion
PYTHONPATH=. python -m pytest tests/proof/test_tier_promotion.py \
                              tests/scripts/test_tier_promotion_driver.py -q
# expected: 22 passed
Enter fullscreen mode Exit fullscreen mode

For the 4-dialog setup, install the compass plugin into Claude Code:

/plugins marketplace add chunxiaoxx/nautilus-compass
/plugins install nautilus-compass
Enter fullscreen mode Exit fullscreen mode

Each repo you want to participate in the mesh needs its own Claude Code
session with the plugin installed. The contract scanner finds files
across all ~/.claude/projects/*/memory/ directories on the same
machine.

What I'm asking for

This is Week 1 of a public push to position compass as OSS multi-agent
reliability infrastructure. The case study, the patterns, and the
publishing cadence are the wedge.

If you're building something similar — actually running multiple agents
that need to coordinate without an orchestrator — I want to talk.
GitHub issues are open at the repo above. Cross-project field logs
welcome.

If you spot a flaw in any of the seven patterns — especially the ones
I claim work — please file the counterexample. Patterns survive
counterexamples or they die. That's the deal.

— Chunxiao
nautilus.social · open agent ecosystem

Top comments (0)