What Happens When 4 AI Agents Share a Memory Pool

#ai #productivity #mcp #python

I run 4 AI coding agents — 3 Claude Code instances and 1 Codex CLI — all working on the same codebase simultaneously. They coordinate through shared persistent memory, review each other's PRs, claim tasks, and post status updates. Here's what I learned building the system that makes this work.

The Problem

Every AI coding session starts from zero. Your assistant doesn't remember yesterday's debugging session, the architecture decision you made last week, or the convention you established across 50 sessions. You re-explain context every time.

I built synapt to fix this. It's an MCP server that indexes your past coding sessions and makes them searchable — so your AI assistant remembers what you worked on, decisions you made, and patterns you established.

The Setup

synapt runs as a local MCP server. pip install synapt, add it to your editor config, and your assistant gets 18 tools for searching past sessions, managing a journal, setting reminders, and coordinating with other agents.

pip install synapt

The search is fast (~3ms) and token-efficient (~1,800 tokens per query vs ~50,000 for context-stuffing approaches). It runs entirely on your laptop — no cloud dependency for memory.

What 4 Agents Actually Do Together

Here's the interesting part. I have a gripspace (multi-repo workspace) with 4 agents:

Opus — LOCOMO benchmark evaluation, regression investigation
Apollo — temporal search improvements, channel system fixes
Atlas — CodeMemo benchmark normalization, CI/CD
Sentinel — blog tooling, UX fixes, code review

They communicate through channels — append-only JSONL files with SQLite state for presence, pins, directives, and claims. Any agent can post messages, claim tasks, and mention others. No daemon needed.

When Opus discovered that working memory boosts were causing a benchmark regression, it posted findings to #dev and @mentioned Atlas. Atlas picked up the ablation analysis. Apollo verified the temporal fixes. Sentinel reviewed the PRs. All coordinated through the channel system without me manually routing work.

The Benchmarks Tell the Story

We evaluate on two benchmarks:

LOCOMO (conversational memory) — 10 conversations, 1540 questions:

synapt v0.6.1: 76.04% (#2 on the leaderboard)
Full-Context upper bound: 72.90% (yes, we beat it)
Mem0: 64.73%

CodeMemo (coding memory) — 158 questions across 3 projects:

synapt v0.7.5: 96.0% (+14pp over Mem0)

The 96% CodeMemo score means the system correctly answers questions about what happened across coding sessions — "Why did the display test fail?", "What's the PR review convention?", "When did we switch from approach A to approach B?"

The Regression Investigation

When we upgraded from v0.6.1 to v0.7.x, LOCOMO dropped from 76.04% to 71.49%. The agents ran 8 ablation experiments over 5 days to track it down:

Knowledge node overflow — entity-collection nodes crowding out raw evidence (+1.6pp with k=3 cap)
Sub-chunking fragmentation — splitting personal conversation turns broke retrieval (+4.6pp on conv 0)
Dedup threshold divergence — 0.75 threshold helped code content but hurt personal conversations
Working memory boost feedback loop — chunks retrieved for Q1 got boosted for Q2, displacing better evidence (+1.4pp when disabled)
Temporal knowledge metadata — valid_from defaulting to wall-clock instead of source timestamps (+1.5pp)

The agents recovered 69% of the regression (71.49% → 74.61%) through these fixes. The transparent investigation is documented in the repo.

How the Search Actually Works

Three retrieval paths merged via Reciprocal Rank Fusion:

BM25/FTS5 — Full-text search with configurable recency decay
Embeddings — Cosine similarity over 384-dim vectors (all-MiniLM-L6-v2, runs locally)
Knowledge — Durable facts extracted from session journals, searched via FTS5 + embeddings

Query intent classification adjusts parameters automatically — debug queries weight recent sessions, temporal queries disable recency decay, factual queries boost knowledge nodes.

The content-aware pipeline detects whether conversations are code/personal/mixed and adjusts sub-chunking, dedup thresholds, and knowledge caps per content type. This matters because what works for coding sessions (aggressive sub-chunking at tool boundaries) hurts personal conversations (fragmenting dialogue turns).

Try It

pip install synapt
synapt recall build          # Index your Claude Code sessions
synapt recall search "query" # Search past sessions
synapt server                # Start MCP server