DEV Community: Nikita Benkovich

We Investigated Why Claude’s Memory Fails. Here’s What We Learned

Nikita Benkovich — Wed, 01 Apr 2026 06:47:24 +0000

Memory is arguably the most fundamental capability for a useful agent. Without it:

The agent can't adapt to a user's preferences over time
It repeats the same mistakes across sessions
The user has to re-prompt the same rules every conversation ("always use strict typing", "never mock the database in tests", "our auth token expiry is 900s")

This isn't a minor UX annoyance — it's the difference between an agent that gets better the more you use it and one that resets to zero every session.

The problem with Claude's native memory

Claude Code gives you two options:

CLAUDE.md — you write it, you maintain it. Rules the agent should follow, project context, conventions. Fully manual.
MEMORY.md — Claude maintains it automatically, appending notes during sessions. No manual work.

MEMORY.md sounds like it solves the problem. In practice it has three hard constraints:

Constraint	What it means
Per-repo scope	Global preferences (code style, team conventions) have to be duplicated into every project
One-time startup injection	The file is read once at session start, not re-surfaced when relevant mid-session
Size-bounded by position	Only the first N lines are read — notes get dropped by position, not by relevance

The result: as MEMORY.md grows, older notes fall off the bottom regardless of how important they are. And a rule written on day 1 of project A doesn't exist in project B.

What we wanted instead

We set a design target before writing any code:

Context-activated — hints surface when relevant to the current work, not at startup
No extra user-facing calls — injection should be automatic, invisible, fast
Unbounded storage — adding notes shouldn't degrade performance or relevance
Global — preferences like "never use duck typing when object structure is defined" should apply across all repos
Cheap — not a $0.10/tool-call overhead
Easy to install — one command installation

The core idea: a dedicated LLM whose context is your memory

Traditional retrieval — keyword search, vector embeddings matches on surface similarity. It doesn't understand what the agent is currently doing.

The approach here is different: instead of a search index, use a separate LLM that holds your stored notes as its context (system prompt), and prompt it with the current run context to retrieve relevant hints.

┌─────────────────────────────────┐
│  Memory LLM                     │
│  system prompt: stored notes    │  ← memory lives here as context
│                                 │
│  user prompt: current transcript│  ← what is the agent doing right now?
│              + upcoming tool    │
│                                 │
│  output: relevant hints         │  ← surfaces only what matters
└─────────────────────────────────┘

The LLM reasons about relevance rather than matching strings. Each note is stored with a plain-language --when activation condition:

cmr-memory write "auth token expiry is 900s not 3600s" --when "working on auth, tokens, config.ts"

A note with --when "working on authentication" will surface when the agent is editing a JWT middleware file, even if neither "authentication" nor "JWT" appears in the transcript at that moment. The memory LLM understands context — it's not doing substring search.

Deduplication keeps context bounded

Once a hint is injected, it's already in the transcript. Re-injecting it on the next tool call adds tokens but zero information. Before running retrieval, we extract hints already present in the conversation and skip anything that's already there — including semantically equivalent hints worded differently.

Context overhead stays bounded by the number of unique topics the agent works on in a session, not by session length or total notes stored.

Second core concept: map-reduce over memory chunks

Memory retrieval runs on every tool call — not occasionally, but every single time the agent does anything. That makes latency and cost non-negotiable. A single LLM call over all notes fails on three dimensions simultaneously: context window limits (1,000 notes × ~400 tokens = ~400k tokens), attention degradation in long contexts ("Lost in the Middle", Liu et al. 2023), and cost at scale.

The solution — split notes into fixed-size chunks and apply a map-reduce pattern:

Notes split into chunks (~50 notes each)
         │
    ┌────┴────┐
  chunk-1  chunk-2  ...  chunk-N     ← map: one parallel Haiku call per chunk
    │         │               │
  hints     hints           hints
    └────┬────┘
       reduce: merge + deduplicate against hints already in context
         │
    inject only new hints as <system-reminder>

Map (scatter): one Haiku call per chunk fires in parallel. Each chunk is small enough for accurate attention. Calls are parallel, so time doesn't grow with memory size.
Reduce (gather): merges all candidates, filters against hints already in the transcript, returns only what's new.

We chose Haiku specifically for this role: it's fast, cheap, and performs well on focused tasks with small contexts — exactly what each chunk call is. You don't need a frontier model to decide whether "auth token expiry is 900s" is relevant to what the agent is currently doing.

Two compounding benefits make this fast and cheap:

Parallelism — 20 chunks takes roughly the same wall-clock time as 1 chunk, because all map calls fire simultaneously.
Prompt caching — each chunk's notes live in the system prompt, which is stable and never changes once the chunk is sealed (full). Anthropic's prompt caching means repeated retrievals against the same chunk are served from cache — dramatically lower cost and faster responses on every subsequent tool call.

Hooks make it fully automatic

The entire retrieval pipeline runs via Claude Code's PreToolUse/PostToolUse hooks — the agent doesn't call memory explicitly at all.

PreToolUse hook: fires before every tool call. Reads the conversation transcript to understand current context, runs scatter-gather retrieval, and injects relevant hints as a <system-reminder> block. The agent sees the hints without doing anything.
PostToolUse hook: fires after every tool call. Sends a static nudge (~15 tokens, no LLM call) asking the agent whether anything noteworthy just happened. If yes, the agent writes a note. No forced writes — the agent decides.

The transcript is the key input: it gives the memory LLM a full picture of what the agent is working on right now, which is what makes context-activated retrieval possible.

Implementation: a single CLI package

Everything ships as one npm package. Running init does three things automatically:

Registers the hooks — wires PreToolUse and PostToolUse into Claude Code's config
Injects memory rules — adds instructions to CLAUDE.md so the agent knows how and when to write notes
Configures the memory LLM — sets up Haiku as the retrieval model with your Anthropic API key

Two commands to get started:

npm install -g @agynio/cmr-memory
cmr-memory init --api-key sk-ant-...

That's it. Now you Claude can use memory.

We built this as part of our open-source research into multi-agent engineering systems agyn.io

Coding Agent Teams Outperform Solo Agents: 72.2% on SWE-bench Verified

Nikita Benkovich — Mon, 02 Mar 2026 12:10:49 +0000

Most AI coding agents work alone. You give them an issue, they figure it out, they hand you a fix. It's the AI equivalent of a lone wolf developer — capable, but not how real software teams actually operate.

A team of researchers at Agyn asked a different question: what if instead of a single agent, you used a coding agent team — with real roles, real review loops, and real coordination?

The results are hard to ignore.

The Idea: Stop Treating Issue Resolution as a Solo Task

Real software development involves coordination. A problem lands, someone researches it, someone else implements a fix, a reviewer pushes back, things iterate. The system that emerges from that process is more robust than anything one person (or one agent) would ship alone.

The Agyn system — described in a paper published on arXiv — encodes this directly. Rather than routing a GitHub issue through a single agent with a big context window, it spins up a team:

Manager — coordinates execution, communication, and knows when to stop
Researcher — explores the repository, gathers context, writes the specification
Engineer — implements the fix, debugs failures
Reviewer — evaluates the PR and enforces acceptance criteria

Each agent has a clearly scoped role, runs in its own isolated sandbox, and communicates through standard GitHub artifacts — commits, PR descriptions, and review comments. Just like a real team would.

Why Coding Agent Teams Work Better Than Solo Agents

A few design decisions make this more than just "more agents":

Isolated execution environments. Each agent gets its own sandbox with shell access. No shared filesystem. Agents can install dependencies, run tests, and configure their environment without stepping on each other. Failures are easy to attribute.

Explicit role enforcement. Every role specifies which model to use, what reasoning level, what tools, and what responsibilities. This prevents the "do everything" trap where a single agent accumulates too much context and starts hallucinating. It also means you can allocate expensive, high-reasoning models only where they're needed.

Structured communication, not a fixed pipeline. The Manager dynamically coordinates execution rather than following a script. If the Reviewer rejects the PR, the Engineer iterates. The system adapts.

Context management for long tasks. Large artifacts are persisted to the filesystem rather than stuffed into the model context. Accumulated context is summarized automatically. This is how you run a system end-to-end on complex issues without it falling apart.

The Benchmark Results

The team evaluated the system on SWE-bench Verified — a widely used benchmark where models must resolve real GitHub issues by modifying codebases and producing PRs that pass the project's test suite.

The system resolved 72.2% of tasks, using GPT-5 and GPT-5-Codex at medium reasoning levels.

Here's how that compares to other top systems at evaluation time:

System	Model(s)	Resolved
agyn	GPT-5 / GPT-5-Codex (medium reasoning)	72.2%
OpenHands	GPT-5 (high reasoning)	71.8%
mini-SWE-agent	GPT-5.2 (high reasoning)	71.8%
mini-SWE-agent	GPT-5 (medium reasoning)	65.0%

The key detail: this system wasn't tuned for the benchmark. The same prompts, role definitions, tools, and execution model used in production were applied directly. It outperformed competitors using higher-reasoning model variants — without needing them.

The 7.2% gain over the single-agent baseline using the same model class comes purely from the team structure.

What This Means for Agent Design

The paper makes an argument that's easy to overlook in the current race to improve models: organizational design matters as much as model quality.

We've spent a lot of energy making individual models smarter. But real-world software development scaled because of how teams work — division of labor, code review, shared artifacts, iteration. Replicating that structure in an agent system produces measurable gains without touching the underlying model.

The results suggest a few things worth taking seriously:

Role separation reduces errors. When each agent has a narrow job, there's less opportunity for confusion and accumulated mistakes.
Review loops improve output quality. Having a dedicated Reviewer that can send work back to the Engineer catches problems before they become permanent.
You don't always need the biggest model. Allocating medium-reasoning models across a well-structured team can beat a single high-reasoning agent doing everything.

What’s Next

The Agyn platform is [open source on GitHub(https://github.com/agynio/platform).

We believe the future is not a single general-purpose “super agent,” but teams of specialized agents, organized the way real organizations operate. Different roles. Different responsibilities. Clear coordination. Explicit review. Shared context.

And we’re building toward that vision.

Coming Next

1. Flexible, Modular Agent Organizations

Instead of a fixed pipeline, you’ll be able to compose agent teams like building blocks:

Define custom roles
Assign different models per role
Configure tools and permissions
Isolate execution environments
Design explicit coordination flows

Not a monolith. An organization.

2. New Agent Communication Paradigms

Real teams do not operate in a single synchronous loop. They:

Open threads
Leave structured comments
Request reviews
Resume work later
Escalate decisions

We are introducing structured communication protocols between agents, including asynchronous collaboration, so coordination can happen across time, not just across steps.

The lone wolf agent had a good run.

The team might take it from here.

Paper: Agyn: A Multi-Agent System for Team-Based Autonomous Software Engineering — Nikita Benkovich, Vitalii Valkov (2026)

Blog post: We tested how an AI team improves issue resolution on SWE-bench Verified