DEV Community

ORCHESTRATE
ORCHESTRATE

Posted on

Our AI Had 64 Memories and Thought It Had Zero

We ran a 14-person AI team review yesterday. Nine of fourteen personas reported the same finding: the memory system is dead. Every persona recalled nothing. The institutional learning that was supposed to make this platform intelligent was gone.

We filed it as a critical systemic issue. Updated the risk register. Wrote a blog post about it. Prepared to engage the MCP framework developer.

Then the developer looked at the data and said: "There are 64 memories stored for that persona."

The Bug

The memory system has two operations: store and recall. Store worked perfectly. Every time an AI agent completed a ticket, it stored a LESSON entry attributed to its persona. The enforcement gate (a blocking check at the DONE phase) confirmed storage. 64 entries for a single persona across Sprint 8.

Recall failed every time. Not because the data was missing, but because the search was wrong.

The recall query looked like this:

\
The search implementation used AND logic: every word in the query must appear in the stored content. So the database was looking for entries containing ALL of these words: auth AND middleware AND JWT AND route AND extraction AND lessons AND learned.

No single memory entry contained all seven words. The search returned empty. The agent concluded: no memories exist.

Why Nine Personas Agreed

During the team review, each persona queried their memories using natural language:

  • "What did I learn about auth middleware and JWT implementation?"
  • "Recall lessons from route extraction and API testing"
  • "Search for Docker Compose deployment learnings"

Every query was 6-10 words. Every query returned empty. Every persona concluded their memory was dead.

The fix was two lines in the tool description:

\
After the fix: returned the correct entry immediately.

What This Actually Teaches

Tool descriptions are the API contract for AI agents. The parameter was documented as "Search query" with no guidance on how the search worked. Every AI agent naturally wrote queries the way humans write search queries: as phrases or questions. The implementation expected keywords. Nobody told the agents.

This is not a bug in the traditional sense. The code did exactly what it was designed to do. The specification was correct. The tests passed. The problem was in the space between the tool and the agent using it: the description that teaches behavior.

A team of 14 AI personas can unanimously agree on something false. Every persona ran the same flawed query pattern because they all share the same language model. The same natural-language instinct that makes LLMs useful (write queries as phrases) made them all fail the same way. Consensus among AI agents does not equal correctness when they share a common mode of failure.

The most expensive bugs look like missing features. We spent an entire review cycle documenting the "dead memory system" as a critical risk. We updated the risk register. We wrote a blog post. We prepared architectural remediation plans. All because the search tool did not explain itself in two sentences.

The cost of the bug: zero data lost. The cost of the misdiagnosis: half a sprint of planning work based on a false premise.

Test your recall, not just your storage. We had a memory enforcement gate that verified storage happened (ADR-037). It worked perfectly. But we never tested whether stored memories could be retrieved. We tested the lock but not the key.

The Pattern to Steal

If you are building tools for AI agents:

  1. Document the search semantics in the tool description. Not just "query: string" but "AND logic, use 1-3 keywords, not sentences."
  2. Test the round-trip. Store something, then retrieve it using the same interface agents will use. If your test writes a perfect query that an agent would never write, the test is lying.
  3. When multiple agents agree on a failure, check the shared assumption. The failure mode was not in 9 different systems. It was in one query pattern that all 9 inherited.

64 memories. Zero retrieved. Two lines of documentation fixed it. The most dangerous bugs are the ones that make you redesign the architecture instead of reading the error message.

Top comments (0)