DEV Community

Qiushi
Qiushi

Posted on • Originally published at claw-stack.com

Building a Persistent Memory System for AI Agents

The canonical advice for giving an AI agent memory is: use a vector database. Store embeddings, do similarity search, retrieve relevant chunks. This is good advice for retrieval-augmented generation systems where the query pattern is "find documents similar to this question." It's not necessarily the right answer for an agent that needs to remember what it did last Tuesday and what decisions it made about project X six weeks ago.

Here's how we built the Claw-Stack memory system, why it looks the way it does, and what we learned along the way.

The problem with stateless agents

Every Claude session starts fresh. The model has no memory of previous sessions unless you explicitly inject that context at the start. For a research assistant that you talk to once, this is fine. For an autonomous agent that runs every day, accumulates knowledge about your projects, and needs to maintain consistent behavior over weeks, it's a fundamental problem.

The naive solution is to dump everything into the system prompt. This works until you've accumulated a few hundred KB of context, at which point two things happen: you start hitting context limits, and the model's ability to use the early parts of a very long context degrades. The agent starts ignoring things you told it three months ago because they're too far from the current interaction.

We needed a memory system with two properties: it had to be selective (only inject what's relevant to the current session), and it had to be human-readable (we needed to be able to audit, edit, and correct what the agent believed).

The three-layer architecture

The memory system has three layers:

Layer 1: MEMORY.md — a compact index loaded at the start of every session. This is a structured Markdown file with sections for recent activity, active projects, key contacts, and infrastructure notes. It's intentionally kept short — the system enforces a byte-size cap — so it doesn't consume the context budget on sessions with large task descriptions.

Layer 2: Per-topic files — longer Markdown files in memory/ that go into depth on specific subjects. projects/claw-stack.md, contacts/key-people.md, infrastructure/servers.md. These aren't loaded automatically. The agent has a read_memory tool that fetches a specific file when it needs depth on a topic.

Layer 3: SQLite + QMD vector search — a SQLite database with FTS5 full-text search and a QMD (a vector embedding tool built on top of SQLite) index for semantic search. When the agent gets a query it can't answer from MEMORY.md and the per-topic files, it runs a vector search across all memory content to find relevant fragments.

Why not a vector database

The short answer: for our scale and access patterns, the operational overhead of a standalone vector database isn't worth it.

The main reasons we chose SQLite + FTS5 over a dedicated vector database:

  1. Opacity. With a dedicated vector database, you can't easily inspect whether a retrieval was correct without tooling to query it. A Markdown file you can open in any text editor. Our SQLite database opens with any SQLite tool, and the schema is tables we wrote ourselves.

  2. Operational simplicity. The entire memory store is a single .db file plus a directory of Markdown files. No separate process to manage, no format migrations, no version compatibility issues between the database binary and your data.

  3. Sufficient for our scale. We have around 50,000 words of memory content across all files. SQLite FTS5 can do full-text search across that in milliseconds. The cases where vector similarity is meaningfully better than keyword search are real but rare enough that the operational overhead isn't worth it.

QMD (the vector search layer) sits on top of SQLite. Embeddings are computed locally using a small quantized model and stored in a SQLite table alongside the text. Re-indexing takes a few seconds. The entire memory store is a single .db file plus a directory of Markdown files.

The organizer pipeline

Memory doesn't manage itself. After every session, an organizer pipeline runs:

raw session files
  → scan memory/*.md (MD5 hash check, skip unchanged)
  → extract facts per category (project updates, decisions, contacts)
  → deduplicate against existing memory
  → write updated per-topic files
  → rebuild SQLite FTS5 index
  → update MEMORY.md index
Enter fullscreen mode Exit fullscreen mode

The extraction step uses an LLM (Gemini as primary, with Claude Haiku as fallback): it reads a session transcript and produces structured notes in a specific format. The deduplication step is rule-based: if the new fact is a substring of an existing entry, skip it; if it contradicts an existing entry, flag it for human review.

The pipeline runs on a cron schedule (every few hours during active work periods) rather than immediately after every session. This batches the processing cost and avoids writing memory files that will be immediately overwritten by a subsequent session.

The MEMORY.md bloat problem

The most painful lesson was about MEMORY.md growth.

We started with no limit on MEMORY.md length. The organizer kept adding to it. After six weeks, MEMORY.md was over 700 lines long. This had a predictable effect: session startup consumed most of the context budget before any actual task content was loaded, and the model was visibly struggling to synthesize a several-hundred-line brief while also doing useful work.

The fix was to change the organizer's behavior and enforce a size cap. Instead of appending new facts to MEMORY.md directly, the organizer writes them to per-topic files and updates MEMORY.md with pointers — one line that says "see projects/claw-stack.md for current status" rather than embedding the full status in MEMORY.md. The system now enforces a byte-size limit on MEMORY.md to prevent runaway growth.

This required us to rethink what MEMORY.md is for. It's not a summary of everything the agent knows. It's a session briefing — the minimum context needed to orient the agent at the start of a session. Anything beyond that is fetched on demand.

After the refactor, session startup is noticeably faster, and the model makes better use of the context it has. Keeping MEMORY.md truly compact is an ongoing discipline — we found that a strict line count is less useful than a byte-size limit, and even that requires the organizer to be aggressive about using pointers rather than inline content.

Memory as human-readable state

The design philosophy behind the system is that agent memory should be human-readable and human-editable. This is a constraint we imposed deliberately.

When the agent develops incorrect beliefs — and it does, occasionally — we can find the wrong entry in a Markdown file, edit it, and the fix takes effect in the next session. With a vector database, correcting a wrong belief requires knowing which embedding to update, deleting it, writing a new one, and potentially invalidating cached retrievals. With Markdown files, you open the file and change the text.

This also makes auditing straightforward. Before trusting an autonomous agent to make decisions on your behalf, you need to be able to read its beliefs and verify they're correct. The entire memory system is a directory of Markdown files. Any text editor works.

The tradeoff is that the format is fixed. Our memory files follow a specific schema that the organizer knows how to parse and update. If you want to add a new category of memory, you need to update both the file schema and the organizer. For a research project with one operator, that's acceptable. For a production system with many agents and many types of memory, you'd want something more flexible.

What we'd do differently

If we were starting over:

Use a smaller MEMORY.md from day one. We wasted weeks cleaning up the bloat that could have been avoided with an initial size cap. A byte-size limit with pointer-based entries is a better target than a fixed line count for a daily-use assistant.

Separate episodic from semantic memory earlier. "What happened in Tuesday's session" (episodic) and "what is the Claw-Stack architecture" (semantic) are different types of memory that benefit from different retrieval strategies. We mixed them initially and spent time later separating them.

Build the audit tooling first. The hardest part of maintaining an agent memory system isn't the indexing or retrieval — it's knowing when the memory is wrong. We built the audit view (a script that shows you what the agent believes about a given topic) too late. It should have been the first tool we wrote.

The memory system is one of the parts of Claw-Stack we're most satisfied with. It's boring infrastructure that works reliably, which is exactly what memory should be.

This article was originally published on claw-stack.com. We're building an open-source AI agent runtime — check out the docs or GitHub.

Top comments (2)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.