When does flat-file memory beat a vector DB for your agent?

#ai #llm #opensource #productivity

Most "give my agent long-term memory" tutorials jump straight to the same recipe: embed everything, dump it in a vector DB, retrieve top-k by similarity at runtime. For retrieving over a large corpus — the user's documents, a codebase, past support tickets — that's exactly right.

But a lot of what an agent needs to remember isn't a corpus. It's a small set of always-true facts: who the user is, their stack, their hard preferences, the decisions behind the project you're working on. For that slice, RAG is the wrong tool — and I think people reach for it out of habit.

Here's the decision rule I've landed on:

If the context should be loaded 100% of the time, it does not belong in a retrieval layer. Retrieval is for things you load sometimes. Anything you'd want in every prompt should just be in the prompt.

Why a vector store loses on the always-true slice

1. Similarity retrieval is lossy for must-have context. "Always load the user's stack and constraints" is a guarantee. Top-k cosine similarity is a probability. If the user's question doesn't lexically resemble the stored preference, it doesn't get retrieved — and the agent "forgets" something it was supposed to always know. A guarantee implemented as a probability is a bug.

2. You can't read or hand-edit an embedding. When the agent remembers something wrong, you want to open a file and fix the line. With a vector store you're re-embedding and praying.

3. Operational weight. A DB to host, an embedding model in the loop, a retrieval step that can fail — for a few KB of facts that never change shape. That's a lot of moving parts to remember "prefers Go, hates verbose output."

The boring option that scales

For the always-true slice I went with structured markdown files the agent loads at startup, with one strict rule that makes it scale:

An index file is the only thing always in context. It's hard-capped (~200 lines) and holds one-line pointers to topic files, plus the handful of facts that apply to every session. Detail lives in topic files the agent opens only when relevant.

That cap is the whole trick. It prevents the two failure modes that kill naive "just stuff it in the system prompt" memory:

Index bloat → silent truncation. If you let the always-loaded file grow unbounded, it eventually exceeds the context budget and gets cut — and you lose memory with no error. Capping the index and pushing detail into linked files keeps the always-on footprint tiny.
Storing derivable facts. Don't memorize what the agent can reconstruct: code conventions (it reads the code), file layout (it greps), git history (git log is authoritative). Memory is for the things that aren't written down anywhere the agent can look — "this rewrite is driven by a compliance deadline, not tech debt," "never mock the DB in tests," "prefers Go."

Use both layers

This isn't RAG-versus-files. It's two layers answering two different questions:

RAG → "what's relevant to this query?" (the corpus, which grows)
Flat files → "what's true every time?" (the always-on context, which is bounded)

The split is deliberate so the unbounded part never touches the always-on budget. A person's stack and preferences don't grow much; their document pile does. Keep them in different layers.

The implementation

I packaged the convention as zero-dependency markdown templates plus a setup script that scaffolds the folder structure, so you don't have to hand-roll it:

github.com/LuciferForge/claude-code-memory (MIT, free)

It defaults to Claude Code's paths because that's what I built it for, but the convention is agent-agnostic — anything with a startup-context file can use it.

Genuinely want to hear how others draw this line. Where's your cutoff between "goes in retrieval" and "goes in the always-loaded prompt"? And has anyone hit the silent-truncation failure mode on their always-on context — how did you catch it?

DEV Community

When does flat-file memory beat a vector DB for your agent?

Why a vector store loses on the always-true slice

The boring option that scales

Use both layers

The implementation

Top comments (0)