Vitalii Cherepanov

Posted on May 1

RAG isn't memory. It's Ctrl+F with embeddings.

#ai #rag #softwareengineering #llm

Part 1 of 3 — "Memory for AI agents"
Deconstructing the long-term memory myth in LLM systems

Article

It's 3 AM. I'm on my third night debugging an AI agent. I'm standing in the kitchen with a mug of tea, staring at a diff, swearing quietly. The agent has confidently rewritten the auth function — based on a chunk that belongs to a branch that was deleted from the repo two months ago.

The chunk lives in Qdrant. Its cosine similarity to my query is high. Top-1 in the retrieval. The agent honestly grabbed it, honestly stitched it into the prompt, honestly generated the "correct" patch. Against code from a different reality.

I close the laptop and think: okay, I have RAG. I have vectors. I have long-term memory. I have everything every AI conference deck has been promising for the last two years. Why did my agent just propose a fix based on code that doesn't exist anymore?

Because my agent doesn't have memory. My agent has search results with cosine instead of BM25. And between those two sentences lies the entire difference between "AI you can trust in production" and "AI you have to babysit on every line."

This piece is about that difference. And about why we, as engineers, are the ones to blame for not seeing it anymore.

The devaluation of the word "memory"

Let's be honest. What is the typical "memory" of an AI agent in 2026?

text → split into 512-1024 token chunks
     → embedding (bge / text-embedding-3 / openai)
     → vector DB (Qdrant / pgvector / Chroma / Pinecone)
     → cosine similarity top-k
     → concatenate into prompt

This is not memory. This is search. It's old-school Lucene from 2003, repainted in neural colors. Cosine instead of TF-IDF. Embeddings instead of an inverted index. Same thing.

If we just called it that — "vector search," "semantic retrieval" — I'd have no complaints. Call Lucene Lucene, no problem. But when it's sold under the banner "my AI has long-term memory" — sorry. My AI has déjà vu and amnesia at the same time.

This isn't a terminology gripe. It's a question of expectations. When an engineer hears "memory," they imagine a system that remembers: who said what, when, in what context, what was true then versus what's true now. When an engineer gets RAG, they get Ctrl+F. And instead of building honest architecture around that Ctrl+F — with honest constraints — they build a sandcastle and wonder why the agent confuses past with present.

Three holes you can drive a truck through

Three concrete failures. Each one I caught in production. Not theory.

Hole #1: A chunk doesn't know it's a chunk.

Take a perfectly normal declaration from a design doc:

"We moved to JWT because opaque sessions didn't scale to our traffic profile. The alternative was stateful sessions with a Redis cluster, but we ruled it out because of audit requirements from a customer — they don't allow session state outside their perimeter. JWT solves both, but adds invalidation complexity, which we mitigate with short TTLs and refresh tokens."

The chunker splits this into four 512-token pieces. On retrieval, a query comes in: "why did we pick JWT?" Top-3 returns three fragments of the same decision. With no causality. Without the alternative we ruled out. Without the trade-off we accepted.

A decision that was whole turns into three parallel "factoids." The model honestly stitches them into plausible text — and invents the missing connections. Because its job is to generate plausible text. And it will, without blinking.

This isn't a bug in the chunker. This is an architectural property of the entire approach. Any decision declaration you have gets ground into powder and reassembled with structural loss. Every single time.

Hole #2: There's no structure in memory. Only cosine.

When a human explains a project to you, they say:

here's the goal
here are the options we considered
here's what we picked and why
here's what broke two months later
here's what we changed, and that decision now supersedes the old one

In RAG, none of this exists. Zero. RAG doesn't distinguish "hypothesis," "confirmed fact," "rejected alternative," "deprecated decision moved to archive." For RAG, all of these are equivalent points in a 384-dimensional space.

Imagine you're trying to record thirty years of life into a single flat table entries(text, vector) and then search it by cosine. Surprised your memories blur together? That's not your memory failing. That's the structure you crammed it into — a structure that doesn't allow distinctions between "I thought about it" and "I did it," between "I tried it and it worked" and "I tried it and it hurt."

In RAG, there are no fields for these distinctions. Not because the developers didn't think of it. Because the vector-plus-distance paradigm itself doesn't accommodate causality and time. It's a mathematical limitation. You don't fix it with product features.

Hole #3: Time doesn't exist as a first-class concept.

Three weeks ago I wrote into the agent's memory: "we use Postgres." Today I wrote: "we migrated to ClickHouse for analytics, Postgres is OLTP only now." In RAG, both facts sit there. Both have high cosine to a database query. Top-k returns both. The model picks the one that "sounds" better in its pretraining — usually Postgres, because it appears more often in the training data.

This is not memory. This is a roulette wheel disguised as confidence.

When was the last time you saw valid_from, valid_until, deprecated_by, replaced_by, superseded_by fields in a production RAG system? I never have. Because in standard RAG, they're not in the schema. And again — not because devs are lazy. Because the schema "text plus embedding" has no place for the lifecycle of knowledge. No notion of "this is true now" versus "this was true then." Everything collapses into a single time slice — a present that somehow contains yesterday, last year, and deprecated-three-quarters-ago all at once.

Ctrl+F with embeddings doesn't remember. It finds. Different verbs.

"But memory frameworks fix this, right?"

Okay, the believer says. There's mem0, Letta, Zep, Cognee, MemGPT, the whole long-term memory zoo. They added a meaning layer on top of RAG. They're memory-aware.

Let's be honest. I've used them. One after another. For a long time. Looked under the hood, not just at the landing pages.

Each of them takes one piece of real memory — for some it's LLM-extraction before write, for some it's a buffer hierarchy like an OS, for some it's post-hoc graph extraction from dialogues, for some it's per-fact temporal validity — and implements that one piece, without weaving it into the rest.

This is warmer than vanilla Qdrant. It's not a solution.

Because real memory requires seven properties working together. Each of them, in isolation, already exists in the literature or in open source. As far as I can tell, no one has assembled all seven into a single system. Which seven, exactly — that's part 2 of this series. Here, only the limitation that unites all flat-fact solutions, however they wrap themselves:

None of them have the right to say "I don't know."

Show me any one of these systems with a formal abstain mechanism: a gate through which a fact will not pass into prompt context if it has no source, no confidence, no temporal validity, or an unresolved contradiction. I'll wait.

In the standard flow of all these frameworks, the system's response to "there's a contradiction in memory or not enough data" is "well, the model will figure it out." Which translates from marketing to engineering as "the model will hallucinate, and that becomes your problem in production."

Good memory isn't "remembering a lot." It's knowing the boundary of what you don't remember. Part 2 of this series is built around that thesis.

"Why not just push context to 1M tokens?"

This is the second fashion of the last two years, and it deserves its own breakdown, because it leads the industry into the same dead end under a different banner. "Why do we need memory if Gemini has 2M context, Claude has 1M?"

Four problems, no preamble.

One — economics. A single project conversation at 800K tokens with prompt caching off costs tens of dollars per request. Without aggressive caching, you're broke in a week. With aggressive caching, you're building exactly the same hierarchy as Letta — just more expensive and locked to one vendor.

Two — recall. Every long-context benchmark (NIH, Ruler, LongMemEval) shows the same thing: models drown in their own context past 200-300K tokens. Attention is unevenly distributed. This is lost-in-the-middle, and it doesn't get fixed by window size — it gets partially mitigated by architectural tricks inside the model, but it doesn't go away. The more you stuff in, the less of it actually gets considered.

Three — persistence. Context isn't saved. Close the session, gone. Tomorrow the same agent shows up with a clean context. So you have to feed it 800K tokens of "history" again. The problem isn't solved — it's hidden inside your wallet and your latency.

Four — learning. If the agent made a mistake yesterday and you corrected it, that experience isn't structured for the future. Tomorrow it'll repeat the mistake. Context is RAM, not disk. And when someone says "just increase context instead of building memory" — that's the same as saying "why do I need a database, I have a terabyte of RAM." Technically the words rhyme. In practice they're incomparable concepts.

Big context doesn't replace memory. It lets you stuff more into one session — and that's it.

What to do about it tomorrow morning

If you've read this far and you're thinking "okay, agreed, RAG is search, not memory. Now what?" — I have two pieces of news.

The bad: a systemically correct solution requires rewriting the memory layer from schema up through lifecycle, and that's months of work. Not a weekend.

The good: there are several things you can do tomorrow morning that already remove half the pain. Not magic — just engineering hygiene.

Drop the word "memory" from your stack if what you have is RAG. Call it retrieval or search — instantly more honest. That alone removes 80% of inflated expectations from users and the team.
Introduce valid_from and valid_until for every fact. Any fact without temporal validity is a hypothesis, not a fact. Old facts should drop out of retrieval automatically, not compete with new ones on cosine.
Distinguish staging, working, consolidated, archived. Don't dump everything into one collection. A fact that just arrived and a piece of knowledge confirmed by tests are different entities with different weight in retrieval.
Make abstain a first-class outcome. If no fact passed the confidence threshold during retrieve, the system must have the right to say "I don't know, I need data." And that "I don't know" should become a task in the backlog, not a dead end for the user.

This isn't a complete list — it's the minimum to start the transition from "I have RAG, I call it memory" to "I have memory, and it knows its boundaries." The full list of seven principles is in part 2.

Where this comes from

I sit deep in this kitchen — Claude Code, Cursor, Codex, Windsurf, MCP servers, mem0, Zep, local RAG stacks on Postgres + pgvector, Qdrant, Chroma. Over the last few months I've tried, I think, everything on the market. I have my own MCP memory server with about fifteen hundred entries, which I rewrote from scratch three times because each time I hit one of the three holes above.

At some point, I got tired. Not of AI — of what we call memory at AI. Sat down and started writing my own cognitive runtime that doesn't pretend to know, that knows what it doesn't know, and that sets its own tasks to close the gaps. Called it braincore. One Go binary, local, MCP-stdio, Apache-2.0. Not a pitch, because it's open source — just an example that I say "this can be done" not theoretically.

Seven architectural principles it's built on — that's part 2 of this series. Drops in a week. I'll cover atomic knowledge units, lifecycle, strict mode, causal decision chains, AST-based identity for code, internal git as memory versioning, memory scoring, and negative memory.

And why all of that combined produces a qualitatively different result than any of those pieces in isolation.

Part 3 is philosophical — about the right of an AI agent to stay silent, and why the right metric for production AI isn't accuracy but zero confidently-wrong actions at an acceptable abstain rate. About self-tasking. About why cognitive runtime matters more than model size.

If you read this far and recognized yourself in the opening paragraph — we're in the same boat. If you have RAG that you call memory and it works — tell me how, seriously, I want to know, I might be wrong.

The one thing you can't do is stay silent.

Part 1 of 3. Next — "Seven principles of real memory for AI agents" — drops next Tuesday.

DEV Community