Rayne Robinson

Posted on Mar 2 • Edited on Mar 7

Your AI Forgot Everything You Told It Yesterday. Mine Didn't

#ai #mcp #machinelearning #python

Every MCP memory server I've seen is a JSON file with a save button.

That's not memory. That's a notepad.

The "persistent memory for AI" space is filling up with CRUD wrappers — store a string, retrieve a string, list all strings. Some add vector search. A few throw in Qdrant or Redis. Every one of them solves the wrong problem.

The problem isn't storage. The problem is that your AI doesn't know what to forget.

The Actual Pain

If you're using Claude Code, Cursor, Codex, or any MCP-compatible agent, you've felt this. You spend an hour debugging a Docker networking issue — localhost resolves to IPv6 but Docker only binds IPv4. You fix it. The session ends. Next session, the agent rediscovers the same issue from scratch.

The standard solutions don't help. Static project files bloat and go stale. Key-value stores remember everything equally — that Docker fix and what you ate for lunch have the same priority. Vector-only retrieval is better, but it can't tell you that a decision from January was superseded by one from February. It just returns both.

What none of these systems have is the thing that makes biological memory actually work: decay.

What Forgetting Gets You

Human memory isn't a database. It's a living system where salience determines survival. You remember your first dog's name but not last Tuesday's breakfast. That's not a bug — it's compression.

I built a memory engine called Mnemos that works the same way. Flat storage fails predictably: it fills up, retrieval quality drops, and your AI drowns in stale context that's technically correct but practically useless.

Importance-based exponential decay. Every memory has a salience score that decays at a rate matched to its importance:

High (architectural decisions, key patterns): ~2-year half-life
Medium (gotchas, workflows): ~5-month half-life
Low (session notes, progress logs): ~1-month half-life
A decision about your database schema stays sharp for years. A note about which terminal tab had the server running fades in weeks. No manual curation.

Reinforcement on access. When a memory gets retrieved and used, its salience ticks back up. Memories that keep being useful survive. Memories that don't, fade. Natural selection for context.

Cognitive Sectors, Not Flat Labels

Most memory servers give you a flat list of facts. Mnemos classifies every memory across five weighted sectors — semantic (facts), procedural (how-to), episodic (events), emotional (frustrations/breakthroughs), and reflective (decisions/trade-offs).

A single memory can be 80% procedural and 20% semantic. The sectors aren't buckets — they're weights. At query time, "how do I configure Docker networking?" boosts procedural matches. "Why did we choose Ollama over vLLM?" boosts reflective ones. Classification is automated by a local 8B model. Cost: zero.

Composite Retrieval

Vector similarity alone asks one question: "how semantically close is this memory to the query?" Necessary, not sufficient.

Mnemos scores every candidate across six signals: vector similarity (35%), keyword match via FTS5 (20%), salience (15%), recency (15%), coactivation — memories frequently retrieved together (10%), and sector match (5%).

The composite score means a highly relevant memory from three months ago beats a vaguely related one from yesterday. A procedural memory about Docker config outranks an episodic memory about the same topic on a "how to" query.

Temporal Awareness

This is the piece nobody else is building. Facts change — you upgrade a model, reverse a decision, realize a pattern was wrong. In flat storage, old and new coexist. Your AI retrieves both.

Mnemos tracks valid_fromand valid_to timestamps with explicit supersession chains. "What was our embedding model on February 1st?" returns the old answer. "What is it now?" returns the current one. Old knowledge isn't deleted — it's retired. The decision trail stays auditable.

The Stack

Same hardware from my last post. Same RTX 5080 laptop. Same zero operating cost.

SQLite — WAL mode, FTS5, triggers for full-text sync. Single file. Backups are cp.
Ollama — two models: qwen3-embedding:4b (2560d embeddings, ~3.5GB VRAM) and qwen3:8b (classification, ~5.2GB VRAM). Total: ~8.7GB of 16GB.
MCP via STDIO — Claude Code spawns the process directly. No HTTP, no ports, no network surface. 10 tools exposed: store, query, search, related, get, list, reinforce, timeline, consolidate, stats.

In Practice

Session 1: I fix an OCR date-parsing bug. Mnemos stores the pattern as procedural knowledge, medium importance.

Session 5: Different project, similar issue. The agent queries date validation patterns. Mnemos returns the fix — not because I searched for "OCR" but because the semantic content matched. The coactivation link strengthens.

Session 20: That OCR memory has been retrieved four times. Its salience is reinforced. A session note about which port I tested on has decayed to near-zero and gets archived during consolidation. No manual curation. The system self-maintains.

Why Not [mem0 / Recall / Hive Memory]?

They solve a simpler problem. mem0 needs Qdrant + Neo4j + an embedding service — three dependencies for vector search with a graph bolted on. Recall, EasyMemory, Hive Memory are variations on persist-across-sessions. They fix "my agent forgets between sessions." They don't fix "my agent drowns in stale context after 200 sessions."

None have decay. None have temporal supersession. None have composite multi-signal retrieval. These aren't nice-to-haves — they're the difference between a notepad and a memory system.

The Bottom Line

Your AI agent is only as good as the context it operates in. Most agents start every session with amnesia — or with a flat dump of everything you've ever told them. Neither works at scale.

Mnemos is a single Python file. It runs on a consumer GPU. It costs nothing to operate. And it means my agent on session 50 is smarter than it was on session 1 — the knowledge compounds and the noise decays.

That's not a notepad. That's memory.

This is Part 2 of my Local AI Architecture series. Part 1 covered dual-model orchestration — routing 80% of AI workloads to a free local model. Next up: vision pipelines and why I stopped paying for OCR APIs. (-or I'll eventually get to it-)

I build zero-cost AI tools on consumer hardware. The factory runs on Docker, Ollama, and one GPU. The tools cost nothing to operate.

DEV Community