Carloshperc

Posted on Jul 3

AI agent memory you verify, not just trust

#productivity #ai #automation #programming

TL;DR

A CLAUDE.md solves static knowledge: commands, conventions, preferences and rules. What it does not solve is episodic memory: yesterday's investigated bug, the approach that already failed, the decision that explains why a PR was abandoned. That is why I built agent-memory-hub. The turn came later: I discovered the memory itself could fail silently. This article is the story of how persistent memory became verifiable memory, and how a product idea became an open-source tool I actually use every day.

Who this article is for

Devs using Claude Code, Codex or Cursor who feel the agent knows the rules but forgets the story.
Anyone maintaining CLAUDE.md, AGENTS.md or persistent memory who wants to separate static knowledge from episodic memory.
Anyone who enjoys an honest post-mortem: a silent pipeline bug, observability and a deliberate product call.

The whole story in one line

If I told this in the most obvious way, the summary would be: I built a memory, found bugs, added observability. But that summary starts in the wrong place. The story did not become interesting because a bug appeared. It became interesting because the bug showed I was trusting a memory without a receipt.

That order changes the weight of everything. agent-memory-hub started as a continuity layer for coding agents. Then it became a practical lesson about ingestion: anything that captures data automatically also needs to prove that it captured it. Without that, memory only looks like memory.

The pain wasn't intelligence. It was memory.

Over the last few months I practically stopped coding without AI agents. Claude Code, Codex, Cursor: when you give them enough context, each tool gets very good at entering a project and producing.

So I did what almost everyone does: I wrote a good CLAUDE.md. I put in build commands, conventions, how I like tests written, how I organize modules, how I open PRs. It worked very well, until I noticed a curious split: the agent knew how I wanted it to work, but it did not remember what had already happened.

It knew how to compile. But it did not remember that yesterday we had already tried another approach. It knew how to run tests. But it forgot that a bug had already been investigated for two hours. It knew the rules. But it did not know the story.

The habit that masked the problem

For a while I compensated in the most human way possible: by repeating myself. I would open a new session and dump context. "Yesterday we tried this." "Do not touch that layer because it has odd coupling." "This test fails when it runs in parallel." "The previous PR decision was to take another path."

The curious thing is that this repetition felt like a natural part of using agents. Since the answer improved after I explained everything again, I did not see it as a system failure. It felt like an entry ritual: before working, recap. The problem is that manual ritual charges interest when you move across projects, machines and tools.

That was the first important insight: the agent did not only need better instructions. It needed continuity. And continuity is not a long prompt. It is the ability to carry consequences from one session to the next.

There are two kinds of knowledge

Once I separated the problem, it became clear that I was calling two different things "context". One is static: how to compile, how to deploy, how to structure a module, how to write commits.

That knowledge belongs in CLAUDE.md. The other part changes every day: yesterday's investigated bug, the decision made in the last session, the experiment that failed, the reason a PR was abandoned. That does not belong in CLAUDE.md, because the file would need to be rewritten constantly. It is memory. Episodic, living, automatic.

CLAUDE.md / AGENTS.md	agent-memory-hub
How to work	What already happened
Static	Dynamic
Written by you	Captured automatically
Rules	History

That is where agent-memory-hub was born. Not to replace CLAUDE.md, but to solve exactly what it was never meant to solve.

What I wanted the memory to remember

When you use agents occasionally, repeating context feels like an annoyance. When they enter the daily workflow, it becomes operational cost. The problem was not storing "preferences"; it was preserving continuity between sessions that, to the tool, looked independent.

In practice, I wanted memory to keep five kinds of episodes: decisions made, hypotheses discarded, bugs already investigated, PR state of mind and small project patterns that only show up after several conversations.

That boundary protected the project from becoming a dump of everything. If everything is memory, nothing is memory. The value was in capturing what I would actually forget between sessions, especially when I moved across machines, tools and contexts.

From there, the question stopped being "which memory tool exists?" and became something else: what kind of system could capture those episodes without trapping my history inside a single tool?

Why I built it myself

I could have stopped there and adopted an existing tool. There are good options for agent memory, and I was not trying to reinvent vector search for sport. But my problem had very personal requirements: work across Claude Code and Codex, cross machines, remain mine when I changed tools, and be auditable with a simple SQL query.

Choosing Postgres via Supabase was less about technical glamour and more about predictability. I wanted data ownership, pg_dump, direct inspection, a table I could open without asking any product for permission. For personal work memory, that matters. My session history is part of my process, not just a tool cache.

That decision raised the stakes for what came next. When you buy a tool, you accept part of the black box as the package. When you build the pipeline, responsibility moves: if it fails silently, it is not the fault of some distant platform. It is the architecture you chose to tolerate.

The first version looked perfect

The architecture was simple: every session was captured automatically and, at the start of the next conversation, the agent received a digest of relevant sessions. Underneath it, hybrid search (keyword + pgvector) and an optional facts layer. Everything was saved to my own Postgres (via Supabase), with no SaaS, no lock-in, and the ability to run pg_dump at any time.

The design: three lifecycle hooks write to and read from the same Postgres table. Capture is idempotent (upsert by session_id); the Stop checkpoint keeps the session alive even through an abrupt kill.

It worked across machines. It worked across tools. It worked so well that I stopped thinking about it. That was the danger: the automation had disappeared from my attention.

You stop verifying because the system looks reliable, and the system looks reliable precisely because failures do not surface. Until the day I decided to check one specific session.

The day of the ghost session

The session that made me pull this thread did not look special. It was just another long conversation, the kind where you stack context, small decisions, command output and course corrections. At some point, a simple question landed: "did this get saved?".

The question was almost bureaucratic. I was not hunting for a bug. I only wanted to confirm that memory had done its job. I checked the log. Nothing. I checked the database. Nothing. The session that had just happened did not exist for the system that was supposed to remember it.

That kind of moment changes your relationship with a tool. Until then, I thought of memory as convenience. After that, I started seeing it as infrastructure. And infrastructure that fails without signaling is not infrastructure; it is a silent bet.

The error was hidden in `exit 0`

When I followed the trace, the pattern became more uncomfortable than one missing session. There were intermittent errors: session yes, session no. Some conversations reached the database; others disappeared on the way.

The capture hook had an except that swallowed any error and moved on with exit 0. From Claude Code's point of view, all good. From the database's point of view, whole sessions were missing.

And that's worse than a crash. A crash you see: it stops, it screams, it shows up. Silence you don't see. The agent simply "forgets" things it should know, and you chalk it up to a model limitation, not a broken pipeline. The memory looked like it worked because nobody had to prove that it worked.

Why silent failure is so deceptive

Silent failure does not compete with your system. It competes with your interpretation of the system. When the agent forgets something, plausible explanations remain: the model did not weigh it enough, the summary was weak, the initial prompt did not include context, semantic search did not find the right session.

All of those explanations can be true. That is exactly why they are dangerous. They give you a comfortable story for a problem that may sit one step earlier: the information was never captured. You keep tuning recall, embeddings and prompts while the hole is in ingestion.

That is when the word "trust" started to feel wrong. Trust, in that context, was just the absence of evidence against. What I needed was evidence for.

The investigation: follow the data, not intuition

The initial temptation was to fix the first visible error and declare victory. But ingestion pipelines rarely break in only one place. The data crossed shell, Python parser, sanitization, Postgres insert and logs. Any one of those layers could be lying by omission.

So I followed the payload like a trail: was the original input valid? Did it arrive unchanged in Python? Did the parser accept it? Did the database reject it? Did the log record the failure? Did the hook return an error to the tool?

That changed the tone of the investigation. The bug stopped being "why did one session disappear?" and became "how many sessions could disappear without me noticing?". The second question is much more uncomfortable, and much more useful.

Three bugs, all silent

What looked like one problem turned into three, at different layers of the same pipeline. The detail is worth it because this pattern shows up in any ingestion system that seems too "simple" to fail.

Bug 1: control characters

Each session payload is JSON. Python's strict parser rejected any payload with unescaped control characters, and the content of a terminal session is full of them (command output, box-drawing, ANSI sequences). The result was a JSONDecodeError the except swallowed.

The fix is one line, but it only works because the error was in the parse:

payload = json.load(sys.stdin, strict=False)  # accept control chars inside strings

Bug 2: the NUL byte

With a tolerant parser, another session failed, now at the database. Postgres refuses the NUL byte in a text column, with the error 22P05: unsupported Unicode escape sequence. So: the parse passed, but the INSERT was rejected. The fix was to sanitize the content, recursively, before uploading:

def strip_nul(obj):
    if isinstance(obj, str):
        return obj.replace("\x00", "")
    if isinstance(obj, dict):
        return {k: strip_nul(v) for k, v in obj.items()}
    if isinstance(obj, list):
        return [strip_nul(v) for v in obj]
    return obj

Bug 3: the `echo` that corrupted the JSON

This one was the subtlest, and the most instructive. The hook command captured the payload and handed it to Python like this:

payload=$(cat); echo "$payload" | python3 capture_session.py ...

The problem is echo. In sh/dash, echo interprets backslash sequences. Any payload with \ (and sessions have many: ANSI sequences, regex, \u) reached the parser corrupted, with the error Invalid \escape. That's why it alternated: a session with a backslash failed, a session without one passed. The parser never had a chance; the data arrived already broken.

The fix is to swap echo for printf '%s', which doesn't interpret escapes. The side-by-side test proves the point, with the same valid JSON input:

# same valid input; via echo it breaks, via printf it passes
sh -c 'p=$(cat in.json); echo "$p"'        | python3 -c 'import json,sys; json.load(sys.stdin)'
# -> JSONDecodeError: Invalid \escape

sh -c 'p=$(cat in.json); printf "%s" "$p"' | python3 -c 'import json,sys; json.load(sys.stdin)'
# -> ok

The cross-cutting lesson: every layer (parser, database, shell) had a different trap for the same data. A "simple" three-stage pipeline hid three independent failure modes, all silent.

Fixing the present was not enough

After fixing the hook, one annoying question remained: what about the past? The sessions that failed would not magically return just because the new code was better. If memory is history, old gaps matter too. Sometimes they matter more, because that is where the decisions and attempts I did not want to repeat were stored.

Backfill became part of the fix. I had to reconcile local transcripts, identify sessions that existed on the filesystem but not in the database, sanitize old payloads, attach subagent sessions and deal with a second account that also had relevant history. It was not glamorous. It was infrastructure cleanup.

flowchart LR
  L["Local transcripts"] --> C["Classify valid sessions"]
  C --> M["Find missing in DB"]
  M --> S["Sanitize old payload"]
  S --> U["Idempotent upsert"]
  U --> H["health confirms coverage"]
  H --> R["History whole again"]

That detail is easy to underestimate. Many pipeline fixes stop at "from now on it is fine". But for memory, the past is the product. If you only fix the future, you leave the most valuable part of the tool with invisible gaps.

The pivot: the problem isn't remembering, it's verifying

Fixing the bugs was the easy part. What stuck was the question they forced: I spent weeks trusting something broken and never knew. The problem was not only technical. It was epistemological: why did I think that memory deserved trust?

Because the metric the whole category optimizes is "remember better": smarter recall, better embeddings, denser summaries. That matters. But the pain that caught me was one step earlier: proof that capture happened. Nobody tells you when memory fails. There's no receipt, no health check, no gap detection. You find out when the agent forgets, if you find out.

So I stopped treating trust as a feeling and started building verification. Two pieces:

A coverage health check. A command that reconciles local transcripts (from every tool) against what's in the database, ignores empty sessions, and watches the capture error rate over the last 24 hours. The output is direct:

agent-memory-hub · health

✓ coverage    ██████████████████████ 86/86 local sessions saved
✓ capture     last 24h: 51 ok, 0 errors
✓ subagents   13/13 sessions with subagents attached

flowchart LR
  L["Local transcripts<br/>(all tools)"] --> R{"health: reconcile"}
  B[("DB · public.sessions")] --> R
  R -->|"gap or capture error"| ALERT["surfaced: coverage under 100%"]
  R -->|"all matched"| OKN["86/86 · 0 errors / 24h"]

When something breaks, it surfaces, instead of going unnoticed. It's the difference between memory you trust and memory you verify.

An MCP server for on-demand recall. The default recall is passive: it injects a digest at the start of the session, before you even say what you want to do. I exposed the memory as MCP tools (pure stdlib, no dependencies) so the agent queries the memory with the task in hand: recall_relevant, recent_sessions, get_facts, get_session. The idea was simple: remember at the right time, not only at the beginning of the conversation. The heart of the server is a minimal JSON-RPC handler:

elif method == "tools/call":
    text = _call(params["name"], params.get("arguments") or {})
    _reply(rid, {"content": [{"type": "text", "text": text}]})

The numbers, after closing everything: 148 sessions, 36 projects, local↔database coverage reconciled, with a second account's sessions and subagent transcripts finally included in the backfill.

The system changed personality

Before, memory was an optimistic promise: "relax, I saved it". Afterward, it became a system with a more mature posture: "I saved these sessions, I did not save those, and here is the difference between what exists locally and what reached the database".

That also changed how I use agents. When recall feels poor, there is now a question before model quality: was the memory available? Was the relevant session captured? Did the agent query the right history? Without that layer, every failure looks like "the AI's fault". With it, forgetting becomes something you can investigate.

What the project does today

After that sequence of fixes, agent-memory-hub stopped being just a pair of hooks. It became a small personal memory platform for agents: capture, query, verification, console, adapters, backup, profile and quality evaluation. It did not start that way; it became that every time a failure asked for a better surface.

Layer	What it solves today
Capture	Claude Code hooks save sessions; adapters pull history from Codex and Cursor.
Query	Session-start recall, on-demand MCP and the `mem` console to search, inspect and summarize.
Observability	`health` reconciles local transcripts with the database; `log` shows what happened.
Quality	Pytest suite, CI and a recall evaluation harness with hit@k/MRR.
Evolution	The cross-project profile proposes rules; you approve before they become behavior.

The commits tell that evolution well. First came the capture bugs. Then health and log. Then MCP, standup, adapters for other tools, packaging for the mem command, CI tests and recall evaluation. Even ranking tuning entered the loop: a recency bias was tested and rejected by data when it hurt recall.

That changes the reading of the project. It is not an idea abandoned after market research. It is a living, used tool, with commits stacking real learning. The care was different: not confusing open-source traction and daily usefulness with an obligation to turn everything into a company immediately.

Before turning it into a startup, I stress-tested the thesis

The seductive version of the story would stop here and say: I found a real pain, built a solution, now there is an opportunity. And for a few hours, I wanted to believe that. "Verifiable memory for agents" sounds like a category. It seems to have a name, urgency and a clear angle.

But a real pain is not automatically a company. So I changed the question. Instead of looking for confirmation, I looked for friction: who is already close to this? What can incumbents copy quickly? What is deep pain and what is an adjacent feature? If I were my own competitor, how long would it take to neutralize this slice?

That research did not reduce the value of the project. On the contrary: it made the value clearer. As a personal tool, it saves me time and reduces repetition every day. As an open-source project, it has already started getting stars and signaling that the pain resonates outside my own machine. As a startup, it might still need a broader thesis than "proof of capture". And that is fine. Not every useful project needs to become a company before it finishes learning its own shape.

What worked: daily use and real signals

The part I am proudest of is that the project did not stay in idea-land. It is in my daily workflow. It captures sessions, shows up at the start of new conversations, helps recover old decisions and reduces the annoying work of re-explaining context to the agent.

It also did not stay invisible to other people. The repo has already received stars, and that matters as a small but honest signal: not market validation, but proof that the pain is recognizable. The adversarial research was not there to abandon the project; it was there to calibrate ambition. The result was sober:

Tool	What it already covers	The gap that still remains
MemGuard	"Datadog for agent memory": health and trust-scoring.	Proof of capture, silent-gap detection, "memory receipts". Real, but thin, and copyable by incumbents in weeks.
Zep	Enterprise governance: SOC 2 Type II, HIPAA, access audit.
MemTrust (paper, 2026)	Crypto-chained audit and memory attestation.
mem0	An audit Events API already built in.

The "observe/trust the memory" theme is filling up fast. The exact slice still open (proof of capture, silent-gap detection, "memory receipts") is real, but it's thin, and copyable by incumbents in weeks (mem0, for instance, already has an audit Events API).

So the decision was to keep using it, keep opening it, keep improving it and not force a startup narrative before its time. Today the value is clear: real dogfood, open-source, public signs of interest, and useful enough to stay in my daily workflow.

One honest technical caveat, for anyone replicating: the hook fix only applies to new sessions, because hooks load at the start of the session. The safety net is the session-end event, which saves the whole transcript cleanly, so even with the intermediate checkpoints failing, the final state isn't lost.

Conclusion

If you run agents with memory, do one thing this week: check that it's actually capturing. Don't assume. Look for the equivalent of my "checked the log, checked the database". The most dangerous failure mode of an ingestion pipeline isn't the one that screams, it's the one that fails quietly and leaves you confident.

The principle generalizes beyond AI agents: for any system that ingests data and can fail silently, observability beats optimization. Verifying is more important than trusting. And the product meta-lesson, maybe the most useful: running adversarial research to calibrate your own ambition is a superpower, not a defeat. Sometimes the best outcome of an investigation is understanding that the project should stay alive, just in the right shape.

The project is open-source and self-hosted. If it saves you a single session re-explaining your project to the agent, it was worth it.

Originally published at buildcomcarlos.com.

DEV Community

AI agent memory you verify, not just trust

TL;DR

Who this article is for

The whole story in one line

The pain wasn't intelligence. It was memory.

The habit that masked the problem

There are two kinds of knowledge

What I wanted the memory to remember

Why I built it myself

The first version looked perfect

The day of the ghost session

The error was hidden in `exit 0`

Why silent failure is so deceptive

The investigation: follow the data, not intuition

Three bugs, all silent

Bug 1: control characters

Bug 2: the NUL byte

Bug 3: the `echo` that corrupted the JSON

Fixing the present was not enough

The pivot: the problem isn't remembering, it's verifying

The system changed personality

What the project does today

Before turning it into a startup, I stress-tested the thesis

What worked: daily use and real signals

Conclusion

Top comments (0)

TL;DR

Who this article is for

The whole story in one line

The pain wasn't intelligence. It was memory.

The habit that masked the problem

There are two kinds of knowledge

What I wanted the memory to remember

Why I built it myself

The first version looked perfect

The day of the ghost session

The error was hidden in exit 0

Why silent failure is so deceptive

The investigation: follow the data, not intuition

Three bugs, all silent

Bug 1: control characters

Bug 2: the NUL byte

Bug 3: the echo that corrupted the JSON

Fixing the present was not enough

The pivot: the problem isn't remembering, it's verifying

The system changed personality

What the project does today

Before turning it into a startup, I stress-tested the thesis

What worked: daily use and real signals

Conclusion

The error was hidden in `exit 0`

Bug 3: the `echo` that corrupted the JSON