DEV Community

Michael Kayode Onyekwere
Michael Kayode Onyekwere

Posted on

My AI remembered the wrong thing and broke my build. So I built memory governance.

Six weeks ago I gave my AI assistant a memory. It worked. No more re-explaining the project every session. Bugs got fixed once and stayed fixed.

Then it followed a rule from January that I'd overridden in February, and the audio in my video sounded like a robot reading through a tin can.

The old rule said "always apply loudnorm to voice audio." The new rule said "never do that — it lifts the noise floor." Both were in memory. Both active. Both ranked the same. The agent grabbed the wrong one and I didn't catch it until I listened to the export.

330 memories, no idea which ones were still valid

I build YouTube Shorts with my AI assistant. Daily production — voice generation, image prompting, FFmpeg assembly, captions, upload. Over two months, the memory database grew to 330 entries. Bugs, fixes, decisions, settings, voice profiles, pipeline steps.

Some of those entries were battle-tested rules that saved me hours every week. Some were from the first week, when I was still figuring things out, and they were just wrong. The database treated them all the same. No status. No history. No way to tell what was current.

I looked at Mem0, Letta, Mengram. Good at storing and retrieving. None of them answer "is this memory still true, or did something else in this database already replace it?"

Lifecycle states

Every memory now gets a status:

hypothesis  →  active  →  validated
                          deprecated  •  superseded
Enter fullscreen mode Exit fullscreen mode

New observations start as hypothesis. They get promoted when confirmed. They get deprecated when disproven. If a new rule replaces an old one, the old one is marked superseded and points to the replacement.

Deprecated and superseded memories don't show up in search. The agent only sees current truth.

from agentmem import Memory

mem = Memory()

mem.add(type="decision", title="Never apply loudnorm to voice",
        content="Lifts noise floor. Use per-track volume instead.",
        status="validated")

# Kill the old wrong rule
mem.deprecate(old_rule_id, reason="Causes robotic audio")

# Agent now only sees the correct rule
context = mem.recall("audio processing", max_tokens=2000)
Enter fullscreen mode Exit fullscreen mode

That alone would have prevented the tin-can incident.

1,848 conflicts (then 7)

I ran conflict detection on the full production database.

from agentmem import detect_conflicts

conflicts = detect_conflicts(mem._conn)
Enter fullscreen mode Exit fullscreen mode

First run: 1,848. Matching any two memories that share a few words and contain "never" somewhere is not precise. After tuning — requiring 25% topic overlap and sentence-level negation matching instead of whole-document scanning — it found 7.

All real. Duplicate entries from two different sync runs. A bug stored once as a "decision" and once as a "bug." Two session snapshots that were never properly superseded. The kind of thing that sits quietly in your database and degrades trust so slowly you don't notice until your agent does something wrong and you can't explain why.

Staleness

Every memory now tracks where it came from — the source file, the section heading, and a hash of the content at import time. If the source file changes, the memory is flagged stale.

from agentmem import detect_stale

stale = detect_stale(mem._conn, stale_days=14)
# [decision] "Use atempo 0.90" — Source changed since import (hash mismatch)
Enter fullscreen mode Exit fullscreen mode

I had memories referencing files that had been renamed weeks ago. Rules updated in the markdown source but never re-synced to the database. Without provenance tracking, I'd have never known.

Health score

One number. Can I trust what my agent knows right now.

$ agentmem health

Memory Health: 85/100
Total: 226
By status: validated: 14, active: 198, hypothesis: 12, deprecated: 2
Conflicts: 1
Stale: 3
Enter fullscreen mode Exit fullscreen mode

Penalises conflicts, stale entries, orphaned supersedes, having zero validated memories. If you never explicitly confirm anything, the score reflects that. My production database started at 55.

65 videos later

Before governance: 330 memories, all "active," 7 hidden contradictions, 104 duplicates. The agent sometimes followed outdated rules. I'd catch it in QA or I wouldn't.

After: 226 active, 104 properly superseded, contradictions surfaced and resolved. Recall prioritises validated canonical rules over unprovenanced imports.

65 videos built since the governance engine went live. Zero repeated production bugs.

The database isn't bigger. It's cleaner. And the agent makes fewer mistakes because what it retrieves is actually right.

Install

pip install quilmem
Enter fullscreen mode Exit fullscreen mode

Open source, local-first, no API keys, no cloud. SQLite underneath. Ships with a CLI, a Python API, and an MCP server with 13 tools that works with Claude Code, Cursor, and Codex.

The governance stuff — lifecycle states, conflict detection, staleness, health scoring, provenance tracking — is all in the free core.

GitHub / Landing page / PyPI

Top comments (0)