ישראל חן

Posted on Jun 23

Agent memory v2 — seven rules after the poisoning

#agents #ai #architecture #llm

Over a month ago I posted about my agent storing its own hallucinations as facts. The fix I was halfway through designing did not survive contact with the comment thread.

The thread added points I didn't think about and rearchitected and improved the v2 design — seven rules I'm now building the memory layer around — and an honest snapshot of what's actually built vs. what's still on paper.

The 30-second recap

Sonnet 4.6 (the bot's live model at the time), asked about a model it lacked access to, confidently denied the model existed. My summarization layer extracted the denial and wrote it into the memories table with category=fact and source=summary. Four days later, a fresh session asked the same question, retrieved the stored row, and served the hallucination as ground truth. The full walkthrough is in the first post.

The root cause sits in two words: no provenance. The schema couldn't tell the difference between "a person verified this" and "the model said it." So when a model said it, the schema treated it as fact. The label "self-poisoning" is descriptive of that shape, not a vulnerability class — any time an agent's own output is re-ingested as input without provenance, this is the same bug under a different costume.

What follows is the architecture I'm building so that's no longer possible.

One note on the count: seven is a writing choice, not a partition. Rules 2, 3, and 7 are tightly coupled (states, fail-closed posture, the subsystem that operates them); Rules 5 and 6 are too (tag survives retrieval, gate at use). I'm presenting them as separate rules because each one is easier to reason about and verify on its own — collapsed into composite rules, the discipline buries itself.

Rule 1 — Provenance is a required field, not a column you might fill

The v1 schema had a free-text source column. "summary" was a valid value. That's how the hallucination passed through — no enum, no enforcement, no "hey, this isn't verified." Just a string the writer was free to set or leave alone.

v2 makes provenance a Pydantic Literal["verified", "unverified", "unavailable_at_write_time"] — a required field on every write. There is no path that writes a memory row without picking one of the three. You can't accidentally store a model assertion as a fact because the type system won't let you.

# v1 — what shipped (the bug)
class MemoryEntry(BaseModel):
    content: str
    source: str = "explicit"      # free-text; "summary" was a valid value
    # ... category, embedding, timestamps, etc.

# v2 — provenance required, three-valued, no default
class MemoryEntry(BaseModel):
    content: str
    source: str = "explicit"
    provenance: Literal["verified", "unverified", "unavailable_at_write_time"]
    # ... rest unchanged

The schema change pairs with audit logging — every provenance transition (write, promote, demote) lands in agent_logs with writer, source, and reason. Same observability gap that made post #1's attribution hard: model_used was silently empty in the log, so I couldn't even tell which model had asserted the false fact. If a row can change state, the log has to record why.

Same turtles-all-the-way-down caveat from Rule 4 applies here: the audit log is only as trustworthy as the writer of the audit log. Append-only storage and per-write signing make tampering detectable, not impossible — for a solo system, detectable is enough; for a multi-tenant one, you'd want immutable storage underneath.

Obvious objection — migration cost. A solo-dev DB with a few thousand rows is fine to rewrite in a script. Larger systems hate this — every existing untyped row needs an inferred or assigned provenance value, and a wrong default on a million rows is expensive. My answer: default existing rows to unverified and let the corroboration step promote what's worth promoting. Treating legacy data as unverified is closer to the truth than treating it as fact.

Rule 2 — Three states, not two

Binary admit/reject collapses the moment the verifier itself is down. That is exactly what happened to me — Ollama stalled, slowed down, and went to sleep on the calls that mattered, the verification step never ran, and the claim sailed through to [fact].

The three states (verified, unverified, unavailable_at_write_time) travel with the row. A claim that arrives while the verifier is down does not get written as fact. It gets written as unavailable_at_write_time and queued for promotion later. Timeout, retry, and async behavior belong to Rule 7's subsystem; this rule only guarantees the state exists for the queue to hold.

The verifier-down case is not an edge case to handle later. It's the case to design first, because it's the case that fired for me.

Obvious objection — three states aren't enough. What about a fact that was verified but became wrong over time? A deprecated model, a changed config, a person who moved? The fix is decay-on-contradiction, not an expired state. When a higher-provenance source contradicts a verified fact, the row gets demoted back to unverified and re-queued. Adding expired turns every read into a four-way switch; demotion keeps it at three and pushes the work to the moment the contradiction lands, which is when you actually have the new information. The demotion mechanism itself sits inside the promotion subsystem in Rule 7.

Rule 3 — The unverified path fails closed

Worst case shifts from "false fact stored silently" to "pending label visible in the row." A delay or a flag instead of a confidently wrong answer downstream.

You can't delete the failure mode. You can make it announce itself instead of masquerading as truth. That's the philosophy in one sentence — the architecture that delivers it is Rule 7's promotion subsystem.

Rule 4 — Confidence orders the pile. It does not promote.

The thread converged on a confidence threshold as the fix, and the counter-argument was the one that landed hardest: Sonnet 4.6's denial in this incident was high confidence. That doesn't generalize to every hallucination — uncertainty-tuned models do hedge — but for the class of hallucinations that are fluent and certain (which this one was), a confidence threshold re-admits exactly the bug. So a new way of doing things had to be surfaced.

Promotion has to be independent corroboration — a different source, a tool result, a second model with a different training prior. Confidence only decides which unverified claim gets corroborated next.

"Higher entity" is a provenance ordering, not a score comparison. A tool result outranks a confident model even when the model is more confident. Two confident models can share a training prior and be wrong together.

Obvious objection — turtles all the way down. Independent corroboration requires the corroborator to be honest. What corroborates the corroborator? I don't have a clean answer and I don't think one exists. The pragmatic version is a provenance hierarchy that bottoms out at the user: tool results outrank confident models, primary sources outrank tool results, user confirmation outranks both. That's a limit, not a solve. v2 will surface unresolved corroboration chains rather than hide them; "we don't yet know" should be visible, not silently treated as "yes."

I'd rather the unresolved chain show itself in the row than discover it on a bus ride, worrying for the worst.

Rule 5 — The tag survives retrieval

This was the deepest cut from the thread, and the one I'd most underestimated.

Most of the poisoning fires at read time, not write time. The row gets flattened into prompt context as plain text, the unverified marker drops off in the join, and the downstream model never sees it. The gate everyone designed at write time never fires because the read path silently strips it.

v2 carries provenance inline through retrieval into the prompt: every fact that lands in context arrives tagged. The model sees (asserted by Sonnet, unverified, 2026-04-17), not just the bare claim. The gate is only as good as what the model actually reads.

Concretely: the Apr 17 "Claude Mythos" denial got summarized and stored. Four days later retrieval flattened it into prompt context as:

Memory: Claude Mythos is not a real AI model or cybersecurity system.

Indistinguishable from a user-confirmed fact. Under v2 the same row arrives as:

Memory (asserted by Sonnet, unverified, 2026-04-17): Claude Mythos is not a real AI model or cybersecurity system.

Same content; the tag is the difference between "the model defers" and "the model repeats."

Caveat — this is the bet, not a result. The local model in v2 will get a system prompt that explicitly tells it to treat unverified rows as untrusted context; that behavior we can guarantee by construction. Whether Sonnet (or any remote model) respects the inline tag on its own without a system-prompt nudge is untested. If it doesn't, the tag still does its job: it surfaces to the user that pending data got pulled into context, so the user can intercept before the model commits to an answer.

Rule 6 — The gate is at the point of USE, not just the point of WRITE

I asked a follow-up about a race: if the write is gated but a subagent reads pending data before promotion, doesn't the subagent just act on stale information?

The cleanest answer I got back is the one I'm building toward. Pending data is readable as context — subagents can see it. But pending data cannot authorize a state-changing or irreversible action until it's promoted. Strict read-only on subagents is too coarse; it blocks legit reads. The gate moves to the point of use: every consumer of a memory row checks the tag before acting on it, not just the writer before storing it.

Same fail-closed idea, one layer down. Verification gates the data going in. Consumer discipline gates the data going out.

Obvious objection — N-consumer rewrite cost. Every agent that reads memory needs to be rewritten to check tags before acting. Multi-agent system = N rewrites. Two answers. First, the rewrite is small: a provenance check before any state-changing tool call. Cost-per-agent is hours, not weeks. Second, the boundary that matters isn't the row read, it's the tool call — agents pass conclusions to each other, and a conclusion derived from unverified data can still trigger an irreversible action two hops downstream. So the gate goes in the agent base class at the tool-call boundary: every tool_call(args) invocation checks the provenance of every memory row that fed its inputs before firing. New agents inherit it for free. Row-level gating in each consumer is too narrow; gating at the tool-call boundary catches transitive use.

Rule 7 — Promotion is a subsystem, not a function call

The mistake I almost made was wiring promotion as a single check inside the write path: if corroborated: row.provenance = "verified". That's a function, not an architecture. Once you think through when promotion fires, what it checks, and what happens when the corroborator disagrees, the function is gone and a subsystem is in its place.

Three triggers fire promotion: an async background worker walks unverified rows on its own schedule, a fresh retrieval re-checks the tag at read time, and a user can confirm a row out of band. The subsystem resolves them by Rule 4's provenance ordering — tool result outranks a second-model prior outranks a confident assertion, with user confirmation terminal until contradicted. Same ordering, applied to a queue instead of a moment.

Two cases the function-call version would have gotten wrong. Verifier down: promotion doesn't fail — the row stays at unavailable_at_write_time and re-queues. Higher-provenance contradiction: the row gets demoted back to unverified and re-queues. Both are normal traffic, not edge cases.

Convergence policy is explicit: during read and write, the subsystem's default position on any unresolved row is unverified — promotion has to be earned, not assumed. The three triggers don't race because the default holds the line until evidence accumulates. The one exception is user contradiction: a previously verified row gets demoted back to unverified by an explicit user signal, and the subsystem re-enters the promotion queue from scratch. User wins; the rest is process.

The verifier has finite throughput, so budget is read-driven: hot rows (frequently retrieved) get checked first; cold rows can sit indefinitely because Rule 6 blocks action on them regardless of how long they wait. Cheap cold backlog beats expensive hot latency.

Obvious objection — this is over-engineered for a solo system. A single function would ship faster. True — and it would re-introduce the original bug the moment the verifier hiccups or a tool result contradicts a stored assertion. The subsystem isn't sized to the system today; it's sized to the failure modes the function call would silently absorb. The discipline is what's load-bearing, not the line count.

Status: all seven rules are wired into the design; none of them are in code yet. Rule 4's promotion path is still v1's confidence threshold — the live wound.

Honest snapshot

All seven have written designs above; none are in code, schema, or test plans yet. "Designed" here means "argued through and committed to," not "formalized into a spec doc." Schema and write path ship first, then retrieval, then consumer enforcement, then the promotion subsystem — several weeks of work, with a build log per piece when it lands. I'm publishing the design before the code is in because the design is the part the thread shaped, and I'd rather hear it's wrong now than after I've written the migration.

Credit where it's load-bearing

The thread on the first post was load-bearing on five of these seven rules.

Rule 1 — Particular thanks to Harjot Singh — building Moonshift — for the verify-before-you-persist framing.
Rule 2 — Shaped by this comment pushing the verifier-down case from edge to design-first.
Rule 5 — Shaped by this comment arguing the tag has to survive the retrieval boundary, not just sit in a column.
Rule 6 — Shaped by this reply walking through the race where a subagent acts on pending data before promotion.
Rule 7 — Shaped by this comment arguing promotion has to be independent corroboration, not confidence — and the tag has to gate behavior, not just sit in the row.

Other commenters pushed back on the cleaner-sounding wrong answers I was about to ship, and the architecture is better for it.

The HECE forensics methodology — the actual SQLite walkthrough I used to find the poisoning, the queries, the false leads, the audit pattern other builders can run on their own agents — is a companion post in this series.

If you're building agent memory and any of these seven rules look wrong, I'd rather find out now than after the migration. Reply, DM, or punch holes in the comments — the post is here precisely to be challenged.

Top comments (2)

Max Quimby • Jun 25

The reframe that "self-poisoning" is just re-ingesting your own output without provenance is the cleanest statement of this bug I've read — it really is the same costume on a recurring problem. Making provenance a required Literal instead of a free-text column is the right call; the type system refusing to let you store a model assertion as fact closes the whole class.

One thing I'd push on from running a similar layer: provenance solves write-time truth, but it doesn't cover decay. A row stamped verified in March can be flat wrong by June — your own opening example (a model's own capabilities) is exactly that kind of fact with a shelf life. We ended up treating temporal validity as a separate axis from provenance, because retrieval by embedding similarity will cheerfully hand back a stale-but-verified row next to a fresh one with no tiebreak. Have you thought about where re-verification lives in the v2 state machine — is there a path from verified back to unverified on age, or only on contradiction? That demotion trigger was the hardest part for us to get right without re-checking everything constantly.

ישראל חן • Jun 26

It's the right one to push back on — provenance was the write-time axis I designed primarily; temporal validity is a second axis I underweighted. Two corrections to what I wrote:

Age-only demotion does exist in v2, but per-category, not uniform. "Model capabilities" carries a months-scale half-life; identity facts carry none. Categories declare their decay function at schema-time; decay reduces score within state, and a category-specific threshold triggers the verified → unverified transition. Without per-category half-life the framing collapses on the first counter-example (a user's birthday doesn't decay).

Re-verification fires at retrieval, not on a background scan — but only for rows that (a) score below their category threshold and (b) are in the retrieval set for the current query. You pay the verifier cost on the rows you're about to use, not on the whole DB. That bounds the "constantly re-checking" problem without eliminating it.

The part I don't have a clean answer for is the one you flagged: who counts as a "reliable source" per category. API docs for capabilities, primary source for historical, live API for prices, user-confirm for personal — but the registry of oracles is the core work in v2 architecture for my agent, not a footnote. If you've published how you wired yours, I'd read it.