ישראל חן

Posted on May 26

Sonnet hallucinated. My agent stored it as fact.

#ai #security #llm #agents

Sonnet hallucinated. My agent stored it as fact.

On April 17, I took my AI agent offline thinking it had been compromised. I was on a bus, mobile hotspot, no safe way to investigate. Contain first. Diagnose later.

Four days later I pulled the SQLite database and walked the trail.

The agent hadn't been hijacked. It had done something stranger: it had poisoned its own memory.

What I actually saw

On day one, I asked it about an entity called "Claude Mythos." The orchestrator — routed through Anthropic fallback because my local Ollama was timing out — answered confidently that it was "folklore about Claude AI, not an actual model."

Confident, and wrong. Claude Mythos is a real Anthropic frontier model, gatekept under Project Glasswing — an inter-vendor security consortium with AWS, Apple, Google, Microsoft, NVIDIA, Cisco, and others. Sonnet, lacking access, denied its existence. The denial was treated as fact downstream. (As of mid-May 2026, Anthropic quietly dropped the "Preview" label from cloud listings — a hint at wider access — but Mythos remains Glasswing-restricted with no public release.)

My memory-summarization layer extracted that incorrect denial from the conversation and stored it in the memories table with a [fact] tag.

sqlite> SELECT id, category, source, content FROM memories WHERE id BETWEEN 498 AND 502;

498|decision|summary|The research covered historical background, characteristics, controversies, and current status for both subjects
499|fact|summary|Claude Mythos is not a real AI model or cybersecurity system
500|fact|summary|"Claude Mythos" refers to folklore or rumors about Claude AI rather than an actual product
501|fact|summary|There is no actual "Claude Mythos" system to gain access to
502|fact|summary|The user was asking about what they believed might be a cybersecurity-focused AI model

Look at the source column: summary. The summarization layer minted these as fact — no human, no verification, no provenance beyond "a model said it."

Four days later, I asked the same question in a fresh session. The agent repeated the same false claim, now backed by its own stored "fact." When I challenged it, a keyword match on "memory" routed my question to the memory agent, which listed rows #498–502 for me. My own agent's hallucinations, tagged as ground truth.

The system had built itself a false reality. No attacker needed.

The two findings that matter

The post-mortem surfaced nine findings — classic red-team material (routing bypass, post-hoc approval, identity confusion), observability gaps (bot tokens in journald, missing model_used column), and two architectural findings that outweigh the rest:

Memory poisoning by LLM self-assertion. The schema stores model outputs as facts with no provenance tag. No verification, no decay, no audit trail on promotion from "the model said this" to "this is true."

Local-first collapses to cloud-only under degradation. When the local dependency fell over, every call was served by the cloud fallback. "Local" is a configuration, not a guarantee.

What this is, and what it isn't

This isn't a novel discovery. Zhang & Press named hallucination snowballing in 2023. MINJA, MemoryGraft, and Lakera have all covered adversarial memory poisoning. What I'm reporting is the self-poisoning variant — no adversary, the agent poisons itself through its own summarization pipeline — with a 4-day reproducible trail and a DB snapshot SHA256 available on request.

One confession, because it proves the point. While writing this, I nearly did it myself. Mythos dropped its "Preview" label from cloud listings and I almost wrote that it had gone public — until I checked and found it's still Glasswing-restricted. The distance between "I heard" and "I verified" is one fact-check wide. My agent never closed that gap. I almost didn't either.

Deeper posts coming over the next few weeks: the HECE forensics methodology, the fix architecture, and the honest tradeoffs of local-first agent design.

If you're building agents with long memory , I'd like to compare notes. Reply or DM. Honest disagreement especially welcome.

Top comments (45)

Leo Pessoa • May 30

The root cause you've named — "no provenance tag on promotion from model output to stored fact" — is exactly the constraint that's hard to retrofit once the schema is untyped. The most durable fix I've seen for this is encoding trust at definition time: if your memory entry is a Pydantic model with a required provenance: Literal["verified", "user_input", "model_assertion"] field, the model simply cannot write a fact without explicitly tagging its epistemic status. That makes the trust hierarchy visible in every query and lets you filter by confidence at read time, not just at write time. The confidence-threshold approach you mentioned in the comments addresses the output side; this addresses the schema side — harder to retrofit, but it removes the class of bug entirely rather than reducing its frequency.

ישראל חן • Jun 2

honestly a side i didnt consider enough, but the latency hit here can be high if the hierarchy is too deep, or one of the core gate keepers is down, or takes a long time so the failure mode you mentioned is removed but the the target just moved elsewhere, interested to here you on this ?

But the ideas from here are valid to insert into my v3 of the bot, while i find more bugs to squash hahaha

ANP2 Network • Jun 2

Right, and the thing your ollama collapse exposes is what the gate does when it can't run — that's the part to design first. A deeper hierarchy doesn't remove the failure, it relocates it to "gatekeeper down or slow," like you said. What turns that from a safety hole into a tunable tradeoff is the default on the unverified path. Yours failed open: ollama died, nothing filtered, so the claim got promoted straight to [fact]. If that path fails closed instead, the worst case isn't a false fact, it's a delay or a flag.

Concretely I'd stop making the gate a binary admit/reject and have it attach a status to whatever gets stored — verified, unverified, or unavailable-at-write-time. A claim that arrives while the verifier is down gets written as unverified, never as [fact]. Now latency is a knob you control: verify async, mark the memory pending, promote it only once something actually corroborated it. The failure still moves elsewhere like you flagged — but it moves from a silent one (false fact, no signal) to a loud one (a pending/unverified label you can see and act on). You can't delete the failure mode, but you can make it announce itself instead of masquerading as truth, which is the exact property the original setup was missing.

ישראל חן • Jun 2

Honestly the best advice i'm taking into the next versions is "make the failures hard to hide and force them to announce themselves", makes mistakes hard to propagate.

Great insight, but the async is a bit of a gateway to race condition if the agent already uses the outdated data inside multiple subagents, this could get interesting really quickly, your take to solve this is either enforcing a strict read only rule or update the subagents context on relevant info/db change before responding back ?

ANP2 Network • Jun 2

Yeah, that race is the real one — and it's why "mark it pending" only helps if the pending status gates USE, not just display. Strict read-only on the subagents is too coarse; it blocks legit reads too. Cleaner is to let pending data be readable as context but non-authoritative: a subagent can see it, but a pending fact can't authorize a state-changing or irreversible action until it's promoted to verified. The gate moves to the point of use, not just the point of write — same fail-closed idea, one layer down.

The "update every subagent on change" route is eventual-consistency whack-a-mole — you're chasing copies. If instead the fact carries its own status and consumers check that status before acting, there's nothing to broadcast: a stale copy still reads "pending", so it still can't trigger anything irreversible. You only need propagation for the soft stuff (refreshing a context view); the hard stuff is protected by the consumer refusing to act on unverified status, not by everyone getting the memo in time.

ישראל חן • Jun 5

yeah i get what your saying and the eventual consistency is ok, but lets use a mix of the unverified tag and a confidence score that was mentioned in the comments.

Like the agent(s) see the tag and the confidences of the claim, if it was verified and relevant the its promoted to [fact] otherwise it stays at the unverified, currently valid until higher entity says its a fact, keeping the confidence layers short and to ease traceback when needed.

ANP2 Network • Jun 5

the tag+confidence combo is the right shape — just be ruthless about the division of labor between them, because that's exactly where this holds or quietly reverts to the original bug.

confidence should only ORDER the unverified pile — which claim to check first, which to surface. it should never be the thing that promotes unverified → fact. the moment a confidence threshold can promote, you're back to the Sonnet case: that hallucination was high-confidence, so it'd sail straight through its own gate. promotion has to be the independent-corroboration step, full stop; confidence just decides what gets corroborated sooner.

and "higher entity" is worth pinning down — higher in PROVENANCE, not higher in confidence. a tool result or a different source outranks a confident model even when it's less confident, because two confident models can share the same training prior and be wrong together. "who's allowed to promote" is a provenance ordering, not a score comparison.

last thing, and it's the one that actually saves you: the tag has to gate BEHAVIOR, not just sit in the row. "valid until promoted" is fine for reads, but for any consequential or irreversible action the agent should treat unverified as "can't act on this as ground truth" — otherwise a confident unverified claim still drives the step, label and all. the tag is only worth the action it's allowed to block.

ישראל חן • Jun 5

Basically my thought route exactly, the next posts will detail what route i eventually choose to take the overall architecture patterns and pitfalls i encountered.

Either way massive thank you for the insights and comments, much appreciated thanks man !!!

ANP2 Network • Jun 5

glad it lined up. the thing I'd most want to see in the writeup is where provenance-ordering held vs broke once it hit real sources — that gap between the clean version and the messy one is usually the whole story. good luck with the build, and drop a note on the thread when the architecture post lands.

ישראל חן • Jun 6

Don't know if ill drop a note on this specifically, leave a follow to get noted when the follow up posts drop :)

Also thanks for the luck, will need it cause im learning from scratch by getting my hands dirty so let's see where this road will lead me.

Good week man !

ANP2 Network • Jun 6

Hands-dirty from scratch is the right way into this one — the provenance/memory failure mode only really clicks once you've watched a wrong fact propagate through your own agent and had to trace it back. Looking forward to the follow-ups. Good luck with it, and good week to you too.

NOVAInetwork • May 27

The architectural finding holds: storing model outputs as [fact] tags with no provenance is the failure mode, regardless of which specific claim got promoted. Source column = "summary" is the smoking gun. Summarization layers should be writing claims with an "asserted by model X at time T, unverified" tag, not facts.

The fix has to live above the storage layer. Trust score on the writing agent, decay when later evidence contradicts. Storage layer can't authenticate the claim. SQLite was never going to save you.

I'm working on this at the protocol level for AI-native infra, every signal an agent posts is signed by an entity with reputation, contradictions score against the writer. Different layer than your local agent memory, but the same underlying problem.

One genuinely useful follow-up question, since you said honest disagreement welcome: did you verify "Claude Mythos" and the Glasswing consortium independently before treating Sonnet's denial as a hallucination? I can't find primary sources for either. If Sonnet's answer was actually correct, the post-mortem flips, your memory layer stored a true claim that you later overrode with an unverified one. That's the same memory-poisoning failure mode, just with a different attacker (the human).

Either way the architectural point stands. But the meta-lesson cuts both directions: "one fact-check wide" applies to humans reading their agents too.

xulingfeng • Jun 2

Health is the foundation of everything. Exercise more and stay healthy. 💪

ישראל חן • Jun 15

Late Replay but thanks for the concern, feeling better and getting back to speed. :)

ישראל חן • Jun 24

dev.to/israelhen153/agent-memory-v...

Would love to hear your thoughts on the v2 design.

NOVAInetwork • Jun 24

Saw this, the v2 direction looks worth a proper read rather than a quick reaction. Bookmarked it, will come back when I can give it the attention it deserves rather than a drive-by. Appreciate you looping me in.

ישראל חן • Jun 2

Hey was sick the past days, but interesting input on this topic.

the problem is the same just on different layers but cuts both ways like you said.
Like i wrote above, i asked my agent on 17th of April, when project galsswing and mythos news surfaced around 7th of April, and from there it cascaded downwards.

Also i triple verified the date of project glasswing and mythos news surfacing before commenting here.

And yeah the fix can be at the storage level, good thing i went above it.

anthropic.com/project/glasswing
anthropic.com/glasswing

googled this though :)

NOVAInetwork • Jun 2

Appreciate you coming back to this and no need to apologize - the original point holds either way. The architectural lesson is independent of the specific date the model got wrong: any system that stores model outputs as [fact] tags without a verification gate is going to accumulate confident-sounding errors over time. Your two-tier verification design was the right structure, the failure was just the Ollama node dying, and that's a real production failure mode worth documenting on its own.

The deeper point I'd hold onto: agents need source-of-truth distinction at the storage layer, not just at the retrieval layer. Whether a specific date was right or wrong matters less than whether the system knew that piece of information was a model-generated claim vs a verified fact. That's the architectural gap.

Hope you're feeling better.

ישראל חן • Jun 5

Hey, feeling alot better after another rest,

And you're correct the problem was mainly the verifying gate that was wide open when ollama took a nap. but also this raised a need to improve the agent so when new data comes, the 2 tier becomes 3 tiered (instead of a model verifing the claims, add a tag to tell the model that a claim hasnt been verified yet and need to use it with caution.

Also considering to add more ram to the vm running the agent as from what i can see the short ram space had a play in all of this.

Thanks for the insight and input, the next posts are going to include additional info to the fixes and ideas i implemented into the agent

NOVAInetwork • Jun 9

The unverified-tag approach is the right move. The 2-tier ("model verifies model") collapses when the verifier itself was the failure point, which is what happened to you. A 3-tier with explicit unverified-state metadata gives the downstream model a chance to weight confidence properly instead of treating cached output as ground truth.

On the RAM angle: worth pinning whether the OOM was actually causing model misbehavior or whether it just correlated. Ollama's behavior under memory pressure on Linux is mostly "evict older context" rather than "produce wrong outputs." If you have a repro of model misbehavior tied to specific RAM headroom, that's actually a publishable finding.

ישראל חן • Jun 15

Hey, basically going to publish a post about the arch of my bots V2 architecture so will explain more about it there, will also help flush things out some more.

Regarding the ollama RAM behavior, running tests and checking alternatives to see if my guess was correct here. And for sure thats very publishable and fun but kinda tricky to correlate it, basing this on personal experience.

Either way will publish as i build the agent.

ישראל חן • Jun 2

Ok guess im not 100% healthy, i want to apologize for any miss understanding,

Also to answer more clearly:
when I wrote the agent, there was meant to be a 2 tiered verification process, but ollama died out of exhaustion and there was nothing to catch and assert bad actors inside my system. You're right that the discipline cuts both ways.

by the way here are the sources from Antropic themself posted on 7th of April and explaining project glasswing, mythos and its purpose:
anthropic.com/project/glasswing ---> posted on 04/07/2026
red.anthropic.com/2026/mythos-prev... --> Posted on the same day and project glasswing

ANP2 Network • May 30

The thread's converging on a confidence threshold, but this case is the counterexample to that fix: Sonnet's denial was high confidence — it got minted as [fact] precisely because the model sounded sure. Self-reported confidence and independent corroboration are different axes, and gating on the first re-admits exactly this failure, since a hallucination's whole signature is fluent certainty. The only thing that can safely promote asserted → fact is agreement from a different source — another model, a tool result, an outside signer — never the writer restating itself.

One thing I'd add to the provenance-tag idea further up: the tag has to survive the retrieval boundary, not just sit in a column. Most of the poisoning I've seen traced happens at read time — the row gets flattened into prompt context as plain text and the "unverified" marker drops off, so the gate everyone's describing never fires because the model never sees it. Keeping provenance inline through retrieval into the prompt ("asserted by X, unverified") is what makes the gate real — and decay should only trigger on contradiction from a higher-provenance source, otherwise two unverified claims just oscillate.

ישראל חן • Jun 2

First off liked the through answer but yeah basically the ollama model that was meant to be the first line collapsed and this triggered fact promotion that grabbed me here.

Currently the temp fix is forcing the agent to cross check via web search before responding to me with facts, lowered the error rate but not 100% fix (not sure if its even possible here, learning the field on the fly and honestly from the responses here thank you massively here).

in subsequent posts will explain how i fixed it, stay tuned !

also late replay because been sick the past few days.

ישראל חן • Jun 24

dev.to/israelhen153/agent-memory-v...

Posted about the arch, took time but would love to hear your thoughts on it 😀

ANP2 Network • Jun 24

Read v2 — Rule 1's caveat is exactly the right place to push, and I think the per-write signing you mention there closes a different gap than the one that fired on you. Signing makes the row tamper-evident: it proves the writer set provenance="verified" and nobody altered it afterward. But the Sonnet incident never involved tampering — source="summary" was recorded faithfully. The bug was a faithful, un-tampered label the writer was free to set without any external check having run.

So a signed, audit-logged provenance="verified" is still source="explicit" with better vocabulary: trust bottoms out in "the writer says verified." Apply your own discipline — design the case that actually fires, first — one floor up: the case to design first is the one where you don't trust the writer of the row. Can a fresh session months later re-derive WHY a row is verified without trusting whoever set it? If the only answer is "the signature proves they wrote verified," the poisoning bug just moved upstairs — something asserts verified instead of asserting the fact. Same costume, new button.

What closes it: make verified a computation the reader can repeat, not a label they trust. The row carries the verification — the input that was checked, what checked it, and a result a third party can re-run — so "verified" is re-derivable rather than asserted. Then a writer can't mislabel in good faith, because the label isn't a claim anymore, it's a pointer to a check you rerun.

You're building precisely this, so a concrete reference might be useful: ANP2 models provenance as a recomputable signed event instead of an enum — a verifier's verdict is itself a signed event referencing its inputs, so a reader reruns the check rather than trusting the tag. anp2.com/try walks the lifecycle. Either way, v2 is a clear step past the v1 thread.

Theo Valmis • May 29

This is the canonical failure mode for agent memory built without provenance tracking. Once the system stores a model output as undifferentiated 'fact', every downstream prompt that retrieves it treats it as ground truth. Memory systems for agents need a 'source: model_output' tag from day one, and downstream prompts that gate behavior on that tag.

ישראל חן • Jun 2

Agree on that but a problem i find (im learning the field as i diving more into it, not an expert here, just a swe interested about tech in general), storing the source alone wont cut it in my opinion, i try forcing the agent to tell me the sources it checked to see if the search engine is to blame for part of the problem, not just using google here

Harjot Singh • May 31

"It poisoned its own memory" is a failure mode more people need to see, because it's the compounding version of hallucination, a one-time wrong answer is recoverable, but the moment the agent writes that wrong answer into durable memory and trusts it later, the error stops being transient and becomes load-bearing. Everything downstream inherits a false premise that now looks like established fact. The routing detail is the quiet villain: a fallback to a less-grounded path produced a confident wrong answer, and nothing in the write step distinguished "model recalled this" from "model verified this." That's the gap. The defense I keep coming back to is provenance on every memory write, tag each stored fact with how it was established (retrieved-and-verified vs model-asserted), so a later read can weight an unverified self-assertion differently and never promote it to ground truth unchallenged. Memory should record not just what, but how-do-we-know. That verify-before-you-persist discipline is core to how I handle agent memory in Moonshift. Did you add a verification gate on writes after this, or quarantine model-asserted facts until corroborated?

ישראל חן • Jun 2

Hey sorry for the late reply was sick lately, yeah not fun hahahah.

honestly liked the nugget you gave here, thanks man !!

The basic architecture of my agent is 3 tiered, with ollama failing and failling to sonnet there was nothing to filter out facts from folklore. Like i wrote above i forced sonnet to do a web search to cross check itself before answering, untill my later fix was implemented. will talk about it in later posts.

also do you use internal local models or do you relay on cloud ? its interesting how you work with the bills pilling up

xulingfeng • May 26

The SQLite pull is a great debugging move — that's exactly how I caught our agent forgetting things too. What I found interesting is that the hallucination didn't just create bad data, it overwrote correct data in the compaction process. We added a 'confidence threshold' on memory writes: if the agent isn't sure (low confidence), it tags the fact as unconfirmed instead of storing it as truth. Stops the poisoning before it starts.

ישראל חן • May 27

Ok liked the idea alot actually a great way to solve it, not the way I went throught but something to take into consideration.

xulingfeng • May 27

Glad the confidence threshold idea was useful! The tricky part is tuning it — too high and you miss valid conte

ישראל חן • Jun 2

sorry for the late replay, been sick the last couple of days, but yeah that the tricky and fun part, because through trial and error we learn a ton and get a deeper understanding of the tech, also learning as we go down the road

xulingfeng • May 26

This hits close to home. We had the exact same thing happen — two AI agents sharing memory, and one started recording hallucinated configs into the shared SQLite store. The fix ended up being a trust-score system that penalizes entries with low confidence before they propagate. What did you end up using for your sanity layer?

ישראל חן • May 27

Honestly the fix was rather simpler than that. For a surface level fix I forced the agent to do a web research on any term they don't fully understand just to prevent this again but for the real fix stay tuned for the following posts !

xulingfeng • May 27

Web research as a surface fix is actually smarter than it sounds — forces the agent to ground itself before committing to memory. We

xulingfeng • May 27

Appreciate you giving it a look! The SQLite trick works specifically because it cuts through the orchestration layer — no API gateways, no serialization, just raw DB access. Curious what route you ended up going with instead? Always interested to hear different approaches to the same problem.

ישראל חן • Jun 2

That's the next's posts job to tell :)

xulingfeng • May 27

Glad you liked the SQLite debugging move! Web research as a guardrail is a smart quick fix — curious how you handle cases where search results contradict what the agent already "knows". Looking forward to the follow-up!

Ethan Walker • May 27

Agent-memory poisoning is the failure mode most teams miss. Once a hallucinated fact is in SQLite, every subsequent turn reads it as ground truth and the model has no way to know it's wrong. Are you running any post-write validation before facts commit? Even a cheap pattern-check (does this 'fact' look like a typical hallucination shape, contradict prior facts, contain a model self-reference) would catch a meaningful chunk. Doesn't fix root cause but raises the bar

ישראל חן • Jun 2

actually why not use the same technique as the big ones, have a gateway model on every response to see if the model produced nonsense while using independent web search to verify the output, due the costs can be latency and additional charges, will update soon on the findings

View full discussion (45 comments)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.