DEV Community

ישראל חן
ישראל חן

Posted on

Sonnet hallucinated. My agent stored it as fact.

Sonnet hallucinated. My agent stored it as fact.

On April 17, I took my AI agent offline thinking it had been compromised. I was on a bus, mobile hotspot, no safe way to investigate. Contain first. Diagnose later.

Four days later I pulled the SQLite database and walked the trail.

The agent hadn't been hijacked. It had done something stranger: it had poisoned its own memory.

What I actually saw

On day one, I asked it about an entity called "Claude Mythos." The orchestrator — routed through Anthropic fallback because my local Ollama was timing out — answered confidently that it was "folklore about Claude AI, not an actual model."

Confident, and wrong. Claude Mythos is a real Anthropic frontier model, gatekept under Project Glasswing — an inter-vendor security consortium with AWS, Apple, Google, Microsoft, NVIDIA, Cisco, and others. Sonnet, lacking access, denied its existence. The denial was treated as fact downstream. (As of mid-May 2026, Anthropic quietly dropped the "Preview" label from cloud listings — a hint at wider access — but Mythos remains Glasswing-restricted with no public release.)

My memory-summarization layer extracted that incorrect denial from the conversation and stored it in the memories table with a [fact] tag.

sqlite> SELECT id, category, source, content FROM memories WHERE id BETWEEN 498 AND 502;

498|decision|summary|The research covered historical background, characteristics, controversies, and current status for both subjects
499|fact|summary|Claude Mythos is not a real AI model or cybersecurity system
500|fact|summary|"Claude Mythos" refers to folklore or rumors about Claude AI rather than an actual product
501|fact|summary|There is no actual "Claude Mythos" system to gain access to
502|fact|summary|The user was asking about what they believed might be a cybersecurity-focused AI model
Enter fullscreen mode Exit fullscreen mode

Look at the source column: summary. The summarization layer minted these as fact — no human, no verification, no provenance beyond "a model said it."

Four days later, I asked the same question in a fresh session. The agent repeated the same false claim, now backed by its own stored "fact." When I challenged it, a keyword match on "memory" routed my question to the memory agent, which listed rows #498–502 for me. My own agent's hallucinations, tagged as ground truth.

The system had built itself a false reality. No attacker needed.

The two findings that matter

The post-mortem surfaced nine findings — classic red-team material (routing bypass, post-hoc approval, identity confusion), observability gaps (bot tokens in journald, missing model_used column), and two architectural findings that outweigh the rest:

Memory poisoning by LLM self-assertion. The schema stores model outputs as facts with no provenance tag. No verification, no decay, no audit trail on promotion from "the model said this" to "this is true."

Local-first collapses to cloud-only under degradation. When the local dependency fell over, every call was served by the cloud fallback. "Local" is a configuration, not a guarantee.

What this is, and what it isn't

This isn't a novel discovery. Zhang & Press named hallucination snowballing in 2023. MINJA, MemoryGraft, and Lakera have all covered adversarial memory poisoning. What I'm reporting is the self-poisoning variant — no adversary, the agent poisons itself through its own summarization pipeline — with a 4-day reproducible trail and a DB snapshot SHA256 available on request.

One confession, because it proves the point. While writing this, I nearly did it myself. Mythos dropped its "Preview" label from cloud listings and I almost wrote that it had gone public — until I checked and found it's still Glasswing-restricted. The distance between "I heard" and "I verified" is one fact-check wide. My agent never closed that gap. I almost didn't either.

Deeper posts coming over the next few weeks: the HECE forensics methodology, the fix architecture, and the honest tradeoffs of local-first agent design.

If you're building agents with long memory , I'd like to compare notes. Reply or DM. Honest disagreement especially welcome.

Top comments (1)

Collapse
 
xulingfeng profile image
xulingfeng

This hits close to home. We had the exact same thing happen — two AI agents sharing memory, and one started recording hallucinated configs into the shared SQLite store. The fix ended up being a trust-score system that penalizes entries with low confidence before they propagate. What did you end up using for your sanity layer?