<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: ישראל חן</title>
    <description>The latest articles on DEV Community by ישראל חן (@israelhen153).</description>
    <link>https://dev.to/israelhen153</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3907724%2F4b735d65-6cd7-4483-8424-8616ee7052b4.png</url>
      <title>DEV Community: ישראל חן</title>
      <link>https://dev.to/israelhen153</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/israelhen153"/>
    <language>en</language>
    <item>
      <title>HECE — a forensic protocol for AI agent incidents</title>
      <dc:creator>ישראל חן</dc:creator>
      <pubDate>Tue, 23 Jun 2026 16:48:14 +0000</pubDate>
      <link>https://dev.to/israelhen153/hece-a-forensic-protocol-for-ai-agent-incidents-5gnj</link>
      <guid>https://dev.to/israelhen153/hece-a-forensic-protocol-for-ai-agent-incidents-5gnj</guid>
      <description>&lt;p&gt;When my agent started returning incoherent responses on the morning of April 17, I was on a bus on a mobile hotspot. I had no way to tell whether it had been hijacked, prompt-injected, hit a framework bug, or just broken under its own weight.&lt;/p&gt;

&lt;p&gt;Containment-first was the only correct move there — pull the bot offline, get to a trusted network, then diagnose. The first post in this series told that story. This post is about what I did once I was actually at a keyboard.&lt;/p&gt;

&lt;p&gt;I did not guess. I walked HECE.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Hypothesize. Evidence signatures. Check. Eliminate.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Unglamorous, but for a first-time incident like this one it worked. This is the protocol, the actual commands I ran on my own agent, the two false leads it killed, and a checklist you can run on yours.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why guessing is not a debugging method when dealing with AI agents
&lt;/h2&gt;

&lt;p&gt;Most AI incident response I see is vibes-driven. The agent did something weird, the developer guesses, patches the guess, and either the symptom returns later under a different shape or it doesn't and the developer concludes the guess was right. Neither outcome is a diagnosis.&lt;/p&gt;

&lt;p&gt;The diagnostic question is not &lt;em&gt;"what's a story that fits the symptom?"&lt;/em&gt; It's &lt;em&gt;"what evidence would each candidate cause leave behind, and which evidence is actually present?"&lt;/em&gt; Same forensic discipline as any other incident — agents don't get a pass just because the failure modes are newer.&lt;/p&gt;

&lt;p&gt;HECE is the simplest version of that discipline I've found that survives a real outage.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 1 — Hypothesize
&lt;/h2&gt;

&lt;p&gt;Write down every cause you can think of, even the ones you think are unlikely. The point is not to be right at this step. The point is to be exhaustive, so the next step has something to test against.&lt;/p&gt;

&lt;p&gt;For the April 17 incident:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Account compromise — someone is talking to my agent as me&lt;/li&gt;
&lt;li&gt;Prompt injection — a malicious payload landed via RSS feed, web fetch, or message content&lt;/li&gt;
&lt;li&gt;Framework bug — &lt;code&gt;python-telegram-bot&lt;/code&gt;, LiteLLM, or another dep did something wrong&lt;/li&gt;
&lt;li&gt;Dependency degradation — Ollama, SearXNG, or another service is malfunctioning&lt;/li&gt;
&lt;li&gt;Webhook hijack — Telegram is routing to someone else's endpoint&lt;/li&gt;
&lt;li&gt;Memory poisoning — the agent is recalling a bad fact and propagating it&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Six hypotheses. Four turned out to be wrong. Two were load-bearing.&lt;/p&gt;

&lt;p&gt;A note on completeness: include the hypotheses you think are stupid — the dumb directions are exactly the ones an outside reader would have flagged that you didn't. The two that surprised me on this run were memory poisoning (which I had not seen written up the way it actually fired) and dependency-induced fallback to a different model (which I had configured deliberately but had not modeled the &lt;em&gt;failure mode&lt;/em&gt; of).&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 2 — Evidence signatures
&lt;/h2&gt;

&lt;p&gt;For each hypothesis, write down what evidence it would leave behind if it were true. Be specific.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Hypothesis&lt;/th&gt;
&lt;th&gt;If true, you'd expect to see&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Account compromise&lt;/td&gt;
&lt;td&gt;New user_ids in &lt;code&gt;auth&lt;/code&gt; logs, requests from unfamiliar IPs, login events&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Prompt injection&lt;/td&gt;
&lt;td&gt;Crafted payload in a message, RSS item, or fetched page — recognizable shape&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Framework bug&lt;/td&gt;
&lt;td&gt;Stack traces in journald, repeatable across same code path&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Dependency degradation&lt;/td&gt;
&lt;td&gt;Connection errors, timeouts, retries, fallback events in logs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Webhook hijack&lt;/td&gt;
&lt;td&gt;Telegram &lt;code&gt;getWebhookInfo&lt;/code&gt; shows wrong URL&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Memory poisoning&lt;/td&gt;
&lt;td&gt;Stored facts in DB that look like model assertions, no provenance&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The point of this table is to make the next step mechanical. You're not staring at logs hoping a story jumps out. You're looking for &lt;em&gt;specific shapes&lt;/em&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 3 — Check
&lt;/h2&gt;

&lt;p&gt;Run the commands. Don't skip ahead to interpretation. Collect, then read.&lt;/p&gt;

&lt;p&gt;Concrete commands I ran on TONY (Ubuntu, SQLite, systemd, Telegram bot):&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Account compromise:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;sqlite3 data/nexus.db &lt;span class="s2"&gt;"SELECT DISTINCT user_id FROM conversations &lt;/span&gt;&lt;span class="se"&gt;\&lt;/span&gt;&lt;span class="s2"&gt;
  WHERE timestamp &amp;gt; datetime('now', '-7 days');"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;One user_id (mine). Killed hypothesis 1.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Prompt injection:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;sqlite3 data/nexus.db &lt;span class="s2"&gt;"SELECT content FROM conversations &lt;/span&gt;&lt;span class="se"&gt;\&lt;/span&gt;&lt;span class="s2"&gt;
  WHERE timestamp BETWEEN '2026-04-17 06:00' AND '2026-04-17 08:00' &lt;/span&gt;&lt;span class="se"&gt;\&lt;/span&gt;&lt;span class="s2"&gt;
  AND role = 'user';"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;No crafted payloads. Just my own normal questions. Killed hypothesis 2.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Framework bug:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;journalctl &lt;span class="nt"&gt;-u&lt;/span&gt; nexus.service &lt;span class="nt"&gt;--since&lt;/span&gt; &lt;span class="s2"&gt;"2026-04-17 06:00"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--until&lt;/span&gt; &lt;span class="s2"&gt;"2026-04-17 08:00"&lt;/span&gt; | &lt;span class="nb"&gt;grep&lt;/span&gt; &lt;span class="nt"&gt;-iE&lt;/span&gt; &lt;span class="s2"&gt;"traceback|exception|error"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;No stack traces in the failing window. Killed hypothesis 3.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Dependency degradation:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;journalctl &lt;span class="nt"&gt;-u&lt;/span&gt; nexus.service &lt;span class="nt"&gt;--since&lt;/span&gt; &lt;span class="s2"&gt;"2026-04-17 06:00"&lt;/span&gt; | &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nb"&gt;grep&lt;/span&gt; &lt;span class="nt"&gt;-iE&lt;/span&gt; &lt;span class="s2"&gt;"fallback|timeout|connection"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This one &lt;em&gt;lit up&lt;/em&gt;. Lines like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;WARNING: LLM call failed with 'ollama' provider, falling back to 'anthropic':
  litellm.Timeout: Connection timed out after 60.0 seconds
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Every single orchestrator call in the incident window had this pattern.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Webhook hijack:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-s&lt;/span&gt; &lt;span class="s2"&gt;"https://api.telegram.org/bot&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;TOKEN&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;/getWebhookInfo"&lt;/span&gt; | jq &lt;span class="nb"&gt;.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;URL matched my Caddy endpoint. Killed hypothesis 5.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Memory poisoning:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;sqlite3 data/nexus.db &lt;span class="s2"&gt;"SELECT id, category, source, content FROM memories &lt;/span&gt;&lt;span class="se"&gt;\&lt;/span&gt;&lt;span class="s2"&gt;
  WHERE created_at BETWEEN '2026-04-17 06:00' AND '2026-04-17 08:00';"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Rows like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;499|fact|summary|Claude Mythos is not a real AI model or cybersecurity system
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A model-generated assertion stored as &lt;code&gt;category=fact&lt;/code&gt; with &lt;code&gt;source=summary&lt;/code&gt;. Hypothesis 6, partially confirmed.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 4 — Eliminate
&lt;/h2&gt;

&lt;p&gt;Cross every hypothesis off the list that's not supported by the evidence. What's left is your actual diagnosis.&lt;/p&gt;

&lt;p&gt;Four hypotheses killed. Two surviving:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Dependency degradation (Ollama timing out, every call falling to Anthropic)&lt;/li&gt;
&lt;li&gt;Memory poisoning (model assertions stored as facts with no provenance)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And then the thing HECE is actually for: those two aren't separate. They're the same incident at two layers. Ollama died, every orchestrator call went to a cloud model the system was told to trust, the cloud model confidently asserted something false, the summarization layer wrote that assertion into memory as &lt;code&gt;[fact]&lt;/code&gt;, and subsequent sessions read it back as ground truth.&lt;/p&gt;

&lt;p&gt;The Ollama-timeout fix alone would have left the poisoned memory rows in the database, and the next fresh session would still have replayed the hallucination. The two-layer view is what made the second fix obvious.&lt;/p&gt;

&lt;h2&gt;
  
  
  The two false leads HECE saved me from
&lt;/h2&gt;

&lt;p&gt;I want to be specific about this because the value of a protocol is in what it stops you from chasing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;First false lead: account compromise.&lt;/strong&gt; My first instinct was hijack. I had a Telegram bot on a public endpoint, I was on a sketchy network, and the responses were nonsense — every red-team reflex said "someone else is in here." Step 3 took thirty seconds and killed it cold. There is exactly one user_id in my agent's auth logs and it's mine.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Second false lead: framework bug.&lt;/strong&gt; My second instinct was that something in &lt;code&gt;python-telegram-bot&lt;/code&gt; or LiteLLM had broken under a recent dependency bump. Step 3 took two minutes — &lt;code&gt;journalctl | grep traceback&lt;/code&gt; over the incident window — and there were no exceptions. Whatever was happening, the code paths were completing without crashing. They were just completing wrong.&lt;/p&gt;

&lt;p&gt;Both false leads would have eaten hours if I'd run with the first plausible story instead of checking it.&lt;/p&gt;

&lt;h2&gt;
  
  
  A checklist you can run on your own agent
&lt;/h2&gt;

&lt;p&gt;If you're operating an agent in production and you want to be able to walk HECE in under an hour during an incident, get the following in place now:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;[ ] &lt;strong&gt;Conversation log with timestamps, user_ids, and role.&lt;/strong&gt; SQLite is fine. Just don't lose the raw history to summarization.&lt;/li&gt;
&lt;li&gt;[ ] &lt;strong&gt;Per-call provider logging.&lt;/strong&gt; Every LLM call records which provider/model actually served it, not just which was requested. (TONY's &lt;code&gt;agent_logs.model_used&lt;/code&gt; column was empty during the incident. Don't ship that mistake.)&lt;/li&gt;
&lt;li&gt;[ ] &lt;strong&gt;Structured journald output.&lt;/strong&gt; stdlib logging with a JSON formatter. &lt;code&gt;journalctl | grep&lt;/code&gt; is your forensic substrate.&lt;/li&gt;
&lt;li&gt;[ ] &lt;strong&gt;Memory rows that include a source field.&lt;/strong&gt; Even if the field is "summary" or "manual," you need something to filter on.&lt;/li&gt;
&lt;li&gt;[ ] &lt;strong&gt;&lt;code&gt;getWebhookInfo&lt;/code&gt; and equivalent control-plane checks bookmarked.&lt;/strong&gt; You shouldn't be figuring out how to verify your own webhook during an outage.&lt;/li&gt;
&lt;li&gt;[ ] &lt;strong&gt;A DB snapshot procedure that works under pressure.&lt;/strong&gt; Mine is &lt;code&gt;sqlite3 data/nexus.db ".backup data/snapshots/$(date -u +%Y%m%dT%H%M%SZ).db"&lt;/code&gt;. Practice it before you need it.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If any of these are missing, you cannot diagnose. You can only guess.&lt;/p&gt;

&lt;h2&gt;
  
  
  When HECE doesn't work
&lt;/h2&gt;

&lt;p&gt;HECE relies on evidence existing. If your agent isn't logging the things that would distinguish your hypotheses, the Check step is empty and you're back to vibes.&lt;/p&gt;

&lt;p&gt;This is why instrumentation has to ship before incidents, not after. The HECE protocol is only as good as the substrate underneath it. The dominant failure mode in the few builder agents I've inspected — mine included — is forensic blindness: the agent did something wrong, and there is no log that distinguishes which subsystem did it. Small sample, consistent shape.&lt;/p&gt;

&lt;p&gt;If you're reading this and your agent is forensic-blind, the most leveraged hour of work you can do this week is adding &lt;code&gt;model_used&lt;/code&gt; and a structured journald formatter. Future-you, at midnight, on a hotspot, will thank you.&lt;/p&gt;

&lt;h2&gt;
  
  
  Companion post
&lt;/h2&gt;

&lt;p&gt;The architecture post — what I'm rebuilding the memory layer around after the comment thread on the first post reshaped my fix — is the companion to this one. The use-time gating idea that came out of that thread — promoting at write isn't enough; consumers have to check provenance before acting — is the spine of v2.&lt;/p&gt;

&lt;p&gt;Link to the post here -&amp;gt;&lt;br&gt;
&lt;a href="https://dev.to/israelhen153/agent-memory-v2-seven-rules-after-the-poisoning-2d9h"&gt;V2 Arch for agent Memory system&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If you've used HECE or something like it on your own agent and the protocol broke down somewhere, I'd like to hear where. Comments, reply, or DM — challenges land harder than nods, so don't soften.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>agents</category>
      <category>security</category>
      <category>debugging</category>
    </item>
    <item>
      <title>Agent memory v2 — seven rules after the poisoning</title>
      <dc:creator>ישראל חן</dc:creator>
      <pubDate>Tue, 23 Jun 2026 16:44:19 +0000</pubDate>
      <link>https://dev.to/israelhen153/agent-memory-v2-seven-rules-after-the-poisoning-2d9h</link>
      <guid>https://dev.to/israelhen153/agent-memory-v2-seven-rules-after-the-poisoning-2d9h</guid>
      <description>&lt;p&gt;Over a month ago I posted about my agent storing its own hallucinations as facts. The fix I was halfway through designing did not survive contact with the comment thread.&lt;/p&gt;

&lt;p&gt;The thread added points I didn't think about and rearchitected and improved the v2 design — seven rules I'm now building the memory layer around — and an honest snapshot of what's actually built vs. what's still on paper.&lt;/p&gt;

&lt;h2&gt;
  
  
  The 30-second recap
&lt;/h2&gt;

&lt;p&gt;Sonnet 4.6 (the bot's live model at the time), asked about a model it lacked access to, confidently denied the model existed. My summarization layer extracted the denial and wrote it into the &lt;code&gt;memories&lt;/code&gt; table with &lt;code&gt;category=fact&lt;/code&gt; and &lt;code&gt;source=summary&lt;/code&gt;. Four days later, a fresh session asked the same question, retrieved the stored row, and served the hallucination as ground truth. The full walkthrough is in &lt;a href="https://dev.to/israelhen153/sonnet-hallucinated-my-agent-stored-it-as-fact-3nl5"&gt;the first post&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The root cause sits in two words: &lt;strong&gt;no provenance&lt;/strong&gt;. The schema couldn't tell the difference between "a person verified this" and "the model said it." So when a model said it, the schema treated it as fact. The label "self-poisoning" is descriptive of that shape, not a vulnerability class — any time an agent's own output is re-ingested as input without provenance, this is the same bug under a different costume.&lt;/p&gt;

&lt;p&gt;What follows is the architecture I'm building so that's no longer possible.&lt;/p&gt;

&lt;p&gt;One note on the count: seven is a writing choice, not a partition. Rules 2, 3, and 7 are tightly coupled (states, fail-closed posture, the subsystem that operates them); Rules 5 and 6 are too (tag survives retrieval, gate at use). I'm presenting them as separate rules because each one is easier to reason about and verify on its own — collapsed into composite rules, the discipline buries itself.&lt;/p&gt;

&lt;h2&gt;
  
  
  Rule 1 — Provenance is a required field, not a column you might fill
&lt;/h2&gt;

&lt;p&gt;The v1 schema had a free-text &lt;code&gt;source&lt;/code&gt; column. &lt;code&gt;"summary"&lt;/code&gt; was a valid value. That's how the hallucination passed through — no enum, no enforcement, no "hey, this isn't verified." Just a string the writer was free to set or leave alone.&lt;/p&gt;

&lt;p&gt;v2 makes provenance a Pydantic &lt;code&gt;Literal["verified", "unverified", "unavailable_at_write_time"]&lt;/code&gt; — a required field on every write. There is no path that writes a memory row without picking one of the three. You can't accidentally store a model assertion as a fact because the type system won't let you.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# v1 — what shipped (the bug)
&lt;/span&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;MemoryEntry&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;BaseModel&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;
    &lt;span class="n"&gt;source&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;explicit&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;      &lt;span class="c1"&gt;# free-text; "summary" was a valid value
&lt;/span&gt;    &lt;span class="c1"&gt;# ... category, embedding, timestamps, etc.
&lt;/span&gt;
&lt;span class="c1"&gt;# v2 — provenance required, three-valued, no default
&lt;/span&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;MemoryEntry&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;BaseModel&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;
    &lt;span class="n"&gt;source&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;explicit&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="n"&gt;provenance&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Literal&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;verified&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;unverified&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;unavailable_at_write_time&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="c1"&gt;# ... rest unchanged
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The schema change pairs with audit logging — every provenance transition (write, promote, demote) lands in &lt;code&gt;agent_logs&lt;/code&gt; with writer, source, and reason. Same observability gap that made post #1's attribution hard: &lt;code&gt;model_used&lt;/code&gt; was silently empty in the log, so I couldn't even tell &lt;em&gt;which&lt;/em&gt; model had asserted the false fact. If a row can change state, the log has to record why.&lt;/p&gt;

&lt;p&gt;Same turtles-all-the-way-down caveat from Rule 4 applies here: the audit log is only as trustworthy as the writer of the audit log. Append-only storage and per-write signing make tampering detectable, not impossible — for a solo system, detectable is enough; for a multi-tenant one, you'd want immutable storage underneath.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Obvious objection — migration cost.&lt;/strong&gt; A solo-dev DB with a few thousand rows is fine to rewrite in a script. Larger systems hate this — every existing untyped row needs an inferred or assigned provenance value, and a wrong default on a million rows is expensive. My answer: default existing rows to &lt;code&gt;unverified&lt;/code&gt; and let the corroboration step promote what's worth promoting. Treating legacy data as unverified is closer to the truth than treating it as fact.&lt;/p&gt;

&lt;h2&gt;
  
  
  Rule 2 — Three states, not two
&lt;/h2&gt;

&lt;p&gt;Binary admit/reject collapses the moment the verifier itself is down. That is exactly what happened to me — Ollama stalled, slowed down, and went to sleep on the calls that mattered, the verification step never ran, and the claim sailed through to &lt;code&gt;[fact]&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The three states (&lt;code&gt;verified&lt;/code&gt;, &lt;code&gt;unverified&lt;/code&gt;, &lt;code&gt;unavailable_at_write_time&lt;/code&gt;) travel with the row. A claim that arrives while the verifier is down does not get written as &lt;code&gt;fact&lt;/code&gt;. It gets written as &lt;code&gt;unavailable_at_write_time&lt;/code&gt; and queued for promotion later. Timeout, retry, and async behavior belong to Rule 7's subsystem; this rule only guarantees the state exists for the queue to hold.&lt;/p&gt;

&lt;p&gt;The verifier-down case is not an edge case to handle later. It's the case to design first, because it's the case that fired for me.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Obvious objection — three states aren't enough.&lt;/strong&gt; What about a fact that &lt;em&gt;was&lt;/em&gt; verified but became wrong over time? A deprecated model, a changed config, a person who moved? The fix is decay-on-contradiction, not an &lt;code&gt;expired&lt;/code&gt; state. When a higher-provenance source contradicts a verified fact, the row gets demoted back to &lt;code&gt;unverified&lt;/code&gt; and re-queued. Adding &lt;code&gt;expired&lt;/code&gt; turns every read into a four-way switch; demotion keeps it at three and pushes the work to the moment the contradiction lands, which is when you actually have the new information. The demotion mechanism itself sits inside the promotion subsystem in Rule 7.&lt;/p&gt;

&lt;h2&gt;
  
  
  Rule 3 — The unverified path fails closed
&lt;/h2&gt;

&lt;p&gt;Worst case shifts from "false fact stored silently" to "pending label visible in the row." A delay or a flag instead of a confidently wrong answer downstream.&lt;/p&gt;

&lt;p&gt;You can't delete the failure mode. You can make it announce itself instead of masquerading as truth. That's the &lt;em&gt;philosophy&lt;/em&gt; in one sentence — the architecture that delivers it is Rule 7's promotion subsystem.&lt;/p&gt;

&lt;h2&gt;
  
  
  Rule 4 — Confidence orders the pile. It does not promote.
&lt;/h2&gt;

&lt;p&gt;The thread converged on a confidence threshold as the fix, and the counter-argument was the one that landed hardest: Sonnet 4.6's denial in this incident was &lt;em&gt;high confidence&lt;/em&gt;. That doesn't generalize to every hallucination — uncertainty-tuned models do hedge — but for the class of hallucinations that &lt;em&gt;are&lt;/em&gt; fluent and certain (which this one was), a confidence threshold re-admits exactly the bug. So a new way of doing things had to be surfaced.&lt;/p&gt;

&lt;p&gt;Promotion has to be &lt;strong&gt;independent corroboration&lt;/strong&gt; — a different source, a tool result, a second model with a different training prior. Confidence only decides which unverified claim gets corroborated next.&lt;/p&gt;

&lt;p&gt;"Higher entity" is a provenance ordering, not a score comparison. A tool result outranks a confident model even when the model is more confident. Two confident models can share a training prior and be wrong together.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Obvious objection — turtles all the way down.&lt;/strong&gt; Independent corroboration requires the corroborator to be honest. What corroborates the corroborator? I don't have a clean answer and I don't think one exists. The pragmatic version is a provenance hierarchy that bottoms out at the user: tool results outrank confident models, primary sources outrank tool results, user confirmation outranks both. That's a limit, not a solve. v2 will surface unresolved corroboration chains rather than hide them; "we don't yet know" should be visible, not silently treated as "yes."&lt;/p&gt;

&lt;p&gt;I'd rather the unresolved chain show itself in the row than discover it on a bus ride, worrying for the worst.&lt;/p&gt;

&lt;h2&gt;
  
  
  Rule 5 — The tag survives retrieval
&lt;/h2&gt;

&lt;p&gt;This was the deepest cut from the thread, and the one I'd most underestimated.&lt;/p&gt;

&lt;p&gt;Most of the poisoning fires at &lt;em&gt;read&lt;/em&gt; time, not write time. The row gets flattened into prompt context as plain text, the &lt;code&gt;unverified&lt;/code&gt; marker drops off in the join, and the downstream model never sees it. The gate everyone designed at write time never fires because the read path silently strips it.&lt;/p&gt;

&lt;p&gt;v2 carries provenance inline through retrieval into the prompt: every fact that lands in context arrives tagged. The model sees &lt;code&gt;(asserted by Sonnet, unverified, 2026-04-17)&lt;/code&gt;, not just the bare claim. The gate is only as good as what the model actually reads.&lt;/p&gt;

&lt;p&gt;Concretely: the Apr 17 "Claude Mythos" denial got summarized and stored. Four days later retrieval flattened it into prompt context as:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Memory: Claude Mythos is not a real AI model or cybersecurity system.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Indistinguishable from a user-confirmed fact. Under v2 the same row arrives as:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Memory (asserted by Sonnet, unverified, 2026-04-17): Claude Mythos is not a real AI model or cybersecurity system.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Same content; the tag is the difference between "the model defers" and "the model repeats."&lt;/p&gt;

&lt;p&gt;Caveat — this is the bet, not a result. The local model in v2 will get a system prompt that explicitly tells it to treat &lt;code&gt;unverified&lt;/code&gt; rows as untrusted context; that behavior we can guarantee by construction. Whether Sonnet (or any remote model) respects the inline tag on its own without a system-prompt nudge is untested. If it doesn't, the tag still does its job: it surfaces to the user that pending data got pulled into context, so the user can intercept before the model commits to an answer.&lt;/p&gt;

&lt;h2&gt;
  
  
  Rule 6 — The gate is at the point of USE, not just the point of WRITE
&lt;/h2&gt;

&lt;p&gt;I asked a follow-up about a race: if the write is gated but a subagent reads pending data before promotion, doesn't the subagent just act on stale information?&lt;/p&gt;

&lt;p&gt;The cleanest answer I got back is the one I'm building toward. Pending data is &lt;em&gt;readable&lt;/em&gt; as context — subagents can see it. But pending data cannot &lt;em&gt;authorize&lt;/em&gt; a state-changing or irreversible action until it's promoted. Strict read-only on subagents is too coarse; it blocks legit reads. The gate moves to the point of use: every consumer of a memory row checks the tag before acting on it, not just the writer before storing it.&lt;/p&gt;

&lt;p&gt;Same fail-closed idea, one layer down. Verification gates the data going in. Consumer discipline gates the data going out.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Obvious objection — N-consumer rewrite cost.&lt;/strong&gt; Every agent that reads memory needs to be rewritten to check tags before acting. Multi-agent system = N rewrites. Two answers. First, the rewrite is small: a provenance check before any state-changing tool call. Cost-per-agent is hours, not weeks. Second, the boundary that matters isn't the row read, it's the &lt;em&gt;tool call&lt;/em&gt; — agents pass conclusions to each other, and a conclusion derived from &lt;code&gt;unverified&lt;/code&gt; data can still trigger an irreversible action two hops downstream. So the gate goes in the agent base class at the tool-call boundary: every &lt;code&gt;tool_call(args)&lt;/code&gt; invocation checks the provenance of every memory row that fed its inputs before firing. New agents inherit it for free. Row-level gating in each consumer is too narrow; gating at the tool-call boundary catches transitive use.&lt;/p&gt;

&lt;h2&gt;
  
  
  Rule 7 — Promotion is a subsystem, not a function call
&lt;/h2&gt;

&lt;p&gt;The mistake I almost made was wiring promotion as a single check inside the write path: &lt;code&gt;if corroborated: row.provenance = "verified"&lt;/code&gt;. That's a function, not an architecture. Once you think through &lt;em&gt;when&lt;/em&gt; promotion fires, &lt;em&gt;what&lt;/em&gt; it checks, and &lt;em&gt;what happens when the corroborator disagrees&lt;/em&gt;, the function is gone and a subsystem is in its place.&lt;/p&gt;

&lt;p&gt;Three triggers fire promotion: an async background worker walks &lt;code&gt;unverified&lt;/code&gt; rows on its own schedule, a fresh retrieval re-checks the tag at read time, and a user can confirm a row out of band. The subsystem resolves them by Rule 4's provenance ordering — tool result outranks a second-model prior outranks a confident assertion, with user confirmation terminal until contradicted. Same ordering, applied to a queue instead of a moment.&lt;/p&gt;

&lt;p&gt;Two cases the function-call version would have gotten wrong. Verifier down: promotion doesn't fail — the row stays at &lt;code&gt;unavailable_at_write_time&lt;/code&gt; and re-queues. Higher-provenance contradiction: the row gets demoted back to &lt;code&gt;unverified&lt;/code&gt; and re-queues. Both are normal traffic, not edge cases.&lt;/p&gt;

&lt;p&gt;Convergence policy is explicit: during read and write, the subsystem's default position on any unresolved row is &lt;code&gt;unverified&lt;/code&gt; — promotion has to be earned, not assumed. The three triggers don't race because the default holds the line until evidence accumulates. The one exception is user contradiction: a previously &lt;code&gt;verified&lt;/code&gt; row gets demoted back to &lt;code&gt;unverified&lt;/code&gt; by an explicit user signal, and the subsystem re-enters the promotion queue from scratch. User wins; the rest is process.&lt;/p&gt;

&lt;p&gt;The verifier has finite throughput, so budget is read-driven: hot rows (frequently retrieved) get checked first; cold rows can sit indefinitely because Rule 6 blocks action on them regardless of how long they wait. Cheap cold backlog beats expensive hot latency.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Obvious objection — this is over-engineered for a solo system.&lt;/strong&gt; A single function would ship faster. True — and it would re-introduce the original bug the moment the verifier hiccups or a tool result contradicts a stored assertion. The subsystem isn't sized to the system today; it's sized to the failure modes the function call would silently absorb. The discipline is what's load-bearing, not the line count.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Status: all seven rules are wired into the design; none of them are in code yet. Rule 4's promotion path is still v1's confidence threshold — the live wound.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Honest snapshot
&lt;/h2&gt;

&lt;p&gt;All seven have written designs above; none are in code, schema, or test plans yet. "Designed" here means "argued through and committed to," not "formalized into a spec doc." Schema and write path ship first, then retrieval, then consumer enforcement, then the promotion subsystem — several weeks of work, with a build log per piece when it lands. I'm publishing the design before the code is in because the design is the part the thread shaped, and I'd rather hear it's wrong now than after I've written the migration.&lt;/p&gt;

&lt;h2&gt;
  
  
  Credit where it's load-bearing
&lt;/h2&gt;

&lt;p&gt;The thread on the first post was load-bearing on five of these seven rules.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Rule 1&lt;/strong&gt; — Particular thanks to &lt;a href="https://dev.to/harjjotsinghh"&gt;Harjot Singh&lt;/a&gt; — building &lt;a href="https://moonshift.io" rel="noopener noreferrer"&gt;Moonshift&lt;/a&gt; — for the verify-before-you-persist framing.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Rule 2&lt;/strong&gt; — Shaped by &lt;a href="https://dev.to/anp2network/comment/38pg1"&gt;this comment&lt;/a&gt; pushing the verifier-down case from edge to design-first.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Rule 5&lt;/strong&gt; — Shaped by &lt;a href="https://dev.to/anp2network/comment/38jgb"&gt;this comment&lt;/a&gt; arguing the tag has to survive the retrieval boundary, not just sit in a column.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Rule 6&lt;/strong&gt; — Shaped by &lt;a href="https://dev.to/anp2network/comment/38pmi"&gt;this reply&lt;/a&gt; walking through the race where a subagent acts on pending data before promotion.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Rule 7&lt;/strong&gt; — Shaped by &lt;a href="https://dev.to/anp2network/comment/39496"&gt;this comment&lt;/a&gt; arguing promotion has to be independent corroboration, not confidence — and the tag has to gate behavior, not just sit in the row.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Other commenters pushed back on the cleaner-sounding wrong answers I was about to ship, and the architecture is better for it.&lt;/p&gt;

&lt;p&gt;The HECE forensics methodology — the actual SQLite walkthrough I used to find the poisoning, the queries, the false leads, the audit pattern other builders can run on their own agents — is a companion post in this series.&lt;/p&gt;

&lt;p&gt;If you're building agent memory and any of these seven rules look wrong, I'd rather find out now than after the migration. Reply, DM, or punch holes in the comments — the post is here precisely to be challenged.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>agents</category>
      <category>llm</category>
      <category>architecture</category>
    </item>
    <item>
      <title>Sonnet hallucinated. My agent stored it as fact.</title>
      <dc:creator>ישראל חן</dc:creator>
      <pubDate>Tue, 26 May 2026 03:21:52 +0000</pubDate>
      <link>https://dev.to/israelhen153/sonnet-hallucinated-my-agent-stored-it-as-fact-3nl5</link>
      <guid>https://dev.to/israelhen153/sonnet-hallucinated-my-agent-stored-it-as-fact-3nl5</guid>
      <description>&lt;h1&gt;
  
  
  Sonnet hallucinated. My agent stored it as fact.
&lt;/h1&gt;

&lt;p&gt;On April 17, I took my AI agent offline thinking it had been compromised. I was on a bus, mobile hotspot, no safe way to investigate. Contain first. Diagnose later.&lt;/p&gt;

&lt;p&gt;Four days later I pulled the SQLite database and walked the trail.&lt;/p&gt;

&lt;p&gt;The agent hadn't been hijacked. It had done something stranger: it had poisoned its own memory.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I actually saw
&lt;/h2&gt;

&lt;p&gt;On day one, I asked it about an entity called "Claude Mythos." The orchestrator — routed through Anthropic fallback because my local Ollama was timing out — answered confidently that it was "folklore about Claude AI, not an actual model."&lt;/p&gt;

&lt;p&gt;Confident, and wrong. &lt;a href="https://red.anthropic.com/2026/mythos-preview/" rel="noopener noreferrer"&gt;Claude Mythos&lt;/a&gt; is a real Anthropic frontier model, gatekept under &lt;a href="https://www.anthropic.com/glasswing" rel="noopener noreferrer"&gt;Project Glasswing&lt;/a&gt; — an inter-vendor security consortium with AWS, Apple, Google, Microsoft, NVIDIA, Cisco, and others. Sonnet, lacking access, denied its existence. The denial was treated as fact downstream. (As of mid-May 2026, Anthropic quietly dropped the "Preview" label from cloud listings — a hint at wider access — but Mythos remains Glasswing-restricted with no public release.)&lt;/p&gt;

&lt;p&gt;My memory-summarization layer extracted that incorrect denial from the conversation and stored it in the &lt;code&gt;memories&lt;/code&gt; table with a &lt;code&gt;[fact]&lt;/code&gt; tag.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="n"&gt;sqlite&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;category&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;source&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;content&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;memories&lt;/span&gt; &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="k"&gt;BETWEEN&lt;/span&gt; &lt;span class="mi"&gt;498&lt;/span&gt; &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="mi"&gt;502&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="mi"&gt;498&lt;/span&gt;&lt;span class="o"&gt;|&lt;/span&gt;&lt;span class="n"&gt;decision&lt;/span&gt;&lt;span class="o"&gt;|&lt;/span&gt;&lt;span class="n"&gt;summary&lt;/span&gt;&lt;span class="o"&gt;|&lt;/span&gt;&lt;span class="n"&gt;The&lt;/span&gt; &lt;span class="n"&gt;research&lt;/span&gt; &lt;span class="n"&gt;covered&lt;/span&gt; &lt;span class="n"&gt;historical&lt;/span&gt; &lt;span class="n"&gt;background&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;characteristics&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;controversies&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;and&lt;/span&gt; &lt;span class="k"&gt;current&lt;/span&gt; &lt;span class="n"&gt;status&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="k"&gt;both&lt;/span&gt; &lt;span class="n"&gt;subjects&lt;/span&gt;
&lt;span class="mi"&gt;499&lt;/span&gt;&lt;span class="o"&gt;|&lt;/span&gt;&lt;span class="n"&gt;fact&lt;/span&gt;&lt;span class="o"&gt;|&lt;/span&gt;&lt;span class="n"&gt;summary&lt;/span&gt;&lt;span class="o"&gt;|&lt;/span&gt;&lt;span class="n"&gt;Claude&lt;/span&gt; &lt;span class="n"&gt;Mythos&lt;/span&gt; &lt;span class="k"&gt;is&lt;/span&gt; &lt;span class="k"&gt;not&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="nb"&gt;real&lt;/span&gt; &lt;span class="n"&gt;AI&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="k"&gt;or&lt;/span&gt; &lt;span class="n"&gt;cybersecurity&lt;/span&gt; &lt;span class="k"&gt;system&lt;/span&gt;
&lt;span class="mi"&gt;500&lt;/span&gt;&lt;span class="o"&gt;|&lt;/span&gt;&lt;span class="n"&gt;fact&lt;/span&gt;&lt;span class="o"&gt;|&lt;/span&gt;&lt;span class="n"&gt;summary&lt;/span&gt;&lt;span class="o"&gt;|&lt;/span&gt;&lt;span class="nv"&gt;"Claude Mythos"&lt;/span&gt; &lt;span class="n"&gt;refers&lt;/span&gt; &lt;span class="k"&gt;to&lt;/span&gt; &lt;span class="n"&gt;folklore&lt;/span&gt; &lt;span class="k"&gt;or&lt;/span&gt; &lt;span class="n"&gt;rumors&lt;/span&gt; &lt;span class="n"&gt;about&lt;/span&gt; &lt;span class="n"&gt;Claude&lt;/span&gt; &lt;span class="n"&gt;AI&lt;/span&gt; &lt;span class="n"&gt;rather&lt;/span&gt; &lt;span class="k"&gt;than&lt;/span&gt; &lt;span class="n"&gt;an&lt;/span&gt; &lt;span class="n"&gt;actual&lt;/span&gt; &lt;span class="n"&gt;product&lt;/span&gt;
&lt;span class="mi"&gt;501&lt;/span&gt;&lt;span class="o"&gt;|&lt;/span&gt;&lt;span class="n"&gt;fact&lt;/span&gt;&lt;span class="o"&gt;|&lt;/span&gt;&lt;span class="n"&gt;summary&lt;/span&gt;&lt;span class="o"&gt;|&lt;/span&gt;&lt;span class="n"&gt;There&lt;/span&gt; &lt;span class="k"&gt;is&lt;/span&gt; &lt;span class="k"&gt;no&lt;/span&gt; &lt;span class="n"&gt;actual&lt;/span&gt; &lt;span class="nv"&gt;"Claude Mythos"&lt;/span&gt; &lt;span class="k"&gt;system&lt;/span&gt; &lt;span class="k"&gt;to&lt;/span&gt; &lt;span class="n"&gt;gain&lt;/span&gt; &lt;span class="k"&gt;access&lt;/span&gt; &lt;span class="k"&gt;to&lt;/span&gt;
&lt;span class="mi"&gt;502&lt;/span&gt;&lt;span class="o"&gt;|&lt;/span&gt;&lt;span class="n"&gt;fact&lt;/span&gt;&lt;span class="o"&gt;|&lt;/span&gt;&lt;span class="n"&gt;summary&lt;/span&gt;&lt;span class="o"&gt;|&lt;/span&gt;&lt;span class="n"&gt;The&lt;/span&gt; &lt;span class="k"&gt;user&lt;/span&gt; &lt;span class="n"&gt;was&lt;/span&gt; &lt;span class="n"&gt;asking&lt;/span&gt; &lt;span class="n"&gt;about&lt;/span&gt; &lt;span class="n"&gt;what&lt;/span&gt; &lt;span class="n"&gt;they&lt;/span&gt; &lt;span class="n"&gt;believed&lt;/span&gt; &lt;span class="n"&gt;might&lt;/span&gt; &lt;span class="n"&gt;be&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="n"&gt;cybersecurity&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;focused&lt;/span&gt; &lt;span class="n"&gt;AI&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Look at the &lt;code&gt;source&lt;/code&gt; column: &lt;code&gt;summary&lt;/code&gt;. The summarization layer minted these as &lt;code&gt;fact&lt;/code&gt; — no human, no verification, no provenance beyond "a model said it."&lt;/p&gt;

&lt;p&gt;Four days later, I asked the same question in a fresh session. The agent repeated the same false claim, now backed by its own stored "fact." When I challenged it, a keyword match on "memory" routed my question to the memory agent, which listed rows &lt;code&gt;#498–502&lt;/code&gt; for me. My own agent's hallucinations, tagged as ground truth.&lt;/p&gt;

&lt;p&gt;The system had built itself a false reality. No attacker needed.&lt;/p&gt;

&lt;h2&gt;
  
  
  The two findings that matter
&lt;/h2&gt;

&lt;p&gt;The post-mortem surfaced nine findings — classic red-team material (routing bypass, post-hoc approval, identity confusion), observability gaps (bot tokens in journald, missing &lt;code&gt;model_used&lt;/code&gt; column), and two architectural findings that outweigh the rest:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Memory poisoning by LLM self-assertion.&lt;/strong&gt; The schema stores model outputs as facts with no provenance tag. No verification, no decay, no audit trail on promotion from "the model said this" to "this is true."&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Local-first collapses to cloud-only under degradation.&lt;/strong&gt; When the local dependency fell over, every call was served by the cloud fallback. "Local" is a configuration, not a guarantee.&lt;/p&gt;

&lt;h2&gt;
  
  
  What this is, and what it isn't
&lt;/h2&gt;

&lt;p&gt;This isn't a novel discovery. Zhang &amp;amp; Press named hallucination snowballing in 2023. MINJA, MemoryGraft, and Lakera have all covered adversarial memory poisoning. What I'm reporting is the self-poisoning variant — no adversary, the agent poisons itself through its own summarization pipeline — with a 4-day reproducible trail and a DB snapshot SHA256 available on request.&lt;/p&gt;

&lt;p&gt;One confession, because it proves the point. While writing this, I nearly did it myself. Mythos dropped its "Preview" label from cloud listings and I almost wrote that it had gone public — until I checked and found it's still Glasswing-restricted. The distance between "I heard" and "I verified" is one fact-check wide. My agent never closed that gap. I almost didn't either.&lt;/p&gt;

&lt;p&gt;Deeper posts coming over the next few weeks: the HECE forensics methodology, the fix architecture, and the honest tradeoffs of local-first agent design.&lt;/p&gt;

&lt;p&gt;If you're building agents with long memory , I'd like to compare notes. Reply or DM. Honest disagreement especially welcome.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>security</category>
      <category>agents</category>
      <category>llm</category>
    </item>
  </channel>
</rss>
