DEV Community: ישראל חן

HECE — a forensic protocol for AI agent incidents

ישראל חן — Tue, 23 Jun 2026 16:48:14 +0000

When my agent started returning incoherent responses on the morning of April 17, I was on a bus on a mobile hotspot. I had no way to tell whether it had been hijacked, prompt-injected, hit a framework bug, or just broken under its own weight.

Containment-first was the only correct move there — pull the bot offline, get to a trusted network, then diagnose. The first post in this series told that story. This post is about what I did once I was actually at a keyboard.

I did not guess. I walked HECE.

Hypothesize. Evidence signatures. Check. Eliminate.

Unglamorous, but for a first-time incident like this one it worked. This is the protocol, the actual commands I ran on my own agent, the two false leads it killed, and a checklist you can run on yours.

Why guessing is not a debugging method when dealing with AI agents

Most AI incident response I see is vibes-driven. The agent did something weird, the developer guesses, patches the guess, and either the symptom returns later under a different shape or it doesn't and the developer concludes the guess was right. Neither outcome is a diagnosis.

The diagnostic question is not "what's a story that fits the symptom?" It's "what evidence would each candidate cause leave behind, and which evidence is actually present?" Same forensic discipline as any other incident — agents don't get a pass just because the failure modes are newer.

HECE is the simplest version of that discipline I've found that survives a real outage.

Step 1 — Hypothesize

Write down every cause you can think of, even the ones you think are unlikely. The point is not to be right at this step. The point is to be exhaustive, so the next step has something to test against.

For the April 17 incident:

Account compromise — someone is talking to my agent as me
Prompt injection — a malicious payload landed via RSS feed, web fetch, or message content
Framework bug — python-telegram-bot, LiteLLM, or another dep did something wrong
Dependency degradation — Ollama, SearXNG, or another service is malfunctioning
Webhook hijack — Telegram is routing to someone else's endpoint
Memory poisoning — the agent is recalling a bad fact and propagating it

Six hypotheses. Four turned out to be wrong. Two were load-bearing.

A note on completeness: include the hypotheses you think are stupid — the dumb directions are exactly the ones an outside reader would have flagged that you didn't. The two that surprised me on this run were memory poisoning (which I had not seen written up the way it actually fired) and dependency-induced fallback to a different model (which I had configured deliberately but had not modeled the failure mode of).

Step 2 — Evidence signatures

For each hypothesis, write down what evidence it would leave behind if it were true. Be specific.

Hypothesis	If true, you'd expect to see
Account compromise	New user_ids in `auth` logs, requests from unfamiliar IPs, login events
Prompt injection	Crafted payload in a message, RSS item, or fetched page — recognizable shape
Framework bug	Stack traces in journald, repeatable across same code path
Dependency degradation	Connection errors, timeouts, retries, fallback events in logs
Webhook hijack	Telegram `getWebhookInfo` shows wrong URL
Memory poisoning	Stored facts in DB that look like model assertions, no provenance

The point of this table is to make the next step mechanical. You're not staring at logs hoping a story jumps out. You're looking for specific shapes.

Step 3 — Check

Run the commands. Don't skip ahead to interpretation. Collect, then read.

Concrete commands I ran on TONY (Ubuntu, SQLite, systemd, Telegram bot):

Account compromise:

sqlite3 data/nexus.db "SELECT DISTINCT user_id FROM conversations \
  WHERE timestamp > datetime('now', '-7 days');"

One user_id (mine). Killed hypothesis 1.

Prompt injection:

sqlite3 data/nexus.db "SELECT content FROM conversations \
  WHERE timestamp BETWEEN '2026-04-17 06:00' AND '2026-04-17 08:00' \
  AND role = 'user';"

No crafted payloads. Just my own normal questions. Killed hypothesis 2.

Framework bug:

journalctl -u nexus.service --since "2026-04-17 06:00" \
  --until "2026-04-17 08:00" | grep -iE "traceback|exception|error"

No stack traces in the failing window. Killed hypothesis 3.

Dependency degradation:

journalctl -u nexus.service --since "2026-04-17 06:00" | \
  grep -iE "fallback|timeout|connection"

This one lit up. Lines like:

WARNING: LLM call failed with 'ollama' provider, falling back to 'anthropic':
  litellm.Timeout: Connection timed out after 60.0 seconds

Every single orchestrator call in the incident window had this pattern.

Webhook hijack:

curl -s "https://api.telegram.org/bot${TOKEN}/getWebhookInfo" | jq .

URL matched my Caddy endpoint. Killed hypothesis 5.

Memory poisoning:

sqlite3 data/nexus.db "SELECT id, category, source, content FROM memories \
  WHERE created_at BETWEEN '2026-04-17 06:00' AND '2026-04-17 08:00';"

Rows like:

499|fact|summary|Claude Mythos is not a real AI model or cybersecurity system

A model-generated assertion stored as category=fact with source=summary. Hypothesis 6, partially confirmed.

Step 4 — Eliminate

Cross every hypothesis off the list that's not supported by the evidence. What's left is your actual diagnosis.

Four hypotheses killed. Two surviving:

Dependency degradation (Ollama timing out, every call falling to Anthropic)
Memory poisoning (model assertions stored as facts with no provenance)

And then the thing HECE is actually for: those two aren't separate. They're the same incident at two layers. Ollama died, every orchestrator call went to a cloud model the system was told to trust, the cloud model confidently asserted something false, the summarization layer wrote that assertion into memory as [fact], and subsequent sessions read it back as ground truth.

The Ollama-timeout fix alone would have left the poisoned memory rows in the database, and the next fresh session would still have replayed the hallucination. The two-layer view is what made the second fix obvious.

The two false leads HECE saved me from

I want to be specific about this because the value of a protocol is in what it stops you from chasing.

First false lead: account compromise. My first instinct was hijack. I had a Telegram bot on a public endpoint, I was on a sketchy network, and the responses were nonsense — every red-team reflex said "someone else is in here." Step 3 took thirty seconds and killed it cold. There is exactly one user_id in my agent's auth logs and it's mine.

Second false lead: framework bug. My second instinct was that something in python-telegram-bot or LiteLLM had broken under a recent dependency bump. Step 3 took two minutes — journalctl | grep traceback over the incident window — and there were no exceptions. Whatever was happening, the code paths were completing without crashing. They were just completing wrong.

Both false leads would have eaten hours if I'd run with the first plausible story instead of checking it.

A checklist you can run on your own agent

If you're operating an agent in production and you want to be able to walk HECE in under an hour during an incident, get the following in place now:

[ ] Conversation log with timestamps, user_ids, and role. SQLite is fine. Just don't lose the raw history to summarization.
[ ] Per-call provider logging. Every LLM call records which provider/model actually served it, not just which was requested. (TONY's agent_logs.model_used column was empty during the incident. Don't ship that mistake.)
[ ] Structured journald output. stdlib logging with a JSON formatter. journalctl | grep is your forensic substrate.
[ ] Memory rows that include a source field. Even if the field is "summary" or "manual," you need something to filter on.
[ ] getWebhookInfo and equivalent control-plane checks bookmarked. You shouldn't be figuring out how to verify your own webhook during an outage.
[ ] A DB snapshot procedure that works under pressure. Mine is sqlite3 data/nexus.db ".backup data/snapshots/$(date -u +%Y%m%dT%H%M%SZ).db". Practice it before you need it.

If any of these are missing, you cannot diagnose. You can only guess.

When HECE doesn't work

HECE relies on evidence existing. If your agent isn't logging the things that would distinguish your hypotheses, the Check step is empty and you're back to vibes.

This is why instrumentation has to ship before incidents, not after. The HECE protocol is only as good as the substrate underneath it. The dominant failure mode in the few builder agents I've inspected — mine included — is forensic blindness: the agent did something wrong, and there is no log that distinguishes which subsystem did it. Small sample, consistent shape.

If you're reading this and your agent is forensic-blind, the most leveraged hour of work you can do this week is adding model_used and a structured journald formatter. Future-you, at midnight, on a hotspot, will thank you.

Companion post

The architecture post — what I'm rebuilding the memory layer around after the comment thread on the first post reshaped my fix — is the companion to this one. The use-time gating idea that came out of that thread — promoting at write isn't enough; consumers have to check provenance before acting — is the spine of v2.

Link to the post here ->
V2 Arch for agent Memory system

If you've used HECE or something like it on your own agent and the protocol broke down somewhere, I'd like to hear where. Comments, reply, or DM — challenges land harder than nods, so don't soften.

Agent memory v2 — seven rules after the poisoning

ישראל חן — Tue, 23 Jun 2026 16:44:19 +0000

Over a month ago I posted about my agent storing its own hallucinations as facts. The fix I was halfway through designing did not survive contact with the comment thread.

The thread added points I didn't think about and rearchitected and improved the v2 design — seven rules I'm now building the memory layer around — and an honest snapshot of what's actually built vs. what's still on paper.

The 30-second recap

Sonnet 4.6 (the bot's live model at the time), asked about a model it lacked access to, confidently denied the model existed. My summarization layer extracted the denial and wrote it into the memories table with category=fact and source=summary. Four days later, a fresh session asked the same question, retrieved the stored row, and served the hallucination as ground truth. The full walkthrough is in the first post.

The root cause sits in two words: no provenance. The schema couldn't tell the difference between "a person verified this" and "the model said it." So when a model said it, the schema treated it as fact. The label "self-poisoning" is descriptive of that shape, not a vulnerability class — any time an agent's own output is re-ingested as input without provenance, this is the same bug under a different costume.

What follows is the architecture I'm building so that's no longer possible.

One note on the count: seven is a writing choice, not a partition. Rules 2, 3, and 7 are tightly coupled (states, fail-closed posture, the subsystem that operates them); Rules 5 and 6 are too (tag survives retrieval, gate at use). I'm presenting them as separate rules because each one is easier to reason about and verify on its own — collapsed into composite rules, the discipline buries itself.

Rule 1 — Provenance is a required field, not a column you might fill

The v1 schema had a free-text source column. "summary" was a valid value. That's how the hallucination passed through — no enum, no enforcement, no "hey, this isn't verified." Just a string the writer was free to set or leave alone.

v2 makes provenance a Pydantic Literal["verified", "unverified", "unavailable_at_write_time"] — a required field on every write. There is no path that writes a memory row without picking one of the three. You can't accidentally store a model assertion as a fact because the type system won't let you.

# v1 — what shipped (the bug)
class MemoryEntry(BaseModel):
    content: str
    source: str = "explicit"      # free-text; "summary" was a valid value
    # ... category, embedding, timestamps, etc.

# v2 — provenance required, three-valued, no default
class MemoryEntry(BaseModel):
    content: str
    source: str = "explicit"
    provenance: Literal["verified", "unverified", "unavailable_at_write_time"]
    # ... rest unchanged

The schema change pairs with audit logging — every provenance transition (write, promote, demote) lands in agent_logs with writer, source, and reason. Same observability gap that made post #1's attribution hard: model_used was silently empty in the log, so I couldn't even tell which model had asserted the false fact. If a row can change state, the log has to record why.

Same turtles-all-the-way-down caveat from Rule 4 applies here: the audit log is only as trustworthy as the writer of the audit log. Append-only storage and per-write signing make tampering detectable, not impossible — for a solo system, detectable is enough; for a multi-tenant one, you'd want immutable storage underneath.

Obvious objection — migration cost. A solo-dev DB with a few thousand rows is fine to rewrite in a script. Larger systems hate this — every existing untyped row needs an inferred or assigned provenance value, and a wrong default on a million rows is expensive. My answer: default existing rows to unverified and let the corroboration step promote what's worth promoting. Treating legacy data as unverified is closer to the truth than treating it as fact.

Rule 2 — Three states, not two

Binary admit/reject collapses the moment the verifier itself is down. That is exactly what happened to me — Ollama stalled, slowed down, and went to sleep on the calls that mattered, the verification step never ran, and the claim sailed through to [fact].

The three states (verified, unverified, unavailable_at_write_time) travel with the row. A claim that arrives while the verifier is down does not get written as fact. It gets written as unavailable_at_write_time and queued for promotion later. Timeout, retry, and async behavior belong to Rule 7's subsystem; this rule only guarantees the state exists for the queue to hold.

The verifier-down case is not an edge case to handle later. It's the case to design first, because it's the case that fired for me.

Obvious objection — three states aren't enough. What about a fact that was verified but became wrong over time? A deprecated model, a changed config, a person who moved? The fix is decay-on-contradiction, not an expired state. When a higher-provenance source contradicts a verified fact, the row gets demoted back to unverified and re-queued. Adding expired turns every read into a four-way switch; demotion keeps it at three and pushes the work to the moment the contradiction lands, which is when you actually have the new information. The demotion mechanism itself sits inside the promotion subsystem in Rule 7.

Rule 3 — The unverified path fails closed

Worst case shifts from "false fact stored silently" to "pending label visible in the row." A delay or a flag instead of a confidently wrong answer downstream.

You can't delete the failure mode. You can make it announce itself instead of masquerading as truth. That's the philosophy in one sentence — the architecture that delivers it is Rule 7's promotion subsystem.

Rule 4 — Confidence orders the pile. It does not promote.

The thread converged on a confidence threshold as the fix, and the counter-argument was the one that landed hardest: Sonnet 4.6's denial in this incident was high confidence. That doesn't generalize to every hallucination — uncertainty-tuned models do hedge — but for the class of hallucinations that are fluent and certain (which this one was), a confidence threshold re-admits exactly the bug. So a new way of doing things had to be surfaced.

Promotion has to be independent corroboration — a different source, a tool result, a second model with a different training prior. Confidence only decides which unverified claim gets corroborated next.

"Higher entity" is a provenance ordering, not a score comparison. A tool result outranks a confident model even when the model is more confident. Two confident models can share a training prior and be wrong together.

Obvious objection — turtles all the way down. Independent corroboration requires the corroborator to be honest. What corroborates the corroborator? I don't have a clean answer and I don't think one exists. The pragmatic version is a provenance hierarchy that bottoms out at the user: tool results outrank confident models, primary sources outrank tool results, user confirmation outranks both. That's a limit, not a solve. v2 will surface unresolved corroboration chains rather than hide them; "we don't yet know" should be visible, not silently treated as "yes."

I'd rather the unresolved chain show itself in the row than discover it on a bus ride, worrying for the worst.

Rule 5 — The tag survives retrieval

This was the deepest cut from the thread, and the one I'd most underestimated.

Most of the poisoning fires at read time, not write time. The row gets flattened into prompt context as plain text, the unverified marker drops off in the join, and the downstream model never sees it. The gate everyone designed at write time never fires because the read path silently strips it.

v2 carries provenance inline through retrieval into the prompt: every fact that lands in context arrives tagged. The model sees (asserted by Sonnet, unverified, 2026-04-17), not just the bare claim. The gate is only as good as what the model actually reads.

Concretely: the Apr 17 "Claude Mythos" denial got summarized and stored. Four days later retrieval flattened it into prompt context as:

Memory: Claude Mythos is not a real AI model or cybersecurity system.

Indistinguishable from a user-confirmed fact. Under v2 the same row arrives as:

Memory (asserted by Sonnet, unverified, 2026-04-17): Claude Mythos is not a real AI model or cybersecurity system.

Same content; the tag is the difference between "the model defers" and "the model repeats."

Caveat — this is the bet, not a result. The local model in v2 will get a system prompt that explicitly tells it to treat unverified rows as untrusted context; that behavior we can guarantee by construction. Whether Sonnet (or any remote model) respects the inline tag on its own without a system-prompt nudge is untested. If it doesn't, the tag still does its job: it surfaces to the user that pending data got pulled into context, so the user can intercept before the model commits to an answer.

Rule 6 — The gate is at the point of USE, not just the point of WRITE

I asked a follow-up about a race: if the write is gated but a subagent reads pending data before promotion, doesn't the subagent just act on stale information?

The cleanest answer I got back is the one I'm building toward. Pending data is readable as context — subagents can see it. But pending data cannot authorize a state-changing or irreversible action until it's promoted. Strict read-only on subagents is too coarse; it blocks legit reads. The gate moves to the point of use: every consumer of a memory row checks the tag before acting on it, not just the writer before storing it.

Same fail-closed idea, one layer down. Verification gates the data going in. Consumer discipline gates the data going out.

Obvious objection — N-consumer rewrite cost. Every agent that reads memory needs to be rewritten to check tags before acting. Multi-agent system = N rewrites. Two answers. First, the rewrite is small: a provenance check before any state-changing tool call. Cost-per-agent is hours, not weeks. Second, the boundary that matters isn't the row read, it's the tool call — agents pass conclusions to each other, and a conclusion derived from unverified data can still trigger an irreversible action two hops downstream. So the gate goes in the agent base class at the tool-call boundary: every tool_call(args) invocation checks the provenance of every memory row that fed its inputs before firing. New agents inherit it for free. Row-level gating in each consumer is too narrow; gating at the tool-call boundary catches transitive use.

Rule 7 — Promotion is a subsystem, not a function call

The mistake I almost made was wiring promotion as a single check inside the write path: if corroborated: row.provenance = "verified". That's a function, not an architecture. Once you think through when promotion fires, what it checks, and what happens when the corroborator disagrees, the function is gone and a subsystem is in its place.

Three triggers fire promotion: an async background worker walks unverified rows on its own schedule, a fresh retrieval re-checks the tag at read time, and a user can confirm a row out of band. The subsystem resolves them by Rule 4's provenance ordering — tool result outranks a second-model prior outranks a confident assertion, with user confirmation terminal until contradicted. Same ordering, applied to a queue instead of a moment.

Two cases the function-call version would have gotten wrong. Verifier down: promotion doesn't fail — the row stays at unavailable_at_write_time and re-queues. Higher-provenance contradiction: the row gets demoted back to unverified and re-queues. Both are normal traffic, not edge cases.

Convergence policy is explicit: during read and write, the subsystem's default position on any unresolved row is unverified — promotion has to be earned, not assumed. The three triggers don't race because the default holds the line until evidence accumulates. The one exception is user contradiction: a previously verified row gets demoted back to unverified by an explicit user signal, and the subsystem re-enters the promotion queue from scratch. User wins; the rest is process.

The verifier has finite throughput, so budget is read-driven: hot rows (frequently retrieved) get checked first; cold rows can sit indefinitely because Rule 6 blocks action on them regardless of how long they wait. Cheap cold backlog beats expensive hot latency.

Obvious objection — this is over-engineered for a solo system. A single function would ship faster. True — and it would re-introduce the original bug the moment the verifier hiccups or a tool result contradicts a stored assertion. The subsystem isn't sized to the system today; it's sized to the failure modes the function call would silently absorb. The discipline is what's load-bearing, not the line count.

Status: all seven rules are wired into the design; none of them are in code yet. Rule 4's promotion path is still v1's confidence threshold — the live wound.

Honest snapshot

All seven have written designs above; none are in code, schema, or test plans yet. "Designed" here means "argued through and committed to," not "formalized into a spec doc." Schema and write path ship first, then retrieval, then consumer enforcement, then the promotion subsystem — several weeks of work, with a build log per piece when it lands. I'm publishing the design before the code is in because the design is the part the thread shaped, and I'd rather hear it's wrong now than after I've written the migration.

Credit where it's load-bearing

The thread on the first post was load-bearing on five of these seven rules.

Rule 1 — Particular thanks to Harjot Singh — building Moonshift — for the verify-before-you-persist framing.
Rule 2 — Shaped by this comment pushing the verifier-down case from edge to design-first.
Rule 5 — Shaped by this comment arguing the tag has to survive the retrieval boundary, not just sit in a column.
Rule 6 — Shaped by this reply walking through the race where a subagent acts on pending data before promotion.
Rule 7 — Shaped by this comment arguing promotion has to be independent corroboration, not confidence — and the tag has to gate behavior, not just sit in the row.

Other commenters pushed back on the cleaner-sounding wrong answers I was about to ship, and the architecture is better for it.

The HECE forensics methodology — the actual SQLite walkthrough I used to find the poisoning, the queries, the false leads, the audit pattern other builders can run on their own agents — is a companion post in this series.

If you're building agent memory and any of these seven rules look wrong, I'd rather find out now than after the migration. Reply, DM, or punch holes in the comments — the post is here precisely to be challenged.

Sonnet hallucinated. My agent stored it as fact.

ישראל חן — Tue, 26 May 2026 03:21:52 +0000

Sonnet hallucinated. My agent stored it as fact.

On April 17, I took my AI agent offline thinking it had been compromised. I was on a bus, mobile hotspot, no safe way to investigate. Contain first. Diagnose later.

Four days later I pulled the SQLite database and walked the trail.

The agent hadn't been hijacked. It had done something stranger: it had poisoned its own memory.

What I actually saw

On day one, I asked it about an entity called "Claude Mythos." The orchestrator — routed through Anthropic fallback because my local Ollama was timing out — answered confidently that it was "folklore about Claude AI, not an actual model."

Confident, and wrong. Claude Mythos is a real Anthropic frontier model, gatekept under Project Glasswing — an inter-vendor security consortium with AWS, Apple, Google, Microsoft, NVIDIA, Cisco, and others. Sonnet, lacking access, denied its existence. The denial was treated as fact downstream. (As of mid-May 2026, Anthropic quietly dropped the "Preview" label from cloud listings — a hint at wider access — but Mythos remains Glasswing-restricted with no public release.)

My memory-summarization layer extracted that incorrect denial from the conversation and stored it in the memories table with a [fact] tag.

sqlite> SELECT id, category, source, content FROM memories WHERE id BETWEEN 498 AND 502;

498|decision|summary|The research covered historical background, characteristics, controversies, and current status for both subjects
499|fact|summary|Claude Mythos is not a real AI model or cybersecurity system
500|fact|summary|"Claude Mythos" refers to folklore or rumors about Claude AI rather than an actual product
501|fact|summary|There is no actual "Claude Mythos" system to gain access to
502|fact|summary|The user was asking about what they believed might be a cybersecurity-focused AI model

Look at the source column: summary. The summarization layer minted these as fact — no human, no verification, no provenance beyond "a model said it."

Four days later, I asked the same question in a fresh session. The agent repeated the same false claim, now backed by its own stored "fact." When I challenged it, a keyword match on "memory" routed my question to the memory agent, which listed rows #498–502 for me. My own agent's hallucinations, tagged as ground truth.

The system had built itself a false reality. No attacker needed.

The two findings that matter

The post-mortem surfaced nine findings — classic red-team material (routing bypass, post-hoc approval, identity confusion), observability gaps (bot tokens in journald, missing model_used column), and two architectural findings that outweigh the rest:

Memory poisoning by LLM self-assertion. The schema stores model outputs as facts with no provenance tag. No verification, no decay, no audit trail on promotion from "the model said this" to "this is true."

Local-first collapses to cloud-only under degradation. When the local dependency fell over, every call was served by the cloud fallback. "Local" is a configuration, not a guarantee.

What this is, and what it isn't

This isn't a novel discovery. Zhang & Press named hallucination snowballing in 2023. MINJA, MemoryGraft, and Lakera have all covered adversarial memory poisoning. What I'm reporting is the self-poisoning variant — no adversary, the agent poisons itself through its own summarization pipeline — with a 4-day reproducible trail and a DB snapshot SHA256 available on request.

One confession, because it proves the point. While writing this, I nearly did it myself. Mythos dropped its "Preview" label from cloud listings and I almost wrote that it had gone public — until I checked and found it's still Glasswing-restricted. The distance between "I heard" and "I verified" is one fact-check wide. My agent never closed that gap. I almost didn't either.

Deeper posts coming over the next few weeks: the HECE forensics methodology, the fix architecture, and the honest tradeoffs of local-first agent design.

If you're building agents with long memory , I'd like to compare notes. Reply or DM. Honest disagreement especially welcome.