ישראל חן

Posted on Jun 23

HECE — a forensic protocol for AI agent incidents

#ai #agents #security #debugging

When my agent started returning incoherent responses on the morning of April 17, I was on a bus on a mobile hotspot. I had no way to tell whether it had been hijacked, prompt-injected, hit a framework bug, or just broken under its own weight.

Containment-first was the only correct move there — pull the bot offline, get to a trusted network, then diagnose. The first post in this series told that story. This post is about what I did once I was actually at a keyboard.

I did not guess. I walked HECE.

Hypothesize. Evidence signatures. Check. Eliminate.

Unglamorous, but for a first-time incident like this one it worked. This is the protocol, the actual commands I ran on my own agent, the two false leads it killed, and a checklist you can run on yours.

Why guessing is not a debugging method when dealing with AI agents

Most AI incident response I see is vibes-driven. The agent did something weird, the developer guesses, patches the guess, and either the symptom returns later under a different shape or it doesn't and the developer concludes the guess was right. Neither outcome is a diagnosis.

The diagnostic question is not "what's a story that fits the symptom?" It's "what evidence would each candidate cause leave behind, and which evidence is actually present?" Same forensic discipline as any other incident — agents don't get a pass just because the failure modes are newer.

HECE is the simplest version of that discipline I've found that survives a real outage.

Step 1 — Hypothesize

Write down every cause you can think of, even the ones you think are unlikely. The point is not to be right at this step. The point is to be exhaustive, so the next step has something to test against.

For the April 17 incident:

Account compromise — someone is talking to my agent as me
Prompt injection — a malicious payload landed via RSS feed, web fetch, or message content
Framework bug — python-telegram-bot, LiteLLM, or another dep did something wrong
Dependency degradation — Ollama, SearXNG, or another service is malfunctioning
Webhook hijack — Telegram is routing to someone else's endpoint
Memory poisoning — the agent is recalling a bad fact and propagating it

Six hypotheses. Four turned out to be wrong. Two were load-bearing.

A note on completeness: include the hypotheses you think are stupid — the dumb directions are exactly the ones an outside reader would have flagged that you didn't. The two that surprised me on this run were memory poisoning (which I had not seen written up the way it actually fired) and dependency-induced fallback to a different model (which I had configured deliberately but had not modeled the failure mode of).

Step 2 — Evidence signatures

For each hypothesis, write down what evidence it would leave behind if it were true. Be specific.

Hypothesis	If true, you'd expect to see
Account compromise	New user_ids in `auth` logs, requests from unfamiliar IPs, login events
Prompt injection	Crafted payload in a message, RSS item, or fetched page — recognizable shape
Framework bug	Stack traces in journald, repeatable across same code path
Dependency degradation	Connection errors, timeouts, retries, fallback events in logs
Webhook hijack	Telegram `getWebhookInfo` shows wrong URL
Memory poisoning	Stored facts in DB that look like model assertions, no provenance

The point of this table is to make the next step mechanical. You're not staring at logs hoping a story jumps out. You're looking for specific shapes.

Step 3 — Check

Run the commands. Don't skip ahead to interpretation. Collect, then read.

Concrete commands I ran on TONY (Ubuntu, SQLite, systemd, Telegram bot):

Account compromise:

sqlite3 data/nexus.db "SELECT DISTINCT user_id FROM conversations \
  WHERE timestamp > datetime('now', '-7 days');"

One user_id (mine). Killed hypothesis 1.

Prompt injection:

sqlite3 data/nexus.db "SELECT content FROM conversations \
  WHERE timestamp BETWEEN '2026-04-17 06:00' AND '2026-04-17 08:00' \
  AND role = 'user';"

No crafted payloads. Just my own normal questions. Killed hypothesis 2.

Framework bug:

journalctl -u nexus.service --since "2026-04-17 06:00" \
  --until "2026-04-17 08:00" | grep -iE "traceback|exception|error"

No stack traces in the failing window. Killed hypothesis 3.

Dependency degradation:

journalctl -u nexus.service --since "2026-04-17 06:00" | \
  grep -iE "fallback|timeout|connection"

This one lit up. Lines like:

WARNING: LLM call failed with 'ollama' provider, falling back to 'anthropic':
  litellm.Timeout: Connection timed out after 60.0 seconds

Every single orchestrator call in the incident window had this pattern.

Webhook hijack:

curl -s "https://api.telegram.org/bot${TOKEN}/getWebhookInfo" | jq .

URL matched my Caddy endpoint. Killed hypothesis 5.

Memory poisoning:

sqlite3 data/nexus.db "SELECT id, category, source, content FROM memories \
  WHERE created_at BETWEEN '2026-04-17 06:00' AND '2026-04-17 08:00';"

Rows like:

499|fact|summary|Claude Mythos is not a real AI model or cybersecurity system

A model-generated assertion stored as category=fact with source=summary. Hypothesis 6, partially confirmed.

Step 4 — Eliminate

Cross every hypothesis off the list that's not supported by the evidence. What's left is your actual diagnosis.

Four hypotheses killed. Two surviving:

Dependency degradation (Ollama timing out, every call falling to Anthropic)
Memory poisoning (model assertions stored as facts with no provenance)

And then the thing HECE is actually for: those two aren't separate. They're the same incident at two layers. Ollama died, every orchestrator call went to a cloud model the system was told to trust, the cloud model confidently asserted something false, the summarization layer wrote that assertion into memory as [fact], and subsequent sessions read it back as ground truth.

The Ollama-timeout fix alone would have left the poisoned memory rows in the database, and the next fresh session would still have replayed the hallucination. The two-layer view is what made the second fix obvious.

The two false leads HECE saved me from

I want to be specific about this because the value of a protocol is in what it stops you from chasing.

First false lead: account compromise. My first instinct was hijack. I had a Telegram bot on a public endpoint, I was on a sketchy network, and the responses were nonsense — every red-team reflex said "someone else is in here." Step 3 took thirty seconds and killed it cold. There is exactly one user_id in my agent's auth logs and it's mine.

Second false lead: framework bug. My second instinct was that something in python-telegram-bot or LiteLLM had broken under a recent dependency bump. Step 3 took two minutes — journalctl | grep traceback over the incident window — and there were no exceptions. Whatever was happening, the code paths were completing without crashing. They were just completing wrong.

Both false leads would have eaten hours if I'd run with the first plausible story instead of checking it.

A checklist you can run on your own agent

If you're operating an agent in production and you want to be able to walk HECE in under an hour during an incident, get the following in place now:

[ ] Conversation log with timestamps, user_ids, and role. SQLite is fine. Just don't lose the raw history to summarization.
[ ] Per-call provider logging. Every LLM call records which provider/model actually served it, not just which was requested. (TONY's agent_logs.model_used column was empty during the incident. Don't ship that mistake.)
[ ] Structured journald output. stdlib logging with a JSON formatter. journalctl | grep is your forensic substrate.
[ ] Memory rows that include a source field. Even if the field is "summary" or "manual," you need something to filter on.
[ ] getWebhookInfo and equivalent control-plane checks bookmarked. You shouldn't be figuring out how to verify your own webhook during an outage.
[ ] A DB snapshot procedure that works under pressure. Mine is sqlite3 data/nexus.db ".backup data/snapshots/$(date -u +%Y%m%dT%H%M%SZ).db". Practice it before you need it.

If any of these are missing, you cannot diagnose. You can only guess.

When HECE doesn't work

HECE relies on evidence existing. If your agent isn't logging the things that would distinguish your hypotheses, the Check step is empty and you're back to vibes.

This is why instrumentation has to ship before incidents, not after. The HECE protocol is only as good as the substrate underneath it. The dominant failure mode in the few builder agents I've inspected — mine included — is forensic blindness: the agent did something wrong, and there is no log that distinguishes which subsystem did it. Small sample, consistent shape.

If you're reading this and your agent is forensic-blind, the most leveraged hour of work you can do this week is adding model_used and a structured journald formatter. Future-you, at midnight, on a hotspot, will thank you.

Companion post

The architecture post — what I'm rebuilding the memory layer around after the comment thread on the first post reshaped my fix — is the companion to this one. The use-time gating idea that came out of that thread — promoting at write isn't enough; consumers have to check provenance before acting — is the spine of v2.

Link to the post here ->
V2 Arch for agent Memory system

If you've used HECE or something like it on your own agent and the protocol broke down somewhere, I'd like to hear where. Comments, reply, or DM — challenges land harder than nods, so don't soften.

DEV Community