DEV Community

ישראל חן
ישראל חן

Posted on

HECE — a forensic protocol for AI agent incidents

When my agent started returning incoherent responses on the morning of April 17, I was on a bus on a mobile hotspot. I had no way to tell whether it had been hijacked, prompt-injected, hit a framework bug, or just broken under its own weight.

Containment-first was the only correct move there — pull the bot offline, get to a trusted network, then diagnose. The first post in this series told that story. This post is about what I did once I was actually at a keyboard.

I did not guess. I walked HECE.

Hypothesize. Evidence signatures. Check. Eliminate.

Unglamorous, but for a first-time incident like this one it worked. This is the protocol, the actual commands I ran on my own agent, the two false leads it killed, and a checklist you can run on yours.

Why guessing is not a debugging method when dealing with AI agents

Most AI incident response I see is vibes-driven. The agent did something weird, the developer guesses, patches the guess, and either the symptom returns later under a different shape or it doesn't and the developer concludes the guess was right. Neither outcome is a diagnosis.

The diagnostic question is not "what's a story that fits the symptom?" It's "what evidence would each candidate cause leave behind, and which evidence is actually present?" Same forensic discipline as any other incident — agents don't get a pass just because the failure modes are newer.

HECE is the simplest version of that discipline I've found that survives a real outage.

Step 1 — Hypothesize

Write down every cause you can think of, even the ones you think are unlikely. The point is not to be right at this step. The point is to be exhaustive, so the next step has something to test against.

For the April 17 incident:

  1. Account compromise — someone is talking to my agent as me
  2. Prompt injection — a malicious payload landed via RSS feed, web fetch, or message content
  3. Framework bug — python-telegram-bot, LiteLLM, or another dep did something wrong
  4. Dependency degradation — Ollama, SearXNG, or another service is malfunctioning
  5. Webhook hijack — Telegram is routing to someone else's endpoint
  6. Memory poisoning — the agent is recalling a bad fact and propagating it

Six hypotheses. Four turned out to be wrong. Two were load-bearing.

A note on completeness: include the hypotheses you think are stupid — the dumb directions are exactly the ones an outside reader would have flagged that you didn't. The two that surprised me on this run were memory poisoning (which I had not seen written up the way it actually fired) and dependency-induced fallback to a different model (which I had configured deliberately but had not modeled the failure mode of).

Step 2 — Evidence signatures

For each hypothesis, write down what evidence it would leave behind if it were true. Be specific.

Hypothesis If true, you'd expect to see
Account compromise New user_ids in auth logs, requests from unfamiliar IPs, login events
Prompt injection Crafted payload in a message, RSS item, or fetched page — recognizable shape
Framework bug Stack traces in journald, repeatable across same code path
Dependency degradation Connection errors, timeouts, retries, fallback events in logs
Webhook hijack Telegram getWebhookInfo shows wrong URL
Memory poisoning Stored facts in DB that look like model assertions, no provenance

The point of this table is to make the next step mechanical. You're not staring at logs hoping a story jumps out. You're looking for specific shapes.

Step 3 — Check

Run the commands. Don't skip ahead to interpretation. Collect, then read.

Concrete commands I ran on TONY (Ubuntu, SQLite, systemd, Telegram bot):

Account compromise:

sqlite3 data/nexus.db "SELECT DISTINCT user_id FROM conversations \
  WHERE timestamp > datetime('now', '-7 days');"
Enter fullscreen mode Exit fullscreen mode

One user_id (mine). Killed hypothesis 1.

Prompt injection:

sqlite3 data/nexus.db "SELECT content FROM conversations \
  WHERE timestamp BETWEEN '2026-04-17 06:00' AND '2026-04-17 08:00' \
  AND role = 'user';"
Enter fullscreen mode Exit fullscreen mode

No crafted payloads. Just my own normal questions. Killed hypothesis 2.

Framework bug:

journalctl -u nexus.service --since "2026-04-17 06:00" \
  --until "2026-04-17 08:00" | grep -iE "traceback|exception|error"
Enter fullscreen mode Exit fullscreen mode

No stack traces in the failing window. Killed hypothesis 3.

Dependency degradation:

journalctl -u nexus.service --since "2026-04-17 06:00" | \
  grep -iE "fallback|timeout|connection"
Enter fullscreen mode Exit fullscreen mode

This one lit up. Lines like:

WARNING: LLM call failed with 'ollama' provider, falling back to 'anthropic':
  litellm.Timeout: Connection timed out after 60.0 seconds
Enter fullscreen mode Exit fullscreen mode

Every single orchestrator call in the incident window had this pattern.

Webhook hijack:

curl -s "https://api.telegram.org/bot${TOKEN}/getWebhookInfo" | jq .
Enter fullscreen mode Exit fullscreen mode

URL matched my Caddy endpoint. Killed hypothesis 5.

Memory poisoning:

sqlite3 data/nexus.db "SELECT id, category, source, content FROM memories \
  WHERE created_at BETWEEN '2026-04-17 06:00' AND '2026-04-17 08:00';"
Enter fullscreen mode Exit fullscreen mode

Rows like:

499|fact|summary|Claude Mythos is not a real AI model or cybersecurity system
Enter fullscreen mode Exit fullscreen mode

A model-generated assertion stored as category=fact with source=summary. Hypothesis 6, partially confirmed.

Step 4 — Eliminate

Cross every hypothesis off the list that's not supported by the evidence. What's left is your actual diagnosis.

Four hypotheses killed. Two surviving:

  • Dependency degradation (Ollama timing out, every call falling to Anthropic)
  • Memory poisoning (model assertions stored as facts with no provenance)

And then the thing HECE is actually for: those two aren't separate. They're the same incident at two layers. Ollama died, every orchestrator call went to a cloud model the system was told to trust, the cloud model confidently asserted something false, the summarization layer wrote that assertion into memory as [fact], and subsequent sessions read it back as ground truth.

The Ollama-timeout fix alone would have left the poisoned memory rows in the database, and the next fresh session would still have replayed the hallucination. The two-layer view is what made the second fix obvious.

The two false leads HECE saved me from

I want to be specific about this because the value of a protocol is in what it stops you from chasing.

First false lead: account compromise. My first instinct was hijack. I had a Telegram bot on a public endpoint, I was on a sketchy network, and the responses were nonsense — every red-team reflex said "someone else is in here." Step 3 took thirty seconds and killed it cold. There is exactly one user_id in my agent's auth logs and it's mine.

Second false lead: framework bug. My second instinct was that something in python-telegram-bot or LiteLLM had broken under a recent dependency bump. Step 3 took two minutes — journalctl | grep traceback over the incident window — and there were no exceptions. Whatever was happening, the code paths were completing without crashing. They were just completing wrong.

Both false leads would have eaten hours if I'd run with the first plausible story instead of checking it.

A checklist you can run on your own agent

If you're operating an agent in production and you want to be able to walk HECE in under an hour during an incident, get the following in place now:

  • [ ] Conversation log with timestamps, user_ids, and role. SQLite is fine. Just don't lose the raw history to summarization.
  • [ ] Per-call provider logging. Every LLM call records which provider/model actually served it, not just which was requested. (TONY's agent_logs.model_used column was empty during the incident. Don't ship that mistake.)
  • [ ] Structured journald output. stdlib logging with a JSON formatter. journalctl | grep is your forensic substrate.
  • [ ] Memory rows that include a source field. Even if the field is "summary" or "manual," you need something to filter on.
  • [ ] getWebhookInfo and equivalent control-plane checks bookmarked. You shouldn't be figuring out how to verify your own webhook during an outage.
  • [ ] A DB snapshot procedure that works under pressure. Mine is sqlite3 data/nexus.db ".backup data/snapshots/$(date -u +%Y%m%dT%H%M%SZ).db". Practice it before you need it.

If any of these are missing, you cannot diagnose. You can only guess.

When HECE doesn't work

HECE relies on evidence existing. If your agent isn't logging the things that would distinguish your hypotheses, the Check step is empty and you're back to vibes.

This is why instrumentation has to ship before incidents, not after. The HECE protocol is only as good as the substrate underneath it. The dominant failure mode in the few builder agents I've inspected — mine included — is forensic blindness: the agent did something wrong, and there is no log that distinguishes which subsystem did it. Small sample, consistent shape.

If you're reading this and your agent is forensic-blind, the most leveraged hour of work you can do this week is adding model_used and a structured journald formatter. Future-you, at midnight, on a hotspot, will thank you.

Companion post

The architecture post — what I'm rebuilding the memory layer around after the comment thread on the first post reshaped my fix — is the companion to this one. The use-time gating idea that came out of that thread — promoting at write isn't enough; consumers have to check provenance before acting — is the spine of v2.

Link to the post here ->
V2 Arch for agent Memory system

If you've used HECE or something like it on your own agent and the protocol broke down somewhere, I'd like to hear where. Comments, reply, or DM — challenges land harder than nods, so don't soften.

Top comments (0)