Your AI agent double-charged a customer at 03:14 UTC. You have the trace, the timestamps, the LLM call envelope, the LangSmith/Helicone/Langfuse dashboard with a green checkbox. You open the postmortem template your SRE team uses for normal services.
It does not fit.
That is the entire problem with AI agent postmortems in 2026. The templates were built for services whose failures look like 500s, stack traces, and rolling restarts. Agent failures look like:
- A tool call that succeeded but did the wrong thing.
- A retry loop that succeeded three times and only the third one ran the side effect.
- A prompt-injection that succeeded by tricking the dispatcher into an edge the developer never coded.
- A "success" on every observability dashboard, followed by a customer email three days later saying the result is wrong.
I have read roughly two dozen real production agent postmortems in 2026 (anonymized, from teams paying for forensic log audits). Every single one is missing at least 3 of the 6 sections below. These are the sections that actually prevent the next incident — and the ones a generic SRE template will never prompt you to write.
Here is the one-page postmortem shape that closes the gap. Use it the next time something goes wrong in production. If you would rather have a human write it for you, the $149 AI Ops Checkup is the same checklist, with someone who has read hundreds of these doing the reading.
1. Replay Fixture (not just a log)
A normal postmortem has a timeline. An agent postmortem needs a replay fixture: the exact input + tool-call sequence + model response that produced the failure, in a form you can re-run.
Most teams skip this because their agent framework doesn't make it easy — the dispatcher is stateful, the LLM call is non-deterministic, and the side effect already happened. But without a replay fixture, you cannot prove the fix works. You ship a patch, you wait for the same failure to recur, and you patch again.
Minimum viable replay fixture:
- The exact user message (or upstream trigger)
- The agent's state at the start of the failing turn (messages, tool schemas, system prompt version)
- A seeded LLM call (if you use temperature > 0, save the seed)
- The exact tool call sequence and return values
- A flag indicating which side effects already ran (and which need to be re-mocked)
If you cannot replay it, you have not actually understood the failure yet. You have just observed it.
2. The Policy-Path Evidence Lattice
A software postmortem says "the code path was A -> B -> C." An agent postmortem has to answer a different question: which policy was the model allowed to violate, and which policy did it actually violate?
This is the section that 23 of the 24 postmortems I read in 2026 had zero of. It looks like this:
ALLOWED_POLICY: "do not call refund() without manager approval"
ACTUAL_PATH: supervisor -> refund(amount=$X) [no approval call]
EVIDENCE: trace event ts=03:14:11.4, prompt=..., tool=refund
ROOT_CAUSE: supervisor system prompt had a typo in the manager-approval
rule; the model followed the rule that was actually written.
Without this, you cannot tell the difference between:
- A prompt-injection attack that succeeded (the model's policy was correct, an attacker rewrote it)
- A hallucination that succeeded (the model had no policy, it improvised)
- A typo that succeeded (the policy was never going to hold)
These three root causes need three different fixes. A postmortem that says "agent went rogue" has not done the work.
3. The Outcome-Assert Line
Every agent has an "I think I succeeded" line in the trace. Almost none have an "I asserted this actually happened" line.
A normal postmortem section is "what the service did." An agent postmortem needs:
LLM_SAYS_IT_DID: "refund of $X to customer Y, confirmation Z"
OUTCOME_ASSERT: GET /v1/refunds/Z -> 200, amount=$X, status=settled
LATENCY_ASSERT: customer-visible latency < 2s (actual: 14s)
INTEGRITY_ASSERT: refund.Z is in customer's actual billing history
The most expensive agent failures of 2026 were all "outcome-assert gap" failures: the agent said it did the thing, the dashboard said green, the customer said nothing happened (or the opposite happened). The Sinch 2026 study found the rollback rate for teams without full eval coverage was 47%, vs 9% for teams with it. That gap is almost entirely the outcome-assert line.
4. The Idempotency-Key Audit
A retry that succeeds twice is a feature in a normal service. A retry that succeeds twice and runs the side effect twice is a refund issued twice, an email sent twice, a database row created twice.
Your postmortem needs a section that says, in plain language:
WAS_THIS_CALL_IDEMPOTENT: no
SHOULD_IT_HAVE_BEEN: yes
KEY_PRESENT_IN_TRACE: no
KEY_PRESENT_IN_TOOL_SCHEMA: yes (tool spec required it)
KEY_PRESENT_IN_AGENT_PROMPT: no (model was never told to read it)
KEY_PRESENT_IN_LIVE_TRACE: no
DEDUPE_AT_DB_LAYER: yes (we got lucky this time)
DEDUPE_AT_API_LAYER: no
REAL_HARM_OCCURRED: yes (1 customer charged 2x, 1 charged 3x)
If the postmortem doesn't answer "was this idempotent, and if not, why didn't our safety net catch it," you have not found the real root cause. You have found the symptom. The real cause is the gap between "tool spec says idempotency-key is required" and "agent runtime never enforced it."
5. The Customer-Visible Truth Statement
This is the section nobody writes because it is uncomfortable. A normal postmortem says "MTTR was 47 minutes." An agent postmortem needs to say:
"Between 03:14 UTC and 04:02 UTC, 14 customers received an email that did not match their actual order. 3 of those customers were charged twice. 1 of those customers initiated a chargeback. The agent's dashboard said success the entire time. The on-call engineer was not paged. The customer told us at 06:30 UTC."
Most teams skip this because it is the part that makes the postmortem feel like a confession. But the postmortem exists to prevent the next incident. If the next incident is going to look the same — wrong outcome, green dashboard, late customer report — you have to name the specific failure shape you are trying to prevent. "MTTR" is not specific enough.
6. The Cross-Cutting Counterfactual
The last section normal templates forget. A software postmortem says "what we changed." An agent postmortem needs:
"If we had added the outcome-assert line for the refund tool 60 days ago, this incident would have been caught at the time of the call, not at the time of the customer email. We did not add it because the LangSmith trace looked fine, the eval set did not include outcome-state, and the side-effecting tool was not in the regression suite. This is the third incident in 2026 where the missing line was outcome-assert; the other two were [linked] and [linked]."
The counterfactual is what turns a postmortem from a record into a tool. It says: "here is the next thing we will not skip, and here is the previous time we did skip it, and here is the specific class of incident that class of skip produces."
Without it, your postmortem collection is a graveyard. With it, each one is a prevention mechanism.
The one-page version
If a real production incident lands in your lap at 03:14 and you have 30 minutes to write the postmortem that will get reviewed in the morning, write these six sections in this order, one sentence each:
- Replay fixture — can you re-run this? If no, what is missing?
- Policy-path evidence — which policy was supposed to block this, and what did the model actually do?
- Outcome-assert — what did the trace claim happened, and what is the actual world state?
- Idempotency-key audit — was this call supposed to be safe to retry, and did our safety net catch the retries that ran?
- Customer-visible truth — what did the customer actually experience, in their words, and when did we find out?
- Cross-cutting counterfactual — what is the specific class of incident we are trying to prevent, and which of our prior incidents in this class did we already have a fix for?
If you cannot fill in all six in 30 minutes, that is your action item list for the next 90 days. The first four are log-shape work (a single line of code each). The fifth is a customer-communication workstream. The sixth is a postmortem-review process change.
What this looks like in the wild
I have read 24 production agent postmortems in 2026 (anonymized, under NDA). Of the 24:
- 0/24 had a working replay fixture
- 1/24 had a policy-path evidence lattice
- 3/24 had an outcome-assert line in the trace (the other 21 reconstructed it after the fact)
- 6/24 had a written idempotency-key audit
- 11/24 had a customer-visible truth statement (the other 13 used MTTR as a proxy)
- 2/24 had a cross-cutting counterfactual that named a prior incident in the same class
The 2 postmortems with the counterfactual were both from the same team. That team's repeat-incident rate was 0 in 2026. Everyone else's was not.
The $149 AI Ops Checkup is, in practice, the same six sections applied to a team's actual production log archive. You send me a week's worth of traces; I read them; I send back the one-page postmortem shape with the sections you should have had, and the specific log lines that prove each one. It is not a vendor, not a dashboard, not a $300/month eval platform. It is a human reading your agent's logs the same way a security consultant reads your auth flow.
If you have a production agent incident in the next 90 days, the one-page version above is the minimum viable postmortem. If you want a second pair of eyes to fill in the six sections for you, the link is in the page.
Top comments (0)