Daniel R. Foster for OptyxStack

Posted on Mar 22

The AI Incident Report Template I Actually Use for Wrong Answers and Tool Failures

#ai #llm #sre #engineering

Most AI incidents are documented too late and too vaguely.

The team remembers the frustration, but not the evidence.

So a week later the postmortem sounds like this:

"The model got weird."
"Retrieval seemed off."
"Tool calling was flaky."
"We think the prompt change may have caused it."

That kind of report is not useful.

If you want incidents to improve the system instead of just creating a document, the write-up has to force clarity.

This is the lightweight template I actually like for production AI incidents.

What makes AI incidents annoying

AI incidents usually cross more than one layer:

model behavior
prompt or policy changes
retrieval quality
tool execution
downstream parsing
logging gaps

That is why generic incident templates often fail here. They capture "what happened" but not the behavioral context needed to debug probabilistic systems.

You do not need a giant framework. You do need a report that makes the team answer the right questions.

The template

This is the copy-paste version.

# AI Incident Report

## 1. Incident Summary
- Incident ID:
- Date / time:
- Owner:
- Status:
- User-visible impact:

## 2. What failed?
- [ ] wrong answer
- [ ] hallucinated citation / unsupported claim
- [ ] tool-call failure
- [ ] structured output parse failure
- [ ] latency spike
- [ ] cost spike
- [ ] policy / refusal regression
- [ ] other:

## 3. Scope
- Affected feature:
- Affected tenants / cohorts:
- Approx request volume:
- First detected:
- Detection method:

## 4. Request-Level Evidence
- request_id examples:
- model_version:
- prompt_version:
- retrieval_version:
- index_version:
- tool_schema_version:
- policy_version:

## 5. Failure Classification
- Suspected primary layer:
- Suspected secondary layer:
- What evidence supports this?
- What evidence contradicts this?

## 6. Timeline
- Change deployed:
- First bad signal:
- Escalation:
- Mitigation:
- Recovery confirmed:

## 7. Root Cause
- Direct cause:
- Contributing factors:
- Why existing checks did not catch it:

## 8. Fix
- Immediate mitigation:
- Permanent fix:
- Owner:
- Due date:

## 9. Guardrail to Add
- [ ] eval case
- [ ] alert
- [ ] dashboard / query
- [ ] release gate
- [ ] logging field
- [ ] rollback rule

## 10. Proof of Recovery
- Before / after metric:
- Sample requests reviewed:
- Residual risk:

That is already enough for many teams.

The 4 sections that matter most

Not every incident doc gets read in full. These four parts do most of the real work.

1. Request-level evidence

This is the difference between diagnosis and storytelling.

If the incident doc does not include actual request examples plus the relevant version fields, the team is operating from memory.

At minimum, I want:

a few request IDs
the active model version
the prompt version
the retrieval or index version if RAG is involved
the tool schema version if tools are involved

Without this, the root-cause section is usually weaker than people think.

2. Failure classification

Teams move faster when they force themselves to name the failing layer.

For example:

retrieval miss
ranking issue
context assembly issue
tool selection issue
tool execution issue
validation issue

If the incident report only says "bad answer," it is too abstract to improve operations.

3. Why checks did not catch it

This is my favorite line in the template.

It reveals whether the real problem was:

no eval coverage
no alert
no rollback trigger
weak traces
unclear ownership

That is often more valuable than the immediate bug itself.

4. Guardrail to add

Every recurring AI incident means one of the system's feedback loops is missing.

A good incident report should end by adding at least one control:

a new eval case
a version field in logs
a release gate
an alert tied to action
a rollback condition

If the report produces no new guardrail, the same class of incident usually comes back.

An example of a weak root cause

Weak:

"The model produced inconsistent outputs."

That sentence explains almost nothing.

Stronger:

"A prompt edit increased tool invocation frequency, but the new tool schema required a field the model was not reliably generating. Parse failures rose immediately after deployment, and no alert existed for that failure mode."

Now the team has something operational:

the trigger
the failing layer
the missing guardrail

Keep the report small

AI teams sometimes overreact to messy incidents by creating giant forms nobody wants to complete.

I would not start there.

The goal is:

short enough to be filled in during a real week
structured enough to support debugging
consistent enough to compare incidents over time

If the template is too heavy, people stop using it.

If it is too loose, the reports become fiction.

Closing

A useful AI incident report should help you answer three things quickly:

What failed?
Which layer most likely failed?
What control do we add so this exact failure is easier to catch next time?

That is enough to turn incidents into system improvement instead of another vague postmortem folder.

If you want deeper material on production AI diagnostics and observability, these are a good next step:

For AI systems, the quality of the incident report often determines whether the team learns anything real.

DEV Community