Opswald

Posted on Jun 1

AI Agent Debugging Checklist: From Failed Run to Root Cause

#ai #llm #devtools #playwright

When an AI agent fails in production, the first instinct is usually to tweak the prompt and rerun the workflow.

That can make the incident harder to understand.

The rerun may change the model output, retrieved context, tool state, timing, permissions, or external API response. If the agent already sent an email, issued a refund, changed a ticket, or called an MCP tool, a naive rerun can also repeat a side effect.

A better workflow starts by preserving evidence from the failed run before changing anything.

This checklist is for developers debugging production AI agents that use tools, retrieval, memory, workflows, or external APIs. The goal is not to make every run deterministic. The goal is to find the first unsupported decision and turn the failure into a replayable regression.

1. Capture the run identity

Start by making sure the failed run can be found again.

Record:

Trace ID or run ID
User/session/job ID
Agent version
Deployment SHA
Model and provider
Prompt or instruction version
Tool registry version
Retrieval index version
Environment and region
Timestamp and timezone

Without this, incident review becomes archaeology. Screenshots and copied logs are not enough. The team needs a stable identifier that joins model calls, tool calls, application logs, queue jobs, and external API writes.

2. Preserve the original trigger and context

The same user request can produce different behavior depending on context.

Capture:

Original user input or job payload
System/developer instructions active for the run
Retrieved documents or chunks
Memory entries used by the agent
Account, tenant, role, and permission scope
Relevant product state before the run
Feature flags and routing decisions

A common failure pattern is a plausible model answer built from incomplete or stale context. If you only inspect the final response, the agent may look unreasonable. If you inspect the context it actually saw, the failure often becomes obvious.

3. Inspect the decision, not just the output

For agents, the important question is often not “what did the model answer?” It is “why did the agent choose this next action?”

Look for:

Selected tool or branch
Alternatives the agent could have chosen
Guardrails or policy checks that ran
Guardrails or policy checks that should have run but did not
Missing facts
Assumptions introduced by summaries or retrieved content
Handoffs between agents or workflow steps

A bad final response is visible. A bad intermediate decision can stay hidden while the final response looks fine.

4. Treat tool calls as model decisions plus API events

Tool calls are where agent debugging diverges from normal request tracing.

For each tool call, preserve:

Tool name and schema
Generated arguments
Validation result
Permission or auth scope
Raw tool response
Normalized tool response
Latency and timeout behavior
Retry count
Error payloads
External mutation or durable receipt ID

A tool call can “succeed” at the API boundary while still being the wrong action. The tool endpoint returned 200. The agent still issued the wrong refund, queried the wrong account, trusted a partial response, or retried a write.

5. Separate read-only tools from mutating tools

Do not debug all tools the same way.

Classify each tool as:

Read-only
Write
Risky write
Human approval required
External side effect

For mutating tools, capture before/after state and an idempotency key. For emails, tickets, refunds, database writes, calendar changes, and external workflow updates, capture a durable receipt.

The key replay rule: debugging should not repeat production side effects.

6. Check retrieval and memory before blaming the model

Many agent failures are context failures.

Ask:

Did retrieval return the right source?
Was the source stale?
Was the relevant fact omitted by chunking or summarization?
Did memory introduce old user preferences or obsolete state?
Did the agent cite or rely on evidence that was not actually present?
Did a later tool result contradict earlier retrieved context?

If the model was given bad or incomplete evidence, prompt tuning may hide the symptom without fixing the system.

7. Compare the failed run to a known-good run

A good comparison can save hours.

Compare:

Same user intent, different outcome
Same tool, different arguments
Same retrieval query, different chunks
Same workflow branch, different guardrail result
Same external API, different permission scope
Same prompt version, different model/provider response

The goal is to find the first divergence that matters. Timelines show order. Decision graphs show causality.

8. Make replay safe before rerunning

Replay does not have to mean regenerating every token exactly. It means preserving enough evidence to ask disciplined questions.

Before replaying, pin or stub:

User input
Prompt/instruction version
Retrieved context
Tool outputs
External API responses
Current time
Random IDs
Mutating tool behavior
Secrets and sensitive fields after redaction

Safe replay lets the team test whether a fix changes the failing decision without sending another email, creating another ticket, issuing another refund, or mutating production state.

9. Convert the incident into a regression

After root cause is found, keep the evidence.

Create a regression fixture with:

Minimal failing input
Pinned context
Expected decision or blocked action
Tool output stubs
Assertions for side effects
Notes on the root cause
Link back to the production trace

Good regression fixtures prevent the same class of failure from coming back through a future prompt change, model upgrade, retrieval change, or tool schema edit.

10. Use a short incident review template

A useful post-incident review for agents can be simple:

Incident: what happened?
Impact: who or what was affected?
First unsupported decision: where did the agent become wrong?
Evidence: what prompt/context/tool/state proved it?
Root cause category: context, tool, permission, memory, orchestration, model, guardrail, or side effect?
Fix: what changed?
Regression: what replay or test now catches this?

The “first unsupported decision” line is the important part. It keeps the review from collapsing into vague statements like “the model hallucinated” or “the prompt was bad.”

Quick checklist

Before changing prompts or code, capture:

[ ] Run ID / trace ID
[ ] Agent, model, prompt, tool, and deployment versions
[ ] Original user input or job payload
[ ] Retrieved context and memory entries
[ ] Selected tools and skipped alternatives
[ ] Tool schemas, arguments, outputs, retries, and errors
[ ] Permission and tenant/account scope
[ ] Side effects and durable receipt IDs
[ ] Before/after state for writes
[ ] Replay plan with stubs for external mutations
[ ] Regression fixture and assertion

Closing thought

Production agent debugging is less about watching a pretty trace and more about preserving the evidence behind a decision.

If you can answer what the agent saw, what it chose, what changed, and how to replay it safely, you can debug the failure. If you cannot, you are guessing.

Useful related resources: