How I use Claude Code to debug production incidents — a step-by-step workflow
Production is down. Alerts are firing. Your phone is blowing up at 2am.
This is when you need your tools to work fast. Claude Code has become my go-to for production incident response — here's exactly how I use it.
The problem with debugging alone
Production incidents are chaotic. You're context-switching between:
- Error logs (thousands of lines)
- Metrics dashboards
- Code that might be the culprit
- Slack threads from panicking stakeholders
Your brain is operating in fight-or-flight mode. You make mistakes. You miss things.
Claude Code helps you slow down and think systematically — even when everything is on fire.
My incident response workflow
Step 1: Dump the context
The first thing I do is paste everything into Claude Code:
claude
Then:
We have a production incident. Here's what I know:
Error: "Cannot read property 'userId' of undefined" in production
First seen: 14:32 UTC
Affected: ~15% of checkout requests
Recent deploys: payments-service v2.4.1 at 14:15 UTC
Here's the error trace:
[paste full stack trace]
Here's the relevant code:
[paste payments-service checkout handler]
What could cause this and what should I check first?
Claude Code reads everything at once and gives you a ranked list of suspects.
Step 2: Log analysis
For large log files, Claude Code can analyze patterns you'd miss manually:
cat /var/log/app/production.log | tail -1000 > /tmp/recent.log
Then in Claude Code:
Analyze /tmp/recent.log. Find:
1. The first occurrence of this error
2. Any patterns in which users are affected
3. Correlation with any other errors nearby in time
Claude Code reads the file and identifies patterns — like "this error always appears 30 seconds after a Redis timeout" — that you'd need 20 minutes to spot manually.
Step 3: Trace the code path
Trace the full code path for a checkout request. Start from the route handler and follow every function call. Show me everywhere userId could be undefined.
Claude Code reads all the related files and maps the execution path. It finds the bug I couldn't see because I was too stressed.
Step 4: Write the hotfix
Write a minimal hotfix that adds a null check for userId before the payment call. Include a log line that captures the request details when this happens so we can debug the root cause later.
This is where sessions get token-heavy. You've loaded:
- The original stack trace
- Log file analysis
- Multiple source files across the call chain
- The fix discussion
Long incidents will hit Claude's rate limits. When that happens, I use SimplyLouie as a fallback endpoint:
export ANTHROPIC_BASE_URL=https://simplylouie.com/api
$2/month. No interruptions during incidents. Worth it.
Step 5: Write the postmortem
Once the incident is resolved:
Based on everything we analyzed today, write a postmortem. Include:
- Timeline
- Root cause
- Impact
- Fix applied
- Action items to prevent recurrence
Keep it factual and blameless.
Claude Code writes a complete postmortem from your incident conversation. In 2 minutes instead of 2 hours.
The full incident timeline with Claude Code
| Time | Without Claude Code | With Claude Code |
|---|---|---|
| T+0 | Panic, start reading logs | Paste context, get ranked suspects |
| T+10 | Still reading logs | Identified root cause |
| T+20 | Maybe have a theory | Hotfix written and reviewed |
| T+30 | Writing the fix | Fix deployed, monitoring |
| T+60 | Still debugging | Writing postmortem |
| T+90 | Incident resolved | Postmortem sent, sleeping |
Tips for incident response with Claude Code
Keep a CLAUDE.md in your service repos with:
- Architecture overview
- Common failure modes
- Key files to check
- Runbook for common incidents
This saves 5 minutes of context-setting during every incident.
Don't paste the whole log file. Tail the last 1000 lines and grep for the error first. Smaller context = faster analysis = fewer rate limit hits.
Use --continue to resume. If your session hits a rate limit mid-incident:
claude --continue
Or switch to the SimplyLouie endpoint so you don't lose momentum:
export ANTHROPIC_BASE_URL=https://simplylouie.com/api
claude
Have a runbook Claude.md. Create a file called INCIDENTS.md in your repo:
# Incident Response Runbook
## Common failures
- Redis timeout → check connection pool settings in config/redis.js
- Payment failures → check Stripe webhook logs first
- Auth failures → check JWT_SECRET env var and token expiry
## Key log locations
- App: /var/log/app/production.log
- Nginx: /var/log/nginx/error.log
- Database: /var/log/postgresql/postgresql.log
At the start of every incident: claude -f INCIDENTS.md
The mental shift
The biggest change isn't speed — it's that I stop second-guessing myself.
When production is down and I'm stressed, I make bad assumptions. I chase the wrong thing for 30 minutes.
Claude Code reads the code, reads the logs, and tells me what the data actually says. No emotions. No assumptions. Just pattern matching.
My Claude sessions during incidents are long — loading 10+ files, multiple rounds of analysis, writing the fix, writing the postmortem. When sessions hit rate limits at the worst possible moment, I use SimplyLouie as a fallback API endpoint. ✌️2/month. Keep it in your incident toolkit.
Top comments (0)