DEV Community

brian austin
brian austin

Posted on

How I use Claude Code to debug production incidents — a step-by-step workflow

How I use Claude Code to debug production incidents — a step-by-step workflow

Production is down. Alerts are firing. Your phone is blowing up at 2am.

This is when you need your tools to work fast. Claude Code has become my go-to for production incident response — here's exactly how I use it.

The problem with debugging alone

Production incidents are chaotic. You're context-switching between:

  • Error logs (thousands of lines)
  • Metrics dashboards
  • Code that might be the culprit
  • Slack threads from panicking stakeholders

Your brain is operating in fight-or-flight mode. You make mistakes. You miss things.

Claude Code helps you slow down and think systematically — even when everything is on fire.

My incident response workflow

Step 1: Dump the context

The first thing I do is paste everything into Claude Code:

claude
Enter fullscreen mode Exit fullscreen mode

Then:

We have a production incident. Here's what I know:

Error: "Cannot read property 'userId' of undefined" in production
First seen: 14:32 UTC
Affected: ~15% of checkout requests
Recent deploys: payments-service v2.4.1 at 14:15 UTC

Here's the error trace:
[paste full stack trace]

Here's the relevant code:
[paste payments-service checkout handler]

What could cause this and what should I check first?
Enter fullscreen mode Exit fullscreen mode

Claude Code reads everything at once and gives you a ranked list of suspects.

Step 2: Log analysis

For large log files, Claude Code can analyze patterns you'd miss manually:

cat /var/log/app/production.log | tail -1000 > /tmp/recent.log
Enter fullscreen mode Exit fullscreen mode

Then in Claude Code:

Analyze /tmp/recent.log. Find:
1. The first occurrence of this error
2. Any patterns in which users are affected
3. Correlation with any other errors nearby in time
Enter fullscreen mode Exit fullscreen mode

Claude Code reads the file and identifies patterns — like "this error always appears 30 seconds after a Redis timeout" — that you'd need 20 minutes to spot manually.

Step 3: Trace the code path

Trace the full code path for a checkout request. Start from the route handler and follow every function call. Show me everywhere userId could be undefined.
Enter fullscreen mode Exit fullscreen mode

Claude Code reads all the related files and maps the execution path. It finds the bug I couldn't see because I was too stressed.

Step 4: Write the hotfix

Write a minimal hotfix that adds a null check for userId before the payment call. Include a log line that captures the request details when this happens so we can debug the root cause later.
Enter fullscreen mode Exit fullscreen mode

This is where sessions get token-heavy. You've loaded:

  • The original stack trace
  • Log file analysis
  • Multiple source files across the call chain
  • The fix discussion

Long incidents will hit Claude's rate limits. When that happens, I use SimplyLouie as a fallback endpoint:

export ANTHROPIC_BASE_URL=https://simplylouie.com/api
Enter fullscreen mode Exit fullscreen mode

$2/month. No interruptions during incidents. Worth it.

Step 5: Write the postmortem

Once the incident is resolved:

Based on everything we analyzed today, write a postmortem. Include:
- Timeline
- Root cause
- Impact
- Fix applied
- Action items to prevent recurrence

Keep it factual and blameless.
Enter fullscreen mode Exit fullscreen mode

Claude Code writes a complete postmortem from your incident conversation. In 2 minutes instead of 2 hours.

The full incident timeline with Claude Code

Time Without Claude Code With Claude Code
T+0 Panic, start reading logs Paste context, get ranked suspects
T+10 Still reading logs Identified root cause
T+20 Maybe have a theory Hotfix written and reviewed
T+30 Writing the fix Fix deployed, monitoring
T+60 Still debugging Writing postmortem
T+90 Incident resolved Postmortem sent, sleeping

Tips for incident response with Claude Code

Keep a CLAUDE.md in your service repos with:

  • Architecture overview
  • Common failure modes
  • Key files to check
  • Runbook for common incidents

This saves 5 minutes of context-setting during every incident.

Don't paste the whole log file. Tail the last 1000 lines and grep for the error first. Smaller context = faster analysis = fewer rate limit hits.

Use --continue to resume. If your session hits a rate limit mid-incident:

claude --continue
Enter fullscreen mode Exit fullscreen mode

Or switch to the SimplyLouie endpoint so you don't lose momentum:

export ANTHROPIC_BASE_URL=https://simplylouie.com/api
claude
Enter fullscreen mode Exit fullscreen mode

Have a runbook Claude.md. Create a file called INCIDENTS.md in your repo:

# Incident Response Runbook

## Common failures
- Redis timeout → check connection pool settings in config/redis.js
- Payment failures → check Stripe webhook logs first
- Auth failures → check JWT_SECRET env var and token expiry

## Key log locations
- App: /var/log/app/production.log
- Nginx: /var/log/nginx/error.log
- Database: /var/log/postgresql/postgresql.log
Enter fullscreen mode Exit fullscreen mode

At the start of every incident: claude -f INCIDENTS.md

The mental shift

The biggest change isn't speed — it's that I stop second-guessing myself.

When production is down and I'm stressed, I make bad assumptions. I chase the wrong thing for 30 minutes.

Claude Code reads the code, reads the logs, and tells me what the data actually says. No emotions. No assumptions. Just pattern matching.


My Claude sessions during incidents are long — loading 10+ files, multiple rounds of analysis, writing the fix, writing the postmortem. When sessions hit rate limits at the worst possible moment, I use SimplyLouie as a fallback API endpoint. ✌️2/month. Keep it in your incident toolkit.

Top comments (0)