brian austin

Posted on Apr 9

How I use Claude Code to debug production incidents — a step-by-step workflow

#claudecode #debugging #devops #productivity

How I use Claude Code to debug production incidents — a step-by-step workflow

Production is down. Alerts are firing. Your phone is blowing up at 2am.

This is when you need your tools to work fast. Claude Code has become my go-to for production incident response — here's exactly how I use it.

The problem with debugging alone

Production incidents are chaotic. You're context-switching between:

Error logs (thousands of lines)
Metrics dashboards
Code that might be the culprit
Slack threads from panicking stakeholders

Your brain is operating in fight-or-flight mode. You make mistakes. You miss things.

Claude Code helps you slow down and think systematically — even when everything is on fire.

My incident response workflow

Step 1: Dump the context

The first thing I do is paste everything into Claude Code:

claude

Then:

We have a production incident. Here's what I know:

Error: "Cannot read property 'userId' of undefined" in production
First seen: 14:32 UTC
Affected: ~15% of checkout requests
Recent deploys: payments-service v2.4.1 at 14:15 UTC

Here's the error trace:
[paste full stack trace]

Here's the relevant code:
[paste payments-service checkout handler]

What could cause this and what should I check first?

Claude Code reads everything at once and gives you a ranked list of suspects.

Step 2: Log analysis

For large log files, Claude Code can analyze patterns you'd miss manually:

cat /var/log/app/production.log | tail -1000 > /tmp/recent.log

Then in Claude Code:

Analyze /tmp/recent.log. Find:
1. The first occurrence of this error
2. Any patterns in which users are affected
3. Correlation with any other errors nearby in time

Claude Code reads the file and identifies patterns — like "this error always appears 30 seconds after a Redis timeout" — that you'd need 20 minutes to spot manually.

Step 3: Trace the code path

Trace the full code path for a checkout request. Start from the route handler and follow every function call. Show me everywhere userId could be undefined.

Claude Code reads all the related files and maps the execution path. It finds the bug I couldn't see because I was too stressed.

Step 4: Write the hotfix

Write a minimal hotfix that adds a null check for userId before the payment call. Include a log line that captures the request details when this happens so we can debug the root cause later.

This is where sessions get token-heavy. You've loaded:

The original stack trace
Log file analysis
Multiple source files across the call chain
The fix discussion

Long incidents will hit Claude's rate limits. When that happens, I use SimplyLouie as a fallback endpoint:

export ANTHROPIC_BASE_URL=https://simplylouie.com/api

$2/month. No interruptions during incidents. Worth it.

Step 5: Write the postmortem

Once the incident is resolved:

Based on everything we analyzed today, write a postmortem. Include:
- Timeline
- Root cause
- Impact
- Fix applied
- Action items to prevent recurrence

Keep it factual and blameless.

Claude Code writes a complete postmortem from your incident conversation. In 2 minutes instead of 2 hours.

The full incident timeline with Claude Code

Time	Without Claude Code	With Claude Code
T+0	Panic, start reading logs	Paste context, get ranked suspects
T+10	Still reading logs	Identified root cause
T+20	Maybe have a theory	Hotfix written and reviewed
T+30	Writing the fix	Fix deployed, monitoring
T+60	Still debugging	Writing postmortem
T+90	Incident resolved	Postmortem sent, sleeping

Tips for incident response with Claude Code

Keep a CLAUDE.md in your service repos with:

Architecture overview
Common failure modes
Key files to check
Runbook for common incidents

This saves 5 minutes of context-setting during every incident.

Don't paste the whole log file. Tail the last 1000 lines and grep for the error first. Smaller context = faster analysis = fewer rate limit hits.

Use --continue to resume. If your session hits a rate limit mid-incident:

claude --continue

Or switch to the SimplyLouie endpoint so you don't lose momentum:

export ANTHROPIC_BASE_URL=https://simplylouie.com/api
claude

Have a runbook Claude.md. Create a file called INCIDENTS.md in your repo:

# Incident Response Runbook

## Common failures
- Redis timeout → check connection pool settings in config/redis.js
- Payment failures → check Stripe webhook logs first
- Auth failures → check JWT_SECRET env var and token expiry

## Key log locations
- App: /var/log/app/production.log
- Nginx: /var/log/nginx/error.log
- Database: /var/log/postgresql/postgresql.log

At the start of every incident: claude -f INCIDENTS.md

The mental shift

The biggest change isn't speed — it's that I stop second-guessing myself.

When production is down and I'm stressed, I make bad assumptions. I chase the wrong thing for 30 minutes.

Claude Code reads the code, reads the logs, and tells me what the data actually says. No emotions. No assumptions. Just pattern matching.

My Claude sessions during incidents are long — loading 10+ files, multiple rounds of analysis, writing the fix, writing the postmortem. When sessions hit rate limits at the worst possible moment, I use SimplyLouie as a fallback API endpoint. ✌️2/month. Keep it in your incident toolkit.

DEV Community

How I use Claude Code to debug production incidents — a step-by-step workflow

How I use Claude Code to debug production incidents — a step-by-step workflow

The problem with debugging alone

My incident response workflow

Step 1: Dump the context

Step 2: Log analysis

Step 3: Trace the code path

Step 4: Write the hotfix

Step 5: Write the postmortem

The full incident timeline with Claude Code

Tips for incident response with Claude Code

The mental shift

Top comments (0)