brian austin

Posted on Apr 9

How I use Claude Code to debug production errors — a complete workflow

#claudecode #debugging #devops #productivity

How I use Claude Code to debug production errors — a complete workflow

Production is down. The error logs are cryptic. Your team is pinging you. This is the moment where Claude Code either saves your day or wastes an hour of your time.

After debugging dozens of production incidents with Claude Code, I've developed a workflow that consistently gets to root cause faster than any other approach I've tried.

Here's exactly how I do it.

The core problem with production debugging

Production errors are different from development errors:

Stack traces reference minified or compiled code
The error might be intermittent
You can't reproduce it locally
Multiple systems are involved
Time pressure is real

Most debugging approaches fail because they're linear. You look at one thing at a time. Claude Code lets you look at everything at once.

Step 1: Dump the error context immediately

The moment an alert fires, I run:

claude "Production error investigation. Here's the error:
$(cat /var/log/app/error.log | tail -50)

And the stack trace:
$(cat /var/log/app/stack.log | tail -30)

Start by telling me: (1) what's actually failing, (2) what caused it, (3) what files to check first."

This gets Claude oriented before I start showing it code. The initial orientation matters — it sets up the right mental model for the rest of the session.

Step 2: The 3-file rule

Claude Code is most effective when you give it 3 files max to start:

The file where the error originates
The file that calls it
The configuration/environment file

claude "Read these three files and identify the bug:
1. src/services/payment.js
2. src/routes/checkout.js  
3. config/production.env.example

The error is: 'Cannot read property stripe_key of undefined' at payment.js:47"

Don't dump the entire codebase. Three files, targeted question, clear error message.

Step 3: Hypothesis-first debugging

Before Claude reads any more files, ask for hypotheses:

claude "Before we go deeper — list your top 3 hypotheses for what's causing this. For each one, tell me what evidence would confirm or rule it out."

This is the most underused Claude Code pattern. Getting explicit hypotheses means:

You're not just following Claude blindly
You can quickly test each hypothesis
You catch cases where Claude is wrong early
The session stays focused

Step 4: The environment check

Half of all production bugs are environment issues. Always do this:

claude "Check if this could be an environment configuration issue. Look at:
- What environment variables does this code need?
- Which ones might be missing or wrong in production?
- How would I verify each one without exposing secrets?"

I've saved hours by catching a missing env var in minute 3 instead of minute 45.

Step 5: The diff approach for regressions

If this worked yesterday and broke today:

# Get the recent changes
git log --oneline -20 > /tmp/recent-commits.txt
git diff HEAD~5 HEAD -- src/services/payment.js > /tmp/payment-diff.txt

claude "Something broke between yesterday and today. Here are the recent commits and the diff for the suspect file. What changed that could cause: [error message]?"
$(cat /tmp/recent-commits.txt)
$(cat /tmp/payment-diff.txt)

Claude is remarkably good at reading diffs and spotting the subtle change that caused the regression.

Step 6: The fix + test combo

When Claude proposes a fix, never just apply it:

claude "Before applying that fix:
1. Write a test that would have caught this bug
2. Explain why the fix works
3. List any edge cases where the fix might fail
4. Tell me if there are similar patterns elsewhere in the codebase that need the same fix"

This four-part check has saved me from shipping fixes that were worse than the bug.

Step 7: The blast radius check

Always end with:

claude "Before I deploy this fix:
- What else could this change break?
- What should I monitor for 30 minutes after deploying?
- Is there a safer way to roll this out (feature flag, canary deploy)?"

Real example: the intermittent 500 error

Last month I had an intermittent 500 that only happened under load. Here's the condensed session:

Me: Here's an error that happens 1% of requests under load:
[error log]

Claude: Based on this, my top 3 hypotheses:
1. Race condition in the session store (most likely — note the timestamp pattern)
2. Connection pool exhaustion under load
3. Memory pressure causing GC pauses

To confirm #1: check if errors cluster at high concurrency...

Me: Yes, errors are always when concurrent users > 50.

Claude: That confirms hypothesis 1. The issue is in your session middleware...
[identifies the exact race condition]

Me: Write the fix.

Claude: Here's the fix with the mutex pattern:
[code]

Also — you have the same pattern in auth.js:89 that will fail the same way under load.

Me: Fix both and write tests for both.

Total time: 23 minutes. Without Claude Code, that bug would have taken a day.

The token burn reality

Production debugging sessions are long. You're context-switching between files, running commands, testing hypotheses. A serious incident can easily run 2-3 hours with Claude Code actively involved.

That's a lot of tokens. Claude Code's rate limits can hit during a production incident — the worst possible time.

If you're doing serious production debugging, you need a fallback. SimplyLouie gives you a Claude-compatible API endpoint for $2/month that doesn't share Anthropic's rate limits. Set it as your backup:

# In your .env
ANTHROPIC_BASE_URL=https://api.simplylouie.com

Switch when Anthropic's limits hit. Switch back when they reset. $2/month is worth it for any production incident.

The CLAUDE.md for debugging

Add a debugging section to your CLAUDE.md:

## Production Debugging Protocol
- Always start with error context + 3 files max
- Always ask for explicit hypotheses before reading more code
- Always check environment variables first
- Always get the fix + test combo
- Always do blast radius check before deploying
- Logs are at: /var/log/app/
- Staging environment: staging.myapp.com
- Rollback command: ./scripts/rollback.sh [version]

This means every Claude session starts with the right context. No need to explain your debugging process every time.

Summary

The production debugging workflow:

Dump the error context immediately
Give 3 files max to start
Get explicit hypotheses before going deeper
Always check environment variables
Use git diff for regressions
Always get fix + test + edge cases + similar patterns
Always do blast radius check before deploying

Production is where Claude Code proves its value. But only if you use it right.

DEV Community

How I use Claude Code to debug production errors — a complete workflow

How I use Claude Code to debug production errors — a complete workflow

The core problem with production debugging

Step 1: Dump the error context immediately

Step 2: The 3-file rule

Step 3: Hypothesis-first debugging

Step 4: The environment check

Step 5: The diff approach for regressions

Step 6: The fix + test combo

Step 7: The blast radius check

Real example: the intermittent 500 error

The token burn reality

The CLAUDE.md for debugging

Summary

Top comments (0)