How I use Claude Code to debug production errors — a complete workflow
Production is down. The error logs are cryptic. Your team is pinging you. This is the moment where Claude Code either saves your day or wastes an hour of your time.
After debugging dozens of production incidents with Claude Code, I've developed a workflow that consistently gets to root cause faster than any other approach I've tried.
Here's exactly how I do it.
The core problem with production debugging
Production errors are different from development errors:
- Stack traces reference minified or compiled code
- The error might be intermittent
- You can't reproduce it locally
- Multiple systems are involved
- Time pressure is real
Most debugging approaches fail because they're linear. You look at one thing at a time. Claude Code lets you look at everything at once.
Step 1: Dump the error context immediately
The moment an alert fires, I run:
claude "Production error investigation. Here's the error:
$(cat /var/log/app/error.log | tail -50)
And the stack trace:
$(cat /var/log/app/stack.log | tail -30)
Start by telling me: (1) what's actually failing, (2) what caused it, (3) what files to check first."
This gets Claude oriented before I start showing it code. The initial orientation matters — it sets up the right mental model for the rest of the session.
Step 2: The 3-file rule
Claude Code is most effective when you give it 3 files max to start:
- The file where the error originates
- The file that calls it
- The configuration/environment file
claude "Read these three files and identify the bug:
1. src/services/payment.js
2. src/routes/checkout.js
3. config/production.env.example
The error is: 'Cannot read property stripe_key of undefined' at payment.js:47"
Don't dump the entire codebase. Three files, targeted question, clear error message.
Step 3: Hypothesis-first debugging
Before Claude reads any more files, ask for hypotheses:
claude "Before we go deeper — list your top 3 hypotheses for what's causing this. For each one, tell me what evidence would confirm or rule it out."
This is the most underused Claude Code pattern. Getting explicit hypotheses means:
- You're not just following Claude blindly
- You can quickly test each hypothesis
- You catch cases where Claude is wrong early
- The session stays focused
Step 4: The environment check
Half of all production bugs are environment issues. Always do this:
claude "Check if this could be an environment configuration issue. Look at:
- What environment variables does this code need?
- Which ones might be missing or wrong in production?
- How would I verify each one without exposing secrets?"
I've saved hours by catching a missing env var in minute 3 instead of minute 45.
Step 5: The diff approach for regressions
If this worked yesterday and broke today:
# Get the recent changes
git log --oneline -20 > /tmp/recent-commits.txt
git diff HEAD~5 HEAD -- src/services/payment.js > /tmp/payment-diff.txt
claude "Something broke between yesterday and today. Here are the recent commits and the diff for the suspect file. What changed that could cause: [error message]?"
$(cat /tmp/recent-commits.txt)
$(cat /tmp/payment-diff.txt)
Claude is remarkably good at reading diffs and spotting the subtle change that caused the regression.
Step 6: The fix + test combo
When Claude proposes a fix, never just apply it:
claude "Before applying that fix:
1. Write a test that would have caught this bug
2. Explain why the fix works
3. List any edge cases where the fix might fail
4. Tell me if there are similar patterns elsewhere in the codebase that need the same fix"
This four-part check has saved me from shipping fixes that were worse than the bug.
Step 7: The blast radius check
Always end with:
claude "Before I deploy this fix:
- What else could this change break?
- What should I monitor for 30 minutes after deploying?
- Is there a safer way to roll this out (feature flag, canary deploy)?"
Real example: the intermittent 500 error
Last month I had an intermittent 500 that only happened under load. Here's the condensed session:
Me: Here's an error that happens 1% of requests under load:
[error log]
Claude: Based on this, my top 3 hypotheses:
1. Race condition in the session store (most likely — note the timestamp pattern)
2. Connection pool exhaustion under load
3. Memory pressure causing GC pauses
To confirm #1: check if errors cluster at high concurrency...
Me: Yes, errors are always when concurrent users > 50.
Claude: That confirms hypothesis 1. The issue is in your session middleware...
[identifies the exact race condition]
Me: Write the fix.
Claude: Here's the fix with the mutex pattern:
[code]
Also — you have the same pattern in auth.js:89 that will fail the same way under load.
Me: Fix both and write tests for both.
Total time: 23 minutes. Without Claude Code, that bug would have taken a day.
The token burn reality
Production debugging sessions are long. You're context-switching between files, running commands, testing hypotheses. A serious incident can easily run 2-3 hours with Claude Code actively involved.
That's a lot of tokens. Claude Code's rate limits can hit during a production incident — the worst possible time.
If you're doing serious production debugging, you need a fallback. SimplyLouie gives you a Claude-compatible API endpoint for $2/month that doesn't share Anthropic's rate limits. Set it as your backup:
# In your .env
ANTHROPIC_BASE_URL=https://api.simplylouie.com
Switch when Anthropic's limits hit. Switch back when they reset. $2/month is worth it for any production incident.
The CLAUDE.md for debugging
Add a debugging section to your CLAUDE.md:
## Production Debugging Protocol
- Always start with error context + 3 files max
- Always ask for explicit hypotheses before reading more code
- Always check environment variables first
- Always get the fix + test combo
- Always do blast radius check before deploying
- Logs are at: /var/log/app/
- Staging environment: staging.myapp.com
- Rollback command: ./scripts/rollback.sh [version]
This means every Claude session starts with the right context. No need to explain your debugging process every time.
Summary
The production debugging workflow:
- Dump the error context immediately
- Give 3 files max to start
- Get explicit hypotheses before going deeper
- Always check environment variables
- Use git diff for regressions
- Always get fix + test + edge cases + similar patterns
- Always do blast radius check before deploying
Production is where Claude Code proves its value. But only if you use it right.
Top comments (0)