DEV Community

Cover image for 3am Incident Response: What I Learned from 200+ Pages
Samson Tanimawo
Samson Tanimawo

Posted on

3am Incident Response: What I Learned from 200+ Pages

The First 5 Minutes Matter Most

I've been paged over 200 times in my career. The pattern is always the same: the first 5 minutes determine whether you resolve in 15 minutes or 3 hours.

Here's what I've learned.

The 3am Brain Problem

At 3am, your cognitive function is roughly 50% of normal. You're making decisions about production systems with half a brain. That's terrifying.

The solution isn't caffeine. It's preparation.

My Incident Response Template

Every time I get paged, I follow the same script. No thinking required:

## STEP 1: ASSESS (0-2 minutes)
- [ ] Read the alert summary
- [ ] Check: Is this customer-facing?
- [ ] Check: Is this data-loss risk?
- [ ] Open the relevant dashboard

## STEP 2: SCOPE (2-5 minutes)
- [ ] When did it start? (check the graph)
- [ ] What changed? (check deploy log)
- [ ] Who else is affected? (check dependent services)
- [ ] Is it getting worse or stable?

## STEP 3: ACT (5+ minutes)
- If deploy-related → rollback
- If traffic-related → scale up
- If dependency-related → check status page
- If unknown → engage secondary on-call
Enter fullscreen mode Exit fullscreen mode

Pattern: 70% of 3am Pages Are Deploy-Related

After tracking root causes for a year, the data was clear:

Deploy-related:      70%
Traffic spike:       12%
Dependency failure:   8%
Infrastructure:       6%
Truly unknown:        4%
Enter fullscreen mode Exit fullscreen mode

This means your first question at 3am should always be: "What deployed in the last 6 hours?"

# First command I always run
git log --oneline --since='6 hours ago' --all

# Or if using a deploy tracker
curl -s https://deploy-tracker.internal/api/recent | jq '.[] | .service + " @ " + .time'
Enter fullscreen mode Exit fullscreen mode

The Rollback-First Mindset

At 3am, your goal is not to fix the bug. Your goal is to restore service. There's a massive difference.

Wrong approach at 3am:

  1. Read the error logs
  2. Understand the root cause
  3. Write a fix
  4. Deploy the fix
  5. Verify

Right approach at 3am:

  1. Confirm the issue
  2. Rollback to last known good
  3. Verify service restored
  4. Go back to sleep
  5. Debug tomorrow with full brain power

Communication Templates

I keep pre-written messages ready:

Internal Slack:
"@oncall-secondary Investigating [SERVICE] alert. 
Impact: [SCOPE]. Will update in 15 min."

Status page (if customer-facing):
"We are investigating reports of [SYMPTOM] 
affecting [SERVICE]. Updates will follow."

Escalation:
"Need help with [SERVICE]. Current state: [STATE]. 
I've tried: [ACTIONS]. Need someone with [EXPERTISE]."
Enter fullscreen mode Exit fullscreen mode

The Post-Incident Ritual

The morning after every 3am page, I spend 20 minutes writing a mini-retro:

  1. Timeline: What happened when
  2. Detection: How fast did we know?
  3. Resolution: What fixed it?
  4. Prevention: How do we stop this from happening again?
  5. Automation: What manual step can we automate?

This compounds. After 200+ incidents, my team has automated away 60% of the common failure modes.

The Metric That Changed Everything

We started tracking "Time to Confident" — the time from page to knowing what's wrong. Not resolution, just understanding. This dropped our overall MTTR by 55% because confident responders take decisive action instead of flailing.

If you're tired of chaotic 3am incidents and want AI-assisted triage, check out what we're building at Nova AI Ops.


Written by Dr. Samson Tanimawo
BSc · MSc · MBA · PhD
Founder & CEO, Nova AI Ops. https://novaaiops.com

Top comments (0)