Samson Tanimawo

Posted on Apr 15

3am Incident Response: What I Learned from 200+ Pages

#sre #incidents #oncall #devops

The First 5 Minutes Matter Most

I've been paged over 200 times in my career. The pattern is always the same: the first 5 minutes determine whether you resolve in 15 minutes or 3 hours.

Here's what I've learned.

The 3am Brain Problem

At 3am, your cognitive function is roughly 50% of normal. You're making decisions about production systems with half a brain. That's terrifying.

The solution isn't caffeine. It's preparation.

My Incident Response Template

Every time I get paged, I follow the same script. No thinking required:

## STEP 1: ASSESS (0-2 minutes)
- [ ] Read the alert summary
- [ ] Check: Is this customer-facing?
- [ ] Check: Is this data-loss risk?
- [ ] Open the relevant dashboard

## STEP 2: SCOPE (2-5 minutes)
- [ ] When did it start? (check the graph)
- [ ] What changed? (check deploy log)
- [ ] Who else is affected? (check dependent services)
- [ ] Is it getting worse or stable?

## STEP 3: ACT (5+ minutes)
- If deploy-related → rollback
- If traffic-related → scale up
- If dependency-related → check status page
- If unknown → engage secondary on-call

Pattern: 70% of 3am Pages Are Deploy-Related

After tracking root causes for a year, the data was clear:

Deploy-related:      70%
Traffic spike:       12%
Dependency failure:   8%
Infrastructure:       6%
Truly unknown:        4%

This means your first question at 3am should always be: "What deployed in the last 6 hours?"

# First command I always run
git log --oneline --since='6 hours ago' --all

# Or if using a deploy tracker
curl -s https://deploy-tracker.internal/api/recent | jq '.[] | .service + " @ " + .time'

The Rollback-First Mindset

At 3am, your goal is not to fix the bug. Your goal is to restore service. There's a massive difference.

Wrong approach at 3am:

Read the error logs
Understand the root cause
Write a fix
Deploy the fix
Verify

Right approach at 3am:

Confirm the issue
Rollback to last known good
Verify service restored
Go back to sleep
Debug tomorrow with full brain power

Communication Templates

I keep pre-written messages ready:

Internal Slack:
"@oncall-secondary Investigating [SERVICE] alert. 
Impact: [SCOPE]. Will update in 15 min."

Status page (if customer-facing):
"We are investigating reports of [SYMPTOM] 
affecting [SERVICE]. Updates will follow."

Escalation:
"Need help with [SERVICE]. Current state: [STATE]. 
I've tried: [ACTIONS]. Need someone with [EXPERTISE]."

The Post-Incident Ritual

The morning after every 3am page, I spend 20 minutes writing a mini-retro:

Timeline: What happened when
Detection: How fast did we know?
Resolution: What fixed it?
Prevention: How do we stop this from happening again?
Automation: What manual step can we automate?

This compounds. After 200+ incidents, my team has automated away 60% of the common failure modes.

The Metric That Changed Everything

We started tracking "Time to Confident" — the time from page to knowing what's wrong. Not resolution, just understanding. This dropped our overall MTTR by 55% because confident responders take decisive action instead of flailing.

If you're tired of chaotic 3am incidents and want AI-assisted triage, check out what we're building at Nova AI Ops.

Written by Dr. Samson Tanimawo
BSc · MSc · MBA · PhD
Founder & CEO, Nova AI Ops. https://novaaiops.com

DEV Community