The First 5 Minutes Matter Most
I've been paged over 200 times in my career. The pattern is always the same: the first 5 minutes determine whether you resolve in 15 minutes or 3 hours.
Here's what I've learned.
The 3am Brain Problem
At 3am, your cognitive function is roughly 50% of normal. You're making decisions about production systems with half a brain. That's terrifying.
The solution isn't caffeine. It's preparation.
My Incident Response Template
Every time I get paged, I follow the same script. No thinking required:
## STEP 1: ASSESS (0-2 minutes)
- [ ] Read the alert summary
- [ ] Check: Is this customer-facing?
- [ ] Check: Is this data-loss risk?
- [ ] Open the relevant dashboard
## STEP 2: SCOPE (2-5 minutes)
- [ ] When did it start? (check the graph)
- [ ] What changed? (check deploy log)
- [ ] Who else is affected? (check dependent services)
- [ ] Is it getting worse or stable?
## STEP 3: ACT (5+ minutes)
- If deploy-related → rollback
- If traffic-related → scale up
- If dependency-related → check status page
- If unknown → engage secondary on-call
Pattern: 70% of 3am Pages Are Deploy-Related
After tracking root causes for a year, the data was clear:
Deploy-related: 70%
Traffic spike: 12%
Dependency failure: 8%
Infrastructure: 6%
Truly unknown: 4%
This means your first question at 3am should always be: "What deployed in the last 6 hours?"
# First command I always run
git log --oneline --since='6 hours ago' --all
# Or if using a deploy tracker
curl -s https://deploy-tracker.internal/api/recent | jq '.[] | .service + " @ " + .time'
The Rollback-First Mindset
At 3am, your goal is not to fix the bug. Your goal is to restore service. There's a massive difference.
Wrong approach at 3am:
- Read the error logs
- Understand the root cause
- Write a fix
- Deploy the fix
- Verify
Right approach at 3am:
- Confirm the issue
- Rollback to last known good
- Verify service restored
- Go back to sleep
- Debug tomorrow with full brain power
Communication Templates
I keep pre-written messages ready:
Internal Slack:
"@oncall-secondary Investigating [SERVICE] alert.
Impact: [SCOPE]. Will update in 15 min."
Status page (if customer-facing):
"We are investigating reports of [SYMPTOM]
affecting [SERVICE]. Updates will follow."
Escalation:
"Need help with [SERVICE]. Current state: [STATE].
I've tried: [ACTIONS]. Need someone with [EXPERTISE]."
The Post-Incident Ritual
The morning after every 3am page, I spend 20 minutes writing a mini-retro:
- Timeline: What happened when
- Detection: How fast did we know?
- Resolution: What fixed it?
- Prevention: How do we stop this from happening again?
- Automation: What manual step can we automate?
This compounds. After 200+ incidents, my team has automated away 60% of the common failure modes.
The Metric That Changed Everything
We started tracking "Time to Confident" — the time from page to knowing what's wrong. Not resolution, just understanding. This dropped our overall MTTR by 55% because confident responders take decisive action instead of flailing.
If you're tired of chaotic 3am incidents and want AI-assisted triage, check out what we're building at Nova AI Ops.
Written by Dr. Samson Tanimawo
BSc · MSc · MBA · PhD
Founder & CEO, Nova AI Ops. https://novaaiops.com
Top comments (0)