The Monday Morning Disaster
Every Monday, the same story: the incoming on-call engineer has no idea what happened over the weekend. The outgoing engineer left a cryptic Slack message at 11pm and went to bed.
We lost 2 hours every Monday rebuilding context.
The Structured Handoff
We built a handoff template that takes 15 minutes to write and saves hours of confusion:
# On-Call Handoff: [DATE] → [DATE]
## Outgoing: @engineer_a | Incoming: @engineer_b
### Active Issues
| Issue | Status | Next Step | ETA |
|-------|--------|-----------|-----|
| DB replication lag | Monitoring | Auto-resolves if < 5s | Check at noon |
| Cert expiry api.prod | Fix scheduled | Deploy cert-bot PR #234 | Tuesday AM |
### Incidents This Shift
1. **[P2] Payment timeout spike** — 2024-03-15 02:30 UTC
- Resolved: Increased connection pool from 20→50
- Post-mortem: Scheduled for Wednesday
- Lingering risk: Pool size is a band-aid, need connection pooler
### Upcoming Risks
- Major deploy of auth-service v3 on Tuesday
- Black Friday load test on Thursday
- AWS maintenance window Friday 2-6am UTC
### Helpful Context
- The cache-service has been flaky — restart fixes it (known bug, JIRA-456)
- New on-call runbook for search-service is at [link]
- PagerDuty schedule was updated — check your shifts
### Metrics to Watch
- DB replication lag: should be < 1s (currently 0.8s)
- Payment success rate: should be > 99.8% (currently 99.7%)
- API error rate: baseline is 0.05% (currently 0.04%)
Automating the Handoff
We automated 80% of this with a bot:
def generate_handoff_report(outgoing_shift_start, outgoing_shift_end):
report = {
'incidents': get_incidents(outgoing_shift_start, outgoing_shift_end),
'active_alerts': get_active_alerts(),
'recent_deploys': get_deploys(hours=48),
'upcoming_maintenance': get_maintenance_windows(days=7),
'slo_status': get_slo_status(),
'open_tickets': get_oncall_tickets(status='open')
}
# Auto-generate summary
summary = []
if report['incidents']:
summary.append(f"{len(report['incidents'])} incidents during shift")
if report['active_alerts']:
summary.append(f"{len(report['active_alerts'])} active alerts to monitor")
if any(slo['budget_remaining'] < 30 for slo in report['slo_status']):
summary.append("WARNING: SLO budget low for some services")
return format_handoff(report, summary)
The 15-Minute Handoff Call
The bot generates the report. The humans spend 15 minutes on video:
0-5 min: Outgoing reviews active issues and incidents
5-10 min: Walk through upcoming risks and context
10-15 min: Incoming asks questions, confirms understanding
Critical rule: The outgoing engineer is NOT released until the incoming engineer says "I'm good."
The Handoff Score
We rate every handoff:
handoff_score:
report_completed: +1
call_happened: +1
all_incidents_documented: +1
active_issues_listed: +1
upcoming_risks_noted: +1
metrics_baseline_included: +1
max_score: 6
target: >= 5
We track this weekly. Teams that score consistently above 5 have 60% fewer "lost context" incidents.
Results
| Metric | Before | After |
|---|---|---|
| Monday morning incidents due to lost context | 3-4/month | 0-1/month |
| Time to rebuild context | 2 hours | 15 minutes |
| Incoming on-call confidence (1-5) | 2.3 | 4.6 |
| Escalations due to missing info | 8/month | 1/month |
The best part: engineers actually look forward to handoffs now because they're quick and useful instead of stressful.
If you want AI-generated on-call handoff reports that capture everything automatically, check out what we're building at Nova AI Ops.
Written by Dr. Samson Tanimawo
BSc · MSc · MBA · PhD
Founder & CEO, Nova AI Ops. https://novaaiops.com
Top comments (0)