The On-Call Handoff That Prevents Dropped Incidents

#sre #oncall #process #devops

The Monday Morning Disaster

Every Monday, the same story: the incoming on-call engineer has no idea what happened over the weekend. The outgoing engineer left a cryptic Slack message at 11pm and went to bed.

We lost 2 hours every Monday rebuilding context.

The Structured Handoff

We built a handoff template that takes 15 minutes to write and saves hours of confusion:

# On-Call Handoff: [DATE] → [DATE]
## Outgoing: @engineer_a | Incoming: @engineer_b

### Active Issues
| Issue | Status | Next Step | ETA |
|-------|--------|-----------|-----|
| DB replication lag | Monitoring | Auto-resolves if < 5s | Check at noon |
| Cert expiry api.prod | Fix scheduled | Deploy cert-bot PR #234 | Tuesday AM |

### Incidents This Shift
1. **[P2] Payment timeout spike** — 2024-03-15 02:30 UTC
   - Resolved: Increased connection pool from 20→50
   - Post-mortem: Scheduled for Wednesday
   - Lingering risk: Pool size is a band-aid, need connection pooler

### Upcoming Risks
- Major deploy of auth-service v3 on Tuesday
- Black Friday load test on Thursday
- AWS maintenance window Friday 2-6am UTC

### Helpful Context
- The cache-service has been flaky — restart fixes it (known bug, JIRA-456)
- New on-call runbook for search-service is at [link]
- PagerDuty schedule was updated — check your shifts

### Metrics to Watch
- DB replication lag: should be < 1s (currently 0.8s)
- Payment success rate: should be > 99.8% (currently 99.7%)
- API error rate: baseline is 0.05% (currently 0.04%)

Automating the Handoff

We automated 80% of this with a bot:

def generate_handoff_report(outgoing_shift_start, outgoing_shift_end):
    report = {
        'incidents': get_incidents(outgoing_shift_start, outgoing_shift_end),
        'active_alerts': get_active_alerts(),
        'recent_deploys': get_deploys(hours=48),
        'upcoming_maintenance': get_maintenance_windows(days=7),
        'slo_status': get_slo_status(),
        'open_tickets': get_oncall_tickets(status='open')
    }

    # Auto-generate summary
    summary = []
    if report['incidents']:
        summary.append(f"{len(report['incidents'])} incidents during shift")
    if report['active_alerts']:
        summary.append(f"{len(report['active_alerts'])} active alerts to monitor")
    if any(slo['budget_remaining'] < 30 for slo in report['slo_status']):
        summary.append("WARNING: SLO budget low for some services")

    return format_handoff(report, summary)

The 15-Minute Handoff Call

The bot generates the report. The humans spend 15 minutes on video:

0-5 min:  Outgoing reviews active issues and incidents
5-10 min: Walk through upcoming risks and context
10-15 min: Incoming asks questions, confirms understanding

Critical rule: The outgoing engineer is NOT released until the incoming engineer says "I'm good."

The Handoff Score

We rate every handoff:

handoff_score:
  report_completed: +1
  call_happened: +1
  all_incidents_documented: +1
  active_issues_listed: +1
  upcoming_risks_noted: +1
  metrics_baseline_included: +1

  max_score: 6
  target: >= 5

We track this weekly. Teams that score consistently above 5 have 60% fewer "lost context" incidents.

Results

Metric	Before	After
Monday morning incidents due to lost context	3-4/month	0-1/month
Time to rebuild context	2 hours	15 minutes
Incoming on-call confidence (1-5)	2.3	4.6
Escalations due to missing info	8/month	1/month

The best part: engineers actually look forward to handoffs now because they're quick and useful instead of stressful.

If you want AI-generated on-call handoff reports that capture everything automatically, check out what we're building at Nova AI Ops.

Written by Dr. Samson Tanimawo
BSc · MSc · MBA · PhD
Founder & CEO, Nova AI Ops. https://novaaiops.com

DEV Community