Samson Tanimawo

Posted on Apr 22

Effective On-Call Rotations: Lessons From Building Fair Schedules

#oncall #sre #devops #culture

The Rotation Nobody Wants

Our on-call rotation was a spreadsheet. Updated manually. Someone always got scheduled during their vacation. Two people occasionally got double-booked. Holidays were a battleground.

Designing Fair Rotations

Principle 1: Equal Burden Distribution

Track total on-call hours, not just shift count:

def calculate_oncall_burden(engineer, period_days=90):
shifts = get_shifts(engineer, period_days)
return {
'total_hours': sum(s.duration_hours for s in shifts),
'weekend_hours': sum(s.duration_hours for s in shifts if s.is_weekend),
'holiday_hours': sum(s.duration_hours for s in shifts if s.is_holiday),
'night_hours': sum(s.duration_hours for s in shifts if s.is_night),
'pages_received': sum(s.page_count for s in shifts),
'burden_score': calculate_weighted_score(shifts)
}

def calculate_weighted_score(shifts):
"""Weight different types of on-call differently."""
score = 0
for s in shifts:
base = s.duration_hours
if s.is_weekend: base *= 1.5
if s.is_holiday: base *= 2.0
if s.is_night: base *= 1.3
score += base
return round(score, 1)

Principle 2: Respect Preferences

onCall_preferences:
alice:
blackout_dates: ["2024-03-25", "2024-04-01:2024-04-05"] # Vacation
preferred_days: ["Mon", "Tue", "Wed"] # Family on weekends
max_consecutive_days: 3

bob:
blackout_dates: ["2024-04-10"]
preferred_days: ["any"]
max_consecutive_days: 7
prefers_weekends: true # Weekend differential pay

Principle 3: Minimum Pool Size

The math on sustainable rotations:

Pool size Frequency Burnout risk
3 people 1 week on / 2 off HIGH unsustainable
4 people 1 week on / 3 off MEDIUM barely okay
5 people 1 week on / 4 off LOW comfortable
6+ people 1 week on / 5+ off MINIMAL ideal

Rule: Minimum 5 people per rotation.
If you have fewer, reduce on-call scope or hire.

Principle 4: Escalation Tiers

escalation_chain:
tier_1: # Primary on-call
response_time: 5 minutes
scope: all pages

tier_2: # Secondary on-call (backup)
response_time: 15 minutes
scope: escalated or unacknowledged

tier_3: # Engineering manager
response_time: 30 minutes
scope: P1 only or when both T1+T2 unavailable

tier_4: # CTO/VP Engineering
response_time: 60 minutes
scope: Extended P1 (>1 hour), customer escalation

The Override System

Life happens. Make swaps easy:

def request_swap(requesting_engineer, target_date, volunteer=None):
"""Allow easy on-call swaps."""

if volunteer:
# Direct swap: Alice asks Bob to cover
execute_swap(requesting_engineer, volunteer, target_date)
notify_team(f"{requesting_engineer} swapped with {volunteer} for {target_date}")
else:
# Open request: Alice needs coverage, anyone can take it
post_to_channel(
f"{requesting_engineer} needs coverage for {target_date}. "
f"Reply to volunteer. Comp: standard on-call rate."
)

# Key: NO manager approval needed for swaps
# This reduces friction dramatically

Holiday Fairness

The holiday rotation is separate and tracked year-over-year:

holidays_2024 = [
'New Years', 'MLK Day', 'Presidents Day', 'Memorial Day',
'July 4th', 'Labor Day', 'Thanksgiving', 'Christmas'
]

def assign_holidays(team, year):
# Get historical holiday assignments
history = get_holiday_history(team, years=3)

# Sort by who has covered the FEWEST holidays recently
sorted_team = sorted(team, key=lambda e: history.get(e, 0))

assignments = {}
for i, holiday in enumerate(holidays_2024):
engineer = sorted_team[i % len(sorted_team)]
assignments[holiday] = engineer

return assignments

Metrics We Track

Metric	Target	Current
Burden score variance	< 15%	8%
Swap request fulfillment	> 95%	98%
Pages per shift (average)	< 3	1.8
NPS for on-call experience	> 0	+32
Holiday coverage fairness	< 1 shift variance	0.5

If you want AI-powered on-call scheduling that optimizes for fairness automatically, check out what we're building at Nova AI Ops.

Written by Dr. Samson Tanimawo
BSc · MSc · MBA · PhD
Founder & CEO, Nova AI Ops. https://novaaiops.com

DEV Community