On-Call Management Kit
On-call doesn't have to burn out your team. This kit provides production-ready PagerDuty and OpsGenie configurations, fair rotation schedules, escalation policies that actually work, and incident communication templates that keep stakeholders informed without drowning responders. Built from patterns that keep teams healthy while maintaining rapid incident response.
Key Features
- PagerDuty service configs — Terraform-ready HCL for services, escalation policies, and maintenance windows
- OpsGenie integration templates — API-driven setup scripts for teams, schedules, and routing rules
- Rotation schedule generator — Python script that creates fair schedules accounting for weekends, holidays, and time zones
- Escalation policy framework — Multi-tier policies (L1 → L2 → L3 → management) with configurable timeouts
- Incident communication templates — Pre-written status updates for Slack, email, and status pages at each severity level
- On-call handoff checklist — Structured handoff template ensuring context transfer between shifts
- Compensation tracker — YAML for tracking on-call hours and off-hours interrupts
- Fatigue analysis script — Analyzes page volume to identify alert fatigue
Quick Start
unzip on-call-management-kit.zip && cd on-call-management-kit/
# Generate a rotation schedule
python3 src/on_call_management_kit/core.py generate-schedule \
--team-members "Alice,Bob,Carol,Dave,Eve" \
--rotation-length 7 --start-date 2026-04-01 \
--timezone "America/New_York" --output schedule.yaml
# Analyze page fatigue
python3 src/on_call_management_kit/utils.py fatigue-report \
--input pagerduty_incidents_export.csv --window 30
Architecture / How It Works
SCHEDULE → ESCALATE → COMMUNICATE
Fair L1→L2→L3 Severity-
rotations timeout- specific
with TZ based templates
awareness promotion for comms
- Schedule — Generate rotation schedules that distribute on-call burden fairly across team members
- Escalate — Define escalation tiers with timeout-based promotion. If L1 doesn't acknowledge in 5 minutes, L2 gets paged
- Communicate — Use severity-specific templates to keep stakeholders informed without overwhelming the responder
Usage Examples
PagerDuty Escalation Policy (Terraform)
resource "pagerduty_escalation_policy" "platform_team" {
name = "Platform Team Escalation"
num_loops = 2
rule {
escalation_delay_in_minutes = 5
target {
type = "schedule_reference"
id = pagerduty_schedule.platform_primary.id
}
}
rule {
escalation_delay_in_minutes = 15
target {
type = "user_reference"
id = pagerduty_user.engineering_manager.id
}
}
}
Rotation Schedule Generator
from on_call_management_kit.core import ScheduleGenerator
generator = ScheduleGenerator(
team=["Alice", "Bob", "Carol", "Dave", "Eve"],
rotation_days=7,
timezone="America/New_York",
holidays=["2026-12-25", "2027-01-01"],
)
schedule = generator.generate(
start="2026-04-01",
end="2026-06-30",
constraints={"Alice": {"blackout_dates": ["2026-05-15"]}}
)
for shift in schedule.shifts:
print(f"{shift.start} → {shift.end}: {shift.assignee}")
Incident Communication Template
# templates/incident_communication.yaml
severity_1:
initial_notification:
slack: |
:rotating_light: **SEV1 Incident Declared**
*Service:* {{ service_name }} | *IC:* {{ ic_name }}
*Impact:* {{ impact_description }}
stakeholders: [vp_engineering, support_lead]
update_cadence: every_15_minutes
resolution:
slack: |
:white_check_mark: **SEV1 Resolved**
*Duration:* {{ duration_minutes }} min | *Postmortem:* {{ postmortem_date }}
Configuration
# config.example.yaml
team:
name: Platform Engineering
members:
- name: Alice Smith
email: alice@example.com
timezone: America/New_York
pagerduty_id: P1A2B3C
- name: Bob Jones
email: bob@example.com
timezone: America/Chicago
pagerduty_id: P4D5E6F
schedule:
rotation_length_days: 7 # Length of each shift
handoff_time: "09:00" # Local time for shift handoff
handoff_day: monday # Day of week for handoff
require_secondary: true # Always have a backup on-call
escalation:
l1_timeout_minutes: 5 # Time before escalating to L2
l2_timeout_minutes: 10 # Time before escalating to L3
l3_timeout_minutes: 15 # Time before escalating to management
num_loops: 2 # Times to loop through policy
fatigue_thresholds:
pages_per_shift_warning: 10 # Flag shifts with high page count
pages_per_shift_critical: 25 # Escalate to management
off_hours_pct_warning: 0.40 # Warning if >40% pages are off-hours
Best Practices
- Minimum 5 people in a rotation — fewer leads to unsustainable on-call frequency
- Hand off during business hours — Monday 9 AM handoffs let the new on-call review recent issues with the team present
- Always have a secondary — the primary handles the page, the secondary is backup if primary is unreachable
- Review page volume monthly — if a service pages more than 2x per shift, it needs engineering investment, not more on-call
- Compensate fairly — track off-hours pages and provide comp time or pay
- Automate the handoff — use the checklist template so outgoing on-call documents open issues, recent deploys, and known risks
Troubleshooting
Schedule generator produces uneven distribution
This usually happens when blackout dates cluster for certain team members. Use --balance-mode strict to force even distribution, which may override some non-critical blackout preferences.
PagerDuty webhook not triggering
Verify the webhook URL is reachable from PagerDuty's infrastructure. Check Extensions → Generic Webhooks in your PagerDuty service. Ensure the webhook endpoint returns 2xx within 5 seconds.
OpsGenie alerts not routing correctly
Check routing rules order — OpsGenie evaluates top-to-bottom, stops at first match.
Fatigue report shows zero pages
Ensure the CSV export includes the created_at column in ISO 8601 format.
This is 1 of 7 resources in the SRE Platform Pro toolkit. Get the complete [On-Call Management Kit] with all files, templates, and documentation for $29.
Or grab the entire SRE Platform Pro bundle (7 products) for $89 — save 30%.
Top comments (0)