DEV Community

Thesius Code
Thesius Code

Posted on • Originally published at datanest-stores.pages.dev

On-Call Management Kit

On-Call Management Kit

On-call doesn't have to burn out your team. This kit provides production-ready PagerDuty and OpsGenie configurations, fair rotation schedules, escalation policies that actually work, and incident communication templates that keep stakeholders informed without drowning responders. Built from patterns that keep teams healthy while maintaining rapid incident response.

Key Features

  • PagerDuty service configs — Terraform-ready HCL for services, escalation policies, and maintenance windows
  • OpsGenie integration templates — API-driven setup scripts for teams, schedules, and routing rules
  • Rotation schedule generator — Python script that creates fair schedules accounting for weekends, holidays, and time zones
  • Escalation policy framework — Multi-tier policies (L1 → L2 → L3 → management) with configurable timeouts
  • Incident communication templates — Pre-written status updates for Slack, email, and status pages at each severity level
  • On-call handoff checklist — Structured handoff template ensuring context transfer between shifts
  • Compensation tracker — YAML for tracking on-call hours and off-hours interrupts
  • Fatigue analysis script — Analyzes page volume to identify alert fatigue

Quick Start

unzip on-call-management-kit.zip && cd on-call-management-kit/

# Generate a rotation schedule
python3 src/on_call_management_kit/core.py generate-schedule \
  --team-members "Alice,Bob,Carol,Dave,Eve" \
  --rotation-length 7 --start-date 2026-04-01 \
  --timezone "America/New_York" --output schedule.yaml

# Analyze page fatigue
python3 src/on_call_management_kit/utils.py fatigue-report \
  --input pagerduty_incidents_export.csv --window 30
Enter fullscreen mode Exit fullscreen mode

Architecture / How It Works

SCHEDULE → ESCALATE → COMMUNICATE
  Fair       L1→L2→L3    Severity-
  rotations  timeout-    specific
  with TZ    based       templates
  awareness  promotion   for comms
Enter fullscreen mode Exit fullscreen mode
  1. Schedule — Generate rotation schedules that distribute on-call burden fairly across team members
  2. Escalate — Define escalation tiers with timeout-based promotion. If L1 doesn't acknowledge in 5 minutes, L2 gets paged
  3. Communicate — Use severity-specific templates to keep stakeholders informed without overwhelming the responder

Usage Examples

PagerDuty Escalation Policy (Terraform)

resource "pagerduty_escalation_policy" "platform_team" {
  name      = "Platform Team Escalation"
  num_loops = 2

  rule {
    escalation_delay_in_minutes = 5
    target {
      type = "schedule_reference"
      id   = pagerduty_schedule.platform_primary.id
    }
  }
  rule {
    escalation_delay_in_minutes = 15
    target {
      type = "user_reference"
      id   = pagerduty_user.engineering_manager.id
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

Rotation Schedule Generator

from on_call_management_kit.core import ScheduleGenerator

generator = ScheduleGenerator(
    team=["Alice", "Bob", "Carol", "Dave", "Eve"],
    rotation_days=7,
    timezone="America/New_York",
    holidays=["2026-12-25", "2027-01-01"],
)

schedule = generator.generate(
    start="2026-04-01",
    end="2026-06-30",
    constraints={"Alice": {"blackout_dates": ["2026-05-15"]}}
)

for shift in schedule.shifts:
    print(f"{shift.start}{shift.end}: {shift.assignee}")
Enter fullscreen mode Exit fullscreen mode

Incident Communication Template

# templates/incident_communication.yaml
severity_1:
  initial_notification:
    slack: |
      :rotating_light: **SEV1 Incident Declared**
      *Service:* {{ service_name }} | *IC:* {{ ic_name }}
      *Impact:* {{ impact_description }}
    stakeholders: [vp_engineering, support_lead]

  update_cadence: every_15_minutes

  resolution:
    slack: |
      :white_check_mark: **SEV1 Resolved**
      *Duration:* {{ duration_minutes }} min | *Postmortem:* {{ postmortem_date }}
Enter fullscreen mode Exit fullscreen mode

Configuration

# config.example.yaml
team:
  name: Platform Engineering
  members:
    - name: Alice Smith
      email: alice@example.com
      timezone: America/New_York
      pagerduty_id: P1A2B3C
    - name: Bob Jones
      email: bob@example.com
      timezone: America/Chicago
      pagerduty_id: P4D5E6F

schedule:
  rotation_length_days: 7          # Length of each shift
  handoff_time: "09:00"            # Local time for shift handoff
  handoff_day: monday              # Day of week for handoff
  require_secondary: true          # Always have a backup on-call

escalation:
  l1_timeout_minutes: 5            # Time before escalating to L2
  l2_timeout_minutes: 10           # Time before escalating to L3
  l3_timeout_minutes: 15           # Time before escalating to management
  num_loops: 2                     # Times to loop through policy

fatigue_thresholds:
  pages_per_shift_warning: 10      # Flag shifts with high page count
  pages_per_shift_critical: 25     # Escalate to management
  off_hours_pct_warning: 0.40      # Warning if >40% pages are off-hours
Enter fullscreen mode Exit fullscreen mode

Best Practices

  • Minimum 5 people in a rotation — fewer leads to unsustainable on-call frequency
  • Hand off during business hours — Monday 9 AM handoffs let the new on-call review recent issues with the team present
  • Always have a secondary — the primary handles the page, the secondary is backup if primary is unreachable
  • Review page volume monthly — if a service pages more than 2x per shift, it needs engineering investment, not more on-call
  • Compensate fairly — track off-hours pages and provide comp time or pay
  • Automate the handoff — use the checklist template so outgoing on-call documents open issues, recent deploys, and known risks

Troubleshooting

Schedule generator produces uneven distribution
This usually happens when blackout dates cluster for certain team members. Use --balance-mode strict to force even distribution, which may override some non-critical blackout preferences.

PagerDuty webhook not triggering
Verify the webhook URL is reachable from PagerDuty's infrastructure. Check Extensions → Generic Webhooks in your PagerDuty service. Ensure the webhook endpoint returns 2xx within 5 seconds.

OpsGenie alerts not routing correctly
Check routing rules order — OpsGenie evaluates top-to-bottom, stops at first match.

Fatigue report shows zero pages
Ensure the CSV export includes the created_at column in ISO 8601 format.


This is 1 of 7 resources in the SRE Platform Pro toolkit. Get the complete [On-Call Management Kit] with all files, templates, and documentation for $29.

Get the Full Kit →

Or grab the entire SRE Platform Pro bundle (7 products) for $89 — save 30%.

Get the Complete Bundle →


Related Articles

Top comments (0)