Thesius Code

Posted on Mar 23 • Originally published at datanest-stores.pages.dev

On-Call Management Kit

#kubernetes #monitoring #sre #devops

On-Call Management Kit

On-call doesn't have to burn out your team. This kit provides production-ready PagerDuty and OpsGenie configurations, fair rotation schedules, escalation policies that actually work, and incident communication templates that keep stakeholders informed without drowning responders. Built from patterns that keep teams healthy while maintaining rapid incident response.

Key Features

PagerDuty service configs — Terraform-ready HCL for services, escalation policies, and maintenance windows
OpsGenie integration templates — API-driven setup scripts for teams, schedules, and routing rules
Rotation schedule generator — Python script that creates fair schedules accounting for weekends, holidays, and time zones
Escalation policy framework — Multi-tier policies (L1 → L2 → L3 → management) with configurable timeouts
Incident communication templates — Pre-written status updates for Slack, email, and status pages at each severity level
On-call handoff checklist — Structured handoff template ensuring context transfer between shifts
Compensation tracker — YAML for tracking on-call hours and off-hours interrupts
Fatigue analysis script — Analyzes page volume to identify alert fatigue

Quick Start

unzip on-call-management-kit.zip && cd on-call-management-kit/

# Generate a rotation schedule
python3 src/on_call_management_kit/core.py generate-schedule \
  --team-members "Alice,Bob,Carol,Dave,Eve" \
  --rotation-length 7 --start-date 2026-04-01 \
  --timezone "America/New_York" --output schedule.yaml

# Analyze page fatigue
python3 src/on_call_management_kit/utils.py fatigue-report \
  --input pagerduty_incidents_export.csv --window 30

Architecture / How It Works

SCHEDULE → ESCALATE → COMMUNICATE
  Fair       L1→L2→L3    Severity-
  rotations  timeout-    specific
  with TZ    based       templates
  awareness  promotion   for comms

Schedule — Generate rotation schedules that distribute on-call burden fairly across team members
Escalate — Define escalation tiers with timeout-based promotion. If L1 doesn't acknowledge in 5 minutes, L2 gets paged
Communicate — Use severity-specific templates to keep stakeholders informed without overwhelming the responder

Usage Examples

PagerDuty Escalation Policy (Terraform)

resource "pagerduty_escalation_policy" "platform_team" {
  name      = "Platform Team Escalation"
  num_loops = 2

  rule {
    escalation_delay_in_minutes = 5
    target {
      type = "schedule_reference"
      id   = pagerduty_schedule.platform_primary.id
    }
  }
  rule {
    escalation_delay_in_minutes = 15
    target {
      type = "user_reference"
      id   = pagerduty_user.engineering_manager.id
    }
  }
}

Rotation Schedule Generator

from on_call_management_kit.core import ScheduleGenerator

generator = ScheduleGenerator(
    team=["Alice", "Bob", "Carol", "Dave", "Eve"],
    rotation_days=7,
    timezone="America/New_York",
    holidays=["2026-12-25", "2027-01-01"],
)

schedule = generator.generate(
    start="2026-04-01",
    end="2026-06-30",
    constraints={"Alice": {"blackout_dates": ["2026-05-15"]}}
)

for shift in schedule.shifts:
    print(f"{shift.start} → {shift.end}: {shift.assignee}")

Incident Communication Template

# templates/incident_communication.yaml
severity_1:
  initial_notification:
    slack: |
      :rotating_light: **SEV1 Incident Declared**
      *Service:* {{ service_name }} | *IC:* {{ ic_name }}
      *Impact:* {{ impact_description }}
    stakeholders: [vp_engineering, support_lead]

  update_cadence: every_15_minutes

  resolution:
    slack: |
      :white_check_mark: **SEV1 Resolved**
      *Duration:* {{ duration_minutes }} min | *Postmortem:* {{ postmortem_date }}

Configuration

# config.example.yaml
team:
  name: Platform Engineering
  members:
    - name: Alice Smith
      email: alice@example.com
      timezone: America/New_York
      pagerduty_id: P1A2B3C
    - name: Bob Jones
      email: bob@example.com
      timezone: America/Chicago
      pagerduty_id: P4D5E6F

schedule:
  rotation_length_days: 7          # Length of each shift
  handoff_time: "09:00"            # Local time for shift handoff
  handoff_day: monday              # Day of week for handoff
  require_secondary: true          # Always have a backup on-call

escalation:
  l1_timeout_minutes: 5            # Time before escalating to L2
  l2_timeout_minutes: 10           # Time before escalating to L3
  l3_timeout_minutes: 15           # Time before escalating to management
  num_loops: 2                     # Times to loop through policy

fatigue_thresholds:
  pages_per_shift_warning: 10      # Flag shifts with high page count
  pages_per_shift_critical: 25     # Escalate to management
  off_hours_pct_warning: 0.40      # Warning if >40% pages are off-hours

Best Practices

Minimum 5 people in a rotation — fewer leads to unsustainable on-call frequency
Hand off during business hours — Monday 9 AM handoffs let the new on-call review recent issues with the team present
Always have a secondary — the primary handles the page, the secondary is backup if primary is unreachable
Review page volume monthly — if a service pages more than 2x per shift, it needs engineering investment, not more on-call
Compensate fairly — track off-hours pages and provide comp time or pay
Automate the handoff — use the checklist template so outgoing on-call documents open issues, recent deploys, and known risks

Troubleshooting

Schedule generator produces uneven distribution
This usually happens when blackout dates cluster for certain team members. Use --balance-mode strict to force even distribution, which may override some non-critical blackout preferences.

PagerDuty webhook not triggering
Verify the webhook URL is reachable from PagerDuty's infrastructure. Check Extensions → Generic Webhooks in your PagerDuty service. Ensure the webhook endpoint returns 2xx within 5 seconds.

OpsGenie alerts not routing correctly
Check routing rules order — OpsGenie evaluates top-to-bottom, stops at first match.

Fatigue report shows zero pages
Ensure the CSV export includes the created_at column in ISO 8601 format.

This is 1 of 7 resources in the SRE Platform Pro toolkit. Get the complete [On-Call Management Kit] with all files, templates, and documentation for $29.

Get the Full Kit →

Or grab the entire SRE Platform Pro bundle (7 products) for $89 — save 30%.

Get the Complete Bundle →

DEV Community

On-Call Management Kit

On-Call Management Kit

Key Features

Quick Start

Architecture / How It Works

Usage Examples

PagerDuty Escalation Policy (Terraform)

Rotation Schedule Generator

Incident Communication Template

Configuration

Best Practices

Troubleshooting

Related Articles

Top comments (0)