DEV Community

Thesius Code
Thesius Code

Posted on • Originally published at datanest-stores.pages.dev

Runbook Template Library

Runbook Template Library

When production is on fire, nobody wants to improvise. This library provides 50+ operational runbooks for the incidents your team will actually face: database connection exhaustion, memory leaks, traffic spikes, deployment rollbacks, certificate expiration, and DNS failures. Each runbook follows a consistent structure — symptoms, diagnosis steps, remediation commands, and verification checks — so any on-call engineer can resolve incidents confidently, even at 3 AM.

Key Features

  • 50+ production-ready runbooks — Covering databases, compute, networking, deployments, storage, and security incidents
  • Consistent structure — Every runbook follows the same format: symptoms → diagnosis → remediation → verification → escalation
  • Copy-paste commands — Real shell commands, kubectl operations, and SQL queries you can run immediately
  • Severity-tagged — Each runbook includes severity classification and expected resolution time
  • Environment-aware configs — YAML configs for development and production environments
  • Runbook index — YAML index mapping alert names to the corresponding runbook
  • Customization guide — Instructions for adapting templates to your infrastructure

Quick Start

unzip runbook-template-library.zip && cd runbook-template-library/

# Browse the runbook index
python3 src/runbook_template_library/core.py list --category database

# Look up a runbook by alert name
python3 src/runbook_template_library/core.py lookup \
  --alert "PostgresConnectionPoolExhausted"
Enter fullscreen mode Exit fullscreen mode

Architecture / How It Works

  1. Index — YAML file mapping Prometheus alert names to runbook paths. Link this from your alerting tool.
  2. Runbooks — Markdown files with structured sections. Each is self-contained and runnable.
  3. Configs — Environment-specific YAML with endpoints and thresholds that runbooks reference.

Usage Examples

Runbook: Database Connection Pool Exhausted

# PostgreSQL Connection Pool Exhausted

**Severity:** SEV2 | **Expected Resolution:** 15-30 min
**Alert:** `PostgresConnectionPoolExhausted`

## Symptoms
- Application logs show "connection pool exhausted"
- Prometheus: `pg_stat_activity_count > pg_settings_max_connections * 0.9`
- New requests timing out or returning 503

## Diagnosis

Enter fullscreen mode Exit fullscreen mode


bash

Check current connection count

kubectl exec -it postgres-primary-0 -n database -- \
psql -U postgres -c "SELECT count(*), state FROM pg_stat_activity GROUP BY state;"

Identify connection-hogging services

kubectl exec -it postgres-primary-0 -n database -- \
psql -U postgres -c "SELECT application_name, count(*) FROM pg_stat_activity GROUP BY application_name ORDER BY count DESC LIMIT 10;"


## Remediation

**Option A: Kill idle connections**
Enter fullscreen mode Exit fullscreen mode


bash
kubectl exec -it postgres-primary-0 -n database -- \
psql -U postgres -c "SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE state = 'idle' AND now() - state_change > interval '10 minutes';"


**Option B: Restart the offending service**
Enter fullscreen mode Exit fullscreen mode


bash
kubectl rollout restart deployment/ -n production


## Verification
Enter fullscreen mode Exit fullscreen mode


bash
kubectl exec -it postgres-primary-0 -n database -- \
psql -U postgres -c "SELECT count(*) FROM pg_stat_activity;"


## Escalation
If connections keep climbing after remediation, escalate to the database team.
Enter fullscreen mode Exit fullscreen mode


yaml

Alert-to-Runbook Index

# index/alert_runbook_map.yaml
alerts:
  PostgresConnectionPoolExhausted:
    runbook: runbooks/database/connection-pool-exhausted.md
    severity: SEV2

  HighMemoryUtilization:
    runbook: runbooks/compute/high-memory-utilization.md
    severity: SEV3

  CertificateExpiringSoon:
    runbook: runbooks/security/certificate-expiring.md
    severity: SEV3

  DeploymentRollbackRequired:
    runbook: runbooks/deployment/rollback-procedure.md
    severity: SEV2

  DiskSpaceCritical:
    runbook: runbooks/storage/disk-space-critical.md
    severity: SEV2
Enter fullscreen mode Exit fullscreen mode

Environment Configuration

# configs/production.yaml
environment: production
database:
  host: postgres-primary.database.svc.cluster.local
  port: 5432
  max_connections: 150
  connection_critical_threshold: 0.90
monitoring:
  prometheus_url: https://prometheus.example.com
  grafana_url: https://grafana.example.com
kubernetes:
  context: production-cluster
Enter fullscreen mode Exit fullscreen mode

Configuration

# config.example.yaml
runbooks:
  base_dir: runbooks/
  index_file: index/alert_runbook_map.yaml
  required_sections: [Symptoms, Diagnosis, Remediation, Verification, Escalation]
  review_cadence: quarterly

validation:
  check_commands: true                  # Validate shell command syntax
  require_severity: true                # Every runbook needs a severity tag

environments:
  - name: development
    config: configs/development.yaml
  - name: production
    config: configs/production.yaml
Enter fullscreen mode Exit fullscreen mode

Best Practices

  • Link runbooks from alerts — PagerDuty and OpsGenie support runbook URLs in alert metadata
  • Write for the worst case — assume a junior engineer at 3 AM
  • Include verification steps — every remediation ends with "how do I know this worked?"
  • Review quarterly — stale runbooks are worse than none
  • Version control everything — Git = audit trail + rollback

Troubleshooting

Runbook validation fails on "missing section"
Check that your Markdown headers exactly match the required_sections list in config. The validator is case-sensitive: "Diagnosis" works, "diagnosis" doesn't.

Alert-to-runbook lookup returns "no match"
The lookup uses exact alert name matching. Ensure names in Prometheus rules match keys in alert_runbook_map.yaml.

Environment config not loading
Verify the YAML file path in config matches the actual file location. Relative paths resolve from the toolkit root.

Commands in runbooks fail in production
Ensure kubernetes.context matches your kubeconfig context. Run kubectl config get-contexts to verify.


This is 1 of 7 resources in the SRE Platform Pro toolkit. Get the complete [Runbook Template Library] with all files, templates, and documentation for $29.

Get the Full Kit →

Or grab the entire SRE Platform Pro bundle (7 products) for $89 — save 30%.

Get the Complete Bundle →


Related Articles

Top comments (0)