Runbook Template Library
When production is on fire, nobody wants to improvise. This library provides 50+ operational runbooks for the incidents your team will actually face: database connection exhaustion, memory leaks, traffic spikes, deployment rollbacks, certificate expiration, and DNS failures. Each runbook follows a consistent structure — symptoms, diagnosis steps, remediation commands, and verification checks — so any on-call engineer can resolve incidents confidently, even at 3 AM.
Key Features
- 50+ production-ready runbooks — Covering databases, compute, networking, deployments, storage, and security incidents
- Consistent structure — Every runbook follows the same format: symptoms → diagnosis → remediation → verification → escalation
- Copy-paste commands — Real shell commands, kubectl operations, and SQL queries you can run immediately
- Severity-tagged — Each runbook includes severity classification and expected resolution time
- Environment-aware configs — YAML configs for development and production environments
- Runbook index — YAML index mapping alert names to the corresponding runbook
- Customization guide — Instructions for adapting templates to your infrastructure
Quick Start
unzip runbook-template-library.zip && cd runbook-template-library/
# Browse the runbook index
python3 src/runbook_template_library/core.py list --category database
# Look up a runbook by alert name
python3 src/runbook_template_library/core.py lookup \
--alert "PostgresConnectionPoolExhausted"
Architecture / How It Works
- Index — YAML file mapping Prometheus alert names to runbook paths. Link this from your alerting tool.
- Runbooks — Markdown files with structured sections. Each is self-contained and runnable.
- Configs — Environment-specific YAML with endpoints and thresholds that runbooks reference.
Usage Examples
Runbook: Database Connection Pool Exhausted
# PostgreSQL Connection Pool Exhausted
**Severity:** SEV2 | **Expected Resolution:** 15-30 min
**Alert:** `PostgresConnectionPoolExhausted`
## Symptoms
- Application logs show "connection pool exhausted"
- Prometheus: `pg_stat_activity_count > pg_settings_max_connections * 0.9`
- New requests timing out or returning 503
## Diagnosis
bash
Check current connection count
kubectl exec -it postgres-primary-0 -n database -- \
psql -U postgres -c "SELECT count(*), state FROM pg_stat_activity GROUP BY state;"
Identify connection-hogging services
kubectl exec -it postgres-primary-0 -n database -- \
psql -U postgres -c "SELECT application_name, count(*) FROM pg_stat_activity GROUP BY application_name ORDER BY count DESC LIMIT 10;"
## Remediation
**Option A: Kill idle connections**
bash
kubectl exec -it postgres-primary-0 -n database -- \
psql -U postgres -c "SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE state = 'idle' AND now() - state_change > interval '10 minutes';"
**Option B: Restart the offending service**
bash
kubectl rollout restart deployment/ -n production
## Verification
bash
kubectl exec -it postgres-primary-0 -n database -- \
psql -U postgres -c "SELECT count(*) FROM pg_stat_activity;"
## Escalation
If connections keep climbing after remediation, escalate to the database team.
yaml
Alert-to-Runbook Index
# index/alert_runbook_map.yaml
alerts:
PostgresConnectionPoolExhausted:
runbook: runbooks/database/connection-pool-exhausted.md
severity: SEV2
HighMemoryUtilization:
runbook: runbooks/compute/high-memory-utilization.md
severity: SEV3
CertificateExpiringSoon:
runbook: runbooks/security/certificate-expiring.md
severity: SEV3
DeploymentRollbackRequired:
runbook: runbooks/deployment/rollback-procedure.md
severity: SEV2
DiskSpaceCritical:
runbook: runbooks/storage/disk-space-critical.md
severity: SEV2
Environment Configuration
# configs/production.yaml
environment: production
database:
host: postgres-primary.database.svc.cluster.local
port: 5432
max_connections: 150
connection_critical_threshold: 0.90
monitoring:
prometheus_url: https://prometheus.example.com
grafana_url: https://grafana.example.com
kubernetes:
context: production-cluster
Configuration
# config.example.yaml
runbooks:
base_dir: runbooks/
index_file: index/alert_runbook_map.yaml
required_sections: [Symptoms, Diagnosis, Remediation, Verification, Escalation]
review_cadence: quarterly
validation:
check_commands: true # Validate shell command syntax
require_severity: true # Every runbook needs a severity tag
environments:
- name: development
config: configs/development.yaml
- name: production
config: configs/production.yaml
Best Practices
- Link runbooks from alerts — PagerDuty and OpsGenie support runbook URLs in alert metadata
- Write for the worst case — assume a junior engineer at 3 AM
- Include verification steps — every remediation ends with "how do I know this worked?"
- Review quarterly — stale runbooks are worse than none
- Version control everything — Git = audit trail + rollback
Troubleshooting
Runbook validation fails on "missing section"
Check that your Markdown headers exactly match the required_sections list in config. The validator is case-sensitive: "Diagnosis" works, "diagnosis" doesn't.
Alert-to-runbook lookup returns "no match"
The lookup uses exact alert name matching. Ensure names in Prometheus rules match keys in alert_runbook_map.yaml.
Environment config not loading
Verify the YAML file path in config matches the actual file location. Relative paths resolve from the toolkit root.
Commands in runbooks fail in production
Ensure kubernetes.context matches your kubeconfig context. Run kubectl config get-contexts to verify.
This is 1 of 7 resources in the SRE Platform Pro toolkit. Get the complete [Runbook Template Library] with all files, templates, and documentation for $29.
Or grab the entire SRE Platform Pro bundle (7 products) for $89 — save 30%.
Top comments (0)