Thesius Code

Posted on Mar 23 • Originally published at datanest-stores.pages.dev

Runbook Template Library

#kubernetes #monitoring #devops #sre

Runbook Template Library

When production is on fire, nobody wants to improvise. This library provides 50+ operational runbooks for the incidents your team will actually face: database connection exhaustion, memory leaks, traffic spikes, deployment rollbacks, certificate expiration, and DNS failures. Each runbook follows a consistent structure — symptoms, diagnosis steps, remediation commands, and verification checks — so any on-call engineer can resolve incidents confidently, even at 3 AM.

Key Features

50+ production-ready runbooks — Covering databases, compute, networking, deployments, storage, and security incidents
Consistent structure — Every runbook follows the same format: symptoms → diagnosis → remediation → verification → escalation
Copy-paste commands — Real shell commands, kubectl operations, and SQL queries you can run immediately
Severity-tagged — Each runbook includes severity classification and expected resolution time
Environment-aware configs — YAML configs for development and production environments
Runbook index — YAML index mapping alert names to the corresponding runbook
Customization guide — Instructions for adapting templates to your infrastructure

Quick Start

unzip runbook-template-library.zip && cd runbook-template-library/

# Browse the runbook index
python3 src/runbook_template_library/core.py list --category database

# Look up a runbook by alert name
python3 src/runbook_template_library/core.py lookup \
  --alert "PostgresConnectionPoolExhausted"

Architecture / How It Works

Index — YAML file mapping Prometheus alert names to runbook paths. Link this from your alerting tool.
Runbooks — Markdown files with structured sections. Each is self-contained and runnable.
Configs — Environment-specific YAML with endpoints and thresholds that runbooks reference.

Usage Examples

Runbook: Database Connection Pool Exhausted

# PostgreSQL Connection Pool Exhausted

**Severity:** SEV2 | **Expected Resolution:** 15-30 min
**Alert:** `PostgresConnectionPoolExhausted`

## Symptoms
- Application logs show "connection pool exhausted"
- Prometheus: `pg_stat_activity_count > pg_settings_max_connections * 0.9`
- New requests timing out or returning 503

## Diagnosis

bash

Check current connection count

kubectl exec -it postgres-primary-0 -n database -- \
psql -U postgres -c "SELECT count(*), state FROM pg_stat_activity GROUP BY state;"

Identify connection-hogging services

kubectl exec -it postgres-primary-0 -n database -- \
psql -U postgres -c "SELECT application_name, count(*) FROM pg_stat_activity GROUP BY application_name ORDER BY count DESC LIMIT 10;"


## Remediation

**Option A: Kill idle connections**

bash
kubectl exec -it postgres-primary-0 -n database -- \
psql -U postgres -c "SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE state = 'idle' AND now() - state_change > interval '10 minutes';"


**Option B: Restart the offending service**

bash
kubectl rollout restart deployment/ -n production


## Verification

bash
kubectl exec -it postgres-primary-0 -n database -- \
psql -U postgres -c "SELECT count(*) FROM pg_stat_activity;"


## Escalation
If connections keep climbing after remediation, escalate to the database team.

yaml

Alert-to-Runbook Index

# index/alert_runbook_map.yaml
alerts:
  PostgresConnectionPoolExhausted:
    runbook: runbooks/database/connection-pool-exhausted.md
    severity: SEV2

  HighMemoryUtilization:
    runbook: runbooks/compute/high-memory-utilization.md
    severity: SEV3

  CertificateExpiringSoon:
    runbook: runbooks/security/certificate-expiring.md
    severity: SEV3

  DeploymentRollbackRequired:
    runbook: runbooks/deployment/rollback-procedure.md
    severity: SEV2

  DiskSpaceCritical:
    runbook: runbooks/storage/disk-space-critical.md
    severity: SEV2

Environment Configuration

# configs/production.yaml
environment: production
database:
  host: postgres-primary.database.svc.cluster.local
  port: 5432
  max_connections: 150
  connection_critical_threshold: 0.90
monitoring:
  prometheus_url: https://prometheus.example.com
  grafana_url: https://grafana.example.com
kubernetes:
  context: production-cluster

Configuration

# config.example.yaml
runbooks:
  base_dir: runbooks/
  index_file: index/alert_runbook_map.yaml
  required_sections: [Symptoms, Diagnosis, Remediation, Verification, Escalation]
  review_cadence: quarterly

validation:
  check_commands: true                  # Validate shell command syntax
  require_severity: true                # Every runbook needs a severity tag

environments:
  - name: development
    config: configs/development.yaml
  - name: production
    config: configs/production.yaml

Best Practices

Link runbooks from alerts — PagerDuty and OpsGenie support runbook URLs in alert metadata
Write for the worst case — assume a junior engineer at 3 AM
Include verification steps — every remediation ends with "how do I know this worked?"
Review quarterly — stale runbooks are worse than none
Version control everything — Git = audit trail + rollback

Troubleshooting

Runbook validation fails on "missing section"
Check that your Markdown headers exactly match the required_sections list in config. The validator is case-sensitive: "Diagnosis" works, "diagnosis" doesn't.

Alert-to-runbook lookup returns "no match"
The lookup uses exact alert name matching. Ensure names in Prometheus rules match keys in alert_runbook_map.yaml.

Environment config not loading
Verify the YAML file path in config matches the actual file location. Relative paths resolve from the toolkit root.

Commands in runbooks fail in production
Ensure kubernetes.context matches your kubeconfig context. Run kubectl config get-contexts to verify.

This is 1 of 7 resources in the SRE Platform Pro toolkit. Get the complete [Runbook Template Library] with all files, templates, and documentation for $29.

Get the Full Kit →

Or grab the entire SRE Platform Pro bundle (7 products) for $89 — save 30%.

Get the Complete Bundle →

DEV Community

Runbook Template Library

Runbook Template Library

Key Features

Quick Start

Architecture / How It Works

Usage Examples

Runbook: Database Connection Pool Exhausted

Check current connection count

Identify connection-hogging services

Alert-to-Runbook Index

Environment Configuration

Configuration

Best Practices

Troubleshooting

Related Articles

Top comments (0)