Every SRE team has a list of things they intend to automate. The list grows faster than it shrinks. New services join the platform and generate new alert categories. Compliance requirements expand and generate new evidence collection obligations. Incident volumes increase and generate new runbook entries. Each item on the list is a reasonable automation candidate. Evaluated individually, each looks tractable. The list as a whole represents a structural failure — not of execution, but of classification.
The problem with most SRE automation backlogs is that they are organised by symptom rather than by pattern. "Automate the pod restart for OOM events on the payments service." "Automate the quarterly credential rotation for the database clusters." "Automate the MTTR report that goes to leadership every Friday." Each item is a specific toil instance. None reveals the underlying automation pattern that, once implemented, eliminates not just that specific toil but the entire class of toil it represents.
A taxonomy changes this. When you classify toil by structural pattern rather than surface manifestation, automation investment compounds: the event-driven remediation framework you build for OOM restarts handles disk pressure remediation, certificate expiry remediation, and unhealthy endpoint remediation with minor configuration changes. The evidence synthesis pipeline you build for the MTTR report generates the compliance evidence package, the SLO summary, and the capacity forecast from the same infrastructure. The gate enforcement mechanism you build for error budget policy enforces security scanning gates, dependency vulnerability gates, and SLO regression gates with the same architecture.
This post proposes a systematic taxonomy of SRE automation patterns — a classification framework that organises automation by structure rather than symptom, enabling compound rather than linear returns on automation investment.
The Two Classification Dimensions
Every SRE automation pattern can be characterised along two independent dimensions: the class of toil it eliminates, and the execution model by which it operates. The intersection defines the automation pattern — and determines the implementation architecture.
Dimension 1 — Automation Class: What Kind of Work Does It Eliminate?
Five automation classes cover the full spectrum of operational toil in a production SRE environment:
- Class 1 — Reactive Remediation: Automated response to detected failures. A system enters an undesirable state; the automation detects it and restores it without human intervention. The human designs the detection and remediation logic, not executes it.
- Class 2 — Proactive Scaling: Automated capacity adjustment ahead of degradation. The system anticipates demand changes and adjusts capacity proactively, eliminating the manual capacity management cycle and the alert-response-scale-verify toil loop.
- Class 3 — Drift Correction: Automated detection and reconciliation of divergence between desired and actual system state. Configuration drift, policy violations, and infrastructure deviation from IaC definitions are detected and corrected continuously rather than discovered during incidents or audits.
- Class 4 — Evidence Synthesis: Automated generation of operational artefacts — postmortems, compliance evidence packages, SLO reports, capacity forecasts — from existing telemetry. Eliminates the high-toil, high-frequency manual assembly of information that already exists in the observability stack.
- Class 5 — Gate Enforcement: Automated policy enforcement at workflow boundaries — deployment gates, change approval gates, security scanning gates, SLO regression gates. Replaces manual committee deliberation with automated policy evaluation, reducing both toil and the inconsistency that manual gate application introduces.
Dimension 2 — Execution Model: How Does the Automation Trigger and Operate?
- Event-Driven: Triggered by discrete state transitions — an alert firing, a webhook payload, a Kubernetes resource state change, a git commit. Dormant until the triggering event occurs, then executes to completion.
- Schedule-Driven: Triggered by time — a CronJob, a maintenance window, a quarterly compliance cycle. Executes at defined intervals regardless of system state.
- Continuous-Reconciliation: Always running, continuously comparing observed state against desired state and correcting divergence. Kubernetes controllers and GitOps operators use this model. The automation never completes; it operates as a persistent control loop.
AUTOMATION TAXONOMY MATRIX
────────────────────────────────────────────────────────────────────────────────
EVENT-DRIVEN SCHEDULE-DRIVEN CONTINUOUS-RECONCILIATION
────────────────────────────────────────────────────────────────────────────────
Reactive Alert webhook Scheduled health Controller-based
Remediation → K8s Job check + repair self-healing loop
Proactive Load spike Pre-shift warm-up HPA / KEDA
Scaling detection → CronJob continuous autoscaling
burst scale
Drift Webhook on Periodic config Argo CD / Kyverno
Correction resource change audit job continuous sync
Evidence Incident close Weekly SLO report Continuous metric
Synthesis → postmortem CronJob aggregation pipeline
generator
Gate PreSync hook Scheduled SLO Admission controller
Enforcement error budget regression check (Kyverno / OPA)
gate
────────────────────────────────────────────────────────────────────────────────
Taxonomy Principle: Identify the automation class first — this determines what the automation must accomplish. Identify the execution model second — this determines the implementation architecture. Conflating the two produces brittle automation that is hard to reason about, hard to test, and hard to extend.
Class 1 — Reactive Remediation Automation
Reactive remediation is the most commonly implemented and most commonly misimplemented automation class. The pattern is deceptively simple: detect an undesirable state, execute a remediation, verify restoration. The failure mode is equally simple: remediation that restores the surface symptom without instrumenting the root cause, generating a toil loop rather than eliminating one.
The correct implementation architecture has four mandatory components. Detection produces a structured event with sufficient context for the remediation to execute without additional lookups. The remediation executes idempotently — running it twice must not cause harm. Verification confirms the desired state has been restored, not just that the remediation command completed. Escalation fires if verification fails, routing to human on-call with the full execution context attached.
# Step 1: AlertManager routes OOMKill alert to remediation webhook
receivers:
- name: oom-remediation-webhook
webhook_configs:
- url: "http://remediation-controller.sre-platform.svc:8080/remediate"
send_resolved: false
http_config:
bearer_token_file: /var/run/secrets/webhook-token
# Payload includes: namespace, pod_name, container_name,
# alert_labels, current_memory_usage, memory_limit
route:
routes:
- match:
alertname: KubePodOOMKilled
receiver: oom-remediation-webhook
group_wait: 30s # Debounce flapping pods
group_interval: 5m
repeat_interval: 1h
# Step 2: Remediation controller spawns a Job — one Job per remediation event.
# The Job is the unit of auditability: outcome logged to Splunk as structured data.
apiVersion: batch/v1
kind: Job
metadata:
name: oom-remediation-{{ pod_name }}-{{ timestamp }}
namespace: sre-platform
labels:
automation-class: reactive-remediation
trigger: oom-kill
annotations:
sre.internal/incident-id: "{{ incident_id }}"
spec:
backoffLimit: 1 # One retry; if it fails twice, escalate
activeDeadlineSeconds: 120
template:
spec:
restartPolicy: Never
serviceAccountName: remediation-executor-sa
containers:
- name: oom-remediator
image: sre-platform/remediator:v3.2.0
env:
- name: TARGET_NAMESPACE
value: "{{ target_namespace }}"
- name: TARGET_POD
value: "{{ pod_name }}"
- name: REMEDIATION_ACTION
value: "rolling-restart-deployment"
- name: VERIFY_HEALTHY_REPLICAS
value: "true"
- name: VERIFY_TIMEOUT_SECONDS
value: "90"
- name: ESCALATE_ON_FAILURE
value: "true"
- name: ESCALATION_CHANNEL
value: "sre-on-call"
- name: SPLUNK_HEC_URL
valueFrom:
secretKeyRef:
name: splunk-hec-creds
key: url
# Execution sequence:
# 1. Confirm OOMKill via kubectl events (not just alert label)
# 2. Check if deployment already has open remediation in flight
# 3. Execute rolling restart (preserves PodDisruptionBudget)
# 4. Wait for all replicas healthy (readiness probe passing)
# 5. Emit Splunk event: remediation_outcome, duration,
# root_cause_hint (memory_at_kill / limit ratio),
# escalated flag
# 6. If verify fails: post Slack with full context, exit 1
The root_cause_hint field in the Splunk payload is the detail that distinguishes a remediation automation from a remediation loop. A pod consistently OOMKilled at 98% of its memory limit will be restored — but the Splunk event creates the longitudinal dataset that surfaces the pattern as a sizing problem, not an operational problem. The automation contains the immediate cost; the telemetry drives the root cause investment.
Istio STRICT mTLS note: The remediation Job's service account must hold a valid client certificate in the mesh. Pod deletions and deployment rollout commands issued from within the mesh travel through the Envoy sidecar and are subject to PeerAuthentication policy enforcement. Scope the remediation executor's RBAC to the minimum necessary namespace to reduce blast radius of a misconfigured policy.
Class 2 — Proactive Scaling Automation
Proactive scaling automation eliminates the reactive capacity management cycle: observe saturation → manually increase capacity → verify relief → update runbook. In a well-instrumented system with the right autoscaling configuration, this cycle should never involve a human for routine load changes.
The critical design decision is metric selection. CPU-based HPA is the most common and most frequently wrong choice. CPU measures how hard the nodes are working, not how much work the service is being asked to do. Under JVM workloads, CPU can remain low while request queue depth climbs because the garbage collector is pausing request processing. Under connection-pool-bounded services, CPU can stay near zero while new requests time out because all available connections are occupied. Request-rate-based scaling eliminates these failure modes by measuring demand directly.
# Request-Rate-Based HPA
# Scales on RPS per replica, not CPU.
# SOT (Safe Operating Throughput) derived from load testing:
# p95 latency exceeds SLO at > 150 RPS/replica.
# HPA target: 120 RPS/replica (80% of SOT = burst headroom).
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: api-gateway-rps-hpa
namespace: production
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: api-gateway
minReplicas: 3
maxReplicas: 50
metrics:
- type: Pods
pods:
metric:
name: http_requests_per_second # Sourced from Istio Envoy telemetry
target:
type: AverageValue
averageValue: "120"
behavior:
scaleUp:
stabilizationWindowSeconds: 30 # Fast scale-up: respond in 30s
policies:
- type: Percent
value: 100 # Can double replica count per interval
periodSeconds: 30
scaleDown:
stabilizationWindowSeconds: 300 # Slow scale-down: avoid flapping
policies:
- type: Percent
value: 20
periodSeconds: 60
# KEDA Multi-Dimensional Autoscaling
# Combines request-rate, queue depth, and scheduled burst preparation
# in a single ScaledObject — all three execution models in one resource.
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: payment-processor-scaler
namespace: production
spec:
scaleTargetRef:
name: payment-processor
minReplicaCount: 5
maxReplicaCount: 80
cooldownPeriod: 60
triggers:
# Trigger 1: Request rate from Prometheus (continuous reconciliation)
- type: prometheus
metadata:
serverAddress: http://prometheus.monitoring.svc:9090
metricName: http_requests_per_second
query: |
sum(
rate(istio_requests_total{
destination_service_name="payment-processor",
reporter="destination"
}[2m])
) / count(kube_pod_info{
namespace="production",
pod=~"payment-processor-.*"
})
threshold: "120"
# Trigger 2: Kafka queue depth (event-driven — reactive to upstream load)
- type: prometheus
metadata:
serverAddress: http://prometheus.monitoring.svc:9090
metricName: payment_queue_depth
query: |
sum(kafka_consumer_group_lag{
topic="payment-requests",
group="payment-processor"
})
threshold: "500"
# Trigger 3: Pre-market open warm-up (schedule-driven — proactive burst prep)
# JVM cold-start latency is ~45s. Scale before demand arrives, not after.
- type: cron
metadata:
timezone: "America/New_York"
start: "20 9 * * 1-5" # 09:20 EST: pre-warm before market open
end: "0 10 * * 1-5" # 10:00 EST: return to demand-driven scaling
desiredReplicas: "25"
# Trigger 4: Off-hours scale-to-zero (non-production namespaces only)
- type: cron
metadata:
timezone: "America/New_York"
start: "0 7 * * 1-5"
end: "0 20 * * 1-5"
desiredReplicas: "3"
The pre-market open warm-up is the pattern that separates proactive from reactive scaling. Scheduled pre-warming converts a known operational risk — cold-start latency at a predictable burst window — into an automated operational guarantee, with zero on-call involvement.
Class 3 — Drift Correction Automation
Configuration drift is the silent accumulation of divergence between the desired state of a system and its actual running state. It accumulates through manual interventions made under incident pressure, through partial rollout failures, and through environment-specific overrides that were never cleaned up.
In regulated environments, drift is a compliance concern as much as an operational one. CIP-010 configuration change management, SOC 2 change management controls, and PCI-DSS configuration baseline requirements all presuppose that the actual state of production systems is known, documented, and under control.
The continuous-reconciliation execution model is the correct architecture because drift does not announce itself. A schedule-driven audit running daily leaves a gap of up to 24 hours. A Kubernetes controller checking desired versus actual state every 30 seconds reduces that window to seconds.
# Argo CD Continuous Reconciliation + CIP-010 Compliance Audit Trail
# Self-heal corrects drift automatically.
# Every sync event — planned or drift-triggered — emits to Splunk
# as a structured compliance record.
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: production-api-platform
namespace: argocd
annotations:
notifications.argoproj.io/subscribe.on-sync-succeeded.splunk: "compliance-audit"
notifications.argoproj.io/subscribe.on-sync-failed.splunk: "compliance-audit"
notifications.argoproj.io/subscribe.on-health-degraded.splunk: "compliance-audit"
notifications.argoproj.io/subscribe.on-sync-status-unknown.slack: "sre-drift-alerts"
spec:
project: production
source:
repoURL: https://git.internal/platform/k8s-manifests
targetRevision: main
path: clusters/prod/api-platform
destination:
server: https://tkg-production.internal:6443
namespace: production
syncPolicy:
automated:
prune: true # Remove resources absent from git (prevents orphan drift)
selfHeal: true # Reconcile live state to git automatically
syncOptions:
- RespectIgnoreDifferences=true
- ServerSideApply=true
retry:
limit: 5
backoff:
duration: 30s
factor: 2
maxDuration: 5m
ignoreDifferences:
- group: apps
kind: Deployment
jsonPointers:
- /spec/replicas # HPA manages this; exclude from drift detection
# Kyverno — Drift Prevention at Admission Layer
# Enforces standards before non-compliant state can enter the cluster.
# Converts periodic manual audit toil into continuous automated enforcement.
# Policy 1: Require resource limits on all production containers
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
name: require-resource-limits-production
spec:
validationFailureAction: Enforce
background: true # Audit existing resources, not just new admissions
rules:
- name: check-container-resource-limits
match:
any:
- resources:
kinds: [Deployment]
namespaces: [production, staging]
validate:
message: >
Resource limits required for all containers in production/staging.
See https://wiki.internal/sre/standards/resources
pattern:
spec:
template:
spec:
containers:
- resources:
limits:
memory: "?*"
cpu: "?*"
---
# Policy 2: AI-ops service accounts must not hold cluster-admin binding
# Enforces HolmesGPT and LiteLLM Proxy RBAC standards continuously
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
name: restrict-ai-ops-rbac
spec:
validationFailureAction: Enforce
rules:
- name: deny-cluster-admin-for-ai-ops
match:
any:
- resources:
kinds: [ClusterRoleBinding]
validate:
message: "AI-ops service accounts must not hold cluster-admin binding."
deny:
conditions:
all:
- key: "{{ request.object.subjects[].name }}"
operator: AnyIn
value:
- holmesgpt-sa
- litellm-proxy-sa
- key: "{{ request.object.roleRef.name }}"
operator: Equals
value: "cluster-admin"
The self-healing sync policy combined with the Splunk notification webhook is not just operational convenience — it is a continuous compliance assertion. The git commit history, Argo CD sync log, and Splunk audit trail together constitute a CIP-010 compliance record that is richer, more tamper-evident, and less labour-intensive than documentation-first approaches.
Class 4 — Evidence Synthesis Automation
Evidence synthesis is the most underautomated class in most SRE environments, and carries the highest toil density in regulated enterprises. Postmortems, SLO reports, compliance evidence packages, capacity forecasts, and DORA metric summaries are almost universally assembled manually from data that already exists in the observability stack. The data is available; the assembly is toil.
The automation architecture follows a consistent pattern regardless of the artefact: define the data sources, define the assembly logic, trigger on the appropriate event or schedule, emit the artefact to the appropriate destination.
# Automated Postmortem Generation
# Event-driven: triggered when incident resolves in PagerDuty
# Produces structured postmortem draft in xWiki Syntax 2.1
# Eliminates 2–4 hours of manual timeline reconstruction per major incident
apiVersion: batch/v1
kind: CronJob
metadata:
name: postmortem-synthesiser
namespace: sre-platform
spec:
schedule: "*/15 * * * *" # Poll resolved incidents; webhook preferred where available
jobTemplate:
spec:
template:
spec:
restartPolicy: OnFailure
serviceAccountName: evidence-synthesiser-sa
containers:
- name: postmortem-generator
image: sre-platform/evidence-synthesiser:v2.0.0
env:
- name: PAGERDUTY_API_TOKEN
valueFrom:
secretKeyRef:
name: pagerduty-creds
key: api-token
- name: SPLUNK_API_URL
value: "https://splunk.internal:8089"
- name: PROMETHEUS_URL
value: "http://prometheus.monitoring.svc:9090"
- name: XWIKI_API_URL
value: "https://wiki.internal/rest/wikis/xwiki"
- name: POSTMORTEM_TEMPLATE_PAGE
value: "SRE.Postmortem.Template"
# Synthesis sequence per resolved incident:
# 1. Fetch PagerDuty timeline (alerts, acks, actions)
# 2. Query Splunk for log events in window ±30min
# 3. Query Prometheus for SLI drop, burn rate spike, saturation events
# 4. Correlate Argo CD sync log with incident start time
# 5. Calculate: error budget consumed, MTTR, contributing alerts
# 6. Render xWiki Syntax 2.1 postmortem draft:
# Auto-populated: timeline, metrics, budget impact, deploy context
# Left blank: root cause, action items (require human input)
# 7. Create page in SRE.Postmortems namespace
# 8. Emit Splunk event: postmortem_created, incident_id,
# budget_consumed_pct, mttr_minutes, deployment_correlated
-- Splunk SPL: Weekly SLO Compliance Summary (Schedule-Driven)
-- Run as a scheduled Splunk report; output forwarded to Slack + leadership email
index=sre_metrics sourcetype="sre:error_budget"
earliest=-7d latest=now
| stats
avg(budget_remaining_pct) as avg_budget_remaining,
min(budget_remaining_pct) as min_budget_remaining,
max(burn_rate_1h) as peak_burn_rate_1h,
count(eval(deployment_gate_status="BLOCKED")) as deployments_blocked,
avg(budget_monetary_value_remaining) as avg_monetary_remaining
by service
| eval slo_status = case(
min_budget_remaining > 75, "HEALTHY",
min_budget_remaining > 25, "DEGRADED",
true(), "EXHAUSTED"
)
| eval trend = case(
avg_budget_remaining > 60, "IMPROVING",
avg_budget_remaining > 40, "STABLE",
true(), "WORSENING"
)
| table service, slo_status, avg_budget_remaining, min_budget_remaining,
peak_burn_rate_1h, deployments_blocked, avg_monetary_remaining, trend
| sort slo_status, -peak_burn_rate_1h
-- Splunk SPL: Quarterly CIP-010 / SOC 2 Change Management Evidence Package
-- Eliminates 8–12 hours of manual evidence collection per audit cycle
index=argocd sourcetype=argocd:audit
earliest="2025-01-01T00:00:00" latest="2025-03-31T23:59:59"
| where action="sync" AND environment="production"
| eval
change_initiated_by = coalesce(actor, "automated-gitops"),
change_authorised_via = case(
isnull(override_annotation), "git-approval-workflow",
true(), "sre-manual-override"
),
change_outcome = if(status="Succeeded", "SUCCESSFUL", "FAILED-ROLLED-BACK")
| join application [
search index=cab_system sourcetype=cab:decisions
| rename application_name as application
| fields application, cab_ticket_id, approver, approval_timestamp
]
| table
_time, application, change_initiated_by, change_authorised_via,
cab_ticket_id, approver, change_outcome, git_commit_sha
| outputlookup compliance_evidence_Q1_2025.csv
Class 5 — Gate Enforcement Automation
Gate enforcement automation replaces human deliberation at workflow decision points with automated policy evaluation. The organisational value is not just toil reduction — it is consistency. Manual gate application is inherently inconsistent: the same change reviewed by different CAB members under different operational pressures may receive different outcomes. Automated gate enforcement applies policy deterministically, with a tamper-evident audit trail.
The critical design principle is the separation of policy definition from policy enforcement. Policy is defined by humans and expressed as code in a version-controlled repository. Enforcement is automated against that policy.
# Canary Analysis Gate — Argo Rollouts + Prometheus
# Replaces manual canary traffic monitoring and promotion decisions.
# Promotes to 100% only if SLI metrics meet thresholds; rolls back automatically.
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: api-gateway
namespace: production
spec:
replicas: 20
strategy:
canary:
steps:
- setWeight: 5
- pause: {duration: 5m}
- analysis:
templates:
- templateName: sli-quality-gate
args:
- name: service-name
value: api-gateway
- setWeight: 25
- pause: {duration: 5m}
- analysis:
templates:
- templateName: sli-quality-gate
args:
- name: service-name
value: api-gateway
- setWeight: 100 # Only reached if both gates pass
---
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: sli-quality-gate
namespace: production
spec:
args:
- name: service-name
metrics:
# Gate 1: Error rate must not exceed SLO error budget at 1× burn
- name: error-rate
interval: 60s
count: 5
successCondition: result[0] < 0.001 # < 0.1% error rate
failureLimit: 1
provider:
prometheus:
address: http://prometheus.monitoring.svc:9090
query: |
sum(rate(istio_requests_total{
destination_service_name="{{args.service-name}}",
response_code=~"5..",
reporter="destination"
}[2m]))
/
sum(rate(istio_requests_total{
destination_service_name="{{args.service-name}}",
reporter="destination"
}[2m]))
# Gate 2: p95 latency must remain within SLO threshold
- name: p95-latency
interval: 60s
count: 5
successCondition: result[0] < 0.3 # p95 < 300ms
failureLimit: 1
provider:
prometheus:
address: http://prometheus.monitoring.svc:9090
query: |
histogram_quantile(0.95,
sum(rate(istio_request_duration_milliseconds_bucket{
destination_service_name="{{args.service-name}}",
reporter="destination"
}[2m])) by (le)
) / 1000
# Kyverno Admission Gate — Supply Chain and Observability Standards
# Continuous-reconciliation execution model at the admission layer.
# Enforces standards before non-compliant state can enter the cluster.
# Gate 1: Production images must come from internal registry
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
name: require-internal-registry-production
spec:
validationFailureAction: Enforce
rules:
- name: check-image-registry
match:
any:
- resources:
kinds: [Pod]
namespaces: [production]
validate:
message: >
Production images must be sourced from registry.internal.
pattern:
spec:
containers:
- image: "registry.internal/*"
initContainers:
- =(image): "registry.internal/*"
---
# Gate 2: AI-ops deployments must declare Splunk log forwarding
# Enforces HolmesGPT / LiteLLM Proxy observability standards at admission
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
name: ai-ops-observability-standards
spec:
validationFailureAction: Enforce
rules:
- name: require-splunk-logging-annotation
match:
any:
- resources:
kinds: [Deployment]
namespaces: [ai-ops, holmesgpt]
validate:
message: "AI-ops deployments must declare Splunk log forwarding annotation."
pattern:
metadata:
annotations:
splunk.logging/enabled: "true"
splunk.logging/index: "?*"
The Automation Investment Decision Framework
Not all toil has equal automation ROI. The decision of which automation to build first benefits from evaluation against four criteria before any code is written.
────────────────────────────────────────────────────────────────────────────
AUTOMATION ROI FRAMEWORK
────────────────────────────────────────────────────────────────────────────
CRITERION 1: FREQUENCY × DURATION (Toil Volume)
Score = occurrences_per_month × avg_minutes_per_occurrence
> 120 min/month → Priority 1: automate immediately
30–120 min/month → Priority 2: automate this quarter
< 30 min/month → Priority 3: defer unless pattern clusters with others
CRITERION 2: CONSISTENCY (Automation Suitability)
Remediation identical every occurrence? → High suitability: Class 1
Follows a decision tree with < 5 branches? → Medium: add conditional logic
Requires contextual human judgment each time? → Low: automate data gathering
only, not the decision
CRITERION 3: BLAST RADIUS (Automation Risk)
High (e.g., scale down production database) → Human confirmation required;
automate detection + staging
Medium (e.g., rolling restart stateless svc) → Automate with verification
step + auto-rollback on fail
Low (e.g., generate report, send notification) → Automate fully
CRITERION 4: PATTERN GENERALISABILITY (Compound Return)
Applies to > 1 service or > 1 toil category?
→ Yes: invest more in the framework; amortise across all instances
→ No: build a narrow point solution; do not over-engineer
────────────────────────────────────────────────────────────────────────────
EXECUTION MODEL SELECTION:
Detected via alert / event? → Event-Driven
Must occur at known time? → Schedule-Driven
Must be continuously true? → Continuous-Reconciliation
All three apply? → Layered: continuous detection +
event-driven remediation +
scheduled evidence synthesis
────────────────────────────────────────────────────────────────────────────
The Automation Maturity Stack
The five automation classes have a natural dependency ordering. Class 3 (Drift Correction) must precede Class 1 (Reactive Remediation) in practice — remediations executed against a drifted configuration produce unpredictable results. Class 2 (Proactive Scaling) requires the observability infrastructure that feeds Class 4 (Evidence Synthesis). Build from the bottom up.
────────────────────────────────────────────────────────────────────────────
LEVEL 5 — PREDICTIVE AUTOMATION
AI-assisted anomaly prediction (HolmesGPT correlation)
Capacity forecast with auto-provisioning triggers
Automated SLO target recalibration from usage patterns
Requires: Levels 1–4 fully operational
LEVEL 4 — EVIDENCE SYNTHESIS
Automated postmortem generation
Continuous compliance evidence pipeline
Automated DORA + five-metric quarterly report
Requires: incident data (L1), metric data (L2), change audit data (L3)
LEVEL 3 — GATE ENFORCEMENT
Error budget PreSync gates (Argo CD)
Canary analysis with automatic rollback (Argo Rollouts)
Admission controller policies (Kyverno)
Requires: SLI data for gates (L2), observability stack (L1)
LEVEL 2 — PROACTIVE SCALING
Request-rate-based HPA
KEDA multi-dimensional autoscaling
Off-hours scale-to-zero (non-production)
Requires: metric instrumentation for scaling signals (L1)
LEVEL 1 — OBSERVABILITY AND DRIFT CORRECTION FOUNDATION
Four Golden Signals instrumented (Envoy proxy + application)
Argo CD self-heal + prune enabled
Kyverno baseline policies deployed
Splunk HEC ingesting structured events
AlertManager routing with structured payloads
*** This layer is the prerequisite for all automation above it. ***
*** Without it, higher-class automation executes against ***
*** unreliable signal and produces unreliable outcomes. ***
────────────────────────────────────────────────────────────────────────────
Common Antipatterns
The Automation-as-Suppression antipattern → Building reactive remediation that restores the surface symptom without instrumenting root cause. An OOM restart automation running forty times per month has not eliminated toil; it has automated a symptom while the memory leak continues accumulating. Every automated remediation must emit a structured Splunk event that makes the recurrence pattern visible. The automation contains the cost; the telemetry drives the fix.
The Single-Instance Automation antipattern → Tightly coupling automation to a single service rather than parameterising it against the class of problem. The OOM restart automation should be configurable for any deployment in any namespace via manifest change, not code change. Automation that cannot be generalised produces a proliferation of point solutions with compounding maintenance toil.
The Untested Automation antipattern → Deploying remediation automation to production without testing against simulated failure conditions. Untested automation creates a second failure mode layered on top of the original one. Reactive remediations should be exercised with chaos tooling against non-production environments on a regular schedule — not only at initial deployment.
The Missing Blast-Radius Assessment antipattern → Building full automation for high-blast-radius actions without a human confirmation step or automatic rollback gate. The error budget PreSync hook blocks a deployment — relatively low blast radius. An automation that scales down a production database because a metric threshold was breached — high blast radius. Execution model must be calibrated to the consequence of incorrect execution, not just the efficiency of correct execution.
The Wrong Execution Model antipattern → Using schedule-driven execution for state that must be continuously true. A CronJob checking policy compliance once per hour is not a drift correction mechanism; it is a periodic audit with a one-hour detection gap. A Kyverno admission controller enforcing the same policy at every resource creation is a drift correction mechanism. Compliance state that matters continuously must be enforced continuously.
Maturity Progression
────────────────────────────────────────────────────────────────────────────
STAGE AUTOMATION STATE NORTH STAR SIGNAL
────────────────────────────────────────────────────────────────────────────
Reactive Toil invisible and unclassified. All remediation is
No taxonomy. Automation = manual and ad hoc.
bash scripts in runbooks. Toil Ratio unknown.
Defined Toil categorised by class. Level 1 foundation
ROI framework applied to deployed. First Class 1
backlog. Taxonomy adopted. or Class 2 automation live.
Measured Classes 1–3 deployed. Toil Ratio measured
Automation coverage tracked and below 40%.
as % of toil categories Automation measurably
with coverage. reduces MTTR.
Optimised Classes 1–4 deployed. Toil Ratio ≤ 25%.
Evidence synthesis eliminates Postmortems generated
governance toil. Gate automatically. DORA
enforcement eliminates manual metrics automated.
CAB deliberation. Compliance evidence
pipeline live.
Generative Class 5 (predictive) active. HolmesGPT correlation
Automation patterns shared as surfaces unknown unknowns
platform primitives across teams. ahead of incidents.
Taxonomy published and cited. Engineering time is
almost entirely
compounding work.
────────────────────────────────────────────────────────────────────────────
Five Action Items for This Week
Run the recurring-incident Splunk query and classify each output item by automation class. Sort by toil score (occurrence × average resolution time). For each item in the top ten, assign it to one of the five classes. Items clustering in the same class are candidates for a shared framework rather than individual point solutions. The classification exercise transforms a task list into an engineering programme.
Audit your existing automation against the execution model taxonomy. For every CronJob, controller, webhook handler, and script in your SRE tooling repo, identify which execution model it uses and whether it is the correct model for the problem it solves. Schedule-driven automation covering for a missing continuous-reconciliation mechanism is a common finding — and a reliability risk, because it leaves a detection gap between execution intervals.
Apply the ROI framework to your top three toil items before writing any code. Score each against frequency × duration, consistency, blast radius, and generalisability. The scoring often reveals that the highest-effort request is not the highest-ROI investment — and that a lower-effort generalised framework would address multiple items simultaneously.
Verify that every existing reactive remediation emits a structured root cause telemetry event. Does each automation emit a Splunk event with fields that distinguish first occurrence from recurrence and capture the leading indicators of the triggering condition? Any automation that restores state without emitting this data is suppressing toil visibility rather than eliminating toil.
Deploy one Kyverno policy that enforces a standard you are currently auditing manually. Pick the compliance or governance standard generating the most recurring audit toil — resource limits, image registry provenance, logging annotations. Implement it as a
ClusterPolicywithvalidationFailureAction: Enforce. Enforcement moves from scheduled detection to continuous prevention, and the policy itself becomes the compliance evidence the manual audit was previously generating.
"The goal of automation in SRE is not to make humans faster at operational work. It is to make humans unnecessary for operational work that follows a known pattern — so that human attention is reserved for the work that does not yet have a pattern. A team that has automated all its known toil categories is not idle; it is free to discover the toil categories that do not yet have names."
Top comments (0)