DEV Community

Cover image for Automating Toil Elimination: A Systematic Taxonomy of SRE Automation Patterns
Nijo George Payyappilly
Nijo George Payyappilly

Posted on

Automating Toil Elimination: A Systematic Taxonomy of SRE Automation Patterns

Every SRE team has a list of things they intend to automate. The list grows faster than it shrinks. New services join the platform and generate new alert categories. Compliance requirements expand and generate new evidence collection obligations. Incident volumes increase and generate new runbook entries. Each item on the list is a reasonable automation candidate. Evaluated individually, each looks tractable. The list as a whole represents a structural failure — not of execution, but of classification.

The problem with most SRE automation backlogs is that they are organised by symptom rather than by pattern. "Automate the pod restart for OOM events on the payments service." "Automate the quarterly credential rotation for the database clusters." "Automate the MTTR report that goes to leadership every Friday." Each item is a specific toil instance. None reveals the underlying automation pattern that, once implemented, eliminates not just that specific toil but the entire class of toil it represents.

A taxonomy changes this. When you classify toil by structural pattern rather than surface manifestation, automation investment compounds: the event-driven remediation framework you build for OOM restarts handles disk pressure remediation, certificate expiry remediation, and unhealthy endpoint remediation with minor configuration changes. The evidence synthesis pipeline you build for the MTTR report generates the compliance evidence package, the SLO summary, and the capacity forecast from the same infrastructure. The gate enforcement mechanism you build for error budget policy enforces security scanning gates, dependency vulnerability gates, and SLO regression gates with the same architecture.

This post proposes a systematic taxonomy of SRE automation patterns — a classification framework that organises automation by structure rather than symptom, enabling compound rather than linear returns on automation investment.


The Two Classification Dimensions

Every SRE automation pattern can be characterised along two independent dimensions: the class of toil it eliminates, and the execution model by which it operates. The intersection defines the automation pattern — and determines the implementation architecture.

Dimension 1 — Automation Class: What Kind of Work Does It Eliminate?

Five automation classes cover the full spectrum of operational toil in a production SRE environment:

  • Class 1 — Reactive Remediation: Automated response to detected failures. A system enters an undesirable state; the automation detects it and restores it without human intervention. The human designs the detection and remediation logic, not executes it.
  • Class 2 — Proactive Scaling: Automated capacity adjustment ahead of degradation. The system anticipates demand changes and adjusts capacity proactively, eliminating the manual capacity management cycle and the alert-response-scale-verify toil loop.
  • Class 3 — Drift Correction: Automated detection and reconciliation of divergence between desired and actual system state. Configuration drift, policy violations, and infrastructure deviation from IaC definitions are detected and corrected continuously rather than discovered during incidents or audits.
  • Class 4 — Evidence Synthesis: Automated generation of operational artefacts — postmortems, compliance evidence packages, SLO reports, capacity forecasts — from existing telemetry. Eliminates the high-toil, high-frequency manual assembly of information that already exists in the observability stack.
  • Class 5 — Gate Enforcement: Automated policy enforcement at workflow boundaries — deployment gates, change approval gates, security scanning gates, SLO regression gates. Replaces manual committee deliberation with automated policy evaluation, reducing both toil and the inconsistency that manual gate application introduces.

Dimension 2 — Execution Model: How Does the Automation Trigger and Operate?

  • Event-Driven: Triggered by discrete state transitions — an alert firing, a webhook payload, a Kubernetes resource state change, a git commit. Dormant until the triggering event occurs, then executes to completion.
  • Schedule-Driven: Triggered by time — a CronJob, a maintenance window, a quarterly compliance cycle. Executes at defined intervals regardless of system state.
  • Continuous-Reconciliation: Always running, continuously comparing observed state against desired state and correcting divergence. Kubernetes controllers and GitOps operators use this model. The automation never completes; it operates as a persistent control loop.
AUTOMATION TAXONOMY MATRIX
────────────────────────────────────────────────────────────────────────────────
                      EVENT-DRIVEN    SCHEDULE-DRIVEN    CONTINUOUS-RECONCILIATION
────────────────────────────────────────────────────────────────────────────────
Reactive              Alert webhook   Scheduled health   Controller-based
Remediation           → K8s Job       check + repair     self-healing loop

Proactive             Load spike      Pre-shift warm-up  HPA / KEDA
Scaling               detection →     CronJob            continuous autoscaling
                      burst scale

Drift                 Webhook on      Periodic config    Argo CD / Kyverno
Correction            resource change audit job          continuous sync

Evidence              Incident close  Weekly SLO report  Continuous metric
Synthesis             → postmortem    CronJob            aggregation pipeline
                      generator

Gate                  PreSync hook    Scheduled SLO      Admission controller
Enforcement           error budget    regression check   (Kyverno / OPA)
                      gate
────────────────────────────────────────────────────────────────────────────────
Enter fullscreen mode Exit fullscreen mode

Taxonomy Principle: Identify the automation class first — this determines what the automation must accomplish. Identify the execution model second — this determines the implementation architecture. Conflating the two produces brittle automation that is hard to reason about, hard to test, and hard to extend.


Class 1 — Reactive Remediation Automation

Reactive remediation is the most commonly implemented and most commonly misimplemented automation class. The pattern is deceptively simple: detect an undesirable state, execute a remediation, verify restoration. The failure mode is equally simple: remediation that restores the surface symptom without instrumenting the root cause, generating a toil loop rather than eliminating one.

The correct implementation architecture has four mandatory components. Detection produces a structured event with sufficient context for the remediation to execute without additional lookups. The remediation executes idempotently — running it twice must not cause harm. Verification confirms the desired state has been restored, not just that the remediation command completed. Escalation fires if verification fails, routing to human on-call with the full execution context attached.

# Step 1: AlertManager routes OOMKill alert to remediation webhook
receivers:
  - name: oom-remediation-webhook
    webhook_configs:
      - url: "http://remediation-controller.sre-platform.svc:8080/remediate"
        send_resolved: false
        http_config:
          bearer_token_file: /var/run/secrets/webhook-token
        # Payload includes: namespace, pod_name, container_name,
        # alert_labels, current_memory_usage, memory_limit

route:
  routes:
    - match:
        alertname: KubePodOOMKilled
      receiver: oom-remediation-webhook
      group_wait: 30s       # Debounce flapping pods
      group_interval: 5m
      repeat_interval: 1h
Enter fullscreen mode Exit fullscreen mode
# Step 2: Remediation controller spawns a Job — one Job per remediation event.
# The Job is the unit of auditability: outcome logged to Splunk as structured data.

apiVersion: batch/v1
kind: Job
metadata:
  name: oom-remediation-{{ pod_name }}-{{ timestamp }}
  namespace: sre-platform
  labels:
    automation-class: reactive-remediation
    trigger: oom-kill
  annotations:
    sre.internal/incident-id: "{{ incident_id }}"
spec:
  backoffLimit: 1           # One retry; if it fails twice, escalate
  activeDeadlineSeconds: 120
  template:
    spec:
      restartPolicy: Never
      serviceAccountName: remediation-executor-sa
      containers:
        - name: oom-remediator
          image: sre-platform/remediator:v3.2.0
          env:
            - name: TARGET_NAMESPACE
              value: "{{ target_namespace }}"
            - name: TARGET_POD
              value: "{{ pod_name }}"
            - name: REMEDIATION_ACTION
              value: "rolling-restart-deployment"
            - name: VERIFY_HEALTHY_REPLICAS
              value: "true"
            - name: VERIFY_TIMEOUT_SECONDS
              value: "90"
            - name: ESCALATE_ON_FAILURE
              value: "true"
            - name: ESCALATION_CHANNEL
              value: "sre-on-call"
            - name: SPLUNK_HEC_URL
              valueFrom:
                secretKeyRef:
                  name: splunk-hec-creds
                  key: url
          # Execution sequence:
          # 1. Confirm OOMKill via kubectl events (not just alert label)
          # 2. Check if deployment already has open remediation in flight
          # 3. Execute rolling restart (preserves PodDisruptionBudget)
          # 4. Wait for all replicas healthy (readiness probe passing)
          # 5. Emit Splunk event: remediation_outcome, duration,
          #    root_cause_hint (memory_at_kill / limit ratio),
          #    escalated flag
          # 6. If verify fails: post Slack with full context, exit 1
Enter fullscreen mode Exit fullscreen mode

The root_cause_hint field in the Splunk payload is the detail that distinguishes a remediation automation from a remediation loop. A pod consistently OOMKilled at 98% of its memory limit will be restored — but the Splunk event creates the longitudinal dataset that surfaces the pattern as a sizing problem, not an operational problem. The automation contains the immediate cost; the telemetry drives the root cause investment.

Istio STRICT mTLS note: The remediation Job's service account must hold a valid client certificate in the mesh. Pod deletions and deployment rollout commands issued from within the mesh travel through the Envoy sidecar and are subject to PeerAuthentication policy enforcement. Scope the remediation executor's RBAC to the minimum necessary namespace to reduce blast radius of a misconfigured policy.


Class 2 — Proactive Scaling Automation

Proactive scaling automation eliminates the reactive capacity management cycle: observe saturation → manually increase capacity → verify relief → update runbook. In a well-instrumented system with the right autoscaling configuration, this cycle should never involve a human for routine load changes.

The critical design decision is metric selection. CPU-based HPA is the most common and most frequently wrong choice. CPU measures how hard the nodes are working, not how much work the service is being asked to do. Under JVM workloads, CPU can remain low while request queue depth climbs because the garbage collector is pausing request processing. Under connection-pool-bounded services, CPU can stay near zero while new requests time out because all available connections are occupied. Request-rate-based scaling eliminates these failure modes by measuring demand directly.

# Request-Rate-Based HPA
# Scales on RPS per replica, not CPU.
# SOT (Safe Operating Throughput) derived from load testing:
# p95 latency exceeds SLO at > 150 RPS/replica.
# HPA target: 120 RPS/replica (80% of SOT = burst headroom).

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api-gateway-rps-hpa
  namespace: production
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api-gateway
  minReplicas: 3
  maxReplicas: 50
  metrics:
    - type: Pods
      pods:
        metric:
          name: http_requests_per_second    # Sourced from Istio Envoy telemetry
        target:
          type: AverageValue
          averageValue: "120"
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 30        # Fast scale-up: respond in 30s
      policies:
        - type: Percent
          value: 100                         # Can double replica count per interval
          periodSeconds: 30
    scaleDown:
      stabilizationWindowSeconds: 300       # Slow scale-down: avoid flapping
      policies:
        - type: Percent
          value: 20
          periodSeconds: 60
Enter fullscreen mode Exit fullscreen mode
# KEDA Multi-Dimensional Autoscaling
# Combines request-rate, queue depth, and scheduled burst preparation
# in a single ScaledObject — all three execution models in one resource.

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: payment-processor-scaler
  namespace: production
spec:
  scaleTargetRef:
    name: payment-processor
  minReplicaCount: 5
  maxReplicaCount: 80
  cooldownPeriod: 60
  triggers:

    # Trigger 1: Request rate from Prometheus (continuous reconciliation)
    - type: prometheus
      metadata:
        serverAddress: http://prometheus.monitoring.svc:9090
        metricName: http_requests_per_second
        query: |
          sum(
            rate(istio_requests_total{
              destination_service_name="payment-processor",
              reporter="destination"
            }[2m])
          ) / count(kube_pod_info{
              namespace="production",
              pod=~"payment-processor-.*"
            })
        threshold: "120"

    # Trigger 2: Kafka queue depth (event-driven — reactive to upstream load)
    - type: prometheus
      metadata:
        serverAddress: http://prometheus.monitoring.svc:9090
        metricName: payment_queue_depth
        query: |
          sum(kafka_consumer_group_lag{
            topic="payment-requests",
            group="payment-processor"
          })
        threshold: "500"

    # Trigger 3: Pre-market open warm-up (schedule-driven — proactive burst prep)
    # JVM cold-start latency is ~45s. Scale before demand arrives, not after.
    - type: cron
      metadata:
        timezone: "America/New_York"
        start: "20 9 * * 1-5"   # 09:20 EST: pre-warm before market open
        end:   "0 10 * * 1-5"   # 10:00 EST: return to demand-driven scaling
        desiredReplicas: "25"

    # Trigger 4: Off-hours scale-to-zero (non-production namespaces only)
    - type: cron
      metadata:
        timezone: "America/New_York"
        start: "0 7 * * 1-5"
        end:   "0 20 * * 1-5"
        desiredReplicas: "3"
Enter fullscreen mode Exit fullscreen mode

The pre-market open warm-up is the pattern that separates proactive from reactive scaling. Scheduled pre-warming converts a known operational risk — cold-start latency at a predictable burst window — into an automated operational guarantee, with zero on-call involvement.


Class 3 — Drift Correction Automation

Configuration drift is the silent accumulation of divergence between the desired state of a system and its actual running state. It accumulates through manual interventions made under incident pressure, through partial rollout failures, and through environment-specific overrides that were never cleaned up.

In regulated environments, drift is a compliance concern as much as an operational one. CIP-010 configuration change management, SOC 2 change management controls, and PCI-DSS configuration baseline requirements all presuppose that the actual state of production systems is known, documented, and under control.

The continuous-reconciliation execution model is the correct architecture because drift does not announce itself. A schedule-driven audit running daily leaves a gap of up to 24 hours. A Kubernetes controller checking desired versus actual state every 30 seconds reduces that window to seconds.

# Argo CD Continuous Reconciliation + CIP-010 Compliance Audit Trail
# Self-heal corrects drift automatically.
# Every sync event — planned or drift-triggered — emits to Splunk
# as a structured compliance record.

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: production-api-platform
  namespace: argocd
  annotations:
    notifications.argoproj.io/subscribe.on-sync-succeeded.splunk: "compliance-audit"
    notifications.argoproj.io/subscribe.on-sync-failed.splunk: "compliance-audit"
    notifications.argoproj.io/subscribe.on-health-degraded.splunk: "compliance-audit"
    notifications.argoproj.io/subscribe.on-sync-status-unknown.slack: "sre-drift-alerts"
spec:
  project: production
  source:
    repoURL: https://git.internal/platform/k8s-manifests
    targetRevision: main
    path: clusters/prod/api-platform
  destination:
    server: https://tkg-production.internal:6443
    namespace: production
  syncPolicy:
    automated:
      prune: true        # Remove resources absent from git (prevents orphan drift)
      selfHeal: true     # Reconcile live state to git automatically
    syncOptions:
      - RespectIgnoreDifferences=true
      - ServerSideApply=true
    retry:
      limit: 5
      backoff:
        duration: 30s
        factor: 2
        maxDuration: 5m
  ignoreDifferences:
    - group: apps
      kind: Deployment
      jsonPointers:
        - /spec/replicas    # HPA manages this; exclude from drift detection
Enter fullscreen mode Exit fullscreen mode
# Kyverno — Drift Prevention at Admission Layer
# Enforces standards before non-compliant state can enter the cluster.
# Converts periodic manual audit toil into continuous automated enforcement.

# Policy 1: Require resource limits on all production containers
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: require-resource-limits-production
spec:
  validationFailureAction: Enforce
  background: true    # Audit existing resources, not just new admissions
  rules:
    - name: check-container-resource-limits
      match:
        any:
          - resources:
              kinds: [Deployment]
              namespaces: [production, staging]
      validate:
        message: >
          Resource limits required for all containers in production/staging.
          See https://wiki.internal/sre/standards/resources
        pattern:
          spec:
            template:
              spec:
                containers:
                  - resources:
                      limits:
                        memory: "?*"
                        cpu: "?*"

---
# Policy 2: AI-ops service accounts must not hold cluster-admin binding
# Enforces HolmesGPT and LiteLLM Proxy RBAC standards continuously
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: restrict-ai-ops-rbac
spec:
  validationFailureAction: Enforce
  rules:
    - name: deny-cluster-admin-for-ai-ops
      match:
        any:
          - resources:
              kinds: [ClusterRoleBinding]
      validate:
        message: "AI-ops service accounts must not hold cluster-admin binding."
        deny:
          conditions:
            all:
              - key: "{{ request.object.subjects[].name }}"
                operator: AnyIn
                value:
                  - holmesgpt-sa
                  - litellm-proxy-sa
              - key: "{{ request.object.roleRef.name }}"
                operator: Equals
                value: "cluster-admin"
Enter fullscreen mode Exit fullscreen mode

The self-healing sync policy combined with the Splunk notification webhook is not just operational convenience — it is a continuous compliance assertion. The git commit history, Argo CD sync log, and Splunk audit trail together constitute a CIP-010 compliance record that is richer, more tamper-evident, and less labour-intensive than documentation-first approaches.


Class 4 — Evidence Synthesis Automation

Evidence synthesis is the most underautomated class in most SRE environments, and carries the highest toil density in regulated enterprises. Postmortems, SLO reports, compliance evidence packages, capacity forecasts, and DORA metric summaries are almost universally assembled manually from data that already exists in the observability stack. The data is available; the assembly is toil.

The automation architecture follows a consistent pattern regardless of the artefact: define the data sources, define the assembly logic, trigger on the appropriate event or schedule, emit the artefact to the appropriate destination.

# Automated Postmortem Generation
# Event-driven: triggered when incident resolves in PagerDuty
# Produces structured postmortem draft in xWiki Syntax 2.1
# Eliminates 2–4 hours of manual timeline reconstruction per major incident

apiVersion: batch/v1
kind: CronJob
metadata:
  name: postmortem-synthesiser
  namespace: sre-platform
spec:
  schedule: "*/15 * * * *"    # Poll resolved incidents; webhook preferred where available
  jobTemplate:
    spec:
      template:
        spec:
          restartPolicy: OnFailure
          serviceAccountName: evidence-synthesiser-sa
          containers:
            - name: postmortem-generator
              image: sre-platform/evidence-synthesiser:v2.0.0
              env:
                - name: PAGERDUTY_API_TOKEN
                  valueFrom:
                    secretKeyRef:
                      name: pagerduty-creds
                      key: api-token
                - name: SPLUNK_API_URL
                  value: "https://splunk.internal:8089"
                - name: PROMETHEUS_URL
                  value: "http://prometheus.monitoring.svc:9090"
                - name: XWIKI_API_URL
                  value: "https://wiki.internal/rest/wikis/xwiki"
                - name: POSTMORTEM_TEMPLATE_PAGE
                  value: "SRE.Postmortem.Template"
              # Synthesis sequence per resolved incident:
              # 1. Fetch PagerDuty timeline (alerts, acks, actions)
              # 2. Query Splunk for log events in window ±30min
              # 3. Query Prometheus for SLI drop, burn rate spike, saturation events
              # 4. Correlate Argo CD sync log with incident start time
              # 5. Calculate: error budget consumed, MTTR, contributing alerts
              # 6. Render xWiki Syntax 2.1 postmortem draft:
              #    Auto-populated: timeline, metrics, budget impact, deploy context
              #    Left blank: root cause, action items (require human input)
              # 7. Create page in SRE.Postmortems namespace
              # 8. Emit Splunk event: postmortem_created, incident_id,
              #    budget_consumed_pct, mttr_minutes, deployment_correlated
Enter fullscreen mode Exit fullscreen mode
-- Splunk SPL: Weekly SLO Compliance Summary (Schedule-Driven)
-- Run as a scheduled Splunk report; output forwarded to Slack + leadership email

index=sre_metrics sourcetype="sre:error_budget"
  earliest=-7d latest=now
| stats
    avg(budget_remaining_pct)            as avg_budget_remaining,
    min(budget_remaining_pct)            as min_budget_remaining,
    max(burn_rate_1h)                    as peak_burn_rate_1h,
    count(eval(deployment_gate_status="BLOCKED")) as deployments_blocked,
    avg(budget_monetary_value_remaining) as avg_monetary_remaining
    by service
| eval slo_status = case(
    min_budget_remaining > 75, "HEALTHY",
    min_budget_remaining > 25, "DEGRADED",
    true(),                    "EXHAUSTED"
  )
| eval trend = case(
    avg_budget_remaining > 60, "IMPROVING",
    avg_budget_remaining > 40, "STABLE",
    true(),                    "WORSENING"
  )
| table service, slo_status, avg_budget_remaining, min_budget_remaining,
    peak_burn_rate_1h, deployments_blocked, avg_monetary_remaining, trend
| sort slo_status, -peak_burn_rate_1h
Enter fullscreen mode Exit fullscreen mode
-- Splunk SPL: Quarterly CIP-010 / SOC 2 Change Management Evidence Package
-- Eliminates 8–12 hours of manual evidence collection per audit cycle

index=argocd sourcetype=argocd:audit
  earliest="2025-01-01T00:00:00" latest="2025-03-31T23:59:59"
| where action="sync" AND environment="production"
| eval
    change_initiated_by = coalesce(actor, "automated-gitops"),
    change_authorised_via = case(
      isnull(override_annotation), "git-approval-workflow",
      true(),                       "sre-manual-override"
    ),
    change_outcome = if(status="Succeeded", "SUCCESSFUL", "FAILED-ROLLED-BACK")
| join application [
    search index=cab_system sourcetype=cab:decisions
    | rename application_name as application
    | fields application, cab_ticket_id, approver, approval_timestamp
  ]
| table
    _time, application, change_initiated_by, change_authorised_via,
    cab_ticket_id, approver, change_outcome, git_commit_sha
| outputlookup compliance_evidence_Q1_2025.csv
Enter fullscreen mode Exit fullscreen mode

Class 5 — Gate Enforcement Automation

Gate enforcement automation replaces human deliberation at workflow decision points with automated policy evaluation. The organisational value is not just toil reduction — it is consistency. Manual gate application is inherently inconsistent: the same change reviewed by different CAB members under different operational pressures may receive different outcomes. Automated gate enforcement applies policy deterministically, with a tamper-evident audit trail.

The critical design principle is the separation of policy definition from policy enforcement. Policy is defined by humans and expressed as code in a version-controlled repository. Enforcement is automated against that policy.

# Canary Analysis Gate — Argo Rollouts + Prometheus
# Replaces manual canary traffic monitoring and promotion decisions.
# Promotes to 100% only if SLI metrics meet thresholds; rolls back automatically.

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: api-gateway
  namespace: production
spec:
  replicas: 20
  strategy:
    canary:
      steps:
        - setWeight: 5
        - pause: {duration: 5m}
        - analysis:
            templates:
              - templateName: sli-quality-gate
            args:
              - name: service-name
                value: api-gateway
        - setWeight: 25
        - pause: {duration: 5m}
        - analysis:
            templates:
              - templateName: sli-quality-gate
            args:
              - name: service-name
                value: api-gateway
        - setWeight: 100    # Only reached if both gates pass

---
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: sli-quality-gate
  namespace: production
spec:
  args:
    - name: service-name
  metrics:

    # Gate 1: Error rate must not exceed SLO error budget at 1× burn
    - name: error-rate
      interval: 60s
      count: 5
      successCondition: result[0] < 0.001    # < 0.1% error rate
      failureLimit: 1
      provider:
        prometheus:
          address: http://prometheus.monitoring.svc:9090
          query: |
            sum(rate(istio_requests_total{
              destination_service_name="{{args.service-name}}",
              response_code=~"5..",
              reporter="destination"
            }[2m]))
            /
            sum(rate(istio_requests_total{
              destination_service_name="{{args.service-name}}",
              reporter="destination"
            }[2m]))

    # Gate 2: p95 latency must remain within SLO threshold
    - name: p95-latency
      interval: 60s
      count: 5
      successCondition: result[0] < 0.3     # p95 < 300ms
      failureLimit: 1
      provider:
        prometheus:
          address: http://prometheus.monitoring.svc:9090
          query: |
            histogram_quantile(0.95,
              sum(rate(istio_request_duration_milliseconds_bucket{
                destination_service_name="{{args.service-name}}",
                reporter="destination"
              }[2m])) by (le)
            ) / 1000
Enter fullscreen mode Exit fullscreen mode
# Kyverno Admission Gate — Supply Chain and Observability Standards
# Continuous-reconciliation execution model at the admission layer.
# Enforces standards before non-compliant state can enter the cluster.

# Gate 1: Production images must come from internal registry
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: require-internal-registry-production
spec:
  validationFailureAction: Enforce
  rules:
    - name: check-image-registry
      match:
        any:
          - resources:
              kinds: [Pod]
              namespaces: [production]
      validate:
        message: >
          Production images must be sourced from registry.internal.
        pattern:
          spec:
            containers:
              - image: "registry.internal/*"
            initContainers:
              - =(image): "registry.internal/*"

---
# Gate 2: AI-ops deployments must declare Splunk log forwarding
# Enforces HolmesGPT / LiteLLM Proxy observability standards at admission
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: ai-ops-observability-standards
spec:
  validationFailureAction: Enforce
  rules:
    - name: require-splunk-logging-annotation
      match:
        any:
          - resources:
              kinds: [Deployment]
              namespaces: [ai-ops, holmesgpt]
      validate:
        message: "AI-ops deployments must declare Splunk log forwarding annotation."
        pattern:
          metadata:
            annotations:
              splunk.logging/enabled: "true"
              splunk.logging/index: "?*"
Enter fullscreen mode Exit fullscreen mode

The Automation Investment Decision Framework

Not all toil has equal automation ROI. The decision of which automation to build first benefits from evaluation against four criteria before any code is written.

────────────────────────────────────────────────────────────────────────────
AUTOMATION ROI FRAMEWORK
────────────────────────────────────────────────────────────────────────────
CRITERION 1: FREQUENCY × DURATION (Toil Volume)
  Score = occurrences_per_month × avg_minutes_per_occurrence
  > 120 min/month  → Priority 1: automate immediately
  30–120 min/month → Priority 2: automate this quarter
  < 30 min/month   → Priority 3: defer unless pattern clusters with others

CRITERION 2: CONSISTENCY (Automation Suitability)
  Remediation identical every occurrence?         → High suitability: Class 1
  Follows a decision tree with < 5 branches?      → Medium: add conditional logic
  Requires contextual human judgment each time?   → Low: automate data gathering
                                                     only, not the decision

CRITERION 3: BLAST RADIUS (Automation Risk)
  High (e.g., scale down production database)     → Human confirmation required;
                                                     automate detection + staging
  Medium (e.g., rolling restart stateless svc)   → Automate with verification
                                                     step + auto-rollback on fail
  Low (e.g., generate report, send notification) → Automate fully

CRITERION 4: PATTERN GENERALISABILITY (Compound Return)
  Applies to > 1 service or > 1 toil category?
    → Yes: invest more in the framework; amortise across all instances
    → No: build a narrow point solution; do not over-engineer

────────────────────────────────────────────────────────────────────────────
EXECUTION MODEL SELECTION:

  Detected via alert / event?      → Event-Driven
  Must occur at known time?        → Schedule-Driven
  Must be continuously true?       → Continuous-Reconciliation
  All three apply?                 → Layered: continuous detection +
                                     event-driven remediation +
                                     scheduled evidence synthesis
────────────────────────────────────────────────────────────────────────────
Enter fullscreen mode Exit fullscreen mode

The Automation Maturity Stack

The five automation classes have a natural dependency ordering. Class 3 (Drift Correction) must precede Class 1 (Reactive Remediation) in practice — remediations executed against a drifted configuration produce unpredictable results. Class 2 (Proactive Scaling) requires the observability infrastructure that feeds Class 4 (Evidence Synthesis). Build from the bottom up.

────────────────────────────────────────────────────────────────────────────
LEVEL 5 — PREDICTIVE AUTOMATION
  AI-assisted anomaly prediction (HolmesGPT correlation)
  Capacity forecast with auto-provisioning triggers
  Automated SLO target recalibration from usage patterns
  Requires: Levels 1–4 fully operational

LEVEL 4 — EVIDENCE SYNTHESIS
  Automated postmortem generation
  Continuous compliance evidence pipeline
  Automated DORA + five-metric quarterly report
  Requires: incident data (L1), metric data (L2), change audit data (L3)

LEVEL 3 — GATE ENFORCEMENT
  Error budget PreSync gates (Argo CD)
  Canary analysis with automatic rollback (Argo Rollouts)
  Admission controller policies (Kyverno)
  Requires: SLI data for gates (L2), observability stack (L1)

LEVEL 2 — PROACTIVE SCALING
  Request-rate-based HPA
  KEDA multi-dimensional autoscaling
  Off-hours scale-to-zero (non-production)
  Requires: metric instrumentation for scaling signals (L1)

LEVEL 1 — OBSERVABILITY AND DRIFT CORRECTION FOUNDATION
  Four Golden Signals instrumented (Envoy proxy + application)
  Argo CD self-heal + prune enabled
  Kyverno baseline policies deployed
  Splunk HEC ingesting structured events
  AlertManager routing with structured payloads

  *** This layer is the prerequisite for all automation above it. ***
  *** Without it, higher-class automation executes against          ***
  *** unreliable signal and produces unreliable outcomes.           ***
────────────────────────────────────────────────────────────────────────────
Enter fullscreen mode Exit fullscreen mode

Common Antipatterns

  • The Automation-as-Suppression antipattern → Building reactive remediation that restores the surface symptom without instrumenting root cause. An OOM restart automation running forty times per month has not eliminated toil; it has automated a symptom while the memory leak continues accumulating. Every automated remediation must emit a structured Splunk event that makes the recurrence pattern visible. The automation contains the cost; the telemetry drives the fix.

  • The Single-Instance Automation antipattern → Tightly coupling automation to a single service rather than parameterising it against the class of problem. The OOM restart automation should be configurable for any deployment in any namespace via manifest change, not code change. Automation that cannot be generalised produces a proliferation of point solutions with compounding maintenance toil.

  • The Untested Automation antipattern → Deploying remediation automation to production without testing against simulated failure conditions. Untested automation creates a second failure mode layered on top of the original one. Reactive remediations should be exercised with chaos tooling against non-production environments on a regular schedule — not only at initial deployment.

  • The Missing Blast-Radius Assessment antipattern → Building full automation for high-blast-radius actions without a human confirmation step or automatic rollback gate. The error budget PreSync hook blocks a deployment — relatively low blast radius. An automation that scales down a production database because a metric threshold was breached — high blast radius. Execution model must be calibrated to the consequence of incorrect execution, not just the efficiency of correct execution.

  • The Wrong Execution Model antipattern → Using schedule-driven execution for state that must be continuously true. A CronJob checking policy compliance once per hour is not a drift correction mechanism; it is a periodic audit with a one-hour detection gap. A Kyverno admission controller enforcing the same policy at every resource creation is a drift correction mechanism. Compliance state that matters continuously must be enforced continuously.


Maturity Progression

────────────────────────────────────────────────────────────────────────────
STAGE        AUTOMATION STATE                    NORTH STAR SIGNAL
────────────────────────────────────────────────────────────────────────────
Reactive     Toil invisible and unclassified.    All remediation is
             No taxonomy. Automation =           manual and ad hoc.
             bash scripts in runbooks.           Toil Ratio unknown.

Defined      Toil categorised by class.          Level 1 foundation
             ROI framework applied to            deployed. First Class 1
             backlog. Taxonomy adopted.          or Class 2 automation live.

Measured     Classes 1–3 deployed.               Toil Ratio measured
             Automation coverage tracked         and below 40%.
             as % of toil categories             Automation measurably
             with coverage.                      reduces MTTR.

Optimised    Classes 1–4 deployed.               Toil Ratio ≤ 25%.
             Evidence synthesis eliminates       Postmortems generated
             governance toil. Gate               automatically. DORA
             enforcement eliminates manual       metrics automated.
             CAB deliberation.                   Compliance evidence
                                                 pipeline live.

Generative   Class 5 (predictive) active.        HolmesGPT correlation
             Automation patterns shared as        surfaces unknown unknowns
             platform primitives across teams.   ahead of incidents.
             Taxonomy published and cited.       Engineering time is
                                                 almost entirely
                                                 compounding work.
────────────────────────────────────────────────────────────────────────────
Enter fullscreen mode Exit fullscreen mode

Five Action Items for This Week

  1. Run the recurring-incident Splunk query and classify each output item by automation class. Sort by toil score (occurrence × average resolution time). For each item in the top ten, assign it to one of the five classes. Items clustering in the same class are candidates for a shared framework rather than individual point solutions. The classification exercise transforms a task list into an engineering programme.

  2. Audit your existing automation against the execution model taxonomy. For every CronJob, controller, webhook handler, and script in your SRE tooling repo, identify which execution model it uses and whether it is the correct model for the problem it solves. Schedule-driven automation covering for a missing continuous-reconciliation mechanism is a common finding — and a reliability risk, because it leaves a detection gap between execution intervals.

  3. Apply the ROI framework to your top three toil items before writing any code. Score each against frequency × duration, consistency, blast radius, and generalisability. The scoring often reveals that the highest-effort request is not the highest-ROI investment — and that a lower-effort generalised framework would address multiple items simultaneously.

  4. Verify that every existing reactive remediation emits a structured root cause telemetry event. Does each automation emit a Splunk event with fields that distinguish first occurrence from recurrence and capture the leading indicators of the triggering condition? Any automation that restores state without emitting this data is suppressing toil visibility rather than eliminating toil.

  5. Deploy one Kyverno policy that enforces a standard you are currently auditing manually. Pick the compliance or governance standard generating the most recurring audit toil — resource limits, image registry provenance, logging annotations. Implement it as a ClusterPolicy with validationFailureAction: Enforce. Enforcement moves from scheduled detection to continuous prevention, and the policy itself becomes the compliance evidence the manual audit was previously generating.


"The goal of automation in SRE is not to make humans faster at operational work. It is to make humans unnecessary for operational work that follows a known pattern — so that human attention is reserved for the work that does not yet have a pattern. A team that has automated all its known toil categories is not idle; it is free to discover the toil categories that do not yet have names."


Top comments (0)