Nijo George Payyappilly

Posted on Jun 15

The Human-in-the-Loop SRE: Designing Automation Escalation Policies for AI-Assisted Operations

#sre #devops #ai #kubernetes

On April 23, 2021, a Fastly CDN configuration change triggered a global outage that took down the UK government website, the New York Times, Reddit, and hundreds of other major internet properties for approximately one hour. The triggering event was a configuration push. The propagation mechanism was automated. The time between the configuration being pushed and the global impact becoming visible was under a minute. The time required for a human operator to identify the cause and initiate the rollback was approximately forty-nine minutes longer than that.

The Fastly incident is not primarily a story about automation failure. It is a story about the speed asymmetry between automated propagation and human response — and about what happens when the automation layer between a human decision and its production consequence moves faster than the accountability layer designed to govern it.

This asymmetry is the defining operational challenge of AI-assisted SRE. The capability to automate incident detection, root cause hypothesis generation, and even remediation is now accessible at costs and latencies that were unavailable five years ago. The operational risk is not that this capability will be under-used. The risk is that it will be deployed without a rigorous escalation policy — a formal framework that defines exactly where automated execution ends and human judgement begins, under what conditions the boundary shifts, and how accountability is preserved for every action the AI takes on behalf of an operator who may not have been in the room when it was taken.

The Human-in-the-Loop Spectrum

AI-assisted SRE operations do not exist at a single point on the autonomy spectrum. They exist across a range, and the appropriate position on that range is a function of confidence, blast radius, novelty, and regulatory context — not of how sophisticated the AI system is.

THE AUTOMATION AUTONOMY SPECTRUM
────────────────────────────────────────────────────────────────────────────

LEVEL 0 — MANUAL
  AI generates no recommendations. Human observes raw telemetry and decides.
  Appropriate when: AI system is unavailable, untrusted, or context is
  outside AI training distribution entirely.

LEVEL 1 — ASSISTED
  AI surfaces relevant context, correlated signals, and historical patterns.
  Human makes all decisions. AI does not recommend actions.
  Appropriate when: novel failure pattern; first occurrence of incident type;
  regulated change requiring documented human judgement.

LEVEL 2 — SUPERVISED
  AI recommends specific actions with confidence scores. Human approves
  each action before execution. AI does not execute autonomously.
  Appropriate when: high blast radius; unfamiliar but not novel pattern;
  action is reversible but consequential.

LEVEL 3 — CONDITIONAL AUTONOMOUS
  AI executes actions autonomously within pre-approved policy boundaries.
  Human is notified after execution. Human can abort within a defined window.
  Appropriate when: well-characterised failure pattern; low blast radius;
  action is fully reversible; pattern seen > N times with consistent outcome.

LEVEL 4 — AUTONOMOUS
  AI executes and verifies remediation without human notification unless
  verification fails. Audit trail maintained.
  Appropriate when: toil pattern fully characterised; action is idempotent;
  blast radius is bounded to a single service; recurrence rate justifies
  zero-latency response.

────────────────────────────────────────────────────────────────────────────
CRITICAL CONSTRAINT: No action may exist permanently at Level 4.
Every Level 4 automation must have a scheduled re-qualification review
that reassesses whether the failure pattern is still well-characterised
and the blast radius assumption still holds.
────────────────────────────────────────────────────────────────────────────

The critical constraint — that no action may exist permanently at Level 4 — is not conservatism. It is the engineering response to a specific failure mode: automation that was correctly calibrated at deployment time and has silently drifted out of calibration as the system evolved. An OOM restart automation that was safe when first deployed becomes unsafe the moment the underlying cause shifts from a memory leak to a data corruption event that is triggering the same symptom. The re-qualification review is the mechanism that catches this drift before it produces an incident.

The Four Escalation Triggers

Every escalation policy is built from four primitive triggers. Each trigger defines a condition under which the automation level must shift upward — toward more human involvement, not less.

Trigger 1 — Confidence Threshold Breach

The AI system's confidence in its diagnosis or recommended action has fallen below a defined threshold. In the context of LLM-based operations (HolmesGPT, LiteLLM Proxy routing), confidence is expressed as a combination of model-reported token probability distributions and domain-specific heuristics applied to the recommendation output.

A low-confidence diagnosis means the AI has identified a plausible pattern match but lacks sufficient corroborating signal to recommend action without human review. Executing actions based on low-confidence diagnoses is the operational equivalent of acting on a single data point in a monitoring dashboard: occasionally correct, reliably dangerous as a policy.

Trigger 2 — Blast Radius Threshold

The proposed action affects more infrastructure than the policy authorises for autonomous execution. Blast radius is assessed across three dimensions: service count (how many services are affected), traffic fraction (what percentage of user requests are served by the affected infrastructure), and reversibility (can the action be undone in under five minutes with a single command).

High blast radius is not a disqualifying condition for automation. It is a condition that requires the automation level to shift to at least Level 2 (supervised) regardless of confidence score.

Trigger 3 — Novelty Detection

The failure pattern does not match any pattern in the AI system's training corpus or historical incident database. Novelty is the most dangerous condition for autonomous execution because it is precisely the condition where the AI's pattern-matching capability provides the least value — and where a confident-sounding but incorrect recommendation carries the highest operational cost.

Novelty detection is the hardest trigger to implement well, because it requires the AI system to accurately assess the boundaries of its own knowledge. A system that cannot reliably distinguish "I have seen this pattern and am confident" from "I have seen a superficially similar pattern and am extrapolating" should not be operating at Level 3 or Level 4.

Trigger 4 — Regulatory Boundary

The proposed action would touch a regulated asset, require a documented change record, affect a system subject to NERC CIP, PCI-DSS, HIPAA, or equivalent obligations, or generate a compliance event. In regulated environments, no automated action may bypass the change management governance framework, regardless of confidence score or blast radius.

This trigger is absolute. It does not have a confidence threshold exception. An AI system that correctly diagnoses a production issue with 99% confidence and proposes a remediation that would constitute an undocumented change to a regulated asset must escalate to Level 2 and generate a change record, even if the remediation would restore service faster without it.

Designing the Escalation Policy Document

The escalation policy is an operational governance document, not a configuration file. It must be version-controlled, reviewed and approved by SRE leadership and compliance, and referenced in every AI-assisted automation's runtime configuration. Its authority derives from human review, not from the AI system that consults it.

ESCALATION POLICY: AI-ASSISTED INCIDENT RESPONSE
────────────────────────────────────────────────────────────────────────────
Service:       production-platform (all services)
AI System:     HolmesGPT + LiteLLM Proxy + Ollama / GitHub Models
Policy Version: v1.3  |  Approved: SRE Lead + VP Engineering
Last Reviewed: 2025-Q1  |  Next Review: 2025-Q2
────────────────────────────────────────────────────────────────────────────

SECTION 1: AUTONOMOUS EXECUTION AUTHORISED (Level 4)
  Conditions required (ALL must be true):
    ✓ Confidence score ≥ 0.85 (model-reported + heuristic composite)
    ✓ Pattern seen ≥ 10 times in incident history with consistent outcome
    ✓ Blast radius: single service, single namespace, ≤ 20% of replicas
    ✓ Action is idempotent and fully reversible in ≤ 5 minutes
    ✓ No regulated asset in scope
    ✓ Error budget > 25% remaining (not in Tier 3 freeze)
  Authorised actions at Level 4:
    → Rolling restart of single stateless deployment (OOM, deadlock)
    → Scale-up of single HPA-managed deployment by ≤ 2 replicas
    → Certificate rotation on non-production workloads
    → Log pipeline gateway restart (telemetry outage, no production impact)
  Required logging: structured Splunk event per action (mandatory)
  Re-qualification: every 90 days or after any incident where autonomous
                   action was taken and outcome was suboptimal

SECTION 2: SUPERVISED EXECUTION (Level 2 — Human Approval Required)
  Conditions triggering Level 2 (ANY is sufficient):
    ⚠ Confidence score 0.60–0.84
    ⚠ Blast radius: > 20% of replicas OR > 1 service OR cross-namespace
    ⚠ First or second occurrence of this failure pattern
    ⚠ Error budget between 25–75% (Tier 2 degraded)
    ⚠ Action affects shared infrastructure (Argo CD, Prometheus, Istio)
  Approval mechanism: Slack approval button with 10-minute timeout
  Timeout behaviour: escalate to on-call if no response in 10 minutes
  Required logging: recommendation + approval/rejection + outcome

SECTION 3: ASSISTED ONLY (Level 1 — No Action Authorised)
  Conditions triggering Level 1 (ANY is sufficient):
    ✗ Confidence score < 0.60
    ✗ Novel failure pattern (no match in incident history)
    ✗ Regulated asset in scope (NERC CIP, PCI-DSS, HIPAA boundary)
    ✗ Error budget < 25% (Tier 3 freeze — deployment freeze active)
    ✗ Active P0 incident in progress (human incident commander owns scope)
    ✗ Multiple simultaneous incidents (blast radius assessment unreliable)
  AI role at Level 1: surface correlated signals, historical context only
  Human owns: diagnosis, action decision, execution, verification

SECTION 4: ACCOUNTABILITY CHAIN
  Every AI-assisted action must trace to one of:
    a) Direct human approval (Level 2 Slack approval button)
    b) This policy document (Level 4 autonomous execution)
  "The AI decided" is not a complete accountability chain.
  Policy document owner: SRE Lead
  Policy review and approval authority: SRE Lead + VP Engineering
────────────────────────────────────────────────────────────────────────────

HolmesGPT Escalation Architecture

The escalation policy document defines the governance rules. The escalation architecture implements those rules as runtime logic in the AI-assisted operations stack. The architecture shown here is specific to the HolmesGPT + LiteLLM Proxy + Ollama deployment pattern in a regulated on-premises environment.

# HolmesGPT Escalation Policy ConfigMap
# Consumed by HolmesGPT at runtime to determine autonomy level per action
# Version-controlled in git; updated only via Argo CD sync (change record enforced)

apiVersion: v1
kind: ConfigMap
metadata:
  name: holmesgpt-escalation-policy
  namespace: holmesgpt
  annotations:
    sre.internal/policy-version: "v1.3"
    sre.internal/approved-by: "sre-lead,vp-engineering"
    sre.internal/approved-date: "2025-03-15"
    sre.internal/next-review: "2025-06-15"
    sre.internal/review-enforced-by: "kyverno-policy/ai-ops-policy-review"
data:
  escalation_policy.yaml: |
    confidence_thresholds:
      autonomous:   0.85
      supervised:   0.60
      assisted_only: 0.0

    blast_radius_limits:
      autonomous:
        max_replica_fraction: 0.20
        max_service_count: 1
        max_namespace_count: 1
        cross_namespace_allowed: false
        regulated_assets_allowed: false

    autonomous_actions_allowlist:
      - action: rolling_restart_stateless
        max_replicas_affected: 5
        requires_pdb_check: true
      - action: hpa_scale_up
        max_replica_delta: 2
        requires_current_below_sot: true
      - action: log_pipeline_restart
        namespaces: [monitoring, sre-platform]
        production_namespaces_blocked: true

    error_budget_gates:
      tier_3_freeze_blocks_autonomous: true
      tier_2_degrades_to_supervised: true

    regulatory_boundary:
      always_level_1_namespaces:
        - pci-zone
        - hipaa-zone
        - nerc-cip-zone
      always_level_1_labels:
        - "compliance.internal/regulated=true"

    novelty_detection:
      min_historical_occurrences_for_autonomous: 10
      similarity_threshold: 0.80
      unknown_pattern_forces_level_1: true

    approval_workflow:
      slack_channel: "sre-aiops-approvals"
      timeout_minutes: 10
      timeout_action: escalate_to_oncall

    audit:
      splunk_sourcetype: "sre:holmesgpt:decisions"
      log_all_recommendations: true
      log_operator_overrides: true
      override_feeds_prompt_review: true

Model Routing for Escalation Quality

The LiteLLM Proxy's model routing configuration is a first-class component of the escalation architecture. Routing to the right model at the right confidence tier is not a performance optimisation — it is a safety mechanism.

# LiteLLM Proxy — Model Routing for Escalation Tiers
# Smaller local models for low blast radius / routine patterns
# Larger models with greater context window for high blast radius / novel patterns
# On-premises models for regulated asset investigations (data sovereignty)

model_list:
  # Tier 1: Routine investigation — local Ollama model
  # Low latency, no data egress, adequate for well-characterised patterns
  - model_name: holmesgpt-routine
    litellm_params:
      model: ollama/llama3.1:8b
      api_base: http://ollama.ai-ops.svc.cluster.local:11434
      timeout: 30
      max_tokens: 2048

  # Tier 2: Complex investigation — larger local model
  # Higher accuracy for multi-service correlation and novel patterns
  - model_name: holmesgpt-complex
    litellm_params:
      model: ollama/llama3.1:70b
      api_base: http://ollama.ai-ops.svc.cluster.local:11434
      timeout: 90
      max_tokens: 8192

  # Tier 3: High-stakes / novel pattern — GitHub Models
  # Largest context window for multi-service incident correlation
  # Data classification check required before routing: no PII, no regulated data
  - model_name: holmesgpt-highstakes
    litellm_params:
      model: github/gpt-4o
      api_base: https://models.inference.ai.azure.com
      api_key: "os.environ/GITHUB_MODELS_PAT"
      timeout: 120
      max_tokens: 16384

router_settings:
  routing_strategy: custom
  routing_logic: |
    # Route by blast_radius_tier header set by HolmesGPT pre-routing assessment
    if blast_radius_tier == "low" and pattern_novelty == "known":
        return "holmesgpt-routine"
    elif blast_radius_tier == "high" or pattern_novelty == "novel":
        # Data classification gate before external model routing
        if data_contains_regulated_fields:
            return "holmesgpt-complex"  # Stay on-premises
        return "holmesgpt-highstakes"
    else:
        return "holmesgpt-complex"

  fallback_model: holmesgpt-complex    # Always fall back to on-premises
  fallback_on_status_codes: [429, 500, 503]

The Recommendation Quality Feedback Loop

The operational risk of AI-assisted recommendations is not static. It evolves as the system changes and as the model's training distribution diverges from the current operational reality. An AI recommendation quality feedback loop is the mechanism that makes this drift visible before it produces a damaging autonomous action.

# Prometheus Recording Rules — AI Recommendation Quality Tracking
# Measures whether HolmesGPT recommendations are operationally valuable
# High override rate or low action rate = recommendation quality degrading

groups:
  - name: holmesgpt.recommendation_quality
    rules:

      # Recommendation acceptance rate: fraction of recommendations
      # that operators acted on (approved or executed autonomously)
      # versus rejected or ignored
      - record: holmesgpt:recommendation_acceptance_rate:rate7d
        expr: |
          sum(rate(holmesgpt_recommendations_acted_on_total[7d]))
          /
          sum(rate(holmesgpt_recommendations_total[7d]))

      # Operator override rate: fraction of autonomous actions that
      # were manually reversed by an operator after execution
      # High rate = autonomous confidence thresholds are too permissive
      - record: holmesgpt:autonomous_override_rate:rate7d
        expr: |
          sum(rate(holmesgpt_autonomous_actions_reversed_total[7d]))
          /
          sum(rate(holmesgpt_autonomous_actions_total[7d]))

      # False positive rate: recommendations made but outcome was
      # NOT the recommended action resolving the incident
      - record: holmesgpt:false_positive_rate:rate7d
        expr: |
          sum(rate(holmesgpt_recommendations_outcome_mismatch_total[7d]))
          /
          sum(rate(holmesgpt_recommendations_acted_on_total[7d]))

      # Alert: recommendation quality degrading
      - alert: HolmesGPT_RecommendationQualityDegrading
        expr: |
          holmesgpt:autonomous_override_rate:rate7d > 0.15
          OR
          holmesgpt:false_positive_rate:rate7d > 0.20
        for: 1d
        labels:
          severity: ticket
          domain: ai_ops_quality
        annotations:
          summary: >
            HolmesGPT recommendation quality below threshold.
            Override rate: {{ with query "holmesgpt:autonomous_override_rate:rate7d" }}
            {{ . | first | value | humanizePercentage }}{{ end }}.
            Action: review recent overrides, update prompt context,
            consider reducing autonomous confidence threshold.
          runbook: "https://wiki.internal/sre/runbooks/holmesgpt-quality-review"

      # Alert: recommendation volume causing alert fatigue risk
      # More than 3 recommendations per incident = cognitive overload signal
      - alert: HolmesGPT_RecommendationVolumeHigh
        expr: |
          sum(rate(holmesgpt_recommendations_total[1h]))
          /
          sum(rate(incidents_opened_total[1h])) > 3
        for: 30m
        labels:
          severity: ticket
        annotations:
          summary: >
            HolmesGPT generating > 3 recommendations per incident on average.
            Risk: alert fatigue causing operators to ignore recommendations.
            Action: tighten confidence floor or reduce recommendation scope.

The Accountability Chain Principle and NIST AI RMF Alignment

The accountability chain principle — that every AI-assisted action must trace back to a human decision, either a direct approval or a policy that a human wrote and approved — is the operational implementation of the NIST AI Risk Management Framework's GOVERN function.

The NIST AI RMF establishes four core functions for AI risk management: GOVERN (policies, accountability), MAP (risk identification), MEASURE (risk quantification), and MANAGE (risk response). Each function maps directly to components of the escalation policy architecture.

NIST AI RMF MAPPING: AI-ASSISTED SRE OPERATIONS
────────────────────────────────────────────────────────────────────────────

GOVERN — Accountability and Policy
  Who owns the AI system's outputs?
    → SRE Lead owns escalation policy; VP Engineering co-approves
  Who approves autonomous action boundaries?
    → Policy document with named approvers and review cadence
  How are accountability chains maintained?
    → Splunk audit trail: every recommendation, decision, and outcome
  SRE implementation: escalation policy document + approval workflow

MAP — Risk Identification
  What failure modes does the AI system face?
    → Confidence decay: model accuracy degrades as system evolves
    → Distribution shift: production patterns diverge from training data
    → Novel pattern extrapolation: confident recommendation on unfamiliar input
    → Blast radius miscalculation: action scope larger than assessed
  SRE implementation: four escalation triggers + novelty detection

MEASURE — Risk Quantification
  How do you measure AI recommendation quality over time?
    → Acceptance rate: fraction of recommendations acted on
    → Override rate: fraction of autonomous actions manually reversed
    → False positive rate: recommendations where predicted outcome was wrong
    → Confidence calibration: does 85% confidence actually mean 85% accuracy?
  SRE implementation: Prometheus quality recording rules + 7-day rolling metrics

MANAGE — Risk Response
  What happens when AI recommendation quality degrades?
    → Automatic downgrade of autonomous confidence threshold
    → Prompt context refresh from recent incident postmortems
    → Temporary suspension of Level 4 autonomy pending review
  SRE implementation: quality alert → runbook → policy review cadence
────────────────────────────────────────────────────────────────────────────

Splunk Audit Trail: The Irreplaceable Governance Layer

In regulated environments, the audit trail for AI-assisted actions is not optional. It is the documentary evidence that demonstrates human accountability over automated decisions — the record that answers the auditor's question: "Who authorised this change to your production system?"

# Splunk HEC Forwarder — HolmesGPT Decision Audit Trail
# Every recommendation, escalation decision, and outcome → Splunk
# This record is the accountability chain in documentary form

# Splunk event structure (sourcetype: sre:holmesgpt:decisions):
# {
#   "timestamp": "2025-04-15T14:23:07Z",
#   "incident_id": "INC-20250415-0047",
#   "alert_name": "KubePodOOMKilled",
#   "service": "payments-api",
#   "namespace": "production",
#
#   "investigation": {
#     "model_used": "holmesgpt-routine",
#     "model_backend": "ollama/llama3.1:8b",
#     "confidence_score": 0.91,
#     "diagnosis": "Memory limit (2Gi) exceeded by 847MB under high load...",
#     "recommended_action": "rolling_restart_stateless",
#     "blast_radius_assessment": {
#       "services_affected": 1,
#       "replica_fraction": 0.15,
#       "reversible": true,
#       "regulated_asset": false
#     }
#   },
#
#   "escalation_decision": {
#     "autonomy_level": 4,
#     "policy_version": "v1.3",
#     "triggers_evaluated": ["confidence", "blast_radius", "novelty", "regulatory"],
#     "triggers_fired": [],
#     "decision": "AUTONOMOUS_EXECUTE",
#     "policy_authority": "holmesgpt-escalation-policy v1.3 (approved: sre-lead)"
#   },
#
#   "execution": {
#     "action_taken": "rolling_restart_stateless",
#     "execution_start": "2025-04-15T14:23:09Z",
#     "verification_result": "HEALTHY",
#     "mttr_seconds": 67,
#     "operator_override": false
#   },
#
#   "quality_signals": {
#     "prediction_matched_outcome": true,
#     "error_budget_consumed_pct": 0.002,
#     "operator_satisfaction": null    # Populated by post-incident feedback
#   }
# }

The policy_authority field in the escalation decision block is the accountability chain closure. It names the specific policy document version and its human approvers. When an auditor asks who authorised the autonomous action, the answer is not "the AI decided" — it is "the SRE Lead and VP Engineering approved escalation policy v1.3 on 2025-03-15, and this action fell within the boundaries of Section 1 of that policy."

The Confidence Calibration Problem

A confidence score of 0.85 from a language model does not intrinsically mean that the recommendation is correct 85% of the time. Language models are notoriously poorly calibrated — they express high confidence in incorrect outputs and sometimes express low confidence in correct ones. The confidence threshold in the escalation policy must be calibrated against the AI system's actual historical accuracy, not against the model's self-reported certainty.

-- Splunk SPL: Confidence Calibration Assessment
-- Compares model-reported confidence bands against actual outcome accuracy
-- Run monthly; output informs confidence threshold calibration in policy

index=sre_holmesgpt sourcetype="sre:holmesgpt:decisions"
| eval confidence_band = case(
    confidence_score >= 0.90, "90-100%",
    confidence_score >= 0.85, "85-89%",
    confidence_score >= 0.80, "80-84%",
    confidence_score >= 0.70, "70-79%",
    confidence_score >= 0.60, "60-69%",
    true(),                   "<60%"
  )
| stats
    count                                          as total_recommendations,
    sum(prediction_matched_outcome)                as correct_predictions,
    avg(prediction_matched_outcome)                as empirical_accuracy,
    sum(operator_override)                         as operator_overrides
    by confidence_band, model_used
| eval
    calibration_delta = empirical_accuracy - (tonumber(substr(confidence_band,1,2))/100),
    calibration_status = if(abs(calibration_delta) < 0.10, "CALIBRATED", "MISCALIBRATED")
| table
    confidence_band, model_used, total_recommendations,
    empirical_accuracy, calibration_delta, calibration_status, operator_overrides
| sort confidence_band

-- If empirical_accuracy at "85-89%" band is actually 0.71:
-- The 0.85 autonomous threshold is accepting actions that are only
-- correct 71% of the time. Raise threshold or re-evaluate model.

Common Antipatterns

The Confidence Theatre antipattern → Using model-reported confidence scores as the primary autonomous execution gate without calibration against empirical outcome accuracy. A model that reports 0.92 confidence but is empirically correct 68% of the time is a dangerous basis for autonomous action. Calibration against historical outcomes must precede the deployment of any confidence-based gate.
The Policy-as-Default antipattern → Deploying the AI system with permissive defaults and planning to tighten the escalation policy "after we see how it performs in production." The escalation policy must be the first artefact produced, not a retroactive constraint on a system that is already taking autonomous actions. Permissive defaults in AI operations systems are not starting points; they are incident preconditions.
The Accountability Diffusion antipattern → Designing the system so that no single person is clearly accountable for an autonomous AI action. "The AI did it" is not an accountability chain. "The escalation policy approved by [names] on [date] authorised this class of action" is. In regulated environments, the inability to name a responsible human for a production change is itself a compliance finding.
The Alert Fatigue Transfer antipattern → Moving from a system that generates too many monitoring alerts to a system that generates too many AI recommendations. If HolmesGPT surfaces seven recommendations per incident, operators will start ignoring them at the same rate they ignore high-volume monitoring alerts. Recommendation volume should be governed by the same principles as alert volume: every recommendation must be actionable, and the threshold for surfacing should be higher than the threshold for suppressing.
The Permanent Level 4 antipattern → Classifying an autonomous action as Level 4 and never re-qualifying it. The re-qualification cadence is the mechanism that prevents a well-calibrated autonomous action from silently becoming a dangerous one as the system evolves. Every Level 4 action must carry a sre.internal/sot-next-review equivalent annotation and a Kyverno policy that generates a ticket when the date passes.

Maturity Progression

────────────────────────────────────────────────────────────────────────────
STAGE        AI-OPS ESCALATION STATE             NORTH STAR SIGNAL
────────────────────────────────────────────────────────────────────────────
Reactive     No AI-assisted operations.          All investigation is
             Operators work from raw             manual. MTTR limited
             telemetry only.                     by human availability.

Defined      HolmesGPT deployed at              AI operating at Level
             Level 1 only. Escalation           1–2 only. Context
             policy drafted but not             surfacing measurably
             yet governing autonomous           reduces investigation
             action.                            time.

Measured     Escalation policy governs          Recommendation quality
             Level 3–4 boundaries.              metrics tracked. Confidence
             Audit trail in Splunk.             calibration assessed
             Quality metrics active.            monthly. Override rate
                                                below 15%.

Optimised    Confidence calibration             Level 4 actions cover
             cycle running quarterly.           top-5 toil remediations.
             Model routing by blast             MTTR for covered patterns
             radius operational.                < 5 minutes (automated).
             NIST AI RMF aligned.               Audit trail satisfies
                                                regulatory review.

Generative   Escalation policy published        Policy cited in industry
             as reference architecture.         guidance. Recommendation
             Feedback loop feeds               quality above 85%.
             prompt engineering cycle.          AI-ops layer itself
             AI-ops treated as a               has SLO and error budget.
             production service.
────────────────────────────────────────────────────────────────────────────

Five Action Items for This Week

Draft your escalation policy document before configuring any autonomous action in HolmesGPT. Start with the accountability chain section: who owns the policy, who approves autonomous action boundaries, and what the change record looks like. A policy document that exists on paper but has not been approved by SRE leadership and VP Engineering is not a governance artefact — it is a draft. The approval is the governance act.
Run the Splunk confidence calibration query against your last 90 days of HolmesGPT decisions. If you do not yet have 90 days of data, start collecting it now at Level 1 only. Calibration data must precede autonomous execution boundaries. The calibration query is the empirical basis for your confidence thresholds — thresholds chosen without it are guesses with operational consequences.
Map every existing automated remediation to an autonomy level and a blast radius assessment. For each automation in your Class 1 (Reactive Remediation) category from the automation taxonomy post, assess: what is its blast radius under worst-case conditions, and what confidence mechanism governs when it executes? Automations with no explicit blast radius boundary and no confidence mechanism are operating at implicit Level 4 without a policy. Make the policy explicit before the next incident.
Configure the recommendation quality Prometheus rules and set a 30-day baseline. Even if you are operating at Level 1 only, begin measuring acceptance rate and false positive rate now. The first meaningful governance conversation about elevating to Level 3 or Level 4 should be anchored in empirical quality data, not in enthusiasm about the capability.
Add the four escalation triggers as literal fields to your HolmesGPT Splunk audit events. Every decision event should record: confidence_trigger_fired: true/false, blast_radius_trigger_fired: true/false, novelty_trigger_fired: true/false, regulatory_trigger_fired: true/false. Over time, this data reveals which triggers are governing your escalation decisions most frequently — and which failure modes your autonomous boundary is most exposed to.

"The risk in AI-assisted SRE is not that the automation will fail to act. The risk is that it will act confidently, at scale, on a pattern it has only partially understood — and that the human who approved the policy that authorised the action will not be reachable, will not remember what the policy said, or will not realise the policy applied to this situation. The escalation policy is not a constraint on AI capability. It is the engineering discipline that makes AI capability safe to deploy in systems where the cost of being confidently wrong is borne by users, not by the model."

What Comes Next

The escalation policy governs how AI recommendations become actions. The harder engineering problem is the quality of the recommendations themselves — specifically, how to evaluate LLM reliability for incident diagnosis with the same rigour that SRE applies to any other production dependency. The next post examines what it means to apply an SLO framework to an AI system: defining SLIs for recommendation accuracy, precision, and recall; setting error budgets for the AI-ops layer; and designing the automated quality gates that prevent a degrading LLM backend from silently undermining the operational decisions that depend on it.

Top comments (1)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.