On April 23, 2021, a Fastly CDN configuration change triggered a global outage that took down the UK government website, the New York Times, Reddit, and hundreds of other major internet properties for approximately one hour. The triggering event was a configuration push. The propagation mechanism was automated. The time between the configuration being pushed and the global impact becoming visible was under a minute. The time required for a human operator to identify the cause and initiate the rollback was approximately forty-nine minutes longer than that.
The Fastly incident is not primarily a story about automation failure. It is a story about the speed asymmetry between automated propagation and human response — and about what happens when the automation layer between a human decision and its production consequence moves faster than the accountability layer designed to govern it.
This asymmetry is the defining operational challenge of AI-assisted SRE. The capability to automate incident detection, root cause hypothesis generation, and even remediation is now accessible at costs and latencies that were unavailable five years ago. The operational risk is not that this capability will be under-used. The risk is that it will be deployed without a rigorous escalation policy — a formal framework that defines exactly where automated execution ends and human judgement begins, under what conditions the boundary shifts, and how accountability is preserved for every action the AI takes on behalf of an operator who may not have been in the room when it was taken.
The Human-in-the-Loop Spectrum
AI-assisted SRE operations do not exist at a single point on the autonomy spectrum. They exist across a range, and the appropriate position on that range is a function of confidence, blast radius, novelty, and regulatory context — not of how sophisticated the AI system is.
THE AUTOMATION AUTONOMY SPECTRUM
────────────────────────────────────────────────────────────────────────────
LEVEL 0 — MANUAL
AI generates no recommendations. Human observes raw telemetry and decides.
Appropriate when: AI system is unavailable, untrusted, or context is
outside AI training distribution entirely.
LEVEL 1 — ASSISTED
AI surfaces relevant context, correlated signals, and historical patterns.
Human makes all decisions. AI does not recommend actions.
Appropriate when: novel failure pattern; first occurrence of incident type;
regulated change requiring documented human judgement.
LEVEL 2 — SUPERVISED
AI recommends specific actions with confidence scores. Human approves
each action before execution. AI does not execute autonomously.
Appropriate when: high blast radius; unfamiliar but not novel pattern;
action is reversible but consequential.
LEVEL 3 — CONDITIONAL AUTONOMOUS
AI executes actions autonomously within pre-approved policy boundaries.
Human is notified after execution. Human can abort within a defined window.
Appropriate when: well-characterised failure pattern; low blast radius;
action is fully reversible; pattern seen > N times with consistent outcome.
LEVEL 4 — AUTONOMOUS
AI executes and verifies remediation without human notification unless
verification fails. Audit trail maintained.
Appropriate when: toil pattern fully characterised; action is idempotent;
blast radius is bounded to a single service; recurrence rate justifies
zero-latency response.
────────────────────────────────────────────────────────────────────────────
CRITICAL CONSTRAINT: No action may exist permanently at Level 4.
Every Level 4 automation must have a scheduled re-qualification review
that reassesses whether the failure pattern is still well-characterised
and the blast radius assumption still holds.
────────────────────────────────────────────────────────────────────────────
The critical constraint — that no action may exist permanently at Level 4 — is not conservatism. It is the engineering response to a specific failure mode: automation that was correctly calibrated at deployment time and has silently drifted out of calibration as the system evolved. An OOM restart automation that was safe when first deployed becomes unsafe the moment the underlying cause shifts from a memory leak to a data corruption event that is triggering the same symptom. The re-qualification review is the mechanism that catches this drift before it produces an incident.
The Four Escalation Triggers
Every escalation policy is built from four primitive triggers. Each trigger defines a condition under which the automation level must shift upward — toward more human involvement, not less.
Trigger 1 — Confidence Threshold Breach
The AI system's confidence in its diagnosis or recommended action has fallen below a defined threshold. In the context of LLM-based operations (HolmesGPT, LiteLLM Proxy routing), confidence is expressed as a combination of model-reported token probability distributions and domain-specific heuristics applied to the recommendation output.
A low-confidence diagnosis means the AI has identified a plausible pattern match but lacks sufficient corroborating signal to recommend action without human review. Executing actions based on low-confidence diagnoses is the operational equivalent of acting on a single data point in a monitoring dashboard: occasionally correct, reliably dangerous as a policy.
Trigger 2 — Blast Radius Threshold
The proposed action affects more infrastructure than the policy authorises for autonomous execution. Blast radius is assessed across three dimensions: service count (how many services are affected), traffic fraction (what percentage of user requests are served by the affected infrastructure), and reversibility (can the action be undone in under five minutes with a single command).
High blast radius is not a disqualifying condition for automation. It is a condition that requires the automation level to shift to at least Level 2 (supervised) regardless of confidence score.
Trigger 3 — Novelty Detection
The failure pattern does not match any pattern in the AI system's training corpus or historical incident database. Novelty is the most dangerous condition for autonomous execution because it is precisely the condition where the AI's pattern-matching capability provides the least value — and where a confident-sounding but incorrect recommendation carries the highest operational cost.
Novelty detection is the hardest trigger to implement well, because it requires the AI system to accurately assess the boundaries of its own knowledge. A system that cannot reliably distinguish "I have seen this pattern and am confident" from "I have seen a superficially similar pattern and am extrapolating" should not be operating at Level 3 or Level 4.
Trigger 4 — Regulatory Boundary
The proposed action would touch a regulated asset, require a documented change record, affect a system subject to NERC CIP, PCI-DSS, HIPAA, or equivalent obligations, or generate a compliance event. In regulated environments, no automated action may bypass the change management governance framework, regardless of confidence score or blast radius.
This trigger is absolute. It does not have a confidence threshold exception. An AI system that correctly diagnoses a production issue with 99% confidence and proposes a remediation that would constitute an undocumented change to a regulated asset must escalate to Level 2 and generate a change record, even if the remediation would restore service faster without it.
Designing the Escalation Policy Document
The escalation policy is an operational governance document, not a configuration file. It must be version-controlled, reviewed and approved by SRE leadership and compliance, and referenced in every AI-assisted automation's runtime configuration. Its authority derives from human review, not from the AI system that consults it.
ESCALATION POLICY: AI-ASSISTED INCIDENT RESPONSE
────────────────────────────────────────────────────────────────────────────
Service: production-platform (all services)
AI System: HolmesGPT + LiteLLM Proxy + Ollama / GitHub Models
Policy Version: v1.3 | Approved: SRE Lead + VP Engineering
Last Reviewed: 2025-Q1 | Next Review: 2025-Q2
────────────────────────────────────────────────────────────────────────────
SECTION 1: AUTONOMOUS EXECUTION AUTHORISED (Level 4)
Conditions required (ALL must be true):
✓ Confidence score ≥ 0.85 (model-reported + heuristic composite)
✓ Pattern seen ≥ 10 times in incident history with consistent outcome
✓ Blast radius: single service, single namespace, ≤ 20% of replicas
✓ Action is idempotent and fully reversible in ≤ 5 minutes
✓ No regulated asset in scope
✓ Error budget > 25% remaining (not in Tier 3 freeze)
Authorised actions at Level 4:
→ Rolling restart of single stateless deployment (OOM, deadlock)
→ Scale-up of single HPA-managed deployment by ≤ 2 replicas
→ Certificate rotation on non-production workloads
→ Log pipeline gateway restart (telemetry outage, no production impact)
Required logging: structured Splunk event per action (mandatory)
Re-qualification: every 90 days or after any incident where autonomous
action was taken and outcome was suboptimal
SECTION 2: SUPERVISED EXECUTION (Level 2 — Human Approval Required)
Conditions triggering Level 2 (ANY is sufficient):
⚠ Confidence score 0.60–0.84
⚠ Blast radius: > 20% of replicas OR > 1 service OR cross-namespace
⚠ First or second occurrence of this failure pattern
⚠ Error budget between 25–75% (Tier 2 degraded)
⚠ Action affects shared infrastructure (Argo CD, Prometheus, Istio)
Approval mechanism: Slack approval button with 10-minute timeout
Timeout behaviour: escalate to on-call if no response in 10 minutes
Required logging: recommendation + approval/rejection + outcome
SECTION 3: ASSISTED ONLY (Level 1 — No Action Authorised)
Conditions triggering Level 1 (ANY is sufficient):
✗ Confidence score < 0.60
✗ Novel failure pattern (no match in incident history)
✗ Regulated asset in scope (NERC CIP, PCI-DSS, HIPAA boundary)
✗ Error budget < 25% (Tier 3 freeze — deployment freeze active)
✗ Active P0 incident in progress (human incident commander owns scope)
✗ Multiple simultaneous incidents (blast radius assessment unreliable)
AI role at Level 1: surface correlated signals, historical context only
Human owns: diagnosis, action decision, execution, verification
SECTION 4: ACCOUNTABILITY CHAIN
Every AI-assisted action must trace to one of:
a) Direct human approval (Level 2 Slack approval button)
b) This policy document (Level 4 autonomous execution)
"The AI decided" is not a complete accountability chain.
Policy document owner: SRE Lead
Policy review and approval authority: SRE Lead + VP Engineering
────────────────────────────────────────────────────────────────────────────
HolmesGPT Escalation Architecture
The escalation policy document defines the governance rules. The escalation architecture implements those rules as runtime logic in the AI-assisted operations stack. The architecture shown here is specific to the HolmesGPT + LiteLLM Proxy + Ollama deployment pattern in a regulated on-premises environment.
# HolmesGPT Escalation Policy ConfigMap
# Consumed by HolmesGPT at runtime to determine autonomy level per action
# Version-controlled in git; updated only via Argo CD sync (change record enforced)
apiVersion: v1
kind: ConfigMap
metadata:
name: holmesgpt-escalation-policy
namespace: holmesgpt
annotations:
sre.internal/policy-version: "v1.3"
sre.internal/approved-by: "sre-lead,vp-engineering"
sre.internal/approved-date: "2025-03-15"
sre.internal/next-review: "2025-06-15"
sre.internal/review-enforced-by: "kyverno-policy/ai-ops-policy-review"
data:
escalation_policy.yaml: |
confidence_thresholds:
autonomous: 0.85
supervised: 0.60
assisted_only: 0.0
blast_radius_limits:
autonomous:
max_replica_fraction: 0.20
max_service_count: 1
max_namespace_count: 1
cross_namespace_allowed: false
regulated_assets_allowed: false
autonomous_actions_allowlist:
- action: rolling_restart_stateless
max_replicas_affected: 5
requires_pdb_check: true
- action: hpa_scale_up
max_replica_delta: 2
requires_current_below_sot: true
- action: log_pipeline_restart
namespaces: [monitoring, sre-platform]
production_namespaces_blocked: true
error_budget_gates:
tier_3_freeze_blocks_autonomous: true
tier_2_degrades_to_supervised: true
regulatory_boundary:
always_level_1_namespaces:
- pci-zone
- hipaa-zone
- nerc-cip-zone
always_level_1_labels:
- "compliance.internal/regulated=true"
novelty_detection:
min_historical_occurrences_for_autonomous: 10
similarity_threshold: 0.80
unknown_pattern_forces_level_1: true
approval_workflow:
slack_channel: "sre-aiops-approvals"
timeout_minutes: 10
timeout_action: escalate_to_oncall
audit:
splunk_sourcetype: "sre:holmesgpt:decisions"
log_all_recommendations: true
log_operator_overrides: true
override_feeds_prompt_review: true
Model Routing for Escalation Quality
The LiteLLM Proxy's model routing configuration is a first-class component of the escalation architecture. Routing to the right model at the right confidence tier is not a performance optimisation — it is a safety mechanism.
# LiteLLM Proxy — Model Routing for Escalation Tiers
# Smaller local models for low blast radius / routine patterns
# Larger models with greater context window for high blast radius / novel patterns
# On-premises models for regulated asset investigations (data sovereignty)
model_list:
# Tier 1: Routine investigation — local Ollama model
# Low latency, no data egress, adequate for well-characterised patterns
- model_name: holmesgpt-routine
litellm_params:
model: ollama/llama3.1:8b
api_base: http://ollama.ai-ops.svc.cluster.local:11434
timeout: 30
max_tokens: 2048
# Tier 2: Complex investigation — larger local model
# Higher accuracy for multi-service correlation and novel patterns
- model_name: holmesgpt-complex
litellm_params:
model: ollama/llama3.1:70b
api_base: http://ollama.ai-ops.svc.cluster.local:11434
timeout: 90
max_tokens: 8192
# Tier 3: High-stakes / novel pattern — GitHub Models
# Largest context window for multi-service incident correlation
# Data classification check required before routing: no PII, no regulated data
- model_name: holmesgpt-highstakes
litellm_params:
model: github/gpt-4o
api_base: https://models.inference.ai.azure.com
api_key: "os.environ/GITHUB_MODELS_PAT"
timeout: 120
max_tokens: 16384
router_settings:
routing_strategy: custom
routing_logic: |
# Route by blast_radius_tier header set by HolmesGPT pre-routing assessment
if blast_radius_tier == "low" and pattern_novelty == "known":
return "holmesgpt-routine"
elif blast_radius_tier == "high" or pattern_novelty == "novel":
# Data classification gate before external model routing
if data_contains_regulated_fields:
return "holmesgpt-complex" # Stay on-premises
return "holmesgpt-highstakes"
else:
return "holmesgpt-complex"
fallback_model: holmesgpt-complex # Always fall back to on-premises
fallback_on_status_codes: [429, 500, 503]
The Recommendation Quality Feedback Loop
The operational risk of AI-assisted recommendations is not static. It evolves as the system changes and as the model's training distribution diverges from the current operational reality. An AI recommendation quality feedback loop is the mechanism that makes this drift visible before it produces a damaging autonomous action.
# Prometheus Recording Rules — AI Recommendation Quality Tracking
# Measures whether HolmesGPT recommendations are operationally valuable
# High override rate or low action rate = recommendation quality degrading
groups:
- name: holmesgpt.recommendation_quality
rules:
# Recommendation acceptance rate: fraction of recommendations
# that operators acted on (approved or executed autonomously)
# versus rejected or ignored
- record: holmesgpt:recommendation_acceptance_rate:rate7d
expr: |
sum(rate(holmesgpt_recommendations_acted_on_total[7d]))
/
sum(rate(holmesgpt_recommendations_total[7d]))
# Operator override rate: fraction of autonomous actions that
# were manually reversed by an operator after execution
# High rate = autonomous confidence thresholds are too permissive
- record: holmesgpt:autonomous_override_rate:rate7d
expr: |
sum(rate(holmesgpt_autonomous_actions_reversed_total[7d]))
/
sum(rate(holmesgpt_autonomous_actions_total[7d]))
# False positive rate: recommendations made but outcome was
# NOT the recommended action resolving the incident
- record: holmesgpt:false_positive_rate:rate7d
expr: |
sum(rate(holmesgpt_recommendations_outcome_mismatch_total[7d]))
/
sum(rate(holmesgpt_recommendations_acted_on_total[7d]))
# Alert: recommendation quality degrading
- alert: HolmesGPT_RecommendationQualityDegrading
expr: |
holmesgpt:autonomous_override_rate:rate7d > 0.15
OR
holmesgpt:false_positive_rate:rate7d > 0.20
for: 1d
labels:
severity: ticket
domain: ai_ops_quality
annotations:
summary: >
HolmesGPT recommendation quality below threshold.
Override rate: {{ with query "holmesgpt:autonomous_override_rate:rate7d" }}
{{ . | first | value | humanizePercentage }}{{ end }}.
Action: review recent overrides, update prompt context,
consider reducing autonomous confidence threshold.
runbook: "https://wiki.internal/sre/runbooks/holmesgpt-quality-review"
# Alert: recommendation volume causing alert fatigue risk
# More than 3 recommendations per incident = cognitive overload signal
- alert: HolmesGPT_RecommendationVolumeHigh
expr: |
sum(rate(holmesgpt_recommendations_total[1h]))
/
sum(rate(incidents_opened_total[1h])) > 3
for: 30m
labels:
severity: ticket
annotations:
summary: >
HolmesGPT generating > 3 recommendations per incident on average.
Risk: alert fatigue causing operators to ignore recommendations.
Action: tighten confidence floor or reduce recommendation scope.
The Accountability Chain Principle and NIST AI RMF Alignment
The accountability chain principle — that every AI-assisted action must trace back to a human decision, either a direct approval or a policy that a human wrote and approved — is the operational implementation of the NIST AI Risk Management Framework's GOVERN function.
The NIST AI RMF establishes four core functions for AI risk management: GOVERN (policies, accountability), MAP (risk identification), MEASURE (risk quantification), and MANAGE (risk response). Each function maps directly to components of the escalation policy architecture.
NIST AI RMF MAPPING: AI-ASSISTED SRE OPERATIONS
────────────────────────────────────────────────────────────────────────────
GOVERN — Accountability and Policy
Who owns the AI system's outputs?
→ SRE Lead owns escalation policy; VP Engineering co-approves
Who approves autonomous action boundaries?
→ Policy document with named approvers and review cadence
How are accountability chains maintained?
→ Splunk audit trail: every recommendation, decision, and outcome
SRE implementation: escalation policy document + approval workflow
MAP — Risk Identification
What failure modes does the AI system face?
→ Confidence decay: model accuracy degrades as system evolves
→ Distribution shift: production patterns diverge from training data
→ Novel pattern extrapolation: confident recommendation on unfamiliar input
→ Blast radius miscalculation: action scope larger than assessed
SRE implementation: four escalation triggers + novelty detection
MEASURE — Risk Quantification
How do you measure AI recommendation quality over time?
→ Acceptance rate: fraction of recommendations acted on
→ Override rate: fraction of autonomous actions manually reversed
→ False positive rate: recommendations where predicted outcome was wrong
→ Confidence calibration: does 85% confidence actually mean 85% accuracy?
SRE implementation: Prometheus quality recording rules + 7-day rolling metrics
MANAGE — Risk Response
What happens when AI recommendation quality degrades?
→ Automatic downgrade of autonomous confidence threshold
→ Prompt context refresh from recent incident postmortems
→ Temporary suspension of Level 4 autonomy pending review
SRE implementation: quality alert → runbook → policy review cadence
────────────────────────────────────────────────────────────────────────────
Splunk Audit Trail: The Irreplaceable Governance Layer
In regulated environments, the audit trail for AI-assisted actions is not optional. It is the documentary evidence that demonstrates human accountability over automated decisions — the record that answers the auditor's question: "Who authorised this change to your production system?"
# Splunk HEC Forwarder — HolmesGPT Decision Audit Trail
# Every recommendation, escalation decision, and outcome → Splunk
# This record is the accountability chain in documentary form
# Splunk event structure (sourcetype: sre:holmesgpt:decisions):
# {
# "timestamp": "2025-04-15T14:23:07Z",
# "incident_id": "INC-20250415-0047",
# "alert_name": "KubePodOOMKilled",
# "service": "payments-api",
# "namespace": "production",
#
# "investigation": {
# "model_used": "holmesgpt-routine",
# "model_backend": "ollama/llama3.1:8b",
# "confidence_score": 0.91,
# "diagnosis": "Memory limit (2Gi) exceeded by 847MB under high load...",
# "recommended_action": "rolling_restart_stateless",
# "blast_radius_assessment": {
# "services_affected": 1,
# "replica_fraction": 0.15,
# "reversible": true,
# "regulated_asset": false
# }
# },
#
# "escalation_decision": {
# "autonomy_level": 4,
# "policy_version": "v1.3",
# "triggers_evaluated": ["confidence", "blast_radius", "novelty", "regulatory"],
# "triggers_fired": [],
# "decision": "AUTONOMOUS_EXECUTE",
# "policy_authority": "holmesgpt-escalation-policy v1.3 (approved: sre-lead)"
# },
#
# "execution": {
# "action_taken": "rolling_restart_stateless",
# "execution_start": "2025-04-15T14:23:09Z",
# "verification_result": "HEALTHY",
# "mttr_seconds": 67,
# "operator_override": false
# },
#
# "quality_signals": {
# "prediction_matched_outcome": true,
# "error_budget_consumed_pct": 0.002,
# "operator_satisfaction": null # Populated by post-incident feedback
# }
# }
The policy_authority field in the escalation decision block is the accountability chain closure. It names the specific policy document version and its human approvers. When an auditor asks who authorised the autonomous action, the answer is not "the AI decided" — it is "the SRE Lead and VP Engineering approved escalation policy v1.3 on 2025-03-15, and this action fell within the boundaries of Section 1 of that policy."
The Confidence Calibration Problem
A confidence score of 0.85 from a language model does not intrinsically mean that the recommendation is correct 85% of the time. Language models are notoriously poorly calibrated — they express high confidence in incorrect outputs and sometimes express low confidence in correct ones. The confidence threshold in the escalation policy must be calibrated against the AI system's actual historical accuracy, not against the model's self-reported certainty.
-- Splunk SPL: Confidence Calibration Assessment
-- Compares model-reported confidence bands against actual outcome accuracy
-- Run monthly; output informs confidence threshold calibration in policy
index=sre_holmesgpt sourcetype="sre:holmesgpt:decisions"
| eval confidence_band = case(
confidence_score >= 0.90, "90-100%",
confidence_score >= 0.85, "85-89%",
confidence_score >= 0.80, "80-84%",
confidence_score >= 0.70, "70-79%",
confidence_score >= 0.60, "60-69%",
true(), "<60%"
)
| stats
count as total_recommendations,
sum(prediction_matched_outcome) as correct_predictions,
avg(prediction_matched_outcome) as empirical_accuracy,
sum(operator_override) as operator_overrides
by confidence_band, model_used
| eval
calibration_delta = empirical_accuracy - (tonumber(substr(confidence_band,1,2))/100),
calibration_status = if(abs(calibration_delta) < 0.10, "CALIBRATED", "MISCALIBRATED")
| table
confidence_band, model_used, total_recommendations,
empirical_accuracy, calibration_delta, calibration_status, operator_overrides
| sort confidence_band
-- If empirical_accuracy at "85-89%" band is actually 0.71:
-- The 0.85 autonomous threshold is accepting actions that are only
-- correct 71% of the time. Raise threshold or re-evaluate model.
Common Antipatterns
The Confidence Theatre antipattern → Using model-reported confidence scores as the primary autonomous execution gate without calibration against empirical outcome accuracy. A model that reports 0.92 confidence but is empirically correct 68% of the time is a dangerous basis for autonomous action. Calibration against historical outcomes must precede the deployment of any confidence-based gate.
The Policy-as-Default antipattern → Deploying the AI system with permissive defaults and planning to tighten the escalation policy "after we see how it performs in production." The escalation policy must be the first artefact produced, not a retroactive constraint on a system that is already taking autonomous actions. Permissive defaults in AI operations systems are not starting points; they are incident preconditions.
The Accountability Diffusion antipattern → Designing the system so that no single person is clearly accountable for an autonomous AI action. "The AI did it" is not an accountability chain. "The escalation policy approved by [names] on [date] authorised this class of action" is. In regulated environments, the inability to name a responsible human for a production change is itself a compliance finding.
The Alert Fatigue Transfer antipattern → Moving from a system that generates too many monitoring alerts to a system that generates too many AI recommendations. If HolmesGPT surfaces seven recommendations per incident, operators will start ignoring them at the same rate they ignore high-volume monitoring alerts. Recommendation volume should be governed by the same principles as alert volume: every recommendation must be actionable, and the threshold for surfacing should be higher than the threshold for suppressing.
The Permanent Level 4 antipattern → Classifying an autonomous action as Level 4 and never re-qualifying it. The re-qualification cadence is the mechanism that prevents a well-calibrated autonomous action from silently becoming a dangerous one as the system evolves. Every Level 4 action must carry a
sre.internal/sot-next-reviewequivalent annotation and a Kyverno policy that generates a ticket when the date passes.
Maturity Progression
────────────────────────────────────────────────────────────────────────────
STAGE AI-OPS ESCALATION STATE NORTH STAR SIGNAL
────────────────────────────────────────────────────────────────────────────
Reactive No AI-assisted operations. All investigation is
Operators work from raw manual. MTTR limited
telemetry only. by human availability.
Defined HolmesGPT deployed at AI operating at Level
Level 1 only. Escalation 1–2 only. Context
policy drafted but not surfacing measurably
yet governing autonomous reduces investigation
action. time.
Measured Escalation policy governs Recommendation quality
Level 3–4 boundaries. metrics tracked. Confidence
Audit trail in Splunk. calibration assessed
Quality metrics active. monthly. Override rate
below 15%.
Optimised Confidence calibration Level 4 actions cover
cycle running quarterly. top-5 toil remediations.
Model routing by blast MTTR for covered patterns
radius operational. < 5 minutes (automated).
NIST AI RMF aligned. Audit trail satisfies
regulatory review.
Generative Escalation policy published Policy cited in industry
as reference architecture. guidance. Recommendation
Feedback loop feeds quality above 85%.
prompt engineering cycle. AI-ops layer itself
AI-ops treated as a has SLO and error budget.
production service.
────────────────────────────────────────────────────────────────────────────
Five Action Items for This Week
Draft your escalation policy document before configuring any autonomous action in HolmesGPT. Start with the accountability chain section: who owns the policy, who approves autonomous action boundaries, and what the change record looks like. A policy document that exists on paper but has not been approved by SRE leadership and VP Engineering is not a governance artefact — it is a draft. The approval is the governance act.
Run the Splunk confidence calibration query against your last 90 days of HolmesGPT decisions. If you do not yet have 90 days of data, start collecting it now at Level 1 only. Calibration data must precede autonomous execution boundaries. The calibration query is the empirical basis for your confidence thresholds — thresholds chosen without it are guesses with operational consequences.
Map every existing automated remediation to an autonomy level and a blast radius assessment. For each automation in your Class 1 (Reactive Remediation) category from the automation taxonomy post, assess: what is its blast radius under worst-case conditions, and what confidence mechanism governs when it executes? Automations with no explicit blast radius boundary and no confidence mechanism are operating at implicit Level 4 without a policy. Make the policy explicit before the next incident.
Configure the recommendation quality Prometheus rules and set a 30-day baseline. Even if you are operating at Level 1 only, begin measuring acceptance rate and false positive rate now. The first meaningful governance conversation about elevating to Level 3 or Level 4 should be anchored in empirical quality data, not in enthusiasm about the capability.
Add the four escalation triggers as literal fields to your HolmesGPT Splunk audit events. Every decision event should record:
confidence_trigger_fired: true/false,blast_radius_trigger_fired: true/false,novelty_trigger_fired: true/false,regulatory_trigger_fired: true/false. Over time, this data reveals which triggers are governing your escalation decisions most frequently — and which failure modes your autonomous boundary is most exposed to.
"The risk in AI-assisted SRE is not that the automation will fail to act. The risk is that it will act confidently, at scale, on a pattern it has only partially understood — and that the human who approved the policy that authorised the action will not be reachable, will not remember what the policy said, or will not realise the policy applied to this situation. The escalation policy is not a constraint on AI capability. It is the engineering discipline that makes AI capability safe to deploy in systems where the cost of being confidently wrong is borne by users, not by the model."
What Comes Next
The escalation policy governs how AI recommendations become actions. The harder engineering problem is the quality of the recommendations themselves — specifically, how to evaluate LLM reliability for incident diagnosis with the same rigour that SRE applies to any other production dependency. The next post examines what it means to apply an SLO framework to an AI system: defining SLIs for recommendation accuracy, precision, and recall; setting error budgets for the AI-ops layer; and designing the automated quality gates that prevent a degrading LLM backend from silently undermining the operational decisions that depend on it.

Top comments (0)