Nijo George Payyappilly

Posted on May 19

Energy Grid Observability: What the Power Sector Can Learn from Google SRE

#sre #observability #devops #reliability

On August 14, 2003, a software bug silenced an alarm. The alarm was part of the state estimation system at FirstEnergy Corporation in Ohio — a system whose job was to model the real-time health of the transmission network and alert operators when that model diverged from a safe operating envelope. The bug had been present for months. It had suppressed alerts for hours before that afternoon. By the time operators understood what was happening, three high-voltage transmission lines had sagged into untrimmed trees, the cascading failure had crossed four state boundaries and into Canada, and fifty-five million people were without power in the largest blackout in North American history.

The official investigation report ran to two hundred and thirty-eight pages. Its conclusion, at root, was simple: the grid failed because the humans operating it had lost situational awareness. Not because the sensors stopped working. Not because the transmission infrastructure was inadequate. Because the software layer between the physical grid and the human operators had stopped faithfully representing reality — and no one knew it.

That is an observability failure. And it is the same class of failure that Site Reliability Engineering was designed to prevent in software systems. The power sector has not yet fully recognised that it is running the same problem under a different name.

Two Reliability Disciplines Separated by Vocabulary

Grid operations and Site Reliability Engineering evolved independently, serving different physical systems and different regulatory regimes. But their foundational concerns are identical: how do you know the current state of a complex, distributed system? How do you define and measure acceptable failure? How do you detect degradation before it becomes catastrophe?

Grid operators have answered these questions with decades of engineering practice. SCADA systems provide real-time telemetry from thousands of sensors. Energy Management Systems (EMS) run continuous state estimation to model grid topology under current load conditions. Protection relay systems execute sub-second automated fault isolation when abnormal conditions are detected. The grid, in narrow technical terms, is one of the most instrumented physical systems ever built.

And yet the 2003 Northeast blackout happened. Texas Winter Storm Uri in February 2021 caused the failure of over one-third of the state's generating capacity. The California heat dome events of 2020 and 2022 pushed the grid to rolling blackouts despite years of grid modernisation investment.

The common thread is not sensor failure or infrastructure inadequacy. It is the gap between monitoring and observability — between knowing that something is happening and understanding why, between seeing individual metric thresholds breach and comprehending the causal chain that connects them.

The core distinction: Monitoring tells you a transmission line is at 98% capacity. Observability tells you why it got there, what will happen next, and which of seventeen possible interventions will resolve it without triggering a cascading failure elsewhere in the network.

Mapping the Four Golden Signals to Grid Operations

Google SRE's Four Golden Signals — Latency, Traffic, Errors, and Saturation — were formulated for software services, but their underlying logic is domain-agnostic. Each characterises a different dimension of system health from the perspective of the entity being served.

Latency — Control System Response Time and State Estimation Convergence

In software services, latency measures how long it takes to serve a request. In grid operations, the equivalent is the time dimension of control system responsiveness: how long does it take for a SCADA command to be executed and confirmed? How long does the state estimation algorithm take to converge after a topology change?

The 2003 Northeast blackout was materially worsened because FirstEnergy's state estimation system had been running in a degraded mode for hours — producing a stale model of the network that operators were trusting as current. The latency of the state estimation update cycle was the hidden variable that turned a manageable contingency into a cascading failure.

Grid observability requires tracking not just whether state estimation is running, but how fresh its output is. A state estimation system that converges in 30 seconds normally but 8 minutes during a topology change is exhibiting a reliability signal that warrants an alert — because 8-minute-old models during fast-moving contingencies are operationally dangerous.

Traffic — Load Demand, Frequency Deviation, and Interchange Flows

Traffic in SRE terms is the demand signal. On the grid, the more operationally sensitive metric is frequency deviation: the departure of grid frequency from its nominal value (60 Hz in North America) as the system balances generation against demand in real time.

The rate of frequency change (ROCOF — Rate of Change of Frequency) is the derivative signal that provides early warning of generation-load imbalance events before frequency has deviated enough to trigger protection systems.

ROCOF is an SRE burn rate metric applied to the physical grid. A high ROCOF means the error budget — the grid's tolerance for frequency deviation — is being consumed faster than the system can respond. The analogy is not decorative; the mathematical structure is identical.

Errors — Protection Relay Operations, SCADA Command Failures, and Communication Outages

Grid errors require careful categorisation, in exactly the same way that HTTP error codes require categorisation to distinguish user errors (4xx) from system failures (5xx). A protection relay operation may be a correctly executed fault isolation. But a relay operation not followed by the expected reclosing sequence is a signal that warrants investigation.

SCADA command failures are the grid equivalent of failed write operations in a database: the operator believes a state change has occurred when it has not. These are the silent errors that accumulate into the situational awareness gap that precedes major events.

Saturation — Thermal Loading, Voltage Margins, and Short-Circuit Capacity

The critical insight from SRE practice is that saturation signals are predictive: you see saturation approaching before the error occurs. A transmission line at 85% of its thermal rating is a leading indicator; the sag-into-tree contact that initiated the 2003 blackout is the lagging consequence. An observability architecture that alerts on saturation approaching threshold provides the intervention window that reactive monitoring misses.

────────────────────────────────────────────────────────────────────────────
GOLDEN SIGNAL    GRID EQUIVALENT                   KEY METRIC
────────────────────────────────────────────────────────────────────────────
Latency          State estimation convergence       Time-to-stable-model (s)
                 SCADA command round-trip           Command confirm latency (ms)
                 EMS display refresh lag            Telemetry staleness (s)

Traffic          Real-time load demand              MW by zone/area
                 Frequency deviation                Hz delta from 60.00
                 Rate of Change of Frequency        Hz/s (ROCOF)

Errors           Unplanned protection relay ops     Events/hour by substation
                 SCADA command failures             Failed commands / total
                 Communication outages              Unobservable assets count

Saturation       Transmission line loading          % of thermal rating
                 Transformer utilisation            % of nameplate MVA
                 Voltage margin                     % deviation from nominal
────────────────────────────────────────────────────────────────────────────

SLIs and SLOs for Grid Reliability

The power sector already has its own reliability metrics. SAIDI, SAIFI, and CAIDI have been used by utilities for decades. But these are lagging, aggregated metrics — they measure what already happened, averaged across a customer base, reported quarterly. They are the equivalent of measuring software reliability by counting support tickets filed last quarter.

An SLO framework applied to grid operations would define SLIs at the control system and communication layer — not just at the customer impact layer — with rolling windows short enough to drive operational decisions in real time.

# Grid Observability SLI/SLO Definitions
# Prometheus recording rules for a modernised grid monitoring stack

groups:
  - name: grid.slo.definitions
    interval: 30s
    rules:

      # SLI 1: State Estimation Freshness
      # Fraction of 5-minute intervals where state estimation converged
      # to a stable solution within 60 seconds of topology change
      # SLO Target: 99.5% of intervals over rolling 7-day window
      - record: sli:state_estimation_freshness:ratio_rate5m
        expr: |
          sum(rate(ems_state_estimation_convergence_success_total[5m]))
          /
          sum(rate(ems_state_estimation_runs_total[5m]))

      # SLI 2: SCADA Command Execution Success
      # Fraction of SCADA commands confirmed executed within 10s
      # SLO Target: 99.9% of commands over rolling 24-hour window
      - record: sli:scada_command_success:ratio_rate5m
        expr: |
          sum(rate(scada_commands_confirmed_total[5m]))
          /
          sum(rate(scada_commands_issued_total[5m]))

      # SLI 3: Substation Communication Availability
      # Fraction of monitored substations with active comms link
      # SLO Target: 99.8% of substations observable at all times
      - record: sli:substation_communication_availability:ratio
        expr: |
          count(scada_substation_last_update_seconds < 60)
          /
          count(scada_substation_monitored == 1)

The OT/IT Convergence Problem as an Observability Architecture Challenge

The energy sector's most distinctive observability challenge is the boundary between Operational Technology (OT) and Information Technology (IT). OT systems — SCADA, protection relays, intelligent electronic devices (IEDs), phasor measurement units (PMUs) — were designed in an era when network isolation was the primary security model. They run proprietary protocols (DNP3, Modbus, IEC 61850) on dedicated networks with multi-decade operational lifetimes.

The consequence is an observability architecture with a structural gap at the OT/IT boundary: rich physical telemetry on one side, modern observability infrastructure on the other, and a brittle, manually maintained integration layer connecting them.

The SRE approach is to treat the OT/IT integration layer as a service with its own SLIs, SLOs, and error budgets. The data pipeline carrying PMU measurements from substations to the EMS is not a background infrastructure concern; it is a first-class service whose reliability directly determines the quality of state estimation output.

# OT/IT Integration Pipeline — SLO and Automated Recovery
# Architecture:
#   IED/RTU (substation) → DNP3/IEC 61850 → Protocol Gateway
#   → MQTT/gRPC → Kafka → Prometheus Exporter → Metrics Platform

groups:
  - name: grid.pipeline.slo
    rules:

      # Pipeline throughput: fraction of expected telemetry points received
      - record: sli:telemetry_pipeline_completeness:ratio_rate5m
        expr: |
          sum(rate(telemetry_points_received_total[5m]))
          /
          sum(rate(telemetry_points_expected_total[5m]))

      # Staleness alert: substation with no update in 120 seconds
      - alert: TelemetryPipelineStale
        expr: |
          (time() - telemetry_substation_last_received_timestamp) > 120
        for: 2m
        labels:
          severity: page
          domain: grid_observability
        annotations:
          summary: >
            Substation {{ $labels.substation_id }} telemetry stale for
            {{ $value | humanizeDuration }} — state estimation input degraded
          runbook: "https://wiki.internal/sre/runbooks/telemetry-pipeline-stale"
          automation: "https://wiki.internal/sre/automation/pipeline-recovery"

Automation-first recovery: A stale substation telemetry link whose recovery procedure is "operator identifies failure → calls substation technician → technician resets gateway → operator confirms recovery" is a toil pattern. The same procedure, triggered automatically by the staleness alert and confirmed by automated verification of resumed telemetry flow, eliminates human latency from the MTTR calculation — and eliminates the risk that the alert is missed during high-tempo operations.

# Automated Telemetry Recovery — Kubernetes Job triggered by AlertManager webhook
apiVersion: batch/v1
kind: Job
metadata:
  name: telemetry-recovery-{{ substation_id }}
  namespace: grid-ops
  labels:
    trigger: alert-automation
    domain: ot-it-pipeline
spec:
  backoffLimit: 2
  template:
    spec:
      restartPolicy: OnFailure
      serviceAccountName: grid-automation-sa
      containers:
        - name: recovery-controller
          image: grid-ops/pipeline-recovery:v2.1.0
          env:
            - name: SUBSTATION_ID
              value: "{{ substation_id }}"
            - name: RECOVERY_MODE
              value: "gateway-restart"
            - name: VERIFY_TIMEOUT_SECONDS
              value: "90"
            - name: ESCALATE_ON_FAILURE
              value: "true"    # Page on-call if automated recovery fails
            - name: SPLUNK_HEC_URL
              valueFrom:
                secretKeyRef:
                  name: splunk-hec-creds
                  key: url

NERC CIP Compliance as an SLO Problem

NERC CIP standards define mandatory reliability and security requirements for bulk power system operators. The dominant industry approach is documentation-first: maintain records sufficient to demonstrate compliance during audits. This is a lagging, manual process that is expensive to maintain and provides limited operational value between audit cycles.

The SRE reframing is to treat compliance requirements as SLOs with continuous automated verification rather than periodic manual attestation. CIP-010 requires detection of unauthorised configuration changes — this is a drift detection requirement that GitOps tooling implements as a built-in operational posture, not a compliance add-on.

# Argo CD Application — Grid Monitoring Stack
# GitOps enforces CIP-010 configuration change management automatically:
# every configuration change is a git commit, every drift is detected,
# and the remediation path (sync) is the compliance record.

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: grid-observability-stack
  namespace: argocd
  annotations:
    # CIP-010 audit trail: all sync events logged to Splunk via webhook
    notifications.argoproj.io/subscribe.on-sync-succeeded.splunk: "grid-cip-compliance"
    notifications.argoproj.io/subscribe.on-sync-failed.splunk: "grid-cip-compliance"
    notifications.argoproj.io/subscribe.on-health-degraded.splunk: "grid-cip-compliance"
spec:
  project: grid-operations
  source:
    repoURL: https://git.internal/grid-ops/observability-config
    targetRevision: main
    path: clusters/grid-control/monitoring
  destination:
    server: https://tkg-grid-control.internal:6443
    namespace: grid-monitoring
  syncPolicy:
    automated:
      prune: true
      selfHeal: true    # Drift auto-remediated: CIP-010 compliance continuous

The self-healing sync policy is not just an operational convenience — it is a continuous compliance assertion. The git commit history, Argo CD sync log, and Splunk audit trail together constitute a CIP-010 compliance record that is richer and less labour-intensive to maintain than the documentation-first approach most utilities currently employ.

Applying Multi-Window Burn Rate Alerting to Grid Frequency Events

Grid frequency management operates on timescales that map precisely to the multi-window burn rate alerting model. Primary frequency response operates in the 0–30 second window. Secondary response (AGC) operates in the 30-second to 10-minute window. Tertiary response operates in the 10-minute to 60-minute window.

This layered response hierarchy is structurally identical to the 14×/6×/3×/1× burn rate model: different urgency thresholds triggering different response actors with different response times, calibrated to the rate at which the budget is being consumed.

# Grid Frequency — Burn Rate Equivalent Alerting
# NERC BAL-003 requires 100% of primary reserve deployment
# within 30 seconds of a frequency deviation event

groups:
  - name: grid.frequency.alerts
    rules:

      # CRITICAL: Under-Frequency Load Shedding imminent
      # Frequency < 59.3 Hz AND declining
      - alert: GridFrequency_Critical_UFLS
        expr: |
          grid_frequency_hz < 59.3
          AND
          deriv(grid_frequency_hz[60s]) < -0.1
        for: 0s    # No 'for' — immediate; no false positive tolerance
        labels:
          severity: critical
          response_tier: primary
        annotations:
          summary: >
            Grid frequency {{ $value }} Hz and declining — UFLS arming imminent

      # PAGE: Secondary response required
      # Frequency 59.3–59.7 Hz: primary response engaged, AGC correction needed
      - alert: GridFrequency_Page_SecondaryResponse
        expr: |
          grid_frequency_hz < 59.7
          AND
          grid_frequency_hz >= 59.3
        for: 30s
        labels:
          severity: page
          response_tier: secondary

      # TICKET: Sustained deviation requiring operator review
      - alert: GridFrequency_Ticket_TertiaryReview
        expr: |
          abs(grid_frequency_hz - 60.0) > 0.1
        for: 5m
        labels:
          severity: ticket
          response_tier: tertiary

Target-State Observability Architecture

────────────────────────────────────────────────────────────────────────────
LAYER              GRID EQUIVALENT            SRE EQUIVALENT
────────────────────────────────────────────────────────────────────────────
Physical           IEDs, PMUs, RTUs,          Application instrumentation
Instrumentation    smart meters               (OTel SDK, Prometheus client)

Protocol           DNP3/IEC61850 →            OpenTelemetry Collector
Translation        MQTT/gRPC gateway          protocol normalisation

Streaming          Kafka / event broker       OTLP metrics/trace pipeline
Transport

Time-Series        Historian (OSIsoft PI,     Prometheus / Thanos
Storage            Emerson Ovation)

Log Aggregation    Splunk Enterprise          Splunk Enterprise
                   (SCADA events, relay       (application + audit logs)
                   records, CIP trails)

Analysis           EMS / DMS analytics        Grafana / Splunk dashboards
Platform                                      SLO burn rate views

Alerting           Upgraded alarm mgmt        Prometheus Alertmanager
                   (SLO-aware)                with burn rate rules

Automation         SCADA automated            Kubernetes controllers,
Response           switching sequences        event-driven remediations
────────────────────────────────────────────────────────────────────────────

A unified Splunk deployment that ingests SCADA event streams, protection relay operation records, CIP audit logs, and control system application logs creates the cross-domain correlation capability that is the difference between detecting individual anomalies and understanding cascading failure chains before they propagate.

Common Antipatterns

The Alarm Flood antipattern → Grid control centres routinely operate with hundreds of active alarms in normal conditions. Operators learn to filter by experience rather than by signal quality. Every alarm must trace to one of the Four Golden Signal categories and must have a defined response action. Alarms without response actions are not alarms; they are noise.
The SCADA-as-Source-of-Truth antipattern → Treating the SCADA display as ground truth rather than a model that must be continuously validated. A SCADA system that has lost communication with a substation will often display the last known state rather than an explicit unknown indicator — creating exactly the situational awareness gap that preceded the 2003 blackout.
The Compliance-as-Observability antipattern → Instrumenting grid systems to satisfy CIP audit requirements rather than to maximise operational situational awareness. These goals overlap but are not identical. CIP drives documentation of security events; operational observability requires telemetry completeness, latency minimisation, and cross-domain correlation that compliance frameworks do not mandate.
The OT/IT Separation antipattern → Maintaining strict organisational separation between OT operations and IT/SRE teams, preventing the application of modern observability practices to grid control systems. The security rationale for network segmentation is valid; the operational rationale for organisational siloing is not.
The Event-Driven-Only Observability antipattern → Relying solely on discrete event logs without continuous time-series telemetry at the control system layer. Event logs capture what happened; time-series telemetry captures the leading indicators of what is about to happen.

Maturity Progression

────────────────────────────────────────────────────────────────────────────
STAGE        GRID OBSERVABILITY STATE            NORTH STAR SIGNAL
────────────────────────────────────────────────────────────────────────────
Reactive     SCADA alarms threshold-based.       Operators filter noise
             Alarm flooding common.              by experience, not design.
             OT/IT data in silos.

Defined      Four Golden Signals instrumented    SLIs defined for state
             at control system layer.            estimation, SCADA
             OT/IT pipeline has SLIs.            commands, comms.

Measured     SLOs established with error         Burn rate alerts replace
             budgets. DORA metrics applied       threshold alerts. CIP
             to control system changes.          compliance via GitOps.

Optimised    Automated pipeline recovery.        Cross-domain Splunk
             Model-driven switching orders.      correlation detects
             AGC/EMS performance SLO-gated.      cascade precursors.
                                                 MTTR < 15 minutes.

Generative   Grid observability platform         Development teams for
             shared across OT and IT.            EMS/SCADA own their SLOs.
             PMU-based wide-area monitoring      N-1 contingency analysis
             SLO-anchored.                       automated.
────────────────────────────────────────────────────────────────────────────

Five Action Items for This Week

Map your grid control systems to the Four Golden Signals framework. For each critical system (EMS, DMS, SCADA, outage management), identify which metrics correspond to Latency, Traffic, Errors, and Saturation. The mapping exercise itself surfaces gaps in current instrumentation.
Instrument your OT/IT data pipeline as a first-class service. Define an SLI for telemetry completeness and pipeline latency. The pipeline carrying substation data to your EMS is more reliability-critical than most services your organisation has SLOs for — and it is almost certainly running without them.
Audit your alarm rationalisation state against the Four Golden Signals. Count how many active alarms in your control centre do not trace to a specific Golden Signal category. Any alarm without a defined response action is a candidate for suppression. Alarm count reduction is an operational safety improvement.
Reframe one CIP compliance requirement as a continuously verified SLO. Pick CIP-010 (configuration change management) or CIP-007 (security event logging) and identify the SLI that would express that requirement as a continuously monitored objective rather than a periodic audit artefact.
Identify the top three manual toil categories in your control centre operations. Switching order preparation, shift handover documentation, and reliability metric reporting are the most common high-toil categories. Quantifying them in operator-hours per month creates the business case for automation investment that operations leadership can act on.

"The 2003 Northeast blackout did not fail for lack of sensors. It failed for lack of observability — the ability to ask questions the designers had not anticipated, about a failure mode they had not modelled, in time to intervene. The power sector has spent two decades strengthening its physical infrastructure since that day. The software layer that mediates between the physical grid and the humans who operate it deserves the same rigour. Google SRE built that rigour for the internet. The grid needs it now."

What Comes Next

The energy grid is the most visible critical infrastructure use case for SRE observability principles, but it is not the only one. Financial services present a different set of constraints — sub-millisecond latency requirements, regulatory reporting obligations, and systemic risk considerations that raise the stakes of error budget decisions beyond any single institution's boundaries. The next post examines how SRE error budgets quantify the hidden economic cost of downtime and why managing that cost is a matter of national economic infrastructure, not just engineering performance.

DEV Community