Nijo George Payyappilly

Posted on Jun 1

Beyond DORA: A Five-Metric Framework for SRE Maturity in Regulated Enterprises

#sre #devops #productivity #reliability

The DORA research programme is the most rigorous empirical study of software delivery performance ever conducted. Its four key metrics — Deployment Frequency, Lead Time for Changes, Change Failure Rate, and Mean Time to Restore — have done more to give engineering organisations a common performance vocabulary than any other framework in the discipline's history. If you work in software and you have not read the State of DevOps Report, stop and read it before finishing this paragraph.

Now: the DORA Four were derived primarily from organisations with cloud-native architectures, on-demand deployment infrastructure, and relatively unconstrained ability to release software when it is ready. The research cohort skews toward technology companies that have already made the cultural and architectural investments that make high-frequency, low-risk deployment possible.

This is not a criticism of the research. It is an observation about its generalisability — and it has a specific consequence for practitioners who work in regulated enterprises: banks, healthcare systems, utilities, insurance carriers, government agencies. In these environments, the DORA Four are necessary but structurally insufficient. They measure the delivery pipeline accurately. They do not measure the operational sustainability of the team running that pipeline — and in regulated enterprises, operational sustainability is where SRE programmes go to die quietly, years before anyone realises the damage is permanent.

This post proposes a fifth metric. Not to replace the DORA Four, but to complete them — to close the measurement gap that leaves regulated enterprise SRE teams flying blind on the dimension that most reliably predicts long-term programme failure.

What the DORA Four Measure and What They Do Not

Before proposing an extension, the limitations deserve precise characterisation. Imprecise criticism of a well-validated framework is noise. The limitations described here are structural — arising from the design scope of the DORA research — and specific to the regulated enterprise context.

Deployment Frequency in Regulated Environments

DORA defines elite performance as on-demand deployment, multiple times per day. In regulated environments, this benchmark is structurally unachievable for reasons that have nothing to do with engineering capability. Change Advisory Board processes exist. Regulatory change freeze windows exist — financial institutions freeze changes around year-end, tax season, and quarterly reporting periods. Healthcare systems freeze around Joint Commission accreditation cycles. Utilities freeze around NERC CIP audit windows.

A regulated enterprise deploying weekly — not because its engineering is poor, but because a mandatory weekly CAB review cycle exists — will score in the Low performer cohort on Deployment Frequency. That classification is accurate relative to the DORA benchmark. It is misleading as a diagnostic of SRE maturity, because it conflates regulatory compliance overhead with engineering capability.

The metric that would actually be useful here is deployment frequency normalised to available deployment windows: how often does the organisation deploy relative to how often it is permitted to deploy? An organisation that deploys on every available window is performing at elite level within its constraints, regardless of where that frequency sits in the absolute DORA distribution.

Lead Time for Changes in Regulated Environments

DORA's Lead Time measures commit to production deployment. In cloud-native environments, this is dominated by CI/CD pipeline execution. In regulated enterprises, it is frequently dominated by CAB review cycle time, regulatory approval lead time, and documentation preparation overhead.

A team with a two-day CI/CD pipeline and a five-day CAB review cycle has a seven-day lead time. Halving the CI/CD pipeline reduces total lead time by 14%. Halving the CAB review cycle reduces total lead time by 36%. But the DORA metric provides no signal about which investment yields the larger return, because it does not decompose lead time into its technical and process components.

Change Failure Rate in Regulated Environments

DORA's CFR measures the percentage of changes requiring remediation after deployment. In regulated environments, this definition has a gap: it captures technical failures but not compliance failures. A change that deploys without technical error but violates a data residency requirement, triggers a regulatory notification obligation, or creates an audit finding is a failure by a name DORA does not have. In regulated enterprises, compliance failures are often more expensive than technical failures — they generate regulatory scrutiny, potential fines, and mandatory remediation programmes.

Mean Time to Restore in Regulated Environments

DORA's MTTR measures time from service degradation to restoration. In regulated environments, restoration is not the end of the timeline; it is the beginning of the compliance timeline. A financial institution that restores service in twelve minutes must then notify its primary regulator within two hours (under OCC guidance), document root cause within ten days, and potentially submit a formal incident report.

More critically: in regulated environments, the fastest remediation path is not always the permitted path. Rolling back a database schema change may restore service in minutes but create a compliance audit gap. The DORA MTTR reflects not engineering capability but the friction between technical and compliance requirements — and the metric provides no visibility into which is the binding constraint.

The structural gap: The DORA Four measure the delivery pipeline and its production consequences. They do not measure the operational sustainability of the team executing that pipeline — the ratio of engineering investment to operational burden that determines whether an SRE programme compounds in capability over time or slowly collapses under the weight of its own toil.

The Fifth Metric: Toil Ratio

Google SRE defines toil precisely: manual, repetitive, automatable work that scales linearly with service growth and produces no enduring improvement to service reliability. Responding to a recurring alert whose remediation is always the same sequence of commands is toil. Manually rotating credentials on a quarterly compliance schedule is toil. Preparing CAB documentation for a deployment that has been executed identically fifty times is toil.

The Toil Ratio is the fraction of operational time consumed by toil work:

─────────────────────────────────────────────────────────────────────────────
TOIL RATIO DEFINITION

  Toil Ratio = Toil Hours / Total Operational Hours

  Where:
    Toil Hours =         Time spent on manual, repetitive, automatable work
                         that scales with service growth and produces no
                         enduring reliability improvement

    Total Operational    Toil Hours + Engineering Hours
    Hours =              (Engineering Hours = automation, tooling, reliability
                         work, observability — work that compounds over time)

  Target (Google SRE):             ≤ 0.50
  Regulated Enterprise Target:     ≤ 0.40
  (Stricter because compliance overhead consumes capacity not captured
  in this ratio — the effective engineering headroom is already reduced)

─────────────────────────────────────────────────────────────────────────────
TOIL CATEGORIES IN A REGULATED ENTERPRISE:

  Operational toil:
    ✓ Recurring alert response with identical remediation steps
    ✓ Manual deployment steps not yet automated in CI/CD
    ✓ On-call handover documentation compiled manually
    ✓ Capacity reporting assembled manually from monitoring platforms

  Compliance toil:
    ✓ CAB documentation for low-risk, high-frequency changes
    ✓ Quarterly access review execution (manual steps)
    ✓ Evidence collection for audit requests not yet automated
    ✓ Change freeze exception requests for standard changes

  Governance toil:
    ✓ Manual SLO report generation for leadership review
    ✓ DORA metric calculation from raw data (not yet automated)
    ✓ Incident timeline reconstruction for postmortems

  NOT toil (engineering work that compounds):
    ✗ Writing the automation that eliminates the manual deployment step
    ✗ Building the alert runbook automation
    ✗ Implementing the SLO dashboard that replaces the manual report
─────────────────────────────────────────────────────────────────────────────

Why Toil Ratio Predicts Regulated Enterprise SRE Programme Failure

The SRE programme failure mode in regulated enterprises is almost never a dramatic collapse. It is a slow, invisible accumulation of toil that crowds out engineering work over two to four years, until the team's posture has regressed from proactive reliability engineering back to reactive firefighting — under a different organisational label, with better job titles, but with the same fundamental dynamic that SRE was introduced to replace.

The mechanism is straightforward. Regulated enterprises impose compliance obligations — audit evidence collection, change documentation, access reviews, regulatory reporting — that generate toil linearly with service count and team size. An SRE team that does not explicitly manage its Toil Ratio will find that compliance toil expands to fill available capacity, leaving progressively less engineering time for the automation investment that would contain the toil growth. Each quarter, toil occupies a slightly larger fraction of team capacity. Each quarter, the automation investment that could reverse the trend is slightly smaller.

The DORA Four provide no warning signal for this failure mode. A team in the middle stages of toil accumulation may still show healthy Deployment Frequency, acceptable Lead Time, reasonable CFR, and adequate MTTR — performing well on every DORA dimension even as its long-term engineering capability is being quietly consumed by the toil ratchet.

The Toil Ratio makes the ratchet visible.

The Complete Five-Metric Framework

─────────────────────────────────────────────────────────────────────────────
THE FIVE-METRIC SRE MATURITY FRAMEWORK FOR REGULATED ENTERPRISES
─────────────────────────────────────────────────────────────────────────────

METRIC 1: DEPLOYMENT FREQUENCY (DORA)
  RE-Adjusted: Deployments per available deployment window
               Elite: ≥ 90% of available windows used

METRIC 2: LEAD TIME FOR CHANGES (DORA)
  RE-Adjusted: Decomposed into:
               → Technical lead time (commit to deployable artefact)
               → Process lead time  (artefact to production)
               Elite: technical < 1 hour; process < 2 business days

METRIC 3: CHANGE FAILURE RATE (DORA)
  RE-Adjusted: Extended to:
               → Technical CFR     (production incidents from changes)
               → Compliance CFR    (changes triggering compliance findings)
               Elite: technical < 5%; compliance = 0%

METRIC 4: MEAN TIME TO RESTORE (DORA)
  RE-Adjusted: Decomposed into:
               → Technical MTTR    (degradation to service restoration)
               → Regulatory MTTR   (incident to closed compliance obligation)
               Elite: technical < 30 min; regulatory < 5 business days

METRIC 5: TOIL RATIO (NEW)
  Definition:  Toil hours / total operational hours per sprint/quarter
  Target:      ≤ 0.40 for regulated enterprise SRE teams
  Elite:        ≤ 0.25 (automation-first posture fully operational)
  Measures:    Operational sustainability and long-term programme health
               — the leading indicator of SRE programme degradation
               that DORA does not capture

─────────────────────────────────────────────────────────────────────────────
FRAMEWORK PROPERTY: The five metrics form a causal chain.

  Toil Ratio → Deployment Frequency   (high toil crowds out deployment automation)
  Toil Ratio → Lead Time              (high compliance toil extends process lead time)
  Lead Time  → Change Failure Rate    (longer lead time = larger batch = higher risk)
  CFR        → MTTR                   (higher failure rate = more complex recovery)
  All four   → Toil Ratio             (poor pipeline health generates more toil)
─────────────────────────────────────────────────────────────────────────────

Measuring the Toil Ratio: Implementation

Toil Ratio measurement requires categorising time, which most engineering organisations do not do systematically. The measurement approach must be lightweight enough to not itself become toil — a real failure mode when instrumentation overhead exceeds the value of the signal it produces.

The recommended approach: categorical tagging of operational work at the sprint level, combined with automated extraction of time signals from existing tooling where possible.

# Toil Ratio from Linear sprint data via Prometheus exporter
# Linear issue labels:
#   sre/toil-operational     — alert response, manual remediation
#   sre/toil-compliance      — audit evidence, CAB docs, access reviews
#   sre/toil-governance      — manual reports, status updates
#   sre/engineering          — automation, tooling, reliability improvements

groups:
  - name: sre.toil_ratio
    rules:

      # Toil ratio per sprint
      - record: sre:toil_ratio:per_sprint
        expr: |
          sum(sre:sprint_points_completed:by_category{category="toil"})
          /
          sum(sre:sprint_points_completed:by_category)

      # Rolling 90-day toil ratio (quarterly reporting view)
      - record: sre:toil_ratio:rolling_90d
        expr: |
          sum_over_time(sre:toil_ratio:per_sprint[90d])
          /
          count_over_time(sre:toil_ratio:per_sprint[90d])

      # Alert: breach of regulated enterprise target
      - alert: ToilRatio_PolicyBreach
        expr: sre:toil_ratio:rolling_90d > 0.40
        for: 1d
        labels:
          severity: ticket
          domain: sre_sustainability
        annotations:
          summary: >
            SRE toil ratio at {{ $value | humanizePercentage }} over rolling
            90 days — exceeds 40% regulated enterprise target.
            Programme sustainability risk: engineering capacity being displaced.

Automated toil detection from incident data catches what sprint tagging misses — the alert at 2 AM, the Slack message requiring immediate manual intervention. These appear in on-call tools and can be extracted without relying on disciplined categorisation.

-- Splunk SPL: Recurring incidents with identical remediation patterns
-- High recurrence on a single runbook = toil category candidate

index=incidents sourcetype=pagerduty
| stats
    count as occurrence_count,
    avg(time_to_resolve_minutes) as avg_ttm
    by alert_name, runbook_url
| where occurrence_count > 3
| eval toil_score = occurrence_count * avg_ttm
| sort -toil_score
| table alert_name, occurrence_count, avg_ttm, toil_score, runbook_url
| head 20

-- Output: ranked list of alerts by toil burden (occurrence × avg time)
-- Top entries are automation investment candidates, ranked by ROI

-- Splunk SPL: Compliance toil detection
-- Deployments that required manual CAB override despite passing automated gates

index=argocd sourcetype=argocd:audit action=sync status=Succeeded
| join deployment_id [
    search index=cab_system sourcetype=cab:decisions
    | where decision_type="exception_override"
    | rename deployment_ref as deployment_id
  ]
| stats count as override_count, values(application_name) as services
    by week_of_year
| eval signal = "CAB exception for automated-gate-passed deployment"

-- High counts signal CAB process not calibrated to trust automated gates:
-- a governance design problem that generates compliance toil visible
-- only through the Toil Ratio metric.

Regulatory Alignment

The five-metric framework's regulated enterprise extensions align with the operational resilience expectations being codified by financial regulators globally.

────────────────────────────────────────────────────────────────────────────
REGULATORY REQUIREMENT                    FIVE-METRIC MAPPING
────────────────────────────────────────────────────────────────────────────
OCC SR 21-3:
  Defined recovery time objectives        Technical MTTR with SLO backing
  Continuous resilience monitoring        Toil Ratio + burn rate alerting
  Board risk appetite for op. risk        Five-metric quarterly report
  Change management governance            Deployment Frequency +
                                          Process Lead Time

EU DORA (Digital Operational             Compliance CFR (changes that
Resilience Act):                         create ICT risk events)
  ICT incident reporting                 Regulatory MTTR (time to
  (notify within 4 hours)                closed regulatory obligation)

UK PRA Operational Resilience:
  Important Business Services            SLO per IBS + error budget
  with defined impact tolerances         → Technical MTTR and
                                         Deployment Frequency during
                                         impact tolerance windows

NERC CIP (energy sector):
  Configuration change management        Compliance CFR (unauthorised
  (CIP-010)                              config changes) + Argo CD
  Security event logging (CIP-007)       GitOps drift detection
────────────────────────────────────────────────────────────────────────────

(Note: EU DORA — the Digital Operational Resilience Act — and the DORA research programme share an acronym. The naming collision is real and worth knowing.)

The Quarterly Five-Metric Report

─────────────────────────────────────────────────────────────────────────────
SRE MATURITY REPORT: Q1 2025  |  Illustrative example
─────────────────────────────────────────────────────────────────────────────

METRIC 1: DEPLOYMENT FREQUENCY
  Raw:          2.3 deployments/week
  RE-Adjusted:  87% of available windows utilised
  Trend:        ↑ +12% vs Q4 2024
  Signal:       13% of windows unused due to late artefact readiness
                → pipeline optimisation opportunity

METRIC 2: LEAD TIME FOR CHANGES
  Technical:    4.2 hours (commit → deployable artefact)
  Process:      3.1 business days (artefact → production)
  Trend:        Technical ↓ 18% improving | Process ↑ 6% worsening
  Signal:       CI/CD optimisation working. CAB review cycle lengthening
                — governance overhead growing faster than technical gains.

METRIC 3: CHANGE FAILURE RATE
  Technical CFR:    4.2%
  Compliance CFR:   0.8%  ← TARGET: 0%
  Signal:           2 compliance findings from config drift in non-prod.
                    GitOps self-heal remediation gap identified.

METRIC 4: MEAN TIME TO RESTORE
  Technical MTTR:   23 minutes (median P1/P2)
  Regulatory MTTR:  4.2 business days
  Trend:            Technical ↓ improving (was 41 min Q4 2024)
  Signal:           Automated remediation covering 3 of top 5 categories.

METRIC 5: TOIL RATIO
  Q1:           44%  ← BREACH: target ≤ 40%
  Rolling 90d:  42%  ← BREACH
  Trend:        ↑ worsening (was 38% Q4 2024)
  Top sources:  (1) Quarterly access review: 18 hrs/quarter
                (2) CAB documentation: 12 hrs/sprint
                (3) Manual SLO report generation: 8 hrs/sprint
  Signal:       PROGRAMME SUSTAINABILITY RISK.
                Automation backlog for top 3 sources: ~40 engineering hours.
                ROI positive within one quarter.
                Recommend: Q2 reliability sprint allocation.

─────────────────────────────────────────────────────────────────────────────
OVERALL: 4 of 5 metrics at target or improving.
Toil Ratio breach is the leading risk indicator for Q2.
─────────────────────────────────────────────────────────────────────────────

Implementation Sequence for Resistant Organisations

The framework is most valuable in precisely the organisations where it is hardest to introduce. The sequence matters as much as the framework itself — instrument before enforcing, make visible before gating, demonstrate value before demanding authority.

────────────────────────────────────────────────────────────────────────────
QUARTER 1 — Instrument Silently
  Deploy DORA metric collection against existing CI/CD and incident data.
  Begin sprint-level toil tagging (SRE team only, no external visibility).
  Build five-metric dashboard for SRE internal use only.
  Goal: Establish baseline without triggering governance resistance.

QUARTER 2 — Make Visible to Engineering Leadership
  Present five-metric baseline to Engineering VPs.
  Frame Toil Ratio breach as programme sustainability risk, not a metric.
  Propose one automation investment to address the top toil source.
  Goal: Create internal champions before external exposure.

QUARTER 3 — Extend to Compliance and Risk Functions
  Introduce Compliance CFR and Regulatory MTTR to the compliance team.
  Frame as tools that give the compliance function better visibility.
  Map framework to existing regulatory reporting obligations.
  Goal: Convert compliance function from obstacle to framework ally.

QUARTER 4 — Gate and Govern
  Implement automated Toil Ratio alerting.
  Propose Deployment Frequency gate tied to error budget policy.
  Present five-metric annual trend to Board Risk Committee.
  Goal: Framework is now a governance mechanism, not a dashboard.
────────────────────────────────────────────────────────────────────────────

The compliance function as the adoption path is the contrarian insight in this sequence. In regulated enterprises, compliance has the organisational authority to mandate measurement that engineering leadership does not. Framing the Compliance CFR and Regulatory MTTR as tools for the compliance team — which they genuinely are — converts what is typically the most resistant stakeholder into the most powerful adoption sponsor.

Common Antipatterns

The Toil Ratio Exemption antipattern → Excluding compliance and governance toil from measurement on the grounds that it is "required" and therefore not actionable. This is the most consequential measurement error in regulated enterprise SRE. Required toil is the most important toil to eliminate, because it is the most reliably growing.
The DORA Benchmark Absolutism antipattern → Comparing regulated enterprise Deployment Frequency against the DORA elite benchmark without the RE-adjustment and concluding the organisation is underperforming when it is deploying on every available window. This drives the wrong investment decisions — optimising CI/CD speed when the binding constraint is the CAB review cycle.
The Metric Collection Without Policy antipattern → Implementing all five metrics as dashboard data without the policy infrastructure that converts measurement into organisational behaviour. Five metrics nobody acts on is five times as much instrumentation overhead as one metric nobody acts on.
The Compliance CFR Undercount antipattern → Calculating Compliance CFR only from audit findings and regulatory notifications, missing near-misses. Near-miss tracking is the leading indicator that Compliance CFR is about to worsen.
The Toil Ratio Gaming antipattern → Teams reclassifying toil work as engineering work under pressure to meet the target. The anti-gaming control is to derive the Toil Ratio from two independent signals: sprint tagging (team-categorised) and automated incident data extraction (not easily reclassified). Divergence between the two signals is itself a diagnostic.

Maturity Progression

────────────────────────────────────────────────────────────────────────────
STAGE        FIVE-METRIC STATE                   NORTH STAR SIGNAL
────────────────────────────────────────────────────────────────────────────
Reactive     DORA Four not measured.             No baseline exists.
             Toil invisible. CFR                 Toil Ratio likely
             conflated with technical.           60–80% unmeasured.

Defined      DORA Four baselined.                Toil Ratio first
             Toil Ratio measured.                measured; likely breaches
             Lead Time decomposed.               40% on first observation.

Measured     All five metrics tracked            Compliance CFR and
             quarterly. RE-adjusted              Regulatory MTTR baselines
             benchmarks applied.                 established. Toil Ratio
             Toil Ratio alert active.            trend visible.

Optimised    Five-metric report is a            Toil Ratio ≤ 0.35.
             compliance artefact.               Compliance CFR = 0.
             Automated toil detection           Process Lead Time declining.
             drives backlog.

Generative   Framework shared across            Board Risk Committee
             industry peers. Regulatory         receives annual report.
             bodies reference framework.        Toil Ratio ≤ 0.25.
             Data contributed to DORA           Framework cited in
             research programme.                regulatory guidance.
────────────────────────────────────────────────────────────────────────────

Five Action Items for This Week

Decompose your last quarter's Lead Time into technical and process components. Pull your CI/CD pipeline data and your change management system data. If the process fraction exceeds 50%, your next lead time investment belongs in governance process redesign, not pipeline optimisation. This is the most frequently misallocated investment in regulated enterprise SRE.
Run the Splunk toil detection query against your last 90 days of incident data. Sort by toil score and identify the top three recurring alerts. Those three are your Toil Ratio improvement backlog, ranked by ROI. If any can be automated in less than one sprint, make the case for immediate prioritisation — the payback period is measured in weeks.
Add Compliance CFR as a separate dimension to your next postmortem template. For every production incident in the next quarter, record whether it created any compliance obligation. Even if the count is zero, the act of asking consistently creates the measurement culture Compliance CFR requires.
Measure your Deployment Frequency against available deployment windows, not the DORA absolute benchmark. If your window utilisation is below 80%, the constraint is not pipeline capability; it is late artefact readiness — a different engineering problem with different solutions.
Present the five-metric framework to your compliance or risk function, not your engineering leadership first. Frame it as a tool that gives them better visibility into operational risk than they currently have. In regulated enterprises, the fastest path to measurement adoption runs through the compliance function, because compliance has the organisational authority to mandate measurement that engineering leadership does not.

"DORA gave the industry a common language for delivery performance. It did not give regulated enterprises a language for operational sustainability — for the question of whether the team executing the delivery pipeline will still be able to do so in three years without burning out, regressing to firefighting, or accumulating the kind of invisible toil debt that compounds silently until the programme it was supposed to protect has already failed. The Toil Ratio is that language. Measure it before you need it."

What Comes Next

The five-metric framework provides the measurement layer for SRE maturity assessment. But measurement without organisational strategy is data without leverage. The hardest problem in regulated enterprise SRE is not building the observability stack or implementing the error budget policy — it is earning the organisational trust and cross-functional authority to do those things in an environment designed to resist them. The next post examines the phased influence strategy: how to position SRE as a solution to pain that already exists, how to create the visible artefacts that build leadership credibility, and how to use the five-metric framework itself as the coalition-building tool that converts the compliance function from an obstacle into an ally.

DEV Community