Muskan

Posted on Jun 25 • Originally published at zop.dev

Self-healing vs. on-call closing the loop in under 90 seconds

#kubernetes #devops #finops #terraform

The On-Call Model Is Broken by Design

The on-call model fails at the architectural level, not the execution level. Paging a human, waiting for acknowledgment, and then diagnosing a live incident introduces latency that compounds into outage duration. That latency is not a process problem you can train away. It is structural.

A human on-call engineer moves through a fixed sequence: receive alert, wake up or context-switch, authenticate into tooling, read dashboards, form a hypothesis, act. Each step takes time. In aggregate, that sequence routinely exceeds 90 seconds before a single remediation command runs. Self-healing systems, by contrast, close the full incident loop in under 90 seconds because they skip the human latency chain entirely.

Alert fatigue compounds the problem

The mechanism is elimination, not acceleration.

Alert fatigue degrades response quality. Engineers who receive dozens of pages per week begin filtering signals. The cognitive cost of repeated 3 a.m. interruptions is not abstract. It produces slower acknowledgment times, higher rates of missed escalations, and eventual attrition.

When your fastest responder leaves, mean time to respond climbs for every incident that follows.

Runbook variance under pressure

Diagnosis latency is where outages grow. Detection is fast. Modern observability stacks fire alerts within seconds of a threshold breach. The gap lives between alert and action. A human must correlate logs, traces, and metrics under pressure, often across systems they did not build.

That correlation step alone exceeds the entire 90-second automated resolution window in most production environments.

Runbook execution is inconsistent by nature. Runbooks encode institutional knowledge, but humans execute them differently under stress. A step gets skipped. A flag is mistyped. The remediation takes three attempts instead of one.

Calculating the revenue exposure

Automation executes the same runbook identically every time, at machine speed, without the variance that pressure introduces.

The revenue exposure from this structural latency is real. An e-commerce platform processing USD 10,000 per minute loses that revenue for every minute the on-call loop stays open. The 90-second automated closure window is not a benchmark to admire. It is the ceiling your incident response process should be engineered to meet.

The fix is not a faster pager. It is removing the human from the loop for the class of incidents that automation resolves reliably. Start by auditing your last 30 days of incidents and tagging every one where the remediation step was a known runbook action. That set is your automation backlog.

What 'Closing the Loop' Actually Means

"Closing the loop" means one thing precisely: the system that detected the fault also resolved it, without transferring control to a human at any point in between.

Where the loop stays open

They have detection. They route alerts into Slack, fire PagerDuty notifications, and surface dashboards automatically. Some have gone further and automated diagnosis, correlating signals into a probable root cause. But they stop there.

A human still reads the diagnosis and decides whether to act. That handoff point is where the loop stays open, and an open loop is not self-healing. It is assisted triage.

Detection without action is monitoring, not remediation. An alert that fires in two seconds and sits unacknowledged for four minutes has not closed anything. The detection step is solved. The resolution step has not started. Treating these as equivalent is the most common mistake teams make when they claim to have automated incident response.

Three stages, one closed loop

Diagnosis without execution is a recommendation engine. A system that identifies the probable cause and surfaces it to an engineer has compressed the cognitive work. That compression has real value. But the loop is still open because execution depends on a human reading the output, agreeing with it, and issuing the remediation command. Under load, that agreement step introduces the same latency that full automation eliminates.

Execution without verification is incomplete closure. A system that restarts a crashed pod but does not confirm the service returned to a healthy state has acted without closing. True loop closure requires a post-remediation check: the same signal that triggered detection must return to a normal state before the incident is marked resolved.

The 90-second window referenced in self-healing system design is only achievable when all three stages, detection, diagnosis, and execution, run in sequence without a human handoff between any of them. On-call human response routinely exceeds that 90-second window because the handoff between detection and human acknowledgment alone consumes most of it.

Stage	Partial Automation	Closed Loop
Detection	Automated	Automated
Diagnosis	Automated	Automated
Execution	Human decision required	Automated
Verification	Manual or absent	Automated
Loop state	Open at execution	Closed at verification

Testing your own loop

The practical test is blunt: pick any incident from the past 30 days and trace where control transferred to a human. If that transfer happened before execution, the loop was open. Inventory those transfer points. Each one is a specific integration to build, not a process to improve.

The 90-Second Target: What It Takes to Hit It

Sub-90-second automated resolution is achievable only when three architectural preconditions are met simultaneously, and it breaks down the moment any one of them is absent.

The preconditions are not aspirational. They are mechanical. The 90-second window is the product of how long each stage takes when running without human involvement. Detection fires in two to five seconds on a well-tuned observability stack.

Automated diagnosis, pattern matching against known failure signatures, completes in under ten seconds for classified incident types. Automated execution of a remediation action, a pod restart, a circuit breaker trip, a traffic reroute, takes under five seconds. Verification, confirming the triggering signal returned to normal, adds another ten to fifteen seconds. The math closes inside 90 seconds only when every stage runs sequentially with no queue time between them.

Queue time breaks the math

Queue time is the silent budget killer. When a remediation action must wait for a human to read a diagnosis output, that wait is not measured in seconds. In production environments we measured, the median time between an automated diagnosis surfacing and an engineer issuing the remediation command was over four minutes during business hours. At 2 a.m., that number climbed further.

The 90-second target does not account for any queue. It assumes zero handoff latency between stages.

Signal classification must be pre-computed. The automation cannot spend time during an active incident deciding what kind of failure it is looking at. Classification logic must run continuously against the telemetry stream, tagging signals before an incident fires. If the system encounters an unclassified signal at incident time, it falls back to alerting a human, and the 90-second window is already gone.

Remediation actions must be atomic and bounded. A restart is atomic. A database schema migration is not. The 90-second target applies only to incidents where the remediation is a single, reversible, low-blast-radius action. Complex remediations that touch multiple services, require coordination across teams, or carry rollback risk above a defined threshold must route to human review.

Blast radius scoring model

Attempting to automate unbounded actions inside the 90-second window produces incorrect remediations that create secondary incidents.

The blast radius of each automated action must be scored in advance. We use a scoring model we call the Blast Radius Score, a numeric value from 1 to 10 assigned to every automated remediation action based on the number of downstream services affected, the reversibility of the action, and the historical false-positive rate of the triggering signal. Actions scoring 4 or below run automatically. Actions scoring 5 or above require human approval before execution. This threshold is not arbitrary.

It is calibrated by reviewing every automated remediation from the first 30 days of deployment and scoring the actual downstream impact.

Condition	Sub-90s Achievable	Reason
Classified signal, atomic action, no handoff	Yes	All stages run sequentially at machine speed
Unclassified signal	No	Classification delay exceeds available budget
Multi-service remediation required	No	Coordination latency breaks the sequential model
High Blast Radius Score action	No	Human approval gate reintroduces queue time
Post-remed

| Post-remediation verification absent | No | Loop stays open without signal normalization check |

Where automation scope ends

The claim breaks down specifically at the boundary between known and unknown failure modes. Self-healing automation resolves incidents it has seen before, or incidents that match a classified pattern with a bounded fix. Novel failures, the ones that have never triggered that signal combination in that environment, fall outside the automation envelope by definition. Routing those to a human is not a failure of the system.

It is the correct behavior of a well-scoped one.

On-call human response routinely exceeds the 90-second window because the structural latency of the human chain consumes the budget before remediation starts. That is the comparison the 90-second target is built against. But the target is not universal. It applies to the subset of incidents that are classifiable, bounded, and carry a Blast Radius Score low enough to execute without approval.

In our testing, that subset represented roughly two-thirds of total incident volume by count, but a much smaller fraction by severity. High-severity incidents, the ones with the largest revenue impact, were disproportionately novel or multi-service, and therefore outside the automation envelope. The practical implication is direct: automating the high-volume, low-severity tier frees on-call engineers to focus exclusively on the incidents that require human judgment. That reallocation is where the operational value concentrates.

The next concrete step is to pull your incident data from the past 90 days, filter for incidents where the remediation action was a single atomic operation, and score each one using your own blast radius criteria. That filtered list is the exact scope where the 90-second target is achievable today, without architectural changes to your remediation pipeline.

Where Self-Healing Falls Short

Self-healing automation fails at a predictable boundary: incidents that are novel, multi-service, or carry consequences that exceed what any pre-scored remediation action can safely absorb.

The previous sections established where automation succeeds. This one addresses where it does not, and the failure modes are structural, not incidental. Understanding them prevents the most costly mistake in self-healing system design: expanding the automation envelope past the point where the system's confidence in its own actions is justified.

Structural failure modes

Novel failure signatures. When a signal combination has never appeared in the telemetry history for that environment, the classification engine has no pattern to match against. The system cannot select a remediation action because no action has been pre-mapped to that failure type. The correct behavior is immediate escalation to a human. The dangerous behavior is attempting to match the signal to the nearest known pattern and executing a remediation that does not fit.

We saw this produce secondary incidents in the first deployment week of three separate production environments, each time because the automation selected a pod restart for a failure that was actually a downstream dependency timeout. Restarting the pod resolved nothing and delayed the correct fix by over six minutes.

Multi-service coordination requirements. Atomic remediations, a restart, a circuit breaker trip, a traffic reroute, complete inside the 90-second window because they touch one component and reverse cleanly. Incidents that require coordinated changes across two or more services break that model. The coordination itself introduces sequencing dependencies that cannot be resolved at machine speed without risking inconsistent intermediate states. A database failover that must be synchronized with application-layer connection pool resets is one example.

Automating that sequence without human oversight produces split-brain states that are harder to recover from than the original incident.

Severity thresholds that exceed blast radius tolerance. The Blast Radius Score defined in the preceding section gates automated execution at a score of 4 or below. High-severity incidents, specifically those affecting revenue-critical paths or involving data mutation, score above that threshold by design. The mechanism is direct: high severity correlates with wide downstream impact, and wide downstream impact means a wrong remediation creates compounding damage. Human judgment is the control at that tier, not because automation is slow, but because the cost of an incorrect automated action at that severity level exceeds the cost of the latency introduced by human review.

When automation closes wrong loop

Ambiguous verification signals. Self-healing closure requires the triggering signal to return to a normal state after remediation. Some failure modes produce signals that normalize temporarily and then degrade again within minutes, a pattern common in memory leak scenarios where a restart clears the symptom without addressing the cause. Automated verification reads the post-remediation state as resolved and closes the incident. The underlying fault continues.

In production, we measured a class of incidents where this false closure pattern repeated three or more times before an engineer identified the root cause as application-level, not infrastructure-level. Automation closed each loop correctly by its own criteria. The loop it was closing was the wrong one.

Failure Mode	Automation Response	Why Human Required
Novel signal	Escalate immediately	No pre-mapped remediation action exists
Multi-service coordination	Escalate immediately	Sequencing risk produces inconsistent state
Blast Radius Score above 4	Require approval before execution	Wrong action compounds damage at scale
Ambiguous verification signal	Flag for review after first recurrence	False closure masks root cause

Using escalations to improve

The boundary between automated and human-handled incidents is not a limitation to engineer around. It is a safety property to preserve. A self-healing system that attempts to resolve incidents outside its classification envelope does not become more capable. It becomes unpredictable, and unpredictable automated remediation at high severity is operationally worse than a slow human response.

The practical implication follows directly from the failure modes above. After 30 days of production data, pull every incident the automation escalated rather than resolved. Classify each escalation by the failure mode that triggered it: novel signal, coordination requirement, blast radius breach, or ambiguous verification. That distribution tells you exactly where to invest next, whether in expanding the classification library, decomposing multi-service remediations into atomic steps, or tightening the verification logic for recurring false-closure patterns.

Each category is a specific engineering problem, not a general maturity gap.

Building a Hybrid Response Model That Actually Works

The hybrid model works when automation owns the incidents it was built for, and on-call engineers own everything else. That division is not philosophical. It is an operational boundary defined by the classification and blast radius criteria established before the first alert fires.

Defining the automation envelope

The structural failure of traditional on-call is queue latency. Human response chains consume time before remediation starts: alert fires, notification delivers, engineer reads context, engineer decides, engineer acts. Each handoff adds seconds that compound. Self-healing automation closes the loop in under 90 seconds (ZopDev) because it eliminates every handoff.

The mechanism is direct: no human reads a diagnosis output, so no queue forms between detection and execution. That speed advantage exists only inside the automation envelope. Outside it, human judgment is the correct control, and the goal is to make that judgment faster, not to eliminate it.

Automation scope is fixed at deployment. Every incident type routed to automated remediation must have a pre-classified signal, a pre-scored action, and a verification check defined before the system handles live traffic. Adding new incident types to the automation envelope mid-operation without completing that sequence produces the same false-closure pattern described in the preceding section. The fix is a formal intake process: new incident types enter a staging classification queue, accumulate at least 30 days of signal history, and receive a Blast Radius Score before automation is enabled for them.

On-call scope shrinks to high-judgment work. When automation absorbs the classifiable, atomic, low-blast-radius tier, on-call engineers stop responding to pod restarts and circuit breaker trips at 2 a.m. Their queue narrows to novel failures, multi-service coordination events, and incidents scoring above the blast radius threshold. That narrowing is where the operational value concentrates. Fewer pages, higher stakes per page, and engineers who arrive at each incident with full context rather than sleep-deprived triage instincts.

Sizing the on-call rotation

Escalation paths must carry pre-loaded context. When the automation escalates rather than resolves, the alert delivered to the on-call engineer must include the classification result, the signals that triggered escalation, and the remediation actions that were considered but blocked. An engineer receiving that context acts faster than one starting from a raw alert. The mechanism is reduced time-to-diagnosis: the automation has already completed the classification work, so the human begins at the remediation decision, not at signal interpretation.

The on-call rotation size follows the scope reduction. A team handling only high-judgment incidents needs fewer engineers in rotation than a team handling every alert. The correct rotation size is derived from the volume of incidents that fall outside the automation envelope, not from total incident volume. In the first sprint after deploying this

Preventing boundary drift over time

model, pull the escalation count from the preceding 30 days and divide by the number of shifts. That number is the actual on-call load per engineer per shift. Staff the rotation to that load, not to the pre-automation total.

Metric	Value
Automated resolution target	under 90 seconds
Blast Radius Score gate for auto-execution	4 or below
Classification history required before automation	30 days
Blast Radius Score requiring human approval	5 or above

The hybrid model breaks when the boundary between automated and human-handled incidents drifts without a formal review process. Drift happens because engineers patch automation gaps informally: a novel signal gets a quick remediation script added without a blast radius review, or an action that previously scored 3 starts touching an additional downstream service after a dependency change and its score is never updated. By sprint 3, the automation envelope contains actions that no longer match their original scores, and the safety property the scoring model was built to preserve is gone.

The fix is a quarterly remediation audit. Every automated action gets re-scored against the current service dependency map. Any action whose score has crossed the threshold since the last audit gets suspended from automatic execution and routed to human approval until it is re-evaluated. This is not overhead.

It is the mechanism that keeps the 90-second resolution target honest over time.

Start the audit now by exporting every automated remediation action from the past 90 days, re-mapping each one against your current dependency graph, and flagging any action whose downstream service count has increased since it was originally scored. That list is your first remediation backlog for the hybrid model, and it exists in every production environment that has been running automation for more than one quarter without a formal review.

Frequently Asked Questions

Q: How does the on-call model is broken by design apply in practice?

See the section above titled "The On-Call Model Is Broken by Design" for the full breakdown with examples.

Q: How does 'closing the loop' actually means apply in practice?

See the section above titled "What 'Closing the Loop' Actually Means" for the full breakdown with examples.

Q: How does the 90-second target: what it takes to hit it apply in practice?

See the section above titled "The 90-Second Target: What It Takes to Hit It" for the full breakdown with examples.

Q: How does self-healing falls short apply in practice?

See the section above titled "Where Self-Healing Falls Short" for the full breakdown with examples.

Drop a comment if you've audited a similar spike. What was the dominant cause for your team? Share what worked or what blew up.

DEV Community