Blast Radius Before Execution: Why Autonomous Cloud Must Check Idle Resources First

#blast #radius #check #autonomous

Blast Radius Before Execution: Why Autonomous Cloud Must Check Idle Resources First

Autonomous cloud remediation fails the same way every time. The recommendation is correct. The action is correct. The scope is wrong.

Stop the idle RDS instance. Correct recommendation. The instance has averaged 2% CPU for 30 days. It is genuinely idle. But it is also the database backing an internal integration endpoint that three production services call once a day, on different schedules, from different accounts. Stop it and you have a production incident at the next scheduled call time: 2:00 AM on a Tuesday.

The recommendation engine did not fail. The blast radius model was missing.

Autonomous systems that act without a blast radius check are not autonomous. They are automated. Automation executes instructions. Autonomy includes a model of consequences. Every auto-remediation action ZopNight certifies runs through a blast radius check before execution. This post defines what that check contains and how to score it.

The Correct-But-Wrong-Scope Problem

Cost tools are good at identifying idle resources. They look at CPU, memory, network, and storage I/O over a 14-30 day window. A resource below thresholds on all four dimensions is flagged as idle. The recommendation is statistically correct.

The tool does not know what depends on that resource. It cannot see the cross-account IAM role that calls the RDS instance from a Lambda in another account. It cannot see the CloudFront distribution that caches responses from the EC2 instance it flagged. It cannot see that the "idle" ElastiCache cluster is the cache warming target for a batch job that runs quarterly.

In the accounts we analyzed, 12-18% of resources flagged as idle have active downstream dependencies that would cause an incident if acted on without verification. That means 1 in 6 to 1 in 8 autonomous actions on a naive system would create an incident. No engineering team will accept that failure rate for unattended automation.

The blast radius check is the gate that separates the safe 82-88% from the risky 12-18%.

Defining Blast Radius: Three Inputs

Blast radius is the set of resources and services that fail or degrade if the target resource is stopped, modified, or deleted mid-action. It is a pre-execution estimate, not a post-incident measurement.

Three inputs define it.

The dependency graph. AWS VPC flow logs record every connection between resources in the past 14 days. A resource with no inbound or outbound connections in 14 days has a low dependency graph score. A resource with 200 connections per day across 4 VPCs has a high one. Service mesh telemetry (Istio, Linkerd) adds application-layer connection data for services that flow logs cannot see (same-host connections, gRPC multiplexed streams). The dependency graph has a 6-hour lag for VPC flow logs. Resources with less than 6 hours of flow log data default to high blast radius.

Criticality tier. Resource tags encode business context that infrastructure metrics cannot. A resource tagged env=production and tier=critical scores high blast radius regardless of its CPU utilization. A resource tagged env=dev and team=platform scores lower. Tags are not perfectly reliable: 23% of resources in typical accounts have stale or missing criticality tags. When the criticality tag is absent, blast radius defaults to medium.

Recency. A resource with no write operations in 24 hours is idle by the write signal. A resource with a write 18 hours ago is not idle. CloudTrail records write API calls against each resource. LastWriteTimestamp is the third input. Resources with writes in the last 24 hours get a recency penalty that raises their blast radius score regardless of CPU.

The Blast Radius Score: 0-100 and What Each Band Means

The three inputs produce a numeric score from 0 to 100. The score determines the action policy.

Score range	Action policy	What it means
0-29	Unattended execution	No active dependencies, non-production tag, no recent writes. Safe to act without human review.
30-69	Notification window	Possible dependencies or ambiguous tags. Action queued with 15-minute notification. Human can cancel.
70-100	Approval required	Active dependencies confirmed, production tag, or recent writes. Action blocked until explicit approval.

A resource scores below 30 only if all three conditions hold: dependency graph shows no connections in 14 days, criticality tag is non-production, and no writes in 24 hours. This is the conservative definition of "safely idle."

A resource scores above 70 if any one of these conditions holds: 50+ connections per day in flow logs, production or critical tag, or writes in the last 6 hours. One high-signal input overrides the others. An RDS instance tagged dev that has 200 daily connections scores above 70. The dev tag does not override the dependency signal.

The 30-69 band handles the ambiguous cases: resources with missing tags, flow log gaps, or moderate connection counts. The 15-minute notification window gives an engineer time to cancel an action they know is risky, without requiring them to pre-approve every action in the queue.

How ZopNight Gates Actions

Every automated remediation in ZopNight runs the blast radius check before execution. The check adds 2-4 seconds to the action pipeline. For actions that run unattended at 3:00 AM, 4 seconds is an acceptable gate latency.

In a typical month, across the ZopNight customer fleet: 67% of triggered remediations score below 30 and run unattended. 24% score 30-69 and enter the notification queue; 91% of those proceed after the window with no cancellation. 9% score above 70 and require approval; approval is granted for 78% of those, with 22% cancelled by the reviewing engineer.

The autonomous action log records the blast radius score alongside every action. This creates the audit trail: not just what action ran, but why the system considered it safe to run without human review.

The Half-Action Problem: Idempotency as a Blast Radius Input

Blast radius measures the risk of acting on a resource. It does not by itself measure the risk of acting and then failing halfway through.

A multi-step remediation that stops an EC2 instance, modifies its tags, and starts it again has three steps. If the network fails after step 1 (stopped) and before step 3 (started), the instance is stopped but not restarted. The resource is in a worse state than before the action ran. The original state was idle but running. The post-failure state is stopped unexpectedly. Recovery requires manual intervention.

Idempotency is the property that makes a remediation safe to retry from any point. A stop-and-delete action is not idempotent: running it twice on an already-deleted resource produces an error. A tag-update action is idempotent: running it twice produces the same result as running it once.

Non-idempotent remediations get a blast radius floor of 50, regardless of their dependency graph, criticality, and recency scores. This forces them into the notification queue minimum, never into the unattended queue.

Remediation type	Idempotent	Blast radius floor	Example
Tag update	Yes	0 (no floor)	Add cost-center tag to EC2
Stop instance	Yes	0 (no floor)	Stop idle EC2 (safe to retry)
Delete snapshot	No	50	Deleted snapshot cannot be recovered
Schema migration	No	50	Partial schema change leaves DB inconsistent
Cross-account IAM change	No	70	IAM changes have immediate effect
Stop + reconfigure + start	Partially	50	Failure mid-sequence leaves wrong config

The blast radius score is a pre-execution risk estimate. It is not a guarantee. A resource that scores 15 can still cause an incident if the flow logs had a 12-hour gap and missed an overnight connection. The score reduces the probability of the wrong-scope failure from 12-18% to under 1%. It does not reduce it to zero. The ZopNight trust score model uses blast radius as one input alongside recommendation confidence and business hours context. No single signal is the gate. The gate is the composite.

Autonomous cloud is safe when the system knows what it does not know and routes accordingly. Blast radius is the model of what it does not know.