Bala Paranj

Posted on Apr 28

Every scanner checks what exists. Nobody checks what's missing

#aws #security #cloud #devops

When cloud resources are deleted, the references to them persist — in IAM policies, event triggers, compute configs, and trust relationships. These orphaned references create exploitable gaps that no per-resource scanner can detect. The finding doesn't live on any single resource. It lives in the space between what's referenced and what exists.

The assumption every scanner makes

Cloud security scanners work by iterating over resources. For each S3 bucket, check its configuration. For each IAM role, check its policies. For each security group, check its rules. The resource exists. The scanner examines it. The finding describes what's wrong with it.

This is a reasonable architecture. It covers the vast majority of cloud security risks. Misconfigured resources — public buckets, overprivileged roles, open security groups — are the bread and butter of cloud security posture management.

But every scanner built on this architecture shares a blind spot: when a resource is deleted, it disappears from the scan. The scanner has nothing to examine. The resource is gone.

The references to it are not.

What deletion leaves behind

Cloud infrastructure is a graph of interconnected references. An IAM policy doesn't exist in isolation — it references S3 buckets, KMS keys, Lambda functions, and SQS queues by ARN. An EventBridge rule references a Lambda function as its target. A CloudWatch alarm references an SNS topic as its notification action. An ECS task definition references an ECR image by tag and a Secrets Manager secret by ARN.

When any of these referenced resources is deleted, the reference persists. The IAM policy still says "Allow PutObject to arn:aws:s3:::prod-audit-logs." The EventBridge rule still targets the Lambda function. The CloudWatch alarm still notifies the SNS topic. The ECS task definition still pulls the image.

The resource is gone. The references are not. And depending on the resource type, those references may be actively exploitable.

Three classes of orphaned references

Not every orphaned reference is equally dangerous. The risk depends on what the reference does and whether the deleted resource's identity is reclaimable.

Class 1: Reclaimable names with active permissions

S3 bucket names are globally unique across all AWS accounts. When a bucket is deleted, its name becomes available for registration by anyone, anywhere. An IAM policy that grants PutObject to that bucket name is now granting write access to whoever claims it next.

This is the most dangerous class. The organization's systems are actively trying to send data — audit logs, backups, application output — to a resource name that an attacker can claim. The attacker registers the bucket, configures it to accept writes, and data starts flowing. The Lambda function writing audit logs doesn't error. The S3 client library doesn't warn. The write succeeds. It goes to the wrong place.

A healthcare organization's HIPAA audit logs — the very records required to prove compliance — could be delivered to an attacker's bucket. The organization continues generating compliance evidence and delivering it to an adversary.

KMS key policies with orphaned principal references follow a related pattern. AWS protects IAM trust policies by replacing deleted role ARNs with internal unique IDs. But resource-based policies — on S3 buckets, KMS keys, SQS queues, SNS topics — evaluate the ARN string directly. A new role created with the same name as a deleted one matches the policy and inherits every permission it grants. For a KMS key policy, that means decrypt access to everything the key protects.

Class 2: Silent monitoring failures

A CloudWatch alarm fires when a metric breaches a threshold. The alarm's action sends a notification to an SNS topic. If the SNS topic has been deleted, the alarm fires into the void. The metric breaches the threshold. The alarm enters ALARM state. The notification goes nowhere. The dashboard shows the alarm is configured. The console shows the alarm is active. Nobody receives the alert.

This class is insidious because the system appears to work. The alarm exists. The EventBridge rule exists. The S3 event notification exists. The configuration looks correct. The targets are gone. Events are generated, matched, and silently dropped. Security automation that the organization built and maintains and monitors through dashboards has stopped functioning — and nothing indicates the failure.

An attacker who discovers this can exploit it deliberately. Delete the SNS topic that a critical alarm notifies. The alarm still fires. The team never knows. The attacker operates under the alarm's detection threshold, and even when they don't, the alarm's notification pipeline is broken. The alarm fires. Nobody comes.

Class 3: Compute dependencies

An ECS task definition references a container image by tag in a registry. The image is deleted. The next deployment pulls — what? If the registry is private, the pull fails. If the registry is public, whatever image currently holds that tag. An attacker who pushes a malicious image with the matching name and tag controls what code runs in the container. The malicious code executes with the task role's IAM permissions.

A Lambda function references a layer that's been deleted. The function deploys without the layer's contents. If the layer provided a security-relevant dependency — a TLS certificate bundle, an encryption library, authentication middleware — the function runs without it. The function serves traffic. The security dependency is silently absent.

A launch template references a deregistered AMI. The auto-scaling group can't launch new instances. During a DDoS attack, when the organization needs to scale response capacity, the scaling group discovers it can't scale. The launch template looks correct. The AMI it depends on is gone.

Why per-resource scanners can't detect this

The architectural limitation is fundamental. A per-resource scanner iterates over resources that exist and evaluates their properties. An orphaned reference finding doesn't live on any existing resource.

The IAM policy exists and looks normal — valid JSON, well-formed ARNs, reasonable permissions. The S3 bucket it references doesn't exist, but the scanner evaluating the IAM policy doesn't know that because it's evaluating the policy, not cross-referencing it against the full resource inventory.

The CloudWatch alarm exists and looks correct — metric configured, threshold set, action defined. The SNS topic it targets doesn't exist, but the scanner evaluating the alarm doesn't cross-reference action ARNs against the SNS inventory.

Cross-inventory reasoning requires holding two datasets simultaneously: the set of all ARNs referenced in configurations and the set of all resources that actually exist. The finding is the difference between these two sets. No single resource carries it. The scanner must reason about the gap — about what's referenced but absent.

Per-resource scanners aren't poorly built. They're architecturally incapable of this detection class. Adding ghost reference detection to a per-resource scanner requires changing its fundamental evaluation model from "for each resource, check properties" to "for each reference, check whether the target exists." That's a different architecture.

What about AWS-native tools?

AWS Config records configuration changes over time, including resource deletions. When a bucket is deleted, Config records the deletion event. But Config doesn't cross-reference the deletion against IAM policies that still reference the bucket. It records "bucket was deleted" — it doesn't conclude "and three policies still grant access to its name."

AWS CloudTrail records API calls, including DeleteBucket. But CloudTrail records the event, not the consequence. It tells you the bucket was deleted. It doesn't tell you which policies, triggers, and configurations are now orphaned.

AWS Security Hub aggregates findings from other services. None of the services it aggregates detects orphaned references.

The information exists in AWS — the deletion event is recorded, and the persisting references are visible through API queries. But no AWS-native service connects these two data points into a finding. The deletion is observed. The consequence is not.

The temporal dimension

Single-snapshot absence detection has an inherent uncertainty: if a resource doesn't appear in the inventory, maybe it was deleted. Or maybe the scanner didn't collect it. An incomplete scan produces false ghost references — resources that exist but weren't captured look like deletions.

Temporal detection resolves this. If a resource appeared in snapshot N-1 and is absent in snapshot N, the resource was genuinely deleted. The scanner collected it before. It's gone now. Two independent observations confirm the deletion. The ghost reference is verified — not just "we can't find it" but "we watched it disappear."

Temporal ghost detection is the highest-confidence version: compare the resource inventory across two consecutive snapshots, identify deletions, then cross-reference persisting references against the confirmed deletions. The finding says: "This resource existed on March 15. It's gone as of March 22. These seven policies still reference it."

The lifecycle gap

Cloud security tools generally treat infrastructure as a static snapshot. What's deployed right now? Is it configured correctly? This covers creation and configuration. It doesn't cover decommissioning.

The lifecycle of a cloud resource has three security-relevant phases:

Creation. The resource is provisioned. Policies are written to grant access. Triggers are configured to reference it. Compute definitions are updated to depend on it. Trust relationships are established. Every scanner covers this phase — the resource exists and can be evaluated.

Operation. The resource runs. Its configuration may drift. Permissions may expand. New references may be added. Scanners cover this phase too — they detect drift, overprivilege, and misconfiguration on the running resource.

Deletion. The resource is removed. The policies aren't updated. The triggers aren't cleaned up. The compute definitions still reference it. The trust relationships persist. No scanner covers this phase — the resource is gone, and the orphaned references are invisible to per-resource evaluation.

The deletion phase is where ghost references are created. And deletion is a normal operation — teams decommission services, migrate architectures, consolidate accounts, sunset products. Every deletion that doesn't include a full reference cleanup creates potential ghost references. In large organizations with hundreds of services and thousands of cross-references, the ghost reference count grows continuously.

What this means in practice

Consider a typical enterprise migration: a team moves from a legacy logging pipeline to a new one. The old pipeline used an S3 bucket for audit log storage, a Lambda function for processing, an SNS topic for alerting, and a KMS key for encryption. The new pipeline uses different resources with different names.

The migration is successful. The new pipeline works. The old resources are deleted. The cleanup checklist says:

[x] Delete old S3 bucket
[x] Delete old Lambda function
[x] Delete old SNS topic
[ ] Update IAM policy that granted write access to old bucket
[ ] Update EventBridge rule that targeted old Lambda
[ ] Update CloudWatch alarm that notified old SNS topic
[ ] Update KMS key policy that trusted old Lambda's role
[ ] Update ECS task definition that injected old Secrets Manager secret

The first three items are the resources. The last five are the references. The resources are deleted because they're the visible artifacts of the old pipeline. The references persist because they're scattered across IAM policies, event configurations, alarm actions, key policies, and task definitions — managed by different teams, in different consoles, with different change management processes.

No single team owns all the references. The application team deletes the Lambda function. The IAM team doesn't know the Lambda was deleted. The monitoring team doesn't know the SNS topic is gone. The platform team doesn't update the KMS key policy. The container team doesn't update the ECS task definition.

The migration succeeded. Five ghost references were created. Each is a potential security gap. One of them (the IAM policy granting write access to the deleted S3 bucket name) is an active exfiltration path — the bucket name is globally reclaimable.

Deleted and forgotten

The reason ghost references persist for months and years is simple: everything still works.

When you delete a resource and something breaks — a deployment fails, an API returns 500, a dashboard goes red — you notice immediately. You trace the error, find the missing dependency, and fix it. Broken things get fixed because broken things are visible.

Ghost references don't break anything. The IAM policy with a dangling ARN still loads. The CloudWatch alarm with a dead SNS target still evaluates its metric. The ECS task definition with a deleted secret still sits in the registry. The EventBridge rule with a missing Lambda target still matches events. Nothing errors. Nothing warns. Nothing crashes. The system runs exactly as it did before — minus one resource that nobody is looking for because it was intentionally deleted.

In complex cloud setups with hundreds of services, thousands of policies, and dozens of teams, the gap between "resource deleted" and "every reference cleaned up" isn't a failure of discipline. It's a structural impossibility. No team has visibility into every configuration surface that references their resources. The application team knows the Lambda function exists. They don't know which EventBridge rules target it, which KMS key policies trust its role, or which monitoring alarms depend on the SNS topic it publishes to. They delete the Lambda. Everything else keeps running. There's nothing to fix because nothing is broken.

That's the trap. The absence of failure is the failure. The system's continued operation is what makes ghost references invisible. If a dangling reference caused an error, every organization would have already solved this problem. It doesn't. So nobody has.

Detection without execution

Ghost reference detection doesn't require running code against live infrastructure, deploying agents, or performing active scanning. It requires two things: a complete inventory of what exists, and a complete inventory of what's referenced. The finding is the set difference.

This makes it suitable for air-gapped environments, compliance-sensitive workloads, and organizations that prohibit active scanning. The evaluation runs over configuration snapshots — captured state, evaluated offline, no credentials needed beyond the initial snapshot collection.

The detection is deterministic. Given the same two inventories, the same ghost references are found every time. No false positives from timing issues, network conditions, or scanner state. The reference either resolves or it doesn't. The resource either exists or it doesn't. The finding is binary.

Ghost References are Risky and Dangerous

Every organization that has ever deleted a cloud resource and didn't update every reference to it has ghost references in their infrastructure. The question is not whether they exist — the question is how many and how dangerous.

The organizations most at risk are the ones that have been operating longest. Years of migrations, decommissions, team changes, and architectural evolution create layers of orphaned references. Each one is individually small — a single ARN in a single policy. Collectively, they represent an unmapped attack surface that grows with every deletion and shrinks only through deliberate cleanup that nobody is doing because nobody can see the gaps.

The tools they rely on for security posture management evaluate what exists. The gaps exist in what does not exist.

Cross-inventory ghost reference detection is implemented in Stave, an open-source security CLI. 23 controls detect orphaned references across IAM policies, resource-based policies, event triggers, compute configurations, network infrastructure, cross-account trust, and temporal confirmation. The finding lives in the space between what's referenced and what exists.

DEV Community