Why a deleted backup Lambda kept billing 9,400 EBS snapshots

#cloud #cost #triage #costspikes

The EBS Snapshot line on the monthly bill was $1,830. There was no active EBS snapshot policy on the account. The backup Lambda that had produced these snapshots had been deleted thirteen months earlier, replaced by AWS Backup, and forgotten. Nobody had deleted what it created. Two volumes' worth of daily snapshots times 400 days came to 9,408 orphans sitting on 14 TB of storage, billed at the EBS Snapshot rate every month since.

Problem signals:

EBS Snapshot line is several hundred dollars a month and no active EBS snapshot pipeline is running on the account
describe-snapshots --owner-ids self returns thousands of entries when you expect dozens
Sampling a few snapshot IDs shows SourceVolumeId values that no longer resolve in describe-volumes
A backup Lambda or custom snapshot script was deprecated in the last 12 to 24 months
AWS Backup is the active tool and its dashboard shows normal counts, but the cost line tells a different story

$1,830 a month on a backup product the account no longer used

The line item that should have been zero

The EBS Snapshot line had been climbing slowly for thirteen months. Nobody had flagged it. The quarterly cost review surfaced it because the line item ranked sixth on the account, and the team's mental model said it should have ranked nowhere. There was no EBS snapshot policy running. AWS Backup had taken over RDS and EBS backups a year earlier, with the old Lambda plus EventBridge pipeline retired the same week.

The first instinct in the room was to pull AWS Backup's plan and see if a retention window had widened. The plan was clean. Snapshot counts there were in the low dozens, exactly what the new policy specified. So the snapshots driving the bill were coming from somewhere else.

$ aws ec2 describe-snapshots --owner-ids self \
    --query 'length(Snapshots)' --output text
9408

The number that turned a routine cost review into an incident.

That number was the moment the room got quiet. AWS Backup writes maybe forty snapshots a month on this account. Nine thousand was a different category of problem.

AWS Backup was clean, so who made these 9,408 snapshots

Ruling out the obvious suspect

With AWS Backup ruled out and no other named pipeline running, the question became: who created these 9,408 snapshots, and is anything still creating more. We pulled the StartTime field on the most recent hundred. The newest one was thirteen months old. Whatever pipeline made them had stopped, which meant we were looking at a stable population, not a leak that was still growing. That mattered because it meant the cleanup had a known size.

The next question was whether the source volumes were still around. We sampled twenty random snapshots and ran describe-volumes against their SourceVolumeId. All twenty came back InvalidVolume.NotFound. The pattern was clear: the snapshots were referencing two specific volume IDs (the daily Lambda backed up two production EBS volumes), both of which had been deleted along with the EC2 instances they served when the application moved to a managed service.

aws ec2 describe-snapshots --owner-ids self \
    --query 'Snapshots[*].[SnapshotId,VolumeId,StartTime]' \
    --output text > all-snapshots.tsv

awk -F'\t' '{print $2}' all-snapshots.tsv | sort -u \
  | while read vid; do
      if ! aws ec2 describe-volumes --volume-ids "$vid" \
          >/dev/null 2>&1; then
        echo "$vid orphan"
      fi
    done > orphan-source-volumes.txt

Group snapshots by their source volume, then check which source volumes still exist.

Only two volume IDs appeared in the orphan list. Two volumes, 400 days of daily snapshots each, give or take retries, gave 9,408. The arithmetic lined up. The Lambda that snapshot them was gone, but AWS does not garbage-collect snapshots when their creator disappears. Snapshots are first-class objects with their own lifecycle, and that lifecycle is whatever you set when you create them. The Lambda set nothing.

Why we sampled twenty before touching the other 9,388

What we did before running delete-snapshot in a loop

The temptation at this point is to write a one-line loop and delete everything. delete-snapshot is irreversible. The cost was real, $1,830 a month for storage of data that referenced infrastructure that no longer existed. Two reasons we did not run the loop immediately.

First, orphan is sometimes a transient state. A volume gets deleted on Tuesday during a planned migration. On Wednesday the orphan-finder runs. A snapshot taken two hours before the volume's deletion looks orphaned but is actually the most recent backup of a service that was just migrated. Deleting it would destroy the only remaining copy of that data. We checked the StartTime on every snapshot in our sample against the deletion date of its source volume. Every one was older than the deletion by at least nine months. The cohort was uniformly historical. No active workflow could be depending on any of them.

Second, we needed to be sure these snapshots were not being referenced as the base for any AMI or any live AWS Backup recovery point. We ran describe-images with a block-device-mapping.snapshot-id filter on the sample, expecting nothing, and got nothing. We checked the AWS Backup recovery point inventory. None of the orphan snapshot IDs appeared there. The deletion was safe.

The actual delete loop took three calendar days. delete-snapshot is rate-limited at roughly 5 requests per second per account with bursts. At 9,400 deletes with retries on the occasional 503, the math runs to about 30 wall-clock minutes of perfect throughput. We never get perfect throughput. We wrote the loop with a 250ms sleep, a checkpoint file, and an append-only deleted.log so we could resume after any interruption without re-trying ones that already succeeded.

while read sid; do
  if grep -qx "$sid" deleted.log; then continue; fi
  aws ec2 delete-snapshot --snapshot-id "$sid" \
    && echo "$sid" >> deleted.log \
    || echo "$sid" >> failed.log
  sleep 0.25
done < orphan-snapshot-ids.txt

Resumable, rate-limited delete loop. The checkpoint file is the load-bearing part.

After three days the EBS Snapshot line on the next monthly forecast dropped to under $20. The fourteen terabytes of orphan storage was gone.

Tag at creation, schedule the cleanup, watch the lines that should be zero

The rule that meant the next deprecated pipeline could not do this

The deletion fixed the symptom. The interesting part of this engagement was the cause. AWS does not couple a snapshot's lifecycle to the lifecycle of whatever process created it. A Lambda gets deleted, an EventBridge rule gets removed, the IAM role goes with them, and the snapshots they made keep existing and keep being billed, forever, until something explicitly deletes them. There is no warning email. There is no dashboard widget. The only signal is the monthly bill, and the bill takes a year to be loud enough to investigate.

Two changes went in after the cleanup. The first was a tag-at-creation rule. Every snapshot the account creates now carries three tags applied at creation time: Owner (a team or service name), Retention (an ISO date past which the snapshot is safe to delete), and CreatedBy (the pipeline that made it). AWS Backup applies these automatically through its backup plan. The handful of custom Lambdas that survived the migration were rewritten to apply them. A weekly cleanup Lambda walks the account, deletes anything past its Retention date, and flags anything older than 90 days with no Retention tag. For the first 60 days the Lambda posted a Slack message and waited for a thumbs-up before deleting. After that it ran automatic.

The second change was to the quarterly cost review process. It now starts with the line items that should be zero or near zero, not the ones that are already big. The big lines get watched constantly by capacity planners. The lines that should be zero are where deleted infrastructure leaves footprints, and they are the ones least likely to be on anybody's dashboard. EBS Snapshot on a no-EBS-snapshot account. Lambda invocations on a service that was migrated to ECS six months ago. NAT Gateway hours on a workload that should not need cross-AZ egress. These are the lines where deprecated pipelines keep paying rent.

The lifecycle every snapshot now goes through. Untagged snapshots cannot live past 90 days without an explicit decision.

Cost archaeology on accounts where a deprecated pipeline is still paying rent

When the bill is the only thing telling you what you forgot

The shape of this incident is common. A pipeline gets shipped, the engineer who wrote it leaves, the policy gets replaced but the outputs survive, and the bill slowly bends upward. EBS snapshots are the most common shape we see. Detached EIPs are close behind. Idle NAT gateways and orphaned ElastiCache clusters round out the top four. None of these line items alarm on a CloudWatch dashboard because nothing is actively misbehaving. The deprecated pipeline is the misbehavior, and the pipeline no longer exists.

We run these cost-archaeology engagements regularly. In the last quarter we walked through three accounts where a single deprecated backup pipeline accounted for more than half of the account's EBS Snapshot line. We have an inventory script that finds orphan snapshots, detached volumes, unused EIPs, and idle NAT gateways across an account in about 20 minutes, plus a sample-then-delete workflow we walk the team through live so nothing irreversible happens on autopilot. The deletion is always the easy half. The work is figuring out which orphans are safe and writing the tag-at-creation policy that stops the next one.

If your bill has a line that does not match anything that should be running, the orphan audit is usually the fastest way to find out where it is going. Request an infrastructure review and we will run the audit with your team on a 30-minute diagnostic call this week. You can also see the broader pattern in our services overview for cloud cost spike work.

Originally published at https://infraforge.agency/insights/orphan-ebs-snapshots-deleted-backup-pipeline-cost-spike/.

If your team is dealing with similar infrastructure debt, we offer infrastructure reviews and recovery engagements — see /review.