DEV Community

folarin oyenuga
folarin oyenuga

Posted on

How Systematic EBS Optimization Delivered Significant Annual Savings

 ## Definitions & Scope

Orphaned volume (in this context): An EBS volume that is unattached, not referenced by any PersistentVolume/PersistentVolumeClaim, not part of backup/DR workflows, not snapshot-dependent for active AMI pipelines, and confirmed deletable by volume owners.

Scope: Production EKS cluster spanning multiple AWS regions, supporting 1000+ namespaces.


The Problem

In cloud environments, EBS volumes accumulate silently. Engineers spin up instances for testing, experiments are abandoned, services are decommissioned—and volumes are left behind. Still attached to your AWS bill.

While investigating resources lacking mandatory tags for a compliance initiative, I found over a thousand unattached EBS volumes consuming 60 TB of storage. The original goal was to add tags to untagged resources. But I saw an opportunity: many of these volumes appeared genuinely orphaned. If I could safely identify and remove them, the cost savings would be significant.

The naive approach "delete all unattached volumes" often causes incidents. Volumes might be referenced by automation, support disaster recovery, or be temporarily detached during deployments. I'd seen rushed cleanup campaigns end in emergency snapshot restores, lost engineering hours and damaged engineering trust.

I needed a systematic way to determine which volumes could be safely deleted, when, and how.


The Solution: 7-Step Verification Methodology

Every volume passed through 7 independent checks before deletion.

Step 1: Volume Age & IOPS Analysis (Triage)

Check: How old is the volume? What's the IOPS configuration?

Volume age correlates with orphan likelihood. Volumes untouched for 6+ months are strong candidates. But age alone isn't enough. IOPS analysis surfaces cost outliers that deserve priority attention.

During triage, I discovered a 750GB io1 volume provisioned at 15,000 IOPS. Annual cost: quite significant for a single volume. It's easy to simply focus on storage (GB) and nothing more. The real cost driver for io1/io2 volumes is provisioned IOPS.

Lesson: IOPS analysis should not be excluded from your verification, especially for io1/io2 volumes.

Step 2: Attachment State Verification

Check: Is this volume currently attached to any EC2 instance?

aws ec2 describe-volumes --filters Name=status,Values=available
Enter fullscreen mode Exit fullscreen mode

Status available means unattached. But unattached ≠ deletable. I move to further verification.

Step 3: Kubernetes PV/PVC Cross-Reference

Check: Is this volume referenced by any PersistentVolume or PersistentVolumeClaim?

kubectl get pv -o json | jq '.items[] | select(.spec.awsElasticBlockStore.volumeID)'
Enter fullscreen mode Exit fullscreen mode

A volume might be unattached at the EC2 level but still referenced by Kubernetes. Deleting it would break the next pod that tries to mount it.

Step 4: CloudTrail Activity Analysis

Check: When was this volume last accessed? By whom?

CloudTrail reveals the last AttachVolume, DetachVolume, or CreateSnapshot events. Volumes with no activity for 6+ months are strong deletion candidates. Recent activity means someone's still using it, even if it's currently detached.

Step 5: Snapshot & AMI Pipeline Check

Check: Is this volume part of an active snapshot workflow or AMI build process?

Snapshots and AMIs indicate the volume is referenced by automation, launch templates, or image build pipelines. Even if the volume itself isn't directly backing an AMI (snapshots do that), its existence in an active workflow means deletion could break restore processes or future builds.

Step 6: Tag & Naming Convention Analysis

Check: Do tags or naming patterns indicate purpose?

Tags like kubernetes.io/created-for/pvc/name or naming patterns like prometheus-data-* reveal original purpose. Cross-reference with current cluster state to determine if that workload still exists.

Step 7: Senior Team Validation for High-Value Volumes

Check: For volumes >1TB or with significant cost impact, get senior team sign-off.

After successfully deleting 725 volumes across earlier phases (small and medium-value batches), I had 4 remaining high-value volumes including the expensive io1 volume from Step 1.

I scheduled a live review session with senior engineers. I walked through my 7-step verification for each volume. They applied their own independent verification methods. Both approaches confirmed: all 4 volumes were safe to delete.

All 4 were deleted during the call. Zero incidents.


Phased Execution

I didn't delete 729 volumes on day one. Phased approach:

Phase 1: 10 oldest, smallest volumes. 48-hour monitoring. Zero issues.

Phase 2: Incremental batches (15 → 20 → 30 → 50 volumes), 24-48 hour monitoring windows between each.

Phase 3: 281 medium-value Prometheus volumes from decommissioned workloads.

Phase 4: 4 high-value volumes with senior team validation.

This was done under change control with documented approvals and audit-friendly evidence.

Total timeline: 5 weeks | Success rate: 100% (729/729) | Production incidents: 0


Results

Metric Value
Volumes investigated 1000+
Volumes deleted 732 (729 by me, 3 via methodology adoption)
Storage reclaimed 60 TB
Annual savings Significant
Production incidents 0

The team appreciated the systematic approach. A colleague subsequently adopted the methodology on another cluster, identifying 3 additional high-value volumes, demonstrating that reusable frameworks multiply impact beyond direct execution.


Key Takeaways

Systematic beats aggressive. I could have deleted everything in a day. The 5-week phased approach with monitoring windows meant zero incidents and built enough trust that the methodology was adopted elsewhere.

IOPS costs dominate io1/io2 volumes. It's important to reiterate that it's often easier to focus on storage (GB). Whereas a single io1 volume can cost more annually than twenty gp3 volumes combined.

Dual verification builds trust. For high-impact deletions, having senior engineers independently confirm your methodology eliminates doubt, enables organizational adoption and also provides an opportunity to see things through fresh perspectives.

Look for adjacent opportunities. The original task was tagging compliance. The cost savings emerged from noticing that many untagged volumes were also unused. Sometimes the bigger win is next to the assigned work.

Automate what you've validated. Once you trust the methodology, bake it into your processes. Manual verification doesn't scale; time is invaluable.


Resources


Have you tackled similar cloud cost challenges? What worked for you?

Top comments (0)