DEV Community

Cover image for Investigate and Clean Up Unused Cloud Resources
Noureldin ehab for AWS Community Builders

Posted on • Originally published at stakpak.gitbook.io

Investigate and Clean Up Unused Cloud Resources

Overview

By the end of this tutorial, you'll learn how to use Stakpak to investigate zombie resources in a live AWS production account, identify every detached volume, idle load balancer, orphaned snapshot, and forgotten instance silently accruing charges, apply the right cleanups safely, validate that production stays healthy throughout, and configure Stakpak Autopilot to help detect similar resource sprawl automatically in the future.

Note: Stakpak is open source, vendor neutral, and works with any model you choose.

Problem

AWS environments naturally accumulate unused resources over time: detached volumes, old snapshots, idle load balancers, unassociated Elastic IPs, and forgotten S3 buckets.

Finding them is easy. Determining whether they're safe to delete is not.

A resource may look unused, but it could still support a production workload, backup process, or undocumented dependency. Safely cleaning up cloud waste requires connecting usage, ownership, and activity data across your environment before taking action.

Application

Northstar Commerce is a B2B ecommerce platform running on AWS, with workloads spread across EKS, ECS Fargate, Lambda, and
Vercel. The main components are:

  • storefront: Customer facing Next.js app on Vercel.

  • api-gateway: Public REST and GraphQL edge on EKS.

  • orders-service: Order lifecycle, Go on EKS, backed by Aurora PostgreSQL.

  • payments-service: Java on ECS Fargate, integrates with Stripe.

  • Inventory-worker: Celery workers on EKS draining an SQS queue.

  • search-indexer: Rust Lambda keeping OpenSearch in sync.

  • admin-console: React SPA on S3 behind CloudFront.

Shared infrastructure includes an EKS cluster, an Aurora cluster, an ElastiCache Redis, an MSK cluster, an OpenSearch domain, ECR, Route 53, ACM, and Secrets Manager.

Primary region is us-east-1, with us-west-2 as a disaster recovery region.

Every workload in the catalog is healthy and serving traffic. None of the recent deploys touched infrastructure.

The application itself is well understood and accounted for, but the AWS account it runs in has accumulated years of side projects, migrations, and experiments that nobody has audited. Anything we find outside of this catalog is a candidate for cleanup, as long as we can prove it isn't quietly supporting one of these workloads.

Now that we understand the app and architecture, we can start investigating the account.

Step-by-Step Guide

Prerequisites

  1. Install Stakpak

  2. AWS credentials configured locally

Troubleshooting

  1. Open Stakpak and ask it to audit our AWS account for unused and zombie resources.

Now lets let it do its magic

Stakpak audited the AWS account for unused and zombie resources across compute, network, storage, IAM, data, and operational categories and found a small but real pool of recurring waste with no business value attached to any of it.

It identified ~$97/month of avoidable spend spread across 15 zombie resources in us-east-1, none tied to any active application. The signals came from EC2 state checks, EBS volume status, ELB target health, CloudWatch metrics for S3 and Lambda, IAM credential reports, and tag/name pattern analysis (*-OLD, -DEPRECATED, rakesh-test-, marketing-campaign-2022, loadtest-runner-2024-q1).

Then it:

  • Terminated the stopped loadtest-runner-2024-q1 EC2 instance, abandoned since the Q1 2024 load test campaign

  • Deleted five unattached EBS volumes totaling 371 GB, including a 200 GB elasticsearch-data-node-3 orphan from the search-v1 deprecation and a 100 GB northstar-mysql-data-OLD volume

  • Deregistered the northstar-golden-image-v2-DEPRECATED AMI and removed its backing snapshot

  • Released four unassociated Elastic IPs, including old-nat-gateway-eip from a decommissioned NAT and jenkins-static-ip from the Jenkins-to-GHA migration

  • Deleted two abandoned ALBs (northstar-internal-tools with an empty target group, and a canary ALB with all targets unhealthy) and the northstar-legacy-clb Classic ELB tied to the deprecated checkout-v1 project

  • Removed the unused openclaw-sg security group and its orphan openclaw-key key pair

  • Emptied and deleted six zombie S3 buckets including northstar-marketing-campaign-2022, rakesh-test-bucket (employee left), tempdata-export, and northstar-checkout-v1-logs

  • Cleaned up two empty CloudWatch log groups (/aws/lambda/feedbackboard-server, /aws/lambda/feedbackboard-warmer) left behind by deleted Lambda functions

After the changes were applied, Stakpak verified that:

  • All 15 zombie resources are gone

  • No surviving production resources (bastion, api-canary, staging-app, prod-invoices bucket) were impacted

  • us-west-2 remained clean (only the default VPC, no workloads)

  • Projected monthly waste dropped from ~$97/month to $0, a 100% reduction on identified zombies

Now everything is cleaned up 🥳

Now its asking us if we want to sit up Stakpak Autopilot to avoid having zombie resources

Note: Stakpak Autopilot monitors your apps 24/7, detects unexpected changes, fixes what’s safe, and only alerts you when it actually matters.

Monitoring

  1. First, it asks us about how often we want to run the checks

  1. Then it asks if we want Stakpak to take action

  1. Then it asks about where we want to get alerted

  1. And that's it

Extra Resources:

Related Use Cases

and more ...

References

Top comments (0)