Investigate and Clean Up Unused Cloud Resources

#ai #aws #finops #devops

Overview

By the end of this tutorial, you'll learn how to use Stakpak to investigate zombie resources in a live AWS production account, identify every detached volume, idle load balancer, orphaned snapshot, and forgotten instance silently accruing charges, apply the right cleanups safely, validate that production stays healthy throughout, and configure Stakpak Autopilot to help detect similar resource sprawl automatically in the future.

Note: Stakpak is open source, vendor neutral, and works with any model you choose.

Problem

AWS environments naturally accumulate unused resources over time: detached volumes, old snapshots, idle load balancers, unassociated Elastic IPs, and forgotten S3 buckets.

Finding them is easy. Determining whether they're safe to delete is not.

A resource may look unused, but it could still support a production workload, backup process, or undocumented dependency. Safely cleaning up cloud waste requires connecting usage, ownership, and activity data across your environment before taking action.

Application

Northstar Commerce is a B2B ecommerce platform running on AWS, with workloads spread across EKS, ECS Fargate, Lambda, and
Vercel. The main components are:

storefront: Customer facing Next.js app on Vercel.
api-gateway: Public REST and GraphQL edge on EKS.
orders-service: Order lifecycle, Go on EKS, backed by Aurora PostgreSQL.
payments-service: Java on ECS Fargate, integrates with Stripe.
Inventory-worker: Celery workers on EKS draining an SQS queue.
search-indexer: Rust Lambda keeping OpenSearch in sync.
admin-console: React SPA on S3 behind CloudFront.

Shared infrastructure includes an EKS cluster, an Aurora cluster, an ElastiCache Redis, an MSK cluster, an OpenSearch domain, ECR, Route 53, ACM, and Secrets Manager.

Primary region is us-east-1, with us-west-2 as a disaster recovery region.

Every workload in the catalog is healthy and serving traffic. None of the recent deploys touched infrastructure.

The application itself is well understood and accounted for, but the AWS account it runs in has accumulated years of side projects, migrations, and experiments that nobody has audited. Anything we find outside of this catalog is a candidate for cleanup, as long as we can prove it isn't quietly supporting one of these workloads.

Now that we understand the app and architecture, we can start investigating the account.

Step-by-Step Guide

Prerequisites

Troubleshooting

Open Stakpak and ask it to audit our AWS account for unused and zombie resources.

Now lets let it do its magic

Stakpak audited the AWS account for unused and zombie resources across compute, network, storage, IAM, data, and operational categories and found a small but real pool of recurring waste with no business value attached to any of it.

It identified ~$97/month of avoidable spend spread across 15 zombie resources in us-east-1, none tied to any active application. The signals came from EC2 state checks, EBS volume status, ELB target health, CloudWatch metrics for S3 and Lambda, IAM credential reports, and tag/name pattern analysis (*-OLD, -DEPRECATED, rakesh-test-, marketing-campaign-2022, loadtest-runner-2024-q1).

Then it:

Terminated the stopped loadtest-runner-2024-q1 EC2 instance, abandoned since the Q1 2024 load test campaign
Deleted five unattached EBS volumes totaling 371 GB, including a 200 GB elasticsearch-data-node-3 orphan from the search-v1 deprecation and a 100 GB northstar-mysql-data-OLD volume
Deregistered the northstar-golden-image-v2-DEPRECATED AMI and removed its backing snapshot
Released four unassociated Elastic IPs, including old-nat-gateway-eip from a decommissioned NAT and jenkins-static-ip from the Jenkins-to-GHA migration
Deleted two abandoned ALBs (northstar-internal-tools with an empty target group, and a canary ALB with all targets unhealthy) and the northstar-legacy-clb Classic ELB tied to the deprecated checkout-v1 project
Removed the unused openclaw-sg security group and its orphan openclaw-key key pair
Emptied and deleted six zombie S3 buckets including northstar-marketing-campaign-2022, rakesh-test-bucket (employee left), tempdata-export, and northstar-checkout-v1-logs
Cleaned up two empty CloudWatch log groups (/aws/lambda/feedbackboard-server, /aws/lambda/feedbackboard-warmer) left behind by deleted Lambda functions

After the changes were applied, Stakpak verified that:

All 15 zombie resources are gone
No surviving production resources (bastion, api-canary, staging-app, prod-invoices bucket) were impacted
us-west-2 remained clean (only the default VPC, no workloads)
Projected monthly waste dropped from ~$97/month to $0, a 100% reduction on identified zombies

Now everything is cleaned up 🥳

Now its asking us if we want to sit up Stakpak Autopilot to avoid having zombie resources

Note: Stakpak Autopilot monitors your apps 24/7, detects unexpected changes, fixes what’s safe, and only alerts you when it actually matters.