Noureldin ehab for AWS Community Builders

Posted on Jun 18 • Originally published at stakpak.gitbook.io

Investigate Why AWS Costs Suddenly Increased

#ai #aws #cloudnative #agents

Overview

By the end of this tutorial, you'll learn how to use Stakpak to investigate zombie resources in a live AWS production account, identify every detached volume, idle load balancer, orphaned snapshot, and forgotten instance silently accruing
charges, apply the right cleanups safely, validate that production stays healthy throughout, and configure Stakpak Autopilot to help detect similar resource sprawl automatically in the future.

Note: Stakpak is open source and works with any model you choose.

Problem

Your AWS production application is healthy. The pipeline is green, the SLOs are green, the on call channel is quiet.

But your FinOps lead just pinged the team:

We're $4,300 over budget this month and trending 35% above last. Nothing in the apps catalog has changed.

You start the usual cost investigation loop, Cost Explorer by service and by tag, VPCs and NAT Gateways, unattached EBS volumes, stale snapshots, idle Elastic IPs, VPC endpoints, RDS instances, CloudWatch log retention, S3 lifecycle policies, CloudTrail events.

Cost Explorer shows the highest cost is from EC2, Other, EKS, and CloudWatch. The rest is scattered across eight services in chunks too small to feel urgent on their own. Tag breakdowns are messy because half the spend rolls up under (no tag) or Owner=unknown, and the biggest single CUR line item is a Fargate workload nobody on the current team recognizes.

Is the $890 NAT Gateway data line the orphaned VPC nobody decommissioned, or production traffic that should be flowing through a VPC endpoint?
Are the 1,400+ EBS snapshots load-bearing, or from a Lambda deprecated 18 months ago and never disabled?
Is the RDS instance tagged Environment=staging-old truly idle, or does some nightly job still touch it?
Which of the 12 likely cost drivers, if any, would be the wrong thing to delete?

Cost Explorer gives you part of the picture. AWS resource APIs give you the rest. But you still have to connect them, attribute them to owners, correlate them with utilization, and decide what is safe to remediate.

Application

Northstar Commerce is a B2B ecommerce platform running on AWS, with workloads spread across EKS, ECS Fargate, Lambda, and Vercel. The main components are:

storefront: Customer facing Next.js app on Vercel.
api-gateway: Public REST and GraphQL edge on EKS
orders-service: Order lifecycle, Go on EKS, backed by Aurora PostgreSQL.
payments-service: Java on ECS Fargate, integrates with Stripe.
inventory-worker: Celery workers on EKS draining an SQS queue.
search-indexer: Rust Lambda keeping OpenSearch in sync.
admin-console: React SPA on S3 behind CloudFront.

Shared infrastructure includes an EKS cluster, an Aurora cluster, an ElastiCache Redis, an MSK cluster, an OpenSearch domain, ECR, Route 53, ACM, and Secrets Manager.

Primary region is us-east-1, with us-west-2 as a disaster recovery region.

Every workload in the catalog is healthy and serving traffic. None of the recent deploys touched infrastructure. Which is what makes a 35% cost jump suspicious: the bill is growing faster than the application is.

Now that we understand the app and architecture, we can start investigating the cost spike.

Step-by-Step Guide

Prerequisites

Troubleshooting

Open Stakpak and ask it to investigate the cloud cost spike

Now lets let it do its magic

Stakpak traced the cost spike across billing, utilization, and infrastructure signals and identified multiple sources of unnecessary spend driving the 35% increase.

It found that the $4,270 June overage came from 12 distinct cost drivers totaling ~$6,800/month of avoidable spend, none caused by application changes. The signals were spread across Cost Explorer deltas, tag anomalies (staging-old +5,854%, intern-summer-2025 +9,677%), CUR line items, CloudWatch utilization, and CloudTrail provenance.

Then it:

Deleted the orphaned legacy VPC and its NAT Gateway, abandoned since the 2024 EKS migration
Terminated three m5.2xlarge legacy batch workers idling at 2% CPU
Deleted the forgotten eks-dev-intern cluster and its Fargate Spot profile, running since July 2025
Deleted the staging-old RDS instance after 30 days of zero connections
Removed five unattached EBS volumes, three idle Elastic IPs, and 1,400+ stale snapshots from a deprecated 2023 backup Lambda
Disabled GuardDuty in eu-west-1 and ap-southeast-1 where no workloads exist
Added an S3 Gateway VPC endpoint to the production VPC, eliminating $890/month of NAT data processing
Applied lifecycle rules to northstar-prod-edge-logs and 30-day retention to three "Never expire" log groups
Fixed cross-AZ traffic on orders-service
Deployed AWS Budgets with anomaly detection, tag-enforcement SCPs, and Config rules

After the changes were applied, Stakpak verified that:

All 12 driver resources are gone or reconfigured
Every production workload remained healthy with no SLO regressions
Projected run-rate dropped to ~$9,600/month, below the January baseline

Now everything is cleaned up 🥳

Now its asking us if we want to sit up stakpak Autopilot to avoid future cost spikes

Note: Stakpak Autopilot monitors your apps 24/7, detects unexpected changes, fixes what’s safe, and only alerts you when it actually matters.

DEV Community

Investigate Why AWS Costs Suddenly Increased

Overview

Problem

Application

Step-by-Step Guide

Prerequisites

Troubleshooting

Monitoring

Extra Resources:

Related Use Cases

References

Top comments (0)