DEV Community: Noureldin ehab

Set Up a Local AWS Environment with MiniStack

Noureldin ehab — Tue, 30 Jun 2026 17:00:00 +0000

Overview

Running cloud infrastructure locally is now easier than ever with tools like MiniStack.

This shift became even more important recently, as LocalStack changed its model requiring accounts, authentication tokens, and introducing paid plans for full usage.

As a result, many developers are looking for simpler, fully local, and free alternatives.

MiniStack lets you run AWS like services locally using real containers (Postgres, Redis, S3), making development faster, cheaper, and fully offline.

In this guide, we will use MiniStack to spin up a local AWS like environment and then use Stakpak to interact with it, configure it, and operate it.

LocalStack moved core services behind a paid plan. If you relied on LocalStack Community for local dev and CI/CD, MiniStack is your free, MIT-licensed drop-in replacement. No sign up, no API key, no telemetry.

Problem

Setting up and working with local cloud environments manually can still be painful:

You need to configure services (databases, storage, networking)
You have to remember CLI commands or SDK usage
You need to debug issues across multiple containers
You manually test if services are actually working
You document setup steps for future use

Even with tools like MiniStack replacing LocalStack for many use cases, operating local infrastructure is still manual work.

Small mistakes like misconfigured services, missing environment variables, or broken connections can slow down development.

Stakpak is open source, vendor neutral, and works with any model you choose.

Step-by-Step Guide

Prerequisites

Install Stakpak
MiniStack installed (or you can ask Stakpak to install it)
Docker Installed (or you can ask Stakpak to install it)

Architecture

Application

What the app does

A user uploads a CSV or JSON file.

It lands in S3.
An event sends a job to SQS.
A Lambda reads the message.
The Lambda parses the file and stores results in Postgres.
It writes job state to Redis.

You can check the code source here

Now we can start that we understand the app and the architecture, we can start deploying it

Deployment

Open Stakpak and ask it to deploy your app locally. That's it!
It read the codebase and understood the architecture
Spun up MiniStack, Postgres, and Redis via Docker Compose
Bootstrapped AWS resources: S3 bucket, SQS queue, and Lambda
Connected SQS to Lambda so uploads trigger processing automatically

Not lets test it..

Testing

It's working 🥳

Here is what happened:

We uploaded sample.csv -> got back a job_id
Lambda fired within 5 seconds (SQS polling)
Redis shows status: done, rows_inserted: 5
Postgres has all 5 rows parsed and stored

Extra Resources:

Related Use Cases

and more...

References

Load Test to Optimize Cloud Costs

Noureldin ehab — Mon, 29 Jun 2026 17:00:00 +0000

Overview

Most load testing tools focus on performance metrics like response times and throughput. They don't show you how much your scaling decisions cost in real time.

This guide shows you how to use Stakpak to see both performance & cost during load tests. You'll learn exactly how much each scaling decision costs, so you can find the cheapest way to keep your app running smoothly.

Note: Stakpak is open source, vendor neutral, and works with any model you choose.

Problem

Companies face a lot of challenges while trying to optimize cloud costs through load testing:

The limitations of traditional load testing

Standard load testing tools focus on performance metrics such as response times, throughput, and error rates.
Then they use these results to make architectural design decisions like scaling services, adding replicas, or reconfiguring infrastructure without thinking about the financial impact.
They only realize the true cost of their architectural decisions once the cloud bill arrives.

This disconnect makes it difficult to align technical performance goals with cost optimization. Performance and cost live in two different universes

Business Impact

Unpredictable monthly cloud bills
Inefficient resource utilization (often 20-30% waste)
Difficulty in capacity planning and budgeting
Competitive disadvantage due to higher operational costs

Step-by-Step Guide

Prerequisites

Install Stakpak
Cloud provider credentials configured
Application deployed and accessible
Basic understanding of your application's architecture.
Choose the endpoint you want to test (Staging or Ephemeral Environment)
Make sure you have explicit permission to run load tests.

In this guide, we will be load testing hackathon-judge-app. Let's take a look at the Cloud Architecture and what this app does.

Architecture

This setup deploys the hackathon-judge-app on Amazon ECS Fargate in the eu-north-1 region.\
Traffic flows through an Application Load Balancer (80/443 -> Target Group 8501) into ECS tasks running across two availability zones for high availability.

Networking: VPC (10.0.0.0/16) with public subnets (NAT Gateways) and private app subnets.
Compute: ECS Fargate cluster with auto scaling (1–2 tasks, CPU 70%, memory 80%).
Registry: Amazon ECR stores the container images.
Observability: CloudWatch Logs (7 days) and Container Insights enabled.

Now, let's take a look at the app

Application

Hackathon Judge App

A Streamlit web application for judging hackathon pitches. Designed to be accessible through mobile web browsers with persistent data storage.

Features

Judge selection: Each judge can select their name before scoring teams
Team scoring: Judges can score teams based on configurable criteria
Score persistence: All scores are saved to a local JSON file
Mobile friendly design: Optimized for use on mobile devices
Configurable through YAML: Easy to adjust teams, judges, and judging criteria
Custom branding: Add your event logo and title for a personalized experience
Authentication: Password protection for judges to secure the scoring process

Now we can start that we understand the app and the architecture, we can start load testing our app

Open your terminal
Open Stakpak by typing stakpak

In this guide, we will use Apache Bench to load test our app

Now let's ask Stakpak to load test our app [insert app link] with Apache Bench and monitor its resource utilization

That's it stakpak will automaically figure out what to do and how to use Apache Bench

Here we see that the CPU utilization spiked to 17.88% and memory remained stable at 15% and auto scalling wasnt triggred

Now let's run the high load test 200 concarrunt users in 120 seconds

Even under stress testing, the highest CPU spike was 42.53%, far below the 70% auto-scaling threshold. This shows that the current infrastructure is over-provisioned for the tested workload and has plenty of headroom before scaling becomes necessary.

If you want to see this in action you can see our Stakpak Ship It session where we did that live

Now that we’ve confirmed our infrastructure is over provisioned, the next step is to evaluate cost efficiency. With Stakpak, you can go beyond performance testing and ask it to:

Estimate cloud costs for the current setup (it will use the Infrastructure Cost Estimation Rulebooks)
Generate a detailed report breaking down where resources (CPU, memory, storage, networking) are underutilized
Provide actionable recommendations for cost optimization for example, rightsizing instances, adjusting auto scaling policies, or switching to more efficient pricing models

Extra Resources:

References

Investigate Why an EC2 Application is Not Reachable

Noureldin ehab — Fri, 26 Jun 2026 17:00:00 +0000

Overview

In this tutorial, we'll use Stakpak to investigate and fix an AWS networking incident where an application running on EC2 is healthy, but unreachable from the internet.

Rather than manually inspecting EC2, VPC, subnet, route table, security group, network ACL, systemd, nginx, and application logs one by one, we'll use Stakpak to:

Investigate the incident
Identify the root cause
Apply the fix
Validate that the EC2 application becomes reachable again

By the end of this tutorial, you'll learn how to use Stakpak to troubleshoot EC2 application reachability issues across both the instance and AWS networking layers. We will also sit stakpak autopilot so it monitors our infrastructure 24/7, auto fix issues when it's safe, and pings us when human judgment is needed.

Note: Stakpak is open source, vendor neutral, and works with any model you choose.

Problem

You deploy a simple web application to an EC2 instance, and everything seems fine at first.

The Terraform deployment succeeds.

The EC2 instance is running.

The instance has a public IP address.

The security group appears to allow HTTP traffic.

The application process is healthy.

nginx is running.

But when you try to access the application from the internet, the request times out.

curl -v --connect-timeout 5 --max-time 10 http://ec2-3-236-155-58.compute-1.amazonaws.com/health

So you start the usual EC2 reachability debugging loop:

aws ec2 describe-instances \
  --instance-ids i-0a2bf3df8a5769989 \
  --region us-east-1

aws ec2 describe-instance-status \
  --instance-ids i-0a2bf3df8a5769989 \
  --region us-east-1

aws ec2 describe-security-groups \
  --group-ids sg-0d133f86e2d08a392 \
  --region us-east-1

aws ec2 describe-route-tables \
  --filters Name=vpc-id,Values=vpc-001f8813b0d78f5e3 \
  --region us-east-1

aws ec2 describe-subnets \
  --subnet-ids subnet-07083683f7e1d2f09 \
  --region us-east-1

aws ec2 describe-network-acls \
  --filters Name=association.subnet-id,Values=subnet-07083683f7e1d2f09 \
  --region us-east-1

Then you start checking the instance

aws ssm start-session \
  --target i-0a2bf3df8a5769989 \
  --region us-east-1

Now you have to figure out what actually matters.

Is the instance unhealthy?
Is nginx down?
Is the app listening on the wrong interface?
Is the security group blocking traffic?
Is the subnet public?
Is the route table missing an internet route?
Is the public IP missing?
Is another VPC networking control blocking the request?

AWS gives you the clues, but you still have to connect them.

Application

The application is a simple web service running on an EC2 instance.

It represents a small catalog preview service for the Northstar Commerce platform.

The app exposes a health endpoint and a basic HTML page. It runs locally on the instance and is served to external clients through nginx.

The main components are:

EC2 Instance: Runs the application and nginx.
Python Web Application: Provides the demo web service and health endpoint.
systemd Service: Keeps the application process running.
nginx: Listens on HTTP port 80 and proxies requests to the local app.
Security Group: Controls instance-level inbound and outbound traffic.
Subnet: Places the instance inside the VPC network.
Route Table: Defines how traffic leaves the subnet.
Internet Gateway: Provides internet connectivity for the VPC.
Network ACL: Applies subnet-level traffic rules.
IAM Instance Profile: Allows access through AWS Systems Manager Session Manager.

The normal request flow is:

A user sends an HTTP request to the EC2 public DNS name, traffic enters the VPC through the internet gateway, reaches the public subnet, passes the subnet and instance network controls, reaches nginx on port 80, nginx proxies the request to the local Python app on 127.0.0.1:8080, and the app returns a health response.The normal request flow is:

The expected health endpoint is: GET /health

When the application is working correctly, it returns:

{
"status": "ok",
"service": "northstar-catalog-preview"
}

In this incident, the application is healthy from inside the instance, but unreachable from the internet.

Now that we understand the app, we can start troubleshooting.

Step-by-Step Guide

Prerequisites

Install Stakpak
Cloud provider credentials configured

Troubleshooting

Open Stakpak and ask it to investigate the EC2 issue

Now lets let it do its magic

Stakpak started by investigating why the EC2 /health endpoint was timing out by checking DNS, EC2 status, SSM access, security groups, route tables, NACLs, and local app health.

It found that the EC2 instance and app were healthy, but the subnet Network ACL was blocking outbound ephemeral response traffic. The instance could receive traffic on port 80, but couldn’t send responses back to clients.

Then it:

Verified EC2 status checks were passing
Confirmed SSM access was online
Confirmed nginx and the app were running locally
Verified local /health returned 200 OK
Confirmed the security group and route table were correct
Added an outbound NACL rule for TCP 1024-65535
Ran Terraform validation
Applied the Terraform fix

During apply, Terraform replaced the EC2 instance because the AL2023 AMI changed.

After the fix, Stakpak verified that:

The new instance i-04244ee1e1e4ef422 was running
The new URL was http://ec2-44-223-99-238.compute-1.amazonaws.com
The NACL allowed outbound ephemeral traffic
/health returned HTTP/1.1 200 OK

Now everything is working🥳

Let's ask it to set up Stakpak Autopilot so we avoid waking up at 3am because of an incident🤡

Stakpak Autopilot monitors your apps 24/7, detects unexpected changes, fixes what’s safe, and only alerts you when it actually matters.

Monitoring

Thats it, now it won't hunt us in our nightmares at 3 am.

Extra Resources:

References

Deploy your own OpenVPN Server on AWS with one prompt

Noureldin ehab — Thu, 25 Jun 2026 17:00:00 +0000

Overview

Most of your AWS resources should be in private subnets for security reasons, but that also means they’re not directly accessible from the internet. To reach them securely, you need a VPN.

In this tutorial, we’ll use OpenVPN on AWS to create a secure, encrypted connection to your private resources so your team can access them safely.

Note: Stakpak is open source, vendor neutral, and works with any model you choose.

Problem

AWS resources in private subnets aren’t accessible from the internet by default.
Teams often try to solve this by opening ports or using bastion hosts, which increases security risks.
These workarounds also add complexity to network management and access control.
A VPN is needed to provide secure and simple access without exposing services publicly.

Business Impact

Without a VPN, secure remote access is harder, slower, and riskier. A VPN simplifies access and keeps development and operations running securely.

But what is a VPN?

A VPN (Virtual Private Network) is a secure, encrypted connection that allows you to access a private network over the internet as if you were physically inside it. It’s commonly used to safely reach internal servers, databases, or applications without exposing them to the public.

Step-by-Step Guide

Prerequisites

Install Stakpak
Cloud provider credentials configured
Then just ask it to i want to install openvpn on aws so i can access my private resources
Here you chose your preferences

I want to know more about the different architectures, so let's ask about it

Here I chose

Which AWS Region? EU West 1
Do you have a VPC set ups? Yeah, i have a VPC
How many people need VPN Access? Just one person needs access
AWS Client VPN or Self Hosted Open VPN or Open VPN from Market Place? Self Hosted Open VPN

I will just tell it to continue with the defaults

Now we can review the commands and press Enter to continue it will be:

Get the VPC details
Get the subnet details
Check the internet gateway

Now it will create a security group for open vpn and get the latest Ubuntu version

Now it will create the security group rules, SSH key, and launch the ec2 instance

Now that we have the EC2 ready, Stakpak will start setting up open VPN

That's it, now we can use OpenVPN

Extra Resources:

References

Migrate from NGINX to Caddy on AWS

Noureldin ehab — Wed, 24 Jun 2026 17:00:00 +0000

Why Migrate to Caddy?

Caddy is open source, and it provides automatic HTTPS and certificate renewal out of the box, removing the need for Certbot or cron jobs. It offers secure defaults, simpler configuration, which makes it a lightweight and low maintenance replacement for nginx

It acts as a reverse proxy, load balancer, and static file server out of the box, with secure defaults and minimal setup.

Note: Stakpak is open source, vendor neutral, and works with any model you choose.

Step by Step Guide

Architecture

Our current setup uses a single tier architecture on AWS to host a static HTML website. It runs on a t3.micro EC2 instance using nginx 1.28.0, serving files from /usr/share/nginx/html/. The instance is part of the default VPC and resides in a public subnet, allowing direct internet access.

Traffic is managed by a security group with inbound rules open to:

SSH (port 22)
HTTP (port 80)
HTTPS (port 443)

DNS is handled through Amazon Route 53, where an A record points the domain migratingtocaddy.guku.io to the instance’s public IP. TLS certificates are issued by Let’s Encrypt and configured via Certbot with the nginx plugin, enabling automatic HTTPS redirection.

The problem with this architecture:

Depends on manual Certbot setup (The renewal cron job can easily be forgotten)
nginx configuration is unnecessarily complex
No built in automation for TLS or reloads
Higher maintenance for updates and security hardening

Let's see how we can fix these problems with caddy

Prerequisites

Install Stakpak
Open your terminal and type "stakpak"
You should configure your cloud credentials before opening stakpak, since Stakpak will use your existing machine setup to work

Guide

Then ask Stakpak to Migrate from NGINX to Caddy with 0 downtime on AWS
First Stakpak will check what is our current set up on AWS

Now, Stakpak recommended three zero down time strategies for the migration

Since we don't want downtime because of the DNS access and TLS let's choose the second option

Now that we have the ALB and target groups, Stakpak will install Caddy
After installing Caddy Stakpak will copy the website content
Now wait for the health checks so we make sure Caddy is working fine

Now Stakpak is updates the DNS to point to the ALB
Thats it, we are ready to redirect the traffic to Caddy, and since we are using ALB we will be able to roll back if needed

Now it's working🥳

ps: don't forget to check our new Slack Integration👀

Extra Resources:

References

Free TLS with Caddy Web Server on AWS EC2 with Let's Encrypt

Noureldin ehab — Tue, 23 Jun 2026 17:00:00 +0000

Overview

In this tutorial, we will see how to deploy a static website on AWS EC2 using Caddy web server with automatic HTTPS certificates from Let's Encrypt.

What you'll build:

Complete AWS infrastructure (VPC, subnet, security groups, EC2 instance)
DNS configuration via Route 53
Caddy web server with automatic HTTPS
Static website accessible via custom subdomain
Production ready setup with HTTP to HTTPS redirect

all in less than 10 min

Note: Stakpak is open source, vendor neutral, and works with any model you choose.

Step by Step Guide

Prerequisites

Install Stakpak
Open your terminal and type "stakpak"
Cloud provider credentials configured

Turorial

Then ask Stakpak to install caddy on Ubuntu on EC2
It will start by checking the AWS creds and region

It will create:

Internet Gateway
Attach Internet Gateway to VPC
Create a public subnet
Create a route table with an internet route
Create security group (ports 22, 80, 443)
Generate SSH key pair

Now it will create the EC2 instance
Now we drink coffee while the EC2 instance starts
Now it will set the DNS and install Caddy, and test it

Now it's working🥳

Extra Resources:

References

Deploy Coolify on AWS & Deploy Your App

Noureldin ehab — Mon, 22 Jun 2026 11:04:57 +0000

Overview

Coolify lets you run your own platform similar to Heroku or Vercel, on your own infrastructure.

In this tutorial, you will:

Deploy Coolify on AWS EC2
Deploy a real application using Coolify
Test that everything is working
Then use Stakpak Autopilot to monitor and maintain it automatically

all with just one prompt

Problem

Setting up Coolify on AWS is mostly manual steps and trial and error.

Create the server
Install dependencies
Run the install script
Make sure everything is configured correctly

A small mistake can leave you with a broken setup that’s hard to debug.

Application

What the app does

We’ll use a simple todo app built with Next.js and Turso.

It:

lets you create and delete tasks
stores data in a Turso database (SQLite over HTTP)
uses a modern stack (Next.js, Drizzle ORM, Tailwind)

This is a small app, but it’s enough to test:

deployment
database connectivity
container behavior

You can check the code source here

Now that we understand the app, we can start deploying it

Step-by-Step Guide

Prerequisites

Install Stakpak
Configure Stakpak
Install Browser Extension(Optional)
Cloud provider credentials configured

Deployment

Open stakpak ask it to deploy my app on aws with coolify

As you can see, it automatically find the coolify skill, let's press enter

Now its asking us about which EC2 we want to use, lets choose t3.medium

Now it's asking us where we want to deploy our app, choose the closest region to your users

Now it's asking for the DB URL and Auth Token for our app

Now, let's restrict SSH to our IP only for security

Now that we have everything ready lets press enter

Now that Coolify is deployed, let's make an admin account

Then it configured Coolify by starting the reverse proxy and enabling API access
It created a project and connected the GitHub repo as a new application
It added the Turso database credentials as environment variables
Finally, it triggered the deployment, and Coolify built and launched the app automatically

Now everything is working🥳

Now, let's ask it to set up Stakpak Autopilot

Note: Stakpak Autopilot monitors your apps 24/7, detects unexpected changes, fixes what’s safe, and only alerts you when it actually matters.

Monitoring

Prompt: Set up stakpak autopilot to monitor the app

That's it!!

Extra Resources:

Related Use Cases

and more...

References

Investigate and Clean Up Unused Cloud Resources

Noureldin ehab — Fri, 19 Jun 2026 12:05:58 +0000

Overview

By the end of this tutorial, you'll learn how to use Stakpak to investigate zombie resources in a live AWS production account, identify every detached volume, idle load balancer, orphaned snapshot, and forgotten instance silently accruing charges, apply the right cleanups safely, validate that production stays healthy throughout, and configure Stakpak Autopilot to help detect similar resource sprawl automatically in the future.

Note: Stakpak is open source, vendor neutral, and works with any model you choose.

Problem

AWS environments naturally accumulate unused resources over time: detached volumes, old snapshots, idle load balancers, unassociated Elastic IPs, and forgotten S3 buckets.

Finding them is easy. Determining whether they're safe to delete is not.

A resource may look unused, but it could still support a production workload, backup process, or undocumented dependency. Safely cleaning up cloud waste requires connecting usage, ownership, and activity data across your environment before taking action.

Application

Northstar Commerce is a B2B ecommerce platform running on AWS, with workloads spread across EKS, ECS Fargate, Lambda, and
Vercel. The main components are:

storefront: Customer facing Next.js app on Vercel.
api-gateway: Public REST and GraphQL edge on EKS.
orders-service: Order lifecycle, Go on EKS, backed by Aurora PostgreSQL.
payments-service: Java on ECS Fargate, integrates with Stripe.
Inventory-worker: Celery workers on EKS draining an SQS queue.
search-indexer: Rust Lambda keeping OpenSearch in sync.
admin-console: React SPA on S3 behind CloudFront.

Shared infrastructure includes an EKS cluster, an Aurora cluster, an ElastiCache Redis, an MSK cluster, an OpenSearch domain, ECR, Route 53, ACM, and Secrets Manager.

Primary region is us-east-1, with us-west-2 as a disaster recovery region.

Every workload in the catalog is healthy and serving traffic. None of the recent deploys touched infrastructure.

The application itself is well understood and accounted for, but the AWS account it runs in has accumulated years of side projects, migrations, and experiments that nobody has audited. Anything we find outside of this catalog is a candidate for cleanup, as long as we can prove it isn't quietly supporting one of these workloads.

Now that we understand the app and architecture, we can start investigating the account.

Step-by-Step Guide

Prerequisites

Troubleshooting

Open Stakpak and ask it to audit our AWS account for unused and zombie resources.

Now lets let it do its magic

Stakpak audited the AWS account for unused and zombie resources across compute, network, storage, IAM, data, and operational categories and found a small but real pool of recurring waste with no business value attached to any of it.

It identified ~$97/month of avoidable spend spread across 15 zombie resources in us-east-1, none tied to any active application. The signals came from EC2 state checks, EBS volume status, ELB target health, CloudWatch metrics for S3 and Lambda, IAM credential reports, and tag/name pattern analysis (*-OLD, -DEPRECATED, rakesh-test-, marketing-campaign-2022, loadtest-runner-2024-q1).

Then it:

Terminated the stopped loadtest-runner-2024-q1 EC2 instance, abandoned since the Q1 2024 load test campaign
Deleted five unattached EBS volumes totaling 371 GB, including a 200 GB elasticsearch-data-node-3 orphan from the search-v1 deprecation and a 100 GB northstar-mysql-data-OLD volume
Deregistered the northstar-golden-image-v2-DEPRECATED AMI and removed its backing snapshot
Released four unassociated Elastic IPs, including old-nat-gateway-eip from a decommissioned NAT and jenkins-static-ip from the Jenkins-to-GHA migration
Deleted two abandoned ALBs (northstar-internal-tools with an empty target group, and a canary ALB with all targets unhealthy) and the northstar-legacy-clb Classic ELB tied to the deprecated checkout-v1 project
Removed the unused openclaw-sg security group and its orphan openclaw-key key pair
Emptied and deleted six zombie S3 buckets including northstar-marketing-campaign-2022, rakesh-test-bucket (employee left), tempdata-export, and northstar-checkout-v1-logs
Cleaned up two empty CloudWatch log groups (/aws/lambda/feedbackboard-server, /aws/lambda/feedbackboard-warmer) left behind by deleted Lambda functions

After the changes were applied, Stakpak verified that:

All 15 zombie resources are gone
No surviving production resources (bastion, api-canary, staging-app, prod-invoices bucket) were impacted
us-west-2 remained clean (only the default VPC, no workloads)
Projected monthly waste dropped from ~$97/month to $0, a 100% reduction on identified zombies

Now everything is cleaned up 🥳

Now its asking us if we want to sit up Stakpak Autopilot to avoid having zombie resources

Note: Stakpak Autopilot monitors your apps 24/7, detects unexpected changes, fixes what’s safe, and only alerts you when it actually matters.

Monitoring

First, it asks us about how often we want to run the checks

Then it asks if we want Stakpak to take action

Then it asks about where we want to get alerted

And that's it

Extra Resources:

Related Use Cases

and more ...

References

Investigate Why AWS Costs Suddenly Increased

Noureldin ehab — Thu, 18 Jun 2026 12:56:43 +0000

Overview

By the end of this tutorial, you'll learn how to use Stakpak to investigate zombie resources in a live AWS production account, identify every detached volume, idle load balancer, orphaned snapshot, and forgotten instance silently accruing
charges, apply the right cleanups safely, validate that production stays healthy throughout, and configure Stakpak Autopilot to help detect similar resource sprawl automatically in the future.

Note: Stakpak is open source and works with any model you choose.

Problem

Your AWS production application is healthy. The pipeline is green, the SLOs are green, the on call channel is quiet.

But your FinOps lead just pinged the team:

We're $4,300 over budget this month and trending 35% above last. Nothing in the apps catalog has changed.

You start the usual cost investigation loop, Cost Explorer by service and by tag, VPCs and NAT Gateways, unattached EBS volumes, stale snapshots, idle Elastic IPs, VPC endpoints, RDS instances, CloudWatch log retention, S3 lifecycle policies, CloudTrail events.

Cost Explorer shows the highest cost is from EC2, Other, EKS, and CloudWatch. The rest is scattered across eight services in chunks too small to feel urgent on their own. Tag breakdowns are messy because half the spend rolls up under (no tag) or Owner=unknown, and the biggest single CUR line item is a Fargate workload nobody on the current team recognizes.

Is the $890 NAT Gateway data line the orphaned VPC nobody decommissioned, or production traffic that should be flowing through a VPC endpoint?
Are the 1,400+ EBS snapshots load-bearing, or from a Lambda deprecated 18 months ago and never disabled?
Is the RDS instance tagged Environment=staging-old truly idle, or does some nightly job still touch it?
Which of the 12 likely cost drivers, if any, would be the wrong thing to delete?

Cost Explorer gives you part of the picture. AWS resource APIs give you the rest. But you still have to connect them, attribute them to owners, correlate them with utilization, and decide what is safe to remediate.

Application

Northstar Commerce is a B2B ecommerce platform running on AWS, with workloads spread across EKS, ECS Fargate, Lambda, and Vercel. The main components are:

storefront: Customer facing Next.js app on Vercel.
api-gateway: Public REST and GraphQL edge on EKS
orders-service: Order lifecycle, Go on EKS, backed by Aurora PostgreSQL.
payments-service: Java on ECS Fargate, integrates with Stripe.
inventory-worker: Celery workers on EKS draining an SQS queue.
search-indexer: Rust Lambda keeping OpenSearch in sync.
admin-console: React SPA on S3 behind CloudFront.

Shared infrastructure includes an EKS cluster, an Aurora cluster, an ElastiCache Redis, an MSK cluster, an OpenSearch domain, ECR, Route 53, ACM, and Secrets Manager.

Primary region is us-east-1, with us-west-2 as a disaster recovery region.

Every workload in the catalog is healthy and serving traffic. None of the recent deploys touched infrastructure. Which is what makes a 35% cost jump suspicious: the bill is growing faster than the application is.

Now that we understand the app and architecture, we can start investigating the cost spike.

Step-by-Step Guide

Prerequisites

Troubleshooting

Open Stakpak and ask it to investigate the cloud cost spike

Now lets let it do its magic

Stakpak traced the cost spike across billing, utilization, and infrastructure signals and identified multiple sources of unnecessary spend driving the 35% increase.

It found that the $4,270 June overage came from 12 distinct cost drivers totaling ~$6,800/month of avoidable spend, none caused by application changes. The signals were spread across Cost Explorer deltas, tag anomalies (staging-old +5,854%, intern-summer-2025 +9,677%), CUR line items, CloudWatch utilization, and CloudTrail provenance.

Then it:

Deleted the orphaned legacy VPC and its NAT Gateway, abandoned since the 2024 EKS migration
Terminated three m5.2xlarge legacy batch workers idling at 2% CPU
Deleted the forgotten eks-dev-intern cluster and its Fargate Spot profile, running since July 2025
Deleted the staging-old RDS instance after 30 days of zero connections
Removed five unattached EBS volumes, three idle Elastic IPs, and 1,400+ stale snapshots from a deprecated 2023 backup Lambda
Disabled GuardDuty in eu-west-1 and ap-southeast-1 where no workloads exist
Added an S3 Gateway VPC endpoint to the production VPC, eliminating $890/month of NAT data processing
Applied lifecycle rules to northstar-prod-edge-logs and 30-day retention to three "Never expire" log groups
Fixed cross-AZ traffic on orders-service
Deployed AWS Budgets with anomaly detection, tag-enforcement SCPs, and Config rules

After the changes were applied, Stakpak verified that:

All 12 driver resources are gone or reconfigured
Every production workload remained healthy with no SLO regressions
Projected run-rate dropped to ~$9,600/month, below the January baseline

Now everything is cleaned up 🥳

Now its asking us if we want to sit up stakpak Autopilot to avoid future cost spikes

Note: Stakpak Autopilot monitors your apps 24/7, detects unexpected changes, fixes what’s safe, and only alerts you when it actually matters.