<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Noureldin ehab</title>
    <description>The latest articles on DEV Community by Noureldin ehab (@noureldin_ehab).</description>
    <link>https://dev.to/noureldin_ehab</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3023541%2F79a69a55-82e1-43b9-a5a7-5a687c98c094.jpg</url>
      <title>DEV Community: Noureldin ehab</title>
      <link>https://dev.to/noureldin_ehab</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/noureldin_ehab"/>
    <language>en</language>
    <item>
      <title>Investigate Why AWS Costs Suddenly Increased</title>
      <dc:creator>Noureldin ehab</dc:creator>
      <pubDate>Thu, 18 Jun 2026 12:56:43 +0000</pubDate>
      <link>https://dev.to/aws-builders/investigate-why-aws-costs-suddenly-increased-5clk</link>
      <guid>https://dev.to/aws-builders/investigate-why-aws-costs-suddenly-increased-5clk</guid>
      <description>&lt;h1&gt;
  
  
  Overview
&lt;/h1&gt;

&lt;p&gt;By the end of this tutorial, you'll learn how to use Stakpak to investigate zombie resources in a live AWS production account, identify every detached volume, idle load balancer, orphaned snapshot, and forgotten instance silently accruing&lt;br&gt;
charges, apply the right cleanups safely, validate that production stays healthy throughout, and configure &lt;a href="https://stakpak.gitbook.io/docs/how-it-works/autopilot" rel="noopener noreferrer"&gt;Stakpak Autopilot&lt;/a&gt; to help detect similar resource sprawl automatically in the future.&lt;/p&gt;

&lt;p&gt;Note: Stakpak is open source and works with any model you choose.&lt;/p&gt;

&lt;h2&gt;
  
  
  Problem
&lt;/h2&gt;

&lt;p&gt;Your AWS production application is healthy. The pipeline is green, the SLOs are green, the on call channel is quiet.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;But your FinOps lead just pinged the team:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;We're $4,300 over budget this month and trending 35% above last. Nothing in the apps catalog has changed. &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You start the usual cost investigation loop, Cost Explorer by service and by tag, VPCs and NAT Gateways, unattached EBS volumes, stale snapshots, idle Elastic IPs, VPC endpoints, RDS instances, CloudWatch log retention, S3 lifecycle policies, CloudTrail events.&lt;/p&gt;

&lt;p&gt;Cost Explorer shows the highest cost is from EC2, Other, EKS, and CloudWatch. The rest is scattered across eight services in chunks too small to feel urgent on their own. Tag breakdowns are messy because half the spend rolls up under (no tag) or Owner=unknown, and the biggest single CUR line item is a Fargate workload nobody on the current team recognizes.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Is the $890 NAT Gateway data line the orphaned VPC nobody decommissioned, or production traffic that should be flowing through a VPC endpoint?&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Are the 1,400+ EBS snapshots load-bearing, or from a Lambda deprecated 18 months ago and never disabled?&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Is the RDS instance tagged Environment=staging-old truly idle, or does some nightly job still touch it?&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Which of the 12 likely cost drivers, if any, would be the wrong thing to delete?&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Cost Explorer gives you part of the picture. AWS resource APIs give you the rest. But you still have to connect them, attribute them to owners, correlate them with utilization, and decide what is safe to remediate.&lt;/p&gt;

&lt;h1&gt;
  
  
  Application
&lt;/h1&gt;

&lt;p&gt;Northstar Commerce is a B2B ecommerce platform running on AWS, with workloads spread across EKS, ECS Fargate, Lambda, and Vercel. The main components are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;storefront: Customer facing Next.js app on Vercel.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;api-gateway: Public REST and GraphQL edge on EKS&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;orders-service: Order lifecycle, Go on EKS, backed by Aurora PostgreSQL.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;payments-service: Java on ECS Fargate, integrates with Stripe.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;inventory-worker: Celery workers on EKS draining an SQS queue.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;search-indexer: Rust Lambda keeping OpenSearch in sync.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;admin-console: React SPA on S3 behind CloudFront.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Shared infrastructure includes an EKS cluster, an Aurora cluster, an ElastiCache Redis, an MSK cluster, an OpenSearch domain, ECR, Route 53, ACM, and Secrets Manager. &lt;/p&gt;

&lt;p&gt;Primary region is us-east-1, with us-west-2 as a disaster recovery region.&lt;/p&gt;

&lt;p&gt;Every workload in the catalog is healthy and serving traffic. None of the recent deploys touched infrastructure. Which is what makes a 35% cost jump suspicious: the bill is growing faster than the application is.&lt;/p&gt;

&lt;p&gt;Now that we understand the app and architecture, we can start investigating the cost spike.&lt;/p&gt;

&lt;h1&gt;
  
  
  Step-by-Step Guide
&lt;/h1&gt;

&lt;h2&gt;
  
  
  Prerequisites
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://stakpak.dev/" rel="noopener noreferrer"&gt;Install Stakpak&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://docs.aws.amazon.com/cli/v1/userguide/cli-configure-files.html" rel="noopener noreferrer"&gt;AWS credentials configured locally&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Troubleshooting
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;Open Stakpak and ask it to &lt;code&gt;investigate the cloud cost spike&lt;/code&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Now lets let it do its magic&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fpr31v2mvf2aatv3n0e1s.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fpr31v2mvf2aatv3n0e1s.png" alt=" " width="620" height="465"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Stakpak traced the cost spike across billing, utilization, and infrastructure signals and identified multiple sources of unnecessary spend driving the 35% increase.&lt;/p&gt;

&lt;p&gt;It found that the $4,270 June overage came from 12 distinct cost drivers totaling ~$6,800/month of avoidable spend, none caused by application changes. The signals were spread across Cost Explorer deltas, tag anomalies (staging-old +5,854%, intern-summer-2025 +9,677%), CUR line items, CloudWatch utilization, and CloudTrail provenance.&lt;/p&gt;

&lt;p&gt;Then it:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Deleted the orphaned legacy VPC and its NAT Gateway, abandoned since the 2024 EKS migration&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Terminated three m5.2xlarge legacy batch workers idling at 2% CPU&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Deleted the forgotten eks-dev-intern cluster and its Fargate Spot profile, running since July 2025&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Deleted the staging-old RDS instance after 30 days of zero connections&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Removed five unattached EBS volumes, three idle Elastic IPs, and 1,400+ stale snapshots from a deprecated 2023 backup Lambda&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Disabled GuardDuty in eu-west-1 and ap-southeast-1 where no workloads exist&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Added an S3 Gateway VPC endpoint to the production VPC, eliminating $890/month of NAT data processing&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Applied lifecycle rules to northstar-prod-edge-logs and 30-day retention to three "Never expire" log groups&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Fixed cross-AZ traffic on orders-service&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Deployed AWS Budgets with anomaly detection, tag-enforcement SCPs, and Config rules&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;After the changes were applied, Stakpak verified that:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;All 12 driver resources are gone or reconfigured&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Every production workload remained healthy with no SLO regressions&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Projected run-rate dropped to ~$9,600/month, below the January baseline&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Now everything is cleaned up 🥳&lt;/p&gt;

&lt;p&gt;Now its asking us if we want to sit up &lt;a href="https://stakpak.gitbook.io/docs/how-it-works/autopilot" rel="noopener noreferrer"&gt;stakpak Autopilot&lt;/a&gt; to avoid future cost spikes&lt;/p&gt;

&lt;p&gt;Note: Stakpak Autopilot monitors your apps 24/7, detects unexpected changes, fixes what’s safe, and only alerts you when it actually matters.&lt;/p&gt;

&lt;h1&gt;
  
  
  Monitoring
&lt;/h1&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fy5nmt8p6cerm6semyq3l.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fy5nmt8p6cerm6semyq3l.png" alt=" " width="800" height="412"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h1&gt;
  
  
  Extra Resources:
&lt;/h1&gt;

&lt;h2&gt;
  
  
  Related Use Cases
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://stakpak.gitbook.io/docs/tutorial/investigate-why-aws-costs-suddenly-increased" rel="noopener noreferrer"&gt;Investigate and Clean Up Unused Cloud Resources&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://stakpak.gitbook.io/docs/tutorial/deploy-coolify-on-aws-and-deploy-your-app" rel="noopener noreferrer"&gt;Deploy Coolify on AWS &amp;amp; Deploy Your App&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://stakpak.gitbook.io/docs/tutorial/load-test-to-optimize-cloud-costs" rel="noopener noreferrer"&gt;Load Test to Optimize Cloud Costs&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://stakpak.gitbook.io/docs/get-started/install-stakpak" rel="noopener noreferrer"&gt;Install Stakpak&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://stakpak.gitbook.io/docs/get-started/configure-stakpak" rel="noopener noreferrer"&gt;Configure Stakpak&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://docs.aws.amazon.com/cli/v1/userguide/cli-configure-files.html" rel="noopener noreferrer"&gt;Configuration and credential file settings in the AWS CLI&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://stakpak.gitbook.io/docs/how-it-works/autopilot" rel="noopener noreferrer"&gt;Autopilot&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://stakpak.gitbook.io/docs/how-it-works/handling-secrets" rel="noopener noreferrer"&gt;Handling Secrets&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://stakpak.gitbook.io/docs/how-it-works/warden-guardrails" rel="noopener noreferrer"&gt;Warden Guardrails&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>ai</category>
      <category>aws</category>
      <category>cloudnative</category>
      <category>agents</category>
    </item>
  </channel>
</rss>
