Sumit Gautam

Posted on May 21

The Cloud Cost Spike Nobody Warned Me About

#aws #serverless #tutorial #productivity

I've discovered cloud cost problems every possible way. Here's what I learned each time.

I've been on the wrong end of an unexpected AWS bill more than once. And I've discovered those problems every possible way the industry offers.

A billing alert firing at 11pm on a friday evening. A client call on a Monday morning where the first words were "why did our AWS bill double?" A routine Cost Explorer review that started as a 10-minute check and turned into a two-hour investigation. And yes — a month-end invoice that was simply higher than it should have been, with no prior warning because nobody had set one.

Each time, the root cause wasn't a bug. It wasn't a misconfiguration in any obvious sense. It was the natural output of infrastructure built by engineers — including me — who understood how AWS services work but hadn't fully internalized how AWS billing works.

Those are not the same thing. And the gap between them is where real money disappears.

This article is about that gap — the specific AWS cost patterns that look like correct architecture until you see the bill, and what I put in place after each incident to make sure it didn't happen the same way twice.

Cost Driver 1: NAT Gateway Data Transfer Charges

This is the one that surprises almost everyone the first time.

NAT Gateway pricing has two components that AWS documents clearly and engineers consistently underestimate in practice. The first is the hourly charge for the gateway existing — roughly $0.045/hour per gateway, about $32/month. Noticeable but expected.

The second is the data processing charge — $0.045 per GB of data that passes through the gateway in either direction. This is the one that generates real bills.

The scenario I hit: a Kubernetes cluster on EKS with pods in private subnets pulling container images from ECR, sending logs to CloudWatch, and making API calls to various AWS services — all routed through a NAT Gateway. A moderately active cluster processing a few hundred GB of data per day generates NAT Gateway charges that dwarf the EC2 costs underneath it.

The architecture is correct. Private subnets with NAT Gateway is the right pattern for production workloads. The billing implication just wasn't modeled.

What fixes this:

For traffic between your resources and AWS services specifically, use VPC Endpoints instead of routing through NAT Gateway. VPC Endpoints keep traffic on the AWS private network — no NAT Gateway processing charge, lower latency, and often better security posture:

# Create a VPC Endpoint for S3 (Gateway type — free)
aws ec2 create-vpc-endpoint \
  --vpc-id vpc-xxxxxxxx \
  --service-name com.amazonaws.ap-south-1.s3 \
  --route-table-ids rtb-xxxxxxxx \
  --vpc-endpoint-type Gateway

# Create Interface Endpoint for ECR (replaces NAT for image pulls)
aws ec2 create-vpc-endpoint \
  --vpc-id vpc-xxxxxxxx \
  --service-name com.amazonaws.ap-south-1.ecr.dkr \
  --vpc-endpoint-type Interface \
  --subnet-ids subnet-xxxxxxxx \
  --security-group-ids sg-xxxxxxxx

For S3 and DynamoDB, Gateway Endpoints are free. For ECR, CloudWatch, Secrets Manager, and other services, Interface Endpoints have an hourly cost — but for high-volume workloads, they're almost always cheaper than equivalent NAT Gateway processing charges.

Model this before you build. The break-even point is lower than you expect.

Cost Driver 2: Forgotten and Idle Resources

This one is less glamorous than NAT Gateway math but responsible for more wasted spend across more accounts than anything else on this list.

The pattern is consistent: resources get created for a purpose, the purpose ends or changes, the resources remain. Nobody deletes them because nobody owns the cleanup. In a team environment, this compounds — everyone assumes someone else deprovisioned the staging environment from last quarter.

What I found in a Cost Explorer review of a client account:

Unattached EBS volumes from terminated EC2 instances — volumes persist after instance termination by default unless you explicitly configure deletion on termination
Outdated RDS snapshots — automated snapshots accumulate beyond the retention window you thought you configured, particularly if manual snapshots were taken and never cleaned up
Idle NAT Gateways in regions where workloads had been decommissioned — $32/month each, several of them, months after the workloads they served were gone
Old AMIs and their associated snapshots — AMIs are easy to create, easy to forget, and each one holds snapshot storage charges indefinitely

None of these are large individually. Together, across an account that had been running for two years without systematic cleanup, they were meaningful.

What fixes this:

Build a cleanup policy into your infrastructure practice, not your quarterly review calendar. At minimum:

# Find unattached EBS volumes
aws ec2 describe-volumes \
  --filters Name=status,Values=available \
  --query 'Volumes[*].{ID:VolumeId,Size:Size,Created:CreateTime}' \
  --output table

# Find snapshots older than 90 days (adjust Owner to your account ID)
aws ec2 describe-snapshots \
  --owner-ids self \
  --query 'Snapshots[?StartTime<=`2025-01-01`].{ID:SnapshotId,Size:VolumeSize,Date:StartTime}' \
  --output table

# Find NAT Gateways not associated with active route tables
aws ec2 describe-nat-gateways \
  --filter Name=state,Values=available \
  --query 'NatGateways[*].{ID:NatGatewayId,VPC:VpcId,Created:CreateTime}' \
  --output table

For ongoing governance, enable AWS Config with rules for unattached volumes and idle resources, and use AWS Cost Anomaly Detection — it catches spend pattern changes faster than static billing alerts:

# Create a cost anomaly monitor for EC2
aws ce create-anomaly-monitor \
  --anomaly-monitor '{
    "MonitorName": "EC2Monitor",
    "MonitorType": "DIMENSIONAL",
    "MonitorDimension": "SERVICE"
  }'

Tag everything at creation with an owner and a project. Resources without tags in a quarterly audit are candidates for deletion. Make this a policy, not a suggestion.

Cost Driver 3: Data Transfer Between Availability Zones

This is the most invisible cost driver on the list because it requires no misconfiguration and no forgotten resources. It's the direct result of building the high-availability architecture AWS recommends.

AWS charges $0.01 per GB for data transferred between Availability Zones within the same region. In both directions. This sounds trivial until you map it against what actually moves between AZs in a real distributed system.

The scenario: a three-tier application deployed across three AZs for availability. Application servers in AZ-A making database calls to RDS in AZ-B. A caching layer in AZ-C that application servers across all three AZs read from. A Kubernetes cluster where pods are scheduled across AZs without affinity rules, meaning a pod in AZ-A routinely calls a service pod in AZ-C. Every one of these cross-AZ calls — database queries, cache reads, inter-service calls — generates data transfer charges.

At low volume, this is background noise. At production scale, cross-AZ transfer costs can match or exceed your compute costs for data-intensive workloads.

What fixes this:

The goal is AZ-aware traffic routing — keeping traffic within the same AZ wherever availability requirements permit:

# Kubernetes topology-aware routing
# Prefer pods in the same AZ before routing cross-zone
apiVersion: v1
kind: Service
metadata:
  name: my-service
  annotations:
    service.kubernetes.io/topology-mode: "Auto"
spec:
  selector:
    app: my-service
  ports:
    - port: 80

For EKS specifically, enable Topology Aware Routing and configure pod affinity rules to co-locate services that communicate frequently:

affinity:
  podAffinity:
    preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 100
        podAffinityTerm:
          labelSelector:
            matchExpressions:
              - key: app
                operator: In
                values:
                  - dependent-service
          topologyKey: topology.kubernetes.io/zone

For RDS, use RDS Proxy in the same AZ as your compute where possible, and be deliberate about which AZ your primary instance sits in relative to your application tier.

Cost Driver 4: S3 Storage and Request Costs

S3 feels cheap because the storage rate is low — $0.023 per GB per month for Standard storage. The request costs are what accumulate unexpectedly.

S3 charges per API request: $0.0004 per 1,000 GET requests, $0.005 per 1,000 PUT/COPY/POST/LIST requests. These numbers are small. Multiplied by millions of requests per day from an application that wasn't designed with S3 request patterns in mind, they add up.

The patterns I've seen generate unexpected S3 costs:

Application code calling ListObjects in a loop instead of paginating correctly — each List call counts as a request, and tight loops can generate thousands per minute
Small file uploads — many small PUTs cost more in request charges than fewer large ones, relevant for logging pipelines that write per-event rather than batching
S3 access logs enabled and writing to the same bucket — access logs generate their own requests, which generate more access logs, compounding the request count
Lifecycle policies absent — objects in Standard storage that should have transitioned to Infrequent Access or Glacier months ago

What fixes this:

Enable S3 Storage Lens at the account level — it gives you per-bucket visibility into request patterns, storage class distribution, and cost drivers without requiring manual investigation:

# Enable S3 Storage Lens default dashboard
aws s3control put-storage-lens-configuration \
  --account-id YOUR_ACCOUNT_ID \
  --config-id default \
  --storage-lens-configuration '{
    "Id": "default",
    "IsEnabled": true,
    "AccountLevel": {
      "BucketLevel": {}
    }
  }'

Add lifecycle policies to every bucket at creation — treat it as a default, not an optimization:

{
  "Rules": [
    {
      "Id": "transition-to-ia",
      "Status": "Enabled",
      "Transitions": [
        {
          "Days": 30,
          "StorageClass": "STANDARD_IA"
        },
        {
          "Days": 90,
          "StorageClass": "GLACIER_IR"
        }
      ]
    }
  ]
}

Cost Driver 5: Oversized Instances Running 24/7

This is the simplest cost driver and the one with the most straightforward fix — which is why it's last. Simple doesn't mean small.

The pattern: instances sized for peak load running continuously at 10-20% utilization. Development and staging environments sized to match production. Instances that were right-sized six months ago for a workload that has since shrunk.

On a client engagement I reviewed Cost Explorer and found several m5.2xlarge instances — $0.384/hour, about $276/month each — running continuously at consistently low CPU and memory utilization. They had been provisioned for a load test, the load test had concluded, and the instances had continued running because nobody had a process for decommissioning them after the test.

What fixes this:

Enable AWS Compute Optimizer — it analyzes CloudWatch metrics and produces specific right-sizing recommendations with projected savings:

# Get EC2 right-sizing recommendations
aws compute-optimizer get-ec2-instance-recommendations \
  --query 'instanceRecommendations[*].{
    Instance:instanceArn,
    Finding:finding,
    RecommendedType:recommendationOptions[0].instanceType,
    SavingsPercent:recommendationOptions[0].estimatedMonthlySavings.value
  }' \
  --output table

For non-production environments, implement instance scheduling — stop instances outside working hours. An instance running 8 hours a day instead of 24 costs 67% less:

# AWS Instance Scheduler via CloudFormation (or use Lambda)
# Simple approach: tag-based stop/start with EventBridge

# Tag instances for scheduling
aws ec2 create-tags \
  --resources i-xxxxxxxxx \
  --tags Key=Schedule,Value=office-hours

# EventBridge rule to stop tagged instances at 7pm IST
aws events put-rule \
  --name StopDevInstances \
  --schedule-expression "cron(30 13 ? * MON-FRI *)" \
  --state ENABLED

What Every Discovery Method Taught Me

Every way I've found an AWS cost problem taught me something different.

The billing alert that fired at 11pm taught me to set thresholds before I think I need them — at 50%, 80%, and 100% of expected spend, not just at the number that feels alarming.

The client call on a Monday morning taught me that cost problems in team environments are invisible until they're someone else's problem to escalate. Shared accounts need shared visibility — Cost Explorer access for the whole team, not just the billing owner.

The routine review that turned into two hours taught me that Cost Explorer by service, checked weekly rather than monthly, surfaces anomalies while they're small. By month end, the pattern has been running for weeks.

The surprise invoice taught me the most: the absence of an alert is not the same as the absence of a problem. An unmonitored account is a guarantee of eventual surprise.

The actual lesson across all of them is the same: AWS billing is an observability problem. The same discipline you apply to application monitoring — alerts, anomaly detection, dashboards, regular review — applies to your cloud spend. Without it, cost issues are invisible until they're on an invoice.

The AWS services that generate surprising costs are almost always working exactly as documented. The surprise comes from not modeling the billing implications before the architecture is built, and not monitoring spend with the same rigor as uptime.

Model the billing first. Monitor it like production. Build the architecture second.

Quick Reference: The AWS Cost Governance Checklist

VPC Endpoints for S3, ECR, CloudWatch, Secrets Manager — eliminate NAT Gateway processing for AWS service traffic
Billing alerts at 50%, 80%, 100% of monthly budget threshold
Cost Anomaly Detection enabled at account level
AWS Config rules for unattached EBS volumes and idle resources
Topology Aware Routing on EKS to minimize cross-AZ data transfer
S3 lifecycle policies on every bucket at creation
Compute Optimizer enabled — review recommendations monthly
Instance scheduling for all non-production environments
Mandatory tagging policy — Owner, Project, Environment on every resource