DEV Community

Cover image for AWS Cost Optimization: How We Cut Our Bill by 60%
Harshit Luthra
Harshit Luthra

Posted on • Originally published at harshit.cloud

AWS Cost Optimization: How We Cut Our Bill by 60%

Originally published at harshit.cloud on 2024-12-05.


AWS Cost Optimization: How We Cut Our Bill by 60%

The CFO saw the AWS bill hit $50,000 a month and I got a calendar invite titled "We need to talk about AWS." I knew the meeting before I clicked accept.

Three months later we were at $20,000 a month, with better p95 latency than when we started. The interesting part is that none of the wins were clever. Most of them were a checkbox someone had skipped two years ago.

the starting point

The bill broke down like this: EC2 $28,000, RDS $12,000, data transfer $6,000, CloudWatch $2,000, everything else $2,000. Fifty grand a month. The cost-allocation tags were missing on roughly 40% of resources, so for the first week the work was just figuring out who owned what.

Most of it turned out to be waste. Not bad architecture, not premature scale, just defaults that nobody had revisited since the seed round.

rightsizing the EC2 fleet

Every app server in the fleet was running on m5.2xlarge. Not because anything needed eight vCPUs, but because the previous engineer picked an instance type once in 2022 and Terraform copy-pasted it forever after.

A month of CloudWatch told the real story:

# Check actual CPU utilization
aws cloudwatch get-metric-statistics \
  --namespace AWS/EC2 \
  --metric-name CPUUtilization \
  --dimensions Name=InstanceId,Value=i-xxxxx \
  --start-time 2024-01-01T00:00:00Z \
  --end-time 2024-01-31T23:59:59Z \
  --period 3600 \
  --statistics Average
Enter fullscreen mode Exit fullscreen mode

Average CPU 12%. Average memory 30%. The fleet was a parking lot.

Dropping to m5.large cut the per-hour rate by 4x:

# Before
resource "aws_instance" "app" {
  instance_type = "m5.2xlarge"  # $0.384/hour
}

# After
resource "aws_instance" "app" {
  instance_type = "m5.large"     # $0.096/hour
}
Enter fullscreen mode Exit fullscreen mode

That single change saved $18,000 a month. p95 latency went down because the new instances were on a newer hypervisor generation. (I have stopped being surprised by this.)

reserved instances for the steady-state fleet

The app servers ran 24/7. We were paying On-Demand for them anyway, because nobody had wanted to commit a year ahead during a hiring freeze.

The Cost Explorer recommendation API will tell you what to buy if you ask it nicely:

aws ce get-reservation-purchase-recommendation \
  --service "Amazon Elastic Compute Cloud - Compute" \
  --lookback-period-in-days SIXTY_DAYS \
  --term-in-years ONE \
  --payment-option ALL_UPFRONT
Enter fullscreen mode Exit fullscreen mode

We bought 1-year RIs for ten m5.large app servers and five c5.xlarge API servers. 40% off On-Demand, no architectural change, no risk. $4,000 a month back.

The argument against RIs is always "but what if our load profile changes." Three months later it hadn't.

spot for the things that can die

The CI fleet was On-Demand c5.xlarge runners that sat idle most of the day and got hammered for an hour around lunch. A perfect Spot workload — interruptible, parallelizable, with a queue in front.

resource "aws_launch_template" "ci_runner" {
  name_prefix   = "ci-runner-"
  image_id      = data.aws_ami.ubuntu.id
  instance_type = "c5.xlarge"

  instance_market_options {
    market_type = "spot"
    spot_options {
      max_price                      = "0.10"  # ~70% discount
      spot_instance_type             = "one-time"
      instance_interruption_behavior = "terminate"
    }
  }
}

resource "aws_autoscaling_group" "ci_runners" {
  name = "ci-runners"

  mixed_instances_policy {
    launch_template {
      launch_template_specification {
        launch_template_id = aws_launch_template.ci_runner.id
      }
    }

    instances_distribution {
      on_demand_base_capacity                  = 1  # one runner always on
      on_demand_percentage_above_base_capacity = 0  # everything else is Spot
      spot_allocation_strategy                 = "capacity-optimized"
    }
  }

  min_size = 2
  max_size = 10
}
Enter fullscreen mode Exit fullscreen mode

One On-Demand runner for the always-on baseline, the rest Spot, capacity-optimized strategy so AWS picks pools with low interruption rates. $2,500 a month. The CI team noticed the build queue was faster, not that the underlying instances had changed.

S3 lifecycle policies

We had 50 TB in S3, all in Standard. The application logs were the worst offender — every JSON line our services had ever emitted, sitting at $0.023 per GB-month, being read by exactly nobody.

aws s3api list-objects-v2 \
  --bucket my-bucket \
  --query "Contents[?LastModified<'2023-01-01'].[Key,Size]" \
  --output table
Enter fullscreen mode Exit fullscreen mode

Most of it hadn't been touched in a year.

The lifecycle policy is the thing AWS lets you write once and forget:

{
  "Rules": [
    {
      "Id": "Archive old logs",
      "Status": "Enabled",
      "Filter": { "Prefix": "logs/" },
      "Transitions": [
        { "Days": 30,  "StorageClass": "STANDARD_IA" },
        { "Days": 90,  "StorageClass": "GLACIER_IR" },
        { "Days": 180, "StorageClass": "DEEP_ARCHIVE" }
      ]
    },
    {
      "Id": "Delete old temp files",
      "Status": "Enabled",
      "Filter": { "Prefix": "temp/" },
      "Expiration": { "Days": 7 }
    },
    {
      "Id": "Intelligent tiering for backups",
      "Status": "Enabled",
      "Filter": { "Prefix": "backups/" },
      "Transitions": [
        { "Days": 0, "StorageClass": "INTELLIGENT_TIERING" }
      ]
    }
  ]
}
Enter fullscreen mode Exit fullscreen mode

Apply it once:

aws s3api put-bucket-lifecycle-configuration \
  --bucket my-bucket \
  --lifecycle-configuration file://lifecycle.json
Enter fullscreen mode Exit fullscreen mode

$3,000 a month. The work was reading enough of the data to be confident no on-call runbook secretly depended on a five-year-old log line. (One did. We rewrote the runbook.)

RDS, where the real fat lived

The dev database was a db.r5.4xlarge. Sixteen vCPUs and 128 GB of RAM, running 24/7, used by maybe three engineers between 10am and 6pm in one timezone. It cost more than half the engineering team's laptops combined.

The fix was three changes. Drop the dev instance to db.t3.large. Auto-stop it at night and on weekends. Move staging to Aurora Serverless v2 so it scales to half a capacity unit when idle:

resource "aws_db_instance" "dev" {
  identifier     = "dev-database"
  instance_class = "db.t3.large"  # was db.r5.4xlarge

  iam_database_authentication_enabled = true
  auto_minor_version_upgrade          = true

  backup_retention_period = 7
  backup_window           = "03:00-04:00"
  maintenance_window      = "mon:04:00-mon:05:00"
}

resource "aws_rds_cluster" "staging" {
  cluster_identifier = "staging-aurora"
  engine             = "aurora-postgresql"
  engine_mode        = "provisioned"

  serverlessv2_scaling_configuration {
    max_capacity = 2.0
    min_capacity = 0.5
  }
}
Enter fullscreen mode Exit fullscreen mode

$5,000 a month. The complaints about staging being slow on the first request after lunch went away once people understood that two seconds of cold-start was the trade.

CloudWatch logs, kept forever

CloudWatch logs default to "never expire," which is fine if you want to be the company paying $0.50 per GB to ingest and $0.03 per GB-month to keep a stack trace from 2021.

A short script set retention on every log group in the account:

import boto3

client = boto3.client('logs')

log_groups = client.describe_log_groups()

for log_group in log_groups['logGroups']:
    group_name = log_group['logGroupName']

    # prod keeps 30 days, everything else keeps 7
    retention_days = 30 if 'prod' in group_name else 7

    client.put_retention_policy(
        logGroupName=group_name,
        retentionInDays=retention_days
    )

    print(f"Set {group_name} to {retention_days} days")
Enter fullscreen mode Exit fullscreen mode

$1,500 a month, recovered from log groups whose entire purpose was to exist.

the NAT gateway tax

Three NAT Gateways, one per AZ, $0.045 per hour each. The HA story was airtight. The actual traffic profile didn't justify it for the non-prod VPCs.

# Before: 3 NAT Gateways
resource "aws_nat_gateway" "az1" { /* ... */ }
resource "aws_nat_gateway" "az2" { /* ... */ }
resource "aws_nat_gateway" "az3" { /* ... */ }

# After: 1 NAT Gateway
resource "aws_nat_gateway" "main" {
  allocation_id = aws_eip.nat.id
  subnet_id     = aws_subnet.public[0].id

  tags = { Name = "main-nat-gateway" }
}

resource "aws_route" "private_nat" {
  for_each = aws_route_table.private

  route_table_id         = each.value.id
  destination_cidr_block = "0.0.0.0/0"
  nat_gateway_id         = aws_nat_gateway.main.id
}
Enter fullscreen mode Exit fullscreen mode

$200 a month. We kept the three-gateway HA setup in production. The argument against single-NAT in dev is "but what if the AZ goes down?" The answer in dev is "then dev is down."

data transfer, the silent killer

$6,000 a month in data transfer fees, which is the kind of bill where you can't actually see what you're paying for until you turn on VPC Flow Logs and read them.

aws ec2 create-flow-logs \
  --resource-type VPC \
  --resource-ids vpc-xxxxx \
  --traffic-type ALL \
  --log-destination-type s3 \
  --log-destination arn:aws:s3:::my-flow-logs
Enter fullscreen mode Exit fullscreen mode

Two culprits. App servers were pulling Docker images from external registries on every cold start, paying NAT egress on every layer. And one stale cron job was syncing a database snapshot across regions every hour for a use case that nobody could remember sponsoring.

ECR interface endpoints route the registry traffic privately, so it never leaves the VPC and never touches NAT:

resource "aws_vpc_endpoint" "ecr_api" {
  vpc_id              = aws_vpc.main.id
  service_name        = "com.amazonaws.us-east-1.ecr.api"
  vpc_endpoint_type   = "Interface"
  private_dns_enabled = true

  subnet_ids         = aws_subnet.private[*].id
  security_group_ids = [aws_security_group.vpc_endpoints.id]
}

resource "aws_vpc_endpoint" "ecr_dkr" {
  vpc_id              = aws_vpc.main.id
  service_name        = "com.amazonaws.us-east-1.ecr.dkr"
  vpc_endpoint_type   = "Interface"
  private_dns_enabled = true

  subnet_ids         = aws_subnet.private[*].id
  security_group_ids = [aws_security_group.vpc_endpoints.id]
}
Enter fullscreen mode Exit fullscreen mode

The S3 gateway endpoint is free, which is the only kind of free that AWS hands out without an asterisk:

resource "aws_vpc_endpoint" "s3" {
  vpc_id       = aws_vpc.main.id
  service_name = "com.amazonaws.us-east-1.s3"

  route_table_ids = aws_route_table.private[*].id
}
Enter fullscreen mode Exit fullscreen mode

CloudFront went in front of the static asset bucket, which moved bytes out of the per-GB egress lane and into the CDN lane. $3,500 a month back, most of which was the ECR change alone.

budgets, so the next surprise isn't a surprise

The reason this whole exercise happened in the first place was that nobody had a budget alert. The fix is twelve lines of Terraform:

resource "aws_budgets_budget" "monthly" {
  name         = "monthly-budget"
  budget_type  = "COST"
  limit_amount = "25000"
  limit_unit   = "USD"
  time_unit    = "MONTHLY"

  notification {
    comparison_operator        = "GREATER_THAN"
    threshold                  = 80
    threshold_type             = "PERCENTAGE"
    notification_type          = "ACTUAL"
    subscriber_email_addresses = ["alerts@company.com"]
  }

  notification {
    comparison_operator        = "GREATER_THAN"
    threshold                  = 100
    threshold_type             = "PERCENTAGE"
    notification_type          = "FORECASTED"
    subscriber_email_addresses = ["cfo@company.com"]
  }
}
Enter fullscreen mode Exit fullscreen mode

The CFO gets the forecasted-overshoot alert. The on-call gets the 80%-of-actual alert. By the time the second one fires, somebody is already digging.

the receipts

Category Before After Savings
EC2 $28,000 $12,000 57%
RDS $12,000 $7,000 42%
Data Transfer $6,000 $2,500 58%
CloudWatch $2,000 $500 75%
Other $2,000 $1,000 50%
Total $50,000 $20,000 60%

Six weeks of part-time work, no architecture rewrites, no migrations, no vendor changes. Mostly Terraform diffs and one Python script.

The line from the postmortem the CFO actually circulated was the part I keep coming back to: "The bill didn't grow because we scaled. The bill grew because nobody was looking."

Top comments (0)