Mariusz Gębala

Posted on Mar 13 • Originally published at haitmg.pl

AWS Cost Waste: 5 Things I Find in Every Audit

#terraform #aws #cloud #devops

AWS cost waste is money spent on cloud resources that deliver zero value - orphaned volumes, logs stored forever, idle databases, and infrastructure nobody remembers deploying. In most accounts, it adds up to 27-35% of the total bill.

#	Waste pattern	Typical annual cost	Fix effort
1	Orphaned EBS volumes	$2,000+ per TB	1 Terraform line
2	CloudWatch logs without retention	15% of monthly bill	1 CLI command per log group
3	Unnecessary NAT Gateways	$1,166/year per 3-AZ setup	Conditional Terraform
4	gp2 volumes instead of gp3	20% of EBS spend	In-place migration, zero downtime
5	Over-provisioned RDS	$350+/month per idle instance	Environment-aware sizing

According to a Flexera report, organizations waste 27% of their cloud spending. I have mixed feelings about this. In the audits I've conducted throughout my career, the result has more often been closer to 35%. Never mind the numbers. More important is the fact that almost no one notices wasted money until they actually check it.

Interestingly, these aren't some exotic edge cases. The same pattern usually repeats itself - five similar problems for every customer. In this article, I present a list of the most common cases.

1. Orphaned EBS volumes

Did you have EC2 for testing? Great. Did you test everything you needed to? Even better. Did you shut down the instances? Well, you're clearly a professional who cares about costs. But wait... Did you really select "terminate EBS on shutdown"? Oh, no? And you've probably tested hundreds of instances over the last year? Let's do the math. Let's be optimistic, you had 50 of these instances. The cost is 0.08-0.10 USD per GB per month. Let's not bother with the math; I'll leave that to you.

One audit reported 2.4 TB of orphaned volumes (across three regions). $2.3k just went "into the cloud" and nobody actually noticed. But who's going to stop a rich man?

Find them:

aws ec2 describe-volumes \
  --filters Name=status,Values=available \
  --query 'Volumes[].{ID:VolumeId,Size:Size,Type:VolumeType,Created:CreateTime}' \
  --output table

If that table has more than zero rows, you're paying for storage nobody uses.

Prevent with Terraform:

resource "aws_instance" "app" {
  ami           = var.ami_id
  instance_type = var.instance_type

  root_block_device {
    volume_type           = "gp3"
    delete_on_termination = true  # This is the line that matters
    encrypted             = true
  }
}

One line in your module. That's it. If your Terraform modules don't set this, every terminated instance leaves behind a volume that nobody will ever clean up.

2. CloudWatch logs that never expire

We like having application logs, don't we? Let's log everything: Lambda, all ECS tasks, every API Gateway - EVERYTHING! Retention? And what if, in 15 years, someone asks why that ECS task crashed? Don't set it.

Logs are supposedly just text data. And it's hard to disagree, they are. It's worse when we log absolutely everything to CloudWatch. Although, no, that's not bad. What's bad is when we don't set any retention for those logs. Honestly, do you often find yourself reading logs older than a few days? Okay, that could still happen. But logs from a month ago? Probably once every 5 years would be useful, but even without that, you can survive. But even if you don't review them, remember - you have to pay for all those logs. It seems like peanuts, because it's only $0.03/GB. But they add up faster than you think. I've seen situations where CloudWatch was 15% of the monthly bill.

The conclusion is simple: if you let AWS automatically create log groups (which, contrary to appearances, is the default behavior), retention is infinite. Are you using Terraform? Then use the retention policy and you won't have to worry about unusually high bills.

Find log groups with no retention:

aws logs describe-log-groups \
  --query 'logGroups[?!retentionInDays].{Name:logGroupName,StoredBytes:storedBytes}' \
  --output table

Fix immediately:

# Set 30-day retention on a specific log group
aws logs put-retention-policy \
  --log-group-name "/aws/lambda/my-function" \
  --retention-in-days 30

Prevent with Terraform:

# Create the log group BEFORE the Lambda, so you control retention
resource "aws_cloudwatch_log_group" "lambda" {
  name              = "/aws/lambda/${var.function_name}"
  retention_in_days = 30  # ALWAYS set this
}

3. NAT Gateways nobody needs

Oh, I love this topic. You probably already know that overlay routing helps reduce the already high costs of implementing VM-Series. Just creating a NAT Gateway costs ~33 USD, and not even a single bit has passed through it. And imagine that you have to adhere to HA, meaning you install one NAT Gateway in each AZ, and you have three of them. It costs 100 USD just to install a NAT Gateway. Not to mention that you'll pay 0.045 USD per GB.

You know the problem? Most non-production environments seriously don't need three NAT Gateways. In fact, sometimes they don't need one at all.

Check utilization:

# Check bytes processed by each NAT Gateway over the last 7 days
aws cloudwatch get-metric-statistics \
  --namespace AWS/NATGateway \
  --metric-name BytesOutToDestination \
  --dimensions Name=NatGatewayId,Value=nat-0123456789abcdef0 \
  --start-time $(date -u -d '7 days ago' +%Y-%m-%dT%H:%M:%S) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
  --period 604800 \
  --statistics Sum

Prevent with Terraform:

variable "environment" {
  type = string
}

# 1 NAT Gateway in dev/staging, N in production
resource "aws_nat_gateway" "main" {
  count         = var.environment == "prod" ? length(var.azs) : 1
  allocation_id = aws_eip.nat[count.index].id
  subnet_id     = aws_subnet.public[count.index].id
}

It's also worth checking whether your private subnets are using the internet at all. Maybe some of them only communicate with other AWS services? Endpoints are a much cheaper solution than NAT Gateways.

4. gp2 volumes that should be gp3

This topic is also interesting. Basically, there's almost nothing you need to do here, and I see it practically everywhere.

Except I can guess where that comes from. It's common wisdom that newer something (in this case, a higher version is associated with something newer) means more expensive. So, someone who doesn't use AWS every day starts up EC2 and sees the choice between gp2 and gp3 EBS. They think, "I'll go with the older, cheaper one." Mmm... good luck! gp3 is about 20% cheaper than gp2, has 3,000 IOPS and 125 MB/s base throughput. Despite this, according to Datadog's State of Cloud Costs report, gp2 accounts for 58% of EBS spending.

Generally, there's no scenario where gp2 is better - gp3 simply costs less and performs better. That's all.

Find all gp2 volumes:

aws ec2 describe-volumes \
  --filters Name=volume-type,Values=gp2 \
  --query 'Volumes[].{ID:VolumeId,Size:Size,State:State,Instance:Attachments[0].InstanceId}' \
  --output table

Migrate (no downtime):

aws ec2 modify-volume --volume-id vol-0123456789abcdef0 --volume-type gp3

That's it. No shutdowns, no snapshots, no maintenance window. The migration occurs in the background while the volume remains connected and operational.

Prevent with Terraform:

variable "volume_type" {
  type    = string
  default = "gp3"

  validation {
    condition     = var.volume_type != "gp2"
    error_message = "Use gp3 instead of gp2. It's 20% cheaper with better baseline performance."
  }
}

A validation block in the EC2 module rejects gp2 at plan time. This prevents anyone from accidentally deploying a costly option.

5. Over-provisioned RDS instances

Time for dessert. Oh, how many companies are losing real money here. And let me give you an example. We have something to launch in production in eight months, so now let's use exactly the same parameters in the development environment that we'll use in production. So let's take a look at a db.r6g.xlarge instance. Cost? Let's say an average of $350. Needed for development? Yes, the same as a bicycle for a fish.

But this is still a rare case. In production, I've seen more than once someone set up RDS where the average CPU utilization is 5-8%. The last time such a move was in 2008, when the global crisis hit everyone.

Check CPU utilization over the last 14 days:

aws cloudwatch get-metric-statistics \
  --namespace AWS/RDS \
  --metric-name CPUUtilization \
  --dimensions Name=DBInstanceIdentifier,Value=my-database \
  --start-time $(date -u -d '14 days ago' +%Y-%m-%dT%H:%M:%S) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
  --period 86400 \
  --statistics Average \
  --output table

Check for zero-connection databases:

aws cloudwatch get-metric-statistics \
  --namespace AWS/RDS \
  --metric-name DatabaseConnections \
  --dimensions Name=DBInstanceIdentifier,Value=my-database \
  --start-time $(date -u -d '7 days ago' +%Y-%m-%dT%H:%M:%S) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
  --period 3600 \
  --statistics Maximum \
  --output table

Prevent with Terraform:

resource "aws_db_instance" "main" {
  instance_class    = var.environment == "prod" ? "db.r6g.large" : "db.t4g.micro"
  multi_az          = var.environment == "prod"
  allocated_storage = var.environment == "prod" ? 100 : 20
  storage_type      = "gp3"
}

Environment-aware sizing. Dev gets the minimum, production gets what it needs. No more copying production configs into staging and forgetting about it.

The pattern behind all five

You've probably noticed a key problem? Most of these topics don't apply to startups or small businesses that watch every cent twice. They apply to large companies. You know what's worse? That these large companies often look for savings on staffing in difficult times, not even on the things I mentioned. Nobody seems to pay attention to that. You know why? Because the staff has shrunk...

And it's not like I see this everywhere. Usually, the teams I work with are really well-equipped with AWS. It's just that there's a real shortage of resources to devote to cost optimization in the cloud.

What to do about it

Simply take these ready-made commands and run them on your environment. It'll take you maybe 10 minutes, and you might save someone or yourself a full-time job.

If you want to go deeper into the topic - identify over-allocated computing resources, audit data transfer patterns, check liability coverage - that's a longer conversation. But start with these five. They can be checked for free, and most can be fixed for free.

I built cloud-audit to automate the security side of these checks - it runs 30+ checks in ~12 seconds. For cost specifically, the five CLI commands above are your starting point.

Originally published at haitmg.pl