From automated dev/test auto-resize to right-sizing and storage optimization—here's how to slash RDS costs with Terraform without sacrificing performance.
Your RDS database is probably your second-highest AWS cost after EC2. And if you're like most teams, you're overpaying by 50-70%.
Here's why:
- Dev/test databases running at full capacity 24/7 (you only need full power during business hours)
- Production instances sized for peak load that happens 1% of the time
- Multi-AZ enabled everywhere "just in case"
- gp2 storage when gp3 is 20% cheaper
- Snapshots piling up from instances deleted 2 years ago
Sound familiar? Let's fix all of this with Terraform automation.
💸 The RDS Cost Breakdown
A typical $5,000/month RDS bill looks like:
Instance hours: $3,200 (64%) ← Biggest target
Storage: $1,000 (20%) ← Easy wins
Backups/snapshots: $400 (8%) ← Often wasted
Data transfer: $300 (6%) ← Sneaky costs
Multi-AZ premium: $100 (2%) ← Necessary evil?
Our strategy: Attack each of these systematically with Terraform.
🎯 Strategy #1: Dev/Test Auto-Resize (45% Savings)
Automatically downsize instances during nights and weekends. Keeps databases available 24/7 but at significantly lower cost.
The Math:
- Dev database at db.t3.large 24/7: $200/month
- Auto-resize: db.t3.large (50 hrs/wk) + db.t3.small (118 hrs/wk): $110/month
- Savings: 45% 🎉
Implementation
# modules/rds-auto-resize/main.tf
variable "rds_instances" {
type = map(object({
identifier = string
business_hours_class = string # e.g., "db.t3.large"
off_hours_class = string # e.g., "db.t3.small"
scale_up_cron = string
scale_down_cron = string
}))
}
resource "aws_lambda_function" "rds_resizer" {
filename = data.archive_file.lambda.output_path
function_name = "rds-auto-resizer"
role = aws_iam_role.lambda.arn
handler = "index.handler"
runtime = "python3.11"
timeout = 600
source_code_hash = data.archive_file.lambda.output_base64sha256
}
data "archive_file" "lambda" {
type = "zip"
output_path = "${path.module}/lambda.zip"
source {
content = <<-EOF
import boto3
import json
rds = boto3.client('rds')
def handler(event, context):
db_identifier = event['db_identifier']
target_class = event['target_instance_class']
try:
response = rds.describe_db_instances(DBInstanceIdentifier=db_identifier)
current_class = response['DBInstances'][0]['DBInstanceClass']
if current_class == target_class:
return {'statusCode': 200, 'body': 'Already at target size'}
rds.modify_db_instance(
DBInstanceIdentifier=db_identifier,
DBInstanceClass=target_class,
ApplyImmediately=True
)
return {'statusCode': 200, 'body': f'Resized to {target_class}'}
except Exception as e:
return {'statusCode': 500, 'body': str(e)}
EOF
filename = "index.py"
}
}
resource "aws_iam_role" "lambda" {
name = "rds-resizer-lambda"
assume_role_policy = jsonencode({
Version = "2012-10-17"
Statement = [{
Action = "sts:AssumeRole"
Effect = "Allow"
Principal = { Service = "lambda.amazonaws.com" }
}]
})
}
resource "aws_iam_role_policy" "lambda_rds" {
role = aws_iam_role.lambda.id
policy = jsonencode({
Version = "2012-10-17"
Statement = [{
Effect = "Allow"
Action = ["rds:ModifyDBInstance", "rds:DescribeDBInstances", "logs:CreateLogGroup", "logs:CreateLogStream", "logs:PutLogEvents"]
Resource = "*"
}]
})
}
resource "aws_cloudwatch_event_rule" "scale_down" {
for_each = var.rds_instances
name = "rds-scale-down-${each.key}"
schedule_expression = "cron(${each.value.scale_down_cron})"
}
resource "aws_cloudwatch_event_rule" "scale_up" {
for_each = var.rds_instances
name = "rds-scale-up-${each.key}"
schedule_expression = "cron(${each.value.scale_up_cron})"
}
resource "aws_cloudwatch_event_target" "scale_down" {
for_each = var.rds_instances
rule = aws_cloudwatch_event_rule.scale_down[each.key].name
arn = aws_lambda_function.rds_resizer.arn
input = jsonencode({
db_identifier = each.value.identifier
target_instance_class = each.value.off_hours_class
})
}
resource "aws_cloudwatch_event_target" "scale_up" {
for_each = var.rds_instances
rule = aws_cloudwatch_event_rule.scale_up[each.key].name
arn = aws_lambda_function.rds_resizer.arn
input = jsonencode({
db_identifier = each.value.identifier
target_instance_class = each.value.business_hours_class
})
}
resource "aws_lambda_permission" "allow_eventbridge" {
for_each = merge(
{ for k, v in var.rds_instances : "down-${k}" => aws_cloudwatch_event_rule.scale_down[k].arn },
{ for k, v in var.rds_instances : "up-${k}" => aws_cloudwatch_event_rule.scale_up[k].arn }
)
statement_id = "AllowEventBridge-${each.key}"
action = "lambda:InvokeFunction"
function_name = aws_lambda_function.rds_resizer.function_name
principal = "events.amazonaws.com"
source_arn = each.value
}
Usage
module "rds_auto_resize" {
source = "./modules/rds-auto-resize"
rds_instances = {
dev = {
identifier = "myapp-dev"
business_hours_class = "db.t3.large"
off_hours_class = "db.t3.small"
scale_up_cron = "0 8 ? * MON-FRI *" # 8 AM weekdays
scale_down_cron = "0 18 ? * MON-FRI *" # 6 PM weekdays
}
staging = {
identifier = "myapp-staging"
business_hours_class = "db.t3.xlarge"
off_hours_class = "db.t3.medium"
scale_up_cron = "0 7 ? * MON-FRI *"
scale_down_cron = "0 20 ? * MON-FRI *"
}
}
}
Note: Resizing causes a brief 1-2 minute connection interruption. Most apps with connection pooling handle this automatically.
🎯 Strategy #2: Right-Sizing Instances (30-40% Savings)
Most RDS instances are sized for peak load. Downsize to match actual usage.
Find Oversized Instances
data "aws_cloudwatch_metric_statistics" "cpu" {
for_each = toset(["myapp-prod", "myapp-staging"])
namespace = "AWS/RDS"
metric_name = "CPUUtilization"
period = 86400
stat = "Average"
start_time = timeadd(timestamp(), "-30d")
end_time = timestamp()
dimensions = { DBInstanceIdentifier = each.key }
}
output "rightsizing_candidates" {
value = {
for id, stats in data.aws_cloudwatch_metric_statistics.cpu :
id => "Avg CPU: ${mean(stats.datapoints[*].average)}%"
if mean(stats.datapoints[*].average) < 40
}
}
Rule of thumb: If average CPU < 40%, downsize one tier (e.g., db.r6g.xlarge → db.r6g.large).
Use Burstable Instances for Dev/Test
resource "aws_db_instance" "dev" {
identifier = "myapp-dev"
instance_class = "db.t3.medium" # Burstable, much cheaper
engine = "postgres"
allocated_storage = 100
storage_type = "gp3"
multi_az = false
backup_retention_period = 7
skip_final_snapshot = true
}
🎯 Strategy #3: Storage Optimization (20% Savings)
Migrate gp2 → gp3
resource "aws_db_instance" "optimized" {
identifier = "myapp-prod"
storage_type = "gp3" # 20% cheaper than gp2
iops = 3000 # Baseline included (free)
throughput = 125 # MB/s baseline included (free)
allocated_storage = 500
}
Instant savings: 20% on storage costs with better performance. gp3 baseline (3,000 IOPS, 125 MB/s) is included at no extra cost.
🎯 Strategy #4: Snapshot Cleanup (30-50% Savings)
Old snapshots cost $0.095/GB-month and pile up quickly.
Automated Retention
resource "aws_db_instance" "prod" {
identifier = "myapp-prod"
backup_retention_period = 30 # Keep 30 days only
backup_window = "03:00-04:00"
skip_final_snapshot = var.environment != "production"
}
Manual Cleanup Script
# Delete snapshots older than 35 days
aws rds describe-db-snapshots --snapshot-type manual \
--query "DBSnapshots[?SnapshotCreateTime<='$(date -d '35 days ago' -Iso)'].DBSnapshotIdentifier" \
--output text | xargs -n1 aws rds delete-db-snapshot --db-snapshot-identifier
🎯 Strategy #5: Multi-AZ Optimization (50% Savings)
Multi-AZ doubles your instance cost. Only use for critical production databases.
locals {
multi_az_config = {
production = true # Critical, customer-facing
staging = false # Can tolerate brief downtime
dev = false # Definitely not needed
}
}
resource "aws_db_instance" "db" {
for_each = local.multi_az_config
identifier = "myapp-${each.key}"
instance_class = each.key == "production" ? "db.r6g.large" : "db.t3.medium"
multi_az = each.value
backup_retention_period = each.value ? 30 : 7
}
Savings: $200/month per non-production database by removing unnecessary Multi-AZ.
🎯 Strategy #6: Reserved Instances (40-60% Savings)
For stable production workloads, purchase RIs:
# Find your stable instances
aws rds describe-db-instances \
--query 'DBInstances[?DBInstanceStatus==`available`].[DBInstanceIdentifier,DBInstanceClass]' \
--output table
# Purchase via AWS Console: RDS → Reserved Instances → Purchase
# - 1-year RI: 40% savings
# - 3-year RI: 60% savings (all-upfront for max discount)
Pro tip: Start with 1-year RIs, upgrade to 3-year once workload is proven stable.
📊 Real-World Example: Complete Optimization
Before (Typical startup with 10 RDS instances):
3 Production (Multi-AZ, r6g.xlarge, gp2): $2,400/month
3 Staging (Multi-AZ, r6g.large, gp2): $1,200/month
4 Dev (t3.large, gp2, 24/7 full size): $800/month
Snapshots (500 GB): $48/month
Total: $4,448/month
After (Optimized with Terraform):
3 Production (Multi-AZ, r6g.large, gp3, RI): $960/month (RI discount + right-sized + gp3)
3 Staging (Single-AZ, t3.large, gp3): $450/month (removed Multi-AZ + gp3)
4 Dev (auto-resize t3.medium/small, gp3): $280/month (auto-resize + gp3)
Snapshots (200 GB with lifecycle): $19/month (automated cleanup)
Total: $1,709/month
Annual savings: $32,868 💰
⚡ Quick Implementation Checklist
Week 1: Quick wins (Low effort, high impact)
- ✅ Enable dev/test auto-resize module (45% savings immediately)
- ✅ Migrate gp2 → gp3 storage (20% storage savings, zero downtime)
- ✅ Run snapshot cleanup script (30-50% backup savings)
Week 2: Right-sizing (Requires analysis)
- ✅ Query CloudWatch metrics for CPU utilization
- ✅ Identify oversized instances (avg CPU < 40%)
- ✅ Downsize dev/test to burstable instances
- ✅ Test smaller instance classes in staging
Week 3: Architectural changes
- ✅ Remove unnecessary Multi-AZ from non-production
- ✅ Set up automated snapshot lifecycle policies
- ✅ Verify backups are working correctly
Week 4: Long-term commitments
- ✅ Analyze stable production workloads
- ✅ Purchase 1-year Reserved Instances
- ✅ Document RI strategy for future purchases
- ✅ Set up monthly cost review process
🎯 Summary: Savings by Strategy
| Strategy | Effort | Savings | Risk | Priority |
|---|---|---|---|---|
| Dev/test auto-resize | Low | 45% | Low | 🔥 Do first |
| gp2 → gp3 migration | Low | 20% | None | 🔥 Do first |
| Snapshot cleanup | Low | 30-50% | Low | High |
| Right-sizing | Medium | 30-40% | Medium | High |
| Remove unnecessary Multi-AZ | Low | 50% | Medium | Medium |
| Reserved Instances | Low | 40-60% | Low | Medium |
Expected total: 50-70% of RDS costs
For a $5,000/month RDS bill, that's $2,500-$3,500/month saved = $30,000-$42,000/year 🚀
💡 Pro Tips
- Start with dev/test auto-resize - Easiest win, 45% savings, minimal risk
-
Use Cost Explorer tags - Tag instances with
Environment,Team,CostCenterfor tracking - Test resizing manually first - Verify your app handles brief connection interruptions
- Don't over-optimize production - Saving $100/month isn't worth a 3 AM outage
- Review quarterly - Workloads change, revisit right-sizing every 3 months
What's your biggest RDS cost pain point? Share in the comments! 💬
Follow for more AWS cost optimization strategies! ⚡
Top comments (0)