June Gu

Posted on Mar 22

Downsizing Without Downtime: An SRE's Guide to Safe Cost Optimization

#aws #finops #sre #reliability

Downsizing Without Downtime: An SRE's Guide to Safe Cost Optimization

Tags: aws finops sre reliability kubernetes

In Part 1, I covered finding $12K/year in passive waste — abandoned VPCs, orphan log groups, stale WorkSpaces. Things nobody was using. That was the easy part.

This article is about the hard part: actively downsizing infrastructure that's still running in production — without breaking availability. This is where FinOps meets SRE, and where most cost-cutting initiatives fail.

I've seen teams blindly follow AWS Cost Explorer recommendations, downsize an RDS instance during peak hours, and trigger a 45-minute outage. The problem isn't the recommendation — it's executing it without an SRE mindset.

The SRE Guarantee: Every optimization in this article passes through three gates: error budget protection, assured minimum downtime, and reliability over savings. See the series introduction for the full guarantee. If any gate fails, we don't proceed — no matter how much the savings.

Here's the framework I use: every cost optimization must pass through the reliability filter first.

The SLO gate: when is it safe to cut?

Before touching any resource, I check three things:

Error budget status — If we've burned >50% of this month's error budget, no changes. Period.
Current resource utilization — CloudWatch metrics over 14+ days, not a snapshot.
Blast radius — If this fails, what's the user impact? One service? All services?

Error budget > 50% remaining?
  └─ Yes → Check utilization
       └─ Avg CPU < 20% for 14 days?
            └─ Yes → Check blast radius
                 └─ Single service, non-critical path?
                      └─ Yes → Proceed with rollback plan
                      └─ No → Schedule for maintenance window
            └─ No → Skip, re-evaluate next month
  └─ No → Do nothing. Stability first.

This is the difference between FinOps and SRE-driven FinOps. Cost tools tell you what to cut. SRE tells you when and how.

Automate this: finops scan runs all checks below in one command. Each section maps to a specific check in the toolkit.

1. EC2 / EKS node right-sizing with Pod Disruption Budgets

The problem: EKS worker nodes running at 15% CPU average. AWS says "downsize." But these nodes run 8 microservices — you can't just swap the instance type and hope pods reschedule gracefully.

The SRE approach:

# Ensure PDB exists BEFORE downsizing
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: gateway-server-pdb
  namespace: connectorder
spec:
  minAvailable: 1
  selector:
    matchLabels:
      app: gateway-server

Execution steps:

Verify PDB exists for every service on the node group
Add new node group with smaller instance type (t3.large → t3.medium)
Cordon old nodes — Kubernetes respects PDBs during drain
Monitor SLOs for 24 hours
Remove old node group only after SLO confirmation

What we saved: t3.large ($0.0832/hr) → t3.medium ($0.0416/hr) = 50% per node. With 4 nodes across dev/staging, that's ~$120/month.

What could go wrong: Without PDBs, draining a node can kill all replicas of a service simultaneously. With PDBs, Kubernetes guarantees at least minAvailable pods stay running.

Toolkit check: finops scan --checks ec2_rightsizing — flags instances with avg CPU < 20% over 14 days. (source)

2. NAT Gateway → NAT Instance with high availability

The problem: NAT Gateways cost $32.40/month each (fixed) + data processing. In dev/staging environments processing <1 GB/month, you're paying $32 for almost nothing.

The SRE approach: Don't just replace with a single NAT Instance — that's a single point of failure. Use dual-AZ NAT Instances with auto-recovery.

# NAT Instance with auto-recovery via ASG
resource "aws_autoscaling_group" "nat" {
  min_size         = 1
  max_size         = 1
  desired_capacity = 1

  launch_template {
    id      = aws_launch_template.nat.id
    version = "$Latest"
  }

  # Auto-replace if health check fails
  health_check_type         = "EC2"
  health_check_grace_period = 120

  tag {
    key                 = "Name"
    value               = "${local.name_prefix}-nat"
    propagate_at_launch = true
  }
}

# t4g.nano: $3.02/month — 10x cheaper than NAT Gateway
resource "aws_launch_template" "nat" {
  instance_type = "t4g.nano"
  image_id      = data.aws_ami.nat_instance.id
  # ... source_dest_check = false
}

What we saved: $32.40 → $3.02/month per environment. Across 3 dev/staging environments: ~$88/month.

The HA guarantee: ASG auto-replaces the instance within ~2 minutes if it fails. For dev/staging, 2 minutes of NAT downtime is acceptable. For prod, keep the managed NAT Gateway.

Real-world validation: Halodoc's engineering team documented their full migration from managed NAT Gateways to NAT instances using fck-nat, an open-source project that provides ready-to-use ARM-based AMIs supporting up to 5Gbps burst on a t4g.nano. They achieved over 90% cost reduction across non-prod environments. The fck-nat AMI handles IP forwarding, NAT rules, and CloudWatch alarms out of the box — it's essentially what I built manually with the ASG approach above, but packaged as a reusable AMI. If you're doing this at scale, consider fck-nat instead of rolling your own.

Toolkit check: finops scan --checks nat_gateway — flags NAT Gateways with 0 bytes processed in dev/staging accounts. (source)

3. Spot Instances for non-production EKS with graceful draining

The problem: Dev and staging EKS node groups run on-demand 24/7 for workloads that tolerate interruption.

The SRE approach: Spot saves 60-70%, but you need graceful handling of the 2-minute interruption notice.

# EKS managed node group with Spot + drain handler
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
managedNodeGroups:
  - name: spot-workers
    instanceTypes: ["t3.medium", "t3a.medium", "t3.large"]
    spot: true
    desiredCapacity: 3
    labels:
      lifecycle: spot
    taints:
      - key: spot
        value: "true"
        effect: PreferNoSchedule

Critical: Install the AWS Node Termination Handler. Without it, pods get killed mid-request.

helm install aws-node-termination-handler \
  eks/aws-node-termination-handler \
  --namespace kube-system \
  --set enableSpotInterruptionDraining=true \
  --set enableScheduledEventDraining=true

What we saved: 3 on-demand t3.medium nodes ($0.0416/hr × 3 × 730hr) = $91/month → Spot (~$0.0125/hr × 3 × 730hr) = $27/month. $64/month savings per environment.

The reliability rule: Never use Spot for production. Never use Spot for stateful workloads. Only use Spot where you have:

Multiple instance type fallbacks (capacity diversification)
Node Termination Handler installed
Pod anti-affinity so replicas spread across nodes

Toolkit check: finops scan --checks spot_candidates — identifies stateless ASGs and EKS node groups eligible for Spot. (source)

4. RDS right-sizing without losing your safety net

The problem: RDS instances provisioned for peak load that only hits 2 hours per day. Average CPU: 8%. But it's a database — you can't just resize and pray.

The SRE approach:

Before	After	Why it's safe
db.r6g.xlarge (prod)	db.r6g.large (prod)	Read replica absorbs overflow
db.r6g.large (staging)	db.r6g.medium (staging)	No Multi-AZ needed in staging
Multi-AZ on staging	Single-AZ	Staging doesn't need failover

Execution steps:

Add a read replica BEFORE downsizing (safety net)
Monitor replica lag for 48 hours
Apply instance modification during low-traffic window (scheduled, not immediate)
Monitor connection count and query latency for 1 week
Remove old read replica only after confirming SLOs hold

What we saved:

Staging Multi-AZ removal: ~$200/month (you're paying 2x for staging redundancy nobody needs)
Right-sizing across 3 non-prod instances: ~$150/month

What NOT to touch: Production primary instances running at >40% CPU. Production Multi-AZ. Any RDS with burst credit dependency (t-class instances under load).

Parameter groups: the hidden risk

When you change an RDS instance class, memory-dependent parameters may break silently.

Default parameter groups auto-scale — shared_buffers, effective_cache_size, and work_mem in PostgreSQL (or innodb_buffer_pool_size in MySQL) adjust automatically with instance memory. If you're using the default parameter group, downsizing is straightforward.

Custom parameter groups with hardcoded values don't auto-scale. If someone set shared_buffers = 8GB explicitly for a db.r6g.xlarge (32GB RAM), downsizing to db.r6g.large (16GB RAM) means shared_buffers is now 50% of total RAM instead of 25%. That leaves almost nothing for OS cache and connections.

This is a known production pitfall. AWS documents that RDS replicas can get stuck in incompatible-parameters mode when created with a smaller instance class if the source's parameter group has hardcoded buffer values too large for the target. The same issue applies to downsizing: the instance may fail to start or perform poorly. AWS Prescriptive Guidance recommends using formula-based parameters (e.g., {DBInstanceClassMemory/32768}) that auto-scale with instance size, rather than hardcoded values.

Before downsizing, check:

# List parameter groups for the instance
aws rds describe-db-instances \
  --db-instance-identifier pn-sh-rds-prod \
  --query 'DBInstances[0].DBParameterGroups' \
  --profile dodo-dev

# Check for hardcoded memory parameters
aws rds describe-db-parameters \
  --db-parameter-group-name my-custom-pg15 \
  --query 'Parameters[?ParameterName==`shared_buffers` || ParameterName==`effective_cache_size` || ParameterName==`work_mem`].[ParameterName,ParameterValue,Source]' \
  --output table \
  --profile dodo-dev

The rule: If Source = user (not engine-default), the parameter is hardcoded. Recalculate it for the target instance size before downsizing.

CDC and logical replication: the blast radius multiplier

If the database has Change Data Capture (CDC) enabled via logical replication, downsizing becomes significantly riskier.

Why it matters:

Replication slots consume WAL: Logical replication slots prevent WAL cleanup until the consumer catches up. On a smaller instance with less I/O throughput, WAL can accumulate faster than it's consumed.
Replication lag increases: Smaller instance = less CPU and memory for WAL decoding. If your CDC pipeline (Debezium, DMS, custom) can't keep up, lag grows — and if the slot falls too far behind, you may need to recreate it.
Disk pressure: WAL accumulation on a smaller instance with less storage headroom can fill the disk, causing the primary to halt writes entirely.

This is not theoretical. Gunnar Morling (Debezium/Red Hat) documented the "insatiable" replication slot problem — when a CDC consumer stops, an idle RDS PostgreSQL instance accumulates 18 GB/day of WAL because RDS writes a heartbeat every 5 minutes into 64 MB WAL segments. His follow-up guide on mastering replication slots is essential reading. Artie's production guide calls slot bloat "the single most common way CDC pipelines take down production databases."

Before downsizing a CDC-enabled database:

# Check for logical replication slots (PostgreSQL)
# Run via psql or RDS Data API:
# SELECT slot_name, plugin, active, restart_lsn, confirmed_flush_lsn
# FROM pg_replication_slots;

# Check replication lag via CloudWatch
aws cloudwatch get-metric-statistics \
  --namespace AWS/RDS \
  --metric-name ReplicationSlotDiskUsage \
  --dimensions Name=DBInstanceIdentifier,Value=pn-sh-rds-prod \
  --start-time $(date -u -v-7d +%Y-%m-%dT%H:%M:%S) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
  --period 3600 \
  --statistics Maximum \
  --profile dodo-dev

Critical safety net (PostgreSQL 13+): Set max_slot_wal_keep_size in your parameter group to cap how much WAL a replication slot can retain. Without this, an inactive slot will accumulate WAL indefinitely — Morling measured 18 GB/day on an idle RDS instance. Also set a CloudWatch alarm on OldestReplicationSlotLag — warning at 1 GB, critical at 10 GB.

The rule: If pg_replication_slots shows active logical slots, do NOT downsize without first confirming the CDC consumer can handle reduced throughput. Consider pausing CDC, downsizing, then resuming — but plan for a full re-sync if the slot is lost.

Cold cache: the first-hour tax

Every RDS instance modification restarts the database engine. When it comes back up, the buffer pool is empty. This is the cold cache problem.

What happens:

PostgreSQL's shared_buffers starts empty — every query hits disk
Query p99 latency spikes 3-10x for the first 30-60 minutes
Connection pool may hit timeouts as queries take longer
If you're monitoring SLOs, you'll see an error budget burn

Mitigation:

Schedule the modification during the lowest-traffic window (e.g., 02:00-04:00 KST for our services)
Use "Apply during maintenance window" — not "Apply immediately"
Pre-warm with read replica promotion instead of in-place modification:
- Create a read replica at the target (smaller) size
- Let the replica's buffer pool warm up from replication traffic
- Promote the replica to primary during maintenance window
- The promoted instance already has a warm cache
Budget the cold cache period into your SLO error budget — if you have 78% budget remaining, a 45-minute cache warm-up that degrades p99 by 3x might burn 2-3% of your monthly budget. That's acceptable. If you only have 50% remaining, it's not.

Blue-green consideration: RDS Blue/Green Deployments create a green (new) environment alongside the blue (current). This is safer for major changes but costs 2x during the switchover period. For a simple instance class change, in-place modification with read replica pre-warming is more cost-effective than blue-green.

What the industry uses: AWS published a detailed guide on automated cache pre-warming for Aurora PostgreSQL using the pg_prewarm extension, which loads specific tables and indexes into shared buffers before traffic arrives. For standard RDS PostgreSQL, the same extension is available — and there's even an open-source tool specifically designed to pre-warm RDS PostgreSQL instances after restarts. Aurora also offers Cluster Cache Management (CCM) which designates a replica to inherit the primary's buffer cache on failover — eliminating cold cache entirely for failover scenarios.

Toolkit check: finops scan --checks rds_rightsizing — flags oversized RDS instances and unnecessary Multi-AZ in non-prod. (source)

5. ElastiCache scheduling for dev/staging

The problem: ElastiCache clusters running 24/7 in dev/staging. Developers use them 10 hours/day, 5 days/week. You're paying for 118 idle hours per week.

The SRE approach: Stop clusters outside business hours via EventBridge + Lambda.

# Lambda: stop dev ElastiCache at 8 PM, start at 8 AM
def handler(event, context):
    action = event.get('action')  # 'stop' or 'start'
    cluster_id = event.get('cluster_id')

    client = boto3.client('elasticache')
    if action == 'stop':
        # Serverless: just scale to 0 ECPUs
        # Classic: delete with final snapshot, recreate on start
        pass
    elif action == 'start':
        # Restore from snapshot
        pass

What we saved: ~50% per cluster. 2 dev/staging clusters: ~$80/month.

The reliability check: Always test that the start/restore actually works before relying on scheduling. A cluster that won't restore Monday morning is worse than paying weekend costs.

Toolkit check: finops scan --checks elasticache_scheduling — detects dev/staging ElastiCache running 24/7. (source)

6. Reserved Instances: commit only after right-sizing

The problem: Teams buy RIs before optimizing. Then they downsize and the RI doesn't match. Money locked in for 1-3 years.

The SRE approach: RIs are the last step, not the first.

Week 1-2: Find waste (Part 1 — passive cleanup)
Week 3-4: Downsize safely (this article)
Week 5-6: Monitor — confirm new sizes are stable
Week 7-8: THEN buy RIs/Savings Plans for the right-sized resources

Decision matrix:

Resource	Stable for 30+ days?	CPU predictable?	Action
Prod RDS (right-sized)	Yes	Yes, 35-45%	1-year RI (All Upfront)
Prod EKS nodes	Yes	Yes, 40-60%	Compute Savings Plan
Dev anything	N/A	N/A	Never reserve — use Spot/scheduling

What we projected: After right-sizing prod workloads, 1-year RIs would save an additional 30-40% on the new baseline — roughly $300-500/month for our scale.

Industry validation: CloudChipr's RDS right-sizing guide puts it bluntly: "Buying a Reserved Instance for an overprovisioned database just optimizes the cost of waste." The Flexera State of the Cloud Report consistently finds that 27% of cloud spend is wasted, with premature RI commitment being a top contributor. If you must reserve, use Compute Savings Plans over EC2 Instance Savings Plans — ProsperOps explains that Compute SPs offer instance family flexibility, so you can still right-size without breaking coverage.

Toolkit check: finops scan --checks reserved_instances — calculates RI/Savings Plans ROI for stable workloads. (source)

7. Orphan resource cleanup

The problem: EBS volumes from terminated instances, Elastic IPs not attached to anything, snapshots from 2 years ago, load balancers with zero targets.

The SRE approach: These are almost always safe to remove — but verify first.

Checklist before deletion:

[ ] EBS volume: not attached, no recent snapshots depending on it
[ ] EIP: not referenced in DNS or application config
[ ] Snapshot: original volume no longer exists, no AMI depends on it
[ ] ALB: zero registered targets for 7+ days, no DNS pointing to it

What we found: 12 orphan EBS volumes, 4 unused EIPs, 47 snapshots older than 90 days. ~$85/month in pure waste.

Toolkit check: finops scan --checks unused_resources — flags unattached EBS, unused EIPs, old snapshots, idle ALBs. (source)

The complete picture: what's safe and what's not

Optimization	Risk	Prod safe?	Dev/Staging safe?	Savings
Orphan cleanup	Very low	Yes	Yes	$85/mo
ElastiCache scheduling	Low	No	Yes	$80/mo
NAT Gateway → Instance	Low-Med	No	Yes	$88/mo
Spot for non-prod	Medium	No	Yes	$64/mo
EC2/EKS right-sizing	Medium	With PDB	Yes	$120/mo
RDS right-sizing	Medium	With replica	Yes	$350/mo
Reserved Instances	Lock-in risk	After sizing	Never	$300-500/mo

Total from active downsizing: ~$787-1,087/month ($9.4-13K/year)
Combined with Part 1 (passive waste): $1,431-2,104/month ($17.2-25.2K/year)

The toolkit: automate the discovery

Everything in this article maps to a check in aws-finops-toolkit:

# Install
pip install aws-finops-toolkit

# Scan all checks across multiple accounts
finops scan --profiles dev,staging,prod

# Run only the downsizing-related checks
finops scan --checks ec2_rightsizing,nat_gateway,spot_candidates,rds_rightsizing,elasticache_scheduling,reserved_instances,unused_resources

# Generate HTML report for management
finops report --format html --output finops-downsizing.html

The tool finds the opportunities. The SRE decides which ones are safe to execute, and in what order.

What I learned

FinOps without SRE is dangerous. Cost tools don't know your SLOs. They'll tell you to downsize a database that's already at its limit during peak hours.
Always add safety before removing cost. Read replica before RDS downsize. PDB before node downsize. Drain handler before Spot. The safety net costs less than the savings — and it prevents the 2 AM page.
Reserve last, not first. Right-size → stabilize → then commit. Buying RIs on oversized instances locks in waste.
Prod and non-prod are different games. Non-prod is where you optimize aggressively (Spot, scheduling, single-AZ). Prod is where you optimize carefully (right-sizing with replicas, PDBs, maintenance windows).
SLO data is your FinOps compass. If your error budget is healthy, you have room to experiment. If it's burned, don't touch anything — reliability comes first.

FinOps for SREs — Series Index

Series Introduction: The SRE Guarantee
Part 0: The Pre-Flight Checklist — 9 Checks Before Cutting Any Cost
Part 1: How I Found $12K/Year in AWS Waste
Part 2: Downsizing Without Downtime ← you are here

The checks in this article are implemented in aws-finops-toolkit — an open-source CLI for automated AWS cost scanning.

DEV Community

Downsizing Without Downtime: An SRE's Guide to Safe Cost Optimization

Downsizing Without Downtime: An SRE's Guide to Safe Cost Optimization

The SLO gate: when is it safe to cut?

1. EC2 / EKS node right-sizing with Pod Disruption Budgets

2. NAT Gateway → NAT Instance with high availability

3. Spot Instances for non-production EKS with graceful draining

4. RDS right-sizing without losing your safety net

Parameter groups: the hidden risk

CDC and logical replication: the blast radius multiplier

Cold cache: the first-hour tax

5. ElastiCache scheduling for dev/staging

6. Reserved Instances: commit only after right-sizing

7. Orphan resource cleanup

The complete picture: what's safe and what's not

The toolkit: automate the discovery

What I learned

FinOps for SREs — Series Index

Top comments (0)