DEV Community: June Gu

The Pre-Flight Checklist: 9 Things to Analyze Before Cutting Any AWS Cost

June Gu — Sun, 22 Mar 2026 00:18:25 +0000

The Pre-Flight Checklist: 9 Things to Analyze Before Cutting Any AWS Cost

Tags: aws finops sre reliability devops

Last month I saved $12K/year by cleaning up AWS waste across four accounts. But before I touched a single resource, I spent two days just analyzing. Not because I'm cautious by nature — because I've seen what happens when people skip this step.

A colleague at a previous company followed AWS Cost Explorer's recommendation to downsize an RDS instance. It was 12% CPU average — seemed obvious. What they didn't check: that instance handled a 4x traffic spike every Friday at 6 PM. The downsize turned Friday evening into a 90-minute outage, a rollback, and an incident report that took longer to write than the analysis would have.

The rule: never optimize what you don't fully understand.

This article is the pre-flight checklist I run before every cost optimization. It's conversational by design — I want you to internalize the thinking, not just memorize a checklist.

The SRE Guarantee: Before any optimization, we guarantee error budget protection, assured minimum downtime, and reliability over savings. See the series introduction for the full guarantee. Every check in this article enforces that guarantee.

Automate this: finops preflight runs this entire analysis from your terminal.
See aws-finops-toolkit.

1. Traffic: What's the actual load?

The first question isn't "what does this cost?" — it's "what does this do?"

What to pull:

Current TPS / QPS (transactions or queries per second)
Peak QPS over the last 30 days
When the peak happens (time of day, day of week)

How I check it:

# ALB request count — last 7 days, 1-hour intervals
aws cloudwatch get-metric-statistics \
  --namespace AWS/ApplicationELB \
  --metric-name RequestCount \
  --dimensions Name=LoadBalancer,Value=app/pn-sh-alb/abc123 \
  --start-time $(date -u -v-7d +%Y-%m-%dT%H:%M:%S) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
  --period 3600 \
  --statistics Sum \
  --profile dodo-dev \
  --region ap-northeast-2

# RDS connections — peak over 14 days
aws cloudwatch get-metric-statistics \
  --namespace AWS/RDS \
  --metric-name DatabaseConnections \
  --dimensions Name=DBInstanceIdentifier,Value=pn-sh-rds-prod \
  --start-time $(date -u -v-14d +%Y-%m-%dT%H:%M:%S) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
  --period 3600 \
  --statistics Maximum \
  --profile dodo-dev

What I'm looking for:

Avg QPS: 320/s      ← This is what CPU metrics reflect
Peak QPS: 1,247/s   ← This is what the instance must survive
Ratio: 3.9x         ← If > 3x, be very careful downsizing
Peak window: 11-13h, 18-20h KST  ← Never change anything during these hours

The conversation with yourself:

"This instance averages 12% CPU, but peaks at 47% during lunch hour. If I downsize from xlarge to large, the peak would hit 94% CPU on the smaller instance. That's not optimization — that's a time bomb."

The difference between an average and a peak can be the difference between a smooth optimization and a 2 AM page.

Toolkit: finops preflight --target i-0abc123 --profile prod pulls this automatically from CloudWatch.

2. Quality of Service: Where are we against our SLOs?

Before touching anything, I need to know: how much room do we have to experiment?

What to check:

Current p99 latency vs target (e.g., p99 < 200ms)
Availability % vs target (e.g., 99.9%)
Error rate trend (stable, improving, degrading?)
Error budget remaining this month

How I check it (SigNoz):

# SigNoz ClickHouse query — p99 latency last 7 days
SELECT
  toStartOfHour(timestamp) as hour,
  quantile(0.99)(duration_nano) / 1e6 as p99_ms,
  count() as request_count,
  countIf(status_code >= 500) / count() * 100 as error_rate_pct
FROM signoz_traces.distributed_signoz_index_v2
WHERE serviceName = 'gateway-server'
  AND timestamp > now() - INTERVAL 7 DAY
GROUP BY hour
ORDER BY hour

The decision matrix:

Error Budget Remaining	Action
> 70%	Green — safe to optimize, schedule at off-peak
40-70%	Yellow — optimize only low-risk items (orphan cleanup, dev/staging)
< 40%	Red — do not touch anything. Focus on reliability first.
Budget burned (SLO breached)	Stop. Any optimization must IMPROVE reliability, not risk it.

The conversation:

"Our gateway-server has 78% error budget remaining. p99 is 142ms against a 200ms target. That's a comfortable margin — we can proceed with dev/staging optimizations. But I'll hold off on prod RDS right-sizing until next month when we have a full 30-day baseline after the last deployment."

This is where FinOps meets SRE. A FinOps tool tells you to downsize. An SRE checks if the system can absorb the risk.

Toolkit: finops preflight --target gateway-server --apm signoz --apm-endpoint http://signoz.internal:3301 queries SigNoz for SLO status.

3. Cache Strategy: What's already absorbing load?

If a service is low-CPU because Redis handles 85% of requests, downsizing the backend might be fine. But if the cache fails, that backend needs to handle 100% — at the original capacity.

What to check:

Cache hit rate (ElastiCache / Redis)
Cache eviction rate
Cache TTL settings
What happens on cache miss (DB query? External API call?)

How I check it:

# ElastiCache hit rate — last 7 days
aws cloudwatch get-metric-statistics \
  --namespace AWS/ElastiCache \
  --metric-name CacheHitRate \
  --dimensions Name=CacheClusterId,Value=pn-sh-redis-dev \
  --start-time $(date -u -v-7d +%Y-%m-%dT%H:%M:%S) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
  --period 3600 \
  --statistics Average \
  --profile dodo-dev

# Eviction rate — if rising, cache is under pressure
aws cloudwatch get-metric-statistics \
  --namespace AWS/ElastiCache \
  --metric-name Evictions \
  --dimensions Name=CacheClusterId,Value=pn-sh-redis-dev \
  --start-time $(date -u -v-7d +%Y-%m-%dT%H:%M:%S) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
  --period 3600 \
  --statistics Sum \
  --profile dodo-dev

The conversation:

"Redis hit rate is 87%. That means only 13% of requests actually reach the database. Current DB CPU is 12% — but without cache, it would be ~92%. If I downsize this DB, I'm betting that Redis never goes down. Is that a bet I want to make?"

Answer: In prod, no. In dev/staging where I can tolerate cache failures, yes.

The rule: Factor cache dependency into every right-sizing decision. CPU utilization without cache context is misleading.

4. Incident History: What's broken before?

The best predictor of future incidents is past incidents. Before touching a resource, I check: has anything involving this service broken in the last 90 days?

What to check:

Incident count involving the target service (last 90 days)
Root causes — was it capacity-related?
Related services that were impacted
Time to recovery

Where to look:

SigNoz alerts history
PagerDuty/Slack incident channels
Post-mortem docs (our pn-infra-docs/incidents/)
CloudWatch alarm history

# CloudWatch alarm history for the target
aws cloudwatch describe-alarm-history \
  --alarm-name "pn-sh-rds-prod-cpu-high" \
  --history-item-type StateUpdate \
  --start-date $(date -u -v-90d +%Y-%m-%dT%H:%M:%S) \
  --end-date $(date -u +%Y-%m-%dT%H:%M:%S) \
  --profile dodo-dev

The conversation:

"Two incidents in the last 90 days. One was a network blip (unrelated). The other was a connection pool exhaustion on this exact RDS instance during a traffic spike — we had to vertically scale up. That was 6 weeks ago."

"If I downsize this instance now, I'm reducing the headroom that prevented that from happening again. Let me check the connection metrics more carefully before proceeding."

Red flags that block optimization:

Capacity-related incident in the last 60 days → wait
Service was recently scaled UP to fix an issue → definitely wait
Ongoing performance investigation → do not touch

5. Access Setup: Credentials for CLI Analysis

This is practical, not conceptual. Before you can analyze anything, you need:

AWS credentials:

# Verify access to all target accounts
aws sts get-caller-identity --profile dodo-dev    # shared
aws sts get-caller-identity --profile dodo-prod   # dodopoint
aws sts get-caller-identity --profile now          # nowwaiting
aws sts get-caller-identity --profile placen       # nexus hub

# Required IAM permissions (read-only):
# - cloudwatch:GetMetricStatistics
# - ec2:DescribeInstances, DescribeNatGateways, DescribeVolumes
# - rds:DescribeDBInstances
# - elasticache:DescribeCacheClusters
# - s3:ListBuckets, GetBucketPolicy
# - ce:GetCostAndUsage (Cost Explorer)

APM access (SigNoz):

# SigNoz API — verify connectivity
curl -s http://signoz.internal:3301/api/v1/services | jq '.data | length'

# If using SigNoz Cloud:
export SIGNOZ_API_KEY="your-api-key"
curl -s -H "SIGNOZ-API-KEY: $SIGNOZ_API_KEY" \
  https://your-instance.signoz.io/api/v1/services

The toolkit config:

# finops.yaml — account + APM configuration
accounts:
  - profile: dodo-dev
    name: Shared (Dev)
    region: ap-northeast-2
  - profile: dodo-prod
    name: DodoPoint (Prod)
    region: ap-northeast-1
  - profile: placen
    name: Nexus Hub
    region: ap-northeast-2

apm:
  provider: signoz
  endpoint: http://signoz.internal:3301
  # or api_key: ${SIGNOZ_API_KEY}

The rule: Read-only access only. The analysis phase should never modify anything. If your credentials have write access, consider creating a dedicated FinOpsReadOnly role.

Toolkit: finops preflight --profile dodo-dev validates credentials and permissions before analysis.

6. Target Identification: Instance + APM Mapping

Now we need to map the infrastructure resource (EC2 instance, RDS instance) to the service it runs and the APM dashboard that monitors it.

Why this matters: AWS sees "i-0abc123" and "db.r6g.xlarge". Your team sees "gateway-server" and "the ordering database." FinOps decisions need both views.

How to build the mapping:

# EC2: get instance → service mapping from tags
aws ec2 describe-instances \
  --filters "Name=instance-state-name,Values=running" \
  --query 'Reservations[].Instances[].[InstanceId, InstanceType, Tags[?Key==`Name`].Value | [0], Tags[?Key==`Service`].Value | [0]]' \
  --output table \
  --profile dodo-dev

# RDS: instance → service mapping
aws rds describe-db-instances \
  --query 'DBInstances[].[DBInstanceIdentifier, DBInstanceClass, Engine, EngineVersion]' \
  --output table \
  --profile dodo-dev

The result you want:

AWS Resource	Instance Type	Service	SigNoz Dashboard	Owner
i-0abc123	t3.large	EKS node (gateway)	gateway-server	Platform
pn-sh-rds-prod	db.r6g.xlarge	ConnectOrder DB	connectorder-db	Platform
pn-sh-redis-dev	cache.t3.medium	Session cache	redis-metrics	Platform

The conversation:

"I see this RDS instance costs $380/month. But what service uses it? Ah — it's the ConnectOrder primary database. That means gateway-server, auth-server, and user-server all depend on it. That's a high blast radius. Let me check SigNoz for all three services, not just the database metrics."

The rule: Never optimize a resource without knowing what depends on it.

Toolkit: finops preflight --target pn-sh-rds-prod discovers dependent services and maps to APM.

7. Traffic Pattern & Service Specification

Now I zoom out. Not just "what's the current load" but "what does the traffic pattern look like over a week, a month?"

What to analyze:

Weekday vs weekend traffic ratio
Daily peak patterns (lunch hour? evening?)
Monthly patterns (start of month, end of month, paydays?)
Seasonal patterns (holidays, events)
Service type: stateless (can use Spot) vs stateful (cannot)
Dependency chain: who calls this? who does this call?

How I visualize it:

# Hourly request count — last 30 days — export for pattern analysis
aws cloudwatch get-metric-data \
  --metric-data-queries '[{
    "Id": "requests",
    "MetricStat": {
      "Metric": {
        "Namespace": "AWS/ApplicationELB",
        "MetricName": "RequestCount",
        "Dimensions": [{"Name": "LoadBalancer", "Value": "app/pn-sh-alb/abc123"}]
      },
      "Period": 3600,
      "Stat": "Sum"
    }
  }]' \
  --start-time $(date -u -v-30d +%Y-%m-%dT%H:%M:%S) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
  --profile dodo-dev \
  --output json > traffic-30d.json

Pattern analysis result:

Service: gateway-server
Type: stateless (REST API gateway)
Dependencies: auth-server, user-server, SSE-server (downstream)

Traffic Pattern:
  Weekday avg:  420 QPS
  Weekend avg:  180 QPS (43% of weekday)
  Peak hours:   11:00-13:00, 18:00-20:00 KST
  Peak QPS:     1,247
  Low point:    02:00-06:00 KST (~30 QPS)

  Mon-Fri pattern: stable
  Saturday:     -40% from weekday
  Sunday:       -55% from weekday
  Month-end:    no significant spike

Recommendation:
  - Stateless → Spot candidate ✅
  - Predictable pattern → scheduling candidate ✅ (scale down 22:00-07:00)
  - High peak:avg ratio (3.9x) → careful with right-sizing ⚠️

The conversation:

"Weekend traffic drops to 43% of weekday. That means weekend EKS nodes are 57% wasted. Instead of right-sizing (which affects all days), I could use HPA with lower weekend min replicas. Or scheduled scaling. That's safer than shrinking the instance type — I keep peak capacity on weekdays."

Holiday traffic spikes

If your services handle seasonal traffic — holidays, promotions, events — this changes everything about when you can optimize. Squadcast's SRE guide recommends analyzing postmortems from past holiday incidents to build a pre-season checklist — the same principle applies to FinOps freezes.

For F&B/retail platforms (like ours), Korean holidays drive 2-5x normal traffic:

Chuseok (Korean Thanksgiving): September, 3-5 day spike
Lunar New Year: January/February, 3-5 day spike
Christmas/year-end promotions: December

The playbook:

Freeze all FinOps changes 2 weeks before any holiday period
Verify current capacity handles last year's holiday peak (check historical CloudWatch data)
Schedule all optimizations for the quiet period after the holiday
Document the holiday calendar in finops.yaml so the toolkit warns you automatically

# finops.yaml — holiday calendar
preflight:
  holidays:
    - name: Chuseok
      start: "2026-09-14"
      end: "2026-09-17"
      freeze_start: "2026-09-01"  # 2 weeks before
    - name: Lunar New Year
      start: "2027-01-28"
      end: "2027-01-30"
      freeze_start: "2027-01-14"

Toolkit: finops preflight checks the holiday calendar and returns WAIT if within a freeze window.

Batch systems

Batch jobs are invisible to daily averages but define your actual capacity floor.

Common batch patterns:

ETL pipelines (nightly or hourly)
Billing runs (start/end of month)
Data exports and report generation
Scheduled sync jobs between services

The playbook:

Map every batch schedule: cron jobs, EventBridge rules, Airflow DAGs
Check: does the batch peak overlap with the downsized capacity?
Rule: size for batch peak, not daily average. If a nightly ETL uses 80% CPU for 2 hours, the instance must handle 80% — even if the 14-day average is 12%.
If batch is weekly or monthly, a 14-day CPU average is misleading — use the batch window CPU instead

The conversation:

"This RDS instance averages 12% CPU. But every Sunday at 2 AM, a billing reconciliation job runs for 3 hours at 75% CPU. If I downsize from xlarge to large, that Sunday job would hit 150% — it would fail or timeout."

Toolkit: finops preflight detects batch patterns by analyzing CloudWatch metric variance and flags resources with periodic spikes.

8. Priority & Freeze Check: Is it safe to act now?

The final gate. Even if all metrics say "go," organizational context can say "stop."

What to check:

Check	How	Block if...
Deployment freeze	Team calendar, Slack announcements	Any freeze active
Release pending	Sprint board, release schedule	Major release within 2 weeks
Service priority level	Service catalog	P0 service → prod changes need CAB approval
Active incidents	PagerDuty, incident channels	Any open incident on target service
Error severity trend	SigNoz alerts	Error rate trending up (even if within SLO)
Change window	Team agreement	Outside agreed change window
Dependent team availability	Team calendar	Owning team on vacation or unavailable

Priority levels and what you can optimize:

Service Priority	Prod optimization?	Dev/Staging?	Requires approval?
P0 (critical path)	Maintenance window only	Yes	Yes — team lead + SRE
P1 (important)	Off-peak hours	Yes	Yes — SRE
P2 (standard)	Business hours OK	Yes	No
P3 (non-critical)	Anytime	Yes	No

The conversation:

"All metrics look good for downsizing the staging RDS. But wait — the ConnectOrder team is launching a new feature next Tuesday. They're running load tests on staging this week. If I downsize now, their load test results will be invalid."

"Let me wait until after their launch. I'll schedule the optimization for the week after."

The rule: FinOps is not urgent. Reliability is urgent. If there's any doubt about timing, wait. The waste will still be there next week.

9. Existing RI/SP Coverage: What's already committed?

This check prevents one of the most expensive FinOps mistakes: downsizing an instance that's covered by a Reserved Instance, and wasting the reservation.

What to check:

Active Reserved Instances: do any match the target instance type?
Active Savings Plans: what type? (Compute vs EC2 Instance)
If downsizing, will the new size still be covered?

How I check it:

# List active Reserved Instances
aws ec2 describe-reserved-instances \
  --filters "Name=state,Values=active" \
  --query 'ReservedInstances[].[InstanceType,InstanceCount,End,Scope]' \
  --output table \
  --profile dodo-dev

# List active Savings Plans
aws savingsplans describe-savings-plans \
  --states active \
  --query 'SavingsPlans[].[SavingsPlanType,Commitment,End]' \
  --output table \
  --profile dodo-dev

The decision matrix:

Scenario	Risk	Action
Target covered by RI, same instance type	HIGH	Do NOT downsize — calculate RI remaining value vs savings
Target covered by Compute Savings Plan	LOW	Safe to change instance family (Compute SP is flexible)
Target covered by EC2 Instance Savings Plan	HIGH	Do NOT change instance family — SP is family-locked
No RI/SP coverage	NONE	Safe to proceed

The conversation:

"I want to downsize this db.r6g.xlarge to db.r6g.large. Let me check... we have a 1-year RI for db.r6g.xlarge with 8 months remaining. The RI costs $3,060/year. Downsizing would waste $2,040 in remaining reservation value. The downsize would save $190/month = $1,520 over 8 months. Net loss: $520. Don't downsize until the RI expires."

The rule: Always check RI/SP coverage before any right-sizing. The savings from downsizing can be completely negated by wasted reservations.

This mistake is widespread. CloudChipr's RDS guide warns explicitly: "Buying a Reserved Instance for an overprovisioned database just optimizes the cost of waste." ProsperOps notes that if usage falls below commitment, the unused portion goes to waste — making monitoring essential. And Craig Deveson's LinkedIn article documents real strategies for recovering from RI mistakes, including instance size flexibility within the same family and the RI Marketplace for selling unused reservations. The Flexera State of the Cloud Report estimates 27% of all cloud spend is wasted — and RI mismanagement is one of the top contributors.

Toolkit: finops preflight --target <instance> checks active RIs and Savings Plans automatically and returns WAIT if downsizing would waste a reservation.

Putting it all together: the `finops preflight` report

Here's what the complete analysis looks like when you run it:

$ finops preflight --target pn-sh-rds-prod --profile dodo-dev --apm signoz

╭──────────────────────────────────────────────────────────────────╮
│                    PRE-FLIGHT ANALYSIS                            │
│  Target: pn-sh-rds-prod (db.r6g.xlarge)                        │
│  Account: Shared (468411441302)                                  │
│  Analyzed: 2026-03-14 09:32 KST                                │
╰──────────────────────────────────────────────────────────────────╯

📊 TRAFFIC
  Current QPS:     312 req/s
  Peak QPS (30d):  1,247 req/s
  Peak:Avg ratio:  3.9x
  Peak hours:      11:00-13:00, 18:00-20:00 KST
  Weekend drop:    -57%

📋 QUALITY OF SERVICE (SigNoz)
  p99 latency:     142ms / 200ms target     ✅ 29% headroom
  Availability:    99.94% / 99.9% target     ✅
  Error rate:      0.04%                     ✅
  Error budget:    78% remaining             ✅ GREEN

🗄️ CACHE DEPENDENCY
  ElastiCache:     pn-sh-redis-dev
  Hit rate:        87.3%                     ⚠️ 13% hits DB directly
  Eviction rate:   0.02%                     ✅ Stable
  Cache-miss load: ~40 QPS reaches DB

🔥 INCIDENT HISTORY (90 days)
  Total incidents: 2
  Capacity-related: 1 (connection pool, 6 weeks ago)    ⚠️
  Status:          Resolved, connection pool increased

📊 RESOURCE METRICS (14-day)
  CPU avg:         12.3%
  CPU peak:        47.2%
  Memory avg:      34.7%
  Connections avg: 23 / 1000 max
  IOPS avg:        145 / 3000 provisioned

🔗 DEPENDENCIES
  Services:        gateway-server, auth-server, user-server
  Blast radius:    HIGH (3 services depend on this)

💰 RI/SP COVERAGE
  Reserved Instances: 1 active (db.r6g.xlarge, 8 months remaining)
  Savings Plans:      1 Compute SP ($500/mo commitment)          ✅ Flexible
  RI match:           ⚠️ Target matches active RI
  SP family risk:     None (Compute SP)

🚦 PRIORITY CHECK
  Service level:   P0 (critical path)
  Deploy freeze:   None active
  Pending release: ConnectOrder v2.3 — March 18      ⚠️
  Team available:  Yes

╭──────────────────────────────────────────────────────────────────╮
│ RECOMMENDATION:  ⚠️  WAIT — PROCEED AFTER MARCH 18              │
│                                                                  │
│ Analysis supports right-sizing (CPU avg 12%, 78% error budget), │
│ but:                                                             │
│  1. Pending release March 18 — wait for post-release stability  │
│  2. Connection pool incident 6 weeks ago — verify pool config   │
│  3. P0 service — requires team lead + SRE approval              │
│  4. High blast radius — 3 dependent services                    │
│                                                                  │
│ After March 18 (if SLOs hold):                                  │
│  → Downsize db.r6g.xlarge → db.r6g.large                       │
│  → Add read replica as safety net before resize                 │
│  → Schedule: 02:00-04:00 KST (lowest traffic)                  │
│  → Estimated savings: $190/month ($2,280/year)                  │
│  → Rollback plan: modify-db-instance back to xlarge (<10 min)   │
╰──────────────────────────────────────────────────────────────────╯

That's the pre-flight. One command, nine checks, a clear recommendation. No guessing, no "let's just try it and see."

What comes next

This pre-flight analysis is the foundation of the FinOps for SREs series. After pre-flight clears:

Part 1: Finding Passive Waste — clean up what nobody uses
Part 2: Downsizing Without Downtime — actively optimize with reliability guardrails

The analysis is the foundation. Without it, you're guessing. And in production, guessing has a cost — measured in pages, not dollars.

FinOps for SREs — Series Index

Series Introduction: The SRE Guarantee
Part 0: The Pre-Flight Checklist — 9 Checks Before Cutting Any Cost ← you are here
Part 1: How I Found $12K/Year in AWS Waste
Part 2: Downsizing Without Downtime

The pre-flight analysis is implemented in aws-finops-toolkit as the finops preflight command.

Downsizing Without Downtime: An SRE's Guide to Safe Cost Optimization

June Gu — Sun, 22 Mar 2026 00:17:49 +0000

Downsizing Without Downtime: An SRE's Guide to Safe Cost Optimization

Tags: aws finops sre reliability kubernetes

In Part 1, I covered finding $12K/year in passive waste — abandoned VPCs, orphan log groups, stale WorkSpaces. Things nobody was using. That was the easy part.

This article is about the hard part: actively downsizing infrastructure that's still running in production — without breaking availability. This is where FinOps meets SRE, and where most cost-cutting initiatives fail.

I've seen teams blindly follow AWS Cost Explorer recommendations, downsize an RDS instance during peak hours, and trigger a 45-minute outage. The problem isn't the recommendation — it's executing it without an SRE mindset.

The SRE Guarantee: Every optimization in this article passes through three gates: error budget protection, assured minimum downtime, and reliability over savings. See the series introduction for the full guarantee. If any gate fails, we don't proceed — no matter how much the savings.

Here's the framework I use: every cost optimization must pass through the reliability filter first.

The SLO gate: when is it safe to cut?

Before touching any resource, I check three things:

Error budget status — If we've burned >50% of this month's error budget, no changes. Period.
Current resource utilization — CloudWatch metrics over 14+ days, not a snapshot.
Blast radius — If this fails, what's the user impact? One service? All services?

Error budget > 50% remaining?
  └─ Yes → Check utilization
       └─ Avg CPU < 20% for 14 days?
            └─ Yes → Check blast radius
                 └─ Single service, non-critical path?
                      └─ Yes → Proceed with rollback plan
                      └─ No → Schedule for maintenance window
            └─ No → Skip, re-evaluate next month
  └─ No → Do nothing. Stability first.

This is the difference between FinOps and SRE-driven FinOps. Cost tools tell you what to cut. SRE tells you when and how.

Automate this: finops scan runs all checks below in one command. Each section maps to a specific check in the toolkit.

1. EC2 / EKS node right-sizing with Pod Disruption Budgets

The problem: EKS worker nodes running at 15% CPU average. AWS says "downsize." But these nodes run 8 microservices — you can't just swap the instance type and hope pods reschedule gracefully.

The SRE approach:

# Ensure PDB exists BEFORE downsizing
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: gateway-server-pdb
  namespace: connectorder
spec:
  minAvailable: 1
  selector:
    matchLabels:
      app: gateway-server

Execution steps:

Verify PDB exists for every service on the node group
Add new node group with smaller instance type (t3.large → t3.medium)
Cordon old nodes — Kubernetes respects PDBs during drain
Monitor SLOs for 24 hours
Remove old node group only after SLO confirmation

What we saved: t3.large ($0.0832/hr) → t3.medium ($0.0416/hr) = 50% per node. With 4 nodes across dev/staging, that's ~$120/month.

What could go wrong: Without PDBs, draining a node can kill all replicas of a service simultaneously. With PDBs, Kubernetes guarantees at least minAvailable pods stay running.

Toolkit check: finops scan --checks ec2_rightsizing — flags instances with avg CPU < 20% over 14 days. (source)

2. NAT Gateway → NAT Instance with high availability

The problem: NAT Gateways cost $32.40/month each (fixed) + data processing. In dev/staging environments processing <1 GB/month, you're paying $32 for almost nothing.

The SRE approach: Don't just replace with a single NAT Instance — that's a single point of failure. Use dual-AZ NAT Instances with auto-recovery.

# NAT Instance with auto-recovery via ASG
resource "aws_autoscaling_group" "nat" {
  min_size         = 1
  max_size         = 1
  desired_capacity = 1

  launch_template {
    id      = aws_launch_template.nat.id
    version = "$Latest"
  }

  # Auto-replace if health check fails
  health_check_type         = "EC2"
  health_check_grace_period = 120

  tag {
    key                 = "Name"
    value               = "${local.name_prefix}-nat"
    propagate_at_launch = true
  }
}

# t4g.nano: $3.02/month — 10x cheaper than NAT Gateway
resource "aws_launch_template" "nat" {
  instance_type = "t4g.nano"
  image_id      = data.aws_ami.nat_instance.id
  # ... source_dest_check = false
}

What we saved: $32.40 → $3.02/month per environment. Across 3 dev/staging environments: ~$88/month.

The HA guarantee: ASG auto-replaces the instance within ~2 minutes if it fails. For dev/staging, 2 minutes of NAT downtime is acceptable. For prod, keep the managed NAT Gateway.

Real-world validation: Halodoc's engineering team documented their full migration from managed NAT Gateways to NAT instances using fck-nat, an open-source project that provides ready-to-use ARM-based AMIs supporting up to 5Gbps burst on a t4g.nano. They achieved over 90% cost reduction across non-prod environments. The fck-nat AMI handles IP forwarding, NAT rules, and CloudWatch alarms out of the box — it's essentially what I built manually with the ASG approach above, but packaged as a reusable AMI. If you're doing this at scale, consider fck-nat instead of rolling your own.

Toolkit check: finops scan --checks nat_gateway — flags NAT Gateways with 0 bytes processed in dev/staging accounts. (source)

3. Spot Instances for non-production EKS with graceful draining

The problem: Dev and staging EKS node groups run on-demand 24/7 for workloads that tolerate interruption.

The SRE approach: Spot saves 60-70%, but you need graceful handling of the 2-minute interruption notice.

# EKS managed node group with Spot + drain handler
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
managedNodeGroups:
  - name: spot-workers
    instanceTypes: ["t3.medium", "t3a.medium", "t3.large"]
    spot: true
    desiredCapacity: 3
    labels:
      lifecycle: spot
    taints:
      - key: spot
        value: "true"
        effect: PreferNoSchedule

Critical: Install the AWS Node Termination Handler. Without it, pods get killed mid-request.

helm install aws-node-termination-handler \
  eks/aws-node-termination-handler \
  --namespace kube-system \
  --set enableSpotInterruptionDraining=true \
  --set enableScheduledEventDraining=true

What we saved: 3 on-demand t3.medium nodes ($0.0416/hr × 3 × 730hr) = $91/month → Spot (~$0.0125/hr × 3 × 730hr) = $27/month. $64/month savings per environment.

The reliability rule: Never use Spot for production. Never use Spot for stateful workloads. Only use Spot where you have:

Multiple instance type fallbacks (capacity diversification)
Node Termination Handler installed
Pod anti-affinity so replicas spread across nodes

Toolkit check: finops scan --checks spot_candidates — identifies stateless ASGs and EKS node groups eligible for Spot. (source)

4. RDS right-sizing without losing your safety net

The problem: RDS instances provisioned for peak load that only hits 2 hours per day. Average CPU: 8%. But it's a database — you can't just resize and pray.

The SRE approach:

Before	After	Why it's safe
db.r6g.xlarge (prod)	db.r6g.large (prod)	Read replica absorbs overflow
db.r6g.large (staging)	db.r6g.medium (staging)	No Multi-AZ needed in staging
Multi-AZ on staging	Single-AZ	Staging doesn't need failover

Execution steps:

Add a read replica BEFORE downsizing (safety net)
Monitor replica lag for 48 hours
Apply instance modification during low-traffic window (scheduled, not immediate)
Monitor connection count and query latency for 1 week
Remove old read replica only after confirming SLOs hold

What we saved:

Staging Multi-AZ removal: ~$200/month (you're paying 2x for staging redundancy nobody needs)
Right-sizing across 3 non-prod instances: ~$150/month

What NOT to touch: Production primary instances running at >40% CPU. Production Multi-AZ. Any RDS with burst credit dependency (t-class instances under load).

Parameter groups: the hidden risk

When you change an RDS instance class, memory-dependent parameters may break silently.

Default parameter groups auto-scale — shared_buffers, effective_cache_size, and work_mem in PostgreSQL (or innodb_buffer_pool_size in MySQL) adjust automatically with instance memory. If you're using the default parameter group, downsizing is straightforward.

Custom parameter groups with hardcoded values don't auto-scale. If someone set shared_buffers = 8GB explicitly for a db.r6g.xlarge (32GB RAM), downsizing to db.r6g.large (16GB RAM) means shared_buffers is now 50% of total RAM instead of 25%. That leaves almost nothing for OS cache and connections.

This is a known production pitfall. AWS documents that RDS replicas can get stuck in incompatible-parameters mode when created with a smaller instance class if the source's parameter group has hardcoded buffer values too large for the target. The same issue applies to downsizing: the instance may fail to start or perform poorly. AWS Prescriptive Guidance recommends using formula-based parameters (e.g., {DBInstanceClassMemory/32768}) that auto-scale with instance size, rather than hardcoded values.

Before downsizing, check:

# List parameter groups for the instance
aws rds describe-db-instances \
  --db-instance-identifier pn-sh-rds-prod \
  --query 'DBInstances[0].DBParameterGroups' \
  --profile dodo-dev

# Check for hardcoded memory parameters
aws rds describe-db-parameters \
  --db-parameter-group-name my-custom-pg15 \
  --query 'Parameters[?ParameterName==`shared_buffers` || ParameterName==`effective_cache_size` || ParameterName==`work_mem`].[ParameterName,ParameterValue,Source]' \
  --output table \
  --profile dodo-dev

The rule: If Source = user (not engine-default), the parameter is hardcoded. Recalculate it for the target instance size before downsizing.

CDC and logical replication: the blast radius multiplier

If the database has Change Data Capture (CDC) enabled via logical replication, downsizing becomes significantly riskier.

Why it matters:

Replication slots consume WAL: Logical replication slots prevent WAL cleanup until the consumer catches up. On a smaller instance with less I/O throughput, WAL can accumulate faster than it's consumed.
Replication lag increases: Smaller instance = less CPU and memory for WAL decoding. If your CDC pipeline (Debezium, DMS, custom) can't keep up, lag grows — and if the slot falls too far behind, you may need to recreate it.
Disk pressure: WAL accumulation on a smaller instance with less storage headroom can fill the disk, causing the primary to halt writes entirely.

This is not theoretical. Gunnar Morling (Debezium/Red Hat) documented the "insatiable" replication slot problem — when a CDC consumer stops, an idle RDS PostgreSQL instance accumulates 18 GB/day of WAL because RDS writes a heartbeat every 5 minutes into 64 MB WAL segments. His follow-up guide on mastering replication slots is essential reading. Artie's production guide calls slot bloat "the single most common way CDC pipelines take down production databases."

Before downsizing a CDC-enabled database:

# Check for logical replication slots (PostgreSQL)
# Run via psql or RDS Data API:
# SELECT slot_name, plugin, active, restart_lsn, confirmed_flush_lsn
# FROM pg_replication_slots;

# Check replication lag via CloudWatch
aws cloudwatch get-metric-statistics \
  --namespace AWS/RDS \
  --metric-name ReplicationSlotDiskUsage \
  --dimensions Name=DBInstanceIdentifier,Value=pn-sh-rds-prod \
  --start-time $(date -u -v-7d +%Y-%m-%dT%H:%M:%S) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
  --period 3600 \
  --statistics Maximum \
  --profile dodo-dev

Critical safety net (PostgreSQL 13+): Set max_slot_wal_keep_size in your parameter group to cap how much WAL a replication slot can retain. Without this, an inactive slot will accumulate WAL indefinitely — Morling measured 18 GB/day on an idle RDS instance. Also set a CloudWatch alarm on OldestReplicationSlotLag — warning at 1 GB, critical at 10 GB.

The rule: If pg_replication_slots shows active logical slots, do NOT downsize without first confirming the CDC consumer can handle reduced throughput. Consider pausing CDC, downsizing, then resuming — but plan for a full re-sync if the slot is lost.

Cold cache: the first-hour tax

Every RDS instance modification restarts the database engine. When it comes back up, the buffer pool is empty. This is the cold cache problem.

What happens:

PostgreSQL's shared_buffers starts empty — every query hits disk
Query p99 latency spikes 3-10x for the first 30-60 minutes
Connection pool may hit timeouts as queries take longer
If you're monitoring SLOs, you'll see an error budget burn

Mitigation:

Schedule the modification during the lowest-traffic window (e.g., 02:00-04:00 KST for our services)
Use "Apply during maintenance window" — not "Apply immediately"
Pre-warm with read replica promotion instead of in-place modification:
- Create a read replica at the target (smaller) size
- Let the replica's buffer pool warm up from replication traffic
- Promote the replica to primary during maintenance window
- The promoted instance already has a warm cache
Budget the cold cache period into your SLO error budget — if you have 78% budget remaining, a 45-minute cache warm-up that degrades p99 by 3x might burn 2-3% of your monthly budget. That's acceptable. If you only have 50% remaining, it's not.

Blue-green consideration: RDS Blue/Green Deployments create a green (new) environment alongside the blue (current). This is safer for major changes but costs 2x during the switchover period. For a simple instance class change, in-place modification with read replica pre-warming is more cost-effective than blue-green.

What the industry uses: AWS published a detailed guide on automated cache pre-warming for Aurora PostgreSQL using the pg_prewarm extension, which loads specific tables and indexes into shared buffers before traffic arrives. For standard RDS PostgreSQL, the same extension is available — and there's even an open-source tool specifically designed to pre-warm RDS PostgreSQL instances after restarts. Aurora also offers Cluster Cache Management (CCM) which designates a replica to inherit the primary's buffer cache on failover — eliminating cold cache entirely for failover scenarios.

Toolkit check: finops scan --checks rds_rightsizing — flags oversized RDS instances and unnecessary Multi-AZ in non-prod. (source)

5. ElastiCache scheduling for dev/staging

The problem: ElastiCache clusters running 24/7 in dev/staging. Developers use them 10 hours/day, 5 days/week. You're paying for 118 idle hours per week.

The SRE approach: Stop clusters outside business hours via EventBridge + Lambda.

# Lambda: stop dev ElastiCache at 8 PM, start at 8 AM
def handler(event, context):
    action = event.get('action')  # 'stop' or 'start'
    cluster_id = event.get('cluster_id')

    client = boto3.client('elasticache')
    if action == 'stop':
        # Serverless: just scale to 0 ECPUs
        # Classic: delete with final snapshot, recreate on start
        pass
    elif action == 'start':
        # Restore from snapshot
        pass

What we saved: ~50% per cluster. 2 dev/staging clusters: ~$80/month.

The reliability check: Always test that the start/restore actually works before relying on scheduling. A cluster that won't restore Monday morning is worse than paying weekend costs.

Toolkit check: finops scan --checks elasticache_scheduling — detects dev/staging ElastiCache running 24/7. (source)

6. Reserved Instances: commit only after right-sizing

The problem: Teams buy RIs before optimizing. Then they downsize and the RI doesn't match. Money locked in for 1-3 years.

The SRE approach: RIs are the last step, not the first.

Week 1-2: Find waste (Part 1 — passive cleanup)
Week 3-4: Downsize safely (this article)
Week 5-6: Monitor — confirm new sizes are stable
Week 7-8: THEN buy RIs/Savings Plans for the right-sized resources

Decision matrix:

Resource	Stable for 30+ days?	CPU predictable?	Action
Prod RDS (right-sized)	Yes	Yes, 35-45%	1-year RI (All Upfront)
Prod EKS nodes	Yes	Yes, 40-60%	Compute Savings Plan
Dev anything	N/A	N/A	Never reserve — use Spot/scheduling

What we projected: After right-sizing prod workloads, 1-year RIs would save an additional 30-40% on the new baseline — roughly $300-500/month for our scale.

Industry validation: CloudChipr's RDS right-sizing guide puts it bluntly: "Buying a Reserved Instance for an overprovisioned database just optimizes the cost of waste." The Flexera State of the Cloud Report consistently finds that 27% of cloud spend is wasted, with premature RI commitment being a top contributor. If you must reserve, use Compute Savings Plans over EC2 Instance Savings Plans — ProsperOps explains that Compute SPs offer instance family flexibility, so you can still right-size without breaking coverage.

Toolkit check: finops scan --checks reserved_instances — calculates RI/Savings Plans ROI for stable workloads. (source)

7. Orphan resource cleanup

The problem: EBS volumes from terminated instances, Elastic IPs not attached to anything, snapshots from 2 years ago, load balancers with zero targets.

The SRE approach: These are almost always safe to remove — but verify first.

Checklist before deletion:

[ ] EBS volume: not attached, no recent snapshots depending on it
[ ] EIP: not referenced in DNS or application config
[ ] Snapshot: original volume no longer exists, no AMI depends on it
[ ] ALB: zero registered targets for 7+ days, no DNS pointing to it

What we found: 12 orphan EBS volumes, 4 unused EIPs, 47 snapshots older than 90 days. ~$85/month in pure waste.

Toolkit check: finops scan --checks unused_resources — flags unattached EBS, unused EIPs, old snapshots, idle ALBs. (source)

The complete picture: what's safe and what's not

Optimization	Risk	Prod safe?	Dev/Staging safe?	Savings
Orphan cleanup	Very low	Yes	Yes	$85/mo
ElastiCache scheduling	Low	No	Yes	$80/mo
NAT Gateway → Instance	Low-Med	No	Yes	$88/mo
Spot for non-prod	Medium	No	Yes	$64/mo
EC2/EKS right-sizing	Medium	With PDB	Yes	$120/mo
RDS right-sizing	Medium	With replica	Yes	$350/mo
Reserved Instances	Lock-in risk	After sizing	Never	$300-500/mo

Total from active downsizing: ~$787-1,087/month ($9.4-13K/year)
Combined with Part 1 (passive waste): $1,431-2,104/month ($17.2-25.2K/year)

The toolkit: automate the discovery

Everything in this article maps to a check in aws-finops-toolkit:

# Install
pip install aws-finops-toolkit

# Scan all checks across multiple accounts
finops scan --profiles dev,staging,prod

# Run only the downsizing-related checks
finops scan --checks ec2_rightsizing,nat_gateway,spot_candidates,rds_rightsizing,elasticache_scheduling,reserved_instances,unused_resources

# Generate HTML report for management
finops report --format html --output finops-downsizing.html

The tool finds the opportunities. The SRE decides which ones are safe to execute, and in what order.

What I learned

FinOps without SRE is dangerous. Cost tools don't know your SLOs. They'll tell you to downsize a database that's already at its limit during peak hours.
Always add safety before removing cost. Read replica before RDS downsize. PDB before node downsize. Drain handler before Spot. The safety net costs less than the savings — and it prevents the 2 AM page.
Reserve last, not first. Right-size → stabilize → then commit. Buying RIs on oversized instances locks in waste.
Prod and non-prod are different games. Non-prod is where you optimize aggressively (Spot, scheduling, single-AZ). Prod is where you optimize carefully (right-sizing with replicas, PDBs, maintenance windows).
SLO data is your FinOps compass. If your error budget is healthy, you have room to experiment. If it's burned, don't touch anything — reliability comes first.

FinOps for SREs — Series Index

Series Introduction: The SRE Guarantee
Part 0: The Pre-Flight Checklist — 9 Checks Before Cutting Any Cost
Part 1: How I Found $12K/Year in AWS Waste
Part 2: Downsizing Without Downtime ← you are here

The checks in this article are implemented in aws-finops-toolkit — an open-source CLI for automated AWS cost scanning.

FinOps for SREs: Cutting Costs Without Breaking Things

June Gu — Sun, 22 Mar 2026 00:17:33 +0000

FinOps for SREs: Cutting Costs Without Breaking Things

Tags: aws finops sre reliability devops

Most FinOps advice starts with a cost dashboard. This series starts with a different question: how do we cut costs without violating our SLOs?

I'm an SRE at a subsidiary of one of Korea's largest tech companies, managing four AWS accounts connected via a Transit Gateway hub-spoke architecture. When I was asked to reduce cloud spend, I didn't open AWS Cost Explorer first. I opened our SigNoz dashboards and checked our error budgets.

That's the difference between FinOps and SRE-driven FinOps.

The SRE Guarantee

Before any cost optimization begins, I guarantee three things:

1. Error Budget Protection
No optimization will be executed if it risks breaching SLOs. If our error budget is below 50%, all FinOps work stops — reliability comes first.

2. Assured Minimum Downtime
Every change has a rollback plan, a maintenance window, and a blast radius assessment. Zero-downtime is the target. Documented, brief downtime during a maintenance window is the floor. Unplanned downtime is unacceptable.

3. Reliability Over Savings
If forced to choose between $500/month in savings and a 0.01% availability risk, we choose availability. Always. The cost of an outage — in customer trust, in engineering hours, in incident response — exceeds any monthly savings.

This guarantee isn't just a principle. It's encoded in every check of the aws-finops-toolkit — the open-source CLI I built to automate this workflow.

The Series

This series walks through the complete FinOps workflow I used to identify $48-67K/year in savings across four AWS accounts — starting with analysis, through passive cleanup, to active downsizing with SRE guardrails.

Part 0: The Pre-Flight Checklist

9 checks before cutting any cost. Traffic analysis, SLO status, cache dependencies, incident history, RI/SP coverage, and more. This is the analysis phase — never optimize what you don't fully understand.

→ OSS: finops preflight command (aws-finops-toolkit)

Part 1: How I Found $12K/Year in AWS Waste

Passive waste — things nobody uses. Abandoned VPCs ($748/mo), orphan CloudWatch log groups ($110-165/mo), S3 lifecycle vs Intelligent-Tiering ($75-104/mo). Zero risk to production. Total: $933-1,017/month.

→ OSS: finops scan — vpc_waste, cloudwatch_waste, s3_lifecycle checks

Part 2: Downsizing Without Downtime

Active optimization — shrinking running infrastructure with SRE guardrails. EC2/EKS right-sizing with PDBs, NAT Gateway replacement, Spot with drain handlers, RDS right-sizing with read replicas and cold cache planning, ElastiCache scheduling, and Reserved Instances (commit last, not first). Total: $787-1,087/month.

→ OSS: finops scan — ec2_rightsizing, nat_gateway, spot_candidates, rds_rightsizing, elasticache_scheduling, reserved_instances, unused_resources checks

Combined Savings

Phase	Monthly	Annual
Part 1: Passive waste cleanup	$933-1,017	$11.2-12.2K
Part 2: Active downsizing	$787-1,087	$9.4-13K
Total identified	$1,720-2,104	$20.6-25.2K
P0-P2 roadmap (pending)	$3,995-5,565	$48-67K

Every optimization in this series passed through the SRE guarantee. Not a single SLO was breached. Not a single unplanned outage occurred.

The Toolkit

Everything in this series maps to aws-finops-toolkit — an open-source CLI that automates the discovery:

# Pre-flight analysis before any change
finops preflight --target pn-sh-rds-prod --profile dodo-dev --apm signoz

# Scan for cost waste across accounts
finops scan --profiles dev,staging,prod

# Generate report for stakeholders
finops report --format html --output finops-report.html

The tool finds the opportunities. The SRE decides which ones are safe to execute.

This is the introduction to the "FinOps for SREs" series. Start with Part 0: The Pre-Flight Checklist or jump to the part most relevant to your situation.

I'm June, an SRE with 5+ years of experience at Korea's top tech companies including Coupang (NYSE: CPNG) and NAVER Corporation. I write about real-world infrastructure problems. Find me on LinkedIn.

How I Found $12K/Year in AWS Waste Across 4 Accounts — Without Touching Production

June Gu — Sun, 22 Mar 2026 00:11:52 +0000

How I Found $12K/Year in AWS Waste Across 4 Accounts — Without Touching Production

Tags: aws finops cloudcost sre

I joined a subsidiary of one of Korea's largest tech companies at the beginning of 2026 as the sole SRE. I inherited four AWS accounts — hub, shared, waitlist, and loyalty — connected via a Transit Gateway hub-spoke architecture. Each account ran its own mix of EKS clusters, RDS instances, Aurora clusters, legacy EC2 services, and networking stacks accumulated over several years by multiple teams.

Nobody had done a cost audit since the accounts were created. Resources from decommissioned projects were still running. Log groups from deleted infrastructure were still ingesting. Six WorkSpaces that nobody had logged into for months were quietly billing $525/month.

Within two weeks of part-time analysis and execution, I cut $644/month in immediate waste and identified a total of $933-1,017/month ($11.2-12.2K/year) across three workstreams — all without touching a single production service. Beyond that, I mapped out a P0-P2 roadmap worth $48-67K/year that is now pending platform team approval.

This is what the work actually looked like.

The audit: mapping costs across four accounts

Before optimizing anything, I needed to understand what we were paying for and who owned it. Our four accounts mapped to distinct business units:

Account	What runs there
hub	Transit Gateway, ArgoCD, ECR, bastion, monitoring (SigNoz)
shared	Ordering platform (EKS, RDS, microservices)
waitlist	Waitlist service (Aurora clusters, legacy EC2)
loyalty	Loyalty platform (RDS, Aurora, legacy WorkSpaces, VPN)

Each account gets its own AWS bill, which is one of the underappreciated benefits of multi-account architecture for FinOps. No tagging allocation formulas, no arguments about which team caused a cost spike. The Transit Gateway attachment cost ($51.10/month per VPC) is the "tax" each spoke pays for connectivity — and it is transparent.

I added standardized cost allocation tags derived from our existing Terraform naming convention:

tags = {
  Org       = var.org       # company identifier
  Group     = var.group     # "hub", "sh", "nw", "dp"
  Service   = var.service   # "core", "ordering", etc.
  Env       = var.env       # "prod", "stage", "dev"
  ManagedBy = "terraform"
}

With these tags activated in AWS Cost Explorer, I could slice spend by team, service, and environment. But the real insights came from cross-referencing Cost Explorer data with SigNoz (our centralized observability platform running in the hub account), which collects resource utilization metrics from every spoke cluster. That combination — dollars from Cost Explorer, utilization from SigNoz — is what let me confidently identify waste rather than guess at it.

The audit revealed three categories of waste, each requiring a different approach.

Phase 1: VPC cleanup — $748/month saved

This was the biggest win, and it was almost entirely abandoned infrastructure.

The abandoned dev VPC ($64/month)

The shared account contained a VPC called fnb-dev that had been created for a food-and-beverage integration project. The project was cancelled, but the VPC lived on: two NAT Gateways, two Elastic IPs, an EC2 instance, an Internet Gateway, and the VPC itself. Nobody was using any of it.

I confirmed zero traffic on the NAT Gateways via CloudWatch metrics (BytesIn/BytesOut flat at zero for 90+ days), verified no DNS records pointed to the EC2 instance, and tore the entire VPC down.

The privacy VPC that outlived its purpose ($525/month)

This was the expensive one. The loyalty account had a "privacy VPC" running six Amazon WorkSpaces, an AWS Directory Service instance, a Storage Gateway, and a NAT Gateway. It was originally set up for a compliance project that had since been handled differently.

The six WorkSpaces alone cost roughly $300/month. The Directory Service added another $100+. None of the WorkSpaces had been logged into recently. I confirmed with the platform team that the compliance workflow no longer required this infrastructure, documented the teardown plan, and removed the entire VPC.

$525/month for infrastructure that was doing literally nothing. This is the kind of waste that hides in multi-account setups — each account team assumes someone else needs it.

Unattached Elastic IPs and orphaned NAT Gateways ($11/month)

Small individually, but they add up and signal a pattern:

An unattached EIP in the hub account: $7.50/month
A NAT Gateway EIP in shared-stage that was no longer needed: $3.50/month

I also identified two NAT Gateways in the loyalty account's security VPC worth ~$130/month that are pending vendor coordination before removal.

VPC cleanup totals

Item	Account	Monthly savings	Status
fnb-dev VPC full teardown	shared	~$64	Done
Privacy VPC full teardown	loyalty	~$525	Done
Unattached EIPs (2)	hub, shared	~$11	Done
security-vpc NAT Gateways (2)	loyalty	~$130	Pending vendor
Total		~$748

Lesson: The highest-ROI FinOps work is not right-sizing or reservations. It is finding entire stacks that should not exist. A single abandoned VPC with managed services can cost more per month than all your dev environment optimizations combined.

Phase 2: CloudWatch log retention — $110-165/month saved

CloudWatch Logs is one of those services that silently accumulates cost because the default retention is "never expire." When nobody sets explicit retention policies, logs grow forever.

Three tiers of log waste

I categorized every log group in the loyalty account (which had the most legacy services) into three buckets:

Tier 1 — Orphan log groups (delete immediately)

When I tore down the privacy VPC, seven CloudWatch log groups were left behind: five from Storage Gateway and two from Lambda functions that had been part of the privacy workflow. These groups had no active log streams but still stored data. I deleted them outright.

Tier 2 — Inactive service logs (set to 30-day retention)

Eighteen log groups belonged to services that had not emitted a log event in 12+ months. Old message queue processors, abandoned feature branches that had been deployed and forgotten, food-and-beverage integration services that matched the cancelled project. These got a 30-day retention policy — enough time to investigate if someone suddenly asks "what happened with service X last month?" while ensuring the data does not accumulate indefinitely.

Tier 3 — Active but over-retained logs (reduce to 90 days)

The main production service log group had its retention set to 731 days (two full years). It had accumulated 199 GB of log data. For a service whose logs are primarily useful for incident investigation (where you rarely look back more than a few weeks), two years is excessive. I reduced retention to 90 days.

What is still pending

The immediate changes saved roughly $5/month in ongoing storage costs, but the real savings come from items still awaiting platform team confirmation:

Transport log optimization: 216 GB/month ingestion rate, costing $164/month. This one needs careful analysis of whether the log data feeds any dashboards or alerting.
Legacy dashboard cleanup: 12 CloudWatch dashboards that nobody has viewed in months, costing $45-60/month.
Unused alarm cleanup: Alarms attached to deleted resources, $10-20/month.

Total potential: $110-165/month once all items are resolved.

Lesson: Log retention is a governance problem, not a technical one. Set a default retention policy (I recommend 30 days for non-production, 90 days for production) at the organizational level, and require explicit justification for anything longer. The cost of storing logs you will never read adds up faster than you expect.

This is not unique to us. One team discovered that thousands of dormant log groups across multiple regions were costing several hundred dollars per month storing logs nobody would ever read. Another AWS case study showed a single log group dropping from $415/year to $18/year — a 95% reduction — simply by setting a 30-day retention policy. AWS even published an automation guide for enforcing retention policies at scale because this problem is so widespread.

Phase 3: S3 lifecycle policies — $75-104/month saved

S3 storage costs are easy to ignore because individual buckets rarely cost more than a few dollars. But across four accounts with years of accumulated data, the total becomes significant.

The lifecycle approach

I applied a standard lifecycle policy to three buckets in the waitlist account:

0-30 days:   S3 Standard (frequent access for recent data)
30-90 days:  S3 Standard-IA (infrequent access, lower storage cost)
90+ days:    S3 Glacier Instant Retrieval (archive, sub-millisecond access)

The three buckets and their sizes:

Bucket pattern	Size	Contents
`{service}-papertrail`	1,293 GB	Application log exports
`{service}-datalab-athena-tables`	238 GB	Analytical query results
`{service}-upload`	352 GB	User-uploaded content

Total: 1,883 GB across three buckets, with the vast majority of objects older than 90 days and rarely accessed.

Why Intelligent-Tiering was the wrong answer

My first instinct was S3 Intelligent-Tiering — let AWS automatically move objects between access tiers based on usage patterns. It sounds ideal. But when I ran the numbers, it was actually more expensive for our buckets.

The reason: Intelligent-Tiering charges a monitoring fee of $0.0025 per 1,000 objects per month. For buckets with millions of small objects, this monitoring cost exceeds the storage savings from automatic tiering.

Consider a bucket with 54.8 million objects averaging under 128 KB each:

Intelligent-Tiering monitoring cost:
  54,800,000 objects / 1,000 × $0.0025 = $137/month in monitoring alone

Standard-IA storage savings for same bucket:
  Negligible — objects under 128 KB are charged minimum 128 KB in IA,
  so small objects can actually cost MORE in IA than Standard

The monitoring fee alone was more than the entire bucket's current storage cost. Lifecycle policies with explicit transitions based on object age are cheaper and more predictable for buckets with high object counts or small average object sizes.

When Intelligent-Tiering does make sense: Buckets with fewer, larger objects (think database backups, media files) where the monitoring fee per object is negligible compared to the storage cost delta between tiers. For log-style buckets with millions of small files, stick with lifecycle rules.

This is a common trap. Sedai's analysis confirmed that for workloads with millions of small files and predictable access patterns, explicit lifecycle rules are cheaper than Intelligent-Tiering because they eliminate the monitoring fee entirely while achieving the same storage outcome. Even AWS's own pricing page notes that objects under 128 KB are never auto-tiered — they stay in Frequent Access tier, paying full price, while still incurring monitoring costs in some configurations. If you know your data's access pattern (and for logs, you do), skip the automation and set explicit rules.

What is still pending

Five additional buckets totaling over 3 TB are awaiting platform team review:

Application logs bucket (916 GB, actively written)
Access logging bucket (751 GB, 54.8M small objects)
Device logs bucket (730 GB, 9.6M objects)
VPC flow logs bucket (411 GB, actively written)
Data dump bucket (241 GB, 19.1M objects)

Several of these are candidates for expiration policies (delete after N days) rather than just tiering, which would further reduce costs.

Results: $644/month immediate, $933-1,017/month total

Workstream	Immediate savings	Pending platform approval	Total
#1 VPC cleanup	$600/mo	$130/mo	$748/mo
#2 CloudWatch logs	$5/mo	$105-160/mo	$110-165/mo
#3 S3 lifecycle	$39/mo	$36-65/mo	$75-104/mo
Total	$644/mo	$271-355/mo	$933-1,017/mo

Annual: $11,196 - $12,204

Every dollar saved here came from resources that were either completely unused or storing data that nobody was reading. No production services were modified. No architectural changes were required. No users were impacted.

The roadmap: $48-67K/year still on the table

The VPC/CloudWatch/S3 work was the low-hanging fruit — things I could verify and execute without risking service availability. The next phase requires platform team coordination because it involves production databases and compute.

Priority	Focus	Monthly savings	Key items
P0 Immediate	Unused databases	$1,772-2,172/mo	Idle RDS instances, oversized ElastiCache
P1 Short-term	Legacy compute	$933-1,113/mo	More unused DBs, idle EC2, DocumentDB
P2 Medium-term	Architecture	$1,290-2,280/mo	EKS Karpenter, dev scheduling, gp2-to-gp3 migration, Redis EOL upgrades

Full roadmap: $3,995-5,565/month = $48,000-67,000/year

The P0 items alone — a few RDS instances that are running but not connected to any application — would nearly double the savings achieved so far. But these require the platform team to confirm that the databases are truly unused, not just "used once a quarter for a batch job nobody documented."

This is why the 3-phase methodology matters: the cost of accidentally deleting a database that someone needs is orders of magnitude higher than the monthly savings from removing it.

The methodology: why process matters more than tools

Every optimization followed a 3-phase workflow:

Phase 1 — Analyze: Gather metrics from Cost Explorer and SigNoz. Calculate exact savings. Map dependencies. Classify the downtime risk: zero-impact (unused resource), brief disruption (restart required), or service-affecting (production traffic).

Phase 2 — Platform team confirm: For anything touching production or anything where ownership is ambiguous, I create a Confluence page with the analysis, tag the responsible team, and wait for explicit confirmation. This is the slow part, and it should be. Rushing this step is how you delete the database that runs the quarterly compliance report.

Phase 3 — Execute: Cross-check the plan one final time, send a Slack notification to the operations channel, execute the change, verify the expected cost reduction appears in the next billing cycle, and document the outcome.

The prioritization within each phase follows two rules:

Downtime risk first: Zero-impact items (unused resources) before brief-disruption items before service-affecting items. This builds trust with the platform team — they see you removing dead weight before you propose changes to anything live.
Savings amount second: Within the same risk tier, tackle the highest-dollar items first for maximum ROI on your analysis time.

This workflow is not exciting. It does not involve a fancy FinOps platform or automated recommendation engine. But it works, and it ensures that every change is reversible, documented, and approved by someone who understands the service context.

What is next: automating the audit

The manual audit across four accounts took about two weeks of part-time work. Most of that time was spent on the same repetitive queries: find unattached EIPs, find log groups with no recent events, find S3 buckets without lifecycle policies, find resources with zero utilization.

I am building an open-source CLI tool — aws-finops-toolkit — to automate these patterns across multi-account AWS environments. The goal is to reduce the initial audit from two weeks to an afternoon:

Multi-account scanning via AWS Organizations or assumed roles
Automatic detection of orphaned resources (unattached EIPs, empty log groups, idle NAT Gateways)
S3 lifecycle policy recommendations based on access patterns and object size distribution
CloudWatch log group analysis with retention recommendations
Cost-per-resource estimates using the AWS Pricing API
Markdown and CSV report generation for stakeholder review

If you manage multiple AWS accounts and have seen the same patterns I described here, the repo could use contributors.

Lessons learned

1. The biggest savings are in resources that should not exist. Right-sizing and reservations get all the blog posts, but a single abandoned VPC with six WorkSpaces cost more per month than all the node right-sizing I could do across every dev environment. Always start by looking for entire stacks that can be deleted.

2. Multi-account architecture is a FinOps feature. Per-account billing makes cost ownership unambiguous. When I proposed removing the privacy VPC in the loyalty account, I was talking to one team about one account's bill. There was no allocation debate.

3. Intelligent-Tiering is not a universal answer. For high-object-count buckets with small files, the per-object monitoring fee can exceed the storage savings. Always run the numbers before enabling it. Lifecycle policies with explicit age-based transitions are cheaper and more predictable for log-style workloads.

4. Log retention is a governance gap, not a technical problem. When the default is "retain forever," every log group becomes a slowly growing cost center. Set organizational defaults (30 days non-prod, 90 days prod) and require justification for longer retention.

5. Process builds trust, trust unlocks bigger savings. The $644/month I saved independently was useful, but the $48-67K/year roadmap requires platform team buy-in. By starting with zero-risk items (dead VPCs, orphaned log groups) and following a documented workflow, I built the credibility needed for the team to approve changes to production infrastructure.

6. Cost optimization is not a project — it is a practice. These savings will erode if nobody checks back in six months. I set up monthly cost reviews, budget alarms on every account, and Slack notifications for every optimization action. The next engineer who inherits this infrastructure will at least know what was changed and why.

FinOps for SREs — Series Index

Series Introduction: The SRE Guarantee
Part 0: The Pre-Flight Checklist — 9 Checks Before Cutting Any Cost
Part 1: How I Found $12K/Year in AWS Waste ← you are here
Part 2: Downsizing Without Downtime

DEV Community: June Gu

The Pre-Flight Checklist: 9 Things to Analyze Before Cutting Any AWS Cost

The Pre-Flight Checklist: 9 Things to Analyze Before Cutting Any AWS Cost

1. Traffic: What's the actual load?

2. Quality of Service: Where are we against our SLOs?

3. Cache Strategy: What's already absorbing load?

4. Incident History: What's broken before?

5. Access Setup: Credentials for CLI Analysis

6. Target Identification: Instance + APM Mapping

7. Traffic Pattern & Service Specification

Holiday traffic spikes

Batch systems

8. Priority & Freeze Check: Is it safe to act now?

9. Existing RI/SP Coverage: What's already committed?

Putting it all together: the finops preflight report

What comes next

FinOps for SREs — Series Index

Downsizing Without Downtime: An SRE's Guide to Safe Cost Optimization

Downsizing Without Downtime: An SRE's Guide to Safe Cost Optimization

The SLO gate: when is it safe to cut?

1. EC2 / EKS node right-sizing with Pod Disruption Budgets

2. NAT Gateway → NAT Instance with high availability

3. Spot Instances for non-production EKS with graceful draining

4. RDS right-sizing without losing your safety net

Parameter groups: the hidden risk

CDC and logical replication: the blast radius multiplier

Cold cache: the first-hour tax

5. ElastiCache scheduling for dev/staging

6. Reserved Instances: commit only after right-sizing

7. Orphan resource cleanup

The complete picture: what's safe and what's not

The toolkit: automate the discovery

What I learned

FinOps for SREs — Series Index

FinOps for SREs: Cutting Costs Without Breaking Things

FinOps for SREs: Cutting Costs Without Breaking Things

The SRE Guarantee

The Series

Part 0: The Pre-Flight Checklist

Part 1: How I Found $12K/Year in AWS Waste

Part 2: Downsizing Without Downtime

Combined Savings

The Toolkit

How I Found $12K/Year in AWS Waste Across 4 Accounts — Without Touching Production

How I Found $12K/Year in AWS Waste Across 4 Accounts — Without Touching Production

The audit: mapping costs across four accounts

Phase 1: VPC cleanup — $748/month saved

The abandoned dev VPC ($64/month)

The privacy VPC that outlived its purpose ($525/month)

Unattached Elastic IPs and orphaned NAT Gateways ($11/month)

VPC cleanup totals

Phase 2: CloudWatch log retention — $110-165/month saved

Three tiers of log waste

What is still pending

Phase 3: S3 lifecycle policies — $75-104/month saved

The lifecycle approach

Why Intelligent-Tiering was the wrong answer

What is still pending

Results: $644/month immediate, $933-1,017/month total

The roadmap: $48-67K/year still on the table

The methodology: why process matters more than tools

What is next: automating the audit

Lessons learned

FinOps for SREs — Series Index

Putting it all together: the `finops preflight` report