DEV Community

June Gu
June Gu

Posted on

The Pre-Flight Checklist: 9 Things to Analyze Before Cutting Any AWS Cost

The Pre-Flight Checklist: 9 Things to Analyze Before Cutting Any AWS Cost

Tags: aws finops sre reliability devops


Last month I saved $12K/year by cleaning up AWS waste across four accounts. But before I touched a single resource, I spent two days just analyzing. Not because I'm cautious by nature — because I've seen what happens when people skip this step.

A colleague at a previous company followed AWS Cost Explorer's recommendation to downsize an RDS instance. It was 12% CPU average — seemed obvious. What they didn't check: that instance handled a 4x traffic spike every Friday at 6 PM. The downsize turned Friday evening into a 90-minute outage, a rollback, and an incident report that took longer to write than the analysis would have.

The rule: never optimize what you don't fully understand.

This article is the pre-flight checklist I run before every cost optimization. It's conversational by design — I want you to internalize the thinking, not just memorize a checklist.

The SRE Guarantee: Before any optimization, we guarantee error budget protection, assured minimum downtime, and reliability over savings. See the series introduction for the full guarantee. Every check in this article enforces that guarantee.

Automate this: finops preflight runs this entire analysis from your terminal.
See aws-finops-toolkit.


1. Traffic: What's the actual load?

The first question isn't "what does this cost?" — it's "what does this do?"

What to pull:

  • Current TPS / QPS (transactions or queries per second)
  • Peak QPS over the last 30 days
  • When the peak happens (time of day, day of week)

How I check it:

# ALB request count — last 7 days, 1-hour intervals
aws cloudwatch get-metric-statistics \
  --namespace AWS/ApplicationELB \
  --metric-name RequestCount \
  --dimensions Name=LoadBalancer,Value=app/pn-sh-alb/abc123 \
  --start-time $(date -u -v-7d +%Y-%m-%dT%H:%M:%S) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
  --period 3600 \
  --statistics Sum \
  --profile dodo-dev \
  --region ap-northeast-2

# RDS connections — peak over 14 days
aws cloudwatch get-metric-statistics \
  --namespace AWS/RDS \
  --metric-name DatabaseConnections \
  --dimensions Name=DBInstanceIdentifier,Value=pn-sh-rds-prod \
  --start-time $(date -u -v-14d +%Y-%m-%dT%H:%M:%S) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
  --period 3600 \
  --statistics Maximum \
  --profile dodo-dev
Enter fullscreen mode Exit fullscreen mode

What I'm looking for:

Avg QPS: 320/s      ← This is what CPU metrics reflect
Peak QPS: 1,247/s   ← This is what the instance must survive
Ratio: 3.9x         ← If > 3x, be very careful downsizing
Peak window: 11-13h, 18-20h KST  ← Never change anything during these hours
Enter fullscreen mode Exit fullscreen mode

The conversation with yourself:

"This instance averages 12% CPU, but peaks at 47% during lunch hour. If I downsize from xlarge to large, the peak would hit 94% CPU on the smaller instance. That's not optimization — that's a time bomb."

The difference between an average and a peak can be the difference between a smooth optimization and a 2 AM page.

Toolkit: finops preflight --target i-0abc123 --profile prod pulls this automatically from CloudWatch.


2. Quality of Service: Where are we against our SLOs?

Before touching anything, I need to know: how much room do we have to experiment?

What to check:

  • Current p99 latency vs target (e.g., p99 < 200ms)
  • Availability % vs target (e.g., 99.9%)
  • Error rate trend (stable, improving, degrading?)
  • Error budget remaining this month

How I check it (SigNoz):

# SigNoz ClickHouse query  p99 latency last 7 days
SELECT
  toStartOfHour(timestamp) as hour,
  quantile(0.99)(duration_nano) / 1e6 as p99_ms,
  count() as request_count,
  countIf(status_code >= 500) / count() * 100 as error_rate_pct
FROM signoz_traces.distributed_signoz_index_v2
WHERE serviceName = 'gateway-server'
  AND timestamp > now() - INTERVAL 7 DAY
GROUP BY hour
ORDER BY hour
Enter fullscreen mode Exit fullscreen mode

The decision matrix:

Error Budget Remaining Action
> 70% Green — safe to optimize, schedule at off-peak
40-70% Yellow — optimize only low-risk items (orphan cleanup, dev/staging)
< 40% Red — do not touch anything. Focus on reliability first.
Budget burned (SLO breached) Stop. Any optimization must IMPROVE reliability, not risk it.

The conversation:

"Our gateway-server has 78% error budget remaining. p99 is 142ms against a 200ms target. That's a comfortable margin — we can proceed with dev/staging optimizations. But I'll hold off on prod RDS right-sizing until next month when we have a full 30-day baseline after the last deployment."

This is where FinOps meets SRE. A FinOps tool tells you to downsize. An SRE checks if the system can absorb the risk.

Toolkit: finops preflight --target gateway-server --apm signoz --apm-endpoint http://signoz.internal:3301 queries SigNoz for SLO status.


3. Cache Strategy: What's already absorbing load?

If a service is low-CPU because Redis handles 85% of requests, downsizing the backend might be fine. But if the cache fails, that backend needs to handle 100% — at the original capacity.

What to check:

  • Cache hit rate (ElastiCache / Redis)
  • Cache eviction rate
  • Cache TTL settings
  • What happens on cache miss (DB query? External API call?)

How I check it:

# ElastiCache hit rate — last 7 days
aws cloudwatch get-metric-statistics \
  --namespace AWS/ElastiCache \
  --metric-name CacheHitRate \
  --dimensions Name=CacheClusterId,Value=pn-sh-redis-dev \
  --start-time $(date -u -v-7d +%Y-%m-%dT%H:%M:%S) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
  --period 3600 \
  --statistics Average \
  --profile dodo-dev

# Eviction rate — if rising, cache is under pressure
aws cloudwatch get-metric-statistics \
  --namespace AWS/ElastiCache \
  --metric-name Evictions \
  --dimensions Name=CacheClusterId,Value=pn-sh-redis-dev \
  --start-time $(date -u -v-7d +%Y-%m-%dT%H:%M:%S) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
  --period 3600 \
  --statistics Sum \
  --profile dodo-dev
Enter fullscreen mode Exit fullscreen mode

The conversation:

"Redis hit rate is 87%. That means only 13% of requests actually reach the database. Current DB CPU is 12% — but without cache, it would be ~92%. If I downsize this DB, I'm betting that Redis never goes down. Is that a bet I want to make?"

Answer: In prod, no. In dev/staging where I can tolerate cache failures, yes.

The rule: Factor cache dependency into every right-sizing decision. CPU utilization without cache context is misleading.


4. Incident History: What's broken before?

The best predictor of future incidents is past incidents. Before touching a resource, I check: has anything involving this service broken in the last 90 days?

What to check:

  • Incident count involving the target service (last 90 days)
  • Root causes — was it capacity-related?
  • Related services that were impacted
  • Time to recovery

Where to look:

  • SigNoz alerts history
  • PagerDuty/Slack incident channels
  • Post-mortem docs (our pn-infra-docs/incidents/)
  • CloudWatch alarm history
# CloudWatch alarm history for the target
aws cloudwatch describe-alarm-history \
  --alarm-name "pn-sh-rds-prod-cpu-high" \
  --history-item-type StateUpdate \
  --start-date $(date -u -v-90d +%Y-%m-%dT%H:%M:%S) \
  --end-date $(date -u +%Y-%m-%dT%H:%M:%S) \
  --profile dodo-dev
Enter fullscreen mode Exit fullscreen mode

The conversation:

"Two incidents in the last 90 days. One was a network blip (unrelated). The other was a connection pool exhaustion on this exact RDS instance during a traffic spike — we had to vertically scale up. That was 6 weeks ago."

"If I downsize this instance now, I'm reducing the headroom that prevented that from happening again. Let me check the connection metrics more carefully before proceeding."

Red flags that block optimization:

  • Capacity-related incident in the last 60 days → wait
  • Service was recently scaled UP to fix an issue → definitely wait
  • Ongoing performance investigation → do not touch

5. Access Setup: Credentials for CLI Analysis

This is practical, not conceptual. Before you can analyze anything, you need:

AWS credentials:

# Verify access to all target accounts
aws sts get-caller-identity --profile dodo-dev    # shared
aws sts get-caller-identity --profile dodo-prod   # dodopoint
aws sts get-caller-identity --profile now          # nowwaiting
aws sts get-caller-identity --profile placen       # nexus hub

# Required IAM permissions (read-only):
# - cloudwatch:GetMetricStatistics
# - ec2:DescribeInstances, DescribeNatGateways, DescribeVolumes
# - rds:DescribeDBInstances
# - elasticache:DescribeCacheClusters
# - s3:ListBuckets, GetBucketPolicy
# - ce:GetCostAndUsage (Cost Explorer)
Enter fullscreen mode Exit fullscreen mode

APM access (SigNoz):

# SigNoz API — verify connectivity
curl -s http://signoz.internal:3301/api/v1/services | jq '.data | length'

# If using SigNoz Cloud:
export SIGNOZ_API_KEY="your-api-key"
curl -s -H "SIGNOZ-API-KEY: $SIGNOZ_API_KEY" \
  https://your-instance.signoz.io/api/v1/services
Enter fullscreen mode Exit fullscreen mode

The toolkit config:

# finops.yaml — account + APM configuration
accounts:
  - profile: dodo-dev
    name: Shared (Dev)
    region: ap-northeast-2
  - profile: dodo-prod
    name: DodoPoint (Prod)
    region: ap-northeast-1
  - profile: placen
    name: Nexus Hub
    region: ap-northeast-2

apm:
  provider: signoz
  endpoint: http://signoz.internal:3301
  # or api_key: ${SIGNOZ_API_KEY}
Enter fullscreen mode Exit fullscreen mode

The rule: Read-only access only. The analysis phase should never modify anything. If your credentials have write access, consider creating a dedicated FinOpsReadOnly role.

Toolkit: finops preflight --profile dodo-dev validates credentials and permissions before analysis.


6. Target Identification: Instance + APM Mapping

Now we need to map the infrastructure resource (EC2 instance, RDS instance) to the service it runs and the APM dashboard that monitors it.

Why this matters: AWS sees "i-0abc123" and "db.r6g.xlarge". Your team sees "gateway-server" and "the ordering database." FinOps decisions need both views.

How to build the mapping:

# EC2: get instance → service mapping from tags
aws ec2 describe-instances \
  --filters "Name=instance-state-name,Values=running" \
  --query 'Reservations[].Instances[].[InstanceId, InstanceType, Tags[?Key==`Name`].Value | [0], Tags[?Key==`Service`].Value | [0]]' \
  --output table \
  --profile dodo-dev

# RDS: instance → service mapping
aws rds describe-db-instances \
  --query 'DBInstances[].[DBInstanceIdentifier, DBInstanceClass, Engine, EngineVersion]' \
  --output table \
  --profile dodo-dev
Enter fullscreen mode Exit fullscreen mode

The result you want:

AWS Resource Instance Type Service SigNoz Dashboard Owner
i-0abc123 t3.large EKS node (gateway) gateway-server Platform
pn-sh-rds-prod db.r6g.xlarge ConnectOrder DB connectorder-db Platform
pn-sh-redis-dev cache.t3.medium Session cache redis-metrics Platform

The conversation:

"I see this RDS instance costs $380/month. But what service uses it? Ah — it's the ConnectOrder primary database. That means gateway-server, auth-server, and user-server all depend on it. That's a high blast radius. Let me check SigNoz for all three services, not just the database metrics."

The rule: Never optimize a resource without knowing what depends on it.

Toolkit: finops preflight --target pn-sh-rds-prod discovers dependent services and maps to APM.


7. Traffic Pattern & Service Specification

Now I zoom out. Not just "what's the current load" but "what does the traffic pattern look like over a week, a month?"

What to analyze:

  • Weekday vs weekend traffic ratio
  • Daily peak patterns (lunch hour? evening?)
  • Monthly patterns (start of month, end of month, paydays?)
  • Seasonal patterns (holidays, events)
  • Service type: stateless (can use Spot) vs stateful (cannot)
  • Dependency chain: who calls this? who does this call?

How I visualize it:

# Hourly request count — last 30 days — export for pattern analysis
aws cloudwatch get-metric-data \
  --metric-data-queries '[{
    "Id": "requests",
    "MetricStat": {
      "Metric": {
        "Namespace": "AWS/ApplicationELB",
        "MetricName": "RequestCount",
        "Dimensions": [{"Name": "LoadBalancer", "Value": "app/pn-sh-alb/abc123"}]
      },
      "Period": 3600,
      "Stat": "Sum"
    }
  }]' \
  --start-time $(date -u -v-30d +%Y-%m-%dT%H:%M:%S) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
  --profile dodo-dev \
  --output json > traffic-30d.json
Enter fullscreen mode Exit fullscreen mode

Pattern analysis result:

Service: gateway-server
Type: stateless (REST API gateway)
Dependencies: auth-server, user-server, SSE-server (downstream)

Traffic Pattern:
  Weekday avg:  420 QPS
  Weekend avg:  180 QPS (43% of weekday)
  Peak hours:   11:00-13:00, 18:00-20:00 KST
  Peak QPS:     1,247
  Low point:    02:00-06:00 KST (~30 QPS)

  Mon-Fri pattern: stable
  Saturday:     -40% from weekday
  Sunday:       -55% from weekday
  Month-end:    no significant spike

Recommendation:
  - Stateless → Spot candidate ✅
  - Predictable pattern → scheduling candidate ✅ (scale down 22:00-07:00)
  - High peak:avg ratio (3.9x) → careful with right-sizing ⚠️
Enter fullscreen mode Exit fullscreen mode

The conversation:

"Weekend traffic drops to 43% of weekday. That means weekend EKS nodes are 57% wasted. Instead of right-sizing (which affects all days), I could use HPA with lower weekend min replicas. Or scheduled scaling. That's safer than shrinking the instance type — I keep peak capacity on weekdays."

Holiday traffic spikes

If your services handle seasonal traffic — holidays, promotions, events — this changes everything about when you can optimize. Squadcast's SRE guide recommends analyzing postmortems from past holiday incidents to build a pre-season checklist — the same principle applies to FinOps freezes.

For F&B/retail platforms (like ours), Korean holidays drive 2-5x normal traffic:

  • Chuseok (Korean Thanksgiving): September, 3-5 day spike
  • Lunar New Year: January/February, 3-5 day spike
  • Christmas/year-end promotions: December

The playbook:

  1. Freeze all FinOps changes 2 weeks before any holiday period
  2. Verify current capacity handles last year's holiday peak (check historical CloudWatch data)
  3. Schedule all optimizations for the quiet period after the holiday
  4. Document the holiday calendar in finops.yaml so the toolkit warns you automatically
# finops.yaml — holiday calendar
preflight:
  holidays:
    - name: Chuseok
      start: "2026-09-14"
      end: "2026-09-17"
      freeze_start: "2026-09-01"  # 2 weeks before
    - name: Lunar New Year
      start: "2027-01-28"
      end: "2027-01-30"
      freeze_start: "2027-01-14"
Enter fullscreen mode Exit fullscreen mode

Toolkit: finops preflight checks the holiday calendar and returns WAIT if within a freeze window.

Batch systems

Batch jobs are invisible to daily averages but define your actual capacity floor.

Common batch patterns:

  • ETL pipelines (nightly or hourly)
  • Billing runs (start/end of month)
  • Data exports and report generation
  • Scheduled sync jobs between services

The playbook:

  1. Map every batch schedule: cron jobs, EventBridge rules, Airflow DAGs
  2. Check: does the batch peak overlap with the downsized capacity?
  3. Rule: size for batch peak, not daily average. If a nightly ETL uses 80% CPU for 2 hours, the instance must handle 80% — even if the 14-day average is 12%.
  4. If batch is weekly or monthly, a 14-day CPU average is misleading — use the batch window CPU instead

The conversation:

"This RDS instance averages 12% CPU. But every Sunday at 2 AM, a billing reconciliation job runs for 3 hours at 75% CPU. If I downsize from xlarge to large, that Sunday job would hit 150% — it would fail or timeout."

Toolkit: finops preflight detects batch patterns by analyzing CloudWatch metric variance and flags resources with periodic spikes.


8. Priority & Freeze Check: Is it safe to act now?

The final gate. Even if all metrics say "go," organizational context can say "stop."

What to check:

Check How Block if...
Deployment freeze Team calendar, Slack announcements Any freeze active
Release pending Sprint board, release schedule Major release within 2 weeks
Service priority level Service catalog P0 service → prod changes need CAB approval
Active incidents PagerDuty, incident channels Any open incident on target service
Error severity trend SigNoz alerts Error rate trending up (even if within SLO)
Change window Team agreement Outside agreed change window
Dependent team availability Team calendar Owning team on vacation or unavailable

Priority levels and what you can optimize:

Service Priority Prod optimization? Dev/Staging? Requires approval?
P0 (critical path) Maintenance window only Yes Yes — team lead + SRE
P1 (important) Off-peak hours Yes Yes — SRE
P2 (standard) Business hours OK Yes No
P3 (non-critical) Anytime Yes No

The conversation:

"All metrics look good for downsizing the staging RDS. But wait — the ConnectOrder team is launching a new feature next Tuesday. They're running load tests on staging this week. If I downsize now, their load test results will be invalid."

"Let me wait until after their launch. I'll schedule the optimization for the week after."

The rule: FinOps is not urgent. Reliability is urgent. If there's any doubt about timing, wait. The waste will still be there next week.


9. Existing RI/SP Coverage: What's already committed?

This check prevents one of the most expensive FinOps mistakes: downsizing an instance that's covered by a Reserved Instance, and wasting the reservation.

What to check:

  • Active Reserved Instances: do any match the target instance type?
  • Active Savings Plans: what type? (Compute vs EC2 Instance)
  • If downsizing, will the new size still be covered?

How I check it:

# List active Reserved Instances
aws ec2 describe-reserved-instances \
  --filters "Name=state,Values=active" \
  --query 'ReservedInstances[].[InstanceType,InstanceCount,End,Scope]' \
  --output table \
  --profile dodo-dev

# List active Savings Plans
aws savingsplans describe-savings-plans \
  --states active \
  --query 'SavingsPlans[].[SavingsPlanType,Commitment,End]' \
  --output table \
  --profile dodo-dev
Enter fullscreen mode Exit fullscreen mode

The decision matrix:

Scenario Risk Action
Target covered by RI, same instance type HIGH Do NOT downsize — calculate RI remaining value vs savings
Target covered by Compute Savings Plan LOW Safe to change instance family (Compute SP is flexible)
Target covered by EC2 Instance Savings Plan HIGH Do NOT change instance family — SP is family-locked
No RI/SP coverage NONE Safe to proceed

The conversation:

"I want to downsize this db.r6g.xlarge to db.r6g.large. Let me check... we have a 1-year RI for db.r6g.xlarge with 8 months remaining. The RI costs $3,060/year. Downsizing would waste $2,040 in remaining reservation value. The downsize would save $190/month = $1,520 over 8 months. Net loss: $520. Don't downsize until the RI expires."

The rule: Always check RI/SP coverage before any right-sizing. The savings from downsizing can be completely negated by wasted reservations.

This mistake is widespread. CloudChipr's RDS guide warns explicitly: "Buying a Reserved Instance for an overprovisioned database just optimizes the cost of waste." ProsperOps notes that if usage falls below commitment, the unused portion goes to waste — making monitoring essential. And Craig Deveson's LinkedIn article documents real strategies for recovering from RI mistakes, including instance size flexibility within the same family and the RI Marketplace for selling unused reservations. The Flexera State of the Cloud Report estimates 27% of all cloud spend is wasted — and RI mismanagement is one of the top contributors.

Toolkit: finops preflight --target <instance> checks active RIs and Savings Plans automatically and returns WAIT if downsizing would waste a reservation.


Putting it all together: the finops preflight report

Here's what the complete analysis looks like when you run it:

$ finops preflight --target pn-sh-rds-prod --profile dodo-dev --apm signoz

╭──────────────────────────────────────────────────────────────────╮
│                    PRE-FLIGHT ANALYSIS                            │
│  Target: pn-sh-rds-prod (db.r6g.xlarge)                        │
│  Account: Shared (468411441302)                                  │
│  Analyzed: 2026-03-14 09:32 KST                                │
╰──────────────────────────────────────────────────────────────────╯

📊 TRAFFIC
  Current QPS:     312 req/s
  Peak QPS (30d):  1,247 req/s
  Peak:Avg ratio:  3.9x
  Peak hours:      11:00-13:00, 18:00-20:00 KST
  Weekend drop:    -57%

📋 QUALITY OF SERVICE (SigNoz)
  p99 latency:     142ms / 200ms target     ✅ 29% headroom
  Availability:    99.94% / 99.9% target     ✅
  Error rate:      0.04%                     ✅
  Error budget:    78% remaining             ✅ GREEN

🗄️ CACHE DEPENDENCY
  ElastiCache:     pn-sh-redis-dev
  Hit rate:        87.3%                     ⚠️ 13% hits DB directly
  Eviction rate:   0.02%                     ✅ Stable
  Cache-miss load: ~40 QPS reaches DB

🔥 INCIDENT HISTORY (90 days)
  Total incidents: 2
  Capacity-related: 1 (connection pool, 6 weeks ago)    ⚠️
  Status:          Resolved, connection pool increased

📊 RESOURCE METRICS (14-day)
  CPU avg:         12.3%
  CPU peak:        47.2%
  Memory avg:      34.7%
  Connections avg: 23 / 1000 max
  IOPS avg:        145 / 3000 provisioned

🔗 DEPENDENCIES
  Services:        gateway-server, auth-server, user-server
  Blast radius:    HIGH (3 services depend on this)

💰 RI/SP COVERAGE
  Reserved Instances: 1 active (db.r6g.xlarge, 8 months remaining)
  Savings Plans:      1 Compute SP ($500/mo commitment)          ✅ Flexible
  RI match:           ⚠️ Target matches active RI
  SP family risk:     None (Compute SP)

🚦 PRIORITY CHECK
  Service level:   P0 (critical path)
  Deploy freeze:   None active
  Pending release: ConnectOrder v2.3 — March 18      ⚠️
  Team available:  Yes

╭──────────────────────────────────────────────────────────────────╮
│ RECOMMENDATION:  ⚠️  WAIT — PROCEED AFTER MARCH 18              │
│                                                                  │
│ Analysis supports right-sizing (CPU avg 12%, 78% error budget), │
│ but:                                                             │
│  1. Pending release March 18 — wait for post-release stability  │
│  2. Connection pool incident 6 weeks ago — verify pool config   │
│  3. P0 service — requires team lead + SRE approval              │
│  4. High blast radius — 3 dependent services                    │
│                                                                  │
│ After March 18 (if SLOs hold):                                  │
│  → Downsize db.r6g.xlarge → db.r6g.large                       │
│  → Add read replica as safety net before resize                 │
│  → Schedule: 02:00-04:00 KST (lowest traffic)                  │
│  → Estimated savings: $190/month ($2,280/year)│  → Rollback plan: modify-db-instance back to xlarge (<10 min)   │
╰──────────────────────────────────────────────────────────────────╯
Enter fullscreen mode Exit fullscreen mode

That's the pre-flight. One command, nine checks, a clear recommendation. No guessing, no "let's just try it and see."


What comes next

This pre-flight analysis is the foundation of the FinOps for SREs series. After pre-flight clears:

The analysis is the foundation. Without it, you're guessing. And in production, guessing has a cost — measured in pages, not dollars.


FinOps for SREs — Series Index


The pre-flight analysis is implemented in aws-finops-toolkit as the finops preflight command.

Top comments (0)