<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: June Gu</title>
    <description>The latest articles on DEV Community by June Gu (@june-gu).</description>
    <link>https://dev.to/june-gu</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3811414%2Fb50a3c63-7961-4524-9da3-cc65a28941de.jpeg</url>
      <title>DEV Community: June Gu</title>
      <link>https://dev.to/june-gu</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/june-gu"/>
    <language>en</language>
    <item>
      <title>The Pre-Flight Checklist: 9 Things to Analyze Before Cutting Any AWS Cost</title>
      <dc:creator>June Gu</dc:creator>
      <pubDate>Sun, 22 Mar 2026 00:18:25 +0000</pubDate>
      <link>https://dev.to/june-gu/the-pre-flight-checklist-9-things-to-analyze-before-cutting-any-aws-cost-35dh</link>
      <guid>https://dev.to/june-gu/the-pre-flight-checklist-9-things-to-analyze-before-cutting-any-aws-cost-35dh</guid>
      <description>&lt;h1&gt;
  
  
  The Pre-Flight Checklist: 9 Things to Analyze Before Cutting Any AWS Cost
&lt;/h1&gt;

&lt;p&gt;&lt;strong&gt;Tags:&lt;/strong&gt; &lt;code&gt;aws&lt;/code&gt; &lt;code&gt;finops&lt;/code&gt; &lt;code&gt;sre&lt;/code&gt; &lt;code&gt;reliability&lt;/code&gt; &lt;code&gt;devops&lt;/code&gt;&lt;/p&gt;




&lt;p&gt;Last month I saved $12K/year by cleaning up AWS waste across four accounts. But before I touched a single resource, I spent two days just &lt;em&gt;analyzing&lt;/em&gt;. Not because I'm cautious by nature — because I've seen what happens when people skip this step.&lt;/p&gt;

&lt;p&gt;A colleague at a previous company followed AWS Cost Explorer's recommendation to downsize an RDS instance. It was 12% CPU average — seemed obvious. What they didn't check: that instance handled a 4x traffic spike every Friday at 6 PM. The downsize turned Friday evening into a 90-minute outage, a rollback, and an incident report that took longer to write than the analysis would have.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The rule: never optimize what you don't fully understand.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This article is the pre-flight checklist I run before every cost optimization. It's conversational by design — I want you to internalize the &lt;em&gt;thinking&lt;/em&gt;, not just memorize a checklist.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;The SRE Guarantee&lt;/strong&gt;: Before any optimization, we guarantee error budget protection, assured minimum downtime, and reliability over savings. See the &lt;a href="https://dev.to/june-gu/finops-for-sres-cutting-costs-without-breaking-things-2fbk"&gt;series introduction&lt;/a&gt; for the full guarantee. Every check in this article enforces that guarantee.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Automate this&lt;/strong&gt;: &lt;code&gt;finops preflight&lt;/code&gt; runs this entire analysis from your terminal.&lt;br&gt;
See &lt;a href="https://github.com/junegu/aws-finops-toolkit" rel="noopener noreferrer"&gt;aws-finops-toolkit&lt;/a&gt;.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbggs02kpku9txirkais2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbggs02kpku9txirkais2.png" alt="9 pre-flight checks flowing into GO/WAIT/STOP verdict" width="720" height="540"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  1. Traffic: What's the actual load?
&lt;/h2&gt;

&lt;p&gt;The first question isn't "what does this cost?" — it's "what does this &lt;em&gt;do&lt;/em&gt;?"&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What to pull:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Current TPS / QPS (transactions or queries per second)&lt;/li&gt;
&lt;li&gt;Peak QPS over the last 30 days&lt;/li&gt;
&lt;li&gt;When the peak happens (time of day, day of week)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;How I check it:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# ALB request count — last 7 days, 1-hour intervals&lt;/span&gt;
aws cloudwatch get-metric-statistics &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--namespace&lt;/span&gt; AWS/ApplicationELB &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--metric-name&lt;/span&gt; RequestCount &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--dimensions&lt;/span&gt; &lt;span class="nv"&gt;Name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;LoadBalancer,Value&lt;span class="o"&gt;=&lt;/span&gt;app/pn-sh-alb/abc123 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--start-time&lt;/span&gt; &lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;date&lt;/span&gt; &lt;span class="nt"&gt;-u&lt;/span&gt; &lt;span class="nt"&gt;-v-7d&lt;/span&gt; +%Y-%m-%dT%H:%M:%S&lt;span class="si"&gt;)&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--end-time&lt;/span&gt; &lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;date&lt;/span&gt; &lt;span class="nt"&gt;-u&lt;/span&gt; +%Y-%m-%dT%H:%M:%S&lt;span class="si"&gt;)&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--period&lt;/span&gt; 3600 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--statistics&lt;/span&gt; Sum &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--profile&lt;/span&gt; dodo-dev &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--region&lt;/span&gt; ap-northeast-2

&lt;span class="c"&gt;# RDS connections — peak over 14 days&lt;/span&gt;
aws cloudwatch get-metric-statistics &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--namespace&lt;/span&gt; AWS/RDS &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--metric-name&lt;/span&gt; DatabaseConnections &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--dimensions&lt;/span&gt; &lt;span class="nv"&gt;Name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;DBInstanceIdentifier,Value&lt;span class="o"&gt;=&lt;/span&gt;pn-sh-rds-prod &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--start-time&lt;/span&gt; &lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;date&lt;/span&gt; &lt;span class="nt"&gt;-u&lt;/span&gt; &lt;span class="nt"&gt;-v-14d&lt;/span&gt; +%Y-%m-%dT%H:%M:%S&lt;span class="si"&gt;)&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--end-time&lt;/span&gt; &lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;date&lt;/span&gt; &lt;span class="nt"&gt;-u&lt;/span&gt; +%Y-%m-%dT%H:%M:%S&lt;span class="si"&gt;)&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--period&lt;/span&gt; 3600 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--statistics&lt;/span&gt; Maximum &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--profile&lt;/span&gt; dodo-dev
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;What I'm looking for:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Avg QPS: 320/s      ← This is what CPU metrics reflect
Peak QPS: 1,247/s   ← This is what the instance must survive
Ratio: 3.9x         ← If &amp;gt; 3x, be very careful downsizing
Peak window: 11-13h, 18-20h KST  ← Never change anything during these hours
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;The conversation with yourself:&lt;/strong&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"This instance averages 12% CPU, but peaks at 47% during lunch hour. If I downsize from xlarge to large, the peak would hit 94% CPU on the smaller instance. That's not optimization — that's a time bomb."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The difference between an average and a peak can be the difference between a smooth optimization and a 2 AM page.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Toolkit&lt;/strong&gt;: &lt;code&gt;finops preflight --target i-0abc123 --profile prod&lt;/code&gt; pulls this automatically from CloudWatch.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  2. Quality of Service: Where are we against our SLOs?
&lt;/h2&gt;

&lt;p&gt;Before touching anything, I need to know: &lt;strong&gt;how much room do we have to experiment?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What to check:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Current p99 latency vs target (e.g., p99 &amp;lt; 200ms)&lt;/li&gt;
&lt;li&gt;Availability % vs target (e.g., 99.9%)&lt;/li&gt;
&lt;li&gt;Error rate trend (stable, improving, degrading?)&lt;/li&gt;
&lt;li&gt;Error budget remaining this month&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;How I check it (SigNoz):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# SigNoz ClickHouse query — p99 latency last 7 days
SELECT
  toStartOfHour(timestamp) as hour,
  quantile(0.99)(duration_nano) / 1e6 as p99_ms,
  count() as request_count,
  countIf(status_code &amp;gt;= 500) / count() * 100 as error_rate_pct
FROM signoz_traces.distributed_signoz_index_v2
WHERE serviceName = 'gateway-server'
  AND timestamp &amp;gt; now() - INTERVAL 7 DAY
GROUP BY hour
ORDER BY hour
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;The decision matrix:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Error Budget Remaining&lt;/th&gt;
&lt;th&gt;Action&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&amp;gt; 70%&lt;/td&gt;
&lt;td&gt;Green — safe to optimize, schedule at off-peak&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;40-70%&lt;/td&gt;
&lt;td&gt;Yellow — optimize only low-risk items (orphan cleanup, dev/staging)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&amp;lt; 40%&lt;/td&gt;
&lt;td&gt;Red — &lt;strong&gt;do not touch anything&lt;/strong&gt;. Focus on reliability first.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Budget burned (SLO breached)&lt;/td&gt;
&lt;td&gt;Stop. Any optimization must IMPROVE reliability, not risk it.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;The conversation:&lt;/strong&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"Our gateway-server has 78% error budget remaining. p99 is 142ms against a 200ms target. That's a comfortable margin — we can proceed with dev/staging optimizations. But I'll hold off on prod RDS right-sizing until next month when we have a full 30-day baseline after the last deployment."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This is where FinOps meets SRE. A FinOps tool tells you to downsize. An SRE checks if the system can absorb the risk.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Toolkit&lt;/strong&gt;: &lt;code&gt;finops preflight --target gateway-server --apm signoz --apm-endpoint http://signoz.internal:3301&lt;/code&gt; queries SigNoz for SLO status.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  3. Cache Strategy: What's already absorbing load?
&lt;/h2&gt;

&lt;p&gt;If a service is low-CPU because Redis handles 85% of requests, downsizing the backend might be fine. But if the cache fails, that backend needs to handle 100% — at the original capacity.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What to check:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Cache hit rate (ElastiCache / Redis)&lt;/li&gt;
&lt;li&gt;Cache eviction rate&lt;/li&gt;
&lt;li&gt;Cache TTL settings&lt;/li&gt;
&lt;li&gt;What happens on cache miss (DB query? External API call?)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;How I check it:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# ElastiCache hit rate — last 7 days&lt;/span&gt;
aws cloudwatch get-metric-statistics &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--namespace&lt;/span&gt; AWS/ElastiCache &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--metric-name&lt;/span&gt; CacheHitRate &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--dimensions&lt;/span&gt; &lt;span class="nv"&gt;Name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;CacheClusterId,Value&lt;span class="o"&gt;=&lt;/span&gt;pn-sh-redis-dev &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--start-time&lt;/span&gt; &lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;date&lt;/span&gt; &lt;span class="nt"&gt;-u&lt;/span&gt; &lt;span class="nt"&gt;-v-7d&lt;/span&gt; +%Y-%m-%dT%H:%M:%S&lt;span class="si"&gt;)&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--end-time&lt;/span&gt; &lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;date&lt;/span&gt; &lt;span class="nt"&gt;-u&lt;/span&gt; +%Y-%m-%dT%H:%M:%S&lt;span class="si"&gt;)&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--period&lt;/span&gt; 3600 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--statistics&lt;/span&gt; Average &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--profile&lt;/span&gt; dodo-dev

&lt;span class="c"&gt;# Eviction rate — if rising, cache is under pressure&lt;/span&gt;
aws cloudwatch get-metric-statistics &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--namespace&lt;/span&gt; AWS/ElastiCache &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--metric-name&lt;/span&gt; Evictions &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--dimensions&lt;/span&gt; &lt;span class="nv"&gt;Name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;CacheClusterId,Value&lt;span class="o"&gt;=&lt;/span&gt;pn-sh-redis-dev &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--start-time&lt;/span&gt; &lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;date&lt;/span&gt; &lt;span class="nt"&gt;-u&lt;/span&gt; &lt;span class="nt"&gt;-v-7d&lt;/span&gt; +%Y-%m-%dT%H:%M:%S&lt;span class="si"&gt;)&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--end-time&lt;/span&gt; &lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;date&lt;/span&gt; &lt;span class="nt"&gt;-u&lt;/span&gt; +%Y-%m-%dT%H:%M:%S&lt;span class="si"&gt;)&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--period&lt;/span&gt; 3600 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--statistics&lt;/span&gt; Sum &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--profile&lt;/span&gt; dodo-dev
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;The conversation:&lt;/strong&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"Redis hit rate is 87%. That means only 13% of requests actually reach the database. Current DB CPU is 12% — but without cache, it would be ~92%. If I downsize this DB, I'm betting that Redis never goes down. Is that a bet I want to make?"&lt;/p&gt;

&lt;p&gt;Answer: In prod, no. In dev/staging where I can tolerate cache failures, yes.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;The rule&lt;/strong&gt;: Factor cache dependency into every right-sizing decision. CPU utilization without cache context is misleading.&lt;/p&gt;




&lt;h2&gt;
  
  
  4. Incident History: What's broken before?
&lt;/h2&gt;

&lt;p&gt;The best predictor of future incidents is past incidents. Before touching a resource, I check: has anything involving this service broken in the last 90 days?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What to check:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Incident count involving the target service (last 90 days)&lt;/li&gt;
&lt;li&gt;Root causes — was it capacity-related?&lt;/li&gt;
&lt;li&gt;Related services that were impacted&lt;/li&gt;
&lt;li&gt;Time to recovery&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Where to look:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;SigNoz alerts history&lt;/li&gt;
&lt;li&gt;PagerDuty/Slack incident channels&lt;/li&gt;
&lt;li&gt;Post-mortem docs (our &lt;code&gt;pn-infra-docs/incidents/&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;CloudWatch alarm history
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# CloudWatch alarm history for the target&lt;/span&gt;
aws cloudwatch describe-alarm-history &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--alarm-name&lt;/span&gt; &lt;span class="s2"&gt;"pn-sh-rds-prod-cpu-high"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--history-item-type&lt;/span&gt; StateUpdate &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--start-date&lt;/span&gt; &lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;date&lt;/span&gt; &lt;span class="nt"&gt;-u&lt;/span&gt; &lt;span class="nt"&gt;-v-90d&lt;/span&gt; +%Y-%m-%dT%H:%M:%S&lt;span class="si"&gt;)&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--end-date&lt;/span&gt; &lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;date&lt;/span&gt; &lt;span class="nt"&gt;-u&lt;/span&gt; +%Y-%m-%dT%H:%M:%S&lt;span class="si"&gt;)&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--profile&lt;/span&gt; dodo-dev
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;The conversation:&lt;/strong&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"Two incidents in the last 90 days. One was a network blip (unrelated). The other was a connection pool exhaustion on this exact RDS instance during a traffic spike — we had to vertically scale up. That was 6 weeks ago."&lt;/p&gt;

&lt;p&gt;"If I downsize this instance now, I'm reducing the headroom that prevented that from happening again. Let me check the connection metrics more carefully before proceeding."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Red flags that block optimization:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Capacity-related incident in the last 60 days → wait&lt;/li&gt;
&lt;li&gt;Service was recently scaled UP to fix an issue → definitely wait&lt;/li&gt;
&lt;li&gt;Ongoing performance investigation → do not touch&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  5. Access Setup: Credentials for CLI Analysis
&lt;/h2&gt;

&lt;p&gt;This is practical, not conceptual. Before you can analyze anything, you need:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;AWS credentials:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Verify access to all target accounts&lt;/span&gt;
aws sts get-caller-identity &lt;span class="nt"&gt;--profile&lt;/span&gt; dodo-dev    &lt;span class="c"&gt;# shared&lt;/span&gt;
aws sts get-caller-identity &lt;span class="nt"&gt;--profile&lt;/span&gt; dodo-prod   &lt;span class="c"&gt;# dodopoint&lt;/span&gt;
aws sts get-caller-identity &lt;span class="nt"&gt;--profile&lt;/span&gt; now          &lt;span class="c"&gt;# nowwaiting&lt;/span&gt;
aws sts get-caller-identity &lt;span class="nt"&gt;--profile&lt;/span&gt; placen       &lt;span class="c"&gt;# nexus hub&lt;/span&gt;

&lt;span class="c"&gt;# Required IAM permissions (read-only):&lt;/span&gt;
&lt;span class="c"&gt;# - cloudwatch:GetMetricStatistics&lt;/span&gt;
&lt;span class="c"&gt;# - ec2:DescribeInstances, DescribeNatGateways, DescribeVolumes&lt;/span&gt;
&lt;span class="c"&gt;# - rds:DescribeDBInstances&lt;/span&gt;
&lt;span class="c"&gt;# - elasticache:DescribeCacheClusters&lt;/span&gt;
&lt;span class="c"&gt;# - s3:ListBuckets, GetBucketPolicy&lt;/span&gt;
&lt;span class="c"&gt;# - ce:GetCostAndUsage (Cost Explorer)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;APM access (SigNoz):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# SigNoz API — verify connectivity&lt;/span&gt;
curl &lt;span class="nt"&gt;-s&lt;/span&gt; http://signoz.internal:3301/api/v1/services | jq &lt;span class="s1"&gt;'.data | length'&lt;/span&gt;

&lt;span class="c"&gt;# If using SigNoz Cloud:&lt;/span&gt;
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;SIGNOZ_API_KEY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"your-api-key"&lt;/span&gt;
curl &lt;span class="nt"&gt;-s&lt;/span&gt; &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"SIGNOZ-API-KEY: &lt;/span&gt;&lt;span class="nv"&gt;$SIGNOZ_API_KEY&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  https://your-instance.signoz.io/api/v1/services
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;The toolkit config:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# finops.yaml — account + APM configuration&lt;/span&gt;
&lt;span class="na"&gt;accounts&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;profile&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;dodo-dev&lt;/span&gt;
    &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Shared (Dev)&lt;/span&gt;
    &lt;span class="na"&gt;region&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ap-northeast-2&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;profile&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;dodo-prod&lt;/span&gt;
    &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;DodoPoint (Prod)&lt;/span&gt;
    &lt;span class="na"&gt;region&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ap-northeast-1&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;profile&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;placen&lt;/span&gt;
    &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Nexus Hub&lt;/span&gt;
    &lt;span class="na"&gt;region&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ap-northeast-2&lt;/span&gt;

&lt;span class="na"&gt;apm&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;provider&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;signoz&lt;/span&gt;
  &lt;span class="na"&gt;endpoint&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;http://signoz.internal:3301&lt;/span&gt;
  &lt;span class="c1"&gt;# or api_key: ${SIGNOZ_API_KEY}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;The rule&lt;/strong&gt;: Read-only access only. The analysis phase should never modify anything. If your credentials have write access, consider creating a dedicated &lt;code&gt;FinOpsReadOnly&lt;/code&gt; role.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Toolkit&lt;/strong&gt;: &lt;code&gt;finops preflight --profile dodo-dev&lt;/code&gt; validates credentials and permissions before analysis.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  6. Target Identification: Instance + APM Mapping
&lt;/h2&gt;

&lt;p&gt;Now we need to map the &lt;em&gt;infrastructure resource&lt;/em&gt; (EC2 instance, RDS instance) to the &lt;em&gt;service it runs&lt;/em&gt; and the &lt;em&gt;APM dashboard that monitors it&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why this matters:&lt;/strong&gt; AWS sees "i-0abc123" and "db.r6g.xlarge". Your team sees "gateway-server" and "the ordering database." FinOps decisions need both views.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How to build the mapping:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# EC2: get instance → service mapping from tags&lt;/span&gt;
aws ec2 describe-instances &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--filters&lt;/span&gt; &lt;span class="s2"&gt;"Name=instance-state-name,Values=running"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--query&lt;/span&gt; &lt;span class="s1"&gt;'Reservations[].Instances[].[InstanceId, InstanceType, Tags[?Key==`Name`].Value | [0], Tags[?Key==`Service`].Value | [0]]'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--output&lt;/span&gt; table &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--profile&lt;/span&gt; dodo-dev

&lt;span class="c"&gt;# RDS: instance → service mapping&lt;/span&gt;
aws rds describe-db-instances &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--query&lt;/span&gt; &lt;span class="s1"&gt;'DBInstances[].[DBInstanceIdentifier, DBInstanceClass, Engine, EngineVersion]'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--output&lt;/span&gt; table &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--profile&lt;/span&gt; dodo-dev
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;The result you want:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;AWS Resource&lt;/th&gt;
&lt;th&gt;Instance Type&lt;/th&gt;
&lt;th&gt;Service&lt;/th&gt;
&lt;th&gt;SigNoz Dashboard&lt;/th&gt;
&lt;th&gt;Owner&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;i-0abc123&lt;/td&gt;
&lt;td&gt;t3.large&lt;/td&gt;
&lt;td&gt;EKS node (gateway)&lt;/td&gt;
&lt;td&gt;gateway-server&lt;/td&gt;
&lt;td&gt;Platform&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;pn-sh-rds-prod&lt;/td&gt;
&lt;td&gt;db.r6g.xlarge&lt;/td&gt;
&lt;td&gt;ConnectOrder DB&lt;/td&gt;
&lt;td&gt;connectorder-db&lt;/td&gt;
&lt;td&gt;Platform&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;pn-sh-redis-dev&lt;/td&gt;
&lt;td&gt;cache.t3.medium&lt;/td&gt;
&lt;td&gt;Session cache&lt;/td&gt;
&lt;td&gt;redis-metrics&lt;/td&gt;
&lt;td&gt;Platform&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;The conversation:&lt;/strong&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"I see this RDS instance costs $380/month. But what service uses it? Ah — it's the ConnectOrder primary database. That means gateway-server, auth-server, and user-server all depend on it. That's a high blast radius. Let me check SigNoz for all three services, not just the database metrics."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;The rule&lt;/strong&gt;: Never optimize a resource without knowing what depends on it.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Toolkit&lt;/strong&gt;: &lt;code&gt;finops preflight --target pn-sh-rds-prod&lt;/code&gt; discovers dependent services and maps to APM.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  7. Traffic Pattern &amp;amp; Service Specification
&lt;/h2&gt;

&lt;p&gt;Now I zoom out. Not just "what's the current load" but "what does the traffic pattern look like over a week, a month?"&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What to analyze:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Weekday vs weekend traffic ratio&lt;/li&gt;
&lt;li&gt;Daily peak patterns (lunch hour? evening?)&lt;/li&gt;
&lt;li&gt;Monthly patterns (start of month, end of month, paydays?)&lt;/li&gt;
&lt;li&gt;Seasonal patterns (holidays, events)&lt;/li&gt;
&lt;li&gt;Service type: stateless (can use Spot) vs stateful (cannot)&lt;/li&gt;
&lt;li&gt;Dependency chain: who calls this? who does this call?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;How I visualize it:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Hourly request count — last 30 days — export for pattern analysis&lt;/span&gt;
aws cloudwatch get-metric-data &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--metric-data-queries&lt;/span&gt; &lt;span class="s1"&gt;'[{
    "Id": "requests",
    "MetricStat": {
      "Metric": {
        "Namespace": "AWS/ApplicationELB",
        "MetricName": "RequestCount",
        "Dimensions": [{"Name": "LoadBalancer", "Value": "app/pn-sh-alb/abc123"}]
      },
      "Period": 3600,
      "Stat": "Sum"
    }
  }]'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--start-time&lt;/span&gt; &lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;date&lt;/span&gt; &lt;span class="nt"&gt;-u&lt;/span&gt; &lt;span class="nt"&gt;-v-30d&lt;/span&gt; +%Y-%m-%dT%H:%M:%S&lt;span class="si"&gt;)&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--end-time&lt;/span&gt; &lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;date&lt;/span&gt; &lt;span class="nt"&gt;-u&lt;/span&gt; +%Y-%m-%dT%H:%M:%S&lt;span class="si"&gt;)&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--profile&lt;/span&gt; dodo-dev &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--output&lt;/span&gt; json &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; traffic-30d.json
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Pattern analysis result:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Service: gateway-server
Type: stateless (REST API gateway)
Dependencies: auth-server, user-server, SSE-server (downstream)

Traffic Pattern:
  Weekday avg:  420 QPS
  Weekend avg:  180 QPS (43% of weekday)
  Peak hours:   11:00-13:00, 18:00-20:00 KST
  Peak QPS:     1,247
  Low point:    02:00-06:00 KST (~30 QPS)

  Mon-Fri pattern: stable
  Saturday:     -40% from weekday
  Sunday:       -55% from weekday
  Month-end:    no significant spike

Recommendation:
  - Stateless → Spot candidate ✅
  - Predictable pattern → scheduling candidate ✅ (scale down 22:00-07:00)
  - High peak:avg ratio (3.9x) → careful with right-sizing ⚠️
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;The conversation:&lt;/strong&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"Weekend traffic drops to 43% of weekday. That means weekend EKS nodes are 57% wasted. Instead of right-sizing (which affects all days), I could use HPA with lower weekend min replicas. Or scheduled scaling. That's safer than shrinking the instance type — I keep peak capacity on weekdays."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  Holiday traffic spikes
&lt;/h3&gt;

&lt;p&gt;If your services handle seasonal traffic — holidays, promotions, events — this changes everything about when you can optimize. &lt;a href="https://www.squadcast.com/blog/what-can-sres-do-to-make-holiday-seasons-peak-traffic-less-chaotic" rel="noopener noreferrer"&gt;Squadcast's SRE guide&lt;/a&gt; recommends analyzing postmortems from past holiday incidents to build a pre-season checklist — the same principle applies to FinOps freezes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;For F&amp;amp;B/retail platforms&lt;/strong&gt; (like ours), Korean holidays drive 2-5x normal traffic:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Chuseok (Korean Thanksgiving): September, 3-5 day spike&lt;/li&gt;
&lt;li&gt;Lunar New Year: January/February, 3-5 day spike&lt;/li&gt;
&lt;li&gt;Christmas/year-end promotions: December&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;The playbook:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Freeze all FinOps changes 2 weeks before any holiday period&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Verify current capacity handles last year's holiday peak (check historical CloudWatch data)&lt;/li&gt;
&lt;li&gt;Schedule all optimizations for the quiet period after the holiday&lt;/li&gt;
&lt;li&gt;Document the holiday calendar in &lt;code&gt;finops.yaml&lt;/code&gt; so the toolkit warns you automatically
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# finops.yaml — holiday calendar&lt;/span&gt;
&lt;span class="na"&gt;preflight&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;holidays&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Chuseok&lt;/span&gt;
      &lt;span class="na"&gt;start&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2026-09-14"&lt;/span&gt;
      &lt;span class="na"&gt;end&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2026-09-17"&lt;/span&gt;
      &lt;span class="na"&gt;freeze_start&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2026-09-01"&lt;/span&gt;  &lt;span class="c1"&gt;# 2 weeks before&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Lunar New Year&lt;/span&gt;
      &lt;span class="na"&gt;start&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2027-01-28"&lt;/span&gt;
      &lt;span class="na"&gt;end&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2027-01-30"&lt;/span&gt;
      &lt;span class="na"&gt;freeze_start&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2027-01-14"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Toolkit&lt;/strong&gt;: &lt;code&gt;finops preflight&lt;/code&gt; checks the holiday calendar and returns WAIT if within a freeze window.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  Batch systems
&lt;/h3&gt;

&lt;p&gt;Batch jobs are invisible to daily averages but define your actual capacity floor.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Common batch patterns:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;ETL pipelines (nightly or hourly)&lt;/li&gt;
&lt;li&gt;Billing runs (start/end of month)&lt;/li&gt;
&lt;li&gt;Data exports and report generation&lt;/li&gt;
&lt;li&gt;Scheduled sync jobs between services&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;The playbook:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Map every batch schedule: cron jobs, EventBridge rules, Airflow DAGs&lt;/li&gt;
&lt;li&gt;Check: does the batch peak overlap with the downsized capacity?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Rule: size for batch peak, not daily average.&lt;/strong&gt; If a nightly ETL uses 80% CPU for 2 hours, the instance must handle 80% — even if the 14-day average is 12%.&lt;/li&gt;
&lt;li&gt;If batch is weekly or monthly, a 14-day CPU average is misleading — use the batch window CPU instead&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;The conversation:&lt;/strong&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"This RDS instance averages 12% CPU. But every Sunday at 2 AM, a billing reconciliation job runs for 3 hours at 75% CPU. If I downsize from xlarge to large, that Sunday job would hit 150% — it would fail or timeout."&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Toolkit&lt;/strong&gt;: &lt;code&gt;finops preflight&lt;/code&gt; detects batch patterns by analyzing CloudWatch metric variance and flags resources with periodic spikes.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  8. Priority &amp;amp; Freeze Check: Is it safe to act now?
&lt;/h2&gt;

&lt;p&gt;The final gate. Even if all metrics say "go," organizational context can say "stop."&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What to check:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Check&lt;/th&gt;
&lt;th&gt;How&lt;/th&gt;
&lt;th&gt;Block if...&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Deployment freeze&lt;/td&gt;
&lt;td&gt;Team calendar, Slack announcements&lt;/td&gt;
&lt;td&gt;Any freeze active&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Release pending&lt;/td&gt;
&lt;td&gt;Sprint board, release schedule&lt;/td&gt;
&lt;td&gt;Major release within 2 weeks&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Service priority level&lt;/td&gt;
&lt;td&gt;Service catalog&lt;/td&gt;
&lt;td&gt;P0 service → prod changes need CAB approval&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Active incidents&lt;/td&gt;
&lt;td&gt;PagerDuty, incident channels&lt;/td&gt;
&lt;td&gt;Any open incident on target service&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Error severity trend&lt;/td&gt;
&lt;td&gt;SigNoz alerts&lt;/td&gt;
&lt;td&gt;Error rate trending up (even if within SLO)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Change window&lt;/td&gt;
&lt;td&gt;Team agreement&lt;/td&gt;
&lt;td&gt;Outside agreed change window&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Dependent team availability&lt;/td&gt;
&lt;td&gt;Team calendar&lt;/td&gt;
&lt;td&gt;Owning team on vacation or unavailable&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Priority levels and what you can optimize:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Service Priority&lt;/th&gt;
&lt;th&gt;Prod optimization?&lt;/th&gt;
&lt;th&gt;Dev/Staging?&lt;/th&gt;
&lt;th&gt;Requires approval?&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;P0 (critical path)&lt;/td&gt;
&lt;td&gt;Maintenance window only&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes — team lead + SRE&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;P1 (important)&lt;/td&gt;
&lt;td&gt;Off-peak hours&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes — SRE&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;P2 (standard)&lt;/td&gt;
&lt;td&gt;Business hours OK&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;P3 (non-critical)&lt;/td&gt;
&lt;td&gt;Anytime&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;The conversation:&lt;/strong&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"All metrics look good for downsizing the staging RDS. But wait — the ConnectOrder team is launching a new feature next Tuesday. They're running load tests on staging this week. If I downsize now, their load test results will be invalid."&lt;/p&gt;

&lt;p&gt;"Let me wait until after their launch. I'll schedule the optimization for the week after."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;The rule&lt;/strong&gt;: FinOps is not urgent. Reliability is urgent. If there's any doubt about timing, wait. The waste will still be there next week.&lt;/p&gt;




&lt;h2&gt;
  
  
  9. Existing RI/SP Coverage: What's already committed?
&lt;/h2&gt;

&lt;p&gt;This check prevents one of the most expensive FinOps mistakes: downsizing an instance that's covered by a Reserved Instance, and wasting the reservation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What to check:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Active Reserved Instances: do any match the target instance type?&lt;/li&gt;
&lt;li&gt;Active Savings Plans: what type? (Compute vs EC2 Instance)&lt;/li&gt;
&lt;li&gt;If downsizing, will the new size still be covered?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;How I check it:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# List active Reserved Instances&lt;/span&gt;
aws ec2 describe-reserved-instances &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--filters&lt;/span&gt; &lt;span class="s2"&gt;"Name=state,Values=active"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--query&lt;/span&gt; &lt;span class="s1"&gt;'ReservedInstances[].[InstanceType,InstanceCount,End,Scope]'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--output&lt;/span&gt; table &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--profile&lt;/span&gt; dodo-dev

&lt;span class="c"&gt;# List active Savings Plans&lt;/span&gt;
aws savingsplans describe-savings-plans &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--states&lt;/span&gt; active &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--query&lt;/span&gt; &lt;span class="s1"&gt;'SavingsPlans[].[SavingsPlanType,Commitment,End]'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--output&lt;/span&gt; table &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--profile&lt;/span&gt; dodo-dev
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;The decision matrix:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Scenario&lt;/th&gt;
&lt;th&gt;Risk&lt;/th&gt;
&lt;th&gt;Action&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Target covered by RI, same instance type&lt;/td&gt;
&lt;td&gt;HIGH&lt;/td&gt;
&lt;td&gt;Do NOT downsize — calculate RI remaining value vs savings&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Target covered by Compute Savings Plan&lt;/td&gt;
&lt;td&gt;LOW&lt;/td&gt;
&lt;td&gt;Safe to change instance family (Compute SP is flexible)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Target covered by EC2 Instance Savings Plan&lt;/td&gt;
&lt;td&gt;HIGH&lt;/td&gt;
&lt;td&gt;Do NOT change instance family — SP is family-locked&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;No RI/SP coverage&lt;/td&gt;
&lt;td&gt;NONE&lt;/td&gt;
&lt;td&gt;Safe to proceed&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;The conversation:&lt;/strong&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"I want to downsize this db.r6g.xlarge to db.r6g.large. Let me check... we have a 1-year RI for db.r6g.xlarge with 8 months remaining. The RI costs $3,060/year. Downsizing would waste $2,040 in remaining reservation value. The downsize would save $190/month = $1,520 over 8 months. Net loss: $520. &lt;strong&gt;Don't downsize until the RI expires.&lt;/strong&gt;"&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;The rule&lt;/strong&gt;: Always check RI/SP coverage before any right-sizing. The savings from downsizing can be completely negated by wasted reservations.&lt;/p&gt;

&lt;p&gt;This mistake is widespread. &lt;a href="https://cloudchipr.com/blog/aws-rds-right-sizing" rel="noopener noreferrer"&gt;CloudChipr's RDS guide&lt;/a&gt; warns explicitly: "Buying a Reserved Instance for an overprovisioned database just optimizes the cost of waste." &lt;a href="https://www.prosperops.com/blog/aws-reserved-instances/" rel="noopener noreferrer"&gt;ProsperOps&lt;/a&gt; notes that if usage falls below commitment, the unused portion goes to waste — making monitoring essential. And &lt;a href="https://www.linkedin.com/pulse/how-fix-aws-reserved-instance-mistakes-craig-deveson" rel="noopener noreferrer"&gt;Craig Deveson's LinkedIn article&lt;/a&gt; documents real strategies for recovering from RI mistakes, including instance size flexibility within the same family and the RI Marketplace for selling unused reservations. The &lt;a href="https://blog.easecloud.io/startup-tech/aws-cost-optimization-mistakes/" rel="noopener noreferrer"&gt;Flexera State of the Cloud Report&lt;/a&gt; estimates 27% of all cloud spend is wasted — and RI mismanagement is one of the top contributors.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Toolkit&lt;/strong&gt;: &lt;code&gt;finops preflight --target &amp;lt;instance&amp;gt;&lt;/code&gt; checks active RIs and Savings Plans automatically and returns WAIT if downsizing would waste a reservation.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Putting it all together: the &lt;code&gt;finops preflight&lt;/code&gt; report
&lt;/h2&gt;

&lt;p&gt;Here's what the complete analysis looks like when you run it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ finops preflight --target pn-sh-rds-prod --profile dodo-dev --apm signoz

╭──────────────────────────────────────────────────────────────────╮
│                    PRE-FLIGHT ANALYSIS                            │
│  Target: pn-sh-rds-prod (db.r6g.xlarge)                        │
│  Account: Shared (468411441302)                                  │
│  Analyzed: 2026-03-14 09:32 KST                                │
╰──────────────────────────────────────────────────────────────────╯

📊 TRAFFIC
  Current QPS:     312 req/s
  Peak QPS (30d):  1,247 req/s
  Peak:Avg ratio:  3.9x
  Peak hours:      11:00-13:00, 18:00-20:00 KST
  Weekend drop:    -57%

📋 QUALITY OF SERVICE (SigNoz)
  p99 latency:     142ms / 200ms target     ✅ 29% headroom
  Availability:    99.94% / 99.9% target     ✅
  Error rate:      0.04%                     ✅
  Error budget:    78% remaining             ✅ GREEN

🗄️ CACHE DEPENDENCY
  ElastiCache:     pn-sh-redis-dev
  Hit rate:        87.3%                     ⚠️ 13% hits DB directly
  Eviction rate:   0.02%                     ✅ Stable
  Cache-miss load: ~40 QPS reaches DB

🔥 INCIDENT HISTORY (90 days)
  Total incidents: 2
  Capacity-related: 1 (connection pool, 6 weeks ago)    ⚠️
  Status:          Resolved, connection pool increased

📊 RESOURCE METRICS (14-day)
  CPU avg:         12.3%
  CPU peak:        47.2%
  Memory avg:      34.7%
  Connections avg: 23 / 1000 max
  IOPS avg:        145 / 3000 provisioned

🔗 DEPENDENCIES
  Services:        gateway-server, auth-server, user-server
  Blast radius:    HIGH (3 services depend on this)

💰 RI/SP COVERAGE
  Reserved Instances: 1 active (db.r6g.xlarge, 8 months remaining)
  Savings Plans:      1 Compute SP ($500/mo commitment)          ✅ Flexible
  RI match:           ⚠️ Target matches active RI
  SP family risk:     None (Compute SP)

🚦 PRIORITY CHECK
  Service level:   P0 (critical path)
  Deploy freeze:   None active
  Pending release: ConnectOrder v2.3 — March 18      ⚠️
  Team available:  Yes

╭──────────────────────────────────────────────────────────────────╮
│ RECOMMENDATION:  ⚠️  WAIT — PROCEED AFTER MARCH 18              │
│                                                                  │
│ Analysis supports right-sizing (CPU avg 12%, 78% error budget), │
│ but:                                                             │
│  1. Pending release March 18 — wait for post-release stability  │
│  2. Connection pool incident 6 weeks ago — verify pool config   │
│  3. P0 service — requires team lead + SRE approval              │
│  4. High blast radius — 3 dependent services                    │
│                                                                  │
│ After March 18 (if SLOs hold):                                  │
│  → Downsize db.r6g.xlarge → db.r6g.large                       │
│  → Add read replica as safety net before resize                 │
│  → Schedule: 02:00-04:00 KST (lowest traffic)                  │
│  → Estimated savings: $190/month ($2,280/year)                  │
│  → Rollback plan: modify-db-instance back to xlarge (&amp;lt;10 min)   │
╰──────────────────────────────────────────────────────────────────╯
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;That's the pre-flight.&lt;/strong&gt; One command, nine checks, a clear recommendation. No guessing, no "let's just try it and see."&lt;/p&gt;




&lt;h2&gt;
  
  
  What comes next
&lt;/h2&gt;

&lt;p&gt;This pre-flight analysis is the foundation of the &lt;a href="https://dev.to/june-gu/finops-for-sres-cutting-costs-without-breaking-things-2fbk"&gt;FinOps for SREs series&lt;/a&gt;. After pre-flight clears:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://dev.to/june-gu/how-i-found-12kyear-in-aws-waste-across-4-accounts-without-touching-production-28bp"&gt;Part 1: Finding Passive Waste&lt;/a&gt; — clean up what nobody uses&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://dev.to/june-gu/downsizing-without-downtime-an-sres-guide-to-safe-cost-optimization-1lck"&gt;Part 2: Downsizing Without Downtime&lt;/a&gt; — actively optimize with reliability guardrails&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The analysis is the foundation. Without it, you're guessing. And in production, guessing has a cost — measured in pages, not dollars.&lt;/p&gt;




&lt;h3&gt;
  
  
  FinOps for SREs — Series Index
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://dev.to/june-gu/finops-for-sres-cutting-costs-without-breaking-things-2fbk"&gt;Series Introduction: The SRE Guarantee&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Part 0: The Pre-Flight Checklist — 9 Checks Before Cutting Any Cost&lt;/strong&gt; ← you are here&lt;/li&gt;
&lt;li&gt;&lt;a href="https://dev.to/june-gu/how-i-found-12kyear-in-aws-waste-across-4-accounts-without-touching-production-28bp"&gt;Part 1: How I Found $12K/Year in AWS Waste&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://dev.to/june-gu/downsizing-without-downtime-an-sres-guide-to-safe-cost-optimization-1lck"&gt;Part 2: Downsizing Without Downtime&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;The pre-flight analysis is implemented in &lt;a href="https://github.com/junegu/aws-finops-toolkit" rel="noopener noreferrer"&gt;aws-finops-toolkit&lt;/a&gt; as the &lt;code&gt;finops preflight&lt;/code&gt; command.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>aws</category>
      <category>finops</category>
      <category>sre</category>
      <category>reliability</category>
    </item>
    <item>
      <title>Downsizing Without Downtime: An SRE's Guide to Safe Cost Optimization</title>
      <dc:creator>June Gu</dc:creator>
      <pubDate>Sun, 22 Mar 2026 00:17:49 +0000</pubDate>
      <link>https://dev.to/june-gu/downsizing-without-downtime-an-sres-guide-to-safe-cost-optimization-1lck</link>
      <guid>https://dev.to/june-gu/downsizing-without-downtime-an-sres-guide-to-safe-cost-optimization-1lck</guid>
      <description>&lt;h1&gt;
  
  
  Downsizing Without Downtime: An SRE's Guide to Safe Cost Optimization
&lt;/h1&gt;

&lt;p&gt;&lt;strong&gt;Tags:&lt;/strong&gt; &lt;code&gt;aws&lt;/code&gt; &lt;code&gt;finops&lt;/code&gt; &lt;code&gt;sre&lt;/code&gt; &lt;code&gt;reliability&lt;/code&gt; &lt;code&gt;kubernetes&lt;/code&gt;&lt;/p&gt;




&lt;p&gt;In &lt;a href="https://dev.to/june-gu/how-i-found-12kyear-in-aws-waste-across-4-accounts-without-touching-production-28bp"&gt;Part 1&lt;/a&gt;, I covered finding $12K/year in passive waste — abandoned VPCs, orphan log groups, stale WorkSpaces. Things nobody was using. That was the easy part.&lt;/p&gt;

&lt;p&gt;This article is about the hard part: &lt;strong&gt;actively downsizing infrastructure that's still running in production&lt;/strong&gt; — without breaking availability. This is where FinOps meets SRE, and where most cost-cutting initiatives fail.&lt;/p&gt;

&lt;p&gt;I've seen teams blindly follow AWS Cost Explorer recommendations, downsize an RDS instance during peak hours, and trigger a 45-minute outage. The problem isn't the recommendation — it's executing it without an SRE mindset.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;The SRE Guarantee&lt;/strong&gt;: Every optimization in this article passes through three gates: error budget protection, assured minimum downtime, and reliability over savings. See the &lt;a href="https://dev.to/june-gu/finops-for-sres-cutting-costs-without-breaking-things-2fbk"&gt;series introduction&lt;/a&gt; for the full guarantee. If any gate fails, we don't proceed — no matter how much the savings.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Here's the framework I use: &lt;strong&gt;every cost optimization must pass through the reliability filter first.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fn4bi3ztfokdz5t3ilv7r.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fn4bi3ztfokdz5t3ilv7r.png" alt="SLO Gate decision tree" width="720" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The SLO gate: when is it safe to cut?
&lt;/h2&gt;

&lt;p&gt;Before touching any resource, I check three things:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Error budget status&lt;/strong&gt; — If we've burned &amp;gt;50% of this month's error budget, no changes. Period.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Current resource utilization&lt;/strong&gt; — CloudWatch metrics over 14+ days, not a snapshot.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Blast radius&lt;/strong&gt; — If this fails, what's the user impact? One service? All services?
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Error budget &amp;gt; 50% remaining?
  └─ Yes → Check utilization
       └─ Avg CPU &amp;lt; 20% for 14 days?
            └─ Yes → Check blast radius
                 └─ Single service, non-critical path?
                      └─ Yes → Proceed with rollback plan
                      └─ No → Schedule for maintenance window
            └─ No → Skip, re-evaluate next month
  └─ No → Do nothing. Stability first.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is the difference between FinOps and SRE-driven FinOps. Cost tools tell you &lt;em&gt;what&lt;/em&gt; to cut. SRE tells you &lt;em&gt;when&lt;/em&gt; and &lt;em&gt;how&lt;/em&gt;.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Automate this&lt;/strong&gt;: &lt;a href="https://github.com/junegu/aws-finops-toolkit" rel="noopener noreferrer"&gt;&lt;code&gt;finops scan&lt;/code&gt;&lt;/a&gt; runs all checks below in one command. Each section maps to a specific check in the toolkit.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  1. EC2 / EKS node right-sizing with Pod Disruption Budgets
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;The problem&lt;/strong&gt;: EKS worker nodes running at 15% CPU average. AWS says "downsize." But these nodes run 8 microservices — you can't just swap the instance type and hope pods reschedule gracefully.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The SRE approach&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Ensure PDB exists BEFORE downsizing&lt;/span&gt;
&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;policy/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;PodDisruptionBudget&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;gateway-server-pdb&lt;/span&gt;
  &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;connectorder&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;minAvailable&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;
  &lt;span class="na"&gt;selector&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;matchLabels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;app&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;gateway-server&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Execution steps&lt;/strong&gt;:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Verify PDB exists for every service on the node group&lt;/li&gt;
&lt;li&gt;Add new node group with smaller instance type (t3.large → t3.medium)&lt;/li&gt;
&lt;li&gt;Cordon old nodes — Kubernetes respects PDBs during drain&lt;/li&gt;
&lt;li&gt;Monitor SLOs for 24 hours&lt;/li&gt;
&lt;li&gt;Remove old node group only after SLO confirmation&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;What we saved&lt;/strong&gt;: t3.large ($0.0832/hr) → t3.medium ($0.0416/hr) = &lt;strong&gt;50% per node&lt;/strong&gt;. With 4 nodes across dev/staging, that's ~$120/month.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What could go wrong&lt;/strong&gt;: Without PDBs, draining a node can kill all replicas of a service simultaneously. With PDBs, Kubernetes guarantees at least &lt;code&gt;minAvailable&lt;/code&gt; pods stay running.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Toolkit check&lt;/strong&gt;: &lt;code&gt;finops scan --checks ec2_rightsizing&lt;/code&gt; — flags instances with avg CPU &amp;lt; 20% over 14 days. (&lt;a href="https://github.com/junegu/aws-finops-toolkit/blob/main/src/finops/checks/ec2_rightsizing.py" rel="noopener noreferrer"&gt;source&lt;/a&gt;)&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  2. NAT Gateway → NAT Instance with high availability
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;The problem&lt;/strong&gt;: NAT Gateways cost $32.40/month each (fixed) + data processing. In dev/staging environments processing &amp;lt;1 GB/month, you're paying $32 for almost nothing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The SRE approach&lt;/strong&gt;: Don't just replace with a single NAT Instance — that's a single point of failure. Use dual-AZ NAT Instances with auto-recovery.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="c1"&gt;# NAT Instance with auto-recovery via ASG&lt;/span&gt;
&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_autoscaling_group"&lt;/span&gt; &lt;span class="s2"&gt;"nat"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;min_size&lt;/span&gt;         &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
  &lt;span class="nx"&gt;max_size&lt;/span&gt;         &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
  &lt;span class="nx"&gt;desired_capacity&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;

  &lt;span class="nx"&gt;launch_template&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;id&lt;/span&gt;      &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;aws_launch_template&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;nat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;
    &lt;span class="nx"&gt;version&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"$Latest"&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;

  &lt;span class="c1"&gt;# Auto-replace if health check fails&lt;/span&gt;
  &lt;span class="nx"&gt;health_check_type&lt;/span&gt;         &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"EC2"&lt;/span&gt;
  &lt;span class="nx"&gt;health_check_grace_period&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;120&lt;/span&gt;

  &lt;span class="nx"&gt;tag&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;key&lt;/span&gt;                 &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"Name"&lt;/span&gt;
    &lt;span class="nx"&gt;value&lt;/span&gt;               &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"${local.name_prefix}-nat"&lt;/span&gt;
    &lt;span class="nx"&gt;propagate_at_launch&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;# t4g.nano: $3.02/month — 10x cheaper than NAT Gateway&lt;/span&gt;
&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_launch_template"&lt;/span&gt; &lt;span class="s2"&gt;"nat"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;instance_type&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"t4g.nano"&lt;/span&gt;
  &lt;span class="nx"&gt;image_id&lt;/span&gt;      &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;aws_ami&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;nat_instance&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;
  &lt;span class="c1"&gt;# ... source_dest_check = false&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;What we saved&lt;/strong&gt;: $32.40 → $3.02/month per environment. Across 3 dev/staging environments: &lt;strong&gt;~$88/month&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The HA guarantee&lt;/strong&gt;: ASG auto-replaces the instance within ~2 minutes if it fails. For dev/staging, 2 minutes of NAT downtime is acceptable. For prod, keep the managed NAT Gateway.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Real-world validation&lt;/strong&gt;: &lt;a href="https://blogs.halodoc.io/from-aws-nat-gateway-to-nat-instance-a-cost-optimized-networking-strategy/" rel="noopener noreferrer"&gt;Halodoc's engineering team&lt;/a&gt; documented their full migration from managed NAT Gateways to NAT instances using &lt;a href="https://fck-nat.dev/" rel="noopener noreferrer"&gt;fck-nat&lt;/a&gt;, an open-source project that provides ready-to-use ARM-based AMIs supporting up to 5Gbps burst on a t4g.nano. They achieved over 90% cost reduction across non-prod environments. The fck-nat AMI handles IP forwarding, NAT rules, and CloudWatch alarms out of the box — it's essentially what I built manually with the ASG approach above, but packaged as a reusable AMI. If you're doing this at scale, consider &lt;a href="https://github.com/AndrewGuenther/fck-nat" rel="noopener noreferrer"&gt;fck-nat&lt;/a&gt; instead of rolling your own.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Toolkit check&lt;/strong&gt;: &lt;code&gt;finops scan --checks nat_gateway&lt;/code&gt; — flags NAT Gateways with 0 bytes processed in dev/staging accounts. (&lt;a href="https://github.com/junegu/aws-finops-toolkit/blob/main/src/finops/checks/nat_gateway.py" rel="noopener noreferrer"&gt;source&lt;/a&gt;)&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  3. Spot Instances for non-production EKS with graceful draining
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;The problem&lt;/strong&gt;: Dev and staging EKS node groups run on-demand 24/7 for workloads that tolerate interruption.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The SRE approach&lt;/strong&gt;: Spot saves 60-70%, but you need graceful handling of the 2-minute interruption notice.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# EKS managed node group with Spot + drain handler&lt;/span&gt;
&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;eksctl.io/v1alpha5&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ClusterConfig&lt;/span&gt;
&lt;span class="na"&gt;managedNodeGroups&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;spot-workers&lt;/span&gt;
    &lt;span class="na"&gt;instanceTypes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;t3.medium"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;t3a.medium"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;t3.large"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
    &lt;span class="na"&gt;spot&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
    &lt;span class="na"&gt;desiredCapacity&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;3&lt;/span&gt;
    &lt;span class="na"&gt;labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;lifecycle&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;spot&lt;/span&gt;
    &lt;span class="na"&gt;taints&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;spot&lt;/span&gt;
        &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;true"&lt;/span&gt;
        &lt;span class="na"&gt;effect&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;PreferNoSchedule&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Critical&lt;/strong&gt;: Install the AWS Node Termination Handler. Without it, pods get killed mid-request.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;helm &lt;span class="nb"&gt;install &lt;/span&gt;aws-node-termination-handler &lt;span class="se"&gt;\&lt;/span&gt;
  eks/aws-node-termination-handler &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--namespace&lt;/span&gt; kube-system &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--set&lt;/span&gt; &lt;span class="nv"&gt;enableSpotInterruptionDraining&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;true&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--set&lt;/span&gt; &lt;span class="nv"&gt;enableScheduledEventDraining&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;true&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;What we saved&lt;/strong&gt;: 3 on-demand t3.medium nodes ($0.0416/hr × 3 × 730hr) = $91/month → Spot (~$0.0125/hr × 3 × 730hr) = $27/month. &lt;strong&gt;$64/month savings per environment.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The reliability rule&lt;/strong&gt;: Never use Spot for production. Never use Spot for stateful workloads. Only use Spot where you have:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Multiple instance type fallbacks (capacity diversification)&lt;/li&gt;
&lt;li&gt;Node Termination Handler installed&lt;/li&gt;
&lt;li&gt;Pod anti-affinity so replicas spread across nodes&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Toolkit check&lt;/strong&gt;: &lt;code&gt;finops scan --checks spot_candidates&lt;/code&gt; — identifies stateless ASGs and EKS node groups eligible for Spot. (&lt;a href="https://github.com/junegu/aws-finops-toolkit/blob/main/src/finops/checks/spot_candidates.py" rel="noopener noreferrer"&gt;source&lt;/a&gt;)&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  4. RDS right-sizing without losing your safety net
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;The problem&lt;/strong&gt;: RDS instances provisioned for peak load that only hits 2 hours per day. Average CPU: 8%. But it's a database — you can't just resize and pray.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The SRE approach&lt;/strong&gt;:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Before&lt;/th&gt;
&lt;th&gt;After&lt;/th&gt;
&lt;th&gt;Why it's safe&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;db.r6g.xlarge (prod)&lt;/td&gt;
&lt;td&gt;db.r6g.large (prod)&lt;/td&gt;
&lt;td&gt;Read replica absorbs overflow&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;db.r6g.large (staging)&lt;/td&gt;
&lt;td&gt;db.r6g.medium (staging)&lt;/td&gt;
&lt;td&gt;No Multi-AZ needed in staging&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Multi-AZ on staging&lt;/td&gt;
&lt;td&gt;Single-AZ&lt;/td&gt;
&lt;td&gt;Staging doesn't need failover&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Execution steps&lt;/strong&gt;:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Add a read replica BEFORE downsizing (safety net)&lt;/li&gt;
&lt;li&gt;Monitor replica lag for 48 hours&lt;/li&gt;
&lt;li&gt;Apply instance modification during low-traffic window (scheduled, not immediate)&lt;/li&gt;
&lt;li&gt;Monitor connection count and query latency for 1 week&lt;/li&gt;
&lt;li&gt;Remove old read replica only after confirming SLOs hold&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;What we saved&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Staging Multi-AZ removal: &lt;strong&gt;~$200/month&lt;/strong&gt; (you're paying 2x for staging redundancy nobody needs)&lt;/li&gt;
&lt;li&gt;Right-sizing across 3 non-prod instances: &lt;strong&gt;~$150/month&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;What NOT to touch&lt;/strong&gt;: Production primary instances running at &amp;gt;40% CPU. Production Multi-AZ. Any RDS with burst credit dependency (t-class instances under load).&lt;/p&gt;

&lt;h3&gt;
  
  
  Parameter groups: the hidden risk
&lt;/h3&gt;

&lt;p&gt;When you change an RDS instance class, memory-dependent parameters may break silently.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Default parameter groups auto-scale&lt;/strong&gt; — &lt;code&gt;shared_buffers&lt;/code&gt;, &lt;code&gt;effective_cache_size&lt;/code&gt;, and &lt;code&gt;work_mem&lt;/code&gt; in PostgreSQL (or &lt;code&gt;innodb_buffer_pool_size&lt;/code&gt; in MySQL) adjust automatically with instance memory. If you're using the default parameter group, downsizing is straightforward.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Custom parameter groups with hardcoded values don't auto-scale.&lt;/strong&gt; If someone set &lt;code&gt;shared_buffers = 8GB&lt;/code&gt; explicitly for a db.r6g.xlarge (32GB RAM), downsizing to db.r6g.large (16GB RAM) means &lt;code&gt;shared_buffers&lt;/code&gt; is now 50% of total RAM instead of 25%. That leaves almost nothing for OS cache and connections.&lt;/p&gt;

&lt;p&gt;This is a known production pitfall. &lt;a href="https://repost.aws/knowledge-center/rds-aurora-postgresql-shared-buffers" rel="noopener noreferrer"&gt;AWS documents&lt;/a&gt; that RDS replicas can get stuck in &lt;code&gt;incompatible-parameters&lt;/code&gt; mode when created with a smaller instance class if the source's parameter group has hardcoded buffer values too large for the target. The same issue applies to downsizing: the instance may fail to start or perform poorly. &lt;a href="https://docs.aws.amazon.com/prescriptive-guidance/latest/tuning-postgresql-parameters/shared-buffers.html" rel="noopener noreferrer"&gt;AWS Prescriptive Guidance&lt;/a&gt; recommends using formula-based parameters (e.g., &lt;code&gt;{DBInstanceClassMemory/32768}&lt;/code&gt;) that auto-scale with instance size, rather than hardcoded values.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Before downsizing, check:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# List parameter groups for the instance&lt;/span&gt;
aws rds describe-db-instances &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--db-instance-identifier&lt;/span&gt; pn-sh-rds-prod &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--query&lt;/span&gt; &lt;span class="s1"&gt;'DBInstances[0].DBParameterGroups'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--profile&lt;/span&gt; dodo-dev

&lt;span class="c"&gt;# Check for hardcoded memory parameters&lt;/span&gt;
aws rds describe-db-parameters &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--db-parameter-group-name&lt;/span&gt; my-custom-pg15 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--query&lt;/span&gt; &lt;span class="s1"&gt;'Parameters[?ParameterName==`shared_buffers` || ParameterName==`effective_cache_size` || ParameterName==`work_mem`].[ParameterName,ParameterValue,Source]'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--output&lt;/span&gt; table &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--profile&lt;/span&gt; dodo-dev
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;The rule&lt;/strong&gt;: If &lt;code&gt;Source&lt;/code&gt; = &lt;code&gt;user&lt;/code&gt; (not &lt;code&gt;engine-default&lt;/code&gt;), the parameter is hardcoded. Recalculate it for the target instance size before downsizing.&lt;/p&gt;

&lt;h3&gt;
  
  
  CDC and logical replication: the blast radius multiplier
&lt;/h3&gt;

&lt;p&gt;If the database has Change Data Capture (CDC) enabled via logical replication, downsizing becomes significantly riskier.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why it matters:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Replication slots consume WAL&lt;/strong&gt;: Logical replication slots prevent WAL cleanup until the consumer catches up. On a smaller instance with less I/O throughput, WAL can accumulate faster than it's consumed.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Replication lag increases&lt;/strong&gt;: Smaller instance = less CPU and memory for WAL decoding. If your CDC pipeline (Debezium, DMS, custom) can't keep up, lag grows — and if the slot falls too far behind, you may need to recreate it.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Disk pressure&lt;/strong&gt;: WAL accumulation on a smaller instance with less storage headroom can fill the disk, causing the primary to halt writes entirely.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is not theoretical. Gunnar Morling (Debezium/Red Hat) documented &lt;a href="https://www.morling.dev/blog/insatiable-postgres-replication-slot/" rel="noopener noreferrer"&gt;the "insatiable" replication slot problem&lt;/a&gt; — when a CDC consumer stops, an idle RDS PostgreSQL instance accumulates &lt;strong&gt;18 GB/day&lt;/strong&gt; of WAL because RDS writes a heartbeat every 5 minutes into 64 MB WAL segments. His &lt;a href="https://www.morling.dev/blog/mastering-postgres-replication-slots/" rel="noopener noreferrer"&gt;follow-up guide&lt;/a&gt; on mastering replication slots is essential reading. &lt;a href="https://www.artie.com/blogs/postgres-replication-slot-101-how-to-capture-cdc-without-breaking-production" rel="noopener noreferrer"&gt;Artie's production guide&lt;/a&gt; calls slot bloat "the single most common way CDC pipelines take down production databases."&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Before downsizing a CDC-enabled database:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Check for logical replication slots (PostgreSQL)&lt;/span&gt;
&lt;span class="c"&gt;# Run via psql or RDS Data API:&lt;/span&gt;
&lt;span class="c"&gt;# SELECT slot_name, plugin, active, restart_lsn, confirmed_flush_lsn&lt;/span&gt;
&lt;span class="c"&gt;# FROM pg_replication_slots;&lt;/span&gt;

&lt;span class="c"&gt;# Check replication lag via CloudWatch&lt;/span&gt;
aws cloudwatch get-metric-statistics &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--namespace&lt;/span&gt; AWS/RDS &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--metric-name&lt;/span&gt; ReplicationSlotDiskUsage &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--dimensions&lt;/span&gt; &lt;span class="nv"&gt;Name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;DBInstanceIdentifier,Value&lt;span class="o"&gt;=&lt;/span&gt;pn-sh-rds-prod &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--start-time&lt;/span&gt; &lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;date&lt;/span&gt; &lt;span class="nt"&gt;-u&lt;/span&gt; &lt;span class="nt"&gt;-v-7d&lt;/span&gt; +%Y-%m-%dT%H:%M:%S&lt;span class="si"&gt;)&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--end-time&lt;/span&gt; &lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;date&lt;/span&gt; &lt;span class="nt"&gt;-u&lt;/span&gt; +%Y-%m-%dT%H:%M:%S&lt;span class="si"&gt;)&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--period&lt;/span&gt; 3600 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--statistics&lt;/span&gt; Maximum &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--profile&lt;/span&gt; dodo-dev
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Critical safety net&lt;/strong&gt; (PostgreSQL 13+): Set &lt;code&gt;max_slot_wal_keep_size&lt;/code&gt; in your parameter group to cap how much WAL a replication slot can retain. Without this, an inactive slot will accumulate WAL indefinitely — &lt;a href="https://www.morling.dev/blog/insatiable-postgres-replication-slot/" rel="noopener noreferrer"&gt;Morling measured 18 GB/day on an idle RDS instance&lt;/a&gt;. Also set a CloudWatch alarm on &lt;code&gt;OldestReplicationSlotLag&lt;/code&gt; — warning at 1 GB, critical at 10 GB.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The rule&lt;/strong&gt;: If &lt;code&gt;pg_replication_slots&lt;/code&gt; shows active logical slots, do NOT downsize without first confirming the CDC consumer can handle reduced throughput. Consider pausing CDC, downsizing, then resuming — but plan for a full re-sync if the slot is lost.&lt;/p&gt;

&lt;h3&gt;
  
  
  Cold cache: the first-hour tax
&lt;/h3&gt;

&lt;p&gt;Every RDS instance modification restarts the database engine. When it comes back up, the buffer pool is empty. This is the &lt;strong&gt;cold cache&lt;/strong&gt; problem.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What happens:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;PostgreSQL's &lt;code&gt;shared_buffers&lt;/code&gt; starts empty — every query hits disk&lt;/li&gt;
&lt;li&gt;Query p99 latency spikes 3-10x for the first 30-60 minutes&lt;/li&gt;
&lt;li&gt;Connection pool may hit timeouts as queries take longer&lt;/li&gt;
&lt;li&gt;If you're monitoring SLOs, you'll see an error budget burn&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Mitigation:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Schedule the modification during the lowest-traffic window&lt;/strong&gt; (e.g., 02:00-04:00 KST for our services)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Use "Apply during maintenance window"&lt;/strong&gt; — not "Apply immediately"&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pre-warm with read replica promotion&lt;/strong&gt; instead of in-place modification:

&lt;ul&gt;
&lt;li&gt;Create a read replica at the target (smaller) size&lt;/li&gt;
&lt;li&gt;Let the replica's buffer pool warm up from replication traffic&lt;/li&gt;
&lt;li&gt;Promote the replica to primary during maintenance window&lt;/li&gt;
&lt;li&gt;The promoted instance already has a warm cache&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Budget the cold cache period into your SLO error budget&lt;/strong&gt; — if you have 78% budget remaining, a 45-minute cache warm-up that degrades p99 by 3x might burn 2-3% of your monthly budget. That's acceptable. If you only have 50% remaining, it's not.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Blue-green consideration:&lt;/strong&gt; RDS Blue/Green Deployments create a green (new) environment alongside the blue (current). This is safer for major changes but costs 2x during the switchover period. For a simple instance class change, in-place modification with read replica pre-warming is more cost-effective than blue-green.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What the industry uses&lt;/strong&gt;: AWS published a &lt;a href="https://aws.amazon.com/blogs/database/optimize-amazon-aurora-postgresql-auto-scaling-performance-with-automated-cache-pre-warming/" rel="noopener noreferrer"&gt;detailed guide on automated cache pre-warming&lt;/a&gt; for Aurora PostgreSQL using the &lt;code&gt;pg_prewarm&lt;/code&gt; extension, which loads specific tables and indexes into shared buffers before traffic arrives. For standard RDS PostgreSQL, the same extension is available — and there's even &lt;a href="https://github.com/robins/PrewarmRDSPostgres" rel="noopener noreferrer"&gt;an open-source tool&lt;/a&gt; specifically designed to pre-warm RDS PostgreSQL instances after restarts. Aurora also offers &lt;a href="https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/AuroraPostgreSQL.cluster-cache-mgmt.html" rel="noopener noreferrer"&gt;Cluster Cache Management (CCM)&lt;/a&gt; which designates a replica to inherit the primary's buffer cache on failover — eliminating cold cache entirely for failover scenarios.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Toolkit check&lt;/strong&gt;: &lt;code&gt;finops scan --checks rds_rightsizing&lt;/code&gt; — flags oversized RDS instances and unnecessary Multi-AZ in non-prod. (&lt;a href="https://github.com/junegu/aws-finops-toolkit/blob/main/src/finops/checks/rds_rightsizing.py" rel="noopener noreferrer"&gt;source&lt;/a&gt;)&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  5. ElastiCache scheduling for dev/staging
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;The problem&lt;/strong&gt;: ElastiCache clusters running 24/7 in dev/staging. Developers use them 10 hours/day, 5 days/week. You're paying for 118 idle hours per week.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The SRE approach&lt;/strong&gt;: Stop clusters outside business hours via EventBridge + Lambda.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Lambda: stop dev ElastiCache at 8 PM, start at 8 AM
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;handler&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;action&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;action&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# 'stop' or 'start'
&lt;/span&gt;    &lt;span class="n"&gt;cluster_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;cluster_id&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;boto3&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;client&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;elasticache&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;action&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;stop&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="c1"&gt;# Serverless: just scale to 0 ECPUs
&lt;/span&gt;        &lt;span class="c1"&gt;# Classic: delete with final snapshot, recreate on start
&lt;/span&gt;        &lt;span class="k"&gt;pass&lt;/span&gt;
    &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;action&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;start&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="c1"&gt;# Restore from snapshot
&lt;/span&gt;        &lt;span class="k"&gt;pass&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;What we saved&lt;/strong&gt;: ~50% per cluster. 2 dev/staging clusters: &lt;strong&gt;~$80/month&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The reliability check&lt;/strong&gt;: Always test that the start/restore actually works before relying on scheduling. A cluster that won't restore Monday morning is worse than paying weekend costs.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Toolkit check&lt;/strong&gt;: &lt;code&gt;finops scan --checks elasticache_scheduling&lt;/code&gt; — detects dev/staging ElastiCache running 24/7. (&lt;a href="https://github.com/junegu/aws-finops-toolkit/blob/main/src/finops/checks/elasticache_scheduling.py" rel="noopener noreferrer"&gt;source&lt;/a&gt;)&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  6. Reserved Instances: commit only after right-sizing
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;The problem&lt;/strong&gt;: Teams buy RIs before optimizing. Then they downsize and the RI doesn't match. Money locked in for 1-3 years.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The SRE approach&lt;/strong&gt;: RIs are the &lt;strong&gt;last step&lt;/strong&gt;, not the first.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Week 1-2: Find waste (Part 1 — passive cleanup)
Week 3-4: Downsize safely (this article)
Week 5-6: Monitor — confirm new sizes are stable
Week 7-8: THEN buy RIs/Savings Plans for the right-sized resources
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Decision matrix&lt;/strong&gt;:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Resource&lt;/th&gt;
&lt;th&gt;Stable for 30+ days?&lt;/th&gt;
&lt;th&gt;CPU predictable?&lt;/th&gt;
&lt;th&gt;Action&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Prod RDS (right-sized)&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes, 35-45%&lt;/td&gt;
&lt;td&gt;1-year RI (All Upfront)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Prod EKS nodes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes, 40-60%&lt;/td&gt;
&lt;td&gt;Compute Savings Plan&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Dev anything&lt;/td&gt;
&lt;td&gt;N/A&lt;/td&gt;
&lt;td&gt;N/A&lt;/td&gt;
&lt;td&gt;Never reserve — use Spot/scheduling&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;What we projected&lt;/strong&gt;: After right-sizing prod workloads, 1-year RIs would save an additional &lt;strong&gt;30-40%&lt;/strong&gt; on the new baseline — roughly $300-500/month for our scale.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Industry validation&lt;/strong&gt;: &lt;a href="https://cloudchipr.com/blog/aws-rds-right-sizing" rel="noopener noreferrer"&gt;CloudChipr's RDS right-sizing guide&lt;/a&gt; puts it bluntly: "Buying a Reserved Instance for an overprovisioned database just optimizes the cost of waste." &lt;a href="https://blog.easecloud.io/startup-tech/aws-cost-optimization-mistakes/" rel="noopener noreferrer"&gt;The Flexera State of the Cloud Report&lt;/a&gt; consistently finds that 27% of cloud spend is wasted, with premature RI commitment being a top contributor. If you must reserve, use Compute Savings Plans over EC2 Instance Savings Plans — &lt;a href="https://www.prosperops.com/blog/aws-reserved-instances/" rel="noopener noreferrer"&gt;ProsperOps explains&lt;/a&gt; that Compute SPs offer instance family flexibility, so you can still right-size without breaking coverage.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Toolkit check&lt;/strong&gt;: &lt;code&gt;finops scan --checks reserved_instances&lt;/code&gt; — calculates RI/Savings Plans ROI for stable workloads. (&lt;a href="https://github.com/junegu/aws-finops-toolkit/blob/main/src/finops/checks/reserved_instances.py" rel="noopener noreferrer"&gt;source&lt;/a&gt;)&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  7. Orphan resource cleanup
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;The problem&lt;/strong&gt;: EBS volumes from terminated instances, Elastic IPs not attached to anything, snapshots from 2 years ago, load balancers with zero targets.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The SRE approach&lt;/strong&gt;: These are almost always safe to remove — but verify first.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Checklist before deletion&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;[ ] EBS volume: not attached, no recent snapshots depending on it&lt;/li&gt;
&lt;li&gt;[ ] EIP: not referenced in DNS or application config&lt;/li&gt;
&lt;li&gt;[ ] Snapshot: original volume no longer exists, no AMI depends on it&lt;/li&gt;
&lt;li&gt;[ ] ALB: zero registered targets for 7+ days, no DNS pointing to it&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;What we found&lt;/strong&gt;: 12 orphan EBS volumes, 4 unused EIPs, 47 snapshots older than 90 days. &lt;strong&gt;~$85/month&lt;/strong&gt; in pure waste.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Toolkit check&lt;/strong&gt;: &lt;code&gt;finops scan --checks unused_resources&lt;/code&gt; — flags unattached EBS, unused EIPs, old snapshots, idle ALBs. (&lt;a href="https://github.com/junegu/aws-finops-toolkit/blob/main/src/finops/checks/unused_resources.py" rel="noopener noreferrer"&gt;source&lt;/a&gt;)&lt;/p&gt;
&lt;/blockquote&gt;




&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2qhdxwd8xdtpdul01jwa.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2qhdxwd8xdtpdul01jwa.png" alt="Risk vs Savings: 7 optimizations ranked" width="720" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The complete picture: what's safe and what's not
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Optimization&lt;/th&gt;
&lt;th&gt;Risk&lt;/th&gt;
&lt;th&gt;Prod safe?&lt;/th&gt;
&lt;th&gt;Dev/Staging safe?&lt;/th&gt;
&lt;th&gt;Savings&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Orphan cleanup&lt;/td&gt;
&lt;td&gt;Very low&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;$85/mo&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ElastiCache scheduling&lt;/td&gt;
&lt;td&gt;Low&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;$80/mo&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;NAT Gateway → Instance&lt;/td&gt;
&lt;td&gt;Low-Med&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;$88/mo&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Spot for non-prod&lt;/td&gt;
&lt;td&gt;Medium&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;$64/mo&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;EC2/EKS right-sizing&lt;/td&gt;
&lt;td&gt;Medium&lt;/td&gt;
&lt;td&gt;With PDB&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;$120/mo&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;RDS right-sizing&lt;/td&gt;
&lt;td&gt;Medium&lt;/td&gt;
&lt;td&gt;With replica&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;$350/mo&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Reserved Instances&lt;/td&gt;
&lt;td&gt;Lock-in risk&lt;/td&gt;
&lt;td&gt;After sizing&lt;/td&gt;
&lt;td&gt;Never&lt;/td&gt;
&lt;td&gt;$300-500/mo&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Total from active downsizing&lt;/strong&gt;: ~$787-1,087/month ($9.4-13K/year)&lt;br&gt;
&lt;strong&gt;Combined with Part 1 (passive waste)&lt;/strong&gt;: $1,431-2,104/month ($17.2-25.2K/year)&lt;/p&gt;


&lt;h2&gt;
  
  
  The toolkit: automate the discovery
&lt;/h2&gt;

&lt;p&gt;Everything in this article maps to a check in &lt;a href="https://github.com/junegu/aws-finops-toolkit" rel="noopener noreferrer"&gt;aws-finops-toolkit&lt;/a&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Install&lt;/span&gt;
pip &lt;span class="nb"&gt;install &lt;/span&gt;aws-finops-toolkit

&lt;span class="c"&gt;# Scan all checks across multiple accounts&lt;/span&gt;
finops scan &lt;span class="nt"&gt;--profiles&lt;/span&gt; dev,staging,prod

&lt;span class="c"&gt;# Run only the downsizing-related checks&lt;/span&gt;
finops scan &lt;span class="nt"&gt;--checks&lt;/span&gt; ec2_rightsizing,nat_gateway,spot_candidates,rds_rightsizing,elasticache_scheduling,reserved_instances,unused_resources

&lt;span class="c"&gt;# Generate HTML report for management&lt;/span&gt;
finops report &lt;span class="nt"&gt;--format&lt;/span&gt; html &lt;span class="nt"&gt;--output&lt;/span&gt; finops-downsizing.html
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The tool finds the opportunities. The SRE decides which ones are safe to execute, and in what order.&lt;/p&gt;




&lt;h2&gt;
  
  
  What I learned
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;FinOps without SRE is dangerous.&lt;/strong&gt; Cost tools don't know your SLOs. They'll tell you to downsize a database that's already at its limit during peak hours.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Always add safety before removing cost.&lt;/strong&gt; Read replica before RDS downsize. PDB before node downsize. Drain handler before Spot. The safety net costs less than the savings — and it prevents the 2 AM page.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Reserve last, not first.&lt;/strong&gt; Right-size → stabilize → then commit. Buying RIs on oversized instances locks in waste.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Prod and non-prod are different games.&lt;/strong&gt; Non-prod is where you optimize aggressively (Spot, scheduling, single-AZ). Prod is where you optimize carefully (right-sizing with replicas, PDBs, maintenance windows).&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;SLO data is your FinOps compass.&lt;/strong&gt; If your error budget is healthy, you have room to experiment. If it's burned, don't touch anything — reliability comes first.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;




&lt;h3&gt;
  
  
  FinOps for SREs — Series Index
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://dev.to/june-gu/finops-for-sres-cutting-costs-without-breaking-things-2fbk"&gt;Series Introduction: The SRE Guarantee&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://dev.to/june-gu/the-pre-flight-checklist-9-things-to-analyze-before-cutting-any-aws-cost-35dh"&gt;Part 0: The Pre-Flight Checklist — 9 Checks Before Cutting Any Cost&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://dev.to/june-gu/how-i-found-12kyear-in-aws-waste-across-4-accounts-without-touching-production-28bp"&gt;Part 1: How I Found $12K/Year in AWS Waste&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Part 2: Downsizing Without Downtime&lt;/strong&gt; ← you are here&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;The checks in this article are implemented in &lt;a href="https://github.com/junegu/aws-finops-toolkit" rel="noopener noreferrer"&gt;aws-finops-toolkit&lt;/a&gt; — an open-source CLI for automated AWS cost scanning.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>aws</category>
      <category>finops</category>
      <category>sre</category>
      <category>reliability</category>
    </item>
    <item>
      <title>FinOps for SREs: Cutting Costs Without Breaking Things</title>
      <dc:creator>June Gu</dc:creator>
      <pubDate>Sun, 22 Mar 2026 00:17:33 +0000</pubDate>
      <link>https://dev.to/june-gu/finops-for-sres-cutting-costs-without-breaking-things-2fbk</link>
      <guid>https://dev.to/june-gu/finops-for-sres-cutting-costs-without-breaking-things-2fbk</guid>
      <description>&lt;h1&gt;
  
  
  FinOps for SREs: Cutting Costs Without Breaking Things
&lt;/h1&gt;

&lt;p&gt;&lt;strong&gt;Tags:&lt;/strong&gt; &lt;code&gt;aws&lt;/code&gt; &lt;code&gt;finops&lt;/code&gt; &lt;code&gt;sre&lt;/code&gt; &lt;code&gt;reliability&lt;/code&gt; &lt;code&gt;devops&lt;/code&gt;&lt;/p&gt;




&lt;p&gt;Most FinOps advice starts with a cost dashboard. This series starts with a different question: &lt;strong&gt;how do we cut costs without violating our SLOs?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;I'm an SRE at a subsidiary of one of Korea's largest tech companies, managing four AWS accounts connected via a Transit Gateway hub-spoke architecture. When I was asked to reduce cloud spend, I didn't open AWS Cost Explorer first. I opened our SigNoz dashboards and checked our error budgets.&lt;/p&gt;

&lt;p&gt;That's the difference between FinOps and &lt;strong&gt;SRE-driven FinOps&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Feditzn0z3mns74lesqle.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Feditzn0z3mns74lesqle.png" alt="The SRE Guarantee: Error Budget Protection, Assured Minimum Downtime, Reliability Over Savings" width="720" height="340"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The SRE Guarantee
&lt;/h2&gt;

&lt;p&gt;Before any cost optimization begins, I guarantee three things:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Error Budget Protection&lt;/strong&gt;&lt;br&gt;
No optimization will be executed if it risks breaching SLOs. If our error budget is below 50%, all FinOps work stops — reliability comes first.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Assured Minimum Downtime&lt;/strong&gt;&lt;br&gt;
Every change has a rollback plan, a maintenance window, and a blast radius assessment. Zero-downtime is the target. Documented, brief downtime during a maintenance window is the floor. Unplanned downtime is unacceptable.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Reliability Over Savings&lt;/strong&gt;&lt;br&gt;
If forced to choose between $500/month in savings and a 0.01% availability risk, we choose availability. Always. The cost of an outage — in customer trust, in engineering hours, in incident response — exceeds any monthly savings.&lt;/p&gt;

&lt;p&gt;This guarantee isn't just a principle. It's encoded in every check of the &lt;a href="https://github.com/junegu/aws-finops-toolkit" rel="noopener noreferrer"&gt;aws-finops-toolkit&lt;/a&gt; — the open-source CLI I built to automate this workflow.&lt;/p&gt;


&lt;h2&gt;
  
  
  The Series
&lt;/h2&gt;

&lt;p&gt;This series walks through the complete FinOps workflow I used to identify $48-67K/year in savings across four AWS accounts — starting with analysis, through passive cleanup, to active downsizing with SRE guardrails.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F41mgddakkqmxlt3b43oy.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F41mgddakkqmxlt3b43oy.png" alt="Waste breakdown across 4 AWS accounts" width="720" height="400"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  &lt;a href="https://dev.to/june-gu/the-pre-flight-checklist-9-things-to-analyze-before-cutting-any-aws-cost-35dh"&gt;Part 0: The Pre-Flight Checklist&lt;/a&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;9 checks before cutting any cost.&lt;/strong&gt; Traffic analysis, SLO status, cache dependencies, incident history, RI/SP coverage, and more. This is the analysis phase — never optimize what you don't fully understand.&lt;/p&gt;

&lt;p&gt;→ OSS: &lt;code&gt;finops preflight&lt;/code&gt; command (&lt;a href="https://github.com/junegu/aws-finops-toolkit" rel="noopener noreferrer"&gt;aws-finops-toolkit&lt;/a&gt;)&lt;/p&gt;
&lt;h3&gt;
  
  
  &lt;a href="https://dev.to/june-gu/how-i-found-12kyear-in-aws-waste-across-4-accounts-without-touching-production-28bp"&gt;Part 1: How I Found $12K/Year in AWS Waste&lt;/a&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Passive waste — things nobody uses.&lt;/strong&gt; Abandoned VPCs ($748/mo), orphan CloudWatch log groups ($110-165/mo), S3 lifecycle vs Intelligent-Tiering ($75-104/mo). Zero risk to production. Total: $933-1,017/month.&lt;/p&gt;

&lt;p&gt;→ OSS: &lt;code&gt;finops scan&lt;/code&gt; — &lt;code&gt;vpc_waste&lt;/code&gt;, &lt;code&gt;cloudwatch_waste&lt;/code&gt;, &lt;code&gt;s3_lifecycle&lt;/code&gt; checks&lt;/p&gt;
&lt;h3&gt;
  
  
  &lt;a href="https://dev.to/june-gu/downsizing-without-downtime-an-sres-guide-to-safe-cost-optimization-1lck"&gt;Part 2: Downsizing Without Downtime&lt;/a&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Active optimization — shrinking running infrastructure with SRE guardrails.&lt;/strong&gt; EC2/EKS right-sizing with PDBs, NAT Gateway replacement, Spot with drain handlers, RDS right-sizing with read replicas and cold cache planning, ElastiCache scheduling, and Reserved Instances (commit last, not first). Total: $787-1,087/month.&lt;/p&gt;

&lt;p&gt;→ OSS: &lt;code&gt;finops scan&lt;/code&gt; — &lt;code&gt;ec2_rightsizing&lt;/code&gt;, &lt;code&gt;nat_gateway&lt;/code&gt;, &lt;code&gt;spot_candidates&lt;/code&gt;, &lt;code&gt;rds_rightsizing&lt;/code&gt;, &lt;code&gt;elasticache_scheduling&lt;/code&gt;, &lt;code&gt;reserved_instances&lt;/code&gt;, &lt;code&gt;unused_resources&lt;/code&gt; checks&lt;/p&gt;


&lt;h2&gt;
  
  
  Combined Savings
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Phase&lt;/th&gt;
&lt;th&gt;Monthly&lt;/th&gt;
&lt;th&gt;Annual&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Part 1: Passive waste cleanup&lt;/td&gt;
&lt;td&gt;$933-1,017&lt;/td&gt;
&lt;td&gt;$11.2-12.2K&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Part 2: Active downsizing&lt;/td&gt;
&lt;td&gt;$787-1,087&lt;/td&gt;
&lt;td&gt;$9.4-13K&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Total identified&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;$1,720-2,104&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;$20.6-25.2K&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;P0-P2 roadmap (pending)&lt;/td&gt;
&lt;td&gt;$3,995-5,565&lt;/td&gt;
&lt;td&gt;$48-67K&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Every optimization in this series passed through the SRE guarantee. Not a single SLO was breached. Not a single unplanned outage occurred.&lt;/p&gt;


&lt;h2&gt;
  
  
  The Toolkit
&lt;/h2&gt;

&lt;p&gt;Everything in this series maps to &lt;a href="https://github.com/junegu/aws-finops-toolkit" rel="noopener noreferrer"&gt;aws-finops-toolkit&lt;/a&gt; — an open-source CLI that automates the discovery:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Pre-flight analysis before any change&lt;/span&gt;
finops preflight &lt;span class="nt"&gt;--target&lt;/span&gt; pn-sh-rds-prod &lt;span class="nt"&gt;--profile&lt;/span&gt; dodo-dev &lt;span class="nt"&gt;--apm&lt;/span&gt; signoz

&lt;span class="c"&gt;# Scan for cost waste across accounts&lt;/span&gt;
finops scan &lt;span class="nt"&gt;--profiles&lt;/span&gt; dev,staging,prod

&lt;span class="c"&gt;# Generate report for stakeholders&lt;/span&gt;
finops report &lt;span class="nt"&gt;--format&lt;/span&gt; html &lt;span class="nt"&gt;--output&lt;/span&gt; finops-report.html
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The tool finds the opportunities. The SRE decides which ones are safe to execute.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;This is the introduction to the "FinOps for SREs" series. Start with &lt;a href="https://dev.to/june-gu/the-pre-flight-checklist-9-things-to-analyze-before-cutting-any-aws-cost-35dh"&gt;Part 0: The Pre-Flight Checklist&lt;/a&gt; or jump to the part most relevant to your situation.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;I'm June, an SRE with 5+ years of experience at Korea's top tech companies including Coupang (NYSE: CPNG) and NAVER Corporation. I write about real-world infrastructure problems. Find me on &lt;a href="https://linkedin.com/in/junegu" rel="noopener noreferrer"&gt;LinkedIn&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>aws</category>
      <category>finops</category>
      <category>sre</category>
      <category>devops</category>
    </item>
    <item>
      <title>How I Found $12K/Year in AWS Waste Across 4 Accounts — Without Touching Production</title>
      <dc:creator>June Gu</dc:creator>
      <pubDate>Sun, 22 Mar 2026 00:11:52 +0000</pubDate>
      <link>https://dev.to/june-gu/how-i-found-12kyear-in-aws-waste-across-4-accounts-without-touching-production-28bp</link>
      <guid>https://dev.to/june-gu/how-i-found-12kyear-in-aws-waste-across-4-accounts-without-touching-production-28bp</guid>
      <description>&lt;h1&gt;
  
  
  How I Found $12K/Year in AWS Waste Across 4 Accounts — Without Touching Production
&lt;/h1&gt;

&lt;p&gt;&lt;strong&gt;Tags:&lt;/strong&gt; &lt;code&gt;aws&lt;/code&gt; &lt;code&gt;finops&lt;/code&gt; &lt;code&gt;cloudcost&lt;/code&gt; &lt;code&gt;sre&lt;/code&gt;&lt;/p&gt;




&lt;p&gt;I joined a subsidiary of one of Korea's largest tech companies at the beginning of 2026 as the sole SRE. I inherited four AWS accounts — hub, shared, waitlist, and loyalty — connected via a Transit Gateway hub-spoke architecture. Each account ran its own mix of EKS clusters, RDS instances, Aurora clusters, legacy EC2 services, and networking stacks accumulated over several years by multiple teams.&lt;/p&gt;

&lt;p&gt;Nobody had done a cost audit since the accounts were created. Resources from decommissioned projects were still running. Log groups from deleted infrastructure were still ingesting. Six WorkSpaces that nobody had logged into for months were quietly billing $525/month.&lt;/p&gt;

&lt;p&gt;Within two weeks of part-time analysis and execution, I cut $644/month in immediate waste and identified a total of $933-1,017/month ($11.2-12.2K/year) across three workstreams — all without touching a single production service. Beyond that, I mapped out a P0-P2 roadmap worth $48-67K/year that is now pending platform team approval.&lt;/p&gt;

&lt;p&gt;This is what the work actually looked like.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F41mgddakkqmxlt3b43oy.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F41mgddakkqmxlt3b43oy.png" alt="Waste breakdown: VPC $748/mo, CloudWatch $110-165/mo, S3 $75-104/mo" width="720" height="400"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The audit: mapping costs across four accounts
&lt;/h2&gt;

&lt;p&gt;Before optimizing anything, I needed to understand what we were paying for and who owned it. Our four accounts mapped to distinct business units:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Account&lt;/th&gt;
&lt;th&gt;What runs there&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;hub&lt;/td&gt;
&lt;td&gt;Transit Gateway, ArgoCD, ECR, bastion, monitoring (SigNoz)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;shared&lt;/td&gt;
&lt;td&gt;Ordering platform (EKS, RDS, microservices)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;waitlist&lt;/td&gt;
&lt;td&gt;Waitlist service (Aurora clusters, legacy EC2)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;loyalty&lt;/td&gt;
&lt;td&gt;Loyalty platform (RDS, Aurora, legacy WorkSpaces, VPN)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Each account gets its own AWS bill, which is one of the underappreciated benefits of multi-account architecture for FinOps. No tagging allocation formulas, no arguments about which team caused a cost spike. The Transit Gateway attachment cost ($51.10/month per VPC) is the "tax" each spoke pays for connectivity — and it is transparent.&lt;/p&gt;

&lt;p&gt;I added standardized cost allocation tags derived from our existing Terraform naming convention:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="nx"&gt;tags&lt;/span&gt; &lt;span class="err"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;Org&lt;/span&gt;       &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;var&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;org&lt;/span&gt;       &lt;span class="c1"&gt;# company identifier&lt;/span&gt;
  &lt;span class="nx"&gt;Group&lt;/span&gt;     &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;var&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;group&lt;/span&gt;     &lt;span class="c1"&gt;# "hub", "sh", "nw", "dp"&lt;/span&gt;
  &lt;span class="nx"&gt;Service&lt;/span&gt;   &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;var&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;service&lt;/span&gt;   &lt;span class="c1"&gt;# "core", "ordering", etc.&lt;/span&gt;
  &lt;span class="nx"&gt;Env&lt;/span&gt;       &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;var&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;env&lt;/span&gt;       &lt;span class="c1"&gt;# "prod", "stage", "dev"&lt;/span&gt;
  &lt;span class="nx"&gt;ManagedBy&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"terraform"&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;With these tags activated in AWS Cost Explorer, I could slice spend by team, service, and environment. But the real insights came from cross-referencing Cost Explorer data with SigNoz (our centralized observability platform running in the hub account), which collects resource utilization metrics from every spoke cluster. That combination — dollars from Cost Explorer, utilization from SigNoz — is what let me confidently identify waste rather than guess at it.&lt;/p&gt;

&lt;p&gt;The audit revealed three categories of waste, each requiring a different approach.&lt;/p&gt;

&lt;h2&gt;
  
  
  Phase 1: VPC cleanup — $748/month saved
&lt;/h2&gt;

&lt;p&gt;This was the biggest win, and it was almost entirely abandoned infrastructure.&lt;/p&gt;

&lt;h3&gt;
  
  
  The abandoned dev VPC ($64/month)
&lt;/h3&gt;

&lt;p&gt;The shared account contained a VPC called &lt;code&gt;fnb-dev&lt;/code&gt; that had been created for a food-and-beverage integration project. The project was cancelled, but the VPC lived on: two NAT Gateways, two Elastic IPs, an EC2 instance, an Internet Gateway, and the VPC itself. Nobody was using any of it.&lt;/p&gt;

&lt;p&gt;I confirmed zero traffic on the NAT Gateways via CloudWatch metrics (BytesIn/BytesOut flat at zero for 90+ days), verified no DNS records pointed to the EC2 instance, and tore the entire VPC down.&lt;/p&gt;

&lt;h3&gt;
  
  
  The privacy VPC that outlived its purpose ($525/month)
&lt;/h3&gt;

&lt;p&gt;This was the expensive one. The loyalty account had a "privacy VPC" running six Amazon WorkSpaces, an AWS Directory Service instance, a Storage Gateway, and a NAT Gateway. It was originally set up for a compliance project that had since been handled differently.&lt;/p&gt;

&lt;p&gt;The six WorkSpaces alone cost roughly $300/month. The Directory Service added another $100+. None of the WorkSpaces had been logged into recently. I confirmed with the platform team that the compliance workflow no longer required this infrastructure, documented the teardown plan, and removed the entire VPC.&lt;/p&gt;

&lt;p&gt;$525/month for infrastructure that was doing literally nothing. This is the kind of waste that hides in multi-account setups — each account team assumes someone else needs it.&lt;/p&gt;

&lt;h3&gt;
  
  
  Unattached Elastic IPs and orphaned NAT Gateways ($11/month)
&lt;/h3&gt;

&lt;p&gt;Small individually, but they add up and signal a pattern:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;An unattached EIP in the hub account: $7.50/month&lt;/li&gt;
&lt;li&gt;A NAT Gateway EIP in shared-stage that was no longer needed: $3.50/month&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I also identified two NAT Gateways in the loyalty account's security VPC worth ~$130/month that are pending vendor coordination before removal.&lt;/p&gt;

&lt;h3&gt;
  
  
  VPC cleanup totals
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Item&lt;/th&gt;
&lt;th&gt;Account&lt;/th&gt;
&lt;th&gt;Monthly savings&lt;/th&gt;
&lt;th&gt;Status&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;fnb-dev VPC full teardown&lt;/td&gt;
&lt;td&gt;shared&lt;/td&gt;
&lt;td&gt;~$64&lt;/td&gt;
&lt;td&gt;Done&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Privacy VPC full teardown&lt;/td&gt;
&lt;td&gt;loyalty&lt;/td&gt;
&lt;td&gt;~$525&lt;/td&gt;
&lt;td&gt;Done&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Unattached EIPs (2)&lt;/td&gt;
&lt;td&gt;hub, shared&lt;/td&gt;
&lt;td&gt;~$11&lt;/td&gt;
&lt;td&gt;Done&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;security-vpc NAT Gateways (2)&lt;/td&gt;
&lt;td&gt;loyalty&lt;/td&gt;
&lt;td&gt;~$130&lt;/td&gt;
&lt;td&gt;Pending vendor&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Total&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;~$748&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Lesson&lt;/strong&gt;: The highest-ROI FinOps work is not right-sizing or reservations. It is finding entire stacks that should not exist. A single abandoned VPC with managed services can cost more per month than all your dev environment optimizations combined.&lt;/p&gt;

&lt;h2&gt;
  
  
  Phase 2: CloudWatch log retention — $110-165/month saved
&lt;/h2&gt;

&lt;p&gt;CloudWatch Logs is one of those services that silently accumulates cost because the default retention is "never expire." When nobody sets explicit retention policies, logs grow forever.&lt;/p&gt;

&lt;h3&gt;
  
  
  Three tiers of log waste
&lt;/h3&gt;

&lt;p&gt;I categorized every log group in the loyalty account (which had the most legacy services) into three buckets:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tier 1 — Orphan log groups (delete immediately)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;When I tore down the privacy VPC, seven CloudWatch log groups were left behind: five from Storage Gateway and two from Lambda functions that had been part of the privacy workflow. These groups had no active log streams but still stored data. I deleted them outright.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tier 2 — Inactive service logs (set to 30-day retention)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Eighteen log groups belonged to services that had not emitted a log event in 12+ months. Old message queue processors, abandoned feature branches that had been deployed and forgotten, food-and-beverage integration services that matched the cancelled project. These got a 30-day retention policy — enough time to investigate if someone suddenly asks "what happened with service X last month?" while ensuring the data does not accumulate indefinitely.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tier 3 — Active but over-retained logs (reduce to 90 days)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The main production service log group had its retention set to 731 days (two full years). It had accumulated 199 GB of log data. For a service whose logs are primarily useful for incident investigation (where you rarely look back more than a few weeks), two years is excessive. I reduced retention to 90 days.&lt;/p&gt;

&lt;h3&gt;
  
  
  What is still pending
&lt;/h3&gt;

&lt;p&gt;The immediate changes saved roughly $5/month in ongoing storage costs, but the real savings come from items still awaiting platform team confirmation:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Transport log optimization&lt;/strong&gt;: 216 GB/month ingestion rate, costing $164/month. This one needs careful analysis of whether the log data feeds any dashboards or alerting.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Legacy dashboard cleanup&lt;/strong&gt;: 12 CloudWatch dashboards that nobody has viewed in months, costing $45-60/month.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Unused alarm cleanup&lt;/strong&gt;: Alarms attached to deleted resources, $10-20/month.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Total potential: $110-165/month once all items are resolved.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Lesson&lt;/strong&gt;: Log retention is a governance problem, not a technical one. Set a default retention policy (I recommend 30 days for non-production, 90 days for production) at the organizational level, and require explicit justification for anything longer. The cost of storing logs you will never read adds up faster than you expect.&lt;/p&gt;

&lt;p&gt;This is not unique to us. &lt;a href="https://medium.com/@dobeerman/tackling-hidden-aws-costs-the-cleanup-of-dormant-cloudwatch-log-groups-bonus-c08496449f05" rel="noopener noreferrer"&gt;One team discovered&lt;/a&gt; that thousands of dormant log groups across multiple regions were costing several hundred dollars per month storing logs nobody would ever read. Another &lt;a href="https://www.infracost.io/finops-policies/aws-cloudwatch-consider-using-a-retention-policy/" rel="noopener noreferrer"&gt;AWS case study&lt;/a&gt; showed a single log group dropping from $415/year to $18/year — a 95% reduction — simply by setting a 30-day retention policy. AWS even &lt;a href="https://aws.amazon.com/blogs/infrastructure-and-automation/reduce-log-storage-costs-by-automating-retention-settings-in-amazon-cloudwatch/" rel="noopener noreferrer"&gt;published an automation guide&lt;/a&gt; for enforcing retention policies at scale because this problem is so widespread.&lt;/p&gt;

&lt;h2&gt;
  
  
  Phase 3: S3 lifecycle policies — $75-104/month saved
&lt;/h2&gt;

&lt;p&gt;S3 storage costs are easy to ignore because individual buckets rarely cost more than a few dollars. But across four accounts with years of accumulated data, the total becomes significant.&lt;/p&gt;

&lt;h3&gt;
  
  
  The lifecycle approach
&lt;/h3&gt;

&lt;p&gt;I applied a standard lifecycle policy to three buckets in the waitlist account:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;0-30 days:   S3 Standard (frequent access for recent data)
30-90 days:  S3 Standard-IA (infrequent access, lower storage cost)
90+ days:    S3 Glacier Instant Retrieval (archive, sub-millisecond access)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The three buckets and their sizes:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Bucket pattern&lt;/th&gt;
&lt;th&gt;Size&lt;/th&gt;
&lt;th&gt;Contents&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;{service}-papertrail&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;1,293 GB&lt;/td&gt;
&lt;td&gt;Application log exports&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;{service}-datalab-athena-tables&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;238 GB&lt;/td&gt;
&lt;td&gt;Analytical query results&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;{service}-upload&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;352 GB&lt;/td&gt;
&lt;td&gt;User-uploaded content&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Total: 1,883 GB across three buckets, with the vast majority of objects older than 90 days and rarely accessed.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why Intelligent-Tiering was the wrong answer
&lt;/h3&gt;

&lt;p&gt;My first instinct was S3 Intelligent-Tiering — let AWS automatically move objects between access tiers based on usage patterns. It sounds ideal. But when I ran the numbers, it was actually more expensive for our buckets.&lt;/p&gt;

&lt;p&gt;The reason: Intelligent-Tiering charges a monitoring fee of $0.0025 per 1,000 objects per month. For buckets with millions of small objects, this monitoring cost exceeds the storage savings from automatic tiering.&lt;/p&gt;

&lt;p&gt;Consider a bucket with 54.8 million objects averaging under 128 KB each:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Intelligent-Tiering monitoring cost:
  54,800,000 objects / 1,000 × $0.0025 = $137/month in monitoring alone

Standard-IA storage savings for same bucket:
  Negligible — objects under 128 KB are charged minimum 128 KB in IA,
  so small objects can actually cost MORE in IA than Standard
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The monitoring fee alone was more than the entire bucket's current storage cost. Lifecycle policies with explicit transitions based on object age are cheaper and more predictable for buckets with high object counts or small average object sizes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;When Intelligent-Tiering does make sense&lt;/strong&gt;: Buckets with fewer, larger objects (think database backups, media files) where the monitoring fee per object is negligible compared to the storage cost delta between tiers. For log-style buckets with millions of small files, stick with lifecycle rules.&lt;/p&gt;

&lt;p&gt;This is a common trap. &lt;a href="https://sedai.io/blog/amazon-s3-intelligent-tiering-storage-optimization" rel="noopener noreferrer"&gt;Sedai's analysis&lt;/a&gt; confirmed that for workloads with millions of small files and predictable access patterns, explicit lifecycle rules are cheaper than Intelligent-Tiering because they eliminate the monitoring fee entirely while achieving the same storage outcome. Even &lt;a href="https://aws.amazon.com/s3/storage-classes/intelligent-tiering/" rel="noopener noreferrer"&gt;AWS's own pricing page&lt;/a&gt; notes that objects under 128 KB are never auto-tiered — they stay in Frequent Access tier, paying full price, while still incurring monitoring costs in some configurations. If you know your data's access pattern (and for logs, you do), skip the automation and set explicit rules.&lt;/p&gt;

&lt;h3&gt;
  
  
  What is still pending
&lt;/h3&gt;

&lt;p&gt;Five additional buckets totaling over 3 TB are awaiting platform team review:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Application logs bucket (916 GB, actively written)&lt;/li&gt;
&lt;li&gt;Access logging bucket (751 GB, 54.8M small objects)&lt;/li&gt;
&lt;li&gt;Device logs bucket (730 GB, 9.6M objects)&lt;/li&gt;
&lt;li&gt;VPC flow logs bucket (411 GB, actively written)&lt;/li&gt;
&lt;li&gt;Data dump bucket (241 GB, 19.1M objects)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Several of these are candidates for expiration policies (delete after N days) rather than just tiering, which would further reduce costs.&lt;/p&gt;

&lt;h2&gt;
  
  
  Results: $644/month immediate, $933-1,017/month total
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Workstream&lt;/th&gt;
&lt;th&gt;Immediate savings&lt;/th&gt;
&lt;th&gt;Pending platform approval&lt;/th&gt;
&lt;th&gt;Total&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;#1 VPC cleanup&lt;/td&gt;
&lt;td&gt;$600/mo&lt;/td&gt;
&lt;td&gt;$130/mo&lt;/td&gt;
&lt;td&gt;$748/mo&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;#2 CloudWatch logs&lt;/td&gt;
&lt;td&gt;$5/mo&lt;/td&gt;
&lt;td&gt;$105-160/mo&lt;/td&gt;
&lt;td&gt;$110-165/mo&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;#3 S3 lifecycle&lt;/td&gt;
&lt;td&gt;$39/mo&lt;/td&gt;
&lt;td&gt;$36-65/mo&lt;/td&gt;
&lt;td&gt;$75-104/mo&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Total&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;$644/mo&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;$271-355/mo&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;$933-1,017/mo&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Annual: $11,196 - $12,204&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Every dollar saved here came from resources that were either completely unused or storing data that nobody was reading. No production services were modified. No architectural changes were required. No users were impacted.&lt;/p&gt;

&lt;h2&gt;
  
  
  The roadmap: $48-67K/year still on the table
&lt;/h2&gt;

&lt;p&gt;The VPC/CloudWatch/S3 work was the low-hanging fruit — things I could verify and execute without risking service availability. The next phase requires platform team coordination because it involves production databases and compute.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Priority&lt;/th&gt;
&lt;th&gt;Focus&lt;/th&gt;
&lt;th&gt;Monthly savings&lt;/th&gt;
&lt;th&gt;Key items&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;P0 Immediate&lt;/td&gt;
&lt;td&gt;Unused databases&lt;/td&gt;
&lt;td&gt;$1,772-2,172/mo&lt;/td&gt;
&lt;td&gt;Idle RDS instances, oversized ElastiCache&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;P1 Short-term&lt;/td&gt;
&lt;td&gt;Legacy compute&lt;/td&gt;
&lt;td&gt;$933-1,113/mo&lt;/td&gt;
&lt;td&gt;More unused DBs, idle EC2, DocumentDB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;P2 Medium-term&lt;/td&gt;
&lt;td&gt;Architecture&lt;/td&gt;
&lt;td&gt;$1,290-2,280/mo&lt;/td&gt;
&lt;td&gt;EKS Karpenter, dev scheduling, gp2-to-gp3 migration, Redis EOL upgrades&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Full roadmap: $3,995-5,565/month = $48,000-67,000/year&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The P0 items alone — a few RDS instances that are running but not connected to any application — would nearly double the savings achieved so far. But these require the platform team to confirm that the databases are truly unused, not just "used once a quarter for a batch job nobody documented."&lt;/p&gt;

&lt;p&gt;This is why the 3-phase methodology matters: the cost of accidentally deleting a database that someone needs is orders of magnitude higher than the monthly savings from removing it.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fig6my7pt5f01po24tk4x.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fig6my7pt5f01po24tk4x.png" alt="3-phase methodology: Analyze → Confirm → Execute" width="720" height="300"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The methodology: why process matters more than tools
&lt;/h2&gt;

&lt;p&gt;Every optimization followed a 3-phase workflow:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Phase 1 — Analyze&lt;/strong&gt;: Gather metrics from Cost Explorer and SigNoz. Calculate exact savings. Map dependencies. Classify the downtime risk: zero-impact (unused resource), brief disruption (restart required), or service-affecting (production traffic).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Phase 2 — Platform team confirm&lt;/strong&gt;: For anything touching production or anything where ownership is ambiguous, I create a Confluence page with the analysis, tag the responsible team, and wait for explicit confirmation. This is the slow part, and it should be. Rushing this step is how you delete the database that runs the quarterly compliance report.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Phase 3 — Execute&lt;/strong&gt;: Cross-check the plan one final time, send a Slack notification to the operations channel, execute the change, verify the expected cost reduction appears in the next billing cycle, and document the outcome.&lt;/p&gt;

&lt;p&gt;The prioritization within each phase follows two rules:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Downtime risk first&lt;/strong&gt;: Zero-impact items (unused resources) before brief-disruption items before service-affecting items. This builds trust with the platform team — they see you removing dead weight before you propose changes to anything live.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Savings amount second&lt;/strong&gt;: Within the same risk tier, tackle the highest-dollar items first for maximum ROI on your analysis time.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This workflow is not exciting. It does not involve a fancy FinOps platform or automated recommendation engine. But it works, and it ensures that every change is reversible, documented, and approved by someone who understands the service context.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is next: automating the audit
&lt;/h2&gt;

&lt;p&gt;The manual audit across four accounts took about two weeks of part-time work. Most of that time was spent on the same repetitive queries: find unattached EIPs, find log groups with no recent events, find S3 buckets without lifecycle policies, find resources with zero utilization.&lt;/p&gt;

&lt;p&gt;I am building an open-source CLI tool — &lt;a href="https://github.com/junegu/aws-finops-toolkit" rel="noopener noreferrer"&gt;aws-finops-toolkit&lt;/a&gt; — to automate these patterns across multi-account AWS environments. The goal is to reduce the initial audit from two weeks to an afternoon:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Multi-account scanning via AWS Organizations or assumed roles&lt;/li&gt;
&lt;li&gt;Automatic detection of orphaned resources (unattached EIPs, empty log groups, idle NAT Gateways)&lt;/li&gt;
&lt;li&gt;S3 lifecycle policy recommendations based on access patterns and object size distribution&lt;/li&gt;
&lt;li&gt;CloudWatch log group analysis with retention recommendations&lt;/li&gt;
&lt;li&gt;Cost-per-resource estimates using the AWS Pricing API&lt;/li&gt;
&lt;li&gt;Markdown and CSV report generation for stakeholder review&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you manage multiple AWS accounts and have seen the same patterns I described here, the repo could use contributors.&lt;/p&gt;

&lt;h2&gt;
  
  
  Lessons learned
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1. The biggest savings are in resources that should not exist.&lt;/strong&gt; Right-sizing and reservations get all the blog posts, but a single abandoned VPC with six WorkSpaces cost more per month than all the node right-sizing I could do across every dev environment. Always start by looking for entire stacks that can be deleted.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Multi-account architecture is a FinOps feature.&lt;/strong&gt; Per-account billing makes cost ownership unambiguous. When I proposed removing the privacy VPC in the loyalty account, I was talking to one team about one account's bill. There was no allocation debate.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Intelligent-Tiering is not a universal answer.&lt;/strong&gt; For high-object-count buckets with small files, the per-object monitoring fee can exceed the storage savings. Always run the numbers before enabling it. Lifecycle policies with explicit age-based transitions are cheaper and more predictable for log-style workloads.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Log retention is a governance gap, not a technical problem.&lt;/strong&gt; When the default is "retain forever," every log group becomes a slowly growing cost center. Set organizational defaults (30 days non-prod, 90 days prod) and require justification for longer retention.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. Process builds trust, trust unlocks bigger savings.&lt;/strong&gt; The $644/month I saved independently was useful, but the $48-67K/year roadmap requires platform team buy-in. By starting with zero-risk items (dead VPCs, orphaned log groups) and following a documented workflow, I built the credibility needed for the team to approve changes to production infrastructure.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;6. Cost optimization is not a project — it is a practice.&lt;/strong&gt; These savings will erode if nobody checks back in six months. I set up monthly cost reviews, budget alarms on every account, and Slack notifications for every optimization action. The next engineer who inherits this infrastructure will at least know what was changed and why.&lt;/p&gt;




&lt;h3&gt;
  
  
  FinOps for SREs — Series Index
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://dev.to/june-gu/finops-for-sres-cutting-costs-without-breaking-things-2fbk"&gt;Series Introduction: The SRE Guarantee&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://dev.to/june-gu/the-pre-flight-checklist-9-things-to-analyze-before-cutting-any-aws-cost-35dh"&gt;Part 0: The Pre-Flight Checklist — 9 Checks Before Cutting Any Cost&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Part 1: How I Found $12K/Year in AWS Waste&lt;/strong&gt; ← you are here&lt;/li&gt;
&lt;li&gt;&lt;a href="https://dev.to/june-gu/downsizing-without-downtime-an-sres-guide-to-safe-cost-optimization-1lck"&gt;Part 2: Downsizing Without Downtime&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;I'm June, an SRE with 5+ years of experience at Korea's top tech companies including Coupang (NYSE: CPNG) and NAVER Corporation. I write about real-world infrastructure problems. Find me on &lt;a href="https://linkedin.com/in/junegu" rel="noopener noreferrer"&gt;LinkedIn&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>aws</category>
      <category>finops</category>
      <category>cloudcost</category>
      <category>sre</category>
    </item>
  </channel>
</rss>
