DEV Community: Rick Wise

Why I Built a Risk Score Instead of a Buy Button

Rick Wise — Wed, 08 Jul 2026 17:26:13 +0000

The obvious feature to build here is a button. "You're spending $4,200/month on m5.xlarge — buy a 1-year Savings Plan and save 30%." One click, instant discount, everybody's happy.

I almost built that button. I'm glad I didn't.

Here's the problem with the button: it's only looking at the last three months. It has no idea whether the workload it's telling you to commit to will still exist in month nine. And a Reserved Instance or Savings Plan isn't a coupon — it's a bet, paid up front or amortized monthly, that a specific slice of your infrastructure will look roughly the same for a year or three. Get that bet wrong and the "savings" tool just talked you into a liability.

So instead of a buy button, I built a Commitment Risk Score — a feature whose entire job is to sometimes tell you not to buy anything yet.

The feature that argues with itself

The Commitment Risk Score pulls four signals straight from Cost Explorer for an account: how much your top instance families churn month to month, how volatile your spend is, how long your individual resources actually live, and how well you're already using the commitments you have. It weights them (35/25/25/15) into a single 0–100 score, and that score maps to a recommendation — anywhere from "3-year Convertible RI, take the discount" down to "on-demand and Spot only, buy nothing."

That last outcome is the part that made this feature interesting to build. Most cost-optimization tools are graded on how much they tell you to save. This one is graded on how honest it is about when a "savings" purchase would actually be a mistake. It costs about $0.04/account/month to compute — four GetCostAndUsage calls, refreshed weekly — the recommendation quality has nothing to do with the compute cost, so there was no excuse to cut a signal to save pennies.

I wrote up the full math — the Jaccard-distance churn calculation, the coefficient-of-variation volatility threshold, the worked examples — in a separate deep dive, because the mechanics deserved their own space: The Commitment-Risk Score: should you buy that RI?

What almost went wrong

The instinct that almost got me was averaging. My first pass weighted all four signals close to evenly, because "why not, they're all relevant." It took building the worked examples to see the problem: a team mid-Graviton-migration with high churn but a stable dollar total would score fine on an even-weighted average, because volatility and churn partially cancel out in the wrong direction. The churn signal needed to dominate — 35%, not 25% — because a changing instance mix is structurally the most common way a commitment gets stranded, regardless of what the topline spend number is doing.

The existing-waste signal ended up smallest at 15%, which felt backwards at first — isn't "you're already wasting money" the most damning fact? It is, but it's also the least predictive one. It tells you about the past, not whether the next commitment will strand. It stayed in the score because it's a legitimate red flag, but it doesn't get to drown out the forward-looking signals.

Why this is the right shape for a cost tool

CloudWise's whole premise is read-only: we look at your AWS usage and billing data, we never touch your infrastructure, and every dollar figure we show you needs to survive you checking it against Cost Explorer yourself. A recommendation engine that only ever says "buy more" doesn't survive that scrutiny for long — eventually it recommends a commitment that strands, and you stop trusting the number.

A feature that's willing to say "not yet, here's why" is the one worth trusting the next time it says "yes, buy it." That was true before I shipped this, and building it just made me believe it more.

CloudWise is an AWS cost optimization tool for startups — 191 automated waste checks including commitment-risk scoring, read-only by design, starting at $19/month. Run a free scan at cloudcostwise.io.

EC2 Cost Optimization: Are You Ready to Commit to Reserved Instances?

Rick Wise — Wed, 10 Jun 2026 12:38:36 +0000

RI and Savings Plan utilization rates below 80% are far more common than teams admit. The dashboards show green, the discount applied, the finance team is happy — and meanwhile a chunk of every commitment is being paid for and not used.

Here's how it usually happens. You buy a 1-year EC2 Instance Savings Plan on m5.xlarge. All-upfront, that's roughly $1,075 for the year. Six months in, someone runs a Graviton benchmark, the numbers are great, and you migrate the fleet to m7g. That EC2 Instance Savings Plan was scoped to the m5 family. It doesn't follow you. You now pay for the m7g capacity on-demand and keep paying off the stranded m5 commitment. The "savings" turned into a double bill.

The mistake isn't buying Reserved Instances. RIs and Savings Plans are the single biggest lever on an EC2 bill — 30% to 70% off on-demand. The mistake is buying them before your architecture is stable enough to actually use them for the full term.

And the heuristic most teams use to decide is backwards. "We've been running this instance type for three months, let's commit." That's looking in the rear-view mirror. A commitment is a bet on the next 12 to 36 months, not the last three. The real question is: how likely is this workload to still look like this a year from now?

You can answer that with data you already have. There are four signals, each measurable from Cost Explorer or the AWS CLI, that together tell you whether you're ready to commit.

Four signals that tell you if you're ready

These are the four signals that combine into a single 0–100 risk score. The weights below aren't arbitrary — they reflect how much each one actually predicts a stranded commitment. You can compute every one of them yourself.

Signal 1 — Instance Family Churn (weight: 35%)

This is the heaviest signal, because a changing instance mix is the most common way commitments get stranded. It measures how much your top instance families shift month over month.

The method: pull Cost Explorer grouped by INSTANCE_TYPE_FAMILY at monthly granularity over the last 6 months. You can do exactly that from the CLI — set the dates to your own 6-month window:

aws ce get-cost-and-usage \
  --time-period Start=2026-01-01,End=2026-06-01 \
  --granularity MONTHLY \
  --metrics "UnblendedCost" \
  --group-by Type=DIMENSION,Key=INSTANCE_TYPE_FAMILY \
  --query "ResultsByTime[].Groups[].{Family:Keys[0],Cost:Metrics.UnblendedCost.Amount}" \
  --output table

For each month, take the top 5 families by spend. Then measure how different consecutive months are using Jaccard distance:

Jaccard distance = 1 − |intersection| / |union|

A distance of 0 means the two months had identical top-5 families (no churn). A distance of 1 means they shared nothing (total churn).

Worked example. Say your top families this period are {m5, r5, c5, t3, r6i}, and a few months later, mid-Graviton-migration, they're {m7g, r7g, c7g, t4g, m5}. The only family in both sets is m5. So the intersection is {m5} (size 1) and the union is all 9 distinct families. That gives:

1 − 1/9 = 0.89

0.89 is very high churn. Your workload is structurally different than it was, and any family-specific commitment you bought before the migration is now mostly dead weight.

The score averages the Jaccard distance across each consecutive month-pair over the window.

Risk threshold: an average Jaccard distance above 0.3 over 6 months means your fleet is moving too fast for long-term, family-locked commitments. Graviton migrations, containerizing onto a different family, and ML-pipeline rebuilds all push this number up fast.

Signal 2 — Spend Trend Volatility (weight: 25%)

Churn tells you what you're running; volatility tells you how much. A workload can stay on the same families but swing wildly in size — and a commitment sized to a peak month is wasted in a trough month.

The metric is the coefficient of variation (CV) of your monthly EC2 spend over 6 months:

CV = (standard deviation / mean) × 100

Worked example: six months averaging $4,200/month with a standard deviation of $1,260 gives a CV of 30%. That's a third of your spend bouncing around month to month — too much to safely commit to a fixed hourly floor.

Risk threshold: CV above 25% is too unpredictable to commit. (In the scoring model, CV is doubled and capped, so a 50% CV maxes out this signal at 100.)

One nuance worth internalizing: if your spend is volatile but you still want some commitment, Compute Savings Plans are far safer here than EC2 Instance Savings Plans. Compute SPs apply across families, sizes, regions, and even Fargate/Lambda, so a flexible commitment sized to your baseline survives the swings.

Signal 3 — Resource Lifecycle Duration (weight: 25%)

This one catches ephemeral fleets. If your individual instances live for days, not months, a 1-year commitment is the wrong instrument no matter how stable the aggregate spend looks.

The metric is the median number of days your individual EC2 instances stay alive — median, not mean, because a few long-lived bastion hosts will drag a mean upward and hide a fleet of short-lived workers. You can derive it from daily Cost Explorer data grouped by RESOURCE_ID, or sanity-check it directly against the live fleet with one CLI call:

aws ec2 describe-instances \
  --filters Name=instance-state-name,Values=running \
  --query "Reservations[].Instances[].{ID:InstanceId,Type:InstanceType,Launch:LaunchTime}" \
  --output table

Sort the output by Launch and eyeball the median age.

Risk threshold: a median instance lifetime under 30 days signals an ephemeral workload where RIs are the wrong tool. In the scoring model, a median of 30 days or less scores the maximum, and 180+ days scores zero, with a linear slide between.

Context: ECS on Spot, batch processing, ephemeral ML training jobs, and aggressive scale-in all produce short lifetimes. Steady-state app servers, databases, and background workers produce long ones — and those are exactly the resources that are safe to commit to.

Signal 4 — Existing Commitment Utilization (weight: 15%)

The lowest weight, but the most damning when it's bad — because it's direct evidence you're already wasting commitments. Before buying more, look at how well you're using what you have.

Check Cost Explorer → Savings Plans → Utilization Report, plus the RI Utilization Report. The score is simply the inverse of your average utilization: 100% utilized scores 0 (no waste), 0% utilized scores 100.

Risk threshold: existing utilization under 80% means fix this before buying anything new.

The trap here is subtle and worth stating plainly: buying more commitments does not fix low utilization — it buries the signal. New unused capacity drags the blended utilization number around and makes the underlying waste harder to see. If you're below 80%, the next purchase isn't optimization, it's compounding a mistake.

Mapping your score to a commitment decision

The four signals combine — weighted 35 / 25 / 25 / 15 — into a single 0–100 risk score. Here's how that score maps to an actual purchasing decision, mirroring the labels the model emits.

Low risk — stable architecture, churn < 0.3, CV < 15%, lifecycle > 90 days, utilization > 90%.
→ 3-year Convertible RIs or a 1-year Compute Savings Plan. This is the maximum-savings zone. Your architecture has earned the commitment; go take the discount.

Medium risk — some volatility, a Graviton migration partway done, moderate churn.
→ 1-year Compute Savings Plan only. Avoid EC2 Instance Savings Plans and standard RIs here — they're too specific for a fleet still in motion. A Compute SP applies to any instance family, size, and OS, which makes it the right instrument to capture savings on your stable baseline while the rest of the fleet shifts underneath it.

High risk — actively containerizing, churn > 0.6, short lifecycles, existing utilization under 70%.
→ On-demand plus Spot. No new commitments. Lean on Spot for fault-tolerant and interruptible workloads, and reserve any Compute SP strictly for the genuinely stable baseline portion of your compute, if there is one.

One clarification that trips people up: Convertible RIs can be exchanged for a different family, size, or region, so they're often pitched as "flexible." But the exchange is a manual process with real restrictions, and it's nowhere near as fluid as a Compute Savings Plan that just applies automatically. When in doubt mid-transition, the Compute SP is the safer flexible bet.

The 15-minute audit: check your account right now

You don't need a tool to get a first read. Here's the manual version of all four signals, end to end.

Step 1 — Churn. Cost Explorer → Group by Instance Type Family → Monthly granularity → last 6 months. Compare the top-5 families month over month. If the set is visibly reorganizing — new families entering, old ones dropping out — that's churn. Eyeball whether a typical month-to-month change shares fewer than ~3 of its top 5 families with the prior month (that's roughly the 0.3 distance line).

Step 2 — Volatility. Export your monthly EC2 totals for the same 6 months. Calculate the mean and standard deviation, then CV = std_dev / mean × 100. If CV is above 25%, flag it.

Step 3 — Lifecycle. Run the describe-instances command from Signal 3, sort by LaunchTime, and estimate the median instance age in days. Under 30 days is a red flag.

Step 4 — Utilization. Cost Explorer → Savings Plans → Utilization Report for the last 3 months. If you're under 80%, stop — understand why before you buy anything else.

Fifteen minutes of Cost Explorer and one CLI command will tell you more about your readiness to commit than any "we've been running this for a while" gut check.

Automated detection

If you'd rather have this calculated automatically than pull Cost Explorer data by hand every quarter, CloudWise computes a Commitment Risk Score as part of its RI/SP management view (shipped June 2, 2026). It combines the same four signals using the weights above and returns a 0–100 score, a label (LOW / MEDIUM / HIGH / CRITICAL), and a recommended maximum commitment term — so the buy/don't-buy decision comes with the math already done. You can find it at cloudcostwise.io.

The discount on a Reserved Instance is real, but it's only a discount if you use the whole term. Run the four signals first. If your architecture is stable, commit with confidence and take the savings. If it's moving, stay flexible — a Compute Savings Plan on your baseline, Spot for the rest — and revisit when the numbers settle. The worst outcome isn't paying on-demand a little longer. It's locking in a year of spend on a fleet you've already left behind.

ElastiCache Pricing Breakdown: Where the Money Actually Goes

Rick Wise — Thu, 16 Apr 2026 14:29:01 +0000

ElastiCache looks straightforward on the bill. You pick a node type, maybe add a replica for high availability, and move on. Then the invoice arrives and the number is bigger than the mental math suggested.

The gap usually comes from one of five places: engine choice, replication topology, extended support surcharges, idle clusters, or oversized nodes nobody ever right-sized. Let's break down exactly how ElastiCache charges — and where teams get surprised.

Three Engines, Three Price Points

ElastiCache supports three engines: Valkey, Redis OSS, and Memcached. They don't cost the same.

Valkey is 20% cheaper than Redis OSS and Memcached for node-based clusters, and 33% cheaper on ElastiCache Serverless. This isn't a promotional rate — it's the permanent pricing structure AWS launched with Valkey.

For context, a cache.r7g.xlarge in us-east-1:

Engine	Hourly Rate	Monthly (730 hrs)
Valkey	$0.3496	~$255
Redis OSS	$0.437	~$319
Memcached	$0.437	~$319

Prices shown for us-east-1, On-Demand.

That's a $64/month difference per node on a single instance type. Multiply that across a 12-node cluster and you're looking at $768/month — just from engine choice. If you're running Redis OSS and don't need Redis-specific features that Valkey doesn't support, the migration saves real money.

Node-Based Pricing: You Pay Whether the Cache Is Hit or Not

ElastiCache charges per node-hour from the moment a node is launched until it's terminated. Partial hours are billed as full hours. There is no scale-to-zero.

A few common node types and what they cost:

Node Type	Memory	Hourly Rate	Monthly (730 hrs)
cache.t3.micro	0.5 GiB	$0.017	~$12
cache.m5.large	6.38 GiB	$0.156	~$114
cache.r7g.xlarge	26.32 GiB	$0.437	~$319
cache.r6g.16xlarge	419.09 GiB	$5.254	~$3,835

Prices shown for Redis OSS / Memcached in us-east-1, On-Demand. Valkey is 20% lower.

The important thing to internalize: a cache.t3.micro sitting idle costs the same $12/month as one handling thousands of requests per second. The meter runs on time, not usage.

AWS recommends reserving 25% of a node's memory for non-data use (replication buffers, OS overhead, etc.), so the usable capacity of a cache.r7g.xlarge is roughly 19.74 GiB, not 26.32 GiB.

Replication Multiplies the Bill

Most production deployments use replication for high availability. With Redis OSS or Valkey, you configure a replication group with a primary node and one or more replica nodes per shard.

Every replica is a full node charged at the same hourly rate.

A three-shard cluster with one replica per shard using cache.r7g.xlarge (Valkey):

3 shards × 2 nodes per shard = 6 nodes
6 × $0.3496/hr = $2.10/hr → ~$1,531/month

Add a second replica for read scaling:

3 shards × 3 nodes per shard = 9 nodes
9 × $0.3496/hr = $3.15/hr → ~$2,297/month

Plus, multi-AZ replication generates cross-AZ data transfer at $0.01/GiB in each direction. For a high-throughput cache doing 100,000 requests/second with 500-byte objects, that's roughly 167 GiB/hour of traffic. If 50% crosses AZ boundaries, that's an extra $0.84/hour — about $613/month in data transfer alone.

Teams often enable multi-AZ replication on dev and staging environments where a single node would be fine.

Serverless: Simpler, But Not Always Cheaper

ElastiCache Serverless removes the node sizing decision entirely. You pay for two things:

Data stored — billed in GB-hours
ElastiCache Processing Units (ECPUs) — a unit combining vCPU time and data transferred

Dimension	Valkey	Redis OSS	Memcached
Data storage	$0.084/GB-hr	$0.125/GB-hr	$0.125/GB-hr
ECPUs	$0.0023/M	$0.0034/M	$0.0034/M
Minimum data stored	100 MB	1 GB	1 GB

Prices shown for us-east-1.

A simple GET or SET transferring under 1 KB consumes 1 ECPU. A command transferring 3.2 KB consumes 3.2 ECPUs. Commands that use more vCPU time (like SORT or ZADD) consume proportionally more.

Serverless can be cheaper for spiky workloads because you don't over-provision for peaks. But for stable, high-throughput workloads, node-based clusters are often significantly cheaper. AWS's own Example 2 shows a spiky workload costing $2.92/hour serverless vs. $5.66/hour on-demand nodes — but for steady traffic, the math can flip the other way.

The minimum charge matters too. A Serverless cache for Redis OSS or Memcached is metered for at least 1 GB of data stored — roughly $91/month minimum even if you're storing almost nothing. Valkey's 100 MB minimum brings that floor down to about $6/month.

Extended Support: The Surcharge Nobody Budgets For

When a Redis OSS or Memcached engine version reaches end-of-life, AWS continues providing security patches through Extended Support — at a steep premium.

Period	Surcharge
Year 1–2 after EOL	80% premium on node-hour rate
Year 3 after EOL	160% premium on node-hour rate

A cache.m5.large running Redis 5 (EOL January 31, 2026) at $0.156/hour becomes:

Year 1–2: $0.156 + ($0.156 × 80%) = $0.281/hour (~$205/month)
Year 3: $0.156 + ($0.156 × 160%) = $0.406/hour (~$296/month)

That's nearly triple the base cost by year three. Teams that don't track engine versions can drift into Extended Support without realizing their bill just jumped 80%.

Backup Storage and Data Transfer

Two cost categories that don't appear under the main "ElastiCache" line:

Backup storage: $0.085/GiB per month for all regions. No data transfer charges for creating or restoring backups. This is generally small unless you're snapshotting large clusters frequently.

Data transfer:

Path	Cost
Same AZ (EC2 ↔ ElastiCache)	Free
Cross-AZ (same Region)	$0.01/GiB each way
Cross-Region (Global Datastore)	$0.02/GiB

The cross-AZ charge is easy to miss because it shows up as EC2 data transfer on the bill, not ElastiCache. You're only charged for the EC2 side — there's no ElastiCache data transfer charge for traffic in or out of the node itself.

Data Tiering: The Cost Saver Most Teams Don't Know About

R6gd nodes combine memory and NVMe SSD, automatically moving least-frequently-accessed data to SSD. You get nearly 5× the total storage capacity compared to memory-only R6g nodes.

AWS's example: a 1 TiB dataset needs 1 cache.r6gd.16xlarge node ($9.98/hour) vs. 4 cache.r6g.16xlarge nodes ($21.01/hour) — a 52% cost reduction.

The trade-off: SSD-resident data has slightly higher latency on first access. If your workload regularly accesses less than 20% of the dataset, data tiering is worth evaluating.

Data tiering is not available with ElastiCache Serverless.

Reserved Nodes: Up to 55% Off

If your ElastiCache usage is stable, reserved nodes offer steep discounts:

Commitment	Discount vs. On-Demand
1-year, No Upfront	Up to 48.2%
1-year, Partial Upfront	Up to 52%
3-year, All Upfront	Up to 55%

Reserved nodes are size-flexible — you can apply the discount across different node sizes within the same family. If you buy a reservation for cache.r7g.xlarge, it can cover cache.r7g.large nodes proportionally.

One useful detail: Redis OSS reservations automatically apply to Valkey nodes in the same family and region. Since Valkey is 20% cheaper, you get 20% more value from existing reservations after migrating.

The Real Problem: Idle Caches

Here's what actually burns money: caches nobody is using.

ElastiCache has no scale-to-zero for node-based clusters. A cache with zero hits costs exactly the same as one handling millions of requests. This is the pattern we see most often:

A team provisions a cache for a microservice, then the service is deprecated
Dev/staging caches left running after the project ends
A "temporary" cache for a migration that became permanent infrastructure
A replicated cluster in non-production where a single node would suffice

A three-node cache.r7g.xlarge cluster running idle for a year at Valkey on-demand rates: $9,186 wasted.

Over-Provisioned Caches Are Nearly as Bad

Beyond idle caches, oversized nodes are the second biggest source of waste. Teams pick a large node type during initial setup, the workload stabilizes at a fraction of capacity, and nobody revisits the sizing.

A cache.r6g.xlarge running at 6% CPU with active connections is doing real work — but it's doing it on a node that's 3–4× larger than needed. Downsizing from cache.r6g.xlarge to cache.r6g.large can cut costs by 40–50% with no performance impact.

How to Spot the Waste

Check these CloudWatch metrics for each cluster:

CacheHits: Zero for 14+ days means nothing is reading from this cache
CurrConnections: Zero means nothing is even connecting
EngineCPUUtilization: Consistently under 10% with active connections means the node is oversized

Quick CLI inventory of all your ElastiCache clusters:

aws elasticache describe-cache-clusters \
  --show-cache-node-info \
  --query 'CacheClusters[*].{
    ClusterId:CacheClusterId,
    Engine:Engine,
    EngineVersion:EngineVersion,
    NodeType:CacheNodeType,
    NumNodes:NumCacheNodes,
    Status:CacheClusterStatus
  }' \
  --output table

If any of those clusters show an engine version approaching EOL, you're on the clock for an Extended Support surcharge.

CloudWise detects idle ElastiCache clusters by analyzing CloudWatch cache hit metrics over 14 days, flags oversized nodes running under 10% CPU, and alerts you when clusters are approaching or already incurring Extended Support surcharges. Three detectors, one scan.

CloudWise automates AWS cost analysis across 180+ waste detectors. Try it at cloudcostwise.io.

How Timestream Actually Bills: A Breakdown for Engineers

Rick Wise — Thu, 09 Apr 2026 13:56:59 +0000

Timestream can look simple on the bill until you break down the line items. Most teams think in terms of "stored data," but Amazon Timestream for LiveAnalytics is billed across multiple meters that move independently.

If you only watch one number, you can miss where most of the spend actually comes from.

First: Know Which Timestream Product You Are Using

AWS now has two Timestream offerings:

Amazon Timestream for LiveAnalytics (serverless, billed by writes, query compute, memory store, magnetic store)
Amazon Timestream for InfluxDB (managed InfluxDB, billed by DB instance-hours and storage)

This post focuses on Timestream for LiveAnalytics, where most billing misunderstandings happen.

The LiveAnalytics Billing Model (What Actually Ticks)

For Timestream for LiveAnalytics, AWS charges separately for:

Writes: billed by amount of data written (rounded to nearest KiB), often shown in pricing examples as a per-million 1 KiB write unit.
Queries: billed by Timestream Compute Units (TCUs) consumed over time (TCU-hours), not by a flat per-query fee.
Memory store: billed by GB-hour.
Magnetic store: billed by GB-month (with account/region minimums for magnetic storage usage).

So the statement "Timestream is $0.10/GB-month" is not accurate for LiveAnalytics. That kind of single storage rate framing is incomplete and often wrong.

Why Teams Get Surprised

1) Query charges are compute-time based, not per query count

A dashboard running one heavy query every few seconds can cost more than many lightweight queries. Query cost follows TCU consumption and duration.

2) Memory retention is expensive relative to magnetic retention

Keeping a long retention period in memory store drives GB-hour charges. Moving older data to magnetic store usually lowers storage cost.

3) High write frequency amplifies ingestion cost fast

Small records at high frequency still add up, especially without batching and schema optimization.

4) "Idle table" thinking can be misleading

In LiveAnalytics, empty or unused tables are not the main cost driver. Spend usually comes from write volume, query compute, and retained data in memory/magnetic tiers.

A Better Mental Model for Retention

Retention decisions directly shape spend:

Short memory retention for hot, low-latency workloads
Longer magnetic retention for historical analysis
Keep only what is needed in memory store

If your query patterns are mostly historical and not sub-second operational reads, memory retention is often set too high.

Practical Cost Review Checklist

Run this monthly:

Writes: are records batched efficiently? Are you writing unnecessary dimensions/measures?
Queries: which workloads consume most TCU time? Any dashboards refreshing too frequently?
Memory store: is hot retention longer than real operational need?
Magnetic store: is long-term retention aligned with compliance and analytics requirements?
Table lifecycle: are stale datasets still retained without business need?

CLI: Inventory Retention Settings Across Tables

Use this to review memory/magnetic retention quickly:

aws timestream-write list-databases --query 'Databases[].DatabaseName' --output text | tr '\t' '\n' | while read db; do
  [ -z "$db" ] && continue
  aws timestream-write list-tables --database-name "$db" --query 'Tables[].TableName' --output text | tr '\t' '\n' | while read tbl; do
    [ -z "$tbl" ] && continue
    aws timestream-write describe-table \
      --database-name "$db" \
      --table-name "$tbl" \
      --query '{Database:Table.DatabaseName,Table:Table.TableName,MemoryHours:Table.RetentionProperties.MemoryStoreRetentionPeriodInHours,MagneticDays:Table.RetentionProperties.MagneticStoreRetentionPeriodInDays}'
  done
done

This does not prove usage by itself, but it gives you the retention map you need before optimizing query and write behavior.

Bottom Line

Timestream for LiveAnalytics billing is multi-dimensional:

writes
query compute (TCUs)
memory store
magnetic store

If you treat it as a single storage bill, you will miss the biggest optimization levers.

CloudWise helps teams surface these hidden cost patterns and prioritize the fastest savings opportunities across AWS data services.

CloudWise automates AWS cost analysis and waste detection. Try it at cloudcostwise.io.

The Hidden Costs of Idle EMR Clusters (And How to Stop the Bleed)

Rick Wise — Thu, 02 Apr 2026 14:36:43 +0000

EMR looks simple on the bill. You spin up a cluster, run your Spark jobs, and shut it down. But most teams don't shut it down — and that's where the money disappears.

EMR Has Two Price Tags

Every EMR instance carries two charges:

EC2 instance cost — the standard on-demand rate for the instance type
EMR surcharge — an additional per-instance-hour fee, typically 20–25% of the EC2 price

For a common analytics instance like m5.xlarge (4 vCPUs, 16 GB RAM) in us-east-1:

Component	Hourly Cost
EC2	$0.192
EMR surcharge	$0.048
Total	$0.240/hr

A 5-node cluster of m5.xlarge instances costs $1.20/hr — roughly $876/month if left running. That's just compute. Storage is extra.

Most teams focus on the EC2 line item and completely miss the EMR surcharge. It doesn't show up as a separate line — it's bundled into the "Amazon Elastic MapReduce" charge on your bill, and it adds up fast across multiple clusters.

The EBS Trap

Every EMR node gets EBS volumes attached. The default root volume is typically 10–15 GB, but core and task nodes often get larger volumes for HDFS or local shuffle storage.

Current EBS pricing in us-east-1:

Volume Type	Cost
gp3 (default for new clusters)	$0.08/GB-month
gp2 (legacy default)	$0.10/GB-month
io1 (provisioned IOPS)	$0.125/GB-month + $0.065/IOPS

A 5-node cluster with 100 GB gp3 per node adds $40/month in storage alone. Not huge — but it never stops charging, even when the cluster is idle.

The real problem isn't the per-GB rate. It's that EBS charges continue as long as the cluster exists, regardless of whether any jobs are running.

The Idle Cluster Problem

Here's the scenario that burns money: a cluster in WAITING state.

EMR clusters have three relevant states:

RUNNING — actively executing steps (Spark, Hive, Presto jobs)
WAITING — cluster is up, all steps are complete, waiting for new work
TERMINATED — shut down, no charges

The WAITING state is the silent budget killer. The cluster is fully provisioned — all EC2 instances running, all EBS volumes attached, EMR surcharge ticking — but doing zero work. It's an idle engine burning fuel in a parked car.

This happens more often than you'd think:

Dev/test clusters spun up for debugging, never terminated
Scheduled pipelines where the cluster outlives the job by hours or days
"Keep-alive" clusters left running for ad-hoc queries that happen once a week
Failed termination where auto-termination was configured but a step error left the cluster hanging

A 10-node r5.2xlarge cluster in WAITING state costs roughly $5,256/month — EC2 ($0.504/hr × 10 × 730) plus EMR surcharge ($0.126/hr × 10 × 730) plus EBS. For processing zero bytes of data.

What to Actually Check

If you want to audit your EMR spend, focus on three things:

1. Clusters in WAITING state for more than 24 hours

aws emr list-clusters --active --query 'Clusters[?Status.State==`WAITING`]'

Any cluster that's been waiting more than a day is almost certainly forgotten. Check the ReadyDateTime in the timeline to see how long it's been idle.

2. Long-running clusters with no recent steps

Some teams run "persistent" EMR clusters for interactive workloads (Jupyter, Presto). These are valid — but they should be right-sized. Check list-steps to see when the last step actually ran.

aws emr list-steps --cluster-id j-XXXXX --step-states COMPLETED \
  --query 'Steps[0].Status.Timeline.EndDateTime'

If the last step completed weeks ago, the cluster is waste.

3. Auto-termination configuration

EMR supports auto-termination after idle timeout. If your clusters don't have this enabled, you're one forgotten SSH session away from a surprise bill.

aws emr describe-cluster --cluster-id j-XXXXX \
  --query 'Cluster.AutoTerminationPolicy'

The Fix

For batch workloads, the answer is straightforward: use transient clusters. Spin up, process, terminate. EMR's step execution mode does this automatically — the cluster terminates after the last step completes.

For interactive workloads, set aggressive auto-termination policies (1–2 hours of idle time) and right-size instance types based on actual utilization, not peak estimates from six months ago.

And tag everything. You can't optimize what you can't attribute. Use aws:elasticmapreduce:editor-id and custom cost allocation tags to tie clusters back to teams and projects.

CloudWise detects idle and long-running EMR clusters automatically, flags the monthly waste, and generates remediation plans. Try it at cloudcostwise.io.

Amazon MQ Pricing: What's Really on Your Bill

Rick Wise — Thu, 26 Mar 2026 12:37:23 +0000

Amazon MQ looks simple on the bill until you pull it apart. Most teams spin up a managed ActiveMQ or RabbitMQ broker, send a few messages, and move on. Then the invoice arrives with line items they didn't expect.

Let's break down exactly where the money goes — and what catches people off guard.

The Broker Instance: It's Not a Flat Rate

Amazon MQ charges per broker instance-hour based on the instance type and engine. There's no single "broker fee." The range is wide:

Instance Type	Engine	Hourly Rate	Monthly (730 hrs)
mq.t3.micro	ActiveMQ	$0.034	$24.82
mq.t3.micro	RabbitMQ	$0.034	$24.82
mq.m5.large	ActiveMQ	$0.276	$201.48
mq.m5.large	RabbitMQ	$0.276	$201.48
mq.m5.xlarge	ActiveMQ	$0.552	$402.96
mq.m5.2xlarge	ActiveMQ	$1.104	$805.92
mq.m5.4xlarge	ActiveMQ	$2.208	$1,611.84

Prices shown for us-east-1, On-Demand.

A common mistake is assuming the smallest broker is "cheap." Even an mq.t3.micro runs 24/7, costing roughly $25/month before anything else. An mq.m5.large — the default many teams pick — is over $200/month per broker.

High Availability Doubles the Compute Cost

ActiveMQ supports active/standby deployment for high availability. This means two broker instances in different Availability Zones. Your compute cost doubles immediately:

Single-instance mq.m5.large: $201.48/month
Active/standby mq.m5.large: $402.96/month

RabbitMQ achieves HA through a three-node cluster deployment, which means you're paying for three instances. An mq.m5.large RabbitMQ cluster costs roughly $604/month in compute alone.

Teams often enable HA for production brokers and forget about it on dev/staging environments — where a single instance would be fine.

Storage: EBS vs. EFS Is a 3× Price Difference

ActiveMQ brokers on Amazon MQ offer two storage backends:

Storage Type	Cost	Use Case
EBS	$0.10/GB-month	Standard durability, faster throughput
EFS	$0.30/GB-month	Shared storage for active/standby pairs

EFS is required for active/standby deployments since both brokers need access to the same persistent store. That's a 3× premium over EBS.

A broker with 50 GB of message storage on EFS costs $15/month in storage alone. Not huge, but it adds up across multiple brokers and environments.

RabbitMQ brokers use EBS exclusively — no EFS option.

Data Transfer: The Silent Multiplier

Standard AWS data transfer charges apply to Amazon MQ:

Same AZ: Free
Cross-AZ (typical for HA): $0.01/GB each way
Cross-Region: $0.02/GB
Internet outbound: $0.09/GB (first 10 TB)

For active/standby deployments, replication traffic between AZs is cross-AZ data transfer. High-throughput brokers processing millions of messages per day can accumulate meaningful cross-AZ charges that don't appear under the "AmazonMQ" line item on your bill — they show up under general data transfer.

The Real Problem: Idle Brokers

Here's what actually burns money with Amazon MQ: brokers that nobody is using.

It's remarkably common. A team provisions a broker for a proof of concept, connects a few services, then the project pivots. The broker sits there at $200+/month with zero messages flowing through it. No consumers, no producers, no connections — just a running instance charging by the hour.

Unlike Lambda or SQS, Amazon MQ has no scale-to-zero capability. A broker with zero messages costs the same as one processing thousands per second.

The pattern we see most often:

Dev/staging brokers left running after the sprint ends
Migration brokers kept "just in case" after switching to SQS or EventBridge
HA-enabled brokers in non-production environments

How to Spot the Waste

Look at these CloudWatch metrics for each broker:

TotalMessageCount: If this is zero over 7–14 days, the broker is idle
CurrentConnectionsCount: Zero connections means nothing is even trying to talk to it
TotalConsumerCount / TotalProducerCount: Both at zero confirms no active clients

If all three are zero for more than a week, you're paying for a parking spot nobody is using.

CloudWise detects idle Amazon MQ brokers automatically by analyzing CloudWatch connection and message metrics. If your broker has had zero activity for 14 days, CloudWise flags it with the full monthly cost so you can decide whether to keep it or shut it down.

CloudWise automates AWS cost analysis across 145+ waste detectors. Try it at cloudcostwise.io

Your OpenSearch Bill Is Bigger Than You Think: A Technical Cost Breakdown

Rick Wise — Thu, 19 Mar 2026 13:02:09 +0000

OpenSearch can look cheap at first glance, then surprise you in the monthly bill.

Most teams look at one line item and assume that is “the OpenSearch cost.” In reality, OpenSearch spend is a composite of compute, storage, backups, and data movement, and each part scales differently under real workloads.

If you are trying to reduce spend without breaking search performance, you need to model the full stack, not just one rate card.

1) Start with the right mental model

There are two major OpenSearch consumption models:

OpenSearch Service domains (provisioned clusters)
OpenSearch Serverless collections

This post focuses on domain-based OpenSearch, where cost generally includes:

Data node instance-hours (often the biggest component)
Dedicated master node instance-hours (if enabled)
Warm/cold tier node-hours (if used)
EBS storage attached to nodes
EBS performance dimensions (for gp3: provisioned IOPS/throughput beyond baseline)
Snapshot storage
Data transfer and network-related charges

If your team uses OpenSearch Serverless, the billing dimensions are different (OCUs and storage), so avoid applying domain formulas to serverless workloads.

2) Why “simple price per GB” is misleading

A common mistake is saying:

“OpenSearch data is $X/GB-month”
“EBS is $Y/GB-month”
“Snapshots are $Z/GB-month”

Those values can be valid in a specific region and setup, but not as universal truths.

For example:

EBS pricing differs by volume type (gp2, gp3, io1, io2, st1, etc.)
gp3 can add separate performance costs for extra IOPS/throughput
snapshot charges depend on snapshot type/storage class and service context
OpenSearch itself charges continuously for active domain infrastructure

The result: two domains with similar data size can have very different monthly costs depending on node family, node count, AZ architecture, and EBS configuration.

3) A practical monthly estimation formula

For domain-based OpenSearch, a useful engineering estimate is:

Monthly OpenSearch Cost ≈
(Σ node_hour_rate × node_count × 730)
+ (EBS_GB × EBS_GB_rate)
+ (gp3_extra_iops × iops_rate, if applicable)
+ (gp3_extra_throughput × throughput_rate, if applicable)
+ warm/cold tier costs
+ snapshot storage
+ data transfer

This will still be an estimate, but it is far closer to reality than a single “per-GB” assumption.

4) What to inspect first (high ROI checks)

When I audit OpenSearch cost, I check these first:

Idle domains
Domains with zero or near-zero search/index traffic for days or weeks.
Easy wins: delete, downscale, or consolidate.
Overprovisioned data nodes
Low CPU + low indexing/search rates + high instance count.
Rightsize node families and counts cautiously.
EBS mismatch
gp2 when gp3 would be cheaper for same durability target.
Oversized volumes with consistently low utilization.
Snapshot sprawl
Old manual snapshots with no retention policy.
Define lifecycle and retention rules.
Non-production environments running 24/7
Dev/test domains that do not need full-time uptime.
Schedule down periods where possible.

5) Fast validation commands

Use CLI/API checks before changing anything:

Inventory domains:
aws opensearch list-domain-names
Domain config and capacity:
aws opensearch describe-domain --domain-name <domain_name>
Storage and utilization metrics (CloudWatch):
check request/indexing activity, CPU utilization, memory pressure, free storage, and write/read throughput over at least 14 days.

Then correlate with CUR or billing exports before taking action.

6) A safer way to communicate pricing

Instead of writing:
“OpenSearch costs $0.25/GB, EBS costs $0.10/GB, snapshots cost $0.095/GB”

Say this:
“OpenSearch spend is a combination of node-hours, EBS storage/performance, and snapshot/storage retention. Exact rates vary by region, node family, storage class, and deployment choices.”

That phrasing is both technically correct and operationally useful.

Final takeaway

OpenSearch cost optimization is not about finding one wrong number. It is about identifying the dominant cost driver for your current architecture, then changing one lever at a time:

eliminate idle
rightsize nodes
tune EBS
enforce snapshot retention

If you do those four consistently, you will usually reduce spend without hurting query latency or reliability.

CloudWise automates AWS cost analysis across 42+ services. Try it at cloudcostwise.io

The Silent $33/Month Charge: Understanding AWS NAT Gateway Costs

Rick Wise — Thu, 12 Mar 2026 13:12:47 +0000

Most AWS cost conversations focus on EC2 instances and RDS databases. Meanwhile, NAT Gateways quietly burn $32.85 per month each — whether they process a terabyte of data or zero bytes.

How NAT Gateway Billing Works

NAT Gateway pricing has two components:

Component	Rate (us-east-1)	Monthly Estimate
Hourly base charge	$0.045/hour	$32.85/month (730 hrs)
Data processing	$0.045/GB	Varies by traffic

The base charge is the one that catches teams off guard. It runs 24/7 from the moment the NAT Gateway is created until it's deleted. No traffic? Doesn't matter — you're still paying $0.045 for every hour it exists.

Where the Real Cost Hides: Multi-AZ Deployments

The standard Terraform pattern for a production VPC creates one NAT Gateway per Availability Zone:

resource "aws_nat_gateway" "main" {
  count         = length(var.availability_zones)
  allocation_id = aws_eip.nat[count.index].id
  subnet_id     = aws_subnet.public[count.index].id
}

Three AZs means three NAT Gateways. That's $98.55/month in base charges alone — before a single byte of data is processed. For a staging environment that mirrors production network architecture, you're paying nearly $100/month for network redundancy that staging doesn't need.

The Math at Scale

Let's walk through realistic scenarios:

Small team (1 VPC, 3 AZs):

3 NAT Gateways × $32.85 = $98.55/month base
500 GB data processed × $0.045 = $22.50
Total: $121.05/month

Mid-size company (4 VPCs across dev/staging/prod/sandbox, 3 AZs each):

12 NAT Gateways × $32.85 = $394.20/month base
Most non-prod NAT Gateways processing near-zero traffic
Likely waste: 6–9 idle gateways = $197–$296/month wasted

Enterprise (20+ VPCs, multi-region):

60+ NAT Gateways × $32.85 = $1,971/month base
Traffic typically concentrated in 1–2 AZs per VPC
Idle NAT Gateways can easily exceed $500/month in pure waste

How to Find Idle NAT Gateways

Two CloudWatch metrics tell you everything:

BytesOutToDestination — total bytes sent through the NAT Gateway
ActiveConnectionCount — number of concurrent active connections

If both are zero for 7+ days, the NAT Gateway is idle. Here's how to check:

# List all NAT Gateways
aws ec2 describe-nat-gateways \
  --query 'NatGateways[?State==`available`].{
    ID:NatGatewayId,
    SubnetId:SubnetId,
    VpcId:VpcId,
    State:State
  }' \
  --output table

# Check traffic for a specific NAT Gateway (last 7 days)
aws cloudwatch get-metric-statistics \
  --namespace AWS/NATGateway \
  --metric-name BytesOutToDestination \
  --dimensions Name=NatGatewayId,Value=nat-0abc123def456 \
  --start-time $(date -u -v-7d +%Y-%m-%dT%H:%M:%S) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
  --period 86400 \
  --statistics Sum

If every daily sum is 0.0, that NAT Gateway is costing you $32.85/month for nothing.

Alternatives for Low-Traffic Environments

1. VPC Endpoints (Gateway type — free)

If your private subnets only need to reach S3 or DynamoDB, a Gateway VPC Endpoint handles it with zero hourly or data processing charges:

aws ec2 create-vpc-endpoint \
  --vpc-id vpc-abc123 \
  --service-name com.amazonaws.us-east-1.s3 \
  --route-table-ids rtb-abc123

This single command can eliminate the NAT Gateway entirely for S3-only workloads.

2. NAT Instances (for dev/staging)

A t4g.nano instance running as a NAT instance costs ~$3.07/month — roughly 10x cheaper than a NAT Gateway. The tradeoff is no managed HA, no automatic scaling, and you manage the instance yourself. For non-production environments, that's often acceptable.

3. Consolidate AZs in non-production

Staging doesn't need 3 NAT Gateways. Route all private subnets through a single NAT Gateway in one AZ. Cross-AZ data transfer adds $0.01/GB, but at low staging traffic volumes, that's negligible compared to saving $65.70/month in base charges.

The Regional NAT Gateway Option

AWS recently introduced Regional NAT Gateways, which span multiple AZs but are billed per AZ per hour. If your Regional NAT Gateway covers 3 AZs, you're charged $0.045 × 3 = $0.135/hour — the same as running 3 individual NAT Gateways. The advantage is operational simplicity, not cost savings.

A Quick Audit Checklist

Count your NAT Gateways: aws ec2 describe-nat-gateways --query 'NatGateways[?State==available]' | jq length — multiply by $32.85/month
Check for zero-traffic gateways: Query CloudWatch for BytesOutToDestination over the past 14 days
Review non-production VPCs: Do dev/staging environments truly need NAT Gateway HA across 3 AZs?
Evaluate VPC Endpoints: If traffic is primarily S3/DynamoDB, Gateway Endpoints are free

NAT Gateways are one of those AWS resources where the "set it and forget it" mentality costs real money. A five-minute audit can often save $100–$300/month.

CloudWise automates AWS cost analysis across 38+ services — including idle NAT Gateway detection. Try it at cloudcostwise.io

The Hidden Cost Layers of EC2 (And Why Stopped Instances Still Drain Your Budget)

Rick Wise — Thu, 05 Mar 2026 20:04:13 +0000

EC2 looks simple on the bill — until you pull it apart. What most teams see is a single line item for instance hours. In reality, every running (and even stopped) EC2 instance generates charges across multiple dimensions, and the overlooked ones tend to accumulate quietly.

Let's break down where EC2 costs actually come from, what keeps billing you after you click "Stop", and what you can do about it.

The Obvious: Instance Hours

On-Demand pricing ranges from roughly $0.0042/hr for a t4g.nano to over $30/hr for GPU-accelerated instances like the p4d.24xlarge. The exact rate depends on the instance family (general purpose, compute-optimized, memory-optimized, GPU, etc.), the instance size, and the region.

Most teams have a reasonable handle on this cost. The real surprises come from everything else attached to those instances.

The Persistent One: EBS Storage

Every EC2 instance boots from at least one EBS volume, and most production instances have additional data volumes attached. EBS is billed per GB per month, regardless of whether the instance is running:

Volume Type	Price (us-east-1)	Typical Use Case
gp3 (General Purpose SSD)	$0.08/GB/mo	Default for most workloads
gp2 (Previous Gen SSD)	$0.10/GB/mo	Legacy — still widely used
io1/io2 (Provisioned IOPS SSD)	$0.125/GB/mo + IOPS charges	High-performance databases
st1 (Throughput Optimized HDD)	$0.045/GB/mo	Big data, log processing
sc1 (Cold HDD)	$0.015/GB/mo	Infrequent access archives

When you stop an EC2 instance, you stop paying for compute hours. But every EBS volume attached to that instance continues to incur storage charges at the same rate. A stopped instance with 500 GB of gp3 storage still costs you $40/month in EBS alone — indefinitely.

This is one of the most common sources of invisible cloud waste. Teams spin up instances for a project, stop them "temporarily", and forget about them. Months later, those volumes are still quietly billing.

The Migration Opportunity: gp2 → gp3

If your account still has EBS volumes running on gp2, you're overpaying. AWS released gp3 as a direct replacement that is:

20% cheaper per GB ($0.08 vs $0.10)
Higher baseline performance: 3,000 IOPS and 125 MB/s throughput included (gp2 baseline is only 100 IOPS per GB, minimum 100)
Independently tunable IOPS and throughput — with gp2, you have to increase volume size to get more IOPS

The migration is non-destructive and can be done live via ModifyVolume with no downtime. There's almost no reason not to migrate.

The Forgotten Ones: Unattached Volumes and Old Snapshots

When an EC2 instance is terminated, its root volume is deleted by default — but any additional EBS volumes may persist as unattached volumes (status: available). These volumes have no instance connected but are billed at the full storage rate.

Similarly, EBS snapshots accumulate over time. Each snapshot is billed based on the actual data blocks stored (not the full volume size), at $0.05/GB/month. A team that takes daily snapshots of a 500 GB volume without a retention policy can easily accumulate terabytes of snapshot storage within a year.

The Sneaky Ones: Elastic IPs and Data Transfer

Two more cost components are often overlooked:

Elastic IPs: A public IPv4 address attached to a running instance has historically been free, but as of February 2024, AWS charges $0.005/hr ($3.60/month) for every public IPv4 address — whether attached to a running instance or not. An Elastic IP on a stopped instance, or one not attached to any instance at all, costs the same.

Data Transfer: Outbound data transfer from EC2 to the internet is $0.09/GB (first 10 TB/month tier in us-east-1). Cross-AZ traffic within the same region costs $0.01/GB in each direction. These charges don't appear under "EC2" in Cost Explorer — they show up under "Data Transfer", making them easy to miss.

The Idle Tax

An instance that's running but not doing meaningful work is arguably the most expensive form of waste, because you're paying for both compute and storage. Common culprits:

Dev/staging instances left running outside business hours
Legacy instances that served a purpose months ago but were never decommissioned
Over-provisioned instances where a c5.4xlarge is running at 3% CPU because someone chose the instance size "just in case"

AWS CloudWatch metrics make it straightforward to identify instances with sustained low CPU utilization (below 5% over 14 days is a common threshold), but few teams audit this regularly.

What You Can Do

Audit stopped instances: If an instance has been stopped for more than 30 days, either terminate it (after snapshotting the volumes if needed) or at minimum, detach and delete unused volumes.
Migrate gp2 → gp3: Free performance improvement and 20% cost reduction on storage.
Set snapshot retention policies: Delete snapshots older than 90 days unless compliance requires otherwise.
Schedule dev/staging instances: Use Instance Scheduler or Lambda-based automation to stop instances outside working hours.
Clean up unattached volumes: Any EBS volume with status available is costing you money for nothing.
Review Elastic IPs: Release any EIPs not attached to running infrastructure.

None of these are difficult individually. The challenge is doing them consistently across every account and region. That's the kind of thing automation handles better than humans.

CloudWise detects idle EC2 instances, stopped instances with EBS volumes, unattached volumes, gp2-to-gp3 migration opportunities, old snapshots, and more — across all your regions and accounts. Try it at cloudcostwise.io

How SageMaker Actually Bills: A Breakdown for Engineers

Rick Wise — Thu, 26 Feb 2026 18:28:34 +0000

You deployed a SageMaker notebook to prototype a model. A week later, your AWS bill has a $280 line item you can't explain.

Sound familiar?

SageMaker is one of the most powerful ML platforms on AWS — and one of the most confusing to bill for. Unlike EC2 (one instance, one hourly rate), SageMaker has at least 12 independent billing dimensions spread across notebooks, training, endpoints, storage, data processing, and more. Each one ticks on its own meter.

This post breaks down every SageMaker billing component, shows you the real numbers, and highlights the traps that catch even experienced AWS engineers.

The Core Mental Model: SageMaker Is Not One Service

Think of SageMaker as a collection of services that share a console. Each has its own pricing:

┌─────────────────────────────────────────────────┐
│                 SageMaker                       │
│                                                 │
│  ┌──────────┐  ┌──────────┐  ┌───────────────┐  │
│  │ Notebooks│  │ Training │  │   Endpoints   │  │
│  │ (Dev)    │  │ (Build)  │  │   (Serve)     │  │
│  │ $/hr     │  │ $/hr     │  │   $/hr 24/7   │  │
│  └──────────┘  └──────────┘  └───────────────┘  │
│                                                 │
│  ┌──────────┐  ┌──────────┐  ┌───────────────┐  │
│  │Processing│  │ Storage  │  │  Data Wrangler│  │
│  │ (ETL)    │  │ (EBS+S3) │  │   (Prep)      │  │
│  │ $/hr     │  │ $/GB-mo  │  │   $/hr        │  │
│  └──────────┘  └──────────┘  └───────────────┘  │
│                                                 │
│  ┌──────────┐  ┌──────────┐  ┌───────────────┐  │
│  │ Canvas   │  │ Feature  │  │  Inference    │  │
│  │ (No-code)│  │ Store    │  │  Recommender  │  │
│  │ $/hr     │  │ $/GB+req │  │  (load test)  │  │
│  └──────────┘  └──────────┘  └───────────────┘  │
└─────────────────────────────────────────────────┘

Each box bills independently. You can have zero training cost but be paying hundreds for an idle endpoint. Let's walk through each.

1. Notebook Instances: The Silent $37/month Drain

How it charges: Per-second billing while the notebook is in InService status. You pay for the instance whether or not you have a kernel running.

Instance Type	$/Hour	Monthly (24/7)
ml.t3.medium	$0.05	$36.50
ml.t3.large	$0.10	$73.00
ml.m5.large	$0.115	$83.95
ml.m5.xlarge	$0.23	$167.90
ml.c5.xlarge	$0.204	$148.92

The trap: Notebooks keep billing even when you close the browser tab. The instance stays InService until you explicitly stop it. There's no auto-stop by default.

# Check for running notebooks right now
aws sagemaker list-notebook-instances \
  --status-equals InService \
  --query 'NotebookInstances[].{Name:NotebookInstanceName,Type:InstanceType,Created:CreationTime}' \
  --output table

What catches teams off guard:

You spin up a ml.t3.medium to test a concept on Monday
You forget about it Friday
It runs for 4 weekends = 192 extra hours = $9.60 wasted per forgotten instance
Multiply by a team of 5 data scientists doing this regularly = real money

Cost-saving tip: Use SageMaker Studio notebooks with auto-shutdown instead of classic notebook instances. Or set a CloudWatch alarm:

# Alarm if notebook is InService for > 12 hours with no API activity
aws cloudwatch put-metric-alarm \
  --alarm-name "sagemaker-notebook-idle" \
  --namespace "AWS/SageMaker" \
  --metric-name "InvocationsPerInstance" \
  --dimensions Name=NotebookInstanceName,Value=my-notebook \
  --statistic Sum \
  --period 43200 \
  --threshold 0 \
  --comparison-operator LessThanOrEqualToThreshold \
  --evaluation-periods 1 \
  --alarm-actions arn:aws:sns:us-east-1:123456789012:alert-topic

Notebook Storage (Often Overlooked)

Each notebook instance has an EBS volume (default 5 GB, configurable up to 16 TB). You pay for it even when the notebook is stopped:

Volume Size	$/Month
5 GB (default)	$0.58
50 GB	$5.80
500 GB	$58.00

At $0.116/GB-month (gp2 pricing), a 500 GB volume costs $58/month just sitting there — even while the notebook is stopped.

2. Training Jobs: Pay-Per-Second, But Instance Choice Matters Enormously

How it charges: Per-second billing while the training job runs. No charge when it completes. The clock starts at instance launch and stops at job completion or failure.

Instance Type	$/Hour	Use Case
ml.m5.large	$0.115	Tabular data, small models
ml.m5.xlarge	$0.23	Medium models, preprocessing-heavy
ml.c5.xlarge	$0.204	CPU-bound training (gradient boosting)
ml.p3.2xlarge	$3.825	GPU training (deep learning)
ml.p3.8xlarge	$14.688	Multi-GPU training
ml.p3.16xlarge	$28.152	Distributed deep learning
ml.p4d.24xlarge	$37.688	Large model training (8× A100 GPUs)
ml.g5.xlarge	$1.408	Cost-effective GPU (single NVIDIA A10G)
ml.trn1.2xlarge	$1.3438	AWS Trainium — optimized for training

The trap: GPU instances are eye-wateringly expensive per hour. A single ml.p3.2xlarge training job that takes 24 hours costs $91.80. If your hyperparameter tuning job launches 20 variants in parallel, that's $1,836 in one day.

Training Cost Formula

Training Cost = (instance_count × instance_price_per_second × training_duration_seconds)
              + (storage_gb × $0.116/GB-month × duration_fraction)
              + (data_download_from_s3)

Spot Training: 60-90% Savings (With a Catch)

SageMaker supports managed spot training — using EC2 Spot Instances for training jobs:

estimator = sagemaker.estimator.Estimator(
    # ...
    use_spot_instances=True,
    max_wait=7200,       # Max time to wait for spot capacity
    max_run=3600,        # Max training time
    # checkpoint_s3_uri for spot interruption recovery
    checkpoint_s3_uri='s3://my-bucket/checkpoints/',
)

Savings: Typically 60–90% off On-Demand pricing.

The catch: Spot instances can be interrupted. Your training job gets a 2-minute warning, then terminates. Without checkpointing, you lose all progress and pay for the time already consumed.

Pro tip: Always set checkpoint_s3_uri when using spot training. This saves model checkpoints to S3 so interrupted jobs can resume instead of restarting from scratch.

Managed Warm Pools (New)

If you run many training jobs in sequence (e.g., hyperparameter tuning), each job normally provisions a new instance from scratch (2–5 minutes startup). Warm pools keep instances running between jobs:

You pay for instance time during the keep-alive period
But you skip the ~3 minute cold start per job
Break-even: if you run enough sequential jobs that the saved startup time exceeds the keep-alive cost

3. Real-Time Endpoints: The Big One

This is where most SageMaker overspend happens. Endpoints run 24/7 and bill continuously, even with zero traffic.

How it charges: Per-second billing while the endpoint is InService. You pay for the full instance(s) whether they receive 0 or 10,000 requests per second.

Monthly Endpoint Cost = instance_count × hourly_rate × 730 hours

Instance	$/Hour	Monthly (1 instance)	Monthly (2 instances)
ml.t2.medium	$0.065	$47.45	$94.90
ml.m5.large	$0.115	$83.95	$167.90
ml.m5.xlarge	$0.23	$167.90	$335.80
ml.c5.xlarge	$0.204	$148.92	$297.84
ml.g4dn.xlarge	$0.736	$537.28	$1,074.56
ml.p3.2xlarge	$3.825	$2,792.25	$5,584.50
ml.inf1.xlarge	$0.297	$216.81	$433.62

Read that again: A single ml.p3.2xlarge endpoint costs $2,792/month. Two instances for high availability: $5,585/month. Many teams deploy this, see it works, and forget to right-size.

The Idle Endpoint Problem

A SageMaker endpoint with zero invocations still costs the full hourly rate. Common scenarios:

Model was deployed for a demo → demo ended → endpoint left running
A/B testing: old variant endpoint wasn't deleted after the new model won
Dev/staging endpoints running 24/7 when they're only used during business hours

# Find endpoints with zero invocations in the last 7 days
for endpoint in $(aws sagemaker list-endpoints --status-equals InService \
  --query 'Endpoints[].EndpointName' --output text); do

  invocations=$(aws cloudwatch get-metric-statistics \
    --namespace AWS/SageMaker \
    --metric-name Invocations \
    --dimensions Name=EndpointName,Value="$endpoint" Name=VariantName,Value=AllTraffic \
    --start-time $(date -u -d '7 days ago' +%Y-%m-%dT%H:%M:%S) \
    --end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
    --period 604800 \
    --statistics Sum \
    --query 'Datapoints[0].Sum' --output text 2>/dev/null)

  if [ "$invocations" = "None" ] || [ "$invocations" = "0.0" ]; then
    echo "⚠️  IDLE: $endpoint (0 invocations in 7 days)"
  fi
done

Multi-Model Endpoints: Pack More Models, Pay Less

If you have many low-traffic models, a Multi-Model Endpoint (MME) lets you load models on-demand into a single endpoint:

Standard:  10 models × ml.m5.large × 730 hrs = $839.50/month
MME:       1 endpoint × ml.m5.xlarge × 730 hrs = $167.90/month
                                      Savings:   $671.60/month (80%)

The tradeoff: cold-start latency when loading a model that isn't cached. Fine for batch-like traffic; bad for latency-sensitive real-time inference.

Serverless Inference: Pay Only for What You Use

For sporadic traffic (< ~1000 requests/hour), Serverless Inference eliminates the always-on cost:

Pricing:
  - Memory: $0.0000016/GB-second
  - Requests: included

Example: 1000 requests/day, 500ms avg, 4GB memory
  = 1000 × 0.5s × 4GB × $0.0000016/GB-s × 30 days
  = $0.096/month  ← vs $83.95/month for ml.m5.large

The catch: Cold starts (30s–2min for first invocation after idle period) and max 6 MB payload.

4. Storage: Three Hidden Meters

SageMaker storage costs come from three independent sources:

a) Notebook EBS Volumes

Already covered above: $0.116/GB-month, billed even when notebook is stopped.

b) Training Job Storage

Each training job gets a temporary EBS volume for input data and model artifacts:

Default: 30 GB per instance
Configurable up to 16 TB
Billed only during training (per-second)
SSD (gp2): $0.116/GB-month, prorated to seconds

c) Model Artifacts in S3

Trained models are stored in S3 as .tar.gz archives:

S3 Standard: $0.023/GB-month
A 5 GB model × 20 training runs = 100 GB = $2.30/month
But: large language model checkpoints can be 50–200 GB each
10 checkpoints × 100 GB = 1 TB = $23/month

Pro tip: Set an S3 Lifecycle policy to move old model artifacts to S3 Glacier after 30 days:

{
  "Rules": [{
    "ID": "Archive old SageMaker models",
    "Filter": { "Prefix": "sagemaker/" },
    "Status": "Enabled",
    "Transitions": [{
      "Days": 30,
      "StorageClass": "GLACIER"
    }]
  }]
}

5. Processing Jobs (ETL/Feature Engineering)

SageMaker Processing runs containerized data processing workloads:

How it charges: Same as training — per-second billing for the instances used.

Instance	$/Hour	1-hour ETL job	8-hour daily ETL (monthly)
ml.m5.xlarge	$0.23	$0.23	$55.20
ml.m5.4xlarge	$0.922	$0.92	$221.28
ml.r5.4xlarge	$1.21	$1.21	$290.40

The trap: Processing jobs often run as part of a pipeline. If your pipeline runs daily with 4 instances for 3 hours:

4 instances × ml.m5.xlarge × $0.23/hr × 3 hrs × 30 days = $82.80/month

That's not huge — but if someone accidentally sets the pipeline to run hourly instead of daily:

4 × $0.23 × 3 × 24 × 30 = $1,987.20/month  ← oops

6. SageMaker Savings Plans

AWS offers SageMaker Savings Plans — commit to a $/hour spend for 1 or 3 years in exchange for a discount:

Commitment	Discount vs On-Demand
1-year, no upfront	~20%
1-year, partial upfront	~27%
1-year, all upfront	~30%
3-year, all upfront	~64%

What's covered: Notebook instances, Studio notebooks, training, processing, batch transform, real-time inference, and serverless inference.

What's NOT covered: Data transfer, S3 storage, EBS storage, CloudWatch, and any non-SageMaker charges.

Break-even: If you consistently spend > $100/month on SageMaker compute, a 1-year no-upfront plan likely saves you money. The commitment is dollar-based (e.g., "$0.50/hour"), not instance-based — so you can shift between instance types.

7. Data Transfer: The Other Hidden Cost

SageMaker data transfer charges are identical to EC2 data transfer:

Path	Cost
S3 → SageMaker (same region)	Free
SageMaker → S3 (same region)	Free
Internet → SageMaker	Free
SageMaker → Internet	$0.09/GB (first 10 TB)
Cross-region S3 → SageMaker	$0.01–0.02/GB
Cross-AZ (multi-instance training)	$0.01/GB each way

The trap: Distributed training across multiple instances in different AZs generates inter-AZ data transfer charges for gradient synchronization. A training job with 8 ml.p3.16xlarge instances exchanging 100 GB of gradients per hour across AZs can add $2/hour in data transfer alone.

Mitigation: Use SageMaker's managed instance placement (it tries to co-locate instances in the same AZ). For distributed training, consider EFA (Elastic Fabric Adapter) enabled instances (ml.p4d.24xlarge, ml.trn1.32xlarge) — inter-node traffic over EFA is not charged.

8. The Full Bill Breakdown — A Realistic Example

Let's walk through a realistic monthly SageMaker bill for a mid-size ML team (3 data scientists, 2 models in production):

Development

Item	Details	Monthly Cost
3 Notebook instances	ml.m5.large, ~160 hrs/month each	$55.20
Notebook storage	3 × 50 GB	$17.40
Subtotal		$72.60

Training

Item	Details	Monthly Cost
Training jobs (CPU)	20 jobs × ml.m5.xlarge × 2 hrs	$9.20
Training jobs (GPU)	5 jobs × ml.g5.xlarge × 4 hrs	$28.16
HPO tuning	1 job × 50 trials × ml.g5.xlarge × 1 hr	$70.40
Training storage	20 GB per job, 25 jobs	$0.14
Subtotal		$107.90

Inference (The Biggest Line Item)

Item	Details	Monthly Cost
Prod endpoint (Model A)	2× ml.m5.xlarge × 730 hrs	$335.80
Prod endpoint (Model B)	2× ml.g4dn.xlarge × 730 hrs	$1,074.56
Staging endpoint	1× ml.m5.large × 730 hrs	$83.95
Subtotal		$1,494.31

Other

Item	Details	Monthly Cost
Processing (daily ETL)	2× ml.m5.xlarge × 1 hr × 30 days	$13.80
Model artifacts in S3	200 GB across all experiments	$4.60
Data transfer (internet)	50 GB model serving responses	$4.50
CloudWatch (metrics)	Custom endpoint metrics	$3.00
Subtotal		$25.90

Total

Development:   $72.60    (4.3%)
Training:      $107.90   (6.3%)
Inference:     $1,494.31 (87.9%)  ← 88% of the bill
Other:         $25.90    (1.5%)
──────────────────────────────────
TOTAL:         $1,700.71/month

The punchline: Nearly 88% of this team's SageMaker spend is inference endpoints running 24/7. The training — which is what the team actually thinks about and optimizes — is only 6% of the bill.

9. The 7 Most Common SageMaker Billing Mistakes

1. Leaving Notebook Instances Running

Cost: $37–$168/month per forgotten notebook

Fix: Use SageMaker Studio with auto-shutdown, or set a lifecycle config:

#!/bin/bash
# Auto-stop notebook after 1 hour of inactivity
IDLE_TIME=3600
if [ $(jupyter notebook list | grep -c "http") -eq 0 ]; then
  aws sagemaker stop-notebook-instance \
    --notebook-instance-name $(cat /opt/ml/metadata/resource-name)
fi

2. Not Deleting Endpoints After Experimentation

Cost: $84–$2,792/month per forgotten endpoint

Fix: Tag endpoints with environment=dev and run a nightly cleanup:

# Delete all dev endpoints older than 3 days
aws sagemaker list-endpoints --status-equals InService \
  --query 'Endpoints[?CreationTime<`2026-02-18`].EndpointName' \
  --output text | xargs -I{} aws sagemaker delete-endpoint --endpoint-name {}

3. Over-Provisioning Instance Types

Cost: 2–10× the necessary spend

Fix: Start with the smallest instance that works. Use CloudWatch to check actual utilization:

# Check CPU utilization of an endpoint
aws cloudwatch get-metric-statistics \
  --namespace /aws/sagemaker/Endpoints \
  --metric-name CPUUtilization \
  --dimensions Name=EndpointName,Value=my-endpoint \
    Name=VariantName,Value=AllTraffic \
  --start-time 2026-02-14T00:00:00 \
  --end-time 2026-02-21T00:00:00 \
  --period 3600 \
  --statistics Average \
  --query 'sort_by(Datapoints, &Timestamp)[].{Time:Timestamp,CPU:Average}'

If average CPU is < 20%, you're likely over-provisioned. An ml.m5.xlarge at 15% utilization could be an ml.m5.large (50% cheaper).

4. Running Staging Endpoints 24/7

Cost: $84–$538/month for endpoints used ~8 hrs/day

Fix: Schedule endpoint creation/deletion:

# Scale staging endpoint to 0 instances at 7 PM, back to 1 at 8 AM
import boto3, json

application_autoscaling = boto3.client('application-autoscaling')

# Register the endpoint as a scalable target
application_autoscaling.register_scalable_target(
    ServiceNamespace='sagemaker',
    ResourceId='endpoint/staging-model/variant/AllTraffic',
    ScalableDimension='sagemaker:variant:DesiredInstanceCount',
    MinCapacity=0,
    MaxCapacity=2,
)

# Scale to 0 at 7 PM UTC
application_autoscaling.put_scheduled_action(
    ServiceNamespace='sagemaker',
    ScheduledActionName='scale-down-evening',
    ResourceId='endpoint/staging-model/variant/AllTraffic',
    ScalableDimension='sagemaker:variant:DesiredInstanceCount',
    Schedule='cron(0 19 * * ? *)',
    ScalableTargetAction={'MinCapacity': 0, 'MaxCapacity': 0},
)

5. Not Using Spot for Training

Cost: 2–10× overpayment on training jobs

Fix: Add use_spot_instances=True + checkpoint_s3_uri to every training estimator.

6. Ignoring Multi-Model Endpoints for Low-Traffic Models

Cost: $84+/month per model × N models

Fix: Consolidate into a single MME. Works well for models with < 100 requests/hour.

7. No SageMaker Savings Plan

Cost: 20–64% overpayment on steady-state compute

Fix: Analyze 30 days of SageMaker usage → commit to a 1-year no-upfront Savings Plan for your baseline spend.

10. Quick-Reference Billing Cheat Sheet

Component	Billing Model	Minimum Charge	Always On?
Notebook Instance	Per-second (InService)	1 second	Yes, until stopped
Studio Notebook	Per-second (running kernel)	1 second	No (auto-shutdown capable)
Training Job	Per-second (job duration)	1 second	No (job-scoped)
Processing Job	Per-second (job duration)	1 second	No (job-scoped)
Real-Time Endpoint	Per-second (InService)	1 second	Yes, 24/7
Serverless Endpoint	Per-request + memory-second	None (pay per use)	No
Async Endpoint	Per-second (InService)	1 second	Yes (but can scale to 0)
Batch Transform	Per-second (job duration)	1 second	No (job-scoped)
Feature Store	Per-read/write + storage	None	Depends on store type
EBS Storage	Per-GB-month	$0.116/GB-month	Yes, even when stopped
S3 Artifacts	Per-GB-month	$0.023/GB-month	Yes
Data Transfer Out	Per-GB	$0.09/GB	Only on egress

11. The One Metric That Matters Most

If you track only one metric for SageMaker cost efficiency, track this:

$$\text{Cost per 1K Invocations} = \frac{\text{Monthly Endpoint Cost}}{\text{Monthly Invocations} \div 1000}$$

Example:

Endpoint: 2× ml.m5.xlarge = $335.80/month
Invocations: 500,000/month

$$\frac{\$335.80}{500} = \$0.67 \text{ per 1K invocations}$$

If that number is above $1.00 — you're likely over-provisioned or should consider serverless inference.

If it's above $5.00 — you either have very low traffic (delete the endpoint at night) or you're burning money on GPU instances that aren't needed.

If it's above $20.00 — the endpoint is effectively idle. Delete it.

Wrapping Up

SageMaker billing boils down to three rules:

Endpoints are the #1 cost driver — they run 24/7. Everything else is transient.
If it's InService, you're paying — notebooks, endpoints, anything with that status.
The bill you expect (training) is rarely the bill you get (inference) — teams optimize training time but ignore endpoint sprawl.

The engineers who keep SageMaker costs under control aren't the ones who pick the cheapest instance type. They're the ones who have a process for deleting things they're not using.

# The best SageMaker cost optimization is a cron job
# Run weekly: find and report idle SageMaker resources

echo "=== Idle Notebook Instances ==="
aws sagemaker list-notebook-instances --status-equals InService \
  --query 'NotebookInstances[].NotebookInstanceName' --output table

echo "=== Idle Endpoints (0 invocations, 7d) ==="
for ep in $(aws sagemaker list-endpoints --status-equals InService \
  --query 'Endpoints[].EndpointName' --output text); do
  inv=$(aws cloudwatch get-metric-statistics \
    --namespace AWS/SageMaker --metric-name Invocations \
    --dimensions Name=EndpointName,Value="$ep" Name=VariantName,Value=AllTraffic \
    --start-time $(date -u -v-7d +%Y-%m-%dT%H:%M:%S) \
    --end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
    --period 604800 --statistics Sum \
    --query 'Datapoints[0].Sum' --output text 2>/dev/null)
  [[ "$inv" == "None" || "$inv" == "0.0" ]] && echo "  ⚠️  $ep"
done

Building something to automate this? We built CloudWise to automatically detect idle SageMaker notebooks, endpoints, and oversized instances across all your AWS accounts — including air-gapped environments with no internet access. It's one of 90+ waste detectors that scan your infrastructure so you don't have to run scripts manually.

Found this useful? Drop a 🔖 bookmark — this is the reference I wish I had when I first got a surprise SageMaker bill.

How EC2 + EBS Actually Bills: A Breakdown for Engineers

Rick Wise — Thu, 19 Feb 2026 15:18:29 +0000

The "Stopped Instance" Trap

Every AWS engineer has done it. You spin up an EC2 instance for a quick test, run your script, and then "Stop" the instance thinking you've stopped the bleeding.

You haven't.

While the Compute meter has stopped spinning, the Storage meter is still running at full speed. And if you're using high-performance storage or have elastic IPs attached, you might be bleeding cash without realizing it.

In this post, I'm going to break down exactly how an EC2 instance is billed, component by component, so you can stop leaking money on "zombie" resources.

1. The Compute Layer (EC2)

This is the part everyone understands. When the instance is Running, you pay. When it's Stopped or Terminated, you don't.

On-Demand: You pay by the second (minimum 60 seconds).
Spot: You pay the market price, which fluctuates.
Savings Plans/RIs: You commit to usage in exchange for a discount.

The Gotcha: If you use a "Hibernate" stop instead of a regular stop, you are still paying for the RAM state stored on disk (more on that below).

2. The Storage Layer (EBS) - The Silent Killer

This is where 90% of "phantom costs" come from.

When you launch an EC2 instance, it almost always comes with an EBS volume (the root drive). This volume exists independently of the instance.

Scenario: You launch an m5.large with a 100GB gp3 volume.
Action: You stop the instance.
Result: You stop paying for the m5.large ($0.096/hr), but you continue paying for the 100GB gp3 volume ($0.08/GB/month).

If you have 100 "stopped" dev instances sitting around, that's 10TB of storage you're paying for every month. That's ~$800/month for literally nothing.

The "IOPS" Trap

With gp3 and io2 volumes, you can provision extra IOPS and Throughput. These are billed separately from the storage capacity.

Storage: $0.08/GB-month
IOPS: $0.005/provisioned IOPS-month (above 3,000)
Throughput: $0.04/provisioned MB/s-month (above 125)

If you provision 10,000 IOPS for a database test and then stop the instance, you are still paying for those 10,000 IOPS even though the volume is doing zero reads/writes.

3. The Network Layer (Data Transfer & IPs)

Elastic IPs (EIPs)

This is a classic AWS "tax."

Attached to Running Instance: Free (mostly).
Attached to Stopped Instance: $0.005/hour.
Unattached: $0.005/hour.

If you stop an instance but keep the static IP, AWS charges you because you are "hogging" a scarce IPv4 address.

Data Transfer

Inbound: Free.
Outbound (Internet): Expensive (~$0.09/GB).
Cross-AZ: If your EC2 instance talks to an RDS database in a different Availability Zone, you pay $0.01/GB in each direction.

4. The "Zombie" Snapshot

When you terminate an instance, the root volume usually deletes with it (if "Delete on Termination" is checked). But any manual snapshots you took of that volume remain.

I've seen accounts with terabytes of snapshots from 2018 for instances that haven't existed in 5 years. At $0.05/GB-month, that adds up fast.

The Solution: A "Clean" Shutdown Workflow

Don't just click "Stop." If you're done with an instance for the day (or week), follow this checklist:

Check for EIPs: Release them if you don't need the static IP.
Snapshot & Delete: If you need the data but not the compute, take a snapshot of the volume and delete the volume itself. Snapshots are cheaper ($0.05/GB) than active volumes ($0.08/GB).
Tagging: Tag everything with Owner and ExpiryDate.
Automation: Use a tool (like CloudWise or a simple Lambda) to scan for "Available" volumes (volumes not attached to any instance) and delete them after 7 days.

Summary Checklist

Component	Billed When Running?	Billed When Stopped?	Billed When Terminated?
EC2 Compute	✅ Yes	❌ No	❌ No
EBS Storage	✅ Yes	✅ YES	❌ No (if deleted)
EBS IOPS/Throughput	✅ Yes	✅ YES	❌ No (if deleted)
Elastic IP	❌ No (usually)	✅ YES	✅ YES (if not released)
Data Transfer	✅ Yes	❌ No	❌ No

Stop paying for air. Check your "Volumes" tab today.

I'm Rick, building CloudWise to automate this cleanup for you. I write about AWS cost optimization and DevOps every week.

How I Built an "Agentic" AWS Cost Optimizer (That Doesn't Break Production)

Rick Wise — Tue, 17 Feb 2026 19:12:08 +0000

I’ve spent 25 years in the Software Industry, and I’ve learned one universal truth: Engineers are terrified of deleting things.

We all have that one EC2 instance named test-do-not-delete-final that has been running for 3 years. We know it’s probably waste. The dashboard says it’s waste. But nobody deletes it. Why?

Because the risk of breaking production is infinite, and the reward of saving $50/month is zero.

This is the "Fear Tax." And it’s why most FinOps tools fail. They give you a list of 1,000 "optimization opportunities," and you ignore them all because you don't have time to manually verify safety for each one.

I built CloudWise Agentic Tier to solve this. It’s an agent that doesn't just find waste—it safely removes it after explicit approval, with a rollback guarantee.

Here is the technical deep dive on how I built the safety architecture using Python, Boto3, and Cross-Account IAM Roles.

The Architecture: "Safety First"

The core design philosophy is Reversibility. Every destructive action must be reversible. If it can't be undone, the agent isn't allowed to touch it.

The Workflow

Scan: Identify idle resources (e.g., EBS volumes unattached > 7 days).
Pre-Check: Run read-only calls to verify the resource state and resolve dependencies.
Snapshot: Take a final backup (e.g., CreateSnapshot).
Dry Run: Simulate the deletion to check for IAM permissions and dependencies.
Execute: Perform the destructive action.
Rollback (Optional): If anything breaks, one-click restore.

The "Secret Sauce": Pre-Checks & Placeholders

Most tools just run boto3.client('ec2').delete_volume(). That’s dangerous.

My agent uses a Pre-Check Phase to verify the resource state before generating the execution plan. It also resolves dynamic placeholders.

1. The Pre-Check Logic

Before we even think about deleting, we run a read-only probe.

def _execute_pre_checks(session, pre_checks):
    """
    Run read-only API calls to verify resource state.
    """
    results = []
    for check in pre_checks:
        # e.g. service="ec2", action="describe_volumes", params={"VolumeIds": ["vol-123"]}
        try:
            method = getattr(session.client(check['service']), check['action'])
            response = method(**check['params'])
            results.append({"success": True, "response": response})
        except Exception as e:
            results.append({"success": False, "error": str(e)})
    return results

2. Dynamic Placeholder Resolution

The planner doesn't always know the ID of the snapshot it will create. So I implemented a placeholder system.

The plan might look like this:

ec2:CreateSnapshot (Target: vol-123)
ec2:DeleteVolume (Target: vol-123)

But if we need to restore, we need the Snapshot ID that hasn't been created yet.

The system captures the output of Step 1 and injects it into the Rollback Plan using a placeholder like SNAPSHOT_ID_FROM_PRECHECK.

def _resolve_placeholders(api_calls, pre_check_results):
    """
    Resolve dynamic placeholders like VOLUME_ID_FROM_PRECHECK
    using data from the pre-check phase.
    """
    lookup = _build_precheck_lookup(pre_check_results)

    resolved_calls = []
    for call in api_calls:
        params_str = json.dumps(call['params'])

        # Replace placeholders with actual values
        for key, value in lookup.items():
            if key in params_str:
                params_str = params_str.replace(key, value)

        call['params'] = json.loads(params_str)
        resolved_calls.append(call)

    return resolved_calls

Security: The "2-Hop" IAM Chain

Security is the biggest blocker for SaaS tools. I use a 2-Hop IAM Architecture to ensure strict isolation.

Hop 1 (Service Role): The Lambda function assumes a CloudWiseServiceRole in my account. This acts as a bastion.
Hop 2 (Customer Role): The Service Role assumes the CloudWiseRemediationRole in the customer's account.

Why 2 Hops?

It allows me to rotate the internal Lambda roles without asking 100 customers to update their Trust Policies. The customer only trusts one static Service Role ARN.

The Customer Trust Policy

This is the only thing the customer installs. It trusts my AWS Account, not a specific user, but enforces an ExternalId to prevent "Confused Deputy" attacks.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": { "AWS": "arn:aws:iam::MY_PROD_ACCOUNT_ID:root" },
            "Action": "sts:AssumeRole",
            "Condition": {
                "StringEquals": { "sts:ExternalId": "CUSTOMER_UNIQUE_ID" }
            }
        }
    ]
}

Note: Using root principal allows me to manage the specific IAM role that assumes this role on my side, without breaking the customer's trust.

Handling Edge Cases (The "In The Trenches" Stuff)

CloudFront is Weird

You can't just update a CloudFront distribution. You need the current ETag (version ID) to prove you aren't overwriting someone else's changes.

My agent handles this automatically:

def _prepare_cloudfront_update(client, parameters):
    # 1. Fetch current config to get the ETag
    dist = client.get_distribution(Id=parameters['Id'])
    etag = dist['ETag']

    # 2. Merge our changes
    current_config = dist['Distribution']['DistributionConfig']
    current_config.update(parameters['DistributionConfig'])

    # 3. Return the payload with the ETag
    return {
        "Id": parameters['Id'],
        "DistributionConfig": current_config,
        "IfMatch": etag
    }

The "Agentic" Future

The term "Agentic" is getting thrown around a lot, but in infrastructure, it has a specific meaning to me: Software that does the work, not just the analysis.

For FinOps to mature, we have to stop treating "Cost Optimization" as a homework assignment for engineers. It should be a garbage collection process that runs in the background—safe, reversible, and automated.

If you want to see this in action (or critique my code/architecture), I’m building this in public. You can check out the live tool at cloudcostwise.io.

I’m Rick, a solo founder building CloudWise. I write about AWS, Python, and the psychology of engineering.