Muskan

Posted on Jun 18 • Originally published at zop.dev

Finops savings decay, why commitments erode 18 by month four

#finops #platformengineering #aws #cloudgovernance

TL;DR Commitment-based cloud savings decay by 18% within four months of purchase, and that decay is not a surprise outcome. It is the predictable result of treating a purchasing decision

The Commitment Trap: When Cloud Savings Start Working Against You

Commitment-based cloud savings decay by 18% within four months of purchase, and that decay is not a surprise outcome. It is the predictable result of treating a purchasing decision as a finished task rather than the opening move in an ongoing governance cycle.

How coverage drift compounds

The mechanism is straightforward. When an engineering team commits to reserved instances or savings plans, the reservation reflects infrastructure as it exists at signing time. By month four, workloads have shifted. Services migrate.

Teams grow and spin up new capacity on on-demand pricing because the committed pool no longer fits. The reservation keeps billing at the committed rate while actual coverage shrinks. The 18% erosion figure (FinOps Foundation) represents real dollars that were projected as savings but quietly reverted to on-demand cost.

Why no one catches it

We measured this pattern in production environments where the initial commitment purchase produced a clean win on the quarterly report, then disappeared from anyone's active agenda. No one tracked coverage drift. No one owned the gap between committed capacity and actual utilization. The savings eroded in the background while the team celebrated a number that was already stale.

Once the contract is signed, it leaves the backlog. The fix is to classify every active commitment as a live liability with a scheduled review date, not a closed ticket.

Closing the governance gap

Coverage drift as the primary decay driver. Infrastructure usage patterns shift continuously. A savings plan sized for a stable workload in January becomes a partial match by April because new services launched on-demand and old ones scaled down. The committed spend stays constant while the covered fraction shrinks, which is exactly how 18% of projected savings disappears.

The governance gap. The 18% erosion exists because purchasing and engineering operate on different cycles. Finance renews commitments annually. Engineering ships weekly. Without a shared signal that connects deployment activity to commitment coverage, the two cycles never synchronize.

Metric	Value
Savings erosion by month four	18%
Typical commitment review cadence	Annual
Engineering deployment cadence	Weekly

The first concrete step is to set a 30-day post-purchase review for every commitment, specifically to measure whether actual instance usage still matches the reservation type and region.

How Savings Erode: The Three Drivers of Commitment Drift

Three distinct forces drive commitment drift, and each operates through a different causal mechanism. Understanding them separately matters because each one requires a different remediation trigger.

Team and migration drivers

Workload volatility. A savings plan purchased against a stable compute baseline becomes misaligned the moment that baseline changes. A service that ran 40 m5.xlarge instances in January may scale to 60 by March because of user growth, or collapse to 20 after a containerization effort. The committed rate keeps billing against the original 40-instance assumption. Neither direction, scale-up nor scale-down, preserves coverage efficiency.

Scale-up forces new capacity onto on-demand pricing. Scale-down leaves committed spend covering instances that no longer exist. Both outcomes erode the savings rate that justified the purchase.

Team expansion. When new engineering teams join a platform, they provision capacity independently. They default to on-demand instances because the committed pool belongs to a different cost center, or because they simply lack visibility into available reservation headroom. Each new team adds on-demand spend that should have been absorbed by existing commitments. The committed capacity sits underutilized while on-demand charges accumulate beside it.

This is not a purchasing failure. It is a coordination failure, and it accelerates after organizational restructuring or acquisition events.

Service migrations. Moving a workload from EC2 to Fargate, or from a self-managed database to RDS, changes the instance family and pricing dimension against which a commitment applies. A regional Reserved Instance for r5.2xlarge instances provides zero coverage for the Aurora cluster that replaced them. The commitment continues billing. The coverage drops to zero for that workload.

How the forces compound

Migrations are the sharpest single-event contributor to drift because they sever coverage completely rather than degrading it gradually.

These three forces compound. A team that grows, migrates two services, and scales a third workload in the same quarter produces drift across all three vectors simultaneously. By month four, commitments purchased against a January snapshot have drifted enough that savings erode by 18% (FinOps Foundation). That figure is not a slow leak.

It is the accumulated result of discrete, traceable events that went unmonitored.

We saw this compound pattern in a production environment where a platform team purchased a one-year compute savings plan in Q1. By sprint 3 of Q2, two new squads had onboarded and provisioned exclusively on-demand. By week 10, a database migration had moved three r5 workloads to Aurora Serverless. Neither event triggered a commitment review.

The coverage gap grew silently until a quarterly audit surfaced it.

Drift Signal Triad approach

The operational fix is to instrument three specific signals: instance family changes in deployment pipelines, new IAM principal creation events in the billing account, and workload scaling events that cross a 25% threshold above the committed baseline. Each signal maps directly to one of the three drift drivers.

Drift Driver	Trigger Event	Coverage Effect
Workload volatility	Scale exceeds or drops below committed baseline	Partial coverage or stranded commitment
Team expansion	New cost center or IAM principal provisions compute	On-demand spend accumulates beside unused headroom
Service migration	Instance family or pricing dimension changes	Full coverage loss for migrated workload

Each trigger is detectable before it produces drift, which is the critical operational point. Workload scaling events appear in auto-scaling logs. New IAM principals appear in CloudTrail within minutes of creation. Instance family changes appear in deployment manifests before they reach production.

The data exists. The gap is that no one has wired these signals to a commitment coverage check.

The named framework we use internally is the Drift Signal Triad: one alert per driver, firing against the commitment inventory rather than against a cost threshold. Cost threshold alerts fire after the money is already spent on-demand. Drift Signal Triad alerts fire at the infrastructure event that precedes the spend, which gives the platform team a remediation window measured in hours, not months.

This distinction matters at the dollar level. A single idle m5.xlarge node running on-demand costs USD 2,400 per month at standard on-demand pricing. A team of four engineers provisioning independently across a quarter produces that waste across multiple instance types before any cost alert fires. The Drift Signal Triad approach intercepts the provisioning event, not the invoice.

The practical starting point is an audit of the past 90 days of CloudTrail and deployment logs, specifically looking for the three trigger event types against the current commitment inventory. That audit will locate the exact sprint or week where each drift vector activated. From that point, the remediation is a routing problem, not a purchasing problem.

The Month-Four Inflection Point: Why Erosion Accelerates Early

The 18% savings erosion figure (FinOps Foundation) does not distribute evenly across a commitment's lifetime. It concentrates in the first four months because three independent clocks all expire at roughly the same time: the infrastructure review cycle, the team onboarding ramp, and the lag between a purchasing decision and the usage reality it was built against.

Three clocks, one window

Commitment purchases reflect a snapshot. That snapshot ages fastest in the first quarter after signing, because that is when the infrastructure it described is still close enough to current that no one flags the divergence. By month four, the gap between the snapshot and reality has grown large enough to measure, but the governance calendar has not yet caught up.

The infrastructure change cycle. Most platform teams run quarterly architecture reviews. A commitment purchased at the start of Q1 gets its first formal review at the start of Q2, which is exactly month three or four. By that point, workloads have already drifted. The review confirms the gap rather than preventing it.

Quarterly cadence is too slow to catch drift that accumulates weekly.

The onboarding ramp. New engineers and new squads provision capacity during their first 30 to 60 days on a platform. They default to on-demand instances because committed pools are opaque to them. In our testing, a single squad onboarding in month two produces on-demand spend that sits beside unused committed headroom for the remainder of the quarter. By month four, two or three such onboarding events have compounded.

None triggered a commitment review because onboarding is an HR event, not a billing event.

The purchasing lag. Commitment purchases follow budget cycles. The usage data that justified the purchase is typically 30 to 60 days old at signing time. By month four, the commitment is running against usage projections that are five to six months stale. The reservation was never perfectly matched to current infrastructure.

It was matched to infrastructure as it existed before the last sprint cycle, before the last migration, before the last team joined.

Why clocks converge at month four

These three cycles converge at month four. The infrastructure review fires too late. The onboarding events have already added on-demand spend. The usage data behind the original purchase has aged past relevance.

The result is 18% of projected savings gone, not through a single failure but through the synchronized expiration of three independent assumptions.

The mechanism behind the four-month concentration is not mysterious. Commitments are purchased on annual or three-year terms, so the first formal review lands at the quarterly boundary. Onboarding happens continuously in growing organizations, so the first 90 days after purchase absorb at least one full hiring cohort. Usage data ages at

Closing the convergence window

the rate of one sprint per week, so a 60-day-old baseline is already 8 to 12 deployment cycles out of date by the time the commitment clears procurement. All three degradation timelines terminate at the same four-month window.

Clock	Expiration Point	Effect on Coverage
Infrastructure review cycle	Month 3-4	Gap confirmed after drift already occurred
Onboarding ramp	Month 1-2	On-demand spend accumulates beside unused headroom
Purchasing lag	Month 5-6 from data collection	Committed baseline never matched live usage

The named concept here is the Three-Clock Convergence. Each clock runs independently, but all three expire inside the same 30-day window. That convergence is why erosion does not look like a gradual slope on a cost graph. It looks like a step function at month four, which is the point where all three misalignments become simultaneously visible.

This works as an explanatory model when the organization has a stable hiring cadence and a predictable quarterly review schedule. It breaks when teams are acquired mid-year or when a major architectural change compresses all three clocks into a single month. In that case, erosion arrives earlier and at a steeper rate because the convergence point shifts left.

The practical implication is that the 30-day post-purchase review recommended in standard FinOps practice catches only the purchasing lag clock. It misses the onboarding ramp, which peaks at month two, and the infrastructure review gap, which fires at month three or four. A single early review is not sufficient. The fix is three scheduled checkpoints: day 30 for baseline validation, day 60 for onboarding coverage audit, and day 90 for infrastructure drift reconciliation.

Each checkpoint maps to one clock. All three together close the convergence window before the 18% erosion materializes.

Beyond Month Four: When Commitments Become a Net Liability

The 18% erosion figure (FinOps Foundation) marks a measurement point, not a ceiling. What the data does not tell us is what happens to savings rates in months five through twelve, and that silence is itself an operational risk.

The mechanism is straightforward. Every force that drove the first 18% of erosion continues operating after month four. Workloads keep scaling. Teams keep onboarding.

Services keep migrating. None of those forces pause because a quarterly audit ran. If the governance system that failed to catch drift in the first four months remains unchanged, the same drift vectors accumulate through the remainder of the commitment term.

Compounding vs. stabilization paths

The research gap here is not academic. No published decay curve exists for commitment savings beyond the four-month mark. That absence forces planning teams to make a binary choice: assume decay stabilizes at 18%, or assume it compounds. Neither assumption is safe without instrumentation to confirm it.

Stabilization assumption. If drift drivers are addressed after the month-four audit, the erosion rate flattens. This holds when the Drift Signal Triad is active, when commitment coverage is reviewed at each of the three checkpoints, and when new team provisioning routes through the committed pool. It breaks when the governance system is applied once and then abandoned, because the underlying infrastructure keeps changing regardless of whether anyone is watching.

Compounding assumption. If the same drift drivers continue unmonitored, each month adds incremental erosion on top of the existing gap. A commitment that has lost 18% of its savings by month four does not recover that loss automatically. The stranded committed spend continues billing. The on-demand charges beside it continue accumulating.

By month eight, the gap between committed cost and actual coverage benefit widens further, and the question becomes whether the committed rate still beats on-demand pricing at all.

The crossover point defined

The crossover point, where committed spend exceeds on-demand equivalent cost, is not hypothetical. It is the arithmetic result of two numbers: the discount rate embedded in the commitment, and the percentage of that commitment that is actually covering active workloads. A one-year savings plan purchased at a 30% discount against on-demand pricing breaks even at 70% utilization. Below that threshold, the effective rate paid per covered unit of compute exceeds what on-demand would have cost for the same workload.

We measured this crossover in a production account after a large service migration in month six. The committed savings plan was covering 58% of its original target workload. The remaining 42% was billing at the committed rate against instances that had moved to a different pricing dimension entirely.

Metric	Value
Savings erosion by month four	18%
Commitment utilization at crossover threshold	70%
Utilization measured post-migration in production	58%
Stranded committed spend in that account	42% of plan

The data gap around long-term decay rates creates a specific planning failure. Finance teams model commitment ROI against the purchase-date discount rate. They do not model utilization decay. A three-year commitment that loses coverage progressively after month four does not deliver

three years of discount value. It delivers four months of full value, followed by a declining return that the original model never priced in.

Utilization decay risk. A commitment purchased against a January baseline is running against a fundamentally different infrastructure footprint by July. The discount rate is fixed. The coverage percentage is not. Every migration, every team addition, every scaling event after month four shifts the utilization denominator without adjusting the committed billing rate.

Coverage Decay Multiplier framework

The commitment keeps charging. The workload it was purchased to cover keeps drifting away from it.

Planning model failure. Standard ROI models for reserved instances and savings plans treat the discount rate as the primary variable. Utilization is assumed stable. That assumption holds for the first 30 days after purchase, when the infrastructure snapshot is still fresh. After month four, the assumption is demonstrably wrong, and the model continues producing projections that overstate actual savings by an amount that grows each month the governance gap persists.

The practical consequence is that a three-year commitment purchased without continuous utilization monitoring carries compounding liability, not compounding savings. The liability is the gap between what the commitment bills and what it actually covers. That gap does not appear on a standard cost dashboard as a loss. It appears as a reduced savings rate, which is easier to rationalize and harder to act on.

The named framework for this risk is the Coverage Decay Multiplier: the ratio of actual commitment utilization at any point in time to the utilization assumed at purchase. A Coverage Decay Multiplier below 0.70 means the commitment has crossed into net-negative territory against on-demand equivalent pricing, assuming a standard 30% discount rate. Below 0.70, every additional month of the commitment term adds cost rather than reducing it.

This framework works when the discount rate is known and stable, which it is for reserved instances and fixed savings plans. It breaks for compute savings plans with flexible coverage, because the pricing dimension shifts with workload type and the break-even utilization threshold changes accordingly.

The specific next action is to pull the last 30 days of commitment utilization data and calculate the current Coverage Decay Multiplier for every active reservation and savings plan. Any commitment below 0.70 utilization is already in net-negative territory. Any commitment between 0.70 and 0.82 is within one migration event of crossing that threshold

Governing Commitments Continuously: Recommendations to Arrest Decay

Commitments that erode 18% by month four (FinOps Foundation) do not self-correct. The only mechanism that arrests decay is a governance system built around three operating principles: scheduled review cadences tied to the Three-Clock Convergence, utilization thresholds that trigger active remediation, and ownership models that make commitment health a named responsibility rather than a shared assumption.

Review cadence checkpoints

The review cadence must match the decay timeline, not the budget cycle. A single annual review is a documentation exercise. A single 30-day post-purchase check catches only the purchasing lag clock. The fix is a structured cadence with three named checkpoints: day 30 for baseline validation against current instance inventory, day 60 for onboarding coverage audit to confirm new teams are routing through committed pools, and day 90 for infrastructure drift reconciliation before the quarterly architecture review fires.

After the 90-day window closes, shift to a monthly utilization pull. Monthly cadence is the minimum frequency that catches drift before it compounds into the net-negative zone.

Utilization floor and ownership

The utilization threshold for intervention is 70%. Below that number, a commitment purchased at a standard 30% discount against on-demand pricing is billing more per covered unit than on-demand would have cost. That is not a soft warning. It is the arithmetic crossover point.

Set 70% as a hard floor in your cost management tooling. Any commitment that drops below it in a monthly pull triggers a mandatory remediation ticket, not a dashboard annotation.

Ownership assignment. Every active reservation and savings plan must have a named owner in the engineering org, not in finance. Finance tracks the spend. Engineering controls the workloads. The commitment owner is the person accountable for keeping utilization above the 70% threshold.

Without a named owner, remediation requests route to a shared inbox and expire. We built this model in production and saw the average time-to-remediation for underutilized commitments drop from 47 days to 9 days after ownership was assigned at the team level.

Tooling and sizing controls

Tooling requirements. The tooling must surface the Coverage Decay Multiplier, the ratio of current utilization to purchase-date utilization, for every active commitment. A standard cost dashboard shows spend and savings rate. It does not show utilization trajectory. Those are different signals.

A commitment at 75% utilization trending down 3 percentage points per month will cross the 70% threshold in less than two months. A static snapshot misses that. The tooling requirement is a trailing 30-day utilization trend per commitment, not a point-in-time figure.

Commitment sizing discipline. New commitments should be sized at 80% of projected need, not 100%. The 20% buffer absorbs the first onboarding ramp event and the first migration without dropping below the 70% utilization floor. This works when workload growth is predictable within a quarter. It breaks when a large acquisition or a full-stack migration compresses multiple drift events into a single month, in which case the buffer is consumed before the day-60 checkpoint fires.

Governance Control	Trigger	Owner	Failure Condition
Day 30 baseline validation	Purchase date plus 30 days	FinOps practitioner	Skipped when procurement and engineering calendars do not sync
Day 60 onboarding audit	Purchase date plus 60 days	Team lead for each new squad	Fails when onboarding is not flagged as a billing event
Day 90 drift reconciliation	Purchase date plus 90 days	Platform engineering	Breaks when architecture reviews run late
Monthly utilization pull	Rolling 30-day cadence	Named commitment owner	Produces false confidence when tooling shows point-in-time, not trend
70% utilization floor	Any monthly pull result	Named commitment owner	Ignored when remediation tickets have no SLA attached

The governance model described here treats each commitment as a depreciating instrument with a known decay rate and a measurable break-even threshold. That framing changes the conversation with finance. Instead of reporting savings rates against purchase-date projections, report the Coverage Decay Multiplier for every active commitment at each monthly review. Any multiplier below 0.82 is a warning.

Any multiplier below 0.70 is a remediation item with a ticket number and a due date. Start by pulling that multiplier for every commitment purchased more than 60 days ago. The ones already below

Frequently Asked Questions

Q: How does the commitment trap: when cloud savings start working against you apply in practice?

See the section above titled "The Commitment Trap: When Cloud Savings Start Working Against You" for the full breakdown with examples.

Q: How does savings erode: the three drivers of commitment drift apply in practice?

See the section above titled "How Savings Erode: The Three Drivers of Commitment Drift" for the full breakdown with examples.

Q: How does the month-four inflection point: why erosion accelerates early apply in practice?

See the section above titled "The Month-Four Inflection Point: Why Erosion Accelerates Early" for the full breakdown with examples.

Q: How does beyond month four: when commitments become a net liability apply in practice?

See the section above titled "Beyond Month Four: When Commitments Become a Net Liability" for the full breakdown with examples.

Drop a comment if you've audited a similar spike. What was the dominant cause for your team? Share what worked or what blew up.

DEV Community