DEV Community: Muskan

The railway went down for 10 hours, and it wasn't their fault. Here's the part nobody is talking about.

Muskan — Fri, 22 May 2026 11:16:30 +0000

22:10 UTC. May 19, 2026.

The railway's monitoring starts screaming.

Dashboard, 503. API, dead. Logins, failing. Within nine minutes, the on-call engineers have an answer, and honestly, it's almost worse than an outage:

Google Cloud suspended Railway's entire production account.

No warning. No email. No phone call. Just an automated enforcement action that flipped a switch on a company that spends over ten million dollars a year with them.

I put together a short breakdown of what actually happened, and walked through how we'd have spotted this kind of single-point-of-failure on the architecture canvas with Blast Radius before it bit. If you want the visual version of this post, it's here:

If you've been on the internet long enough, you've seen this movie before. UniSuper, a $50B pension fund, was accidentally deleted by GCP in 2024. Plenty of indie devs are auto-banned by AWS and Google with zero recourse. The Railway one just happens to be the biggest "developer cloud gets locked out of the cloud" event so far.

But the part that got under my skin wasn't the suspension. It was what happened next.

Railway isn't even fully on GCP

Here's what makes this incident actually interesting for anyone running infra.

Railway runs workloads on three things:

Their own bare metal hardware (Railway Metal)
AWS
GCP

Smart. Multi-cloud. Exactly what every architecture deck on LinkedIn tells you to do.

But their network control plane, the thing that knows where everything lives and how to route traffic, was hosted on GCP. So when GCP suspended the account at 22:20 UTC:

22:20: control plane goes down
22:35: the routing cache at the edge starts expiring
~23:35: Railway Metal workloads start returning 404s
shortly after: AWS workloads do the same

By the time the routing cache fully expired, every single workload across every cloud was unreachable. Even the ones running on hardware that the railway owns outright, sitting in their own racks, are completely untouched by Google's enforcement action.

The servers were fine. The applications were fine. Nobody could find them.

That's the blast radius of one upstream click.

From the railway's own postmortem, which is unusually honest and worth reading in full:

"In this ring, there was still a hard dependency on workload discoverability being tied to the network control plane API that was hosted on the machines running in Google Cloud. This meant that despite the mesh continuing to operate for an hour, when the route cache expired, the mesh failed to re-populate the routing tables."

And from Angelo on the Railway forum:

"This one was egregiously bad because it was a single and expected point of failure like a cloud account getting removed… to say we are livid is an understatement."

Account access came back in 9 minutes after a P0 escalation. But the customer-facing outage ran nearly 10 hours total, because once the edges have forgotten where everything lives, you have to wake up disks, restart compute, rebuild routes, and re-converge the mesh layer by layer. The technical recovery alone took the better part of 8 hours after access was restored. And then GitHub starts rate-limiting your OAuth during the recovery surge, because of course it does.

The thing every engineer felt reading this

If you write infrastructure for a living, you read the Railway postmortem with one specific feeling, and it's not schadenfreude.

It's the cold realization that you don't actually know what depends on what in your own stack.

You know the obvious stuff. The big "if RDS goes down, the app goes down" connections. But the long tail? The security group that's quietly shared between four services? Is the Lambda the only thing keeping a webhook alive? Is the " idle " read replica actually the cross-region failover for orders?

You don't know. I don't know. Nobody knows until the thing breaks.

This is the gap nobody fills. Cost tools are great at telling you what's wasted. Observability tools are great at telling you what's broken right now. Neither one tells you what will break if you touch this.

So teams do one of two things:

Delete it and pray.
Don't delete it, and sit on thousands of dollars of monthly waste because nobody wants to be the person who broke checkout on a Friday.

That second one is the actual norm, by the way. Talk to any cloud cost lead, and they'll tell you the bottleneck isn't finding the savings. It's getting anyone to confidently apply the recommendations.

What we built (and why Railway's story is exactly the use case)

This is the gap we've been building Blast Radius for in ZopNight.

The idea is dead simple. Before you apply any recommendation (delete this, stop that, modify the other thing), Blast Radius lights up the architecture canvas and shows you, in plain language, what's actually connected and what's about to break. (If you watched the video above, you've already seen the canvas in action: the RDS read replica that looked idle but was actually the cross-region failover for production orders. That's the kind of thing this catches.)

Here's how it works under the hood:

Adjacency graph. We build it from the metadata you already have: shared security groups, load balancer targets, volume attachments, Lambda triggers, and parent resources. No agents. No flow logs. Just the same metadata you can see in the AWS / GCP / Azure consoles, stitched into a real graph.

Resource-type-aware impact classification. A modification on a Lambda is not the same as a modification on an EC2 instance, which is not the same as a modification on a GKE node pool. We classified 131 resource types into three behavior buckets: onlineModify (live update, no interruption), restartModify (brief restart), and poolModify (children recreated). The classification respects that. Default for unmapped types is Warning, not Safe. Safe has to be proven.

Color on the canvas. Green = safe. Amber = a connection will break. Red = destroyed or non-functional. Everything unrelated gets dimmed so your eyes can find the actual story.

A risk score, 0 to 100. Severity of impact, whether it's prod/staging/dev, how many teams own pieces of it, and whether it's on an active schedule. Not a vibe. A number that explains itself.

So when you click "delete this idle RDS read replica," and Blast Radius lights up red on a cross-region link you forgot existed, you don't have to be the person who broke DR on a Friday. You loop in the right team, or you skip that one and confidently apply the eleven others that came back green.

If Railway had been able to look at their own architecture and ask "what happens if our GCP control plane disappears for 10 hours", and see the answer light up across Metal and AWS in red, they'd have ripped that dependency out a year ago. They are now, by the way. The postmortem commits to:

Removing GCP from the data plane's hot path
A true mesh across Metal, AWS, and GCP where any one interconnect can fail
HA database shards split across AWS and Metal, so quorum survives losing a cloud

The lesson the rest of us pay nothing for

You can't prevent your cloud provider's billing department from having a bad day. You can't make their fraud algorithms call you first.

What you can do is make sure that when one upstream blinks, you already know, exactly, with receipts, which of your own resources will go dark with it.

The internet runs on three companies. Every once in a while, one of them hits the off switch. The question isn't whether it happens to your stack. It's whether you'll be able to point at a canvas and say, "I already knew that would fail. Here's what I did about it."

Visibility is always cheaper than recovery.

From auto-recommendation to one-click cloud remediation, the workflow most tools skip

Muskan — Thu, 21 May 2026 06:39:25 +0000

Every cloud cost tool I have ever opened shows the same big number near the top of the dashboard. You could save 487,000 dollars a year. Sometimes it is bigger. The number is real in the sense that the math checks out. The number is also a lie in the sense that almost none of it ever happens.

The recommended savings number on your dashboard is not the realised savings number on your bill. The gap between those two is where most FinOps tools quietly fail their users, and it is not a small gap. At most teams I have talked to, it sits somewhere between 80 and 95 per cent.

The ticket nobody picks up

Walk through what the dashboard asks you to do when it says to stop this idle EC2 instance and save 312 dollars a month. The path looks like this.

An engineer reads the recommendation.
Files a ticket because they do not own the resource.
Waits for the team that does.
That team schedules the work into a sprint.
Someone eventually logs into the AWS console.
Finds the resource. Confirms it is actually the right one.
Runs the stop action.
Verifies nothing downstream broke.
Updates the ticket.

Nine steps, three humans, two weeks of calendar time for a single 312 dollar recommendation. Multiply that by a few hundred recommendations a month, and the math becomes obvious. Nobody works through that queue. The recommendations pile up. The dashboard keeps showing the same big number. The bill keeps not going down.

The recommendations were never the problem. The execution layer was.

Recommendation is not remediation

Every cloud cost tool on the market does auto-recommendation well. It scans your account, finds the idle instances, the over-provisioned databases, and the orphaned storage, and surfaces them on a dashboard. Some tools are very good at this part. The recommendations are usually right.

What almost no tool does well is auto-remediation. Recommendation tells you what to do. Remediation actually does it. The first is a report. The second is a button that, when clicked, performs the cloud action, verifies it landed, and writes an audit log.

Most teams have spent the last five years drowning in recommendations they never executed on. The dashboards got more sophisticated. The list of suggested actions got longer. The realised savings number on the bill barely moved.

The reason every cost tool stops at recommendation and not remediation is that recommendation is safe to ship, and remediation is not. A number on a dashboard cannot break production. A cloud API call that stops a resource absolutely can. So the industry settled on a comfortable middle ground. Tell the user what to do. Let them deal with the consequences.

The obvious fix, and why it is harder than it looks

The obvious fix is a button on the recommendation that just does the thing. Click, instance stopped, ticket avoided. People have tried this. The reason it has not been a default feature in cost tools is that cloud actions are not safe to fire blindly, and the failure modes are bad enough that one wrong action poisons trust in the whole tool for a year.

The interesting engineering problem is not calling the stop API. That part takes an afternoon. The interesting part is everything around it.

I have been watching a team build this, and the workflow they ended up with is a useful artefact even if you never use their tool. The shape of it is what matters.

What sits behind one-click remediation

Five steps, and skipping any of them is how you get a 3 am incident.

1. Precondition check. Before doing anything, ask the cloud what state the resource is actually in right now. If somebody on the team manually stopped it an hour ago, the workflow stops here and reports already done. This single check is the difference between automation people trust and automation people turn off.

2. Optional approval. Some actions need a human gate. Production-tier rules, destructive operations, anything where the cost of a wrong call is worse than the cost of a slow call. The approval queues with full context: resource, savings, who initiated it, and what rule fired. An admin clicks approve or reject. Cheap actions skip this entirely.

3. Execute. Call the cloud API. Stop the EC2, pause the Synapse pool, scale the Lambda to zero. This is the boring part.

4. Validate. This is the part most tools get wrong. A 200 response from the cloud API does not mean the resource is actually stopped. The validate step polls the cloud state until it confirms the action genuinely landed. If the API said yes but the resource is still running, the workflow flags it as a system error instead of silently lying.

5. Audit. Every step, input, and result is written to a dedicated audit table. Six months from now, when someone asks who stopped the prod-adjacent Synapse pool on March 12, the answer is one query away.

The other thing worth stealing from this design is how errors get categorised. Three buckets.

User action. Permission denied, quota hit. Shows the fix with a console link.
Transient. 429s, 5xx. Gets a retry button.
System. The cloud actually broke, or the API is unsupported. Gets a diagnostic and a support contact.

The category drives the UI. Retry only shows up where retry actually makes sense. This sounds small. It is the difference between an automation surface that engineers learn to trust and one they learn to ignore.

The certification gate

One choice in this design that surprised me. Not every recommendation rule gets a Remediate button. The team ships the button only on rules that have been certified end-to-end on real cloud accounts. The certified set started at 20 rules covering stop, scale-to-zero, and pause actions across AWS, GCP, and Azure. The other rules render the recommendation card without the button.

The temptation when shipping automation is to ship it everywhere on day one. The discipline is to ship it only where you have proven the workflow handles the edge cases. A fake remediation that returns success but did not actually do anything is worse than no remediation at all, because it convinces the team that the savings are realised when they are not.

The Databases Rule

One more thing worth calling out. Customer data resources (RDS, Aurora, Cloud SQL, Elasticache, the entire Postgres and MySQL family on every cloud) are excluded from any automated action. Not as a toggle. As an allowlist that excludes them at the code level, so they cannot be passed to the executor, regardless of what the rule says. The kind of safety rail you only see in tools built by people who have personally been responsible for a database outage and refuse to be again.

What this changes

The recommended savings number on your dashboard becomes a number you can actually realise. The 20-minute ticket-and-console-hop becomes one click. The audit log behind every action means you can show the CFO not just what you saved, but who saved it, when, and how.

The interesting thing is that none of this is technically novel. Precondition, approval, execute, validate, audit. Five steps you would design on a whiteboard in an hour. The reason it matters is that almost no cost tool actually does it, and the ones that try usually skip the validation step and pretend the API response is the truth.

If you are evaluating cloud cost tools, the question to ask is not what the recommended savings number is. The question is, what happens after I click the button, and how do you know the resource is actually stopped?

That is the only number that ends up on your bill.

Blast Radius Before Execution: Why Autonomous Cloud Must Check Idle Resources First

Muskan — Tue, 19 May 2026 12:06:36 +0000

Blast Radius Before Execution: Why Autonomous Cloud Must Check Idle Resources First

Autonomous cloud remediation fails the same way every time. The recommendation is correct. The action is correct. The scope is wrong.

Stop the idle RDS instance. Correct recommendation. The instance has averaged 2% CPU for 30 days. It is genuinely idle. But it is also the database backing an internal integration endpoint that three production services call once a day, on different schedules, from different accounts. Stop it and you have a production incident at the next scheduled call time: 2:00 AM on a Tuesday.

The recommendation engine did not fail. The blast radius model was missing.

Autonomous systems that act without a blast radius check are not autonomous. They are automated. Automation executes instructions. Autonomy includes a model of consequences. Every auto-remediation action ZopNight certifies runs through a blast radius check before execution. This post defines what that check contains and how to score it.

The Correct-But-Wrong-Scope Problem

Cost tools are good at identifying idle resources. They look at CPU, memory, network, and storage I/O over a 14-30 day window. A resource below thresholds on all four dimensions is flagged as idle. The recommendation is statistically correct.

The tool does not know what depends on that resource. It cannot see the cross-account IAM role that calls the RDS instance from a Lambda in another account. It cannot see the CloudFront distribution that caches responses from the EC2 instance it flagged. It cannot see that the "idle" ElastiCache cluster is the cache warming target for a batch job that runs quarterly.

In the accounts we analyzed, 12-18% of resources flagged as idle have active downstream dependencies that would cause an incident if acted on without verification. That means 1 in 6 to 1 in 8 autonomous actions on a naive system would create an incident. No engineering team will accept that failure rate for unattended automation.

The blast radius check is the gate that separates the safe 82-88% from the risky 12-18%.

Defining Blast Radius: Three Inputs

Blast radius is the set of resources and services that fail or degrade if the target resource is stopped, modified, or deleted mid-action. It is a pre-execution estimate, not a post-incident measurement.

Three inputs define it.

The dependency graph. AWS VPC flow logs record every connection between resources in the past 14 days. A resource with no inbound or outbound connections in 14 days has a low dependency graph score. A resource with 200 connections per day across 4 VPCs has a high one. Service mesh telemetry (Istio, Linkerd) adds application-layer connection data for services that flow logs cannot see (same-host connections, gRPC multiplexed streams). The dependency graph has a 6-hour lag for VPC flow logs. Resources with less than 6 hours of flow log data default to high blast radius.

Criticality tier. Resource tags encode business context that infrastructure metrics cannot. A resource tagged env=production and tier=critical scores high blast radius regardless of its CPU utilization. A resource tagged env=dev and team=platform scores lower. Tags are not perfectly reliable: 23% of resources in typical accounts have stale or missing criticality tags. When the criticality tag is absent, blast radius defaults to medium.

Recency. A resource with no write operations in 24 hours is idle by the write signal. A resource with a write 18 hours ago is not idle. CloudTrail records write API calls against each resource. LastWriteTimestamp is the third input. Resources with writes in the last 24 hours get a recency penalty that raises their blast radius score regardless of CPU.

The Blast Radius Score: 0-100 and What Each Band Means

The three inputs produce a numeric score from 0 to 100. The score determines the action policy.

Score range	Action policy	What it means
0-29	Unattended execution	No active dependencies, non-production tag, no recent writes. Safe to act without human review.
30-69	Notification window	Possible dependencies or ambiguous tags. Action queued with 15-minute notification. Human can cancel.
70-100	Approval required	Active dependencies confirmed, production tag, or recent writes. Action blocked until explicit approval.

A resource scores below 30 only if all three conditions hold: dependency graph shows no connections in 14 days, criticality tag is non-production, and no writes in 24 hours. This is the conservative definition of "safely idle."

A resource scores above 70 if any one of these conditions holds: 50+ connections per day in flow logs, production or critical tag, or writes in the last 6 hours. One high-signal input overrides the others. An RDS instance tagged dev that has 200 daily connections scores above 70. The dev tag does not override the dependency signal.

The 30-69 band handles the ambiguous cases: resources with missing tags, flow log gaps, or moderate connection counts. The 15-minute notification window gives an engineer time to cancel an action they know is risky, without requiring them to pre-approve every action in the queue.

How ZopNight Gates Actions

Every automated remediation in ZopNight runs the blast radius check before execution. The check adds 2-4 seconds to the action pipeline. For actions that run unattended at 3:00 AM, 4 seconds is an acceptable gate latency.

In a typical month, across the ZopNight customer fleet: 67% of triggered remediations score below 30 and run unattended. 24% score 30-69 and enter the notification queue; 91% of those proceed after the window with no cancellation. 9% score above 70 and require approval; approval is granted for 78% of those, with 22% cancelled by the reviewing engineer.

The autonomous action log records the blast radius score alongside every action. This creates the audit trail: not just what action ran, but why the system considered it safe to run without human review.

The Half-Action Problem: Idempotency as a Blast Radius Input

Blast radius measures the risk of acting on a resource. It does not by itself measure the risk of acting and then failing halfway through.

A multi-step remediation that stops an EC2 instance, modifies its tags, and starts it again has three steps. If the network fails after step 1 (stopped) and before step 3 (started), the instance is stopped but not restarted. The resource is in a worse state than before the action ran. The original state was idle but running. The post-failure state is stopped unexpectedly. Recovery requires manual intervention.

Idempotency is the property that makes a remediation safe to retry from any point. A stop-and-delete action is not idempotent: running it twice on an already-deleted resource produces an error. A tag-update action is idempotent: running it twice produces the same result as running it once.

Non-idempotent remediations get a blast radius floor of 50, regardless of their dependency graph, criticality, and recency scores. This forces them into the notification queue minimum, never into the unattended queue.

Remediation type	Idempotent	Blast radius floor	Example
Tag update	Yes	0 (no floor)	Add cost-center tag to EC2
Stop instance	Yes	0 (no floor)	Stop idle EC2 (safe to retry)
Delete snapshot	No	50	Deleted snapshot cannot be recovered
Schema migration	No	50	Partial schema change leaves DB inconsistent
Cross-account IAM change	No	70	IAM changes have immediate effect
Stop + reconfigure + start	Partially	50	Failure mid-sequence leaves wrong config

The blast radius score is a pre-execution risk estimate. It is not a guarantee. A resource that scores 15 can still cause an incident if the flow logs had a 12-hour gap and missed an overnight connection. The score reduces the probability of the wrong-scope failure from 12-18% to under 1%. It does not reduce it to zero. The ZopNight trust score model uses blast radius as one input alongside recommendation confidence and business hours context. No single signal is the gate. The gate is the composite.

Autonomous cloud is safe when the system knows what it does not know and routes accordingly. Blast radius is the model of what it does not know.

Most Traffic Spikes Are Predictable. So Why Are We Still Panic-Scaling?

Muskan — Tue, 19 May 2026 12:01:24 +0000

The usual playbook when a big event is coming: someone sends a Slack message three hours before launch asking, "Did we scale up?" A senior engineer logs into the AWS console, eyeballs the current desired count, multiplies by something, and manually bumps the number. Then forgets to roll it back.

That's the part nobody talks about. The spike passes, the instance count stays at 3x, and you're burning money for two days because everyone assumed someone else would fix it.
We ran into this enough times that we built a proper way to handle it.

What it does

Event Readiness turns a planned traffic spike into a structured scaling plan. You define the event window, set an expected load multiplier, and attach the autoscaling groups you want to scale.

ZopNight handles the rest pre-scales before the event starts, holds capacity during it, and rolls everything back when it ends. No manual intervention. No forgetting to scale down.

How it works

You create a plan against your existing autoscaler policies:

Pick the event window (start, end, timezone)
Set a load multiplier — e.g., 3x for a campaign expecting 3x traffic
Attach target policies: AWS ASG, GCP MIG, or Azure VMSS
Set a pre-scale buffer (we default to 30 minutes before event start)

ZopNight calculates the scaled min/max/desired for each target before
you commit. If CPU metrics aren't available, it tells you it's estimating and why.

Before saving, a preview step shows you exactly what will happen to every target's current size, the scaled size, and the timestamps the executor will fire. No surprises.

Once scheduled, the plan moves through a clear state machine:

draft → scheduled → scaling_up → active → scaling_down → completed

If something fails, it lands in failed, and you can retry from draft.
Cancelling from any active state rolls back whatever was already scaled.

The cost estimate

One thing we added that most teams don't have: upfront cost visibility. Before you schedule the plan, you can see the estimated extra cost per target, per hour, in dollars. Not after the event.

Before it, for a plan running 8 hours at 3x capacity across two ASGs, that number is usually a lot smaller than the cost of the event going down.

How does your team handle planned traffic spikes right now? Manual scaling, scripts, or something else?

Verified Schedule Savings vs Estimated Savings: Why the Difference Matters to Your CFO

Muskan — Mon, 18 May 2026 11:08:00 +0000

Verified Schedule Savings vs Estimated Savings: Why the Difference Matters to Your CFO

Every engineering team reports cloud savings at some point. The number goes into a slide, a Jira ticket, or a quarterly review. Then a CFO or finance lead asks one follow-up question: "Can you prove it?"

Most teams cannot. They have a configured schedule and a projected saving. They do not have evidence that the schedule executed, that the resource actually stopped, or that the cost was genuinely avoided. The projected number and the real number are different, and without the distinction, the savings figure is not auditable.

zopnight's Cost Reports page reports two savings numbers deliberately: Estimated Schedule Savings and Verified Schedule Savings. The gap between them is not a reporting artefact. It is a governance metric called the savings verification gap. This post explains what each number measures, why the gap matters, and how to give your CFO the audit-ready view they need.

Estimated Savings Is a Projection, Not a Measurement

Estimated Schedule Savings are calculated from configuration. The formula is straightforward: scheduled downtime hours multiplied by the hourly resource rate. If a virtual machine costs $0.50 per hour and your non-production schedule stops it for 200 hours a month, the estimated saving is $100.

The calculation is correct when schedules execute cleanly. It is wrong in three specific ways.

Resource locks and dependency failures. Some resources cannot be stopped when a schedule triggers. A VM with an attached managed disk that another process is writing to, a database instance with an active connection pool that blocks shutdown, a container that a health check is actively querying. The schedule fires. The stop command fails. The resource stays running. The estimated saving is counted. The real saving is zero.

Manual overrides. A developer needs an environment to stay live past its scheduled shutdown. They override the schedule for the night. The schedule was configured. The saving was projected. The resource ran. This is legitimate in isolation. At scale, across a team of 20 engineers over a month, it accumulates into a consistent gap between what was projected and what actually happened.

Timezone and CRON misconfiguration. A schedule set to stop a resource at 8 PM in one timezone fires at 8 PM UTC instead. The resource runs for an additional 5 hours before the next maintenance window corrects it. The estimated saving assumed perfect timing. The actual saving was shorter.

Failure mode	Cause	Cost consequence
Resource lock	Dependency blocking stop command	Full estimated saving uncollected
Manual override	Developer keeps resource live past schedule	Partial or full saving lost for that period
CRON misconfiguration	Wrong timezone, incorrect window	Saving window shorter than configured

None of these failures appear in estimated savings. They are all invisible until you compare estimated against verified.

Verified Savings Is a Measurement, Not a Projection

Verified Schedule Savings are calculated from resource state transitions. zopnight does not count a saving when a schedule fires. It counts a saving when the resource state changes from running to stopped and the state change is confirmed.

Each confirmed state transition is recorded in Execution History with a timestamp, a resource ID, the action taken, and the result. The saving is written against that record. If the state transition does not happen, the record shows a failure, and no saving is counted.

This produces a number that is always lower than or equal to estimated savings. It can never be higher. Every saving in the verified total has a corresponding execution record. That record is the audit trail.

The Execution History is what changes the conversation with finance. "We saved $8,200 this month" is a claim. "We saved $8,200 this month and here are 340 execution records showing each state transition that produced it" is evidence. Only one of those survives a finance review.

The Savings Verification Gap Is Your Schedule Reliability Score

The savings verification gap is Estimated Schedule Savings minus Verified Schedule Savings. It measures the fraction of configured savings that schedules failed to deliver.

A gap of 18% means 18% of scheduled actions did not execute. Those resources ran when they should not have. The cost was incurred and was not offset by any verified saving. The gap does not tell you which resources failed, but Execution History does. Each entry in the failure log shows which resource, which schedule, which action, and what error caused the failure.

This is the accountability link between engineering and finance. Engineering teams commit to a savings target when they configure schedules. The gap measures how much of that commitment was delivered. A team with a consistent 5% gap has reliable schedules. A team with a 30% gap has a schedule execution problem, and the gap is the first place to look.

The gap also separates two different problems. A large gap caused primarily by resource locks points to a dependency management issue: schedules are configured for resources that cannot be stopped cleanly. A large gap caused primarily by manual overrides points to a process issue: engineers are bypassing schedules regularly. The Execution History distinguishes between the two.

Savings Rate and Budget Health: The CFO View

Savings Rate is Verified Schedule Savings divided by Current Estimated Spend, expressed as a percentage. If you are spending an estimated $40,000 per month and your verified savings are $9,200, your Savings Rate is 23%.

Savings Rate is the single executive metric. It answers "how effectively are our schedules converting configured downtime into actual savings?" It normalises verified savings against spend, so it remains meaningful as the infrastructure footprint changes.

Budget Health adds the commitment layer. zopnight's Budget Overview tracks Total Budget, Total Spend, and Budget Health at the organizational level. Budget Health answers whether current spend is inside committed budget. It connects verified savings, the amount actually reduced, to the broader financial picture.

Metric	What it measures	Who uses it	Audit-ready
Current Estimated Spend	Projected spend at current run rate	Engineering, FinOps	No
Verified Schedule Savings	Confirmed savings from executed state changes	Finance, CFO	Yes
Savings Rate	Verified savings as percentage of estimated spend	Leadership, executives	Yes
Cost Trends Over Time	Spend trajectory over configurable periods	FinOps, budget owners	No
Budget Health	Spend vs committed organizational budget	CFO, finance team	Yes

"Forecastable. Audit-ready." is a specific claim about two of these five metrics. Verified savings and Budget Health are audit-ready because they are grounded in confirmed state transitions and committed budget figures. Estimated Spend and Cost Trends are forecastable because they project from current run rate. The distinction is not aesthetic. It determines what you can show a finance committee.

Governance Applied to the Governance Platform: RBAC on Financial Data

Access control on cost and budget data matters as much as access control on infrastructure. A finance lead reviewing verified savings should not be able to accidentally modify a schedule. A developer checking their team's budget health should not be able to adjust the organizational budget threshold.

zopnight's RBAC, rebuilt from the ground up, provides graduated access across three system roles:

Role	Policies	What it enables for Cost Reports
Viewer	16	Read access to Cost Reports, Budget Overview, Verified Savings, and Audit Logs
Editor	32	Viewer permissions plus schedule management, tag policy configuration
Admin	52	Full control including budget threshold management and user role assignment

The right pattern for financial reporting is: finance team members get Viewer role. They can read every number on the Cost Reports page, drill into Budget Health, and export verified savings data. They cannot touch schedules, budgets, or governance policies. That separation is not a limitation. It is the governance guarantee that makes the numbers trustworthy.

Custom roles extend this further. If a specific finance lead needs read access to cost data but should not see infrastructure topology, a custom role scopes exactly that. The 16-policy Viewer baseline provides the floor. Custom roles allow precise trimming above it.

The phrase "governance applied to the governance platform itself" captures the design intent. zopnight enforces cloud governance policies for your infrastructure. Its own access model applies the same principle internally: every action is scoped to a role, every role is assigned explicitly, and no one gets more access than their function requires.

This is what makes the CFO conversation work. When a finance lead logs into zopnight with Viewer access and sees Verified Schedule Savings of $9,200 with a 23% Savings Rate and a Budget Health status of on-track, they are reading numbers produced by confirmed state transitions, scoped to their role, backed by an execution audit trail. That is a number they can put in a board report.

Estimated savings gets you to the conversation. Verified savings gets you through it.

The $90k Observability Bill: Why Your Cardinality Limit Is the One Knob That Matters

Muskan — Mon, 18 May 2026 11:07:37 +0000

The $90k Observability Bill: Why Your Cardinality Limit Is the One Knob That Matters

The observability bill at a 50-engineer org goes from $8,000/month in year one to $90,000/month by year three. The growth never gets a budget review because each individual instrumentation change looks tiny: a new metric, a new tag on an existing metric, a per-customer dimension someone added to debug a spike. None of those changes show up on the monthly bill as a single line item. They show up as the bill quietly compounding because each one of them multiplied the number of unique series the vendor stores.

The teams that try to fix this usually focus on data volume (samples per second, ingest rate, log lines per day) because that is what the vendor's dashboard surfaces in big numbers. Data volume is 5-10% of the bill at most vendor pricing tiers in 2026. Cardinality, the count of unique time series your metrics generate, is 50-70% of the bill. Optimizing for ingest rate cuts 5% when 60% is available.

The single knob that actually controls observability cost is cardinality, specifically the count of unique tag-value combinations per metric. A 90-day cardinality-first review at a typical mid-market org cuts $35,000 to $60,000 from the monthly bill with no loss of diagnostic capability and no vendor migration. The work is 2-4 engineer-weeks. The payback is positive in month one and compounds because the cost growth curve flattens, not just the level.

The piece is the operator's guide to that review. The composition is with the observability bill: Datadog vs CloudWatch costs work on vendor pricing models, but the angle here is one layer deeper: the property of your own instrumentation that drives the bill regardless of vendor.

The $90k observability bill nobody planned for

Look at the spend trajectory for a typical mid-market SaaS over three years.

Year	Monthly observability bill	Volume contribution	Cardinality contribution	What changed
1	$8,000	$1,800	$5,200	Initial instrumentation; ~250k series
2	$32,000	$4,000	$25,000	Per-customer dimensions added; ~1.4M series
3	$90,000	$7,500	$76,000	Trace IDs leaked into metric labels; ~4.8M series

The volume column grows linearly with the system (more requests, more events, more log lines per minute). The cardinality column grows faster than linearly because each new tag multiplies the existing series count. By year three the cardinality cost is 8-12x the volume cost on the same instrumentation surface.

The bill conversation usually starts with the wrong number. A platform team looks at the bill, sees that Datadog charges per ingested log GB and per million metric samples, and starts a project to "reduce ingest volume." They turn off DEBUG logs, sample non-critical traces, reduce metric sample rate from 10s to 60s. The bill drops $3,000/month. The team celebrates and the cost keeps growing because the cardinality knob is untouched.

The right conversation starts at cardinality: which metrics generate the most unique series, why, what are the tags driving the explosion, and which of those tags actually need to be on a metric versus on a trace or log.

Cardinality math: 4 tags can produce 1.4B series

Cardinality is the product of all tag values on a metric. A counter http_requests_total with no tags is 1 series. Add method (8 values: GET, POST, PUT, DELETE, etc.) and it is 8 series. Add endpoint (300 routes) and it is 8 × 300 = 2,400 series. Add status (12 HTTP codes) and it is 8 × 300 × 12 = 28,800 series. Still cheap. Now add user_id with 50,000 values: http_requests_total × method (8) × endpoint (300) × status (12) × user_id (50,000) = 1,440,000,000 potential series.

A single counter just produced 1.44 billion potential series. In practice the actual count is much lower (most combinations never fire) but the live cardinality typically lands at 30-60% of the potential, which on this example is 430M to 860M series. At Datadog's $0.05/series/month for the standard tier, that one counter costs $21M to $43M per month.

The vendor does not stop you from creating this cardinality. It just bills you for it. The bill arrives, the platform team sees the spike, the question becomes: which metric did this, and which tag is the culprit?

Metric example	Tags	Tag value counts	Total series
`http_requests_total`	method, endpoint, status	8 × 300 × 12	28,800
`http_requests_total`	method, endpoint, status, region	8 × 300 × 12 × 5	144,000
`http_requests_total`	method, endpoint, status, region, user_id	8 × 300 × 12 × 5 × 50,000	7,200,000,000
`db_query_duration`	query_name, db_name	80 × 6	480
`db_query_duration`	query_name, db_name, customer_id	80 × 6 × 400	192,000
`db_query_duration`	query_name, db_name, customer_id, session_id	80 × 6 × 400 × 1,000,000	192,000,000,000

The bolded tags are the cardinality detonators. Each one is a high-uniqueness identifier (per-user, per-customer, per-session) that has no business being a metric dimension. They belong on traces (where each trace is a single event, not a time series) or logs (where each line is a single record). Putting them on a metric multiplies the series count by the cardinality of the identifier.

The three high-cardinality offenders

Most observability bill overruns reduce to three specific tag classes. Removing or aggregating them recovers 40-65% of the bill with zero diagnostic loss.

Tag class	Why it lands on metrics	Where it actually belongs	Bill impact when removed
`user_id` / `customer_id`	"Per-tenant visibility" demand	Trace span attribute	30-50%
`trace_id` / `span_id`	Accidental metric labeling	Already in trace, never metric	10-25%
`version` / `build_id` / `git_sha`	Added for deploy debugging, never pruned	Trace metadata; metric only for last N versions	5-15%

user_id / customer_id on shared metrics. A team wants "per-tenant API latency." Someone adds customer_id to the api_request_duration histogram. Series count multiplies by customer count. The dashboard shows per-customer p99 latency, which the team uses three times in six months. The bill triples. The right answer: keep customer_id on the trace span; query traces for per-tenant analysis; keep the metric to a rollup (api_request_duration × endpoint × status only, no per-customer breakdown).

trace_id promoted to a metric label. A common bug pattern: an OpenTelemetry SDK is misconfigured to copy trace context attributes onto every emitted metric. trace_id is unique per request (effectively infinite cardinality from the metric's perspective). The vendor bill shows millions of one-sample series. The fix is at the SDK / collector level: explicit allow-list of attributes to copy from trace to metric, blocking trace_id and span_id by default.

version / build_id not pruned. Deploy lands; instrumentation tags every metric with version=v1.2.3 so the team can compare pre- and post-deploy behavior. Three weeks later there are 40 versions in the tag values, each with its own series. The team only ever queries the last 2-3 versions. The fix: tag with version, but at the collector level prune any version older than 30 days from the metric pipeline (traces and logs can keep the full history because they age out on their own retention curve).

The pattern across all three: high-uniqueness identifiers belong on traces and logs (which the vendor bills very differently and which scale fine with cardinality) rather than on metrics (which compound). The OpenTelemetry three-pillar separation (metrics, traces, logs) exists precisely so that each type of telemetry can handle the data class it is good at. Cardinality goes on traces and logs; metrics stay aggregated.

The cardinality report (one engineer-day, highest-leverage observability work)

The work to make cardinality manageable starts with measurement. Every major observability vendor exposes per-metric cardinality somehow; the surface is just not in the default dashboard.

Vendor	Cardinality introspection	Where to find it
Datadog	Metric summary page; `datadog.estimated_usage.metrics_*`	Per-metric panel in Metrics Explorer
Prometheus	`prometheus_tsdb_head_series` metric, `topk()` against it	Self-monitoring scrape
Honeycomb	Dataset cardinality view	Per-dataset settings → cardinality
Grafana Mimir / Cortex	`cortex_ingester_active_series`	Self-monitoring
New Relic	Metric cardinality limit warnings in usage UI	Account → Usage

The weekly cardinality report is one engineer-day to build and is the single highest-leverage piece of observability work most teams can ship in 2026. It contains:

Column	Purpose
Metric name	Identification
Series count (now)	Current cardinality
Series count (7d ago)	Growth detection
Top 3 tags by value count	Which dimensions are driving it
Cap	The configured per-metric limit
Action	"Over cap", "Approaching cap", "Healthy"

The report runs weekly, posts to a #observability-cost Slack channel, and surfaces the top 20 metrics by series count plus any metric that grew >50% week-over-week. The platform team reviews the report in 15 minutes. Most weeks there is no action; the weeks where a new high-cardinality tag landed (often unintentionally) catch it before the next billing cycle.

The team that does not have this report has no way to know which metric is driving the bill until the bill arrives. The team that has the report fixes the cardinality issue in the week it appears, not in the quarter after the bill review.

Aggregation: move high-cardinality data to traces and logs

The right place for high-cardinality data is determined by the OpenTelemetry three-pillar separation. Each pillar has a different cost-vs-detail tradeoff and the high-cardinality identifiers go to the pillars that handle them well.

Data class	Pillar	Why
Request counts by method/status/endpoint	Metric	Low cardinality, queried as time series
Per-customer latency analysis	Trace	High cardinality, queried per-request
Per-user error rate	Trace	High cardinality identifier
Aggregate error rate by service	Metric	Low cardinality
Audit events (who did what, when)	Log	Free-form, often compliance-driven
Trace-level diagnostic detail	Trace	Designed for it
Deploy markers (version comparison over 24h)	Metric (with TTL on version tag)	Pruned automatically

The rule of thumb: if a dimension's value count exceeds 100 unique values across a 7-day window, it does not belong on a metric. It belongs on a trace or a log. The vendor's trace and log products price differently (per-event, per-byte, with sampling) and high cardinality is normal for them. The metric product prices per series; high cardinality is the cost detonator.

Cap and alert per metric

The discipline that makes cardinality manageable in the long run is per-metric caps. Without caps, cardinality grows monotonically: every new dimension is a marginal addition that "doesn't seem that big." With caps, the team has to make a conscious decision when a metric approaches its limit: do we raise the cap (and accept the cost), remove the dimension (and lose some detail), or aggregate harder (and trade some precision)?

Metric tier	Cap	Typical use
Tier 1 (critical, high-value)	50,000 series	Customer-facing latency, error rates, SLO inputs
Tier 2 (standard)	5,000 series	Internal service health, deploy markers, batch job metrics
Tier 3 (debugging only)	500 series	Ephemeral metrics added during investigation, must be removed after

The cap alerts fire at 80% of the limit. The platform team gets a ping; the metric's owner has two weeks to either justify a cap raise (with a budget impact estimate) or reduce the cardinality. If neither happens, the metric is downgraded a tier (which lowers its cap and forces the owner to address it).

The numbers above are illustrative; the right caps depend on your vendor's pricing tier. The right way to set them is to start from the current cardinality distribution and pick caps that allow 95% of metrics to fit Tier 2 with the 5% legitimately-needed-high-cardinality metrics in Tier 1. Tier 3 is the safety valve for debugging that should always be temporary.

The AI-agent special case

AI agent fleets create a cardinality problem that ordinary instrumentation rules do not catch. An agent that logs per-invocation metrics with agent_id (47 agents) and request_id (5 million requests/day) produces 235 million unique series per day just from one metric. The cardinality compounds across the metric set; even a small agent fleet can outspend the rest of the org's observability bill in a quarter.

The fix is per-agent metric aggregation: emit one metric per agent per minute instead of one per invocation.

Approach	Cardinality / day	Diagnostic capability
Per-invocation metric (`agent_id` + `request_id`)	235,000,000 series	Per-request drilling (impossible to query anyway at this scale)
Per-agent per-minute aggregate (counter + histogram)	47 series × 1,440 min = 67,680	Per-agent rate + latency distribution
Per-request data → traces (sampled at 1%)	(cardinality moves to trace product)	Per-request when needed, sampled

The per-agent-per-minute aggregate uses a counter (agent_invocations_total{agent_id}) for rate and a histogram (agent_latency_ms{agent_id}) for distribution. Together they answer the questions the per-invocation metric was meant to answer (how often does each agent fire, what is the latency distribution) at 1/3000th the cardinality cost. The per-request detail that is genuinely needed (which request was slow, what was the failure) lives on traces with sampling, where the cost model handles per-request data natively.

The pattern composes with the per-agent token quotas work: the quota system already knows each agent's identity and rate; the observability metric can be a side-effect of the quota counter rather than a separate instrumentation. One source of truth, one cardinality.

Why dropping the vendor is the wrong fix

The first-instinct fix when the observability bill shocks finance is to put the vendor up for review. Solicit a quote from Honeycomb, from Grafana Cloud, from a self-hosted Prometheus + Loki + Tempo stack. The numbers look compelling because the alternative vendor's bill is based on your current usage at their pricing, and migration projections always look optimistic.

The migration math is not optimistic in practice.

Item	Vendor migration	Cardinality fix
Time to first cost reduction	6-12 months (post-migration)	4-8 weeks
Engineer-weeks invested	24-50 (instrumentation rewrite, dashboard rebuild, alert recreation, runbook updates)	2-4
Risk of degraded incident response during transition	High (parallel systems, alert gaps, training cost)	None
Bill reduction after work complete	30-50% if cardinality is fixed on new vendor; 0% if not	40-65%
Cardinality problem replicates on new vendor?	Yes (it is a property of your instrumentation)	N/A (problem is removed)

The migration only pays off if the cardinality problem is fixed in the new system. Otherwise the new vendor's bill grows the same way the old one did, just from a lower starting base. Teams that migrate without fixing cardinality discover this in year two on the new vendor and are back where they started.

The cardinality fix on the current vendor is faster, cheaper, lower-risk, and reduces the bill by a similar percentage. The vendor switch may still make sense for product reasons (better trace UX, different SLO tooling, vendor-specific features), but it is not the cost fix. The cost fix is cardinality.

A typical mid-market org running the 90-day cardinality review recovers $35,000 to $60,000 per month within the first quarter. The compounding effect is more valuable than the level: the cost growth curve flattens because the cardinality discipline is now in place. By year four, an org that ran the cardinality review is at $40-50k/month observability spend; an org that did not is at $130-180k/month on the same engineering surface.

Set up the weekly cardinality report. Identify the top 5 metrics by series count. Find the user_id, trace_id, or version_id tag driving each one. Move those dimensions to traces or logs. The bill drops the next month and stops growing the way it used to. The one knob that matters is the one most teams never touch.

Every team has an architecture diagram. Nobody trusts it. Here's what we built instead.

Muskan — Mon, 18 May 2026 07:05:50 +0000

The actual problem with cloud architecture visibility.

The real issue isn't that teams don't document their infrastructure. It's that cloud infrastructure changes faster than any manual process can keep up with.

A developer spins up a debug RDS instance on a Friday. A new region gets added during a scaling event. A contractor deploys a service that nobody else knows about. None of these show up in any diagram because nobody updated it.

The other problem: existing tools either give you one cloud at a time, or they give you a billing view which tells you what you're spending, but not how anything connects.

What we wanted was: open Atlas, see everything, understand how it fits together. Across all three clouds. In real time.

What Atlas does

Atlas connects to your AWS, GCP, and Azure accounts in read-only mode and auto-discovers every resource. It then builds a dependency graph, not just a flat list of resources, but how they relate to each other. Which services talk to which? What sits behind which load balancer? Where the cross-region connections are.

The view scales from global (all your clouds, all your regions, one screen) down to service-level dependencies. You can zoom into a single VPC and see exactly what's running inside it.

Here's a short demo:

The part that was harder than expected

The interesting technical challenge was reconciling three completely different resource models.

AWS thinks in terms of VPCs, availability zones, and security groups. GCP thinks in terms of projects, networks, and firewall rules. Azure thinks in terms of subscriptions, resource groups, and virtual networks. Same concepts, completely different hierarchies and naming conventions.

Building a unified topology meant building a translation layer that could map these different models onto a consistent graph structure without flattening the differences that actually matter for understanding your architecture.

We also had to decide what "connected" means across clouds. A Lambda that calls a GCP Cloud Run service over HTTPS are those connected in the topology? We landed on: yes, and we show cross-cloud connections explicitly because they're often the least-understood part of a multi-cloud setup.

Cost Per Customer for SaaS: The Unit Economics Dashboard That Killed Three Pricing Mistakes

Muskan — Fri, 15 May 2026 07:02:56 +0000

Cost Per Customer for SaaS: The Unit Economics Dashboard That Killed Three Pricing Mistakes

Finance computes cost per customer as total infra cost / customer count once per quarter. The number is mathematically correct and operationally useless. A B2B SaaS at $8/customer/month sounds healthy until you look at the distribution and find that one customer costs $1,400/month and another costs $0.40. The average hides everything. The 10-15% of customers whose hosting cost exceeds their MRR are invisible. The pricing tier that loses money on heavy users is invisible. The free-tier customer who is silently burning through more compute than three paying customers combined is invisible.

The structural fix is per-customer cost attribution at the cost-record level, refreshed weekly, displayed in five dashboard views, owned by product and finance. The work is not the dashboard. The work is propagating customer_id through three layers (the request path, the workload identity, the storage layer) so every cost record knows which customer it belongs to. Most SaaS data pipelines were built without this discipline; the retrofit takes 4-8 weeks of data engineering. The payback is a margin recovery of 2-5% from pricing fixes and another 8-15% infra reduction from per-customer right-sizing.

The piece composes with the 4-field chargeback schema (which solves per-team attribution at the org level) but operates one layer deeper. Per-customer is more granular than per-team and serves a different audience: product and pricing, not finance and engineering.

The quarterly cost-per-customer number is useless

Look at a typical mid-market B2B SaaS at $4M ARR with 400 customers. The quarterly cost-per-customer number reads $310/month, which is fine if MRR averages $830. The distribution tells a different story.

Cohort	% of customers	Avg cost / mo	Avg MRR / mo	Cost-to-MRR
Top 10% by usage	10%	$1,600	$2,200	73% (negative-margin)
Heavy users	20%	$720	$1,100	65%
Average	50%	$190	$830	23%
Light users	15%	$45	$400	11%
Free tier	5%	$35	$0	infinite

The blended average ($310) hides the entire structure. The top 10% by usage is operating at a 73% cost-to-MRR ratio, which is unrecoverable on a SaaS unit-economics curve. The free tier (which leadership often defends as "low cost, high signal") is actually using more compute per user than the average paying customer. The pricing tier is misaligned with the cost structure in a way no quarterly average will surface.

The dashboard that fixes this has to update faster than quarterly. Pricing decisions get made in the week, not the quarter; if the cost data is months old, the pricing team is flying blind. Weekly is the right cadence: fresh enough to catch a customer who just spun up a heavy workload, slow enough that the dashboard does not thrash on day-to-day usage variance.

Three-layer attribution: request, workload, storage

The work is propagating customer_id through three layers of the system. Skip any layer and the cost record is unattributable; the dashboard ends up with a 10-30% "unattributed" bucket that defeats the per-customer view.

Layer 1: request path. Every API call gets the customer_id stamped into the OpenTelemetry span at the edge. Downstream services read the span context, propagate it, and the cost record for that span has the customer_id attached. This is the easiest layer: typically a 1-2 week change to the API gateway and request middleware.

Layer 2: workload identity. Every async or batch workload (Spark job, Lambda invocation, Kafka consumer, Snowflake query) must know which customer's data it is processing. The customer_id propagates through the queue header, the workload spec, the query tag. Without this, every batch cost lands in the "shared infrastructure" bucket and the per-customer dashboard misses 40-60% of variable cost. This layer is the hardest: 4-6 weeks of data engineering to retrofit on a typical pipeline.

Layer 3: storage layer. Every storage operation (S3 read/write, RDS query, DynamoDB read) needs to be billable to a customer. The convention that works: object keys prefixed with customer_id (customer-acme-corp/orders/2026-05-08.parquet), tables partitioned by customer_id, encrypted with customer-specific KMS keys when residency matters. The cost record reads the key prefix and assigns cost. This is the layer most easily forgotten because storage cost looks small in early stages and explodes later.

The attribution layer's quality is measurable: the unattributed cost bucket as a share of total monthly spend. Healthy is under 5%. Acceptable is under 10%. Above 15% and the per-customer dashboard's numbers are misleading enough that pricing decisions made from them will be wrong.

Three pricing mistakes the dashboard catches

The first month of the dashboard surfaces three specific pricing failures that no quarterly average would have shown. ZopDev customer rollout data is consistent on these three.

Mistake 1: flat-rate pricing for resource-heavy customers.

A $99/month flat plan that includes "unlimited API calls" works fine when most customers use 10K calls/month. The dashboard surfaces the 4-7% of customers who use 5M+ calls/month at $40-$120/month in infra cost. These customers are net-negative on a flat plan. The fix: introduce a fair-use cap or a usage-based overage; grandfather existing customers with notice.

Customer	API calls/mo	Plan	Plan revenue	Infra cost	Net
Healthy customer	10K	Flat $99	$99	$3	+$96
Heavy customer	5M	Flat $99	$99	$48	+$51
Outlier customer	23M	Flat $99	$99	$190	-$91

Mistake 2: per-seat pricing where seat usage does not correlate with cost.

A $25/seat/month plan looks linear. The dashboard shows that heavy users (analysts running daily reports) consume 8-10x the compute of light users (occasional readers). A 50-seat customer with 5 heavy users and 45 light users costs as much as a 50-seat customer with 50 heavy users, but pays the same $1,250/month. The fix: per-seat tiering by user role, or a usage component layered on top.

Mistake 3: free-tier abuse.

The free tier costs $0 in revenue and looks free in cost too — until you see the distribution. Typically 90% of free-tier users consume 5% of free-tier infra cost (light, low-engagement). The remaining 10% consume 95%: training models on the free-tier API, scraping data, running cron jobs against the free endpoints. The fix: rate limits per free account, automatic graduation to paid above usage thresholds.

The pattern across all three: the dashboard exposes the distribution that the average hides. The product team sees the distribution, the pricing team has the data to redesign the plan, and the engineering team has a target list of customers to right-size individually.

The five-view dashboard

The minimum-viable dashboard is five views on one screen. Anything more is noise; anything less misses a class of decision.

View	Sort	Decision it informs
Per-customer cost (descending)	$ desc	Right-sizing: who to optimize first
Cost-to-MRR ratio (descending)	ratio desc	Pricing: which customers lose money
Cost per customer over time	trend	Drift detection: who is suddenly spending more
Cost concentration (top 1% / top 10% as share of total)	none	Pricing tier design: how skewed is the distribution
Per-customer cost broken down by service	service stacked bar	Per-customer right-sizing: which service to target

The five fit on a single browser tab without scrolling. The pricing team opens this once a week. The product team opens it before any pricing or plan change. The CFO opens it before the board meeting. Different audiences, same data, no per-team dashboards that need to be reconciled.

The trend view is the under-appreciated one. A customer whose cost jumped 4x in three weeks is a smoke signal: usually a new use case the customer is exploring, sometimes a misconfigured integration burning compute on their side. Either way, the customer-success team wants to know in week three, not in next quarter's review.

The cost-concentration view answers the strategic question: are we a long-tail SaaS (top 10% of customers = 30% of cost, predictable scaling) or a power-law SaaS (top 1% of customers = 50% of cost, fragile)? The shape determines the right pricing strategy; the strategy is impossible to set without the data.

Variable cost only: exclude the fixed

Per-customer cost should be the sum of three variable cost classes only: compute, storage, outbound bandwidth. Fixed costs (control plane, monitoring, security tooling, SSO, audit logging) are amortized at the org level and excluded from the per-customer number. Mixing them in produces misleading economics.

Cost class	Per-customer?	Reason
EC2 / Fargate compute	Yes	Scales with customer usage
RDS / Aurora compute	Yes	Per-tenant database load
S3 / object storage	Yes	Per-customer data volume
Outbound data transfer	Yes	Per-customer egress
Lambda invocations (per-customer functions)	Yes	Scales with customer events
Kubernetes control plane	No	Fixed; one cluster serves all
Datadog / observability bill	No	Fixed; not customer-driven
Vault, SSO, secret management	No	Org-level shared infra
CI/CD runtime	No	Engineering cost, not customer cost
Security tooling (WAF, GuardDuty)	No	Org-level, not customer-attributable

The reason for the exclusion is mechanical. A customer's "real" cost is what would disappear from the bill if the customer left. Variable costs do disappear; fixed costs do not. Including fixed costs in the per-customer number means a customer who churns "looks like" a $200/month savings on the dashboard when actually the savings is $60.

The fixed costs still exist; they just need to be tracked separately. Add a sixth view to the dashboard if needed ("fixed org-level overhead, $X/month, X% of total spend"). Most teams find that fixed overhead is 18-28% of total spend and stays roughly flat year-over-year as the company grows.

Cost-to-MRR ratio: the most actionable metric

A $400/month customer is fine if MRR is $4,000 (10% ratio) and a crisis if MRR is $200 (200% ratio). The absolute cost is meaningless without the revenue context; the ratio is the actionable number.

Ratio band	Status	Typical action
0-15%	Healthy	No action; this is the target
15-30%	Acceptable	Monitor for drift
30-50%	Yellow	Investigate; is the customer in a heavy onboarding phase?
50-100%	Red	Targeted right-sizing; consider pricing conversation
>100%	Negative-margin	Pricing change, contract renegotiation, or customer-success intervention required

Typical mid-market SaaS has 8-15% of customers in the "red" or "negative-margin" band. Half of those are early-stage customers still ramping (which is fine, expected, often part of the land-and-expand motion). The other half are pricing mistakes: customers whose plan was set before the company understood their actual usage shape, who have stayed grandfathered, or who fell through the cracks of a plan transition.

The pricing-mistake set is what the dashboard surfaces that no quarterly report would. The fix per customer is usually a one-meeting conversation: explain the cost structure, offer a usage-based plan or a higher-tier package, sometimes write off the past 3 months as a relationship investment in exchange for the new plan going forward. Most customers accept; the ones who do not are signaling they would rather churn than pay, which is its own data point.

The hard part is customer_id propagation

The dashboard is the easy part. Most BI tools (Looker, Metabase, Superset) render the five views from a single fact table in a couple of days. The hard part is making the fact table correct, which requires customer_id on every cost record, which requires the propagation work at every layer of the system.

Layer	Retrofit complexity	Typical engineer-weeks
API gateway / request middleware	Low	1-2
Microservice spans (OTel propagation)	Low	1-3 (one engineer per service)
Async queues (Kafka, SQS headers)	Medium	2-4
Batch workloads (Spark, EMR, Lambda)	High	3-5
Data warehouse queries (Snowflake, BigQuery tags)	High	2-4
Object storage (key conventions)	High (data migration if existing)	4-8
Vector DB / search index	Medium	1-3

Across a typical mid-market SaaS stack the total is 14-29 engineer-weeks. Most teams finish the retrofit in 8-12 calendar weeks because the work parallelizes across services.

The discipline matters more than the speed. A retrofit that misses 30% of cost is not 70% useful; it is misleading, because the missing 30% is concentrated in a few services that distort the per-customer numbers. Either commit to full propagation or do not start. The middle path produces a dashboard that finance will not trust and that pricing will not use.

Once the propagation is in place, every new service must include customer_id in its spans, queue headers, storage keys, and warehouse queries from day one. Adding the discipline to the service template is the maintenance work that keeps the dashboard accurate as the company grows.

The dollar math

The dashboard pays back in 60-90 days on a $500K+ ARR product. The payback comes from two mechanisms, both visible on the dashboard.

Source	Typical contribution	Notes
Pricing changes that recover negative-economics customers	2-5% gross margin	Targeted; affects only the 5-15% of customers in red/negative band
Per-customer right-sizing of resource-heavy environments	8-15% infra reduction	Possible only after attribution makes the over-resourced ones visible
Total annual margin recovery	$50K-$200K	On a $4M ARR product with 25% infra-to-revenue ratio
Retrofit cost (one-time)	$80K-$160K	14-29 engineer-weeks at fully-loaded rates
Operating cost (dashboard + monthly refresh)	$20K/year	Minimal once the propagation is solid
Payback period	6-12 months	First-year ROI typically 1.3x-2x

The first-year ROI is fine but not extraordinary. The compound value is the operational difference: by year two, the company is making pricing and right-sizing decisions weekly with real per-customer data instead of quarterly with averaged data. Pricing changes that would have been argued about for a quarter ship in a sprint. Right-sizing that would have been speculative becomes targeted at the 10-15 specific customer environments where the savings actually exist.

The pattern is not "per-customer is the new average." It is "per-customer is the level of granularity at which SaaS pricing and infra decisions are actually made." The quarterly average remains useful for the board slide. The dashboard is what runs the business in between.

Stand up the dashboard, do the propagation work in parallel, and stop running unit economics off a number that hides everything. The first three pricing mistakes the dashboard catches will fund the next two years of the propagation work.

Per-Agent Quotas for MCP: The Token Budget That Stopped One Agent From Burning 80% of the Daily Spend

Muskan — Fri, 15 May 2026 06:57:57 +0000

The first ninety days of an MCP server in production are about correctness, not abuse. The team is busy proving the agents do the right thing: the policy lookups return what they should, the audit log captures the right fields, the structured errors are parsed by the agent framework correctly. Rate limiting is something the team plans to add "after we have real traffic." The team has real traffic on day 12 and forgets to add rate limiting. On day 87 the first runaway lands.

The runaway always has the same shape. One agent starts behaving badly: a test loop forgot to set max_iterations, a malformed prompt drove the model into a long-output failure mode, a retry policy got an aggressive backoff inverted. The agent calls the same MCP tool 400 times in 30 minutes, burning 70% to 90% of the day's token budget before any human sees the alert. By morning the bill shows a $4,200 charge against an Anthropic account that usually does $800/day.

The structural fix is per-agent token quotas baked into the MCP server. Each agent identity gets a budget across three windows (hourly, daily, weekly). The MCP server tracks consumption and rejects calls that would exceed the budget. The agent gets a structured error; the human operator gets one page per cycle instead of a Slack thread at 9 a.m. asking who is responsible for the bill.

The pattern composes with the MCP cost ledger (which tells you what each agent has spent) and the policy-aware MCP governance work. The ledger is descriptive; the quota is prescriptive. Together they turn per-agent cost into a managed budget rather than a billing artifact you read about three days later.

The first MCP runaway lands in 90 days

The runaway shapes are predictable. After auditing a dozen MCP-server rollouts at ZopDev customers, the three failure modes account for almost every incident.

Failure mode	Trigger	Burn rate	Typical detection latency
Test loop left running	Developer's local agent forgets `max_iterations`	80-120 calls/hour for hours	6-12 hours (next morning)
Malformed prompt drives long-output mode	Prompt change ships with regex bug; model hits 8k token outputs every call	3-5x normal token cost per call	2-6 hours (when daily spend alert fires)
Inverted retry backoff	Retry policy doubles on success instead of failure	200-500 calls/hour against an idempotent tool	1-3 hours (when downstream service alarms)

The detection latencies are uncomfortable because none of them are inside the MCP server's control. The server sees the agent's calls and bills them honestly. It does not see the agent's intent, the retry logic, or the prompt change that landed an hour ago. By the time a human sees the spike in a billing dashboard or a cost alert, the runaway has been burning for hours.

The right place to catch this is inside the MCP server, on the call path. Every tool call passes through the server; every call has an agent identity attached (a service account, a session token, an API key). If the server checks the agent's running budget against a quota before allowing the call, the runaway stops at the quota boundary instead of at the 9 a.m. Slack ping.

Per-agent quotas as a tri-window check

A single daily quota is not enough. An agent that burns its full daily budget by 10 a.m. has fourteen hours to keep generating rejected requests; even rejected, the orchestration overhead (the agent's own LLM calls deciding what to do next) eats real tokens. A weekly quota catches the slow-creep agent that goes 8% over every day for five days and adds up to a meaningful Friday total that no daily check would have stopped.

Window	Catches	Typical cap (for 50K/day default agent)
Hourly	The fast runaway (within 60 min)	8K tokens/hour (typical: 2K-3K)
Daily	The within-day burn (within hours)	50K tokens/day
Weekly	The slow creep (within days)	250K tokens/week (typical: 200K)

The three caps compose. The agent is allowed if it is under all three. If any single cap trips, the call is rejected. This sounds expensive to check but in practice it is three counter reads from a Redis hash; the overhead is sub-millisecond.

The hourly cap is the most underweighted of the three. Teams that ship daily-only quotas get caught by the morning runaway whose damage is done before the daily counter resets. The hourly cap means the worst case is 60 minutes of damage instead of 8 hours. For a 100x normal burn rate, the difference is $200 vs $2,000.

The weekly cap catches the agent that nobody pages on because no single day looks anomalous. An agent that does 60K tokens/day on a 50K cap looks within budget on the daily check (allowing the over-budget grace described below) but accumulates to 420K by Friday on a 250K cap. The weekly check catches the pattern that the per-day signal misses.

Default budget + adaptive growth

A new agent does not get a generous quota on day one. The default is small (50K tokens/day, 8K/hour, 250K/week) and grows with demonstrated usage. The growth is automatic, based on the agent's actual consumption pattern over the trailing 30 days.

Time window	Quota state	Behavior
Days 1-7	Default (50K/day)	Most agents stay under 30%; quota stays at default
Days 8-30	Default, monitoring	If utilization stays under 30%, auto-promotion candidate
Day 30	Auto-promotion	If utilization 10-50%, raise to 200K/day; if higher, page FinOps for review
Days 31-90	200K/day	If utilization stays under 40%, candidate for 500K/day at day 60
Day 60+	500K/day or higher	Manual review required for further increases

The shape: cheap by default, generous to proven workloads, never silently unlimited. A new agent that turns out to need 1M tokens/day gets there through a documented promotion path, not by accident. The same agent at day one would have been hard-blocked at 50K and the human would have set the right quota explicitly.

The promotion thresholds matter. If the auto-promotion fires whenever utilization is non-zero, every agent inflates to its peak day's quota and the protection erodes. If the threshold is too tight (e.g., only auto-promote at 50-70% utilization), most agents never grow and FinOps becomes a quota-approval bottleneck. The 30% / 40% bands above are the typical operating range; tighten or loosen based on the team's tolerance for false promotion vs friction.

Hard-block vs degrade: the design choice

When an agent exceeds quota, the MCP server has two response options. Hard-block is the simple choice: reject the call with an error, the agent's task fails, the human investigates. Degrade is the more humane choice: route the call to a cheaper model (or a cached response, or a partial result), the agent's task completes but with lower quality, the cost stays under control.

Mode	Bill predictability	Task continuity	Output quality risk
Hard-block	High (cost stops at cap)	Low (agent task fails)	None (no degraded output)
Degrade to cheap model	Medium (cheap model still costs)	High (agent continues)	High (low-quality outputs may loop)
Return cached response	High (no model call)	Medium (only some calls cacheable)	Medium (cache may be stale)
Return structured "over budget"	High (no model call)	Medium (agent must handle)	None

Most teams ship hard-block first because the failure mode is contained: an agent that breaks under quota is visible immediately and gets a real fix. Degrade looks better in theory but introduces a subtle failure mode: a degraded agent producing low-quality outputs may loop trying to recover, generating more (cheaper but still real) calls and ultimately costing more than hard-blocking would have.

The middle path is to ship hard-block as the default and let agents opt-in to degrade for specific tool classes where partial output is genuinely better than no output. Read-only tools (lookup, search, summarize) are good degrade candidates: a cached or cheap-model response is acceptable. Write tools (mutations, policy changes) should hard-block: a degraded write is worse than no write.

Structured rejection so retries back off correctly

When the MCP server rejects a call, the error payload matters more than the rejection itself. An unstructured error ("quota exceeded") triggers the agent's default retry logic, which is usually "try again with exponential backoff." The agent retries, gets rejected, retries again, and burns more tokens on its own orchestration calls trying to figure out why the tool is failing.

A structured rejection includes the data the agent needs to back off correctly:

Field	Type	Example
`error_code`	string	`quota_exceeded_daily`
`agent_id`	string	`agent-fraud-classifier-prod`
`window`	string	`daily`
`cap`	integer	`50000`
`consumed`	integer	`52340`
`reset_at`	ISO timestamp	`2026-05-10T00:00:00Z`
`retry_after_seconds`	integer	`19260`
`suggested_action`	string	`wait_until_reset`

With this payload, the agent's retry logic can do the right thing: stop retrying until reset_at, surface the over-budget condition to its orchestrator, or fall back to a different code path. None of these are possible from "quota exceeded" alone.

The cost difference between structured and unstructured errors is meaningful. A blind retry loop against a rejected MCP tool generates 5-15 orchestration LLM calls before the agent's policy gives up. At Claude Sonnet rates, those orchestration calls cost roughly the same as the rejected tool calls would have cost. Structured errors zero out that overhead.

Audit + composition with the cost ledger

Every quota check writes an audit log line, regardless of decision. This is the system of record for two things: postmortem of any cost incident, and input to monthly quota tuning.

The cost ledger and the quota check run on the same call path but serve different purposes. The ledger writes the actual token cost after the call returns (the source of truth for billing). The quota check uses a pre-call estimate (input tokens are known exactly, output tokens are projected from max_tokens). The estimate is conservative (assumes worst case); the ledger is exact.

This split matters for tuning. After 30 days, the ratio of estimated cost to actual cost is the input to whether the quota's grace margin needs to change. If estimates are systematically 20% higher than actuals, agents get blocked more often than the budget would warrant; the grace margin can shrink. If estimates are 10% lower, the budget is leaking; the grace margin needs to grow.

The composition with the MCP cost ledger is what makes the quota system trustworthy. The ledger answers "what did we spend"; the quota answers "what are we allowed to spend." Two complementary systems, one call path, one audit log.

First-month tuning: 2-4 unexpected runaways caught

The first month of the quota system in production produces a predictable mix of firings. ZopDev customer rollout data shows:

Week	Typical firings	Common cause	Action
1	3-7	False positives (default too low for legitimate workloads)	Raise quotas for the 1-3 affected agents
2	2-5	Legacy agents using more tokens than anyone realized	Investigate; tune quota or refactor agent prompt
3	1-3	First real runaways caught	Postmortem, no quota change
4	0-2	Mostly real	Steady-state
Month 2+	0-2/month	Real runaways only	Postmortem each

The week-1 false positives are a signal that some agents were running with consumption nobody had measured. This is itself the value: the team learns what its agents actually cost. Most teams discover at least one "we thought this agent did 20K tokens/day but it actually does 180K" surprise in the first two weeks.

The week-3 first real runaway is the moment the system earns its keep. The runaway gets caught within an hour (because of the hourly cap), the page goes out, the human reads the structured error, and the incident is closed in 45 minutes with $200 of damage instead of 8 hours and $5,000.

The FinOps engineer time across the month is 4-6 hours: classifying the firings, adjusting quotas, writing brief postmortems, updating the promotion thresholds based on observed utilization. The fleet-wide saving over the same month is typically 20-40% of the monthly token bill, mostly from runaways prevented and from the visibility into per-agent consumption that the audit log enables.

The dollar math

The numbers are simple. Per-agent quotas at a typical mid-market agent fleet:

Item	Value
Monthly token bill before quotas	$40,000 to $120,000
Reduction from prevented runaways	15-25%
Reduction from agent right-sizing (visible from audit)	5-15%
Total monthly reduction	$8,000 to $48,000
Quota system build cost (one-time)	$20,000 to $40,000
Operating cost (1 FinOps engineer, ~10% time)	$20,000/year
Payback period	1-4 months

The build cost varies because the implementation choices (Redis counters vs Postgres, structured errors vs not, adaptive promotion vs static caps) have different effort profiles. The lowest-effort version (a single daily cap per agent, hard-block on overage, structured errors) ships in two weeks; the full tri-window adaptive system with degrade support is a quarter of platform-engineer time.

The payback math works because the prevented-runaway saving is a real reduction in spend, not a forecasted one. The agent that would have burned $4,200 overnight burns $200 and stops. The cost ledger and the audit log show exactly how much was saved at each firing, which is the kind of receipt finance accepts as ROI evidence.

Per-agent quotas are not optional once an MCP server has more than three or four agents in production. The first runaway is a question of when, not if. Shipping the quota system before the first runaway costs a quarter of platform-engineer time; shipping it after the first runaway costs that plus the bill for the incident and the trust hit from finance. Set the default budget, wire the tri-window check, log the decisions, and stop relying on the morning Slack ping to catch agent runaways.

The Closed-Loop Budget Brake: How a $5k Daily Cap Stopped 2 A.M. Compute Runaways

Muskan — Fri, 15 May 2026 06:56:51 +0000

The 2 a.m. compute runaway is the canonical FinOps incident. A Spark job is misconfigured to provision new EMR nodes every minute it cannot find a leader. A test agent left running on a developer's laptop loops infinite Claude calls against the prod API key. An autoscaling group's max gets bumped from 20 to 2000 in a Terraform plan that nobody reviewed at the right line number. Everything is asleep. The hourly spend goes from $63 to $830 to $4,200. By 9 a.m. the team gets a Slack ping from finance asking why yesterday's bill spiked $47,000.

AWS Budgets fires a soft alert when daily spend crosses a threshold. The alert goes to an SNS topic that emails a distribution list and pings a Slack channel. Nobody reads the channel at 2 a.m. The on-call engineer is paged for production outages, not budget overages. By the time someone sees the alert, the damage is hours old and the runaway has either burned itself out or kept running because the alert did not actually stop anything.

The structural fix is to replace the email with an action. A closed-loop budget brake fires a remediation playbook when a hard daily cap is crossed: stop non-prod EC2 launches, pause non-prod autoscaling groups, freeze agent provisioning, throttle batch jobs, page the on-call. The 5-minute detect-decide-act-verify shape from the closed-loop FinOps work applies directly, with the cap value as the signal and the playbook as the action.

The piece composes with the closed-loop trust score (deciding which playbook tiers auto-fire) and runs alongside cost anomaly detection (which catches longer-horizon structural drift the brake cannot see).

The 2 a.m. runaway and why email alerts fail

Look at how the four common detection mechanisms catch a 2 a.m. runaway, and what they cost in dollars by the time someone acts.

Detection mechanism	Time to detect	Time to action	Spend lost before action
AWS Budgets soft alert (email/Slack)	8-12 hours after threshold crossed	Morning when someone reads it	$30,000 to $80,000
Hourly cost alarm (custom CloudWatch)	60-90 minutes after spike begins	Hours later if on-call is busy	$5,000 to $15,000
Cost anomaly detection (AWS or vendor)	24-72 hours after pattern shifts	After analyst review	$50,000 to $200,000
Closed-loop budget brake	5-15 minutes after cap crossed	Automatic playbook	$500 to $2,000

The dollar gap between the soft alert and the brake is the case for the brake. Soft alerts are real signals, but they are signals that go to humans who are not actively monitoring at 2 a.m. The brake removes the "wait for a human" step from the loop.

The other detectors are not redundant. Hourly cost alarms catch slower-building issues the brake's daily cap might miss within a single day. Cost anomaly detection catches structural shifts (a new feature with legitimate higher spend, a pricing change, a seasonal pattern) over multi-day windows. The brake handles the within-day catastrophic runaway. Three detectors, three different horizons.

The brake: short-circuit, not email

The brake's shape is the same four-stage loop as any other closed-loop FinOps system, with the cap as the input signal.

Detect samples live spend at a 5-minute interval from Cost Explorer or the equivalent on GCP / Azure. Five minutes is the floor: the billing APIs are eventually consistent and finer-grained polling produces false negatives (real spend that has not yet shown up in the API). Decide compares the running daily total against the cap. If crossed, the brake fires a tiered playbook. Act runs the playbook. Verify samples spend again 15 minutes after the playbook fires, confirming the runaway has stopped.

The playbook is per-account and lives in version control. A typical Tier 1 payload contains six actions: stop all env=non-prod EC2 instances launched in the last 60 minutes, pause non-prod autoscaling group scale-outs by setting max_size = current_size, freeze agent provisioning by revoking the agent service role's ec2:RunInstances, throttle non-prod batch queues to zero concurrency, snapshot the spend-by-service breakdown to an S3 bucket for postmortem, and notify the FinOps channel with the breakdown.

The brake does not delete anything. Everything Tier 1 does is reversible in seconds. Spend stops; nothing breaks. The on-call wakes up to a "brake fired" page, reads the breakdown, decides whether the spend was legitimate (and reverses the playbook) or a runaway (and starts the postmortem).

Sizing the cap: three inputs, one formula

The cap value is not a guess. It is computed from three inputs that already exist in the cost data.

Input	Source	Notes
Typical daily spend	Cost Explorer 7-day trailing average	Smooths weekly seasonality
Variance multiplier	Engineering judgement (1.5x to 2x)	Absorbs legitimate daily spikes
Emergency floor	Largest single-day spend in trailing 90d	The "this happened once and was real" line

The formula: cap = max(typical * variance, emergency_floor + 20%).

Worked example for three account profiles:

Account profile	Typical daily	Variance (1.7x)	Emergency floor	Cap
Small (dev team of 12)	$500	$850	$1,200	$1,440
Mid-market (50-engineer org)	$1,500	$2,550	$4,200	$5,040
Large (200-engineer org)	$10,000	$17,000	$24,000	$28,800

The 20% buffer on the emergency floor matters. A legitimate spike that happened once (a launch event, a load test, an unusual data migration) might not happen the same day next month, but the cap has to be high enough that the same pattern would not trip the brake if it recurs. Without the 20% buffer, the brake fires on every recurrence of every legitimate pattern, and the on-call learns to ignore it.

Cap recomputation happens monthly. The typical daily spend drifts as the company grows. The emergency floor may shift if a new legitimate workload pattern emerges. A static cap that lasts more than a quarter starts to over-fire or under-fire because the inputs moved.

Tier the action: composing with the trust score

The brake's playbook is tiered by what the trust score allows. Without the trust score, the brake either over-reaches (touches production resources, causes a customer-facing incident, gets disabled forever) or under-reaches (only pages, doesn't actually stop the spend).

Tier	Trust threshold	Resources touched	Actions
Tier 1	Always-on	`env=non-prod` only	Stop new EC2 launches, pause ASG scale-outs, freeze agent provisioning, throttle batch queues
Tier 2	Trust > 0.7	ML training, ad-hoc analytics	Downscale training jobs, throttle batch ingestion, pause notebook compute
Tier 3	Always pages	`env=prod`	Page on-call with breakdown; no auto-action

Tier 1 is always-on because the actions are low-blast-radius and fully reversible. Stopping a non-prod EC2 instance launched in the last 60 minutes affects only the developer who launched it, and they can re-launch in 90 seconds. Pausing a non-prod ASG scale-out blocks new capacity but does not terminate existing capacity.

Tier 2 needs the trust score because the actions have wider blast radius. Throttling a training job interrupts the team running it; downscaling notebook compute kicks people out of their analyses. The trust score asks: is the spend signal high-confidence enough (cap crossed by 2x+, multiple services contributing) to justify the disruption? If yes, fire. If no, page only.

Tier 3 always pages. The brake does not touch production resources, ever. The math is simple: a production-impacting incident caused by the brake costs more than any single-day cost runaway. The brake's job is to give the human enough time to fix the runaway before the bill is catastrophic, not to be the fix itself.

Cap vs anomaly detection: different time horizons

A common mistake is treating the brake as a replacement for cost anomaly detection. They are not the same system. They run on different signals and they fire on different timescales.

The brake answers "is something burning right now?" The signal is daily-spend-rate exceeds cap. The action is immediate playbook.

Anomaly detection answers "is the spending pattern different from what we expect?" The signal is statistical: spend by service over a multi-day window deviates from forecasted baseline. The action is queued analyst review, often with a recommendation to update budgets or investigate a new workload.

Both run. The brake catches the rare catastrophic runaway. Anomaly detection catches the steady-state drift (a new feature legitimately moving the spend curve, a pricing change at a vendor, a regional capacity shift). Each is useless for the other's job: anomaly detection cannot stop a 2 a.m. runaway because the analyst is not online; the brake cannot tell you that your spend has structurally shifted because it only sees the daily total.

The first-week tuning ritual

The brake gets tuned in its first two weeks. The cap value comes from the formula, but the formula's inputs are estimates. The actual firing rate in week one tells you whether the cap is right.

Week	Firings/week	Typical action	Cap adjustment
1	3-6	Investigate each, classify real vs false positive	Raise cap by 10-15% per false positive
2	1-3	Continue classification	Raise/lower based on the week's data
3-4	0-1	Each firing produces a postmortem	Stable
Steady state	0-2/month	Each firing is real	Recompute cap monthly

The classification at each firing matters. The on-call writes a one-paragraph postmortem: was this a runaway (which workload, what was the root cause, how long would it have run unchecked) or a legitimate spike (which team, why, was it within budget, why did the cap not anticipate it). False positives raise the cap. Real positives keep the cap and become input to the trust score weights.

The discipline that makes the brake trusted is that every firing produces a postmortem. Without postmortems, the team starts to debate "the brake fires too much," lower the cap to make it stop firing, and the brake silently becomes a $50k cap that catches nothing. With postmortems, the cap value is defensible and the trust accumulates.

The exempt-tag escape valve

Some workloads legitimately spike beyond the cap. ML training that goes from $200/day to $8,000/day during a three-day training run looks identical to a runaway under a daily cap. Ad-hoc analytics that spin up 50 BigQuery slots for a quarterly report look identical. The right fix is not "raise the cap to absorb all training spikes" because that defeats the brake. The right fix is to exempt the workload and route it through a different cap.

The pattern: any resource tagged brake_exempt=true is excluded from the daily-cap calculation. Exempt resources go into a separate weekly cap (typically 3x the equivalent daily cap times 7) that catches truly anomalous training or analytics spend.

Workload type	Tag	Cap horizon
Steady-state services (web, API, batch)	(no tag)	Daily
ML training jobs	`brake_exempt=true, brake_class=training`	Weekly
Quarterly analytics, BI rebuilds	`brake_exempt=true, brake_class=analytics`	Weekly
Disaster recovery test environments	`brake_exempt=true, brake_class=dr-test`	Per-event (manual cap)

The exempt tag has to be opt-in and reviewed. A team that wants to exempt their workload submits a one-page rationale to the FinOps team. The exemption is granted with an expiry date and a recompute schedule. Without that discipline, every team eventually tags their workload exempt and the brake erodes back to nothing.

The dollar math

A 2 a.m. runaway on a mid-market AWS account typically costs $30,000 to $80,000 by the time anyone notices. Larger accounts can hit $200,000+ before the morning Slack ping. The frequency is low but not negligible: ZopDev customer audits show one runaway every 4-7 months on mid-market accounts, more frequent during periods of rapid infrastructure change.

Item	Cost / value
One prevented mid-market runaway	$30,000 to $80,000 saved
One prevented large-account runaway	$200,000+ saved
Annual frequency (typical mid-market)	1-2 runaways/year
Brake operating cost (half-time platform engineer)	~$30,000/year
Expected annual savings (mid-market)	$30,000 to $160,000

The brake pays back after the first prevented incident. After that, it is one of the highest-ROI items on a FinOps roadmap, ahead of right-sizing and behind only the cost-allocation work that lets you see where the spend goes in the first place.

The brake is not a substitute for budgets, anomaly detection, or right-sizing. It is the layer that catches what those other systems are not designed to catch: the within-day catastrophic spend event. Email alerts go to inboxes. The brake fires a playbook. The 2 a.m. runaway becomes a 5-minute incident with a $1,500 ceiling instead of an 8-hour incident with an $80,000 ceiling. Set the cap, write the playbook, watch the first two weeks of firings, and stop arguing with the morning bill.

The Golden Path Tax: 14 Hours/Week of Engineer Onboarding We Bought Back With 6 Months of IDP Work

Muskan — Fri, 15 May 2026 06:54:08 +0000

The cost of onboarding a new engineer at a mid-sized cloud-infrastructure org never shows up on a finance dashboard. There's no line item for "hours spent searching for the right runbook" or "Slack threads asking how deployment works this quarter." The cost is real, it's measured in 14 to 22 hours per engineer per week for the first 8 weeks, and it compounds with every hire because the senior engineers who answer the questions lose 4 to 7 hours per week each at the same time.

Six months of IDP work changes the shape of this cost. Not eliminates it. Reshapes it. The 8-week onboarding becomes 4-week onboarding. The 14 hours per week of access requests, runbook hunts, and deployment questions drops to 30 minutes per week of looking things up in the IDP catalog. The senior engineers stop being the canonical source of truth for "how do we deploy this quarter" because the IDP is. The engineering org gets back about $400k per year of engineer time on a 100-engineer team for an investment of $250-350k over the first six months.

The piece walks through what an IDP actually changes, why deployment is the keystone golden path, why templates beat documentation, the 6-month investment shape, the math, and the single instrumentation that tells you whether the IDP is working.

The 14-hour onboarding tax nobody puts on a dashboard

A new engineer at a mid-sized cloud-infra org goes through a fairly predictable cost curve in the first 8 weeks.

Week	Time spent on onboarding-friction work	Top sources
1	16-22 hours/week	AWS access, GitHub access, VPN, k8s kubeconfig, on-call rotation join
2-3	14-18 hours/week	First service deploy, learning CI conventions, finding the right Terraform repo
4-5	10-14 hours/week	Observability setup (logs, metrics, traces, alerts), on-call shadowing
6-7	6-10 hours/week	Cross-team integration patterns, edge cases in deploys, "where does this config live"
8	3-6 hours/week	Settling in; mostly questions that surface only during real incidents

Sum it: roughly 100 to 130 hours per engineer over 8 weeks lost to onboarding friction. At a fully loaded $80/hour, that's $8,000 to $10,400 per engineer in lost time, paid every time you hire.

The cost on the senior engineer side is rarely measured. Each new engineer fires roughly 5-15 "how do I" messages per week into engineering Slack channels. Senior engineers context-switch to answer, draft a half-page response, sometimes screen-share for 20 minutes. The aggregate is 4-7 hours per senior engineer per week per onboarding overlap. With 3-4 new engineers in the first month, the senior who happens to know the answers loses a half-day per week to answering them.

This cost compounds. Year-over-year, the team grows. Each new hire fires the same questions because the answers live in senior engineers' heads, in Slack history, in three different wikis with conflicting versions, in a Confluence page somebody updated 18 months ago. The org pays the same onboarding tax for every new hire, plus the senior engineer time, plus the slow drift as conventions change and old answers become wrong.

Nobody dashboards this because nobody wants to. The engineers paying the cost don't want to flag it (they want to look productive). The senior engineers paying the cost don't want to flag it (they want to look helpful). The org chart doesn't include "onboarding friction" as a category. It's a tax that gets paid in invisible time and shows up as "engineering velocity feels slow" without a clear line item.

What an IDP actually changes

A working IDP — Backstage, Port, Cortex, or a homegrown equivalent — collapses the same 8-week onboarding into 4 weeks and the 14 hours per week of friction work into 30 minutes per week of catalog lookups.

Activity	Pre-IDP	Post-IDP
Get AWS account access	2-3 days, 4 Slack threads	30 min: self-service via IDP request flow with auto-approval rules
Find the deployment runbook	1-2 days, 5 Slack threads	5 min: deployment golden path is the IDP front page
Set up observability for new service	4-6 hours, 2 senior engineers consulted	20 min: template generates the right Datadog/Grafana hooks
Add on-call rotation membership	1-2 days, 3 Slack threads, often blocked on PagerDuty admin	15 min: self-service via IDP rotation manager
Get secret manager access for service X	2-3 days, 2 Slack threads, requires security team approval	30 min: IDP routes the request with the right context, security approves in batch
Create a new service from scratch	1-2 weeks, learn CI/CD conventions ad hoc	2 hours: template scaffolds repo + CI + observability + secrets

The activities don't disappear. The engineer still needs AWS access, still needs to find the deployment runbook, still needs to set up observability. The time per activity collapses because the IDP makes the answer findable and the action self-service. What used to take a week of Slack-thread-driven discovery takes 30 minutes of catalog navigation.

The senior engineer side is what makes the math work. The Slack questions don't go to a senior; they go to the IDP. The IDP answers about 70 percent of them via templates, runbooks, and self-service flows. The remaining 30 percent — the genuinely novel questions — still go to senior engineers, but the rate drops by 60-80 percent. Senior engineers get their week back; new engineers stop blocking on senior availability.

The deployment golden path is the keystone

Six months of IDP work is enough budget to ship deployment + observability + on-call + secrets, in that order. The order matters more than the budget. Deployment first is the keystone; the rest only get traction once engineers trust the IDP to handle deployment correctly.

Why deployment first: an engineer can survive bad observability for a week (the service runs, you'll fix the dashboards later). They cannot ship code without a clear deployment path. If the deployment golden path lives in the IDP and works, the engineer's first IDP interaction is a positive one. They go to the IDP next time something else needs doing. The IDP earns trust through use.

Trying to ship observability + secrets + deployment in parallel fails because there's no foothold for engineer trust. The engineer hits the IDP, finds a half-finished observability template that doesn't quite work, gives up, asks Slack. The IDP becomes the place engineers tried once and found broken. That perception is hard to recover from; better to ship one path well than three paths half-done.

The order after deployment is less critical, but observability and on-call go together because they share a workflow (alerts wake an on-call engineer; they consult the dashboards). Secret management can land third because it's a higher-friction problem (security review is involved) and engineers will tolerate the existing process longer. Environment provisioning is usually month 7-9 if scoped at all; many IDPs never ship it because it requires deeper cloud-account integration than the rest.

Templates beat documentation

The deepest mechanism in an IDP isn't the catalog or the docs. It's the template that generates the right thing instead of describing how to make the right thing.

Mechanism	Drift	Enforcement	Time to first success	Maintenance cost
Documentation	Drifts within weeks; nobody updates	None — engineer can ignore	2-6 hours of reading + iterating	Low to write, high to keep accurate
Template (golden path)	Doesn't drift; the template IS the convention	Strong — produces the right output	15-30 min from template run to working service	Higher to write, near-zero to keep accurate

A 5,000-word doc on "how to create a service" describes the right repo structure, the right CI config, the right observability hooks. The next engineer reads the doc, applies it imperfectly, ships a service that's 80 percent compliant with the conventions. Six months later that service has subtle differences from the canonical pattern. The doc gets updated by someone, the existing service doesn't. Drift sets in immediately.

A template that runs in the IDP produces the same artifacts as the doc would describe. The repo is created with the right structure. The CI config is generated from the same source as the docs. The observability hooks are wired by the same template that wires them in every other service. The engineer's "create service" interaction is a form they fill in (service name, owner, language) and a button they click. Two minutes later the service exists, compliant by construction.

The templates also enforce things docs can't. A doc can say "always tag your resources with cost_center." A template adds the tag automatically. A doc can say "always emit the request_id in logs." A template wires the logger to do it by default. The conventions move from "things engineers should remember" to "things the template does for them." Compliance ratios go from the typical 30-60 percent for documented conventions to 95-99 percent for template-enforced ones.

The work to write a template is roughly 3-5x the work to write the equivalent doc. The maintenance cost is the inverse: docs need constant updating to stay accurate; templates only update when the underlying convention changes. Over a 2-year horizon, templates are cheaper than docs even before counting the engineer-time savings.

The 6-month investment shape

The typical IDP rollout for a 100-engineer org consumes roughly one platform engineer's full quarter for the deployment golden path, then a half-time engagement for the next quarter as the other paths land. Total team investment is roughly 0.75 to 1.0 engineer-quarters of platform time plus rotating part-time involvement from two service teams whose flows the IDP is encoding.

Month	Deliverable	Owner	Dependency
1	IDP catalog up; service inventory imported; access flows wired	Platform engineer (full-time)	Backstage/Port instance + GitHub integration
2	Deployment golden path: template + runbook for new service	Platform engineer + 1 service team part-time	Catalog + CI/CD integration
3	Deployment golden path: rollout to 5 services as pilots	Platform engineer + 5 pilot teams	Working template from month 2
4	Observability golden path: template for Datadog/Grafana hooks	Platform engineer + observability team	Deployment template establishes pattern
5	On-call golden path: PagerDuty + runbook integration	Platform engineer + SRE team	Observability template for alerts
6	Secrets golden path: routing through Vault/AWS Secrets Manager	Platform engineer + security team	Trust established from prior 5 months

Months 7-9 add environment provisioning if scoped. Most orgs don't get to it in the first year because the cloud-account integration is the deepest piece of work and the prior paths produce most of the time savings.

The platform engineer in months 1-3 is mostly heads-down on the deployment path. Months 4-6 the engineer becomes more of a coordinator, working with the observability/SRE/security teams who own the underlying systems. The IDP is the integration layer; it doesn't replace the underlying tools.

The pilot pattern in month 3 is critical. Five services going through the deployment template surfaces every edge case the template missed. Fix the edge cases, then roll out broadly in month 4. Skipping the pilot and rolling broadly in month 3 means the broad rollout hits all the edge cases at once, the template gets blamed, and the IDP loses trust.

The dollar math: $400k recovered, $300k invested

The math is straightforward but politically uncomfortable, because it requires putting a number on engineer time that nobody usually quantifies.

Input	Value	Notes
Engineers in onboarding overlap (avg)	12	Includes new hires + recent transfers within 8 weeks
Hours/week recovered per onboarding engineer	14	From 14 hrs/wk of friction to 30 min/wk
Senior engineer hours/week recovered	4 per senior × 5 affected seniors = 20	Less context-switching to answer questions
Total hours/week recovered	188	(12 × 14) + 20
Fully loaded hourly cost	$80	Median for senior engineers in cloud infra
Annualized recovered value	$782k	188 × $80 × 52 weeks
Adjustment for non-100% onboarding overlap	× 0.55	Onboarding overlap isn't always 12 engineers
Realistic recovered value	~$430k/year	Conservative
IDP investment year 1 (platform eng + tooling)	$250-350k	One platform engineer + Backstage hosting + integrations
Net year-1 ROI	+$80k to +$180k	Positive in year one
Year 2+ ROI	+$350-400k/year	Investment drops to ongoing maintenance ($80-120k/year)

The investment side is more concrete than the recovery side. One platform engineer at fully-loaded $200k for the year, plus $30-50k for Backstage hosting + integrations + tooling, plus part-time involvement from the service teams (call it $70-100k of allocated time across 6 months). Total $300-350k in year one.

The recovery side has the most uncertainty around the "onboarding overlap" number. A 100-engineer org with 20 percent annual hiring has roughly 20 hires per year, with 4-week to 8-week onboarding overlap meaning 4-6 engineers in friction-mode at any given time. The 12 number assumes higher hiring rate or more transfers; adjust accordingly. The dollar value scales linearly.

The argument that lands better than "save $400k" is "recover one half-engineer of capacity per onboarding." Engineering leaders intuitively understand "we get our senior engineers' Mondays back" better than they understand annualized dollar projections.

How to know it worked: the 'how do I' Slack metric

The single instrumentation that tells you the IDP is working is the count of "how do I" messages in engineering Slack channels.

Pre-IDP, a typical 100-engineer org sees 50-100 such messages per week in #engineering, #infrastructure, #platform-help, and similar channels. Each one is a question that should have an answer in the IDP but doesn't, or that's in the IDP but the asker didn't find it.

Post-IDP (month 6 onward), the same channels see 10-20 such messages per week. The 60-80 percent drop is the most reliable signal of golden-path adoption. It's measured in Slack analytics, no instrumentation needed beyond a regex grep on channel history.

The platform team uses the remaining 10-20 messages as the prioritization signal for IDP improvements. Each unanswered question is either a gap in the IDP (add a template or catalog entry) or a discoverability problem (add a search hint or restructure the catalog). The metric drives the work; the work drives the metric down further.

The pattern that fails is letting the IDP roll out without instrumenting. Six months in, the platform team thinks the IDP is great because they built it. The actual signal of success is "are engineers using it instead of asking Slack." Without the Slack metric, the platform team optimizes for things engineers don't actually need; with it, the platform team's roadmap is driven by real friction.

What happens if you don't build it

The opportunity cost of not building an IDP is bounded but real. A growing engineering org without an IDP eventually hires a "developer experience" team to do ad hoc the work an IDP does at scale.

Year	Without IDP	With IDP
1	Onboarding takes 8 weeks; senior engineers spend 5 hrs/wk answering questions; 1 ad-hoc DX engineer hired	IDP investment + deployment golden path; 2 weeks shorter onboarding by year-end
2	DX team grows to 3 engineers maintaining scripts, runbooks, on-call docs ad hoc	IDP team is 1 platform engineer maintaining + extending; observability/secrets paths added
3	DX team is 5 engineers; "how do we deploy" still requires asking; same onboarding tax as year 1	IDP team is 1-2 engineers; onboarding is 4 weeks; senior engineer time recovered
4	DX team is 6 engineers; documentation has drifted again; new attempts to "fix it" begin	IDP is the canonical surface; new conventions land as templates; org grows without proportional friction growth

The DX team isn't a wasted investment — those 5 engineers are doing real work. The work is just less leveraged because it's documentation + scripts + ad-hoc processes instead of templates + catalog + self-service flows. Documentation drifts; templates don't. Scripts get forked; templates get versioned. Ad-hoc processes get replicated badly; self-service flows enforce consistency.

Year 3 is where the divergence becomes obvious. The IDP team is one or two engineers extending the platform; the DX team is five engineers reinventing the same flows for each new service. The IDP org's onboarding tax has stayed flat; the no-IDP org's onboarding tax has grown linearly with team size. Hiring more DX engineers doesn't fix the structural problem; it scales it.

The Backstage / Port / Cortex investment isn't free, and the six-month rollout is real work. But the alternative is paying the same cost as a recurring tax for as long as the org grows, and watching the senior engineers who could be building the next thing instead spend their Mondays answering "how do I." The 14 hours per week per onboarding engineer is the visible cost; the senior engineer time is the hidden one. The IDP recovers both, and the math works on a six-month horizon.

Pod Scheduling for the Frugal: How We Cut EKS Node Cost 31% Without Touching a Workload

Muskan — Fri, 15 May 2026 06:49:54 +0000

A right-sized EKS cluster should not run at 40 percent node utilization. The pods declare requests that sum to 78 percent of node capacity. The cluster autoscaler provisions nodes to fit those requests. The bill goes to finance based on the nodes provisioned. And then the actual utilization metrics show 40 percent. The gap between 78 percent of node capacity claimed and 40 percent actually used is bin-packing inefficiency, and it survives any amount of right-sizing.

The pattern that fixes the gap is three scheduling-side changes that don't touch any workload. Switch the scheduler scoring from default to MostAllocated. Enable Karpenter's consolidation feature. Add a three-tier priority class so batch workloads can be evicted when high-priority pods need capacity. The combined effect on a typical EKS cluster is 25 to 35 percent reduction in node cost without any pod changing its resource requests.

The piece composes with right-sizing vs auto-scaling but starts from the opposite end. Right-sizing argues with every team about their CPU and memory requests. Scheduling improvements just change where pods land. The political cost is much lower, the work fits in a one-sprint window, and right-sizing becomes more effective afterward because the bin-packing baseline is healthier.

The 40% utilization gap on EKS clusters that already right-sized

Look at any EKS cluster that's already done a right-sizing pass. The numbers will be roughly:

Signal	Typical value
Pod CPU requests / node CPU capacity	75-82%
Pod memory requests / node memory capacity	70-78%
Actual node CPU utilization (averaged)	35-45%
Actual node memory utilization (averaged)	38-50%
Cluster Autoscaler / Karpenter target utilization	80%+

The first two rows say "the cluster is well-packed in theory." The middle two say "the cluster is half-empty in practice." The last row says "the autoscaler thinks it's running tight."

The gap is real, not a measurement artifact. Pod resource requests are declarations of what the pod might use; actual utilization is what the pod uses on average. The scheduler reserves the requested capacity even when the pod uses less. A node with five pods declaring 2 CPU each (10 CPU total reserved) but using 1 CPU each on average (5 CPU actual) is at 100 percent reserved and 50 percent utilized. The autoscaler sees the reservation, not the use.

This is by design — the alternative (oversubscribing based on actual use) breaks under burst, and the scheduling literature is unanimous that reservation-based scheduling is the right primitive. The fix isn't to change how the scheduler treats requests. The fix is to make the scheduler pack requests more efficiently and to let Karpenter consolidate the resulting headroom into fewer nodes.

Lever 1: switch to MostAllocated scoring

The default Kubernetes scheduler optimizes for predictable, evenly-spread placement. The default scoring strategy is LeastAllocated, which prefers nodes with more free capacity. The reasoning is fault tolerance: spread pods across nodes so a single node failure has bounded blast radius. This is the right default if you're not paying for the nodes.

MostAllocated is the opposite strategy: prefer nodes with less free capacity, packing pods tightly. The scoring is opt-in and rarely enabled. It's documented but has no auto-enablement signal: nothing in the cluster tells you "you'd save money if you flipped this."

Same workload, two scoring strategies, two outcomes. LeastAllocated produces 6 nodes at 50 percent each (33 percent waste). MostAllocated produces 4 nodes at 75 percent each (25 percent waste, 33 percent fewer nodes).

The configuration is one block in the scheduler config:

Field	Default value	New value
`KubeSchedulerProfiles[0].plugins.score.disabled`	(none)	`NodeResourcesFit` (the default scorer)
`KubeSchedulerProfiles[0].plugins.score.enabled`	(default)	`NodeResourcesFit` with `scoringStrategy.type: MostAllocated`

On a managed EKS cluster, this lands as a config map for the kube-scheduler. The change rolls out per-control-plane and applies to new pod placements; existing pods stay where they are until they restart for other reasons. The transition over a week is gentle: as pods naturally restart (deployments, image updates, node maintenance), the cluster gradually packs tighter.

The catch is that MostAllocated without PodTopologySpread constraints is dangerous. Left to its own devices, the scheduler will happily put all five replicas of a deployment on one node — maximum density, zero fault tolerance. Topology spread is the corrective. We get to that section in a moment.

The expected outcome on a fleet that previously ran 40 percent utilization: utilization rises to 55-65 percent over the first two weeks. The autoscaler notices fewer nodes are needed and provisions less. The bill drops 12-18 percent depending on workload composition.

Lever 2: enable Karpenter consolidation

Karpenter provisions nodes to fit incoming pods. By default, once a node exists, it stays. If pods leave (deployment scale-down, batch job completion), the node lingers under-utilized until it's empty enough that Karpenter's "expiration" rules kick in.

Consolidation is the active counterpart. Karpenter periodically evaluates the existing fleet, asks "could I run all these pods on fewer or smaller nodes," and re-provisions if yes. The evaluation runs hourly by default. Pods get gracefully evicted from the old nodes, the new (smaller or fewer) nodes get spun up, the old nodes terminate.

Six m5.xlarge nodes at 30-40 percent become two m5.2xlarge nodes at 68-72 percent. Same pods, same requests. The autoscaler bill drops because four fewer nodes are running.

The Karpenter NodePool config to enable it:

Field	Default	New
`disruption.consolidationPolicy`	`WhenUnderutilized`	`WhenUnderutilized` (already enabled in recent versions)
`disruption.consolidateAfter`	unset	`30s` to `1m` (acts on transient under-utilization too)
`disruption.expireAfter`	`720h` (30 days)	`168h` (7 days) — forces fleet refresh

The consolidation policy WhenUnderutilized is what does the work. The consolidateAfter knob controls how aggressive the re-evaluation is; shorter values catch transient under-utilization (a deployment that just scaled down) faster. The expireAfter change is secondary but useful: shorter expiration forces the fleet to refresh more often, which catches drift between Karpenter's view and reality.

The catch with consolidation is that the node types it picks need to be a bounded set. If the NodePool allows 30 instance families, consolidation produces fragmentation: some pods on m5, some on c5, some on r5, none of the families used densely enough to consolidate further. The prerequisite work is pruning to 3-5 high-utility instance families that cover the typical pod resource shapes. Most clusters land on m6i for general purpose, c6i for CPU-bound, r6i for memory-bound, with one or two GPU types for ML workloads.

The expected outcome on a typical EKS fleet: 15-25 percent fewer nodes after the first week of consolidation passes. The first day shows the biggest drop (consolidation catches all the historical under-utilization at once); subsequent days are incremental as new under-utilization gets caught.

Lever 3: priority + preemption for batch workloads

The third lever is the one most teams skip. Kubernetes supports pod priority classes and preemption: high-priority pods can evict low-priority pods when capacity is contended, instead of triggering a node-up.

Most clusters end up with three priority classes:

Class	priorityValue	Workloads	Preemption behavior
`critical`	1_000_000	Customer-facing services, control-plane components	Cannot be preempted
`standard`	500_000	Internal services, default for everything else	Preempts batch only
`batch`	100_000	Periodic jobs, ML training, data pipelines	Preempted by everything else

The priorityClassName field on the pod spec assigns the class. New deployments get the appropriate class via templates; existing deployments get tagged in a one-time PR. Critical workloads are usually a small set (under 20 percent of pods). Batch workloads are usually larger than people expect (often 30-40 percent of pods, mostly invisible: cronjobs, data pipelines, build runners).

The preemption behavior is what creates the savings. When the scheduler can't fit a standard or critical pod, it looks for batch pods to evict instead of triggering Karpenter to provision a new node. The batch pod gets evicted (re-queued for later), the standard pod takes its slot, and the cluster doesn't grow. The batch work runs slower but completes; the cluster runs leaner.

The political work is agreeing on which workloads are evictable. Engineers tend to mark their work as critical by default. The agreement requires a clear definition: "critical means customer-facing degradation if evicted." Most internal infrastructure (monitoring, logging, build runners, batch ETL) is not critical by that definition. The clarification is the political work; the technical implementation is one yaml field per pod.

Preemption only adds savings when the cluster has enough batch workloads to absorb the eviction pressure. Clusters that are pure web-tier with no batch see less benefit (typically 2-3 percent). Clusters with ML training or large data pipelines see more (typically 8-12 percent).

PodTopologySpread is non-negotiable

MostAllocated without topology constraints will pack all replicas of a deployment onto one node. A node failure takes down the entire deployment. This is a real production incident, not a theoretical concern.

The fix is PodTopologySpread constraints on every deployment that has fault tolerance requirements. The yaml block:

Field	Value	Why
`topologyKey`	`topology.kubernetes.io/zone`	Spread across AZs first
`maxSkew`	`1`	At most 1 pod imbalance between AZs
`whenUnsatisfiable`	`ScheduleAnyway`	Soft constraint; better packed than crashed
Second constraint `topologyKey`	`kubernetes.io/hostname`	Then spread across nodes within AZ
Second constraint `maxSkew`	`2`	At most 2 pod imbalance between nodes

The two-constraint pattern says "AZ spread is mandatory (for resilience), node spread is preferred (for further fault isolation), but never refuse to schedule because of either." ScheduleAnyway is what makes it compatible with MostAllocated: when packing is the right choice for cost, the scheduler can ignore the soft constraint and pack tighter.

The cost of getting topology spread right is one yaml block per critical deployment. Tooling exists (open-policy-agent, Kyverno) to enforce that deployments above a certain replica count have topology spread defined; we use a simple admission policy that warns on missing spread and blocks on critical-priority deployments without it.

The trade-off this creates: a cluster running MostAllocated + topology spread will, in steady state, run at 60-70 percent utilization rather than the theoretical maximum of 85 percent. The 15-25 percent gap is the cost of fault tolerance. Closing it further means giving up AZ resilience, which is not a finance decision.

The 31% breakdown: 15 + 10 + 6

The combined impact on a typical EKS cluster decomposes roughly as:

Lever	Typical savings	Range	What drives variation
MostAllocated scoring	15%	12-18%	Higher savings on clusters with many small pods (better bin-packing wins)
Karpenter consolidation	10%	8-12%	Higher savings on clusters with bursty deployments (more transient under-utilization to catch)
Priority preemption	6%	2-12%	Higher savings on clusters with significant batch workload (more eviction-eligible pods)
Combined	31%	22-42%	Composition matters; effects are not exactly additive

The combined number is slightly less than the simple sum because the levers overlap. MostAllocated reduces under-utilization, which means consolidation has less work to do. Preemption reduces node-up events, which means consolidation sees a steadier fleet. The interactions are mild but real; planning around 25-35 percent total savings is more accurate than planning around 31 percent.

The exact mix depends on workload composition. A cluster that's 80 percent web traffic and 20 percent batch will see more value from MostAllocated and less from preemption. A cluster that's 50 percent ML training will see more from preemption (the training jobs are the ideal eviction targets) and less from MostAllocated (large pods don't bin-pack as well as small ones). The 15+10+6 split is the central tendency, not a guarantee.

Why scheduling-first is more politically tractable than right-sizing-first

Right-sizing argues with every team about their resource requests. The conversation is "your pod requests 4 CPU and uses 1.5; let's drop the request to 2." Each team pushes back because they remember the time the pod actually used 4 (during the incident two months ago, the deployment burst, the load test). Negotiating each one takes 30 to 60 minutes per service; a 200-service cluster is a quarter of FinOps time.

Scheduling changes don't argue with anyone. The pod requests stay the same. The pod runs the same code. The only thing that changes is which node the pod lands on (MostAllocated), which other nodes exist alongside it (consolidation), and what happens when capacity is tight (preemption). No engineer has to defend their resource requests because no resource requests are changing.

This makes the sequencing matter. Doing scheduling first:

Step	Time	Political cost	Savings unlocked
Enable MostAllocated + topology spread	1-2 sprints	Low (one config change, validated by SRE)	12-18%
Enable Karpenter consolidation + prune node families	1 sprint	Low (Karpenter team's domain)	8-12%
Define priority classes + tag batch workloads	2-3 sprints	Medium (workload classification debate)	2-12%
Right-size pod requests	1-2 quarters	High (per-service negotiation)	another 15-25% on top

By the time you get to right-sizing, the cluster's bin-packing is already healthy, so the right-sizing conversations land on a smaller per-service savings number. That's actually politically helpful: the engineers see "we already saved 31 percent without touching your pods, and now we're asking for the next 15 percent." The framing flips from "we're cutting your resources" to "we're tuning the last bit of headroom."

The 31 percent number is real and replicable. The work fits in one sprint per lever, takes no engineering team's time except SRE's, and doesn't risk any pod's runtime behavior. It's the cheapest savings on the EKS bill and it shows up before the harder right-sizing fight even starts.