<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Muskan </title>
    <description>The latest articles on DEV Community by Muskan  (@muskan_8abedcc7e12).</description>
    <link>https://dev.to/muskan_8abedcc7e12</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3814925%2F56a25a4c-6dc3-421c-9bec-b598c5c71423.png</url>
      <title>DEV Community: Muskan </title>
      <link>https://dev.to/muskan_8abedcc7e12</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/muskan_8abedcc7e12"/>
    <language>en</language>
    <item>
      <title>The railway went down for 10 hours, and it wasn't their fault. Here's the part nobody is talking about.</title>
      <dc:creator>Muskan </dc:creator>
      <pubDate>Fri, 22 May 2026 11:16:30 +0000</pubDate>
      <link>https://dev.to/muskan_8abedcc7e12/the-railway-went-down-for-10-hours-and-it-wasnt-their-fault-heres-the-part-nobody-is-talking-4fh1</link>
      <guid>https://dev.to/muskan_8abedcc7e12/the-railway-went-down-for-10-hours-and-it-wasnt-their-fault-heres-the-part-nobody-is-talking-4fh1</guid>
      <description>&lt;h2&gt;
  
  
  22:10 UTC. May 19, 2026.
&lt;/h2&gt;

&lt;p&gt;The railway's monitoring starts screaming.&lt;/p&gt;

&lt;p&gt;Dashboard, 503. API, dead. Logins, failing. Within nine minutes, the on-call engineers have an answer, and honestly, it's almost worse than an outage:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Google Cloud suspended Railway's entire production account.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;No warning. No email. No phone call. Just an automated enforcement action that flipped a switch on a company that spends over ten million dollars a year with them.&lt;/p&gt;

&lt;p&gt;I put together a short breakdown of what actually happened, and walked through how we'd have spotted this kind of single-point-of-failure on the architecture canvas with Blast Radius before it bit. If you want the visual version of this post, it's here:&lt;/p&gt;

&lt;p&gt;  &lt;iframe src="https://www.youtube.com/embed/TUNjidoDj48"&gt;
  &lt;/iframe&gt;
&lt;/p&gt;

&lt;p&gt;If you've been on the internet long enough, you've seen this movie before. UniSuper, a $50B pension fund, was accidentally deleted by GCP in 2024. Plenty of indie devs are auto-banned by AWS and Google with zero recourse. The Railway one just happens to be the biggest "developer cloud gets locked out of the cloud" event so far.&lt;/p&gt;

&lt;p&gt;But the part that got under my skin wasn't the suspension. It was what happened &lt;em&gt;next&lt;/em&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Railway isn't even fully on GCP
&lt;/h2&gt;

&lt;p&gt;Here's what makes this incident actually interesting for anyone running infra.&lt;/p&gt;

&lt;p&gt;Railway runs workloads on &lt;strong&gt;three&lt;/strong&gt; things:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Their own bare metal hardware (Railway Metal)&lt;/li&gt;
&lt;li&gt;AWS&lt;/li&gt;
&lt;li&gt;GCP&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Smart. Multi-cloud. Exactly what every architecture deck on LinkedIn tells you to do.&lt;/p&gt;

&lt;p&gt;But their &lt;strong&gt;network control plane&lt;/strong&gt;, the thing that knows where everything lives and how to route traffic, was hosted on GCP. So when GCP suspended the account at 22:20 UTC:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;22:20: control plane goes down&lt;/li&gt;
&lt;li&gt;22:35: the routing cache at the edge starts expiring&lt;/li&gt;
&lt;li&gt;~23:35: Railway Metal workloads start returning 404s&lt;/li&gt;
&lt;li&gt;shortly after: AWS workloads do the same&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;By the time the routing cache fully expired, &lt;strong&gt;every single workload across every cloud was unreachable&lt;/strong&gt;. Even the ones running on hardware that the railway owns outright, sitting in their own racks, are completely untouched by Google's enforcement action.&lt;/p&gt;

&lt;p&gt;The servers were fine. The applications were fine. Nobody could find them.&lt;/p&gt;

&lt;p&gt;That's the blast radius of one upstream click.&lt;/p&gt;

&lt;p&gt;From the railway's own postmortem, which is unusually honest and worth reading in full:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"In this ring, there was still a hard dependency on workload discoverability being tied to the network control plane API that was hosted on the machines running in Google Cloud. This meant that despite the mesh continuing to operate for an hour, when the route cache expired, the mesh failed to re-populate the routing tables."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;And from Angelo on the Railway forum:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"This one was egregiously bad because it was a single and expected point of failure like a cloud account getting removed… to say we are livid is an understatement."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Account access came back in 9 minutes after a P0 escalation. But the customer-facing outage ran nearly 10 hours total, because once the edges have forgotten where everything lives, you have to wake up disks, restart compute, rebuild routes, and re-converge the mesh layer by layer. The technical recovery alone took the better part of 8 hours after access was restored. And then GitHub starts rate-limiting your OAuth during the recovery surge, because of course it does.&lt;/p&gt;

&lt;h2&gt;
  
  
  The thing every engineer felt reading this
&lt;/h2&gt;

&lt;p&gt;If you write infrastructure for a living, you read the Railway postmortem with one specific feeling, and it's not schadenfreude.&lt;/p&gt;

&lt;p&gt;It's the cold realization that &lt;strong&gt;you don't actually know what depends on what in your own stack&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;You know the obvious stuff. The big "if RDS goes down, the app goes down" connections. But the long tail? The security group that's quietly shared between four services? Is the Lambda the only thing keeping a webhook alive? Is the " idle " read replica actually the cross-region failover for orders?&lt;/p&gt;

&lt;p&gt;You don't know. I don't know. Nobody knows until the thing breaks.&lt;/p&gt;

&lt;p&gt;This is the gap nobody fills. Cost tools are great at telling you what's wasted. Observability tools are great at telling you what's broken &lt;em&gt;right now&lt;/em&gt;. Neither one tells you &lt;strong&gt;what will break if you touch this&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;So teams do one of two things:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Delete it and pray.&lt;/li&gt;
&lt;li&gt;Don't delete it, and sit on thousands of dollars of monthly waste because nobody wants to be the person who broke checkout on a Friday.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;That second one is the actual norm, by the way. Talk to any cloud cost lead, and they'll tell you the bottleneck isn't &lt;em&gt;finding&lt;/em&gt; the savings. It's getting anyone to confidently apply the recommendations.&lt;/p&gt;

&lt;h2&gt;
  
  
  What we built (and why Railway's story is exactly the use case)
&lt;/h2&gt;

&lt;p&gt;This is the gap we've been building Blast Radius for in ZopNight.&lt;/p&gt;

&lt;p&gt;The idea is dead simple. Before you apply any recommendation (delete this, stop that, modify the other thing), Blast Radius lights up the architecture canvas and shows you, in plain language, what's actually connected and what's about to break. (If you watched the video above, you've already seen the canvas in action: the RDS read replica that &lt;em&gt;looked&lt;/em&gt; idle but was actually the cross-region failover for production orders. That's the kind of thing this catches.)&lt;/p&gt;

&lt;p&gt;Here's how it works under the hood:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Adjacency graph.&lt;/strong&gt; We build it from the metadata you already have: shared security groups, load balancer targets, volume attachments, Lambda triggers, and parent resources. No agents. No flow logs. Just the same metadata you can see in the AWS / GCP / Azure consoles, stitched into a real graph.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Resource-type-aware impact classification.&lt;/strong&gt; A modification on a Lambda is not the same as a modification on an EC2 instance, which is not the same as a modification on a GKE node pool. We classified 131 resource types into three behavior buckets: &lt;code&gt;onlineModify&lt;/code&gt; (live update, no interruption), &lt;code&gt;restartModify&lt;/code&gt; (brief restart), and &lt;code&gt;poolModify&lt;/code&gt; (children recreated). The classification respects that. Default for unmapped types is &lt;em&gt;Warning&lt;/em&gt;, not &lt;em&gt;Safe&lt;/em&gt;. Safe has to be proven.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Color on the canvas.&lt;/strong&gt; Green = safe. Amber = a connection will break. Red = destroyed or non-functional. Everything unrelated gets dimmed so your eyes can find the actual story.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A risk score, 0 to 100.&lt;/strong&gt; Severity of impact, whether it's prod/staging/dev, how many teams own pieces of it, and whether it's on an active schedule. Not a vibe. A number that explains itself.&lt;/p&gt;

&lt;p&gt;So when you click "delete this idle RDS read replica," and Blast Radius lights up red on a cross-region link you forgot existed, you don't have to be the person who broke DR on a Friday. You loop in the right team, or you skip that one and confidently apply the eleven others that came back green.&lt;/p&gt;

&lt;p&gt;If Railway had been able to look at their own architecture and ask "what happens if our GCP control plane disappears for 10 hours", and &lt;em&gt;see&lt;/em&gt; the answer light up across Metal and AWS in red, they'd have ripped that dependency out a year ago. They are now, by the way. The postmortem commits to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Removing GCP from the data plane's hot path&lt;/li&gt;
&lt;li&gt;A true mesh across Metal, AWS, and GCP where any one interconnect can fail&lt;/li&gt;
&lt;li&gt;HA database shards split across AWS and Metal, so quorum survives losing a cloud&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The lesson the rest of us pay nothing for
&lt;/h2&gt;

&lt;p&gt;You can't prevent your cloud provider's billing department from having a bad day. You can't make their fraud algorithms call you first.&lt;/p&gt;

&lt;p&gt;What you &lt;em&gt;can&lt;/em&gt; do is make sure that when one upstream blinks, you already know, exactly, with receipts, which of your own resources will go dark with it.&lt;/p&gt;

&lt;p&gt;The internet runs on three companies. Every once in a while, one of them hits the off switch. The question isn't whether it happens to your stack. It's whether you'll be able to point at a canvas and say, "I already knew that would fail. Here's what I did about it."&lt;/p&gt;

&lt;p&gt;Visibility is always cheaper than recovery.&lt;/p&gt;

</description>
      <category>devops</category>
      <category>cloud</category>
      <category>sre</category>
      <category>postmortem</category>
    </item>
    <item>
      <title>From auto-recommendation to one-click cloud remediation, the workflow most tools skip</title>
      <dc:creator>Muskan </dc:creator>
      <pubDate>Thu, 21 May 2026 06:39:25 +0000</pubDate>
      <link>https://dev.to/muskan_8abedcc7e12/from-auto-recommendation-to-one-click-cloud-remediation-the-workflow-most-tools-skip-5466</link>
      <guid>https://dev.to/muskan_8abedcc7e12/from-auto-recommendation-to-one-click-cloud-remediation-the-workflow-most-tools-skip-5466</guid>
      <description>&lt;p&gt;Every cloud cost tool I have ever opened shows the same big number near the top of the dashboard. You could save 487,000 dollars a year. Sometimes it is bigger. The number is real in the sense that the math checks out. The number is also a lie in the sense that almost none of it ever happens.&lt;/p&gt;

&lt;p&gt;The recommended savings number on your dashboard is not the realised savings number on your bill. The gap between those two is where most FinOps tools quietly fail their users, and it is not a small gap. At most teams I have talked to, it sits somewhere between 80 and 95 per cent.&lt;/p&gt;

&lt;h2&gt;
  
  
  The ticket nobody picks up
&lt;/h2&gt;

&lt;p&gt;Walk through what the dashboard asks you to do when it says to stop this idle EC2 instance and save 312 dollars a month. The path looks like this.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;An engineer reads the recommendation.&lt;/li&gt;
&lt;li&gt;Files a ticket because they do not own the resource.&lt;/li&gt;
&lt;li&gt;Waits for the team that does.&lt;/li&gt;
&lt;li&gt;That team schedules the work into a sprint.&lt;/li&gt;
&lt;li&gt;Someone eventually logs into the AWS console.&lt;/li&gt;
&lt;li&gt;Finds the resource. Confirms it is actually the right one.&lt;/li&gt;
&lt;li&gt;Runs the stop action.&lt;/li&gt;
&lt;li&gt;Verifies nothing downstream broke.&lt;/li&gt;
&lt;li&gt;Updates the ticket.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Nine steps, three humans, two weeks of calendar time for a single 312 dollar recommendation. Multiply that by a few hundred recommendations a month, and the math becomes obvious. Nobody works through that queue. The recommendations pile up. The dashboard keeps showing the same big number. The bill keeps not going down.&lt;/p&gt;

&lt;p&gt;The recommendations were never the problem. The execution layer was.&lt;/p&gt;

&lt;h2&gt;
  
  
  Recommendation is not remediation
&lt;/h2&gt;

&lt;p&gt;Every cloud cost tool on the market does auto-recommendation well. It scans your account, finds the idle instances, the over-provisioned databases, and the orphaned storage, and surfaces them on a dashboard. Some tools are very good at this part. The recommendations are usually right.&lt;/p&gt;

&lt;p&gt;What almost no tool does well is auto-remediation. Recommendation tells you what to do. Remediation actually does it. The first is a report. The second is a button that, when clicked, performs the cloud action, verifies it landed, and writes an audit log.&lt;/p&gt;

&lt;p&gt;Most teams have spent the last five years drowning in recommendations they never executed on. The dashboards got more sophisticated. The list of suggested actions got longer. The realised savings number on the bill barely moved.&lt;/p&gt;

&lt;p&gt;The reason every cost tool stops at recommendation and not remediation is that recommendation is safe to ship, and remediation is not. A number on a dashboard cannot break production. A cloud API call that stops a resource absolutely can. So the industry settled on a comfortable middle ground. Tell the user what to do. Let them deal with the consequences.&lt;/p&gt;

&lt;h2&gt;
  
  
  The obvious fix, and why it is harder than it looks
&lt;/h2&gt;

&lt;p&gt;The obvious fix is a button on the recommendation that just does the thing. Click, instance stopped, ticket avoided. People have tried this. The reason it has not been a default feature in cost tools is that cloud actions are not safe to fire blindly, and the failure modes are bad enough that one wrong action poisons trust in the whole tool for a year.&lt;/p&gt;

&lt;p&gt;The interesting engineering problem is not calling the stop API. That part takes an afternoon. The interesting part is everything around it.&lt;/p&gt;

&lt;p&gt;I have been watching a team build this, and the workflow they ended up with is a useful artefact even if you never use their tool. The shape of it is what matters.&lt;/p&gt;

&lt;p&gt;  &lt;iframe src="https://www.youtube.com/embed/q2deCXtMwI8"&gt;
  &lt;/iframe&gt;
&lt;/p&gt;

&lt;h2&gt;
  
  
  What sits behind one-click remediation
&lt;/h2&gt;

&lt;p&gt;Five steps, and skipping any of them is how you get a 3 am incident.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Precondition check.&lt;/strong&gt; Before doing anything, ask the cloud what state the resource is actually in right now. If somebody on the team manually stopped it an hour ago, the workflow stops here and reports already done. This single check is the difference between automation people trust and automation people turn off.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Optional approval.&lt;/strong&gt; Some actions need a human gate. Production-tier rules, destructive operations, anything where the cost of a wrong call is worse than the cost of a slow call. The approval queues with full context: resource, savings, who initiated it, and what rule fired. An admin clicks approve or reject. Cheap actions skip this entirely.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Execute.&lt;/strong&gt; Call the cloud API. Stop the EC2, pause the Synapse pool, scale the Lambda to zero. This is the boring part.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Validate.&lt;/strong&gt; This is the part most tools get wrong. A 200 response from the cloud API does not mean the resource is actually stopped. The validate step polls the cloud state until it confirms the action genuinely landed. If the API said yes but the resource is still running, the workflow flags it as a system error instead of silently lying.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. Audit.&lt;/strong&gt; Every step, input, and result is written to a dedicated audit table. Six months from now, when someone asks who stopped the prod-adjacent Synapse pool on March 12, the answer is one query away.&lt;/p&gt;

&lt;p&gt;The other thing worth stealing from this design is how errors get categorised. Three buckets.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;User action.&lt;/strong&gt; Permission denied, quota hit. Shows the fix with a console link.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Transient.&lt;/strong&gt; 429s, 5xx. Gets a retry button.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;System.&lt;/strong&gt; The cloud actually broke, or the API is unsupported. Gets a diagnostic and a support contact.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The category drives the UI. Retry only shows up where retry actually makes sense. This sounds small. It is the difference between an automation surface that engineers learn to trust and one they learn to ignore.&lt;/p&gt;

&lt;h2&gt;
  
  
  The certification gate
&lt;/h2&gt;

&lt;p&gt;One choice in this design that surprised me. Not every recommendation rule gets a Remediate button. The team ships the button only on rules that have been certified end-to-end on real cloud accounts. The certified set started at 20 rules covering stop, scale-to-zero, and pause actions across AWS, GCP, and Azure. The other rules render the recommendation card without the button.&lt;/p&gt;

&lt;p&gt;The temptation when shipping automation is to ship it everywhere on day one. The discipline is to ship it only where you have proven the workflow handles the edge cases. A fake remediation that returns success but did not actually do anything is worse than no remediation at all, because it convinces the team that the savings are realised when they are not.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Databases Rule
&lt;/h2&gt;

&lt;p&gt;One more thing worth calling out. Customer data resources (RDS, Aurora, Cloud SQL, Elasticache, the entire Postgres and MySQL family on every cloud) are excluded from any automated action. Not as a toggle. As an allowlist that excludes them at the code level, so they cannot be passed to the executor, regardless of what the rule says. The kind of safety rail you only see in tools built by people who have personally been responsible for a database outage and refuse to be again.&lt;/p&gt;

&lt;h2&gt;
  
  
  What this changes
&lt;/h2&gt;

&lt;p&gt;The recommended savings number on your dashboard becomes a number you can actually realise. The 20-minute ticket-and-console-hop becomes one click. The audit log behind every action means you can show the CFO not just what you saved, but who saved it, when, and how.&lt;/p&gt;

&lt;p&gt;The interesting thing is that none of this is technically novel. Precondition, approval, execute, validate, audit. Five steps you would design on a whiteboard in an hour. The reason it matters is that almost no cost tool actually does it, and the ones that try usually skip the validation step and pretend the API response is the truth.&lt;/p&gt;

&lt;p&gt;If you are evaluating cloud cost tools, the question to ask is not what the recommended savings number is. The question is, what happens after I click the button, and how do you know the resource is actually stopped?&lt;/p&gt;

&lt;p&gt;That is the only number that ends up on your bill.&lt;/p&gt;

</description>
      <category>devops</category>
      <category>cloud</category>
      <category>aws</category>
      <category>kubernetes</category>
    </item>
    <item>
      <title>Blast Radius Before Execution: Why Autonomous Cloud Must Check Idle Resources First</title>
      <dc:creator>Muskan </dc:creator>
      <pubDate>Tue, 19 May 2026 12:06:36 +0000</pubDate>
      <link>https://dev.to/muskan_8abedcc7e12/blast-radius-before-execution-why-autonomous-cloud-must-check-idle-resources-first-25in</link>
      <guid>https://dev.to/muskan_8abedcc7e12/blast-radius-before-execution-why-autonomous-cloud-must-check-idle-resources-first-25in</guid>
      <description>&lt;h1&gt;
  
  
  Blast Radius Before Execution: Why Autonomous Cloud Must Check Idle Resources First
&lt;/h1&gt;

&lt;p&gt;Autonomous cloud remediation fails the same way every time. The recommendation is correct. The action is correct. The scope is wrong.&lt;/p&gt;

&lt;p&gt;Stop the idle RDS instance. Correct recommendation. The instance has averaged 2% CPU for 30 days. It is genuinely idle. But it is also the database backing an internal integration endpoint that three production services call once a day, on different schedules, from different accounts. Stop it and you have a production incident at the next scheduled call time: 2:00 AM on a Tuesday.&lt;/p&gt;

&lt;p&gt;The recommendation engine did not fail. The blast radius model was missing.&lt;/p&gt;

&lt;p&gt;Autonomous systems that act without a blast radius check are not autonomous. They are automated. Automation executes instructions. Autonomy includes a model of consequences. Every &lt;a href="https://zop.dev/resources/blogs/auto-remediation-one-click-cloud-action-20-certified-rules" rel="noopener noreferrer"&gt;auto-remediation action ZopNight certifies&lt;/a&gt; runs through a blast radius check before execution. This post defines what that check contains and how to score it.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Correct-But-Wrong-Scope Problem
&lt;/h2&gt;

&lt;p&gt;Cost tools are good at identifying idle resources. They look at CPU, memory, network, and storage I/O over a 14-30 day window. A resource below thresholds on all four dimensions is flagged as idle. The recommendation is statistically correct.&lt;/p&gt;

&lt;p&gt;The tool does not know what depends on that resource. It cannot see the cross-account IAM role that calls the RDS instance from a Lambda in another account. It cannot see the CloudFront distribution that caches responses from the EC2 instance it flagged. It cannot see that the "idle" ElastiCache cluster is the cache warming target for a batch job that runs quarterly.&lt;/p&gt;

&lt;p&gt;In the accounts we analyzed, 12-18% of resources flagged as idle have active downstream dependencies that would cause an incident if acted on without verification. That means 1 in 6 to 1 in 8 autonomous actions on a naive system would create an incident. No engineering team will accept that failure rate for unattended automation.&lt;/p&gt;

&lt;p&gt;The blast radius check is the gate that separates the safe 82-88% from the risky 12-18%.&lt;/p&gt;

&lt;h2&gt;
  
  
  Defining Blast Radius: Three Inputs
&lt;/h2&gt;

&lt;p&gt;Blast radius is the set of resources and services that fail or degrade if the target resource is stopped, modified, or deleted mid-action. It is a pre-execution estimate, not a post-incident measurement.&lt;/p&gt;

&lt;p&gt;Three inputs define it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The dependency graph.&lt;/strong&gt; AWS VPC flow logs record every connection between resources in the past 14 days. A resource with no inbound or outbound connections in 14 days has a low dependency graph score. A resource with 200 connections &lt;a href="https://zop.dev/resources/blogs/fargate-tax-serverless-kubernetes-cost-38-percent" rel="noopener noreferrer"&gt;per day&lt;/a&gt; across 4 VPCs has a high one. Service mesh telemetry (Istio, Linkerd) adds application-layer connection data for services that flow logs cannot see (same-host connections, gRPC multiplexed streams). The dependency graph has a 6-hour lag for VPC flow logs. Resources with less than 6 hours of flow log data default to high blast radius.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Criticality tier.&lt;/strong&gt; Resource tags encode business context that infrastructure metrics cannot. A resource tagged &lt;code&gt;env=production&lt;/code&gt; and &lt;code&gt;tier=critical&lt;/code&gt; scores high blast radius regardless of its CPU utilization. A resource tagged &lt;code&gt;env=dev&lt;/code&gt; and &lt;code&gt;team=platform&lt;/code&gt; scores lower. Tags are not perfectly reliable: 23% of resources in typical accounts have stale or missing criticality tags. When the criticality tag is absent, blast radius defaults to medium.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Recency.&lt;/strong&gt; A resource with no write operations in 24 hours is idle by the write signal. A resource with a write 18 hours ago is not idle. CloudTrail records write API calls against each resource. LastWriteTimestamp is the third input. Resources with writes in the last 24 hours get a recency penalty that raises their blast radius score regardless of CPU.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flr5i0bz99e0c5vh212dv.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flr5i0bz99e0c5vh212dv.png" alt="diagram" width="800" height="484"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The Blast Radius Score: 0-100 and What Each Band Means
&lt;/h2&gt;

&lt;p&gt;The three inputs produce a numeric score from 0 to 100. The score determines the action policy.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Score range&lt;/th&gt;
&lt;th&gt;Action policy&lt;/th&gt;
&lt;th&gt;What it means&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;0-29&lt;/td&gt;
&lt;td&gt;Unattended execution&lt;/td&gt;
&lt;td&gt;No active dependencies, non-production tag, no recent writes. Safe to act without human review.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;30-69&lt;/td&gt;
&lt;td&gt;Notification window&lt;/td&gt;
&lt;td&gt;Possible dependencies or ambiguous tags. Action queued with 15-minute notification. Human can cancel.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;70-100&lt;/td&gt;
&lt;td&gt;Approval required&lt;/td&gt;
&lt;td&gt;Active dependencies confirmed, production tag, or recent writes. Action blocked until explicit approval.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;A resource scores below 30 only if all three conditions hold: dependency graph shows no connections in 14 days, criticality tag is non-production, and no writes in 24 hours. This is the conservative definition of "safely idle."&lt;/p&gt;

&lt;p&gt;A resource scores above 70 if any one of these conditions holds: 50+ connections per day in flow logs, production or critical tag, or writes in the last 6 hours. One high-signal input overrides the others. An RDS instance tagged dev that has 200 daily connections scores above 70. The dev tag does not override the dependency signal.&lt;/p&gt;

&lt;p&gt;The 30-69 band handles the ambiguous cases: resources with missing tags, flow log gaps, or moderate connection counts. The 15-minute notification window gives an engineer time to cancel an action they know is risky, without requiring them to pre-approve every action in the queue.&lt;/p&gt;

&lt;h2&gt;
  
  
  How ZopNight Gates Actions
&lt;/h2&gt;

&lt;p&gt;Every automated remediation in ZopNight runs the blast radius check before execution. The check adds 2-4 seconds to the action pipeline. For actions that run unattended at 3:00 AM, 4 seconds is an acceptable gate latency.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Faudqpj5pzmssnby06bxl.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Faudqpj5pzmssnby06bxl.png" alt="diagram" width="800" height="951"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In a typical month, across the ZopNight customer fleet: 67% of triggered remediations score below 30 and run unattended. 24% score 30-69 and enter the notification queue; 91% of those proceed after the window with no cancellation. 9% score above 70 and require approval; approval is granted for 78% of those, with 22% cancelled by the reviewing engineer.&lt;/p&gt;

&lt;p&gt;The &lt;a href="https://zop.dev/resources/blogs/autonomous-action-log-auditing-zopnight-decisions-production" rel="noopener noreferrer"&gt;autonomous action log&lt;/a&gt; records the blast radius score alongside every action. This creates the audit trail: not just what action ran, but why the system considered it safe to run without human review.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Half-Action Problem: Idempotency as a Blast Radius Input
&lt;/h2&gt;

&lt;p&gt;Blast radius measures the risk of acting on a resource. It does not by itself measure the risk of acting and then failing halfway through.&lt;/p&gt;

&lt;p&gt;A multi-step remediation that stops an EC2 instance, modifies its tags, and starts it again has three steps. If the network fails after step 1 (stopped) and before step 3 (started), the instance is stopped but not restarted. The resource is in a worse state than before the action ran. The original state was idle but running. The post-failure state is stopped unexpectedly. Recovery requires manual intervention.&lt;/p&gt;

&lt;p&gt;Idempotency is the property that makes a remediation safe to retry from any point. A stop-and-delete action is not idempotent: running it twice on an already-deleted resource produces an error. A tag-update action is idempotent: running it twice produces the same result as running it once.&lt;/p&gt;

&lt;p&gt;Non-idempotent remediations get a blast radius floor of 50, regardless of their dependency graph, criticality, and recency scores. This forces them into the notification queue minimum, never into the unattended queue.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Remediation type&lt;/th&gt;
&lt;th&gt;Idempotent&lt;/th&gt;
&lt;th&gt;Blast radius floor&lt;/th&gt;
&lt;th&gt;Example&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Tag update&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;0 (no floor)&lt;/td&gt;
&lt;td&gt;Add cost-center tag to EC2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Stop instance&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;0 (no floor)&lt;/td&gt;
&lt;td&gt;Stop idle EC2 (safe to retry)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Delete snapshot&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;50&lt;/td&gt;
&lt;td&gt;Deleted snapshot cannot be recovered&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Schema migration&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;50&lt;/td&gt;
&lt;td&gt;Partial schema change leaves DB inconsistent&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cross-account IAM change&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;70&lt;/td&gt;
&lt;td&gt;IAM changes have immediate effect&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Stop + reconfigure + start&lt;/td&gt;
&lt;td&gt;Partially&lt;/td&gt;
&lt;td&gt;50&lt;/td&gt;
&lt;td&gt;Failure mid-sequence leaves wrong config&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The blast radius score is a pre-execution risk estimate. It is not a guarantee. A resource that scores 15 can still cause an incident if the flow logs had a 12-hour gap and missed an overnight connection. The score reduces the probability of the wrong-scope failure from 12-18% to under 1%. It does not reduce it to zero. The &lt;a href="https://zop.dev/resources/blogs/cloud-bill-is-a-control-problem" rel="noopener noreferrer"&gt;ZopNight trust score model&lt;/a&gt; uses blast radius as one input alongside recommendation confidence and business hours context. No single signal is the gate. The gate is the composite.&lt;/p&gt;

&lt;p&gt;Autonomous cloud is safe when the system knows what it does not know and routes accordingly. Blast radius is the model of what it does not know.&lt;/p&gt;

</description>
      <category>blast</category>
      <category>radius</category>
      <category>check</category>
      <category>autonomous</category>
    </item>
    <item>
      <title>Most Traffic Spikes Are Predictable. So Why Are We Still Panic-Scaling?</title>
      <dc:creator>Muskan </dc:creator>
      <pubDate>Tue, 19 May 2026 12:01:24 +0000</pubDate>
      <link>https://dev.to/muskan_8abedcc7e12/most-traffic-spikes-are-predictable-so-why-are-we-still-panic-scaling-3hpo</link>
      <guid>https://dev.to/muskan_8abedcc7e12/most-traffic-spikes-are-predictable-so-why-are-we-still-panic-scaling-3hpo</guid>
      <description>&lt;p&gt;The usual playbook when a big event is coming: someone sends a Slack message three hours before launch asking, "Did we scale up?" A senior engineer logs into the AWS console, eyeballs the current desired count, multiplies by something, and manually bumps the number. Then forgets to roll it back.&lt;/p&gt;

&lt;p&gt;That's the part nobody talks about. The spike passes, the instance count stays at 3x, and you're burning money for two days because everyone assumed someone else would fix it.&lt;br&gt;
We ran into this enough times that we built a proper way to handle it.&lt;/p&gt;
&lt;h2&gt;
  
  
  What it does
&lt;/h2&gt;

&lt;p&gt;Event Readiness turns a planned traffic spike into a structured scaling plan. You define the event window, set an expected load multiplier, and attach the autoscaling groups you want to scale.&lt;/p&gt;

&lt;p&gt;ZopNight handles the rest pre-scales before the event starts, holds capacity during it, and rolls everything back when it ends. No manual intervention. No forgetting to scale down.&lt;/p&gt;

&lt;p&gt;  &lt;iframe src="https://www.youtube.com/embed/FvqGu_K9JeQ"&gt;
  &lt;/iframe&gt;
&lt;/p&gt;

&lt;h2&gt;
  
  
  How it works
&lt;/h2&gt;

&lt;p&gt;You create a plan against your existing autoscaler policies:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Pick the event window (start, end, timezone)&lt;/li&gt;
&lt;li&gt;Set a load multiplier — e.g., 3x for a campaign expecting 3x traffic&lt;/li&gt;
&lt;li&gt;Attach target policies: AWS ASG, GCP MIG, or Azure VMSS&lt;/li&gt;
&lt;li&gt;Set a pre-scale buffer (we default to 30 minutes before event start)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;ZopNight calculates the scaled min/max/desired for each target before &lt;br&gt;
you commit. If CPU metrics aren't available, it tells you it's estimating and why.&lt;/p&gt;

&lt;p&gt;Before saving, a preview step shows you exactly what will happen to every target's current size, the scaled size, and the timestamps the executor will fire. No surprises.&lt;/p&gt;

&lt;p&gt;Once scheduled, the plan moves through a clear state machine:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;draft → scheduled → scaling_up → active → scaling_down → completed&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;If something fails, it lands in failed, and you can retry from draft. &lt;br&gt;
Cancelling from any active state rolls back whatever was already scaled.&lt;/p&gt;

&lt;h2&gt;
  
  
  The cost estimate
&lt;/h2&gt;

&lt;p&gt;One thing we added that most teams don't have: upfront cost visibility. Before you schedule the plan, you can see the estimated extra cost per target, per hour, in dollars. Not after the event. &lt;/p&gt;

&lt;p&gt;Before it, for a plan running 8 hours at 3x capacity across two ASGs, that number is usually a lot smaller than the cost of the event going down.&lt;/p&gt;

&lt;p&gt;How does your team handle planned traffic spikes right now? Manual scaling, scripts, or something else?&lt;/p&gt;

</description>
      <category>automation</category>
      <category>aws</category>
      <category>devops</category>
      <category>infrastructure</category>
    </item>
    <item>
      <title>Verified Schedule Savings vs Estimated Savings: Why the Difference Matters to Your CFO</title>
      <dc:creator>Muskan </dc:creator>
      <pubDate>Mon, 18 May 2026 11:08:00 +0000</pubDate>
      <link>https://dev.to/muskan_8abedcc7e12/verified-schedule-savings-vs-estimated-savings-why-the-difference-matters-to-your-cfo-1j6p</link>
      <guid>https://dev.to/muskan_8abedcc7e12/verified-schedule-savings-vs-estimated-savings-why-the-difference-matters-to-your-cfo-1j6p</guid>
      <description>&lt;h1&gt;
  
  
  Verified Schedule Savings vs Estimated Savings: Why the Difference Matters to Your CFO
&lt;/h1&gt;

&lt;p&gt;Every engineering team reports cloud savings at some point. The number goes into a slide, a Jira ticket, or a quarterly review. Then a CFO or finance lead asks one follow-up question: "Can you prove it?"&lt;/p&gt;

&lt;p&gt;Most teams cannot. They have a configured schedule and a projected saving. They do not have evidence that the schedule executed, that the resource actually stopped, or that the cost was genuinely avoided. The projected number and the real number are different, and without the distinction, the savings figure is not auditable.&lt;/p&gt;

&lt;p&gt;zopnight's Cost Reports page reports two savings numbers deliberately: Estimated Schedule Savings and Verified Schedule Savings. The gap between them is not a reporting artefact. It is a governance metric called the savings verification gap. This post explains what each number measures, why the gap matters, and how to give your CFO the audit-ready view they need.&lt;/p&gt;

&lt;h2&gt;
  
  
  Estimated Savings Is a Projection, Not a Measurement
&lt;/h2&gt;

&lt;p&gt;Estimated Schedule Savings are calculated from configuration. The formula is straightforward: scheduled downtime hours multiplied by the hourly resource rate. If a virtual machine costs $0.50 per hour and your &lt;a href="https://zop.dev/resources/blog/automated-cloud-scheduling-non-prod-environments" rel="noopener noreferrer"&gt;non-production schedule&lt;/a&gt; stops it for 200 hours a month, the estimated saving is $100.&lt;/p&gt;

&lt;p&gt;The calculation is correct when schedules execute cleanly. It is wrong in three specific ways.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Resource locks and dependency failures.&lt;/strong&gt; Some resources cannot be stopped when a schedule triggers. A VM with an attached managed disk that another process is writing to, a database instance with an active connection pool that blocks shutdown, a container that a health check is actively querying. The schedule fires. The stop command fails. The resource stays running. The estimated saving is counted. The real saving is zero.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Manual overrides.&lt;/strong&gt; A developer needs an environment to stay live past its scheduled shutdown. They override the schedule for the night. The schedule was configured. The saving was projected. The resource ran. This is legitimate in isolation. At scale, across a team of 20 engineers over a month, it accumulates into a consistent gap between what was projected and what actually happened.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Timezone and CRON misconfiguration.&lt;/strong&gt; A schedule set to stop a resource at 8 PM in one timezone fires at 8 PM UTC instead. The resource runs for an additional 5 hours before the next maintenance window corrects it. The estimated saving assumed perfect timing. The actual saving was shorter.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Failure mode&lt;/th&gt;
&lt;th&gt;Cause&lt;/th&gt;
&lt;th&gt;Cost consequence&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Resource lock&lt;/td&gt;
&lt;td&gt;Dependency blocking stop command&lt;/td&gt;
&lt;td&gt;Full estimated saving uncollected&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Manual override&lt;/td&gt;
&lt;td&gt;Developer keeps resource live past schedule&lt;/td&gt;
&lt;td&gt;Partial or full saving lost for that period&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CRON misconfiguration&lt;/td&gt;
&lt;td&gt;Wrong timezone, incorrect window&lt;/td&gt;
&lt;td&gt;Saving window shorter than configured&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;None of these failures appear in estimated savings. They are all invisible until you compare estimated against verified.&lt;/p&gt;

&lt;h2&gt;
  
  
  Verified Savings Is a Measurement, Not a Projection
&lt;/h2&gt;

&lt;p&gt;Verified Schedule Savings are calculated from resource state transitions. zopnight does not count a saving when a schedule fires. It counts a saving when the resource state changes from running to stopped and the state change is confirmed.&lt;/p&gt;

&lt;p&gt;Each confirmed state transition is recorded in Execution History with a timestamp, a resource ID, the action taken, and the result. The saving is written against that record. If the state transition does not happen, the record shows a failure, and no saving is counted.&lt;/p&gt;

&lt;p&gt;This produces a number that is always lower than or equal to estimated savings. It can never be higher. Every saving in the verified total has a corresponding execution record. That record is the audit trail.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fz4sk7csfblzrftyt6z3f.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fz4sk7csfblzrftyt6z3f.png" alt="diagram" width="800" height="1309"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The Execution History is what changes the conversation with finance. "We saved $8,200 this month" is a claim. "We saved $8,200 this month and here are 340 execution records showing each state transition that produced it" is evidence. Only one of those survives a finance review.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Savings Verification Gap Is Your Schedule Reliability Score
&lt;/h2&gt;

&lt;p&gt;The savings verification gap is Estimated Schedule Savings minus Verified Schedule Savings. It measures the fraction of configured savings that schedules failed to deliver.&lt;/p&gt;

&lt;p&gt;A gap of 18% means 18% of scheduled actions did not execute. Those resources ran when they should not have. The cost was incurred and was not offset by any verified saving. The gap does not tell you which resources failed, but Execution History does. Each entry in the failure log shows which resource, which schedule, which action, and what error caused the failure.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fogjiswjxm7zsm5gj4vj1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fogjiswjxm7zsm5gj4vj1.png" alt="diagram" width="800" height="207"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This is the accountability link between engineering and finance. Engineering teams commit to a savings target when they configure schedules. The gap measures how much of that commitment was delivered. A team with a consistent 5% gap has reliable schedules. A team with a 30% gap has a schedule execution problem, and the gap is the first place to look.&lt;/p&gt;

&lt;p&gt;The gap also separates two different problems. A large gap caused primarily by resource locks points to a dependency management issue: schedules are configured for resources that cannot be stopped cleanly. A large gap caused primarily by manual overrides points to a process issue: engineers are bypassing schedules regularly. The Execution History distinguishes between the two.&lt;/p&gt;

&lt;h2&gt;
  
  
  Savings Rate and Budget Health: The CFO View
&lt;/h2&gt;

&lt;p&gt;Savings Rate is Verified Schedule Savings divided by Current Estimated Spend, expressed as a percentage. If you are spending an estimated $40,000 per month and your verified savings are $9,200, your Savings Rate is 23%.&lt;/p&gt;

&lt;p&gt;Savings Rate is the single executive metric. It answers "how effectively are our schedules converting configured downtime into actual savings?" It normalises verified savings against spend, so it remains meaningful as the infrastructure footprint changes.&lt;/p&gt;

&lt;p&gt;Budget Health adds the commitment layer. zopnight's Budget Overview tracks Total Budget, Total Spend, and &lt;a href="https://zop.dev/resources/blog/cloud-budget-health-finops-reporting" rel="noopener noreferrer"&gt;Budget Health&lt;/a&gt; at the organizational level. Budget Health answers whether current spend is inside committed budget. It connects verified savings, the amount actually reduced, to the broader financial picture.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;What it measures&lt;/th&gt;
&lt;th&gt;Who uses it&lt;/th&gt;
&lt;th&gt;Audit-ready&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Current Estimated Spend&lt;/td&gt;
&lt;td&gt;Projected spend at current run rate&lt;/td&gt;
&lt;td&gt;Engineering, FinOps&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Verified Schedule Savings&lt;/td&gt;
&lt;td&gt;Confirmed savings from executed state changes&lt;/td&gt;
&lt;td&gt;Finance, CFO&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Savings Rate&lt;/td&gt;
&lt;td&gt;Verified savings as percentage of estimated spend&lt;/td&gt;
&lt;td&gt;Leadership, executives&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cost Trends Over Time&lt;/td&gt;
&lt;td&gt;Spend trajectory over configurable periods&lt;/td&gt;
&lt;td&gt;FinOps, budget owners&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Budget Health&lt;/td&gt;
&lt;td&gt;Spend vs committed organizational budget&lt;/td&gt;
&lt;td&gt;CFO, finance team&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;"Forecastable. Audit-ready." is a specific claim about two of these five metrics. Verified savings and Budget Health are audit-ready because they are grounded in confirmed state transitions and committed budget figures. Estimated Spend and Cost Trends are forecastable because they project from current run rate. The distinction is not aesthetic. It determines what you can show a finance committee.&lt;/p&gt;

&lt;h2&gt;
  
  
  Governance Applied to the Governance Platform: RBAC on Financial Data
&lt;/h2&gt;

&lt;p&gt;Access control on cost and budget data matters as much as access control on infrastructure. A finance lead reviewing verified savings should not be able to accidentally modify a schedule. A developer checking their team's budget health should not be able to adjust the organizational budget threshold.&lt;/p&gt;

&lt;p&gt;zopnight's RBAC, rebuilt from the ground up, provides graduated access across three system roles:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Role&lt;/th&gt;
&lt;th&gt;Policies&lt;/th&gt;
&lt;th&gt;What it enables for Cost Reports&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Viewer&lt;/td&gt;
&lt;td&gt;16&lt;/td&gt;
&lt;td&gt;Read access to Cost Reports, Budget Overview, Verified Savings, and Audit Logs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Editor&lt;/td&gt;
&lt;td&gt;32&lt;/td&gt;
&lt;td&gt;Viewer permissions plus schedule management, tag policy configuration&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Admin&lt;/td&gt;
&lt;td&gt;52&lt;/td&gt;
&lt;td&gt;Full control including budget threshold management and user role assignment&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The right pattern for financial reporting is: finance team members get Viewer role. They can read every number on the Cost Reports page, drill into Budget Health, and export verified savings data. They cannot touch schedules, budgets, or governance policies. That separation is not a limitation. It is the governance guarantee that makes the numbers trustworthy.&lt;/p&gt;

&lt;p&gt;Custom roles extend this further. If a specific finance lead needs read access to cost data but should not see infrastructure topology, a custom role scopes exactly that. The 16-policy Viewer baseline provides the floor. Custom roles allow precise trimming above it.&lt;/p&gt;

&lt;p&gt;The phrase "governance applied to the governance platform itself" captures the design intent. zopnight enforces cloud &lt;a href="https://zop.dev/resources/blog/cloud-governance-rbac-viewer-editor-admin-custom-roles" rel="noopener noreferrer"&gt;governance policies&lt;/a&gt; for your infrastructure. Its own access model applies the same principle internally: every action is scoped to a role, every role is assigned explicitly, and no one gets more access than their function requires.&lt;/p&gt;

&lt;p&gt;This is what makes the CFO conversation work. When a finance lead logs into zopnight with Viewer access and sees Verified Schedule Savings of $9,200 with a 23% Savings Rate and a Budget Health status of on-track, they are reading numbers produced by confirmed state transitions, scoped to their role, backed by an execution audit trail. That is a number they can put in a board report.&lt;/p&gt;

&lt;p&gt;Estimated savings gets you to the conversation. Verified savings gets you through it.&lt;/p&gt;

</description>
      <category>estimated</category>
      <category>savings</category>
      <category>projection</category>
      <category>measurement</category>
    </item>
    <item>
      <title>The $90k Observability Bill: Why Your Cardinality Limit Is the One Knob That Matters</title>
      <dc:creator>Muskan </dc:creator>
      <pubDate>Mon, 18 May 2026 11:07:37 +0000</pubDate>
      <link>https://dev.to/muskan_8abedcc7e12/the-90k-observability-bill-why-your-cardinality-limit-is-the-one-knob-that-matters-4h43</link>
      <guid>https://dev.to/muskan_8abedcc7e12/the-90k-observability-bill-why-your-cardinality-limit-is-the-one-knob-that-matters-4h43</guid>
      <description>&lt;h1&gt;
  
  
  The $90k Observability Bill: Why Your Cardinality Limit Is the One Knob That Matters
&lt;/h1&gt;

&lt;p&gt;The observability bill at a 50-engineer org goes from $8,000/month in year one to $90,000/month by year three. The growth never gets a budget review because each individual instrumentation change looks tiny: a new metric, a new tag on an existing metric, a per-customer dimension someone added to debug a spike. None of those changes show up on the monthly bill as a single line item. They show up as the bill quietly compounding because each one of them multiplied the number of unique series the vendor stores.&lt;/p&gt;

&lt;p&gt;The teams that try to fix this usually focus on data volume (samples per second, ingest rate, log lines per day) because that is what the vendor's dashboard surfaces in big numbers. Data volume is 5-10% of the bill at most vendor pricing tiers in 2026. Cardinality, the count of unique time series your metrics generate, is 50-70% of the bill. Optimizing for ingest rate cuts 5% when 60% is available.&lt;/p&gt;

&lt;p&gt;The single knob that actually controls observability cost is cardinality, specifically the count of unique tag-value combinations per metric. A 90-day cardinality-first review at a typical mid-market org cuts $35,000 to $60,000 from the monthly bill with no loss of diagnostic capability and no vendor migration. The work is 2-4 engineer-weeks. The payback is positive in month one and compounds because the cost growth curve flattens, not just the level.&lt;/p&gt;

&lt;p&gt;The piece is the operator's guide to that review. The composition is with the &lt;a href="https://zop.dev/resources/blogs/observability-bill-datadog-cloudwatch-costs/" rel="noopener noreferrer"&gt;observability bill: Datadog vs CloudWatch costs&lt;/a&gt; work on vendor pricing models, but the angle here is one layer deeper: the property of your own instrumentation that drives the bill regardless of vendor.&lt;/p&gt;

&lt;h2&gt;
  
  
  The $90k observability bill nobody planned for
&lt;/h2&gt;

&lt;p&gt;Look at the spend trajectory for a typical mid-market SaaS over three years.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Year&lt;/th&gt;
&lt;th&gt;Monthly observability bill&lt;/th&gt;
&lt;th&gt;Volume contribution&lt;/th&gt;
&lt;th&gt;Cardinality contribution&lt;/th&gt;
&lt;th&gt;What changed&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;$8,000&lt;/td&gt;
&lt;td&gt;$1,800&lt;/td&gt;
&lt;td&gt;$5,200&lt;/td&gt;
&lt;td&gt;Initial instrumentation; ~250k series&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;$32,000&lt;/td&gt;
&lt;td&gt;$4,000&lt;/td&gt;
&lt;td&gt;$25,000&lt;/td&gt;
&lt;td&gt;Per-customer dimensions added; ~1.4M series&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;$90,000&lt;/td&gt;
&lt;td&gt;$7,500&lt;/td&gt;
&lt;td&gt;$76,000&lt;/td&gt;
&lt;td&gt;Trace IDs leaked into metric labels; ~4.8M series&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The volume column grows linearly with the system (more requests, more events, more log lines per minute). The cardinality column grows faster than linearly because each new tag multiplies the existing series count. By year three the cardinality cost is 8-12x the volume cost on the same instrumentation surface.&lt;/p&gt;

&lt;p&gt;The bill conversation usually starts with the wrong number. A platform team looks at the bill, sees that Datadog charges per ingested log GB and per million metric samples, and starts a project to "reduce ingest volume." They turn off DEBUG logs, sample non-critical traces, reduce metric sample rate from 10s to 60s. The bill drops $3,000/month. The team celebrates and the cost keeps growing because the cardinality knob is untouched.&lt;/p&gt;

&lt;p&gt;The right conversation starts at cardinality: which metrics generate the most unique series, why, what are the tags driving the explosion, and which of those tags actually need to be on a metric versus on a trace or log.&lt;/p&gt;

&lt;h2&gt;
  
  
  Cardinality math: 4 tags can produce 1.4B series
&lt;/h2&gt;

&lt;p&gt;Cardinality is the product of all tag values on a metric. A counter &lt;code&gt;http_requests_total&lt;/code&gt; with no tags is 1 series. Add &lt;code&gt;method&lt;/code&gt; (8 values: GET, POST, PUT, DELETE, etc.) and it is 8 series. Add &lt;code&gt;endpoint&lt;/code&gt; (300 routes) and it is 8 × 300 = 2,400 series. Add &lt;code&gt;status&lt;/code&gt; (12 HTTP codes) and it is 8 × 300 × 12 = 28,800 series. Still cheap. Now add &lt;code&gt;user_id&lt;/code&gt; with 50,000 values: &lt;code&gt;http_requests_total × method (8) × endpoint (300) × status (12) × user_id (50,000) = 1,440,000,000 potential series&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;A single counter just produced 1.44 billion potential series. In practice the actual count is much lower (most combinations never fire) but the live cardinality typically lands at 30-60% of the potential, which on this example is 430M to 860M series. At Datadog's $0.05/series/month for the standard tier, that one counter costs $21M to $43M per month.&lt;/p&gt;

&lt;p&gt;The vendor does not stop you from creating this cardinality. It just bills you for it. The bill arrives, the platform team sees the spike, the question becomes: which metric did this, and which tag is the culprit?&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric example&lt;/th&gt;
&lt;th&gt;Tags&lt;/th&gt;
&lt;th&gt;Tag value counts&lt;/th&gt;
&lt;th&gt;Total series&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;http_requests_total&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;method, endpoint, status&lt;/td&gt;
&lt;td&gt;8 × 300 × 12&lt;/td&gt;
&lt;td&gt;28,800&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;http_requests_total&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;method, endpoint, status, region&lt;/td&gt;
&lt;td&gt;8 × 300 × 12 × 5&lt;/td&gt;
&lt;td&gt;144,000&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;http_requests_total&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;method, endpoint, status, region, &lt;strong&gt;user_id&lt;/strong&gt;
&lt;/td&gt;
&lt;td&gt;8 × 300 × 12 × 5 × 50,000&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;7,200,000,000&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;db_query_duration&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;query_name, db_name&lt;/td&gt;
&lt;td&gt;80 × 6&lt;/td&gt;
&lt;td&gt;480&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;db_query_duration&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;query_name, db_name, &lt;strong&gt;customer_id&lt;/strong&gt;
&lt;/td&gt;
&lt;td&gt;80 × 6 × 400&lt;/td&gt;
&lt;td&gt;192,000&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;db_query_duration&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;query_name, db_name, customer_id, &lt;strong&gt;session_id&lt;/strong&gt;
&lt;/td&gt;
&lt;td&gt;80 × 6 × 400 × 1,000,000&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;192,000,000,000&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The bolded tags are the cardinality detonators. Each one is a high-uniqueness identifier (per-user, per-customer, per-session) that has no business being a metric dimension. They belong on traces (where each trace is a single event, not a time series) or logs (where each line is a single record). Putting them on a metric multiplies the series count by the cardinality of the identifier.&lt;/p&gt;

&lt;h2&gt;
  
  
  The three high-cardinality offenders
&lt;/h2&gt;

&lt;p&gt;Most observability bill overruns reduce to three specific tag classes. Removing or aggregating them recovers 40-65% of the bill with zero diagnostic loss.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Tag class&lt;/th&gt;
&lt;th&gt;Why it lands on metrics&lt;/th&gt;
&lt;th&gt;Where it actually belongs&lt;/th&gt;
&lt;th&gt;Bill impact when removed&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;user_id&lt;/code&gt; / &lt;code&gt;customer_id&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;"Per-tenant visibility" demand&lt;/td&gt;
&lt;td&gt;Trace span attribute&lt;/td&gt;
&lt;td&gt;30-50%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;trace_id&lt;/code&gt; / &lt;code&gt;span_id&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Accidental metric labeling&lt;/td&gt;
&lt;td&gt;Already in trace, never metric&lt;/td&gt;
&lt;td&gt;10-25%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;version&lt;/code&gt; / &lt;code&gt;build_id&lt;/code&gt; / &lt;code&gt;git_sha&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Added for deploy debugging, never pruned&lt;/td&gt;
&lt;td&gt;Trace metadata; metric only for last N versions&lt;/td&gt;
&lt;td&gt;5-15%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;user_id / customer_id on shared metrics.&lt;/strong&gt; A team wants "per-tenant API latency." Someone adds &lt;code&gt;customer_id&lt;/code&gt; to the &lt;code&gt;api_request_duration&lt;/code&gt; histogram. Series count multiplies by customer count. The dashboard shows per-customer p99 latency, which the team uses three times in six months. The bill triples. The right answer: keep &lt;code&gt;customer_id&lt;/code&gt; on the trace span; query traces for per-tenant analysis; keep the metric to a rollup (&lt;code&gt;api_request_duration&lt;/code&gt; × &lt;code&gt;endpoint&lt;/code&gt; × &lt;code&gt;status&lt;/code&gt; only, no per-customer breakdown).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;trace_id promoted to a metric label.&lt;/strong&gt; A common bug pattern: an OpenTelemetry SDK is misconfigured to copy trace context attributes onto every emitted metric. &lt;code&gt;trace_id&lt;/code&gt; is unique per request (effectively infinite cardinality from the metric's perspective). The vendor bill shows millions of one-sample series. The fix is at the SDK / collector level: explicit allow-list of attributes to copy from trace to metric, blocking &lt;code&gt;trace_id&lt;/code&gt; and &lt;code&gt;span_id&lt;/code&gt; by default.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;version / build_id not pruned.&lt;/strong&gt; Deploy lands; instrumentation tags every metric with &lt;code&gt;version=v1.2.3&lt;/code&gt; so the team can compare pre- and post-deploy behavior. Three weeks later there are 40 versions in the tag values, each with its own series. The team only ever queries the last 2-3 versions. The fix: tag with version, but at the collector level prune any version older than 30 days from the metric pipeline (traces and logs can keep the full history because they age out on their own retention curve).&lt;/p&gt;

&lt;p&gt;The pattern across all three: high-uniqueness identifiers belong on traces and logs (which the vendor bills very differently and which scale fine with cardinality) rather than on metrics (which compound). The OpenTelemetry three-pillar separation (metrics, traces, logs) exists precisely so that each type of telemetry can handle the data class it is good at. Cardinality goes on traces and logs; metrics stay aggregated.&lt;/p&gt;

&lt;h2&gt;
  
  
  The cardinality report (one engineer-day, highest-leverage observability work)
&lt;/h2&gt;

&lt;p&gt;The work to make cardinality manageable starts with measurement. Every major observability vendor exposes per-metric cardinality somehow; the surface is just not in the default dashboard.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Vendor&lt;/th&gt;
&lt;th&gt;Cardinality introspection&lt;/th&gt;
&lt;th&gt;Where to find it&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Datadog&lt;/td&gt;
&lt;td&gt;Metric summary page; &lt;code&gt;datadog.estimated_usage.metrics_*&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Per-metric panel in Metrics Explorer&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Prometheus&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;prometheus_tsdb_head_series&lt;/code&gt; metric, &lt;code&gt;topk()&lt;/code&gt; against it&lt;/td&gt;
&lt;td&gt;Self-monitoring scrape&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Honeycomb&lt;/td&gt;
&lt;td&gt;Dataset cardinality view&lt;/td&gt;
&lt;td&gt;Per-dataset settings → cardinality&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Grafana Mimir / Cortex&lt;/td&gt;
&lt;td&gt;&lt;code&gt;cortex_ingester_active_series&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Self-monitoring&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;New Relic&lt;/td&gt;
&lt;td&gt;Metric cardinality limit warnings in usage UI&lt;/td&gt;
&lt;td&gt;Account → Usage&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The weekly cardinality report is one engineer-day to build and is the single highest-leverage piece of observability work most teams can ship in 2026. It contains:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Column&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Metric name&lt;/td&gt;
&lt;td&gt;Identification&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Series count (now)&lt;/td&gt;
&lt;td&gt;Current cardinality&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Series count (7d ago)&lt;/td&gt;
&lt;td&gt;Growth detection&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Top 3 tags by value count&lt;/td&gt;
&lt;td&gt;Which dimensions are driving it&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cap&lt;/td&gt;
&lt;td&gt;The configured per-metric limit&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Action&lt;/td&gt;
&lt;td&gt;"Over cap", "Approaching cap", "Healthy"&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The report runs weekly, posts to a #observability-cost Slack channel, and surfaces the top 20 metrics by series count plus any metric that grew &amp;gt;50% week-over-week. The platform team reviews the report in 15 minutes. Most weeks there is no action; the weeks where a new high-cardinality tag landed (often unintentionally) catch it before the next billing cycle.&lt;/p&gt;

&lt;p&gt;The team that does not have this report has no way to know which metric is driving the bill until the bill arrives. The team that has the report fixes the cardinality issue in the week it appears, not in the quarter after the bill review.&lt;/p&gt;

&lt;h2&gt;
  
  
  Aggregation: move high-cardinality data to traces and logs
&lt;/h2&gt;

&lt;p&gt;The right place for high-cardinality data is determined by the OpenTelemetry three-pillar separation. Each pillar has a different cost-vs-detail tradeoff and the high-cardinality identifiers go to the pillars that handle them well.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqjf2r8ftwjkeo4afmq83.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqjf2r8ftwjkeo4afmq83.png" alt="diagram" width="800" height="321"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Data class&lt;/th&gt;
&lt;th&gt;Pillar&lt;/th&gt;
&lt;th&gt;Why&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Request counts by method/status/endpoint&lt;/td&gt;
&lt;td&gt;Metric&lt;/td&gt;
&lt;td&gt;Low cardinality, queried as time series&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Per-customer latency analysis&lt;/td&gt;
&lt;td&gt;Trace&lt;/td&gt;
&lt;td&gt;High cardinality, queried per-request&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Per-user error rate&lt;/td&gt;
&lt;td&gt;Trace&lt;/td&gt;
&lt;td&gt;High cardinality identifier&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Aggregate error rate by service&lt;/td&gt;
&lt;td&gt;Metric&lt;/td&gt;
&lt;td&gt;Low cardinality&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Audit events (who did what, when)&lt;/td&gt;
&lt;td&gt;Log&lt;/td&gt;
&lt;td&gt;Free-form, often compliance-driven&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Trace-level diagnostic detail&lt;/td&gt;
&lt;td&gt;Trace&lt;/td&gt;
&lt;td&gt;Designed for it&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Deploy markers (version comparison over 24h)&lt;/td&gt;
&lt;td&gt;Metric (with TTL on version tag)&lt;/td&gt;
&lt;td&gt;Pruned automatically&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The rule of thumb: if a dimension's value count exceeds 100 unique values across a 7-day window, it does not belong on a metric. It belongs on a trace or a log. The vendor's trace and log products price differently (per-event, per-byte, with sampling) and high cardinality is normal for them. The metric product prices per series; high cardinality is the cost detonator.&lt;/p&gt;

&lt;h2&gt;
  
  
  Cap and alert per metric
&lt;/h2&gt;

&lt;p&gt;The discipline that makes cardinality manageable in the long run is per-metric caps. Without caps, cardinality grows monotonically: every new dimension is a marginal addition that "doesn't seem that big." With caps, the team has to make a conscious decision when a metric approaches its limit: do we raise the cap (and accept the cost), remove the dimension (and lose some detail), or aggregate harder (and trade some precision)?&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric tier&lt;/th&gt;
&lt;th&gt;Cap&lt;/th&gt;
&lt;th&gt;Typical use&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Tier 1 (critical, high-value)&lt;/td&gt;
&lt;td&gt;50,000 series&lt;/td&gt;
&lt;td&gt;Customer-facing latency, error rates, SLO inputs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Tier 2 (standard)&lt;/td&gt;
&lt;td&gt;5,000 series&lt;/td&gt;
&lt;td&gt;Internal service health, deploy markers, batch job metrics&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Tier 3 (debugging only)&lt;/td&gt;
&lt;td&gt;500 series&lt;/td&gt;
&lt;td&gt;Ephemeral metrics added during investigation, must be removed after&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The cap alerts fire at 80% of the limit. The platform team gets a ping; the metric's owner has two weeks to either justify a cap raise (with a budget impact estimate) or reduce the cardinality. If neither happens, the metric is downgraded a tier (which lowers its cap and forces the owner to address it).&lt;/p&gt;

&lt;p&gt;The numbers above are illustrative; the right caps depend on your vendor's pricing tier. The right way to set them is to start from the current cardinality distribution and pick caps that allow 95% of metrics to fit Tier 2 with the 5% legitimately-needed-high-cardinality metrics in Tier 1. Tier 3 is the safety valve for debugging that should always be temporary.&lt;/p&gt;

&lt;h2&gt;
  
  
  The AI-agent special case
&lt;/h2&gt;

&lt;p&gt;AI agent fleets create a cardinality problem that ordinary instrumentation rules do not catch. An agent that logs per-invocation metrics with &lt;code&gt;agent_id&lt;/code&gt; (47 agents) and &lt;code&gt;request_id&lt;/code&gt; (5 million requests/day) produces 235 million unique series per day just from one metric. The cardinality compounds across the metric set; even a small agent fleet can outspend the rest of the org's observability bill in a quarter.&lt;/p&gt;

&lt;p&gt;The fix is per-agent metric aggregation: emit one metric per agent per minute instead of one per invocation.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Approach&lt;/th&gt;
&lt;th&gt;Cardinality / day&lt;/th&gt;
&lt;th&gt;Diagnostic capability&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Per-invocation metric (&lt;code&gt;agent_id&lt;/code&gt; + &lt;code&gt;request_id&lt;/code&gt;)&lt;/td&gt;
&lt;td&gt;235,000,000 series&lt;/td&gt;
&lt;td&gt;Per-request drilling (impossible to query anyway at this scale)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Per-agent per-minute aggregate (counter + histogram)&lt;/td&gt;
&lt;td&gt;47 series × 1,440 min = 67,680&lt;/td&gt;
&lt;td&gt;Per-agent rate + latency distribution&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Per-request data → traces (sampled at 1%)&lt;/td&gt;
&lt;td&gt;(cardinality moves to trace product)&lt;/td&gt;
&lt;td&gt;Per-request when needed, sampled&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The per-agent-per-minute aggregate uses a counter (&lt;code&gt;agent_invocations_total{agent_id}&lt;/code&gt;) for rate and a histogram (&lt;code&gt;agent_latency_ms{agent_id}&lt;/code&gt;) for distribution. Together they answer the questions the per-invocation metric was meant to answer (how often does each agent fire, what is the latency distribution) at 1/3000th the cardinality cost. The per-request detail that is genuinely needed (which request was slow, what was the failure) lives on traces with sampling, where the cost model handles per-request data natively.&lt;/p&gt;

&lt;p&gt;The pattern composes with the &lt;a href="https://zop.dev/resources/blogs/per-agent-mcp-token-quota-runaway-protection/" rel="noopener noreferrer"&gt;per-agent token quotas&lt;/a&gt; work: the quota system already knows each agent's identity and rate; the observability metric can be a side-effect of the quota counter rather than a separate instrumentation. One source of truth, one cardinality.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why dropping the vendor is the wrong fix
&lt;/h2&gt;

&lt;p&gt;The first-instinct fix when the observability bill shocks finance is to put the vendor up for review. Solicit a quote from Honeycomb, from Grafana Cloud, from a self-hosted Prometheus + Loki + Tempo stack. The numbers look compelling because the alternative vendor's bill is based on your current usage at their pricing, and migration projections always look optimistic.&lt;/p&gt;

&lt;p&gt;The migration math is not optimistic in practice.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Item&lt;/th&gt;
&lt;th&gt;Vendor migration&lt;/th&gt;
&lt;th&gt;Cardinality fix&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Time to first cost reduction&lt;/td&gt;
&lt;td&gt;6-12 months (post-migration)&lt;/td&gt;
&lt;td&gt;4-8 weeks&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Engineer-weeks invested&lt;/td&gt;
&lt;td&gt;24-50 (instrumentation rewrite, dashboard rebuild, alert recreation, runbook updates)&lt;/td&gt;
&lt;td&gt;2-4&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Risk of degraded incident response during transition&lt;/td&gt;
&lt;td&gt;High (parallel systems, alert gaps, training cost)&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Bill reduction after work complete&lt;/td&gt;
&lt;td&gt;30-50% if cardinality is fixed on new vendor; 0% if not&lt;/td&gt;
&lt;td&gt;40-65%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cardinality problem replicates on new vendor?&lt;/td&gt;
&lt;td&gt;Yes (it is a property of your instrumentation)&lt;/td&gt;
&lt;td&gt;N/A (problem is removed)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The migration only pays off if the cardinality problem is fixed in the new system. Otherwise the new vendor's bill grows the same way the old one did, just from a lower starting base. Teams that migrate without fixing cardinality discover this in year two on the new vendor and are back where they started.&lt;/p&gt;

&lt;p&gt;The cardinality fix on the current vendor is faster, cheaper, lower-risk, and reduces the bill by a similar percentage. The vendor switch may still make sense for product reasons (better trace UX, different SLO tooling, vendor-specific features), but it is not the cost fix. The cost fix is cardinality.&lt;/p&gt;

&lt;p&gt;A typical mid-market org running the 90-day cardinality review recovers $35,000 to $60,000 per month within the first quarter. The compounding effect is more valuable than the level: the cost growth curve flattens because the cardinality discipline is now in place. By year four, an org that ran the cardinality review is at $40-50k/month observability spend; an org that did not is at $130-180k/month on the same engineering surface.&lt;/p&gt;

&lt;p&gt;Set up the weekly cardinality report. Identify the top 5 metrics by series count. Find the user_id, trace_id, or version_id tag driving each one. Move those dimensions to traces or logs. The bill drops the next month and stops growing the way it used to. The one knob that matters is the one most teams never touch.&lt;/p&gt;

</description>
      <category>observability</category>
      <category>bill</category>
      <category>cardinality</category>
      <category>limit</category>
    </item>
    <item>
      <title>Every team has an architecture diagram. Nobody trusts it. Here's what we built instead.</title>
      <dc:creator>Muskan </dc:creator>
      <pubDate>Mon, 18 May 2026 07:05:50 +0000</pubDate>
      <link>https://dev.to/muskan_8abedcc7e12/every-team-has-an-architecture-diagram-nobody-trusts-it-heres-what-we-built-instead-13li</link>
      <guid>https://dev.to/muskan_8abedcc7e12/every-team-has-an-architecture-diagram-nobody-trusts-it-heres-what-we-built-instead-13li</guid>
      <description>&lt;p&gt;&lt;strong&gt;The actual problem with cloud architecture visibility.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The real issue isn't that teams don't document their infrastructure. It's that cloud infrastructure changes faster than any manual process can keep up with.&lt;/p&gt;

&lt;p&gt;A developer spins up a debug RDS instance on a Friday. A new region gets added during a scaling event. A contractor deploys a service that nobody else knows about. None of these show up in any diagram because nobody updated it.&lt;/p&gt;

&lt;p&gt;The other problem: existing tools either give you one cloud at a time, or they give you a billing view which tells you what you're spending, but not how anything connects.&lt;/p&gt;

&lt;p&gt;What we wanted was: open Atlas, see everything, understand how it fits together. Across all three clouds. In real time.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Atlas does
&lt;/h2&gt;

&lt;p&gt;Atlas connects to your AWS, GCP, and Azure accounts in read-only mode and auto-discovers every resource. It then builds a dependency graph, not just a flat list of resources, but how they relate to each other. Which services talk to which? What sits behind which load balancer? Where the cross-region connections are.&lt;/p&gt;

&lt;p&gt;The view scales from global (all your clouds, all your regions, one screen) down to service-level dependencies. You can zoom into a single VPC and see exactly what's running inside it.&lt;/p&gt;

&lt;p&gt;Here's a short demo:&lt;/p&gt;

&lt;p&gt;  &lt;iframe src="https://www.youtube.com/embed/BNBs-UG7MfE"&gt;
  &lt;/iframe&gt;
&lt;/p&gt;

&lt;h2&gt;
  
  
  The part that was harder than expected
&lt;/h2&gt;

&lt;p&gt;The interesting technical challenge was reconciling three completely different resource models.&lt;/p&gt;

&lt;p&gt;AWS thinks in terms of VPCs, availability zones, and security groups. GCP thinks in terms of projects, networks, and firewall rules. Azure thinks in terms of subscriptions, resource groups, and virtual networks. Same concepts, completely different hierarchies and naming conventions.&lt;/p&gt;

&lt;p&gt;Building a unified topology meant building a translation layer that could map these different models onto a consistent graph structure without flattening the differences that actually matter for understanding your architecture.&lt;/p&gt;

&lt;p&gt;We also had to decide what "connected" means across clouds. A Lambda that calls a GCP Cloud Run service over HTTPS are those connected in the topology? We landed on: yes, and we show cross-cloud connections explicitly because they're often the least-understood part of a multi-cloud setup.&lt;/p&gt;

</description>
      <category>devops</category>
      <category>cloud</category>
      <category>aws</category>
      <category>architecture</category>
    </item>
    <item>
      <title>Cost Per Customer for SaaS: The Unit Economics Dashboard That Killed Three Pricing Mistakes</title>
      <dc:creator>Muskan </dc:creator>
      <pubDate>Fri, 15 May 2026 07:02:56 +0000</pubDate>
      <link>https://dev.to/muskan_8abedcc7e12/cost-per-customer-for-saas-the-unit-economics-dashboard-that-killed-three-pricing-mistakes-9cl</link>
      <guid>https://dev.to/muskan_8abedcc7e12/cost-per-customer-for-saas-the-unit-economics-dashboard-that-killed-three-pricing-mistakes-9cl</guid>
      <description>&lt;h1&gt;
  
  
  Cost Per Customer for SaaS: The Unit Economics Dashboard That Killed Three Pricing Mistakes
&lt;/h1&gt;

&lt;p&gt;Finance computes cost per customer as &lt;code&gt;total infra cost / customer count&lt;/code&gt; once per quarter. The number is mathematically correct and operationally useless. A B2B SaaS at $8/customer/month sounds healthy until you look at the distribution and find that one customer costs $1,400/month and another costs $0.40. The average hides everything. The 10-15% of customers whose hosting cost exceeds their MRR are invisible. The pricing tier that loses money on heavy users is invisible. The free-tier customer who is silently burning through more compute than three paying customers combined is invisible.&lt;/p&gt;

&lt;p&gt;The structural fix is per-customer cost attribution at the cost-record level, refreshed weekly, displayed in five dashboard views, owned by product and finance. The work is not the dashboard. The work is propagating customer_id through three layers (the request path, the workload identity, the storage layer) so every cost record knows which customer it belongs to. Most SaaS data pipelines were built without this discipline; the retrofit takes 4-8 weeks of data engineering. The payback is a margin recovery of 2-5% from pricing fixes and another 8-15% infra reduction from per-customer right-sizing.&lt;/p&gt;

&lt;p&gt;The piece composes with the &lt;a href="https://zop.dev/resources/blogs/chargeback-four-field-schema-replaces-tag-spreadsheet/" rel="noopener noreferrer"&gt;4-field chargeback schema&lt;/a&gt; (which solves per-team attribution at the org level) but operates one layer deeper. Per-customer is more granular than per-team and serves a different audience: product and pricing, not finance and engineering.&lt;/p&gt;

&lt;h2&gt;
  
  
  The quarterly cost-per-customer number is useless
&lt;/h2&gt;

&lt;p&gt;Look at a typical mid-market B2B SaaS at $4M ARR with 400 customers. The quarterly cost-per-customer number reads $310/month, which is fine if MRR averages $830. The distribution tells a different story.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Cohort&lt;/th&gt;
&lt;th&gt;% of customers&lt;/th&gt;
&lt;th&gt;Avg cost / mo&lt;/th&gt;
&lt;th&gt;Avg MRR / mo&lt;/th&gt;
&lt;th&gt;Cost-to-MRR&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Top 10% by usage&lt;/td&gt;
&lt;td&gt;10%&lt;/td&gt;
&lt;td&gt;$1,600&lt;/td&gt;
&lt;td&gt;$2,200&lt;/td&gt;
&lt;td&gt;73% (negative-margin)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Heavy users&lt;/td&gt;
&lt;td&gt;20%&lt;/td&gt;
&lt;td&gt;$720&lt;/td&gt;
&lt;td&gt;$1,100&lt;/td&gt;
&lt;td&gt;65%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Average&lt;/td&gt;
&lt;td&gt;50%&lt;/td&gt;
&lt;td&gt;$190&lt;/td&gt;
&lt;td&gt;$830&lt;/td&gt;
&lt;td&gt;23%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Light users&lt;/td&gt;
&lt;td&gt;15%&lt;/td&gt;
&lt;td&gt;$45&lt;/td&gt;
&lt;td&gt;$400&lt;/td&gt;
&lt;td&gt;11%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Free tier&lt;/td&gt;
&lt;td&gt;5%&lt;/td&gt;
&lt;td&gt;$35&lt;/td&gt;
&lt;td&gt;$0&lt;/td&gt;
&lt;td&gt;infinite&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The blended average ($310) hides the entire structure. The top 10% by usage is operating at a 73% cost-to-MRR ratio, which is unrecoverable on a SaaS unit-economics curve. The free tier (which leadership often defends as "low cost, high signal") is actually using more compute per user than the average paying customer. The pricing tier is misaligned with the cost structure in a way no quarterly average will surface.&lt;/p&gt;

&lt;p&gt;The dashboard that fixes this has to update faster than quarterly. Pricing decisions get made in the week, not the quarter; if the cost data is months old, the pricing team is flying blind. Weekly is the right cadence: fresh enough to catch a customer who just spun up a heavy workload, slow enough that the dashboard does not thrash on day-to-day usage variance.&lt;/p&gt;

&lt;h2&gt;
  
  
  Three-layer attribution: request, workload, storage
&lt;/h2&gt;

&lt;p&gt;The work is propagating customer_id through three layers of the system. Skip any layer and the cost record is unattributable; the dashboard ends up with a 10-30% "unattributed" bucket that defeats the per-customer view.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7ij286kl0uo19es982gk.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7ij286kl0uo19es982gk.png" alt="diagram" width="800" height="996"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Layer 1: request path.&lt;/strong&gt; Every API call gets the customer_id stamped into the OpenTelemetry span at the edge. Downstream services read the span context, propagate it, and the cost record for that span has the customer_id attached. This is the easiest layer: typically a 1-2 week change to the API gateway and request middleware.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Layer 2: workload identity.&lt;/strong&gt; Every async or batch workload (Spark job, Lambda invocation, Kafka consumer, Snowflake query) must know which customer's data it is processing. The customer_id propagates through the queue header, the workload spec, the query tag. Without this, every batch cost lands in the "shared infrastructure" bucket and the per-customer dashboard misses 40-60% of variable cost. This layer is the hardest: 4-6 weeks of data engineering to retrofit on a typical pipeline.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Layer 3: storage layer.&lt;/strong&gt; Every storage operation (S3 read/write, RDS query, DynamoDB read) needs to be billable to a customer. The convention that works: object keys prefixed with customer_id (&lt;code&gt;customer-acme-corp/orders/2026-05-08.parquet&lt;/code&gt;), tables partitioned by customer_id, encrypted with customer-specific KMS keys when residency matters. The cost record reads the key prefix and assigns cost. This is the layer most easily forgotten because storage cost looks small in early stages and explodes later.&lt;/p&gt;

&lt;p&gt;The attribution layer's quality is measurable: the unattributed cost bucket as a share of total monthly spend. Healthy is under 5%. Acceptable is under 10%. Above 15% and the per-customer dashboard's numbers are misleading enough that pricing decisions made from them will be wrong.&lt;/p&gt;

&lt;h2&gt;
  
  
  Three pricing mistakes the dashboard catches
&lt;/h2&gt;

&lt;p&gt;The first month of the dashboard surfaces three specific pricing failures that no quarterly average would have shown. ZopDev customer rollout data is consistent on these three.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Mistake 1: flat-rate pricing for resource-heavy customers.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;A $99/month flat plan that includes "unlimited API calls" works fine when most customers use 10K calls/month. The dashboard surfaces the 4-7% of customers who use 5M+ calls/month at $40-$120/month in infra cost. These customers are net-negative on a flat plan. The fix: introduce a fair-use cap or a usage-based overage; grandfather existing customers with notice.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Customer&lt;/th&gt;
&lt;th&gt;API calls/mo&lt;/th&gt;
&lt;th&gt;Plan&lt;/th&gt;
&lt;th&gt;Plan revenue&lt;/th&gt;
&lt;th&gt;Infra cost&lt;/th&gt;
&lt;th&gt;Net&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Healthy customer&lt;/td&gt;
&lt;td&gt;10K&lt;/td&gt;
&lt;td&gt;Flat $99&lt;/td&gt;
&lt;td&gt;$99&lt;/td&gt;
&lt;td&gt;$3&lt;/td&gt;
&lt;td&gt;+$96&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Heavy customer&lt;/td&gt;
&lt;td&gt;5M&lt;/td&gt;
&lt;td&gt;Flat $99&lt;/td&gt;
&lt;td&gt;$99&lt;/td&gt;
&lt;td&gt;$48&lt;/td&gt;
&lt;td&gt;+$51&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Outlier customer&lt;/td&gt;
&lt;td&gt;23M&lt;/td&gt;
&lt;td&gt;Flat $99&lt;/td&gt;
&lt;td&gt;$99&lt;/td&gt;
&lt;td&gt;$190&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;-$91&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Mistake 2: per-seat pricing where seat usage does not correlate with cost.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;A $25/seat/month plan looks linear. The dashboard shows that heavy users (analysts running daily reports) consume 8-10x the compute of light users (occasional readers). A 50-seat customer with 5 heavy users and 45 light users costs as much as a 50-seat customer with 50 heavy users, but pays the same $1,250/month. The fix: per-seat tiering by user role, or a usage component layered on top.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Mistake 3: free-tier abuse.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The free tier costs $0 in revenue and looks free in cost too — until you see the distribution. Typically 90% of free-tier users consume 5% of free-tier infra cost (light, low-engagement). The remaining 10% consume 95%: training models on the free-tier API, scraping data, running cron jobs against the free endpoints. The fix: rate limits per free account, automatic graduation to paid above usage thresholds.&lt;/p&gt;

&lt;p&gt;The pattern across all three: the dashboard exposes the distribution that the average hides. The product team sees the distribution, the pricing team has the data to redesign the plan, and the engineering team has a target list of customers to right-size individually.&lt;/p&gt;

&lt;h2&gt;
  
  
  The five-view dashboard
&lt;/h2&gt;

&lt;p&gt;The minimum-viable dashboard is five views on one screen. Anything more is noise; anything less misses a class of decision.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;View&lt;/th&gt;
&lt;th&gt;Sort&lt;/th&gt;
&lt;th&gt;Decision it informs&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Per-customer cost (descending)&lt;/td&gt;
&lt;td&gt;$ desc&lt;/td&gt;
&lt;td&gt;Right-sizing: who to optimize first&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cost-to-MRR ratio (descending)&lt;/td&gt;
&lt;td&gt;ratio desc&lt;/td&gt;
&lt;td&gt;Pricing: which customers lose money&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cost per customer over time&lt;/td&gt;
&lt;td&gt;trend&lt;/td&gt;
&lt;td&gt;Drift detection: who is suddenly spending more&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cost concentration (top 1% / top 10% as share of total)&lt;/td&gt;
&lt;td&gt;none&lt;/td&gt;
&lt;td&gt;Pricing tier design: how skewed is the distribution&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Per-customer cost broken down by service&lt;/td&gt;
&lt;td&gt;service stacked bar&lt;/td&gt;
&lt;td&gt;Per-customer right-sizing: which service to target&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The five fit on a single browser tab without scrolling. The pricing team opens this once a week. The product team opens it before any pricing or plan change. The CFO opens it before the board meeting. Different audiences, same data, no per-team dashboards that need to be reconciled.&lt;/p&gt;

&lt;p&gt;The trend view is the under-appreciated one. A customer whose cost jumped 4x in three weeks is a smoke signal: usually a new use case the customer is exploring, sometimes a misconfigured integration burning compute on their side. Either way, the customer-success team wants to know in week three, not in next quarter's review.&lt;/p&gt;

&lt;p&gt;The cost-concentration view answers the strategic question: are we a long-tail SaaS (top 10% of customers = 30% of cost, predictable scaling) or a power-law SaaS (top 1% of customers = 50% of cost, fragile)? The shape determines the right pricing strategy; the strategy is impossible to set without the data.&lt;/p&gt;

&lt;h2&gt;
  
  
  Variable cost only: exclude the fixed
&lt;/h2&gt;

&lt;p&gt;Per-customer cost should be the sum of three variable cost classes only: compute, storage, outbound bandwidth. Fixed costs (control plane, monitoring, security tooling, SSO, audit logging) are amortized at the org level and excluded from the per-customer number. Mixing them in produces misleading economics.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Cost class&lt;/th&gt;
&lt;th&gt;Per-customer?&lt;/th&gt;
&lt;th&gt;Reason&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;EC2 / Fargate compute&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Scales with customer usage&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;RDS / Aurora compute&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Per-tenant database load&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;S3 / object storage&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Per-customer data volume&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Outbound data transfer&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Per-customer egress&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Lambda invocations (per-customer functions)&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Scales with customer events&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Kubernetes control plane&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Fixed; one cluster serves all&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Datadog / observability bill&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Fixed; not customer-driven&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Vault, SSO, secret management&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Org-level shared infra&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CI/CD runtime&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Engineering cost, not customer cost&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Security tooling (WAF, GuardDuty)&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Org-level, not customer-attributable&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The reason for the exclusion is mechanical. A customer's "real" cost is what would disappear from the bill if the customer left. Variable costs do disappear; fixed costs do not. Including fixed costs in the per-customer number means a customer who churns "looks like" a $200/month savings on the dashboard when actually the savings is $60.&lt;/p&gt;

&lt;p&gt;The fixed costs still exist; they just need to be tracked separately. Add a sixth view to the dashboard if needed ("fixed org-level overhead, $X/month, X% of total spend"). Most teams find that fixed overhead is 18-28% of total spend and stays roughly flat year-over-year as the company grows.&lt;/p&gt;

&lt;h2&gt;
  
  
  Cost-to-MRR ratio: the most actionable metric
&lt;/h2&gt;

&lt;p&gt;A $400/month customer is fine if MRR is $4,000 (10% ratio) and a crisis if MRR is $200 (200% ratio). The absolute cost is meaningless without the revenue context; the ratio is the actionable number.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Ratio band&lt;/th&gt;
&lt;th&gt;Status&lt;/th&gt;
&lt;th&gt;Typical action&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;0-15%&lt;/td&gt;
&lt;td&gt;Healthy&lt;/td&gt;
&lt;td&gt;No action; this is the target&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;15-30%&lt;/td&gt;
&lt;td&gt;Acceptable&lt;/td&gt;
&lt;td&gt;Monitor for drift&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;30-50%&lt;/td&gt;
&lt;td&gt;Yellow&lt;/td&gt;
&lt;td&gt;Investigate; is the customer in a heavy onboarding phase?&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;50-100%&lt;/td&gt;
&lt;td&gt;Red&lt;/td&gt;
&lt;td&gt;Targeted right-sizing; consider pricing conversation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&amp;gt;100%&lt;/td&gt;
&lt;td&gt;Negative-margin&lt;/td&gt;
&lt;td&gt;Pricing change, contract renegotiation, or customer-success intervention required&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Typical mid-market SaaS has 8-15% of customers in the "red" or "negative-margin" band. Half of those are early-stage customers still ramping (which is fine, expected, often part of the land-and-expand motion). The other half are pricing mistakes: customers whose plan was set before the company understood their actual usage shape, who have stayed grandfathered, or who fell through the cracks of a plan transition.&lt;/p&gt;

&lt;p&gt;The pricing-mistake set is what the dashboard surfaces that no quarterly report would. The fix per customer is usually a one-meeting conversation: explain the cost structure, offer a usage-based plan or a higher-tier package, sometimes write off the past 3 months as a relationship investment in exchange for the new plan going forward. Most customers accept; the ones who do not are signaling they would rather churn than pay, which is its own data point.&lt;/p&gt;

&lt;h2&gt;
  
  
  The hard part is customer_id propagation
&lt;/h2&gt;

&lt;p&gt;The dashboard is the easy part. Most BI tools (Looker, Metabase, Superset) render the five views from a single fact table in a couple of days. The hard part is making the fact table correct, which requires customer_id on every cost record, which requires the propagation work at every layer of the system.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Layer&lt;/th&gt;
&lt;th&gt;Retrofit complexity&lt;/th&gt;
&lt;th&gt;Typical engineer-weeks&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;API gateway / request middleware&lt;/td&gt;
&lt;td&gt;Low&lt;/td&gt;
&lt;td&gt;1-2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Microservice spans (OTel propagation)&lt;/td&gt;
&lt;td&gt;Low&lt;/td&gt;
&lt;td&gt;1-3 (one engineer per service)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Async queues (Kafka, SQS headers)&lt;/td&gt;
&lt;td&gt;Medium&lt;/td&gt;
&lt;td&gt;2-4&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Batch workloads (Spark, EMR, Lambda)&lt;/td&gt;
&lt;td&gt;High&lt;/td&gt;
&lt;td&gt;3-5&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Data warehouse queries (Snowflake, BigQuery tags)&lt;/td&gt;
&lt;td&gt;High&lt;/td&gt;
&lt;td&gt;2-4&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Object storage (key conventions)&lt;/td&gt;
&lt;td&gt;High (data migration if existing)&lt;/td&gt;
&lt;td&gt;4-8&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Vector DB / search index&lt;/td&gt;
&lt;td&gt;Medium&lt;/td&gt;
&lt;td&gt;1-3&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Across a typical mid-market SaaS stack the total is 14-29 engineer-weeks. Most teams finish the retrofit in 8-12 calendar weeks because the work parallelizes across services.&lt;/p&gt;

&lt;p&gt;The discipline matters more than the speed. A retrofit that misses 30% of cost is not 70% useful; it is misleading, because the missing 30% is concentrated in a few services that distort the per-customer numbers. Either commit to full propagation or do not start. The middle path produces a dashboard that finance will not trust and that pricing will not use.&lt;/p&gt;

&lt;p&gt;Once the propagation is in place, every new service must include customer_id in its spans, queue headers, storage keys, and warehouse queries from day one. Adding the discipline to the service template is the maintenance work that keeps the dashboard accurate as the company grows.&lt;/p&gt;

&lt;h2&gt;
  
  
  The dollar math
&lt;/h2&gt;

&lt;p&gt;The dashboard pays back in 60-90 days on a $500K+ ARR product. The payback comes from two mechanisms, both visible on the dashboard.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Source&lt;/th&gt;
&lt;th&gt;Typical contribution&lt;/th&gt;
&lt;th&gt;Notes&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Pricing changes that recover negative-economics customers&lt;/td&gt;
&lt;td&gt;2-5% gross margin&lt;/td&gt;
&lt;td&gt;Targeted; affects only the 5-15% of customers in red/negative band&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Per-customer right-sizing of resource-heavy environments&lt;/td&gt;
&lt;td&gt;8-15% infra reduction&lt;/td&gt;
&lt;td&gt;Possible only after attribution makes the over-resourced ones visible&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Total annual margin recovery&lt;/td&gt;
&lt;td&gt;$50K-$200K&lt;/td&gt;
&lt;td&gt;On a $4M ARR product with 25% infra-to-revenue ratio&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Retrofit cost (one-time)&lt;/td&gt;
&lt;td&gt;$80K-$160K&lt;/td&gt;
&lt;td&gt;14-29 engineer-weeks at fully-loaded rates&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Operating cost (dashboard + monthly refresh)&lt;/td&gt;
&lt;td&gt;$20K/year&lt;/td&gt;
&lt;td&gt;Minimal once the propagation is solid&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Payback period&lt;/td&gt;
&lt;td&gt;6-12 months&lt;/td&gt;
&lt;td&gt;First-year ROI typically 1.3x-2x&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The first-year ROI is fine but not extraordinary. The compound value is the operational difference: by year two, the company is making pricing and right-sizing decisions weekly with real per-customer data instead of quarterly with averaged data. Pricing changes that would have been argued about for a quarter ship in a sprint. Right-sizing that would have been speculative becomes targeted at the 10-15 specific customer environments where the savings actually exist.&lt;/p&gt;

&lt;p&gt;The pattern is not "per-customer is the new average." It is "per-customer is the level of granularity at which SaaS pricing and infra decisions are actually made." The quarterly average remains useful for the board slide. The dashboard is what runs the business in between.&lt;/p&gt;

&lt;p&gt;Stand up the dashboard, do the propagation work in parallel, and stop running unit economics off a number that hides everything. The first three pricing mistakes the dashboard catches will fund the next two years of the propagation work.&lt;/p&gt;

</description>
      <category>cost</category>
      <category>saas</category>
      <category>price</category>
    </item>
    <item>
      <title>Per-Agent Quotas for MCP: The Token Budget That Stopped One Agent From Burning 80% of the Daily Spend</title>
      <dc:creator>Muskan </dc:creator>
      <pubDate>Fri, 15 May 2026 06:57:57 +0000</pubDate>
      <link>https://dev.to/muskan_8abedcc7e12/per-agent-quotas-for-mcp-the-token-budget-that-stopped-one-agent-from-burning-80-of-the-daily-38o7</link>
      <guid>https://dev.to/muskan_8abedcc7e12/per-agent-quotas-for-mcp-the-token-budget-that-stopped-one-agent-from-burning-80-of-the-daily-38o7</guid>
      <description>&lt;p&gt;The first ninety days of an MCP server in production are about correctness, not abuse. The team is busy proving the agents do the right thing: the policy lookups return what they should, the audit log captures the right fields, the structured errors are parsed by the agent framework correctly. Rate limiting is something the team plans to add "after we have real traffic." The team has real traffic on day 12 and forgets to add rate limiting. On day 87 the first runaway lands.&lt;/p&gt;

&lt;p&gt;The runaway always has the same shape. One agent starts behaving badly: a test loop forgot to set max_iterations, a malformed prompt drove the model into a long-output failure mode, a retry policy got an aggressive backoff inverted. The agent calls the same MCP tool 400 times in 30 minutes, burning 70% to 90% of the day's token budget before any human sees the alert. By morning the bill shows a $4,200 charge against an Anthropic account that usually does $800/day.&lt;/p&gt;

&lt;p&gt;The structural fix is per-agent token quotas baked into the MCP server. Each agent identity gets a budget across three windows (hourly, daily, weekly). The MCP server tracks consumption and rejects calls that would exceed the budget. The agent gets a structured error; the human operator gets one page per cycle instead of a Slack thread at 9 a.m. asking who is responsible for the bill.&lt;/p&gt;

&lt;p&gt;The pattern composes with the &lt;a href="https://zop.dev/resources/blogs/mcp-cost-ledger-billing-47-ai-agents/" rel="noopener noreferrer"&gt;MCP cost ledger&lt;/a&gt; (which tells you what each agent has spent) and the policy-aware MCP governance work. The ledger is descriptive; the quota is prescriptive. Together they turn per-agent cost into a managed budget rather than a billing artifact you read about three days later.&lt;/p&gt;

&lt;h2&gt;
  
  
  The first MCP runaway lands in 90 days
&lt;/h2&gt;

&lt;p&gt;The runaway shapes are predictable. After auditing a dozen MCP-server rollouts at ZopDev customers, the three failure modes account for almost every incident.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Failure mode&lt;/th&gt;
&lt;th&gt;Trigger&lt;/th&gt;
&lt;th&gt;Burn rate&lt;/th&gt;
&lt;th&gt;Typical detection latency&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Test loop left running&lt;/td&gt;
&lt;td&gt;Developer's local agent forgets &lt;code&gt;max_iterations&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;80-120 calls/hour for hours&lt;/td&gt;
&lt;td&gt;6-12 hours (next morning)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Malformed prompt drives long-output mode&lt;/td&gt;
&lt;td&gt;Prompt change ships with regex bug; model hits 8k token outputs every call&lt;/td&gt;
&lt;td&gt;3-5x normal token cost per call&lt;/td&gt;
&lt;td&gt;2-6 hours (when daily spend alert fires)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Inverted retry backoff&lt;/td&gt;
&lt;td&gt;Retry policy doubles on success instead of failure&lt;/td&gt;
&lt;td&gt;200-500 calls/hour against an idempotent tool&lt;/td&gt;
&lt;td&gt;1-3 hours (when downstream service alarms)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The detection latencies are uncomfortable because none of them are inside the MCP server's control. The server sees the agent's calls and bills them honestly. It does not see the agent's intent, the retry logic, or the prompt change that landed an hour ago. By the time a human sees the spike in a billing dashboard or a cost alert, the runaway has been burning for hours.&lt;/p&gt;

&lt;p&gt;The right place to catch this is inside the MCP server, on the call path. Every tool call passes through the server; every call has an agent identity attached (a service account, a session token, an API key). If the server checks the agent's running budget against a quota before allowing the call, the runaway stops at the quota boundary instead of at the 9 a.m. Slack ping.&lt;/p&gt;

&lt;h2&gt;
  
  
  Per-agent quotas as a tri-window check
&lt;/h2&gt;

&lt;p&gt;A single daily quota is not enough. An agent that burns its full daily budget by 10 a.m. has fourteen hours to keep generating rejected requests; even rejected, the orchestration overhead (the agent's own LLM calls deciding what to do next) eats real tokens. A weekly quota catches the slow-creep agent that goes 8% over every day for five days and adds up to a meaningful Friday total that no daily check would have stopped.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Window&lt;/th&gt;
&lt;th&gt;Catches&lt;/th&gt;
&lt;th&gt;Typical cap (for 50K/day default agent)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Hourly&lt;/td&gt;
&lt;td&gt;The fast runaway (within 60 min)&lt;/td&gt;
&lt;td&gt;8K tokens/hour (typical: 2K-3K)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Daily&lt;/td&gt;
&lt;td&gt;The within-day burn (within hours)&lt;/td&gt;
&lt;td&gt;50K tokens/day&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Weekly&lt;/td&gt;
&lt;td&gt;The slow creep (within days)&lt;/td&gt;
&lt;td&gt;250K tokens/week (typical: 200K)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The three caps compose. The agent is allowed if it is under all three. If any single cap trips, the call is rejected. This sounds expensive to check but in practice it is three counter reads from a Redis hash; the overhead is sub-millisecond.&lt;/p&gt;

&lt;p&gt;The hourly cap is the most underweighted of the three. Teams that ship daily-only quotas get caught by the morning runaway whose damage is done before the daily counter resets. The hourly cap means the worst case is 60 minutes of damage instead of 8 hours. For a 100x normal burn rate, the difference is $200 vs $2,000.&lt;/p&gt;

&lt;p&gt;The weekly cap catches the agent that nobody pages on because no single day looks anomalous. An agent that does 60K tokens/day on a 50K cap looks within budget on the daily check (allowing the over-budget grace described below) but accumulates to 420K by Friday on a 250K cap. The weekly check catches the pattern that the per-day signal misses.&lt;/p&gt;

&lt;h2&gt;
  
  
  Default budget + adaptive growth
&lt;/h2&gt;

&lt;p&gt;A new agent does not get a generous quota on day one. The default is small (50K tokens/day, 8K/hour, 250K/week) and grows with demonstrated usage. The growth is automatic, based on the agent's actual consumption pattern over the trailing 30 days.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Time window&lt;/th&gt;
&lt;th&gt;Quota state&lt;/th&gt;
&lt;th&gt;Behavior&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Days 1-7&lt;/td&gt;
&lt;td&gt;Default (50K/day)&lt;/td&gt;
&lt;td&gt;Most agents stay under 30%; quota stays at default&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Days 8-30&lt;/td&gt;
&lt;td&gt;Default, monitoring&lt;/td&gt;
&lt;td&gt;If utilization stays under 30%, auto-promotion candidate&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Day 30&lt;/td&gt;
&lt;td&gt;Auto-promotion&lt;/td&gt;
&lt;td&gt;If utilization 10-50%, raise to 200K/day; if higher, page FinOps for review&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Days 31-90&lt;/td&gt;
&lt;td&gt;200K/day&lt;/td&gt;
&lt;td&gt;If utilization stays under 40%, candidate for 500K/day at day 60&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Day 60+&lt;/td&gt;
&lt;td&gt;500K/day or higher&lt;/td&gt;
&lt;td&gt;Manual review required for further increases&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The shape: cheap by default, generous to proven workloads, never silently unlimited. A new agent that turns out to need 1M tokens/day gets there through a documented promotion path, not by accident. The same agent at day one would have been hard-blocked at 50K and the human would have set the right quota explicitly.&lt;/p&gt;

&lt;p&gt;The promotion thresholds matter. If the auto-promotion fires whenever utilization is non-zero, every agent inflates to its peak day's quota and the protection erodes. If the threshold is too tight (e.g., only auto-promote at 50-70% utilization), most agents never grow and FinOps becomes a quota-approval bottleneck. The 30% / 40% bands above are the typical operating range; tighten or loosen based on the team's tolerance for false promotion vs friction.&lt;/p&gt;

&lt;h2&gt;
  
  
  Hard-block vs degrade: the design choice
&lt;/h2&gt;

&lt;p&gt;When an agent exceeds quota, the MCP server has two response options. Hard-block is the simple choice: reject the call with an error, the agent's task fails, the human investigates. Degrade is the more humane choice: route the call to a cheaper model (or a cached response, or a partial result), the agent's task completes but with lower quality, the cost stays under control.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Mode&lt;/th&gt;
&lt;th&gt;Bill predictability&lt;/th&gt;
&lt;th&gt;Task continuity&lt;/th&gt;
&lt;th&gt;Output quality risk&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Hard-block&lt;/td&gt;
&lt;td&gt;High (cost stops at cap)&lt;/td&gt;
&lt;td&gt;Low (agent task fails)&lt;/td&gt;
&lt;td&gt;None (no degraded output)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Degrade to cheap model&lt;/td&gt;
&lt;td&gt;Medium (cheap model still costs)&lt;/td&gt;
&lt;td&gt;High (agent continues)&lt;/td&gt;
&lt;td&gt;High (low-quality outputs may loop)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Return cached response&lt;/td&gt;
&lt;td&gt;High (no model call)&lt;/td&gt;
&lt;td&gt;Medium (only some calls cacheable)&lt;/td&gt;
&lt;td&gt;Medium (cache may be stale)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Return structured "over budget"&lt;/td&gt;
&lt;td&gt;High (no model call)&lt;/td&gt;
&lt;td&gt;Medium (agent must handle)&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Most teams ship hard-block first because the failure mode is contained: an agent that breaks under quota is visible immediately and gets a real fix. Degrade looks better in theory but introduces a subtle failure mode: a degraded agent producing low-quality outputs may loop trying to recover, generating more (cheaper but still real) calls and ultimately costing more than hard-blocking would have.&lt;/p&gt;

&lt;p&gt;The middle path is to ship hard-block as the default and let agents opt-in to degrade for specific tool classes where partial output is genuinely better than no output. Read-only tools (lookup, search, summarize) are good degrade candidates: a cached or cheap-model response is acceptable. Write tools (mutations, policy changes) should hard-block: a degraded write is worse than no write.&lt;/p&gt;

&lt;h2&gt;
  
  
  Structured rejection so retries back off correctly
&lt;/h2&gt;

&lt;p&gt;When the MCP server rejects a call, the error payload matters more than the rejection itself. An unstructured error ("quota exceeded") triggers the agent's default retry logic, which is usually "try again with exponential backoff." The agent retries, gets rejected, retries again, and burns more tokens on its own orchestration calls trying to figure out why the tool is failing.&lt;/p&gt;

&lt;p&gt;A structured rejection includes the data the agent needs to back off correctly:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Field&lt;/th&gt;
&lt;th&gt;Type&lt;/th&gt;
&lt;th&gt;Example&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;error_code&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;string&lt;/td&gt;
&lt;td&gt;&lt;code&gt;quota_exceeded_daily&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;agent_id&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;string&lt;/td&gt;
&lt;td&gt;&lt;code&gt;agent-fraud-classifier-prod&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;window&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;string&lt;/td&gt;
&lt;td&gt;&lt;code&gt;daily&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;cap&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;integer&lt;/td&gt;
&lt;td&gt;&lt;code&gt;50000&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;consumed&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;integer&lt;/td&gt;
&lt;td&gt;&lt;code&gt;52340&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;reset_at&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;ISO timestamp&lt;/td&gt;
&lt;td&gt;&lt;code&gt;2026-05-10T00:00:00Z&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;retry_after_seconds&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;integer&lt;/td&gt;
&lt;td&gt;&lt;code&gt;19260&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;suggested_action&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;string&lt;/td&gt;
&lt;td&gt;&lt;code&gt;wait_until_reset&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;With this payload, the agent's retry logic can do the right thing: stop retrying until &lt;code&gt;reset_at&lt;/code&gt;, surface the over-budget condition to its orchestrator, or fall back to a different code path. None of these are possible from "quota exceeded" alone.&lt;/p&gt;

&lt;p&gt;The cost difference between structured and unstructured errors is meaningful. A blind retry loop against a rejected MCP tool generates 5-15 orchestration LLM calls before the agent's policy gives up. At Claude Sonnet rates, those orchestration calls cost roughly the same as the rejected tool calls would have cost. Structured errors zero out that overhead.&lt;/p&gt;

&lt;h2&gt;
  
  
  Audit + composition with the cost ledger
&lt;/h2&gt;

&lt;p&gt;Every quota check writes an audit log line, regardless of decision. This is the system of record for two things: postmortem of any cost incident, and input to monthly quota tuning.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F42urkumt5ew6ippqvi67.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F42urkumt5ew6ippqvi67.png" alt="diagram" width="800" height="924"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The cost ledger and the quota check run on the same call path but serve different purposes. The ledger writes the actual token cost after the call returns (the source of truth for billing). The quota check uses a pre-call estimate (input tokens are known exactly, output tokens are projected from max_tokens). The estimate is conservative (assumes worst case); the ledger is exact.&lt;/p&gt;

&lt;p&gt;This split matters for tuning. After 30 days, the ratio of estimated cost to actual cost is the input to whether the quota's grace margin needs to change. If estimates are systematically 20% higher than actuals, agents get blocked more often than the budget would warrant; the grace margin can shrink. If estimates are 10% lower, the budget is leaking; the grace margin needs to grow.&lt;/p&gt;

&lt;p&gt;The composition with the &lt;a href="https://zop.dev/resources/blogs/mcp-cost-ledger-billing-47-ai-agents/" rel="noopener noreferrer"&gt;MCP cost ledger&lt;/a&gt; is what makes the quota system trustworthy. The ledger answers "what did we spend"; the quota answers "what are we allowed to spend." Two complementary systems, one call path, one audit log.&lt;/p&gt;

&lt;h2&gt;
  
  
  First-month tuning: 2-4 unexpected runaways caught
&lt;/h2&gt;

&lt;p&gt;The first month of the quota system in production produces a predictable mix of firings. ZopDev customer rollout data shows:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Week&lt;/th&gt;
&lt;th&gt;Typical firings&lt;/th&gt;
&lt;th&gt;Common cause&lt;/th&gt;
&lt;th&gt;Action&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;3-7&lt;/td&gt;
&lt;td&gt;False positives (default too low for legitimate workloads)&lt;/td&gt;
&lt;td&gt;Raise quotas for the 1-3 affected agents&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;2-5&lt;/td&gt;
&lt;td&gt;Legacy agents using more tokens than anyone realized&lt;/td&gt;
&lt;td&gt;Investigate; tune quota or refactor agent prompt&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;1-3&lt;/td&gt;
&lt;td&gt;First real runaways caught&lt;/td&gt;
&lt;td&gt;Postmortem, no quota change&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;0-2&lt;/td&gt;
&lt;td&gt;Mostly real&lt;/td&gt;
&lt;td&gt;Steady-state&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Month 2+&lt;/td&gt;
&lt;td&gt;0-2/month&lt;/td&gt;
&lt;td&gt;Real runaways only&lt;/td&gt;
&lt;td&gt;Postmortem each&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The week-1 false positives are a signal that some agents were running with consumption nobody had measured. This is itself the value: the team learns what its agents actually cost. Most teams discover at least one "we thought this agent did 20K tokens/day but it actually does 180K" surprise in the first two weeks.&lt;/p&gt;

&lt;p&gt;The week-3 first real runaway is the moment the system earns its keep. The runaway gets caught within an hour (because of the hourly cap), the page goes out, the human reads the structured error, and the incident is closed in 45 minutes with $200 of damage instead of 8 hours and $5,000.&lt;/p&gt;

&lt;p&gt;The FinOps engineer time across the month is 4-6 hours: classifying the firings, adjusting quotas, writing brief postmortems, updating the promotion thresholds based on observed utilization. The fleet-wide saving over the same month is typically 20-40% of the monthly token bill, mostly from runaways prevented and from the visibility into per-agent consumption that the audit log enables.&lt;/p&gt;

&lt;h2&gt;
  
  
  The dollar math
&lt;/h2&gt;

&lt;p&gt;The numbers are simple. Per-agent quotas at a typical mid-market agent fleet:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Item&lt;/th&gt;
&lt;th&gt;Value&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Monthly token bill before quotas&lt;/td&gt;
&lt;td&gt;$40,000 to $120,000&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Reduction from prevented runaways&lt;/td&gt;
&lt;td&gt;15-25%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Reduction from agent right-sizing (visible from audit)&lt;/td&gt;
&lt;td&gt;5-15%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Total monthly reduction&lt;/td&gt;
&lt;td&gt;$8,000 to $48,000&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Quota system build cost (one-time)&lt;/td&gt;
&lt;td&gt;$20,000 to $40,000&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Operating cost (1 FinOps engineer, ~10% time)&lt;/td&gt;
&lt;td&gt;$20,000/year&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Payback period&lt;/td&gt;
&lt;td&gt;1-4 months&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The build cost varies because the implementation choices (Redis counters vs Postgres, structured errors vs not, adaptive promotion vs static caps) have different effort profiles. The lowest-effort version (a single daily cap per agent, hard-block on overage, structured errors) ships in two weeks; the full tri-window adaptive system with degrade support is a quarter of platform-engineer time.&lt;/p&gt;

&lt;p&gt;The payback math works because the prevented-runaway saving is a real reduction in spend, not a forecasted one. The agent that would have burned $4,200 overnight burns $200 and stops. The cost ledger and the audit log show exactly how much was saved at each firing, which is the kind of receipt finance accepts as ROI evidence.&lt;/p&gt;

&lt;p&gt;Per-agent quotas are not optional once an MCP server has more than three or four agents in production. The first runaway is a question of when, not if. Shipping the quota system before the first runaway costs a quarter of platform-engineer time; shipping it after the first runaway costs that plus the bill for the incident and the trust hit from finance. Set the default budget, wire the tri-window check, log the decisions, and stop relying on the morning Slack ping to catch agent runaways.&lt;/p&gt;

</description>
      <category>per</category>
      <category>agents</category>
      <category>mcp</category>
      <category>token</category>
    </item>
    <item>
      <title>The Closed-Loop Budget Brake: How a $5k Daily Cap Stopped 2 A.M. Compute Runaways</title>
      <dc:creator>Muskan </dc:creator>
      <pubDate>Fri, 15 May 2026 06:56:51 +0000</pubDate>
      <link>https://dev.to/muskan_8abedcc7e12/the-closed-loop-budget-brake-how-a-5k-daily-cap-stopped-2-am-compute-runaways-3495</link>
      <guid>https://dev.to/muskan_8abedcc7e12/the-closed-loop-budget-brake-how-a-5k-daily-cap-stopped-2-am-compute-runaways-3495</guid>
      <description>&lt;p&gt;The 2 a.m. compute runaway is the canonical FinOps incident. A Spark job is misconfigured to provision new EMR nodes every minute it cannot find a leader. A test agent left running on a developer's laptop loops infinite Claude calls against the prod API key. An autoscaling group's max gets bumped from 20 to 2000 in a Terraform plan that nobody reviewed at the right line number. Everything is asleep. The hourly spend goes from $63 to $830 to $4,200. By 9 a.m. the team gets a Slack ping from finance asking why yesterday's bill spiked $47,000.&lt;/p&gt;

&lt;p&gt;AWS Budgets fires a soft alert when daily spend crosses a threshold. The alert goes to an SNS topic that emails a distribution list and pings a Slack channel. Nobody reads the channel at 2 a.m. The on-call engineer is paged for production outages, not budget overages. By the time someone sees the alert, the damage is hours old and the runaway has either burned itself out or kept running because the alert did not actually stop anything.&lt;/p&gt;

&lt;p&gt;The structural fix is to replace the email with an action. A closed-loop budget brake fires a remediation playbook when a hard daily cap is crossed: stop non-prod EC2 launches, pause non-prod autoscaling groups, freeze agent provisioning, throttle batch jobs, page the on-call. The 5-minute detect-decide-act-verify shape from the &lt;a href="https://zop.dev/resources/blogs/closed-loop-finops-detect-decide-act-verify/" rel="noopener noreferrer"&gt;closed-loop FinOps work&lt;/a&gt; applies directly, with the cap value as the signal and the playbook as the action.&lt;/p&gt;

&lt;p&gt;The piece composes with the &lt;a href="https://zop.dev/resources/blogs/closed-loop-trust-score-auto-remediation-decision/" rel="noopener noreferrer"&gt;closed-loop trust score&lt;/a&gt; (deciding which playbook tiers auto-fire) and runs alongside &lt;a href="https://zop.dev/resources/blogs/cloud-cost-anomaly-detection/" rel="noopener noreferrer"&gt;cost anomaly detection&lt;/a&gt; (which catches longer-horizon structural drift the brake cannot see).&lt;/p&gt;

&lt;h2&gt;
  
  
  The 2 a.m. runaway and why email alerts fail
&lt;/h2&gt;

&lt;p&gt;Look at how the four common detection mechanisms catch a 2 a.m. runaway, and what they cost in dollars by the time someone acts.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Detection mechanism&lt;/th&gt;
&lt;th&gt;Time to detect&lt;/th&gt;
&lt;th&gt;Time to action&lt;/th&gt;
&lt;th&gt;Spend lost before action&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;AWS Budgets soft alert (email/Slack)&lt;/td&gt;
&lt;td&gt;8-12 hours after threshold crossed&lt;/td&gt;
&lt;td&gt;Morning when someone reads it&lt;/td&gt;
&lt;td&gt;$30,000 to $80,000&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Hourly cost alarm (custom CloudWatch)&lt;/td&gt;
&lt;td&gt;60-90 minutes after spike begins&lt;/td&gt;
&lt;td&gt;Hours later if on-call is busy&lt;/td&gt;
&lt;td&gt;$5,000 to $15,000&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cost anomaly detection (AWS or vendor)&lt;/td&gt;
&lt;td&gt;24-72 hours after pattern shifts&lt;/td&gt;
&lt;td&gt;After analyst review&lt;/td&gt;
&lt;td&gt;$50,000 to $200,000&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Closed-loop budget brake&lt;/td&gt;
&lt;td&gt;5-15 minutes after cap crossed&lt;/td&gt;
&lt;td&gt;Automatic playbook&lt;/td&gt;
&lt;td&gt;$500 to $2,000&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The dollar gap between the soft alert and the brake is the case for the brake. Soft alerts are real signals, but they are signals that go to humans who are not actively monitoring at 2 a.m. The brake removes the "wait for a human" step from the loop.&lt;/p&gt;

&lt;p&gt;The other detectors are not redundant. Hourly cost alarms catch slower-building issues the brake's daily cap might miss within a single day. Cost anomaly detection catches structural shifts (a new feature with legitimate higher spend, a pricing change, a seasonal pattern) over multi-day windows. The brake handles the within-day catastrophic runaway. Three detectors, three different horizons.&lt;/p&gt;

&lt;h2&gt;
  
  
  The brake: short-circuit, not email
&lt;/h2&gt;

&lt;p&gt;The brake's shape is the same four-stage loop as any other closed-loop FinOps system, with the cap as the input signal.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3fc5bzwjbk57d4pahzws.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3fc5bzwjbk57d4pahzws.png" alt="diagram" width="800" height="786"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Detect samples live spend at a 5-minute interval from Cost Explorer or the equivalent on GCP / Azure. Five minutes is the floor: the billing APIs are eventually consistent and finer-grained polling produces false negatives (real spend that has not yet shown up in the API). Decide compares the running daily total against the cap. If crossed, the brake fires a tiered playbook. Act runs the playbook. Verify samples spend again 15 minutes after the playbook fires, confirming the runaway has stopped.&lt;/p&gt;

&lt;p&gt;The playbook is per-account and lives in version control. A typical Tier 1 payload contains six actions: stop all &lt;code&gt;env=non-prod&lt;/code&gt; EC2 instances launched in the last 60 minutes, pause non-prod autoscaling group scale-outs by setting &lt;code&gt;max_size = current_size&lt;/code&gt;, freeze agent provisioning by revoking the agent service role's &lt;code&gt;ec2:RunInstances&lt;/code&gt;, throttle non-prod batch queues to zero concurrency, snapshot the spend-by-service breakdown to an S3 bucket for postmortem, and notify the FinOps channel with the breakdown.&lt;/p&gt;

&lt;p&gt;The brake does not delete anything. Everything Tier 1 does is reversible in seconds. Spend stops; nothing breaks. The on-call wakes up to a "brake fired" page, reads the breakdown, decides whether the spend was legitimate (and reverses the playbook) or a runaway (and starts the postmortem).&lt;/p&gt;

&lt;h2&gt;
  
  
  Sizing the cap: three inputs, one formula
&lt;/h2&gt;

&lt;p&gt;The cap value is not a guess. It is computed from three inputs that already exist in the cost data.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Input&lt;/th&gt;
&lt;th&gt;Source&lt;/th&gt;
&lt;th&gt;Notes&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Typical daily spend&lt;/td&gt;
&lt;td&gt;Cost Explorer 7-day trailing average&lt;/td&gt;
&lt;td&gt;Smooths weekly seasonality&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Variance multiplier&lt;/td&gt;
&lt;td&gt;Engineering judgement (1.5x to 2x)&lt;/td&gt;
&lt;td&gt;Absorbs legitimate daily spikes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Emergency floor&lt;/td&gt;
&lt;td&gt;Largest single-day spend in trailing 90d&lt;/td&gt;
&lt;td&gt;The "this happened once and was real" line&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The formula: &lt;code&gt;cap = max(typical * variance, emergency_floor + 20%)&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Worked example for three account profiles:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Account profile&lt;/th&gt;
&lt;th&gt;Typical daily&lt;/th&gt;
&lt;th&gt;Variance (1.7x)&lt;/th&gt;
&lt;th&gt;Emergency floor&lt;/th&gt;
&lt;th&gt;Cap&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Small (dev team of 12)&lt;/td&gt;
&lt;td&gt;$500&lt;/td&gt;
&lt;td&gt;$850&lt;/td&gt;
&lt;td&gt;$1,200&lt;/td&gt;
&lt;td&gt;$1,440&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Mid-market (50-engineer org)&lt;/td&gt;
&lt;td&gt;$1,500&lt;/td&gt;
&lt;td&gt;$2,550&lt;/td&gt;
&lt;td&gt;$4,200&lt;/td&gt;
&lt;td&gt;$5,040&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Large (200-engineer org)&lt;/td&gt;
&lt;td&gt;$10,000&lt;/td&gt;
&lt;td&gt;$17,000&lt;/td&gt;
&lt;td&gt;$24,000&lt;/td&gt;
&lt;td&gt;$28,800&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The 20% buffer on the emergency floor matters. A legitimate spike that happened once (a launch event, a load test, an unusual data migration) might not happen the same day next month, but the cap has to be high enough that the same pattern would not trip the brake if it recurs. Without the 20% buffer, the brake fires on every recurrence of every legitimate pattern, and the on-call learns to ignore it.&lt;/p&gt;

&lt;p&gt;Cap recomputation happens monthly. The typical daily spend drifts as the company grows. The emergency floor may shift if a new legitimate workload pattern emerges. A static cap that lasts more than a quarter starts to over-fire or under-fire because the inputs moved.&lt;/p&gt;

&lt;h2&gt;
  
  
  Tier the action: composing with the trust score
&lt;/h2&gt;

&lt;p&gt;The brake's playbook is tiered by what the &lt;a href="https://zop.dev/resources/blogs/closed-loop-trust-score-auto-remediation-decision/" rel="noopener noreferrer"&gt;trust score&lt;/a&gt; allows. Without the trust score, the brake either over-reaches (touches production resources, causes a customer-facing incident, gets disabled forever) or under-reaches (only pages, doesn't actually stop the spend).&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Tier&lt;/th&gt;
&lt;th&gt;Trust threshold&lt;/th&gt;
&lt;th&gt;Resources touched&lt;/th&gt;
&lt;th&gt;Actions&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Tier 1&lt;/td&gt;
&lt;td&gt;Always-on&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;env=non-prod&lt;/code&gt; only&lt;/td&gt;
&lt;td&gt;Stop new EC2 launches, pause ASG scale-outs, freeze agent provisioning, throttle batch queues&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Tier 2&lt;/td&gt;
&lt;td&gt;Trust &amp;gt; 0.7&lt;/td&gt;
&lt;td&gt;ML training, ad-hoc analytics&lt;/td&gt;
&lt;td&gt;Downscale training jobs, throttle batch ingestion, pause notebook compute&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Tier 3&lt;/td&gt;
&lt;td&gt;Always pages&lt;/td&gt;
&lt;td&gt;&lt;code&gt;env=prod&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Page on-call with breakdown; no auto-action&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Tier 1 is always-on because the actions are low-blast-radius and fully reversible. Stopping a non-prod EC2 instance launched in the last 60 minutes affects only the developer who launched it, and they can re-launch in 90 seconds. Pausing a non-prod ASG scale-out blocks new capacity but does not terminate existing capacity.&lt;/p&gt;

&lt;p&gt;Tier 2 needs the trust score because the actions have wider blast radius. Throttling a training job interrupts the team running it; downscaling notebook compute kicks people out of their analyses. The trust score asks: is the spend signal high-confidence enough (cap crossed by 2x+, multiple services contributing) to justify the disruption? If yes, fire. If no, page only.&lt;/p&gt;

&lt;p&gt;Tier 3 always pages. The brake does not touch production resources, ever. The math is simple: a production-impacting incident caused by the brake costs more than any single-day cost runaway. The brake's job is to give the human enough time to fix the runaway before the bill is catastrophic, not to be the fix itself.&lt;/p&gt;

&lt;h2&gt;
  
  
  Cap vs anomaly detection: different time horizons
&lt;/h2&gt;

&lt;p&gt;A common mistake is treating the brake as a replacement for cost anomaly detection. They are not the same system. They run on different signals and they fire on different timescales.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F92o5ptpdrhez5gj2w6c8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F92o5ptpdrhez5gj2w6c8.png" alt="diagram" width="800" height="319"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The brake answers "is something burning right now?" The signal is daily-spend-rate exceeds cap. The action is immediate playbook.&lt;/p&gt;

&lt;p&gt;Anomaly detection answers "is the spending pattern different from what we expect?" The signal is statistical: spend by service over a multi-day window deviates from forecasted baseline. The action is queued analyst review, often with a recommendation to update budgets or investigate a new workload.&lt;/p&gt;

&lt;p&gt;Both run. The brake catches the rare catastrophic runaway. Anomaly detection catches the steady-state drift (a new feature legitimately moving the spend curve, a pricing change at a vendor, a regional capacity shift). Each is useless for the other's job: anomaly detection cannot stop a 2 a.m. runaway because the analyst is not online; the brake cannot tell you that your spend has structurally shifted because it only sees the daily total.&lt;/p&gt;

&lt;h2&gt;
  
  
  The first-week tuning ritual
&lt;/h2&gt;

&lt;p&gt;The brake gets tuned in its first two weeks. The cap value comes from the formula, but the formula's inputs are estimates. The actual firing rate in week one tells you whether the cap is right.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Week&lt;/th&gt;
&lt;th&gt;Firings/week&lt;/th&gt;
&lt;th&gt;Typical action&lt;/th&gt;
&lt;th&gt;Cap adjustment&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;3-6&lt;/td&gt;
&lt;td&gt;Investigate each, classify real vs false positive&lt;/td&gt;
&lt;td&gt;Raise cap by 10-15% per false positive&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;1-3&lt;/td&gt;
&lt;td&gt;Continue classification&lt;/td&gt;
&lt;td&gt;Raise/lower based on the week's data&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3-4&lt;/td&gt;
&lt;td&gt;0-1&lt;/td&gt;
&lt;td&gt;Each firing produces a postmortem&lt;/td&gt;
&lt;td&gt;Stable&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Steady state&lt;/td&gt;
&lt;td&gt;0-2/month&lt;/td&gt;
&lt;td&gt;Each firing is real&lt;/td&gt;
&lt;td&gt;Recompute cap monthly&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The classification at each firing matters. The on-call writes a one-paragraph postmortem: was this a runaway (which workload, what was the root cause, how long would it have run unchecked) or a legitimate spike (which team, why, was it within budget, why did the cap not anticipate it). False positives raise the cap. Real positives keep the cap and become input to the trust score weights.&lt;/p&gt;

&lt;p&gt;The discipline that makes the brake trusted is that every firing produces a postmortem. Without postmortems, the team starts to debate "the brake fires too much," lower the cap to make it stop firing, and the brake silently becomes a $50k cap that catches nothing. With postmortems, the cap value is defensible and the trust accumulates.&lt;/p&gt;

&lt;h2&gt;
  
  
  The exempt-tag escape valve
&lt;/h2&gt;

&lt;p&gt;Some workloads legitimately spike beyond the cap. ML training that goes from $200/day to $8,000/day during a three-day training run looks identical to a runaway under a daily cap. Ad-hoc analytics that spin up 50 BigQuery slots for a quarterly report look identical. The right fix is not "raise the cap to absorb all training spikes" because that defeats the brake. The right fix is to exempt the workload and route it through a different cap.&lt;/p&gt;

&lt;p&gt;The pattern: any resource tagged &lt;code&gt;brake_exempt=true&lt;/code&gt; is excluded from the daily-cap calculation. Exempt resources go into a separate weekly cap (typically 3x the equivalent daily cap times 7) that catches truly anomalous training or analytics spend.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Workload type&lt;/th&gt;
&lt;th&gt;Tag&lt;/th&gt;
&lt;th&gt;Cap horizon&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Steady-state services (web, API, batch)&lt;/td&gt;
&lt;td&gt;(no tag)&lt;/td&gt;
&lt;td&gt;Daily&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ML training jobs&lt;/td&gt;
&lt;td&gt;&lt;code&gt;brake_exempt=true, brake_class=training&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Weekly&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Quarterly analytics, BI rebuilds&lt;/td&gt;
&lt;td&gt;&lt;code&gt;brake_exempt=true, brake_class=analytics&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Weekly&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Disaster recovery test environments&lt;/td&gt;
&lt;td&gt;&lt;code&gt;brake_exempt=true, brake_class=dr-test&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Per-event (manual cap)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The exempt tag has to be opt-in and reviewed. A team that wants to exempt their workload submits a one-page rationale to the FinOps team. The exemption is granted with an expiry date and a recompute schedule. Without that discipline, every team eventually tags their workload exempt and the brake erodes back to nothing.&lt;/p&gt;

&lt;h2&gt;
  
  
  The dollar math
&lt;/h2&gt;

&lt;p&gt;A 2 a.m. runaway on a mid-market AWS account typically costs $30,000 to $80,000 by the time anyone notices. Larger accounts can hit $200,000+ before the morning Slack ping. The frequency is low but not negligible: ZopDev customer audits show one runaway every 4-7 months on mid-market accounts, more frequent during periods of rapid infrastructure change.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Item&lt;/th&gt;
&lt;th&gt;Cost / value&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;One prevented mid-market runaway&lt;/td&gt;
&lt;td&gt;$30,000 to $80,000 saved&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;One prevented large-account runaway&lt;/td&gt;
&lt;td&gt;$200,000+ saved&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Annual frequency (typical mid-market)&lt;/td&gt;
&lt;td&gt;1-2 runaways/year&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Brake operating cost (half-time platform engineer)&lt;/td&gt;
&lt;td&gt;~$30,000/year&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Expected annual savings (mid-market)&lt;/td&gt;
&lt;td&gt;$30,000 to $160,000&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The brake pays back after the first prevented incident. After that, it is one of the highest-ROI items on a FinOps roadmap, ahead of right-sizing and behind only the cost-allocation work that lets you see where the spend goes in the first place.&lt;/p&gt;

&lt;p&gt;The brake is not a substitute for budgets, anomaly detection, or right-sizing. It is the layer that catches what those other systems are not designed to catch: the within-day catastrophic spend event. Email alerts go to inboxes. The brake fires a playbook. The 2 a.m. runaway becomes a 5-minute incident with a $1,500 ceiling instead of an 8-hour incident with an $80,000 ceiling. Set the cap, write the playbook, watch the first two weeks of firings, and stop arguing with the morning bill.&lt;/p&gt;

</description>
      <category>closed</category>
      <category>loop</category>
      <category>budget</category>
      <category>brake</category>
    </item>
    <item>
      <title>The Golden Path Tax: 14 Hours/Week of Engineer Onboarding We Bought Back With 6 Months of IDP Work</title>
      <dc:creator>Muskan </dc:creator>
      <pubDate>Fri, 15 May 2026 06:54:08 +0000</pubDate>
      <link>https://dev.to/muskan_8abedcc7e12/the-golden-path-tax-14-hoursweek-of-engineer-onboarding-we-bought-back-with-6-months-of-idp-work-50f7</link>
      <guid>https://dev.to/muskan_8abedcc7e12/the-golden-path-tax-14-hoursweek-of-engineer-onboarding-we-bought-back-with-6-months-of-idp-work-50f7</guid>
      <description>&lt;p&gt;The cost of onboarding a new engineer at a mid-sized cloud-infrastructure org never shows up on a finance dashboard. There's no line item for "hours spent searching for the right runbook" or "Slack threads asking how deployment works this quarter." The cost is real, it's measured in 14 to 22 hours per engineer per week for the first 8 weeks, and it compounds with every hire because the senior engineers who answer the questions lose 4 to 7 hours per week each at the same time.&lt;/p&gt;

&lt;p&gt;Six months of IDP work changes the shape of this cost. Not eliminates it. Reshapes it. The 8-week onboarding becomes 4-week onboarding. The 14 hours per week of access requests, runbook hunts, and deployment questions drops to 30 minutes per week of looking things up in the IDP catalog. The senior engineers stop being the canonical source of truth for "how do we deploy this quarter" because the IDP is. The engineering org gets back about $400k per year of engineer time on a 100-engineer team for an investment of $250-350k over the first six months.&lt;/p&gt;

&lt;p&gt;The piece walks through what an IDP actually changes, why deployment is the keystone golden path, why templates beat documentation, the 6-month investment shape, the math, and the single instrumentation that tells you whether the IDP is working.&lt;/p&gt;

&lt;h2&gt;
  
  
  The 14-hour onboarding tax nobody puts on a dashboard
&lt;/h2&gt;

&lt;p&gt;A new engineer at a mid-sized cloud-infra org goes through a fairly predictable cost curve in the first 8 weeks.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Week&lt;/th&gt;
&lt;th&gt;Time spent on onboarding-friction work&lt;/th&gt;
&lt;th&gt;Top sources&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;16-22 hours/week&lt;/td&gt;
&lt;td&gt;AWS access, GitHub access, VPN, k8s kubeconfig, on-call rotation join&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2-3&lt;/td&gt;
&lt;td&gt;14-18 hours/week&lt;/td&gt;
&lt;td&gt;First service deploy, learning CI conventions, finding the right Terraform repo&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4-5&lt;/td&gt;
&lt;td&gt;10-14 hours/week&lt;/td&gt;
&lt;td&gt;Observability setup (logs, metrics, traces, alerts), on-call shadowing&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;6-7&lt;/td&gt;
&lt;td&gt;6-10 hours/week&lt;/td&gt;
&lt;td&gt;Cross-team integration patterns, edge cases in deploys, "where does this config live"&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;td&gt;3-6 hours/week&lt;/td&gt;
&lt;td&gt;Settling in; mostly questions that surface only during real incidents&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Sum it: roughly 100 to 130 hours per engineer over 8 weeks lost to onboarding friction. At a fully loaded $80/hour, that's $8,000 to $10,400 per engineer in lost time, paid every time you hire.&lt;/p&gt;

&lt;p&gt;The cost on the senior engineer side is rarely measured. Each new engineer fires roughly 5-15 "how do I" messages per week into engineering Slack channels. Senior engineers context-switch to answer, draft a half-page response, sometimes screen-share for 20 minutes. The aggregate is 4-7 hours per senior engineer per week per onboarding overlap. With 3-4 new engineers in the first month, the senior who happens to know the answers loses a half-day per week to answering them.&lt;/p&gt;

&lt;p&gt;This cost compounds. Year-over-year, the team grows. Each new hire fires the same questions because the answers live in senior engineers' heads, in Slack history, in three different wikis with conflicting versions, in a Confluence page somebody updated 18 months ago. The org pays the same onboarding tax for every new hire, plus the senior engineer time, plus the slow drift as conventions change and old answers become wrong.&lt;/p&gt;

&lt;p&gt;Nobody dashboards this because nobody wants to. The engineers paying the cost don't want to flag it (they want to look productive). The senior engineers paying the cost don't want to flag it (they want to look helpful). The org chart doesn't include "onboarding friction" as a category. It's a tax that gets paid in invisible time and shows up as "engineering velocity feels slow" without a clear line item.&lt;/p&gt;

&lt;h2&gt;
  
  
  What an IDP actually changes
&lt;/h2&gt;

&lt;p&gt;A working IDP — Backstage, Port, Cortex, or a homegrown equivalent — collapses the same 8-week onboarding into 4 weeks and the 14 hours per week of friction work into 30 minutes per week of catalog lookups.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Activity&lt;/th&gt;
&lt;th&gt;Pre-IDP&lt;/th&gt;
&lt;th&gt;Post-IDP&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Get AWS account access&lt;/td&gt;
&lt;td&gt;2-3 days, 4 Slack threads&lt;/td&gt;
&lt;td&gt;30 min: self-service via IDP request flow with auto-approval rules&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Find the deployment runbook&lt;/td&gt;
&lt;td&gt;1-2 days, 5 Slack threads&lt;/td&gt;
&lt;td&gt;5 min: deployment golden path is the IDP front page&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Set up observability for new service&lt;/td&gt;
&lt;td&gt;4-6 hours, 2 senior engineers consulted&lt;/td&gt;
&lt;td&gt;20 min: template generates the right Datadog/Grafana hooks&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Add on-call rotation membership&lt;/td&gt;
&lt;td&gt;1-2 days, 3 Slack threads, often blocked on PagerDuty admin&lt;/td&gt;
&lt;td&gt;15 min: self-service via IDP rotation manager&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Get secret manager access for service X&lt;/td&gt;
&lt;td&gt;2-3 days, 2 Slack threads, requires security team approval&lt;/td&gt;
&lt;td&gt;30 min: IDP routes the request with the right context, security approves in batch&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Create a new service from scratch&lt;/td&gt;
&lt;td&gt;1-2 weeks, learn CI/CD conventions ad hoc&lt;/td&gt;
&lt;td&gt;2 hours: template scaffolds repo + CI + observability + secrets&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The activities don't disappear. The engineer still needs AWS access, still needs to find the deployment runbook, still needs to set up observability. The time per activity collapses because the IDP makes the answer findable and the action self-service. What used to take a week of Slack-thread-driven discovery takes 30 minutes of catalog navigation.&lt;/p&gt;

&lt;p&gt;The senior engineer side is what makes the math work. The Slack questions don't go to a senior; they go to the IDP. The IDP answers about 70 percent of them via templates, runbooks, and self-service flows. The remaining 30 percent — the genuinely novel questions — still go to senior engineers, but the rate drops by 60-80 percent. Senior engineers get their week back; new engineers stop blocking on senior availability.&lt;/p&gt;

&lt;h2&gt;
  
  
  The deployment golden path is the keystone
&lt;/h2&gt;

&lt;p&gt;Six months of IDP work is enough budget to ship deployment + observability + on-call + secrets, in that order. The order matters more than the budget. Deployment first is the keystone; the rest only get traction once engineers trust the IDP to handle deployment correctly.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdz4jj7xpq9ppf7alc9o9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdz4jj7xpq9ppf7alc9o9.png" alt="diagram" width="800" height="1264"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Why deployment first: an engineer can survive bad observability for a week (the service runs, you'll fix the dashboards later). They cannot ship code without a clear deployment path. If the deployment golden path lives in the IDP and works, the engineer's first IDP interaction is a positive one. They go to the IDP next time something else needs doing. The IDP earns trust through use.&lt;/p&gt;

&lt;p&gt;Trying to ship observability + secrets + deployment in parallel fails because there's no foothold for engineer trust. The engineer hits the IDP, finds a half-finished observability template that doesn't quite work, gives up, asks Slack. The IDP becomes the place engineers tried once and found broken. That perception is hard to recover from; better to ship one path well than three paths half-done.&lt;/p&gt;

&lt;p&gt;The order after deployment is less critical, but observability and on-call go together because they share a workflow (alerts wake an on-call engineer; they consult the dashboards). Secret management can land third because it's a higher-friction problem (security review is involved) and engineers will tolerate the existing process longer. Environment provisioning is usually month 7-9 if scoped at all; many IDPs never ship it because it requires deeper cloud-account integration than the rest.&lt;/p&gt;

&lt;h2&gt;
  
  
  Templates beat documentation
&lt;/h2&gt;

&lt;p&gt;The deepest mechanism in an IDP isn't the catalog or the docs. It's the template that generates the right thing instead of describing how to make the right thing.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Mechanism&lt;/th&gt;
&lt;th&gt;Drift&lt;/th&gt;
&lt;th&gt;Enforcement&lt;/th&gt;
&lt;th&gt;Time to first success&lt;/th&gt;
&lt;th&gt;Maintenance cost&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Documentation&lt;/td&gt;
&lt;td&gt;Drifts within weeks; nobody updates&lt;/td&gt;
&lt;td&gt;None — engineer can ignore&lt;/td&gt;
&lt;td&gt;2-6 hours of reading + iterating&lt;/td&gt;
&lt;td&gt;Low to write, high to keep accurate&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Template (golden path)&lt;/td&gt;
&lt;td&gt;Doesn't drift; the template IS the convention&lt;/td&gt;
&lt;td&gt;Strong — produces the right output&lt;/td&gt;
&lt;td&gt;15-30 min from template run to working service&lt;/td&gt;
&lt;td&gt;Higher to write, near-zero to keep accurate&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;A 5,000-word doc on "how to create a service" describes the right repo structure, the right CI config, the right observability hooks. The next engineer reads the doc, applies it imperfectly, ships a service that's 80 percent compliant with the conventions. Six months later that service has subtle differences from the canonical pattern. The doc gets updated by someone, the existing service doesn't. Drift sets in immediately.&lt;/p&gt;

&lt;p&gt;A template that runs in the IDP produces the same artifacts as the doc would describe. The repo is created with the right structure. The CI config is generated from the same source as the docs. The observability hooks are wired by the same template that wires them in every other service. The engineer's "create service" interaction is a form they fill in (service name, owner, language) and a button they click. Two minutes later the service exists, compliant by construction.&lt;/p&gt;

&lt;p&gt;The templates also enforce things docs can't. A doc can say "always tag your resources with &lt;code&gt;cost_center&lt;/code&gt;." A template adds the tag automatically. A doc can say "always emit the &lt;code&gt;request_id&lt;/code&gt; in logs." A template wires the logger to do it by default. The conventions move from "things engineers should remember" to "things the template does for them." Compliance ratios go from the typical 30-60 percent for documented conventions to 95-99 percent for template-enforced ones.&lt;/p&gt;

&lt;p&gt;The work to write a template is roughly 3-5x the work to write the equivalent doc. The maintenance cost is the inverse: docs need constant updating to stay accurate; templates only update when the underlying convention changes. Over a 2-year horizon, templates are cheaper than docs even before counting the engineer-time savings.&lt;/p&gt;

&lt;h2&gt;
  
  
  The 6-month investment shape
&lt;/h2&gt;

&lt;p&gt;The typical IDP rollout for a 100-engineer org consumes roughly one platform engineer's full quarter for the deployment golden path, then a half-time engagement for the next quarter as the other paths land. Total team investment is roughly 0.75 to 1.0 engineer-quarters of platform time plus rotating part-time involvement from two service teams whose flows the IDP is encoding.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Month&lt;/th&gt;
&lt;th&gt;Deliverable&lt;/th&gt;
&lt;th&gt;Owner&lt;/th&gt;
&lt;th&gt;Dependency&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;IDP catalog up; service inventory imported; access flows wired&lt;/td&gt;
&lt;td&gt;Platform engineer (full-time)&lt;/td&gt;
&lt;td&gt;Backstage/Port instance + GitHub integration&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;Deployment golden path: template + runbook for new service&lt;/td&gt;
&lt;td&gt;Platform engineer + 1 service team part-time&lt;/td&gt;
&lt;td&gt;Catalog + CI/CD integration&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;Deployment golden path: rollout to 5 services as pilots&lt;/td&gt;
&lt;td&gt;Platform engineer + 5 pilot teams&lt;/td&gt;
&lt;td&gt;Working template from month 2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;Observability golden path: template for Datadog/Grafana hooks&lt;/td&gt;
&lt;td&gt;Platform engineer + observability team&lt;/td&gt;
&lt;td&gt;Deployment template establishes pattern&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;On-call golden path: PagerDuty + runbook integration&lt;/td&gt;
&lt;td&gt;Platform engineer + SRE team&lt;/td&gt;
&lt;td&gt;Observability template for alerts&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;Secrets golden path: routing through Vault/AWS Secrets Manager&lt;/td&gt;
&lt;td&gt;Platform engineer + security team&lt;/td&gt;
&lt;td&gt;Trust established from prior 5 months&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Months 7-9 add environment provisioning if scoped. Most orgs don't get to it in the first year because the cloud-account integration is the deepest piece of work and the prior paths produce most of the time savings.&lt;/p&gt;

&lt;p&gt;The platform engineer in months 1-3 is mostly heads-down on the deployment path. Months 4-6 the engineer becomes more of a coordinator, working with the observability/SRE/security teams who own the underlying systems. The IDP is the integration layer; it doesn't replace the underlying tools.&lt;/p&gt;

&lt;p&gt;The pilot pattern in month 3 is critical. Five services going through the deployment template surfaces every edge case the template missed. Fix the edge cases, then roll out broadly in month 4. Skipping the pilot and rolling broadly in month 3 means the broad rollout hits all the edge cases at once, the template gets blamed, and the IDP loses trust.&lt;/p&gt;

&lt;h2&gt;
  
  
  The dollar math: $400k recovered, $300k invested
&lt;/h2&gt;

&lt;p&gt;The math is straightforward but politically uncomfortable, because it requires putting a number on engineer time that nobody usually quantifies.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Input&lt;/th&gt;
&lt;th&gt;Value&lt;/th&gt;
&lt;th&gt;Notes&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Engineers in onboarding overlap (avg)&lt;/td&gt;
&lt;td&gt;12&lt;/td&gt;
&lt;td&gt;Includes new hires + recent transfers within 8 weeks&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Hours/week recovered per onboarding engineer&lt;/td&gt;
&lt;td&gt;14&lt;/td&gt;
&lt;td&gt;From 14 hrs/wk of friction to 30 min/wk&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Senior engineer hours/week recovered&lt;/td&gt;
&lt;td&gt;4 per senior × 5 affected seniors = 20&lt;/td&gt;
&lt;td&gt;Less context-switching to answer questions&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Total hours/week recovered&lt;/td&gt;
&lt;td&gt;188&lt;/td&gt;
&lt;td&gt;(12 × 14) + 20&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Fully loaded hourly cost&lt;/td&gt;
&lt;td&gt;$80&lt;/td&gt;
&lt;td&gt;Median for senior engineers in cloud infra&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Annualized recovered value&lt;/td&gt;
&lt;td&gt;$782k&lt;/td&gt;
&lt;td&gt;188 × $80 × 52 weeks&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Adjustment for non-100% onboarding overlap&lt;/td&gt;
&lt;td&gt;× 0.55&lt;/td&gt;
&lt;td&gt;Onboarding overlap isn't always 12 engineers&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Realistic recovered value&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;~$430k/year&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Conservative&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;IDP investment year 1 (platform eng + tooling)&lt;/td&gt;
&lt;td&gt;$250-350k&lt;/td&gt;
&lt;td&gt;One platform engineer + Backstage hosting + integrations&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Net year-1 ROI&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+$80k to +$180k&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Positive in year one&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Year 2+ ROI&lt;/td&gt;
&lt;td&gt;+$350-400k/year&lt;/td&gt;
&lt;td&gt;Investment drops to ongoing maintenance ($80-120k/year)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The investment side is more concrete than the recovery side. One platform engineer at fully-loaded $200k for the year, plus $30-50k for Backstage hosting + integrations + tooling, plus part-time involvement from the service teams (call it $70-100k of allocated time across 6 months). Total $300-350k in year one.&lt;/p&gt;

&lt;p&gt;The recovery side has the most uncertainty around the "onboarding overlap" number. A 100-engineer org with 20 percent annual hiring has roughly 20 hires per year, with 4-week to 8-week onboarding overlap meaning 4-6 engineers in friction-mode at any given time. The 12 number assumes higher hiring rate or more transfers; adjust accordingly. The dollar value scales linearly.&lt;/p&gt;

&lt;p&gt;The argument that lands better than "save $400k" is "recover one half-engineer of capacity per onboarding." Engineering leaders intuitively understand "we get our senior engineers' Mondays back" better than they understand annualized dollar projections.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to know it worked: the 'how do I' Slack metric
&lt;/h2&gt;

&lt;p&gt;The single instrumentation that tells you the IDP is working is the count of "how do I" messages in engineering Slack channels.&lt;/p&gt;

&lt;p&gt;Pre-IDP, a typical 100-engineer org sees 50-100 such messages per week in #engineering, #infrastructure, #platform-help, and similar channels. Each one is a question that should have an answer in the IDP but doesn't, or that's in the IDP but the asker didn't find it.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2aj6pjz2o3k3l7jgj22n.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2aj6pjz2o3k3l7jgj22n.png" alt="diagram" width="800" height="931"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Post-IDP (month 6 onward), the same channels see 10-20 such messages per week. The 60-80 percent drop is the most reliable signal of golden-path adoption. It's measured in Slack analytics, no instrumentation needed beyond a regex grep on channel history.&lt;/p&gt;

&lt;p&gt;The platform team uses the remaining 10-20 messages as the prioritization signal for IDP improvements. Each unanswered question is either a gap in the IDP (add a template or catalog entry) or a discoverability problem (add a search hint or restructure the catalog). The metric drives the work; the work drives the metric down further.&lt;/p&gt;

&lt;p&gt;The pattern that fails is letting the IDP roll out without instrumenting. Six months in, the platform team thinks the IDP is great because they built it. The actual signal of success is "are engineers using it instead of asking Slack." Without the Slack metric, the platform team optimizes for things engineers don't actually need; with it, the platform team's roadmap is driven by real friction.&lt;/p&gt;

&lt;h2&gt;
  
  
  What happens if you don't build it
&lt;/h2&gt;

&lt;p&gt;The opportunity cost of not building an IDP is bounded but real. A growing engineering org without an IDP eventually hires a "developer experience" team to do ad hoc the work an IDP does at scale.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Year&lt;/th&gt;
&lt;th&gt;Without IDP&lt;/th&gt;
&lt;th&gt;With IDP&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;Onboarding takes 8 weeks; senior engineers spend 5 hrs/wk answering questions; 1 ad-hoc DX engineer hired&lt;/td&gt;
&lt;td&gt;IDP investment + deployment golden path; 2 weeks shorter onboarding by year-end&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;DX team grows to 3 engineers maintaining scripts, runbooks, on-call docs ad hoc&lt;/td&gt;
&lt;td&gt;IDP team is 1 platform engineer maintaining + extending; observability/secrets paths added&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;DX team is 5 engineers; "how do we deploy" still requires asking; same onboarding tax as year 1&lt;/td&gt;
&lt;td&gt;IDP team is 1-2 engineers; onboarding is 4 weeks; senior engineer time recovered&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;DX team is 6 engineers; documentation has drifted again; new attempts to "fix it" begin&lt;/td&gt;
&lt;td&gt;IDP is the canonical surface; new conventions land as templates; org grows without proportional friction growth&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The DX team isn't a wasted investment — those 5 engineers are doing real work. The work is just less leveraged because it's documentation + scripts + ad-hoc processes instead of templates + catalog + self-service flows. Documentation drifts; templates don't. Scripts get forked; templates get versioned. Ad-hoc processes get replicated badly; self-service flows enforce consistency.&lt;/p&gt;

&lt;p&gt;Year 3 is where the divergence becomes obvious. The IDP team is one or two engineers extending the platform; the DX team is five engineers reinventing the same flows for each new service. The IDP org's onboarding tax has stayed flat; the no-IDP org's onboarding tax has grown linearly with team size. Hiring more DX engineers doesn't fix the structural problem; it scales it.&lt;/p&gt;

&lt;p&gt;The Backstage / Port / Cortex investment isn't free, and the six-month rollout is real work. But the alternative is paying the same cost as a recurring tax for as long as the org grows, and watching the senior engineers who could be building the next thing instead spend their Mondays answering "how do I." The 14 hours per week per onboarding engineer is the visible cost; the senior engineer time is the hidden one. The IDP recovers both, and the math works on a six-month horizon.&lt;/p&gt;

</description>
      <category>golden</category>
      <category>path</category>
      <category>tax</category>
      <category>hours</category>
    </item>
    <item>
      <title>Pod Scheduling for the Frugal: How We Cut EKS Node Cost 31% Without Touching a Workload</title>
      <dc:creator>Muskan </dc:creator>
      <pubDate>Fri, 15 May 2026 06:49:54 +0000</pubDate>
      <link>https://dev.to/muskan_8abedcc7e12/pod-scheduling-for-the-frugal-how-we-cut-eks-node-cost-31-without-touching-a-workload-3ao7</link>
      <guid>https://dev.to/muskan_8abedcc7e12/pod-scheduling-for-the-frugal-how-we-cut-eks-node-cost-31-without-touching-a-workload-3ao7</guid>
      <description>&lt;p&gt;A right-sized EKS cluster should not run at 40 percent node utilization. The pods declare requests that sum to 78 percent of node capacity. The cluster autoscaler provisions nodes to fit those requests. The bill goes to finance based on the nodes provisioned. And then the actual utilization metrics show 40 percent. The gap between 78 percent of node capacity claimed and 40 percent actually used is bin-packing inefficiency, and it survives any amount of right-sizing.&lt;/p&gt;

&lt;p&gt;The pattern that fixes the gap is three scheduling-side changes that don't touch any workload. Switch the scheduler scoring from default to MostAllocated. Enable Karpenter's consolidation feature. Add a three-tier priority class so batch workloads can be evicted when high-priority pods need capacity. The combined effect on a typical EKS cluster is 25 to 35 percent reduction in node cost without any pod changing its resource requests.&lt;/p&gt;

&lt;p&gt;The piece composes with &lt;a href="https://zop.dev/resources/blogs/right-sizing-vs-auto-scaling-which-saves-more-on-eks/" rel="noopener noreferrer"&gt;right-sizing vs auto-scaling&lt;/a&gt; but starts from the opposite end. Right-sizing argues with every team about their CPU and memory requests. Scheduling improvements just change where pods land. The political cost is much lower, the work fits in a one-sprint window, and right-sizing becomes more effective afterward because the bin-packing baseline is healthier.&lt;/p&gt;

&lt;h2&gt;
  
  
  The 40% utilization gap on EKS clusters that already right-sized
&lt;/h2&gt;

&lt;p&gt;Look at any EKS cluster that's already done a right-sizing pass. The numbers will be roughly:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Signal&lt;/th&gt;
&lt;th&gt;Typical value&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Pod CPU requests / node CPU capacity&lt;/td&gt;
&lt;td&gt;75-82%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Pod memory requests / node memory capacity&lt;/td&gt;
&lt;td&gt;70-78%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Actual node CPU utilization (averaged)&lt;/td&gt;
&lt;td&gt;35-45%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Actual node memory utilization (averaged)&lt;/td&gt;
&lt;td&gt;38-50%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cluster Autoscaler / Karpenter target utilization&lt;/td&gt;
&lt;td&gt;80%+&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The first two rows say "the cluster is well-packed in theory." The middle two say "the cluster is half-empty in practice." The last row says "the autoscaler thinks it's running tight."&lt;/p&gt;

&lt;p&gt;The gap is real, not a measurement artifact. Pod resource requests are declarations of what the pod might use; actual utilization is what the pod uses on average. The scheduler reserves the requested capacity even when the pod uses less. A node with five pods declaring 2 CPU each (10 CPU total reserved) but using 1 CPU each on average (5 CPU actual) is at 100 percent reserved and 50 percent utilized. The autoscaler sees the reservation, not the use.&lt;/p&gt;

&lt;p&gt;This is by design — the alternative (oversubscribing based on actual use) breaks under burst, and the scheduling literature is unanimous that reservation-based scheduling is the right primitive. The fix isn't to change how the scheduler treats requests. The fix is to make the scheduler pack requests more efficiently and to let Karpenter consolidate the resulting headroom into fewer nodes.&lt;/p&gt;

&lt;h2&gt;
  
  
  Lever 1: switch to MostAllocated scoring
&lt;/h2&gt;

&lt;p&gt;The default Kubernetes scheduler optimizes for predictable, evenly-spread placement. The default scoring strategy is &lt;code&gt;LeastAllocated&lt;/code&gt;, which prefers nodes with more free capacity. The reasoning is fault tolerance: spread pods across nodes so a single node failure has bounded blast radius. This is the right default if you're not paying for the nodes.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;MostAllocated&lt;/code&gt; is the opposite strategy: prefer nodes with less free capacity, packing pods tightly. The scoring is opt-in and rarely enabled. It's documented but has no auto-enablement signal: nothing in the cluster tells you "you'd save money if you flipped this."&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fu9mi15hkom7ddq4iaejb.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fu9mi15hkom7ddq4iaejb.png" alt="diagram" width="800" height="311"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Same workload, two scoring strategies, two outcomes. LeastAllocated produces 6 nodes at 50 percent each (33 percent waste). MostAllocated produces 4 nodes at 75 percent each (25 percent waste, 33 percent fewer nodes).&lt;/p&gt;

&lt;p&gt;The configuration is one block in the scheduler config:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Field&lt;/th&gt;
&lt;th&gt;Default value&lt;/th&gt;
&lt;th&gt;New value&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;KubeSchedulerProfiles[0].plugins.score.disabled&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;(none)&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;NodeResourcesFit&lt;/code&gt; (the default scorer)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;KubeSchedulerProfiles[0].plugins.score.enabled&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;(default)&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;NodeResourcesFit&lt;/code&gt; with &lt;code&gt;scoringStrategy.type: MostAllocated&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;On a managed EKS cluster, this lands as a config map for the kube-scheduler. The change rolls out per-control-plane and applies to new pod placements; existing pods stay where they are until they restart for other reasons. The transition over a week is gentle: as pods naturally restart (deployments, image updates, node maintenance), the cluster gradually packs tighter.&lt;/p&gt;

&lt;p&gt;The catch is that MostAllocated without &lt;code&gt;PodTopologySpread&lt;/code&gt; constraints is dangerous. Left to its own devices, the scheduler will happily put all five replicas of a deployment on one node — maximum density, zero fault tolerance. Topology spread is the corrective. We get to that section in a moment.&lt;/p&gt;

&lt;p&gt;The expected outcome on a fleet that previously ran 40 percent utilization: utilization rises to 55-65 percent over the first two weeks. The autoscaler notices fewer nodes are needed and provisions less. The bill drops 12-18 percent depending on workload composition.&lt;/p&gt;

&lt;h2&gt;
  
  
  Lever 2: enable Karpenter consolidation
&lt;/h2&gt;

&lt;p&gt;Karpenter provisions nodes to fit incoming pods. By default, once a node exists, it stays. If pods leave (deployment scale-down, batch job completion), the node lingers under-utilized until it's empty enough that Karpenter's "expiration" rules kick in.&lt;/p&gt;

&lt;p&gt;Consolidation is the active counterpart. Karpenter periodically evaluates the existing fleet, asks "could I run all these pods on fewer or smaller nodes," and re-provisions if yes. The evaluation runs hourly by default. Pods get gracefully evicted from the old nodes, the new (smaller or fewer) nodes get spun up, the old nodes terminate.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Feon4r2mspiwckejg3ukm.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Feon4r2mspiwckejg3ukm.png" alt="diagram" width="800" height="294"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Six m5.xlarge nodes at 30-40 percent become two m5.2xlarge nodes at 68-72 percent. Same pods, same requests. The autoscaler bill drops because four fewer nodes are running.&lt;/p&gt;

&lt;p&gt;The Karpenter NodePool config to enable it:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Field&lt;/th&gt;
&lt;th&gt;Default&lt;/th&gt;
&lt;th&gt;New&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;disruption.consolidationPolicy&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;WhenUnderutilized&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;WhenUnderutilized&lt;/code&gt; (already enabled in recent versions)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;disruption.consolidateAfter&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;unset&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;30s&lt;/code&gt; to &lt;code&gt;1m&lt;/code&gt; (acts on transient under-utilization too)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;disruption.expireAfter&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;720h&lt;/code&gt; (30 days)&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;168h&lt;/code&gt; (7 days) — forces fleet refresh&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The consolidation policy &lt;code&gt;WhenUnderutilized&lt;/code&gt; is what does the work. The &lt;code&gt;consolidateAfter&lt;/code&gt; knob controls how aggressive the re-evaluation is; shorter values catch transient under-utilization (a deployment that just scaled down) faster. The &lt;code&gt;expireAfter&lt;/code&gt; change is secondary but useful: shorter expiration forces the fleet to refresh more often, which catches drift between Karpenter's view and reality.&lt;/p&gt;

&lt;p&gt;The catch with consolidation is that the node types it picks need to be a bounded set. If the NodePool allows 30 instance families, consolidation produces fragmentation: some pods on m5, some on c5, some on r5, none of the families used densely enough to consolidate further. The prerequisite work is pruning to 3-5 high-utility instance families that cover the typical pod resource shapes. Most clusters land on m6i for general purpose, c6i for CPU-bound, r6i for memory-bound, with one or two GPU types for ML workloads.&lt;/p&gt;

&lt;p&gt;The expected outcome on a typical EKS fleet: 15-25 percent fewer nodes after the first week of consolidation passes. The first day shows the biggest drop (consolidation catches all the historical under-utilization at once); subsequent days are incremental as new under-utilization gets caught.&lt;/p&gt;

&lt;h2&gt;
  
  
  Lever 3: priority + preemption for batch workloads
&lt;/h2&gt;

&lt;p&gt;The third lever is the one most teams skip. Kubernetes supports pod priority classes and preemption: high-priority pods can evict low-priority pods when capacity is contended, instead of triggering a node-up.&lt;/p&gt;

&lt;p&gt;Most clusters end up with three priority classes:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Class&lt;/th&gt;
&lt;th&gt;priorityValue&lt;/th&gt;
&lt;th&gt;Workloads&lt;/th&gt;
&lt;th&gt;Preemption behavior&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;critical&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;1_000_000&lt;/td&gt;
&lt;td&gt;Customer-facing services, control-plane components&lt;/td&gt;
&lt;td&gt;Cannot be preempted&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;standard&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;500_000&lt;/td&gt;
&lt;td&gt;Internal services, default for everything else&lt;/td&gt;
&lt;td&gt;Preempts batch only&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;batch&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;100_000&lt;/td&gt;
&lt;td&gt;Periodic jobs, ML training, data pipelines&lt;/td&gt;
&lt;td&gt;Preempted by everything else&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The priorityClassName field on the pod spec assigns the class. New deployments get the appropriate class via templates; existing deployments get tagged in a one-time PR. Critical workloads are usually a small set (under 20 percent of pods). Batch workloads are usually larger than people expect (often 30-40 percent of pods, mostly invisible: cronjobs, data pipelines, build runners).&lt;/p&gt;

&lt;p&gt;The preemption behavior is what creates the savings. When the scheduler can't fit a standard or critical pod, it looks for batch pods to evict instead of triggering Karpenter to provision a new node. The batch pod gets evicted (re-queued for later), the standard pod takes its slot, and the cluster doesn't grow. The batch work runs slower but completes; the cluster runs leaner.&lt;/p&gt;

&lt;p&gt;The political work is agreeing on which workloads are evictable. Engineers tend to mark their work as critical by default. The agreement requires a clear definition: "critical means customer-facing degradation if evicted." Most internal infrastructure (monitoring, logging, build runners, batch ETL) is not critical by that definition. The clarification is the political work; the technical implementation is one yaml field per pod.&lt;/p&gt;

&lt;p&gt;Preemption only adds savings when the cluster has enough batch workloads to absorb the eviction pressure. Clusters that are pure web-tier with no batch see less benefit (typically 2-3 percent). Clusters with ML training or large data pipelines see more (typically 8-12 percent).&lt;/p&gt;

&lt;h2&gt;
  
  
  PodTopologySpread is non-negotiable
&lt;/h2&gt;

&lt;p&gt;MostAllocated without topology constraints will pack all replicas of a deployment onto one node. A node failure takes down the entire deployment. This is a real production incident, not a theoretical concern.&lt;/p&gt;

&lt;p&gt;The fix is &lt;code&gt;PodTopologySpread&lt;/code&gt; constraints on every deployment that has fault tolerance requirements. The yaml block:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Field&lt;/th&gt;
&lt;th&gt;Value&lt;/th&gt;
&lt;th&gt;Why&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;topologyKey&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;topology.kubernetes.io/zone&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Spread across AZs first&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;maxSkew&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;1&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;At most 1 pod imbalance between AZs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;whenUnsatisfiable&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;ScheduleAnyway&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Soft constraint; better packed than crashed&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Second constraint &lt;code&gt;topologyKey&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;&lt;code&gt;kubernetes.io/hostname&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Then spread across nodes within AZ&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Second constraint &lt;code&gt;maxSkew&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;&lt;code&gt;2&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;At most 2 pod imbalance between nodes&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The two-constraint pattern says "AZ spread is mandatory (for resilience), node spread is preferred (for further fault isolation), but never refuse to schedule because of either." &lt;code&gt;ScheduleAnyway&lt;/code&gt; is what makes it compatible with MostAllocated: when packing is the right choice for cost, the scheduler can ignore the soft constraint and pack tighter.&lt;/p&gt;

&lt;p&gt;The cost of getting topology spread right is one yaml block per critical deployment. Tooling exists (open-policy-agent, Kyverno) to enforce that deployments above a certain replica count have topology spread defined; we use a simple admission policy that warns on missing spread and blocks on critical-priority deployments without it.&lt;/p&gt;

&lt;p&gt;The trade-off this creates: a cluster running MostAllocated + topology spread will, in steady state, run at 60-70 percent utilization rather than the theoretical maximum of 85 percent. The 15-25 percent gap is the cost of fault tolerance. Closing it further means giving up AZ resilience, which is not a finance decision.&lt;/p&gt;

&lt;h2&gt;
  
  
  The 31% breakdown: 15 + 10 + 6
&lt;/h2&gt;

&lt;p&gt;The combined impact on a typical EKS cluster decomposes roughly as:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Lever&lt;/th&gt;
&lt;th&gt;Typical savings&lt;/th&gt;
&lt;th&gt;Range&lt;/th&gt;
&lt;th&gt;What drives variation&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;MostAllocated scoring&lt;/td&gt;
&lt;td&gt;15%&lt;/td&gt;
&lt;td&gt;12-18%&lt;/td&gt;
&lt;td&gt;Higher savings on clusters with many small pods (better bin-packing wins)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Karpenter consolidation&lt;/td&gt;
&lt;td&gt;10%&lt;/td&gt;
&lt;td&gt;8-12%&lt;/td&gt;
&lt;td&gt;Higher savings on clusters with bursty deployments (more transient under-utilization to catch)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Priority preemption&lt;/td&gt;
&lt;td&gt;6%&lt;/td&gt;
&lt;td&gt;2-12%&lt;/td&gt;
&lt;td&gt;Higher savings on clusters with significant batch workload (more eviction-eligible pods)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Combined&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;31%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;22-42%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Composition matters; effects are not exactly additive&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The combined number is slightly less than the simple sum because the levers overlap. MostAllocated reduces under-utilization, which means consolidation has less work to do. Preemption reduces node-up events, which means consolidation sees a steadier fleet. The interactions are mild but real; planning around 25-35 percent total savings is more accurate than planning around 31 percent.&lt;/p&gt;

&lt;p&gt;The exact mix depends on workload composition. A cluster that's 80 percent web traffic and 20 percent batch will see more value from MostAllocated and less from preemption. A cluster that's 50 percent ML training will see more from preemption (the training jobs are the ideal eviction targets) and less from MostAllocated (large pods don't bin-pack as well as small ones). The 15+10+6 split is the central tendency, not a guarantee.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why scheduling-first is more politically tractable than right-sizing-first
&lt;/h2&gt;

&lt;p&gt;Right-sizing argues with every team about their resource requests. The conversation is "your pod requests 4 CPU and uses 1.5; let's drop the request to 2." Each team pushes back because they remember the time the pod actually used 4 (during the incident two months ago, the deployment burst, the load test). Negotiating each one takes 30 to 60 minutes per service; a 200-service cluster is a quarter of FinOps time.&lt;/p&gt;

&lt;p&gt;Scheduling changes don't argue with anyone. The pod requests stay the same. The pod runs the same code. The only thing that changes is which node the pod lands on (MostAllocated), which other nodes exist alongside it (consolidation), and what happens when capacity is tight (preemption). No engineer has to defend their resource requests because no resource requests are changing.&lt;/p&gt;

&lt;p&gt;This makes the sequencing matter. Doing scheduling first:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Step&lt;/th&gt;
&lt;th&gt;Time&lt;/th&gt;
&lt;th&gt;Political cost&lt;/th&gt;
&lt;th&gt;Savings unlocked&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Enable MostAllocated + topology spread&lt;/td&gt;
&lt;td&gt;1-2 sprints&lt;/td&gt;
&lt;td&gt;Low (one config change, validated by SRE)&lt;/td&gt;
&lt;td&gt;12-18%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Enable Karpenter consolidation + prune node families&lt;/td&gt;
&lt;td&gt;1 sprint&lt;/td&gt;
&lt;td&gt;Low (Karpenter team's domain)&lt;/td&gt;
&lt;td&gt;8-12%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Define priority classes + tag batch workloads&lt;/td&gt;
&lt;td&gt;2-3 sprints&lt;/td&gt;
&lt;td&gt;Medium (workload classification debate)&lt;/td&gt;
&lt;td&gt;2-12%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Right-size pod requests&lt;/td&gt;
&lt;td&gt;1-2 quarters&lt;/td&gt;
&lt;td&gt;High (per-service negotiation)&lt;/td&gt;
&lt;td&gt;another 15-25% on top&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;By the time you get to right-sizing, the cluster's bin-packing is already healthy, so the right-sizing conversations land on a smaller per-service savings number. That's actually politically helpful: the engineers see "we already saved 31 percent without touching your pods, and now we're asking for the next 15 percent." The framing flips from "we're cutting your resources" to "we're tuning the last bit of headroom."&lt;/p&gt;

&lt;p&gt;The 31 percent number is real and replicable. The work fits in one sprint per lever, takes no engineering team's time except SRE's, and doesn't risk any pod's runtime behavior. It's the cheapest savings on the EKS bill and it shows up before the harder right-sizing fight even starts.&lt;/p&gt;

</description>
      <category>pod</category>
      <category>scheduling</category>
      <category>frugal</category>
      <category>cut</category>
    </item>
  </channel>
</rss>
