Taranjit Singh Malhotra

Posted on Mar 24

The Non-Production Tax: Why Dev, Staging, and QA environments are quietly bleeding your cloud budget

#devops #finops #cloud #zopdev

Right now, 30 to 40% of your cloud bill is being spent on servers your engineers aren't using. Not your production workload. Not data transfer. The culprit is the accumulated fog of dev, staging, QA, and "temporary" environments spun up over the last 18 months by seven different teams. None of them got a proper owner. None got tagged. And none got a shutdown schedule.
Call it the Non-Production Tax. For a 50-engineer company spending $50k/month on cloud, that's $15,000 to $20,000 a month disappearing into infrastructure that nobody is actively using.

This doesn't happen because engineers are careless. It happens because non-prod environments are architecturally designed to accumulate cost, and nobody has built a system to stop them.

The always-on fallacy

The single biggest driver of the Non-Prod Tax is a simple mismatch: your engineers work roughly 40 to 50 hours a week, but the environments they use run 168 hours a week.

Do the math. If your dev environment stays active through the night and over the weekend, you're paying for 118 hours of idle compute every week. That's roughly 70% of the bill going toward empty servers that nobody is touching. You wouldn't leave every light in your office on all weekend. But that's exactly what most teams do with cloud infrastructure.

Where the money actually goes

The conventional wisdom is that staging needs to mirror production exactly. That sounds responsible until you look at what you're paying for. Running a high-availability, multi-replica database setup for a single QA engineer running smoke tests is like renting a 50-seat bus so one person can drive to the corner store. A test environment rarely needs to be as powerful as your live product. A single db.t3.medium instead of a db.r6g.2xlarge costs about $60/month versus $1,000/month. For a database that gets wiped weekly, the difference is hard to justify.

Backup storage that compounds forever. Every database in your test environments is probably set to take automatic daily backups, retained for 30 or 90 days, matching your production configuration. Nobody disabled it because disabling backups on a database feels like the kind of thing that gets you paged at 3am. So the setting stays on. The storage accumulates. Because each GB is a small charge, nobody notices until there are terabytes of test database backups that will never be restored. Some teams end up spending more to store backups of their test databases than they spend on their entire production database.

NAT Gateway charges on near-empty pipes. When AWS routes outbound traffic from your instances to the internet, it goes through a NAT Gateway. AWS charges you by the hour just for having one, and again per GB of data that passes through it. In production, that per-hour cost gets spread across high traffic volumes and becomes negligible. In a test environment where one engineer runs a job at 2pm, you're paying the same fixed hourly rate for a pipe that's nearly always idle. Eliminate per-environment NAT Gateways for test workloads and route them through a shared gateway instead. The savings are immediate.

Ghost environments that outlive their purpose. A developer spins up a staging environment for a feature branch. The feature ships. The developer moves on. The environment keeps running because nobody is certain it's still needed, and deleting it requires confidence that nothing will break. Six months later, it's still billing. The person who created it doesn't remember it exists. Without an automatic policy, such as flagging any environment with no activity for 30 days, ghost environments accumulate indefinitely.

Why this is hard to fix

Three honest reasons.

The costs are scattered. You're not looking at a single $20,000/month line item. You're looking at $200 here, $400 there, $150 somewhere else, spread across dozens of accounts and services. No single number is alarming enough to trigger an investigation. But they add up to the equivalent of a full-time salary.

Nobody owns it. The engineer who built that staging environment six months ago is on a different team now. The infrastructure lives in code nobody actively maintains. Deleting it requires someone to take responsibility for any fallout. That responsibility is uncomfortable, so it gets deferred indefinitely.

There's no friction at creation. Spinning up a new environment takes minutes and requires no cost estimate, no approval, no expiry date. Deleting one requires someone to care enough to do it and to be certain nothing depends on it. Creation wins by default, every time.

How to stop the bleeding

Put servers to sleep on a schedule. This is the fastest win available. A Lambda function triggered by EventBridge rules can stop and start RDS instances and EC2 instances automatically. Set them to shut down at 6pm, wake at 8am on weekdays, and stay off entirely on weekends. That alone cuts test compute costs roughly in half, with zero change to developer workflow.

Route test environments through shared network infrastructure. Stop provisioning a dedicated NAT Gateway for every project environment. Your staging, QA, and dev environments should share network plumbing. They need to be isolated from production, not from each other. Moving from per-environment to shared NAT Gateways eliminates a fixed hourly charge that multiplies with every new environment you spin up.

Shrink the defaults. Use db.t3.medium or smaller for test databases. Turn off automated backups, or cap retention at 3 to 7 days instead of matching the 30-day production setting. A test database that gets reset weekly doesn't need a month of backup history. For RDS specifically, disabling automated backups on non-prod instances takes one config change and shows up immediately on your bill.

Set automatic expiry. The rule should be simple and enforced by the system, not by memory: any environment with no deployment activity or traffic in the last 30 days gets flagged. After a 72-hour notification window with no response, it gets torn down. AWS Config rules and tag-based lifecycle policies can enforce this. So can a simple Lambda that queries CloudWatch metrics and posts to Slack before pulling the trigger. The key is that the default action is deletion, with humans opting in to keep an environment alive, not the other way around.

The bottom line

Cloud waste isn't a single red alert on your dashboard. It's a hundred small leaks running in parallel, each individually easy to ignore. But if your non-prod spend is 35% of your total bill and your cloud costs are $50k/month, you're looking at $210,000 a year in recoverable waste. That's not a rounding error. That's a headcount decision.

The fix isn't sophisticated. It's scheduling, right-sizing, shared infrastructure, and automatic cleanup. What it requires is someone deciding that non-production environments deserve the same architectural scrutiny as the systems your customers actually use.

Most teams haven't made that decision yet. That's why the bill keeps growing.

Your test environments are running 168 hours a week. Your engineers work 50. The math is not in your favor.

Top comments (1)

Nonso Ibenegbu • Mar 31

Great framing on this. The non-production tax is real, and a big reason it persists is the gap between detection and action. That is assuming monitoring exists at all.

Most cost monitoring is set up for production. Alerts, budgets, and anomaly detection are usually all pointed at prod. Non-prod environments fly under the radar because nobody configures monitoring for “that staging cluster the team spun up in January.”

A big part of solving the ownership drift is surfacing waste much earlier, while ownership is still clear, instead of letting it sit quietly for days and weeks and through weekends until cleanup starts to feel risky. In my experience, even one weekend can burn a noticeable amount on forgotten dev instances.

The fix is not just awareness. It is getting the signal into the workflow where someone can actually act on it. In most teams, that means Slack, not a dashboard.

I’ve been building SaneCost around exactly this problem. Non-prod is honestly where it finds the most waste, because that is where resources sit forgotten the longest.

Curious if anyone else has found a workflow that actually closes the loop on non-prod cleanup, or if it always ends up as a quarterly fire drill.