đ Executive Summary
TL;DR: Cloud cost overruns stem from poor visibility and lack of ownership, exemplified by forgotten high-cost instances. The solution involves a multi-pronged FinOps approach, combining automated cleanup scripts, proactive policy-as-code guardrails, and fundamental organizational shifts towards showback and chargeback for sustained financial accountability.
đŻ Key Takeaways
- Implement âJanitorâ scripts (e.g., AWS Lambda) to automatically identify and terminate untagged or abandoned cloud resources, acting as a reactive cost control.
- Enforce âPolicy as Codeâ using tools like Sentinel or Open Policy Agent (OPA), and Service Control Policies (SCPs) to prevent expensive or untagged resource provisioning at the IaC or AWS Organization level.
- Drive âOrganizational Changeâ through FinOps practices like âshowbackâ (displaying team-specific cloud spend) and âchargebackâ (allocating costs to team budgets) to foster a culture of financial ownership.
Struggling with runaway cloud costs and immature FinOps practices? This guide, from a Senior DevOps Engineer, breaks down the real reasons for cloud waste and offers three concrete solutions, from quick scripts to permanent cultural shifts, to get your spending under control.
I Saw a $30k Weekend Bill. Letâs Talk About Your Cloud Cost Problem.
I remember the Monday morning Slack message from Finance like it was yesterday: âDarian, can you explain this AWS spike?â I opened the billing console, and my stomach dropped. A developer, trying to test a new ML model, had spun up a p4d.24xlarge EC2 instance on Friday afternoon for a âquick testâ and promptly forgotten about it. Over a single weekend, that one instance had racked up a five-figure bill. We hadnât set up any guardrails, alerts, or ownership policies. It was a free-for-all, and we were paying for itâliterally.
This isnât a unique story. I see it play out in Reddit threads and hear it from colleagues constantly. Teams are handed the keys to the cloud kingdom with immense power to innovate, but without the financial literacy or guardrails to do it responsibly. Thatâs the core of the FinOps maturity struggle. Itâs not about being cheap; itâs about being efficient and accountable.
The âWhyâ: Itâs a Problem of Visibility and Ownership
Before we jump into solutions, you have to understand the root cause. The problem isnât (usually) malicious developers trying to bankrupt the company. The problem is a toxic combination of two things:
-
Lack of Visibility: Engineers canât see the cost of the infrastructure theyâre provisioning in real-time. The
terraform applycommand doesnât come with a price tag attached. The bill is an abstract concept that someone else, somewhere else, deals with weeks later. -
Lack of Ownership: When no one is directly accountable for the cost of
dev-test-data-processing-cluster-04, no one has an incentive to shut it down. It becomes âthe companyâs infrastructure,â a shared resource that is someone elseâs problem to clean up.
Fixing this isnât just about finding zombie servers. Itâs about fundamentally changing how your teams interact with the cloud. Here are three ways to tackle it, from a band-aid to a cure.
Solution 1: The Quick Fix (The âJanitorâ Script)
This is the reactive, âstop the bleedingâ approach. Youâre not fixing the culture, but you are stopping the immediate waste. The idea is to build automated janitorial services that hunt for and terminate untagged, abandoned, or oversized resources.
We did this with a simple AWS Lambda function, triggered by EventBridge on a nightly schedule. It scanned all EC2 instances and RDS databases in our dev accounts. If a resource was missing an owner tag or a ttl (Time To Live) tag, it would post a warning to a Slack channel, tagging the user who created it (if possible via CloudTrail). If it still wasnât tagged 24 hours later, a second function would terminate it. Harsh? Yes. Effective? Absolutely.
# Super simplified Python/Boto3 logic for a Lambda Janitor
import boto3
def find_untagged_instances(event, context):
ec2 = boto3.client('ec2', region_name='us-east-1')
instances = ec2.describe_instances(
Filters=[{'Name': 'instance-state-name', 'Values': ['running']}]
)
for reservation in instances['Reservations']:
for instance in reservation['Instances']:
instance_id = instance['InstanceId']
tags = instance.get('Tags', [])
tag_keys = [tag['Key'] for tag in tags]
if 'owner' not in tag_keys:
print(f"ALERT: Instance {instance_id} is missing 'owner' tag.")
# In a real script, you'd post this to Slack or SNS
# And maybe add a "pending_termination" tag
Warning: This is a hack, not a strategy. It cleans up the mess but doesnât teach anyone not to make one. Youâll spend a lot of time maintaining this script and dealing with angry developers whose âimportant test serverâ got terminated. Use it to get initial control, but donât stop here.
Solution 2: The Permanent Fix (Policy as Code & Guardrails)
This is where you âshift leftâ and prevent the problem from happening in the first place. Instead of cleaning up messes, you make it impossible to create them. This is the true architectural solution.
The core principle is embedding cost controls directly into your IaC (Infrastructure as Code) pipeline and your cloud account structure.
-
Mandatory Tagging with IaC Policies: Use tools like Sentinel with Terraform Cloud or Open Policy Agent (OPA) with your CI/CD pipeline. You can write policies that will fail a
terraform planif, for example, a resource is missing anownertag or if an S3 bucket doesnât have a lifecycle policy. The developer gets immediate feedback *before* anything is ever deployed. -
Service Control Policies (SCPs): At the AWS Organization level, you can apply SCPs to your developer accounts. These are IAM policies on steroids. You can explicitly deny the ability to launch certain instance families. Worried about another GPU incident? Block all
p4,g5, etc. instance types in any account that isnât your designated âML Researchâ OU.
Example: SCP to block expensive instance types
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "DenyExpensiveInstanceTypesInDev",
"Effect": "Deny",
"Action": "ec2:RunInstances",
"Resource": "arn:aws:ec2:*:*:instance/*",
"Condition": {
"StringLike": {
"ec2:InstanceType": [
"p4d.*",
"p3.*",
"g5.*",
"x2iezn.*",
"u-12tb1.metal"
]
}
}
}
]
}
This approach moves the responsibility from a cleanup script to the platform itself. Itâs more work to set up, but it scales and prevents entire classes of problems.
Solution 3: The âNuclearâ Option (Organizational Change)
The final, and most powerful, solution isnât technical at all. Itâs cultural. Itâs about implementing a true FinOps practice that gives teams both the visibility and the ownership I mentioned earlier.
This is where you build dashboards (using tools like Cloudability, Apptio, or even just detailed Cost and Usage Reports with Quicksight) that show each team their exact cloud spend, broken down by project and service. Youâre not hiding the bill anymore; youâre putting it on a giant screen for everyone to see. This is called âshowback.â
The next level is âchargeback,â where a teamâs cloud spend is actually allocated against their budget. Suddenly, the cost of prod-db-01 isnât an abstract company expense; itâs a line item that the Director of Engineering has to answer for. Nothing encourages a team to right-size an RDS instance faster than seeing its four-figure monthly cost coming out of their own budget.
Pro Tip: This is a massive organizational shift. You need buy-in from Finance, Engineering leadership, and Product. It requires a dedicated person or team (a FinOps Analyst or team) to manage the tooling and reporting. Itâs the hardest path, but itâs the only one that creates a permanent, cost-conscious engineering culture.
Comparing the Approaches
| Approach | Effort | Time to Implement | Long-Term Impact |
|---|---|---|---|
| 1. The Janitor Script | Low | Days | Low (Reactive) |
| 2. Policy & Guardrails | Medium | Weeks | High (Proactive) |
| 3. Organizational Change | High | Months/Quarters | Transformational |
Ultimately, a mature FinOps practice uses a combination of all three. You need the janitor script for what slips through, the guardrails to prevent most issues, and the cultural ownership to make everyone a responsible steward of cloud resources. Stop chasing surprise bills and start building a platform that makes financial responsibility the path of least resistance.
đ Read the original article on TechResolve.blog
â Support my work
If this article helped you, you can buy me a coffee:

Top comments (0)