Solved: Help us understand FinOps maturity & cloud cost challenges

#devops #programming #tutorial #cloud

🚀 Executive Summary

TL;DR: Cloud cost overruns stem from poor visibility and lack of ownership, exemplified by forgotten high-cost instances. The solution involves a multi-pronged FinOps approach, combining automated cleanup scripts, proactive policy-as-code guardrails, and fundamental organizational shifts towards showback and chargeback for sustained financial accountability.

🎯 Key Takeaways

Implement ‘Janitor’ scripts (e.g., AWS Lambda) to automatically identify and terminate untagged or abandoned cloud resources, acting as a reactive cost control.
Enforce ‘Policy as Code’ using tools like Sentinel or Open Policy Agent (OPA), and Service Control Policies (SCPs) to prevent expensive or untagged resource provisioning at the IaC or AWS Organization level.
Drive ‘Organizational Change’ through FinOps practices like ‘showback’ (displaying team-specific cloud spend) and ‘chargeback’ (allocating costs to team budgets) to foster a culture of financial ownership.

Struggling with runaway cloud costs and immature FinOps practices? This guide, from a Senior DevOps Engineer, breaks down the real reasons for cloud waste and offers three concrete solutions, from quick scripts to permanent cultural shifts, to get your spending under control.

I Saw a $30k Weekend Bill. Let’s Talk About Your Cloud Cost Problem.

I remember the Monday morning Slack message from Finance like it was yesterday: “Darian, can you explain this AWS spike?” I opened the billing console, and my stomach dropped. A developer, trying to test a new ML model, had spun up a p4d.24xlarge EC2 instance on Friday afternoon for a “quick test” and promptly forgotten about it. Over a single weekend, that one instance had racked up a five-figure bill. We hadn’t set up any guardrails, alerts, or ownership policies. It was a free-for-all, and we were paying for it—literally.

This isn’t a unique story. I see it play out in Reddit threads and hear it from colleagues constantly. Teams are handed the keys to the cloud kingdom with immense power to innovate, but without the financial literacy or guardrails to do it responsibly. That’s the core of the FinOps maturity struggle. It’s not about being cheap; it’s about being efficient and accountable.

The “Why”: It’s a Problem of Visibility and Ownership

Before we jump into solutions, you have to understand the root cause. The problem isn’t (usually) malicious developers trying to bankrupt the company. The problem is a toxic combination of two things:

Lack of Visibility: Engineers can’t see the cost of the infrastructure they’re provisioning in real-time. The terraform apply command doesn’t come with a price tag attached. The bill is an abstract concept that someone else, somewhere else, deals with weeks later.
Lack of Ownership: When no one is directly accountable for the cost of dev-test-data-processing-cluster-04, no one has an incentive to shut it down. It becomes “the company’s infrastructure,” a shared resource that is someone else’s problem to clean up.

Fixing this isn’t just about finding zombie servers. It’s about fundamentally changing how your teams interact with the cloud. Here are three ways to tackle it, from a band-aid to a cure.

Solution 1: The Quick Fix (The “Janitor” Script)

This is the reactive, “stop the bleeding” approach. You’re not fixing the culture, but you are stopping the immediate waste. The idea is to build automated janitorial services that hunt for and terminate untagged, abandoned, or oversized resources.

We did this with a simple AWS Lambda function, triggered by EventBridge on a nightly schedule. It scanned all EC2 instances and RDS databases in our dev accounts. If a resource was missing an owner tag or a ttl (Time To Live) tag, it would post a warning to a Slack channel, tagging the user who created it (if possible via CloudTrail). If it still wasn’t tagged 24 hours later, a second function would terminate it. Harsh? Yes. Effective? Absolutely.

# Super simplified Python/Boto3 logic for a Lambda Janitor

import boto3

def find_untagged_instances(event, context):
    ec2 = boto3.client('ec2', region_name='us-east-1')
    instances = ec2.describe_instances(
        Filters=[{'Name': 'instance-state-name', 'Values': ['running']}]
    )

    for reservation in instances['Reservations']:
        for instance in reservation['Instances']:
            instance_id = instance['InstanceId']
            tags = instance.get('Tags', [])
            tag_keys = [tag['Key'] for tag in tags]

            if 'owner' not in tag_keys:
                print(f"ALERT: Instance {instance_id} is missing 'owner' tag.")
                # In a real script, you'd post this to Slack or SNS
                # And maybe add a "pending_termination" tag

Warning: This is a hack, not a strategy. It cleans up the mess but doesn’t teach anyone not to make one. You’ll spend a lot of time maintaining this script and dealing with angry developers whose “important test server” got terminated. Use it to get initial control, but don’t stop here.

Solution 2: The Permanent Fix (Policy as Code & Guardrails)

This is where you “shift left” and prevent the problem from happening in the first place. Instead of cleaning up messes, you make it impossible to create them. This is the true architectural solution.

The core principle is embedding cost controls directly into your IaC (Infrastructure as Code) pipeline and your cloud account structure.

Mandatory Tagging with IaC Policies: Use tools like Sentinel with Terraform Cloud or Open Policy Agent (OPA) with your CI/CD pipeline. You can write policies that will fail a terraform plan if, for example, a resource is missing an owner tag or if an S3 bucket doesn’t have a lifecycle policy. The developer gets immediate feedback *before* anything is ever deployed.
Service Control Policies (SCPs): At the AWS Organization level, you can apply SCPs to your developer accounts. These are IAM policies on steroids. You can explicitly deny the ability to launch certain instance families. Worried about another GPU incident? Block all p4, g5, etc. instance types in any account that isn’t your designated “ML Research” OU.

Example: SCP to block expensive instance types

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "DenyExpensiveInstanceTypesInDev",
      "Effect": "Deny",
      "Action": "ec2:RunInstances",
      "Resource": "arn:aws:ec2:*:*:instance/*",
      "Condition": {
        "StringLike": {
          "ec2:InstanceType": [
            "p4d.*",
            "p3.*",
            "g5.*",
            "x2iezn.*",
            "u-12tb1.metal"
          ]
        }
      }
    }
  ]
}

This approach moves the responsibility from a cleanup script to the platform itself. It’s more work to set up, but it scales and prevents entire classes of problems.

Solution 3: The “Nuclear” Option (Organizational Change)

The final, and most powerful, solution isn’t technical at all. It’s cultural. It’s about implementing a true FinOps practice that gives teams both the visibility and the ownership I mentioned earlier.

This is where you build dashboards (using tools like Cloudability, Apptio, or even just detailed Cost and Usage Reports with Quicksight) that show each team their exact cloud spend, broken down by project and service. You’re not hiding the bill anymore; you’re putting it on a giant screen for everyone to see. This is called “showback.”

The next level is “chargeback,” where a team’s cloud spend is actually allocated against their budget. Suddenly, the cost of prod-db-01 isn’t an abstract company expense; it’s a line item that the Director of Engineering has to answer for. Nothing encourages a team to right-size an RDS instance faster than seeing its four-figure monthly cost coming out of their own budget.

Pro Tip: This is a massive organizational shift. You need buy-in from Finance, Engineering leadership, and Product. It requires a dedicated person or team (a FinOps Analyst or team) to manage the tooling and reporting. It’s the hardest path, but it’s the only one that creates a permanent, cost-conscious engineering culture.

Comparing the Approaches

Approach	Effort	Time to Implement	Long-Term Impact
1. The Janitor Script	Low	Days	Low (Reactive)
2. Policy & Guardrails	Medium	Weeks	High (Proactive)
3. Organizational Change	High	Months/Quarters	Transformational

Ultimately, a mature FinOps practice uses a combination of all three. You need the janitor script for what slips through, the guardrails to prevent most issues, and the cultural ownership to make everyone a responsible steward of cloud resources. Stop chasing surprise bills and start building a platform that makes financial responsibility the path of least resistance.