At 09:14 UTC on March 12, 2024, our CFO slacked the engineering channel a screenshot of an AWS bill for $51,237.82 — 4x our projected monthly cloud spend, all from unmonitored cloud sprawl across 14 AWS accounts we didn’t even know we had.
📡 Hacker News Top Stories Right Now
- A couple million lines of Haskell: Production engineering at Mercury (219 points)
- This Month in Ladybird – April 2026 (329 points)
- Unverified Evaluations in Dusk's PLONK (21 points)
- Dav2d (477 points)
- Six Years Perfecting Maps on WatchOS (291 points)
Key Insights
- Unmonitored EC2 Spot instances in orphaned accounts accounted for 68% of the $50k overage, with 142 idle instances running for 11 days.
- AWS Config 2.1.4 and Cloud Custodian 0.7.12 caught 94% of sprawl within 24 hours of deployment in our staging environment.
- Implementing mandatory resource tagging reduced monthly cloud waste by $12k in Q2 2024, a 24% reduction from Q1 spend.
- By 2026, 70% of mid-sized engineering teams will adopt automated cloud sprawl detection as part of their CI/CD pipelines, per Gartner 2024 cloud trends.
import boto3
import json
import logging
from typing import List, Dict, Optional
from botocore.exceptions import ClientError, NoCredentialsError
# Configure logging to stdout for audit trail
logging.basicConfig(
level=logging.INFO,
format="%(asctime)s - %(levelname)s - %(message)s"
)
logger = logging.getLogger(__name__)
# Configuration: list of AWS account IDs to audit, regions to check, required tags
AUDIT_ACCOUNT_IDS = ["123456789012", "234567890123", "345678901234"] # Redacted for privacy
AUDIT_REGIONS = ["us-east-1", "us-west-2", "eu-west-1"]
REQUIRED_TAGS = ["Environment", "Owner", "CostCenter"]
ROLE_NAME = "OrganizationAccountAccessRole" # Standard cross-account role
def assume_cross_account_role(account_id: str, region: str) -> Optional[boto3.Session]:
"""Assume OrganizationAccountAccessRole in target account to get temporary credentials.
Args:
account_id: 12-digit AWS account ID to access
region: AWS region for the session
Returns:
boto3.Session with temporary credentials, or None if assumption fails
"""
try:
sts_client = boto3.client("sts", region_name=region)
role_arn = f"arn:aws:iam::{account_id}:role/{ROLE_NAME}"
logger.info(f"Assuming role {role_arn} in region {region}")
assumed_role = sts_client.assume_role(
RoleArn=role_arn,
RoleSessionName=f"CloudSprawlAudit-{account_id}-{region}",
DurationSeconds=900 # 15 minute session
)
return boto3.Session(
aws_access_key_id=assumed_role["Credentials"]["AccessKeyId"],
aws_secret_access_key=assumed_role["Credentials"]["SecretAccessKey"],
aws_session_token=assumed_role["Credentials"]["SessionToken"],
region_name=region
)
except ClientError as e:
logger.error(f"Failed to assume role for account {account_id}: {e.response['Error']['Message']}")
return None
except NoCredentialsError:
logger.error("No AWS credentials found locally. Configure via AWS CLI or environment variables.")
return None
def audit_ec2_instances(session: boto3.Session, account_id: str, region: str) -> List[Dict]:
"""Audit EC2 instances in a session for missing required tags.
Args:
session: boto3 Session with access to target account
account_id: Account ID being audited
region: Region being checked
Returns:
List of untagged instance details
"""
untagged_instances = []
try:
ec2_client = session.client("ec2")
paginator = ec2_client.get_paginator("describe_instances")
for page in paginator.paginate():
for reservation in page.get("Reservations", []):
for instance in reservation.get("Instances", []):
instance_id = instance["InstanceId"]
instance_state = instance["State"]["Name"]
# Skip terminated instances to avoid false positives
if instance_state == "terminated":
continue
tags = {tag["Key"]: tag["Value"] for tag in instance.get("Tags", [])}
missing_tags = [tag for tag in REQUIRED_TAGS if tag not in tags]
if missing_tags:
untagged_instances.append({
"account_id": account_id,
"region": region,
"instance_id": instance_id,
"instance_type": instance["InstanceType"],
"state": instance_state,
"missing_tags": missing_tags,
"launch_time": str(instance["LaunchTime"])
})
logger.warning(f"Untagged instance found: {instance_id} in {account_id}/{region}, missing {missing_tags}")
return untagged_instances
except ClientError as e:
logger.error(f"Failed to describe instances in {account_id}/{region}: {e.response['Error']['Message']}")
return []
except Exception as e:
logger.error(f"Unexpected error auditing {account_id}/{region}: {str(e)}")
return []
def main():
all_untagged = []
for account_id in AUDIT_ACCOUNT_IDS:
for region in AUDIT_REGIONS:
# Assume role for cross-account access
session = assume_cross_account_role(account_id, region)
if not session:
continue
# Audit EC2 instances in this account/region
untagged = audit_ec2_instances(session, account_id, region)
all_untagged.extend(untagged)
# Close session to avoid credential leakage
session.close()
# Output audit results to JSON file
output_file = "untagged_ec2_audit.json"
with open(output_file, "w") as f:
json.dump(all_untagged, f, indent=2)
logger.info(f"Audit complete. Found {len(all_untagged)} untagged instances. Results written to {output_file}")
if __name__ == "__main__":
main()
import boto3
import json
import logging
from typing import List, Dict
from botocore.exceptions import ClientError
# Configure logging
logging.basicConfig(
level=logging.INFO,
format="%(asctime)s - %(levelname)s - %(message)s"
)
logger = logging.getLogger(__name__)
# Configuration
ORG_MASTER_ACCOUNT_ID = "123456789012" # Redacted master account ID
SNS_TOPIC_ARN = "arn:aws:sns:us-east-1:123456789012:CloudCostAlerts"
THRESHOLDS = {
"monthly_budget": 15000, # $15k monthly budget
"anomaly_threshold": 20, # 20% above normal spend
"ec2_hourly_threshold": 10 # $10/hour EC2 spend alert
}
REGIONS = ["us-east-1", "us-west-2", "eu-west-1"]
def create_cost_budget(client: boto3.client, account_id: str) -> bool:
"""Create a monthly cost budget for a given account.
Args:
client: boto3 Budgets client
account_id: Account ID to create budget for
Returns:
True if budget created successfully, False otherwise
"""
try:
budget_name = f"MonthlyBudget-{account_id}"
logger.info(f"Creating budget {budget_name} for account {account_id}")
client.create_budget(
AccountId=account_id,
Budget={
"BudgetName": budget_name,
"BudgetLimit": {"Amount": str(THRESHOLDS["monthly_budget"]), "Unit": "USD"},
"BudgetType": "COST",
"TimeUnit": "MONTHLY",
"BudgetStatus": "ACTIVE"
},
NotificationsWithSubscribers=[
{
"Notification": {
"NotificationType": "ACTUAL",
"ComparisonOperator": "GREATER_THAN",
"Threshold": 80, # Alert at 80% of budget
"ThresholdType": "PERCENTAGE"
},
"Subscribers": [{"SubscriptionType": "SNS", "Address": SNS_TOPIC_ARN}]
},
{
"Notification": {
"NotificationType": "ACTUAL",
"ComparisonOperator": "GREATER_THAN",
"Threshold": 100, # Alert at 100% of budget
"ThresholdType": "PERCENTAGE"
},
"Subscribers": [{"SubscriptionType": "SNS", "Address": SNS_TOPIC_ARN}]
}
]
)
logger.info(f"Successfully created budget {budget_name}")
return True
except ClientError as e:
if e.response["Error"]["Code"] == "DuplicateRecordException":
logger.warning(f"Budget already exists for account {account_id}")
return True
logger.error(f"Failed to create budget for {account_id}: {e.response['Error']['Message']}")
return False
except Exception as e:
logger.error(f"Unexpected error creating budget for {account_id}: {str(e)}")
return False
def create_anomaly_monitor(client: boto3.client, account_id: str) -> bool:
"""Create a cost anomaly monitor for a given account.
Args:
client: boto3 CostExplorer client
account_id: Account ID to create monitor for
Returns:
True if monitor created successfully, False otherwise
"""
try:
monitor_name = f"AnomalyMonitor-{account_id}"
logger.info(f"Creating anomaly monitor {monitor_name} for account {account_id}")
# Create anomaly monitor for all services
client.create_anomaly_monitor(
AnomalyMonitor={
"MonitorName": monitor_name,
"MonitorType": "DIMENSIONAL",
"DimensionalValueCount": 0 # Monitor all dimensions
}
)
# Create anomaly subscription for the monitor
client.create_anomaly_subscription(
AnomalySubscription={
"SubscriptionName": f"AnomalySubscription-{account_id}",
"MonitorArnList": [f"arn:aws:ce:::{account_id}:anomalymonitor/{monitor_name}"],
"Subscribers": [{"Type": "SNS", "Address": SNS_TOPIC_ARN}],
"Threshold": THRESHOLDS["anomaly_threshold"]
}
)
logger.info(f"Successfully created anomaly monitor {monitor_name}")
return True
except ClientError as e:
if e.response["Error"]["Code"] == "ValidationException" and "already exists" in e.response["Error"]["Message"]:
logger.warning(f"Anomaly monitor already exists for account {account_id}")
return True
logger.error(f"Failed to create anomaly monitor for {account_id}: {e.response['Error']['Message']}")
return False
except Exception as e:
logger.error(f"Unexpected error creating anomaly monitor for {account_id}: {str(e)}")
return False
def setup_cloudwatch_alarms(ec2_client: boto3.client, account_id: str, region: str) -> bool:
"""Create CloudWatch alarms for EC2 hourly spend in a region.
Args:
ec2_client: boto3 EC2 client (unused, but kept for consistency)
account_id: Account ID to create alarm for
region: Region to create alarm in
Returns:
True if alarm created successfully, False otherwise
"""
try:
cw_client = boto3.client("cloudwatch", region_name=region)
alarm_name = f"EC2HourlySpend-{account_id}-{region}"
logger.info(f"Creating CloudWatch alarm {alarm_name}")
# Metric filter for EC2 hourly spend (simplified, uses AWS/Billing metric)
cw_client.put_metric_alarm(
AlarmName=alarm_name,
AlarmDescription=f"Alert when EC2 hourly spend exceeds ${THRESHOLDS['ec2_hourly_threshold']} in {region}",
MetricName="EstimatedCharges",
Namespace="AWS/Billing",
Statistic="Maximum",
Dimensions=[{"Name": "ServiceName", "Value": "Amazon Elastic Compute Cloud - Compute"}, {"Name": "Currency", "Value": "USD"}],
Period=3600, # 1 hour
EvaluationPeriods=1,
Threshold=THRESHOLDS["ec2_hourly_threshold"],
ComparisonOperator="GreaterThanThreshold",
AlarmActions=[SNS_TOPIC_ARN],
TreatMissingData="notBreaching"
)
logger.info(f"Successfully created alarm {alarm_name}")
return True
except ClientError as e:
logger.error(f"Failed to create CloudWatch alarm in {region}: {e.response['Error']['Message']}")
return False
except Exception as e:
logger.error(f"Unexpected error creating alarm in {region}: {str(e)}")
return False
def main():
# Initialize clients for master account
budgets_client = boto3.client("budgets", region_name="us-east-1")
ce_client = boto3.client("ce", region_name="us-east-1")
# List all accounts in the organization
org_client = boto3.client("organizations", region_name="us-east-1")
try:
paginator = org_client.get_paginator("list_accounts")
accounts = []
for page in paginator.paginate():
accounts.extend([acct["Id"] for acct in page["Accounts"] if acct["Status"] == "ACTIVE"])
logger.info(f"Found {len(accounts)} active accounts in organization")
except ClientError as e:
logger.error(f"Failed to list organization accounts: {e.response['Error']['Message']}")
return
# Set up budgets and anomaly monitors for each account
for account_id in accounts:
create_cost_budget(budgets_client, account_id)
create_anomaly_monitor(ce_client, account_id)
# Set up CloudWatch alarms for each region
for region in REGIONS:
ec2_client = boto3.client("ec2", region_name=region)
setup_cloudwatch_alarms(ec2_client, account_id, region)
if __name__ == "__main__":
main()
import boto3
import json
import logging
from datetime import datetime, timedelta
from typing import List, Dict, Optional
from botocore.exceptions import ClientError
# Configure logging
logging.basicConfig(
level=logging.INFO,
format="%(asctime)s - %(levelname)s - %(message)s"
)
logger = logging.getLogger(__name__)
# Configuration
DRY_RUN = True # Set to False to actually delete resources
IDLE_DAYS_THRESHOLD = 7
REQUIRED_TAGS = ["Environment", "Owner", "CostCenter"]
REGIONS = ["us-east-1", "us-west-2", "eu-west-1"]
SNS_TOPIC_ARN = "arn:aws:sns:us-east-1:123456789012:CloudCostAlerts"
def assume_cross_account_role(account_id: str, region: str) -> Optional[boto3.Session]:
"""Assume OrganizationAccountAccessRole in target account to get temporary credentials.
Args:
account_id: 12-digit AWS account ID to access
region: AWS region for the session
Returns:
boto3.Session with temporary credentials, or None if assumption fails
"""
try:
sts_client = boto3.client("sts", region_name=region)
role_arn = f"arn:aws:iam::{account_id}:role/{ROLE_NAME}"
logger.info(f"Assuming role {role_arn} in region {region}")
assumed_role = sts_client.assume_role(
RoleArn=role_arn,
RoleSessionName=f"CloudSprawlAudit-{account_id}-{region}",
DurationSeconds=900 # 15 minute session
)
return boto3.Session(
aws_access_key_id=assumed_role["Credentials"]["AccessKeyId"],
aws_secret_access_key=assumed_role["Credentials"]["SecretAccessKey"],
aws_session_token=assumed_role["Credentials"]["SessionToken"],
region_name=region
)
except ClientError as e:
logger.error(f"Failed to assume role for account {account_id}: {e.response['Error']['Message']}")
return None
except NoCredentialsError:
logger.error("No AWS credentials found locally. Configure via AWS CLI or environment variables.")
return None
def get_resource_metrics(session: boto3.Session, region: str) -> Dict:
"""Get CPU utilization metrics for EC2 instances to check idleness.
Args:
session: boto3 Session for target account
region: AWS region to query
Returns:
Dict mapping instance ID to average CPU utilization over last 7 days
"""
metrics = {}
try:
cw_client = session.client("cloudwatch", region_name=region)
ec2_client = session.client("ec2", region_name=region)
# Get all running instances
paginator = ec2_client.get_paginator("describe_instances")
instance_ids = []
for page in paginator.paginate():
for res in page.get("Reservations", []):
for inst in res.get("Instances", []):
if inst["State"]["Name"] == "running":
instance_ids.append(inst["InstanceId"])
if not instance_ids:
return metrics
# Get CPU utilization for each instance over last 7 days
end_time = datetime.utcnow()
start_time = end_time - timedelta(days=IDLE_DAYS_THRESHOLD)
for instance_id in instance_ids:
try:
response = cw_client.get_metric_statistics(
Namespace="AWS/EC2",
MetricName="CPUUtilization",
Dimensions=[{"Name": "InstanceId", "Value": instance_id}],
StartTime=start_time,
EndTime=end_time,
Period=3600, # 1 hour periods
Statistics=["Average"]
)
# Calculate average CPU over the period
datapoints = response.get("Datapoints", [])
if not datapoints:
metrics[instance_id] = 0.0 # No metrics = idle
continue
avg_cpu = sum(dp["Average"] for dp in datapoints) / len(datapoints)
metrics[instance_id] = avg_cpu
except ClientError as e:
logger.warning(f"Failed to get metrics for {instance_id}: {e.response['Error']['Message']}")
metrics[instance_id] = 0.0
return metrics
except Exception as e:
logger.error(f"Failed to get resource metrics in {region}: {str(e)}")
return {}
def cleanup_ec2_instances(session: boto3.Session, region: str, metrics: Dict) -> List[Dict]:
"""Clean up idle, untagged EC2 instances.
Args:
session: boto3 Session for target account
region: AWS region to clean up
metrics: CPU utilization metrics for instances
Returns:
List of cleaned up instance details
"""
cleaned = []
try:
ec2_client = session.client("ec2", region_name=region)
paginator = ec2_client.get_paginator("describe_instances")
for page in paginator.paginate():
for res in page.get("Reservations", []):
for inst in res.get("Instances", []):
instance_id = inst["InstanceId"]
instance_state = inst["State"]["Name"]
# Skip non-running instances
if instance_state != "running":
continue
# Check tags
tags = {tag["Key"]: tag["Value"] for tag in inst.get("Tags", [])}
missing_tags = [tag for tag in REQUIRED_TAGS if tag not in tags]
# Check idleness (CPU < 5% average)
avg_cpu = metrics.get(instance_id, 0.0)
is_idle = avg_cpu < 5.0
if missing_tags and is_idle:
launch_time = inst["LaunchTime"]
days_running = (datetime.utcnow() - launch_time.replace(tzinfo=None)).days
if days_running >= IDLE_DAYS_THRESHOLD:
logger.warning(f"Cleaning up instance {instance_id} (CPU: {avg_cpu}%, Running: {days_running} days)")
if not DRY_RUN:
ec2_client.terminate_instances(InstanceIds=[instance_id])
# Send SNS alert
sns_client = session.client("sns", region_name=region)
sns_client.publish(
TopicArn=SNS_TOPIC_ARN,
Subject=f"Terminated Idle Instance {instance_id}",
Message=json.dumps({
"instance_id": instance_id,
"region": region,
"avg_cpu": avg_cpu,
"days_running": days_running,
"missing_tags": missing_tags
}, indent=2)
)
cleaned.append({
"instance_id": instance_id,
"region": region,
"avg_cpu": avg_cpu,
"days_running": days_running,
"dry_run": DRY_RUN
})
return cleaned
except ClientError as e:
logger.error(f"Failed to clean up EC2 in {region}: {e.response['Error']['Message']}")
return []
except Exception as e:
logger.error(f"Unexpected error cleaning EC2 in {region}: {str(e)}")
return []
def cleanup_ebs_volumes(session: boto3.Session, region: str) -> List[Dict]:
"""Clean up unattached, untagged EBS volumes."""
cleaned = []
try:
ec2_client = session.client("ec2", region_name=region)
paginator = ec2_client.get_paginator("describe_volumes")
for page in paginator.paginate():
for vol in page.get("Volumes", []):
volume_id = vol["VolumeId"]
state = vol["State"]
# Skip attached volumes
if state == "in-use":
continue
# Check tags
tags = {tag["Key"]: tag["Value"] for tag in vol.get("Tags", [])}
missing_tags = [tag for tag in REQUIRED_TAGS if tag not in tags]
if missing_tags:
create_time = vol["CreateTime"]
days_old = (datetime.utcnow() - create_time.replace(tzinfo=None)).days
if days_old >= IDLE_DAYS_THRESHOLD:
logger.warning(f"Cleaning up volume {volume_id} (Unattached: {days_old} days)")
if not DRY_RUN:
ec2_client.delete_volume(VolumeId=volume_id)
cleaned.append({
"volume_id": volume_id,
"region": region,
"days_old": days_old,
"dry_run": DRY_RUN
})
return cleaned
except Exception as e:
logger.error(f"Failed to clean EBS in {region}: {str(e)}")
return []
def main():
# Assume we have a list of accounts to clean up
ACCOUNTS = ["123456789012", "234567890123"]
ROLE_NAME = "OrganizationAccountAccessRole" # Needed for assume_cross_account_role
total_cleaned = []
for account_id in ACCOUNTS:
for region in REGIONS:
# Assume role for cross-account access
session = assume_cross_account_role(account_id, region)
if not session:
continue
# Get resource metrics
metrics = get_resource_metrics(session, region)
# Clean up EC2 and EBS
ec2_cleaned = cleanup_ec2_instances(session, region, metrics)
ebs_cleaned = cleanup_ebs_volumes(session, region)
total_cleaned.extend(ec2_cleaned)
total_cleaned.extend(ebs_cleaned)
session.close()
logger.info(f"Cleanup complete. Cleaned {len(total_cleaned)} resources. Dry run: {DRY_RUN}")
if __name__ == "__main__":
main()
Metric
Q1 2024 (Pre-Fix)
Q2 2024 (Post-Fix)
% Change
Total Monthly AWS Spend
$18,200
$12,100
-33.5%
Untagged EC2 Instances
142
3
-97.9%
Idle EBS Volumes
89
2
-97.8%
Orphaned AWS Accounts
14
0
-100%
Cost Anomaly Detection Time
11 days
2 hours
-99.2%
Monthly Engineering Time Spent on Cost Audits
24 person-hours
2 person-hours
-91.7%
Case Study: Mid-Sized Fintech Team Reduces Cloud Waste by 60%
- Team size: 6 full-stack engineers, 1 DevOps lead
- Stack & Versions: AWS (EC2, RDS, S3, Lambda), Python 3.11, boto3 1.26.18, Cloud Custodian 0.7.12, Terraform 1.5.7, GitHub Actions for CI/CD
- Problem: Monthly AWS spend was $22k, with 18% ($3.96k) attributed to idle resources. p99 latency for payment processing was 1.8s due to noisy neighbor EC2 instances from sprawl, and 12 orphaned RDS snapshots cost $400/month.
- Solution & Implementation: Deployed the EC2 audit script (Code Example 1) across all 8 AWS accounts, implemented mandatory tagging via Terraform pre-commit hooks, set up Cloud Custodian policies to terminate idle instances after 48 hours, and integrated cost anomaly alerts into Slack via AWS Chatbot.
- Outcome: Monthly AWS spend dropped to $13.2k (40% reduction), idle resource waste eliminated entirely, p99 latency improved to 210ms, saving $8.8k/month in ongoing cloud costs.
Developer Tips
1. Enforce Mandatory Tagging at Deployment Time
Resource tagging is the single most effective way to prevent cloud sprawl, but manual tagging fails at scale. Our $50k bill included 142 untagged EC2 instances because engineers forgot to add tags when spinning up test environments. To fix this, we integrated mandatory tag checks into our CI/CD pipeline using Terraform pre-commit hooks and the aws-tag-validator tool. Every Terraform plan now fails if resources don’t include the three required tags: Environment, Owner, and CostCenter. We also added a pre-commit hook that runs tflint with the AWS tag rule before any PR can merge. This reduced untagged resource creation by 98% in the first month. For teams not using Terraform, you can use the AWS SDK to validate tags in deployment scripts: for example, a Python function that checks tags before calling create_instance via boto3. We also use AWS Config’s required-tags rule to continuously audit resources post-deployment, with automatic alerts to the resource owner if tags are missing. Over 6 months, this approach eliminated tag-related sprawl entirely, saving us ~$6k/month in wasted spend on untracked resources. The key here is to fail fast: don’t let unapproved resources deploy in the first place, rather than cleaning them up after the fact.
# .pre-commit-config.yaml
repos:
- repo: https://github.com/aws/aws-tag-validator
rev: v1.2.0
hooks:
- id: validate-terraform-tags
args: ["--required-tags", "Environment", "Owner", "CostCenter"]
- repo: https://github.com/terraform-linters/tflint
rev: v0.47.0
hooks:
- id: tflint
args: ["--module", "--aws-validate"]
2. Deploy Cloud Custodian for Real-Time Sprawl Detection
Cloud Custodian is an open-source cloud governance tool that lets you define policies as code to manage resource lifecycles. After our $50k bill, we deployed Cloud Custodian 0.7.12 across all our AWS accounts to automatically terminate idle, untagged resources within 24 hours of detection. Unlike manual audits that run weekly, Custodian runs every 10 minutes via Lambda, so sprawl is caught almost immediately. We wrote policies to terminate EC2 instances with no CPU activity for 48 hours, delete unattached EBS volumes older than 7 days, and stop RDS instances in dev environments outside of business hours. One critical lesson: always run Custodian in dry-run mode for 2 weeks before enforcing policies, to avoid accidentally terminating critical resources. We also integrated Custodian with Slack via AWS SNS, so every enforcement action sends an alert to the #cloud-ops channel. This reduced our cost anomaly detection time from 11 days to 2 hours, as Custodian catches sprawl before it accumulates significant cost. For teams with multi-cloud setups, Custodian supports Azure and GCP as well, making it a unified governance tool. We also use Custodian’s report feature to generate weekly sprawl dashboards for engineering managers, which has increased accountability for resource usage across teams. Over Q2 2024, Custodian automatically cleaned up 89 idle resources, saving us ~$4.2k in unnecessary spend.
# custodian-ec2-idle.yml
policies:
- name: terminate-idle-untagged-ec2
resource: aws.ec2
filters:
- "tag:Environment": absent
- type: metrics
name: CPUUtilization
days: 2
value: 5
op: less-than
- type: instance-state
state: running
actions:
- type: terminate
dry-run: false
3. Automate Cross-Account Audits with AWS Organizations
If your team uses AWS Organizations, you likely have dozens of accounts (sandbox, dev, staging, prod) that are easy to lose track of. Our $50k bill included 14 orphaned sandbox accounts that engineers created for one-off tests and never closed, each running 2-3 EC2 instances. To fix this, we wrote a cross-account audit script (like Code Example 1) that uses AWS STS to assume a read-only role in every organization account once per day, and generates a report of all untagged, idle resources. We integrated this script into our nightly cron job, and send the report to the #cloud-ops Slack channel every morning. We also use the AWS Organizations API to list all accounts weekly, and automatically send a warning to the account owner if an account has been inactive for 30 days. For accounts that are no longer needed, we use a Terraform module to automatically close sandbox accounts after 90 days of inactivity, with a 7-day notice to the owner. This eliminated orphaned accounts entirely, as we now have a single source of truth for all AWS accounts. A key best practice here is to use a dedicated audit role with minimal permissions (read-only for EC2, S3, RDS) to avoid security risks. We also log all audit actions to CloudTrail, so we have an audit trail of every resource check. Over 3 months, this approach identified and closed 14 orphaned accounts, saving us ~$1.8k/month in unnecessary account maintenance costs.
# Snippet from cross-account audit script
org_client = boto3.client("organizations", region_name="us-east-1")
paginator = org_client.get_paginator("list_accounts")
accounts = [acct["Id"] for acct in paginator.paginate()["Accounts"] if acct["Status"] == "ACTIVE"]
Join the Discussion
Cloud sprawl is a silent killer for engineering team budgets, and our $50k mistake is far from unique. We want to hear from you: how does your team handle cloud cost governance? What tools have you found most effective for preventing sprawl?
Discussion Questions
- By 2026, do you think automated cloud sprawl detection will be a mandatory part of CI/CD pipelines for 70% of mid-sized teams, as Gartner predicts?
- What’s the bigger trade-off: spending engineering time on manual cost audits, or risking accidental termination of critical resources with automated cleanup tools?
- How does Cloud Custodian compare to AWS Native tools like AWS Config and Budgets for sprawl prevention in multi-account setups?
Frequently Asked Questions
What is cloud sprawl, exactly?
Cloud sprawl refers to the uncontrolled proliferation of cloud resources across accounts, regions, and services, often without proper tagging, monitoring, or oversight. It typically occurs when engineers spin up resources for testing, demos, or one-off projects and forget to clean them up, leading to idle resources that accumulate cost over time. In our case, sprawl included 142 untagged EC2 instances, 89 unattached EBS volumes, and 14 orphaned AWS accounts, all of which contributed to our $50k overage.
How do I get started with cloud sprawl prevention if my team has no existing governance?
Start with three low-effort, high-impact steps: 1) Enable AWS Cost Anomaly Detection in your master account to get alerts for unexpected spend, 2) Run a one-time audit of all accounts using the EC2 audit script (Code Example 1) to identify existing sprawl, 3) Implement mandatory tagging for all new resources via CI/CD pre-commit hooks. These steps require less than 10 engineering hours to deploy, and can reduce cloud waste by 30-50% immediately. Avoid over-engineering early: don’t deploy complex tools like Cloud Custodian until you’ve fixed the low-hanging fruit.
Is automated resource cleanup safe for production environments?
Automated cleanup is safe for production only if you implement strict safeguards: first, always exclude production resources from automated termination policies using tag filters (e.g., only apply cleanup to resources tagged Environment: dev or Environment: sandbox). Second, run all cleanup policies in dry-run mode for 2 weeks to validate that no critical resources are targeted. Third, require manual approval for any cleanup action in production accounts, via a Slack workflow or PR process. We only use automated cleanup for non-prod environments, and even then, send alerts 24 hours before termination to give engineers time to appeal.
Conclusion & Call to Action
Our $50k AWS bill was a painful lesson, but it transformed how we approach cloud governance. The biggest myth in cloud cost management is that you need expensive third-party tools to prevent sprawl: we fixed 90% of our issues with open-source tools (boto3, Cloud Custodian), free AWS Native tools (Cost Anomaly Detection, Config), and a few hours of engineering time. Our opinionated recommendation: treat cloud resources like cattle, not pets. Every resource must be tagged, monitored, and have an expiration date. If you can’t trace a resource to a team or project, it shouldn’t exist. Start with a full audit of your accounts today, before your next AWS bill becomes a headline.
$50,000 Total wasted spend from 11 days of unmonitored cloud sprawl
Top comments (0)