Stop Managing Cloud Exceptions in Spreadsheets — Use Policy-as-Code Instead

#finop #devops #cloudcomputing #ai

TL;DR: Cloud exceptions in spreadsheets rot. Policy-as-code puts them in git with automatic expiry dates, git reviews, and cost tracking. Here's how.

Tools like CleanCloud move exceptions into Git with expiry dates, PR reviews, and enforceable cost thresholds.

The Exception Spreadsheet Nobody Trusts

It starts innocently enough. Your team runs a cloud hygiene scan. A bastion host is intentionally stopped. A dev database has been idle for 45 days but is still in use. A NAT gateway with no traffic (seasonal workload). Legitimate exceptions, all of them.

So you create a list.

Three months later:

Nobody knows which exceptions are still valid
Who approved them?
When do they expire?
The spreadsheet hasn't been updated since 2024
You're suppressing the alerts instead of fixing the waste

Stats:

~40% of FinOps exceptions have zero documented expiry
Average exception age before questioned: 6+ months
Typical cost of "forgotten" exceptions: $10-50K+/month

The exceptions spreadsheet is the FinOps equivalent of commenting out a security alert.

The waste is still running.

You've just stopped seeing it.

What If Exceptions Were Code?

This is exactly the problem policy-as-code solves.

Instead of tracking exceptions in spreadsheets or tickets, tools like CleanCloud treat them as version-controlled configuration — living alongside your infrastructure, with built-in expiry, review, and enforcement.

Instead of a spreadsheet, your exceptions live in Git.

cleancloud.yaml (repo root)

# cleancloud.yaml - commit to your repo root

defaults:
  confidence: MEDIUM      # skip low-signal findings
  min_cost: 10            # ignore cheap findings

exceptions:
  - rule_id: aws.ec2.instance.stopped
    resource_id: i-0bastion1234  # bastion host
    reason: "Bastion - started on demand for debugging"
    expires_at: "2026-12-31"      # auto-expires (forces review)

  - rule_id: aws.rds.instance.idle
    resource_id: "db-test-*"      # wildcard (all test databases)
    reason: "Test databases are ephemeral and intentionally idle"

  - rule_id: aws.nat.idle
    resource_id: nat-12345678
    reason: "NAT gateway for seasonal workload (Jan-Mar only)"
    expires_at: "2026-03-31"      # expires after season ends

thresholds:
  fail_on_confidence: HIGH        # CI gate: block on HIGH findings
  fail_on_cost: 500               # CI gate: block if waste > $500/month

Now every exception is:

Reviewable - Show up in pull requests (why is documented)
Auditable - Git history shows who approved what
Self-expiring - No more "forgotten" exceptions (automatic)
Enforceable - CI fails if exceptions are violated
Version-controlled - Treated like infrastructure code (because it is)

How It Works in Practice

1. Run scans with your config

cleancloud scan --provider aws --all-regions
# Automatically picks up cleancloud.yaml from repo root

Without cleancloud.yaml: 47 findings
With cleancloud.yaml: 12 findings (the exceptions are suppressed)

2. Enforce in CI/CD

# .github/workflows/cloud-hygiene.yml
name: Cloud Hygiene Check

on:
  schedule:
    - cron: '0 9 * * MON'  # Weekly scan

jobs:
  scan:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Scan cloud for waste
        uses: getcleancloud/scan-action@v1
        with:
          provider: aws
          all-regions: true
          fail-on-cost: 500      # Exit code 2 if waste > $500/month

Result: Your build fails if exceptions expire or waste threshold is breached. No surprises.

3. Update exceptions via PR

When your bastion exception expires (2026-12-31), the next scan will fail:

Cloud Hygiene Check: FAILED
  [HIGH] Stopped EC2 instance: i-0bastion1234
  Reason: Exception expired on 2026-12-31

  Action: Remove the exception or update expires_at

Your team reviews the PR, decides whether to renew or delete it. Zero magic. Total visibility.

Real-World Example: Multi-Account Exception

Managing exceptions across your AWS org:

exceptions:
  # Production RDS: kept for failover (intentional)
  - rule_id: aws.rds.instance.idle
    account_id: "123456789012"        # prod account
    region: us-east-1
    resource_id: db-failover
    reason: "Standby RDS in active-passive setup"
    expires_at: "2027-03-31"          # annual review date

  # Staging: ephemeral, safe to ignore
  - rule_id: aws.rds.instance.idle
    account_id: "210987654321"        # staging account
    resource_id: "db-*"               # glob pattern
    reason: "Staging databases are ephemeral"
    # No expires_at = permanent exception (reviewed manually)

The Policy-as-Code Difference

Approach	Exceptions	Audit Trail	Expiry	Cost
Spreadsheet	Unstructured	Slack message	Manual	Free (but $50K waste)
Ticketing	In Jira	Comments	Forgotten	Free (but lost time)
UI Toggle	In vendor SaaS	Dashboard logs	No	Vendor cost
Policy-as-Code	In git	Full history	Automatic	Free (and tighter control)

Getting Started (5 Minutes)

Step 1: Try it without exceptions

pipx install cleancloud
cleancloud demo                    # See sample findings

Step 2: Add a config file

Create cleancloud.yaml in your repo root (the YAML above).

Step 3: Scan with your config

cleancloud scan --provider aws --all-regions
# Now suppresses the exceptions listed in cleancloud.yaml

Step 4: Commit to git

git add cleancloud.yaml
git commit -m "Add cloud hygiene exceptions with expiry dates"
git push

Your exceptions are now:

Version-controlled
Code-reviewed
Automatically expiring
Auditable

Why This Matters for Your Team

Platform teams: Enforce waste thresholds across departments without manual hunting
FinOps teams: Audit trail + expiry dates = zero "forgotten" waste
DevOps/SREs: Exceptions treated like infrastructure code (belong in git)
Security/Compliance: Every exception is a documented, reviewable approval