Mateen Anjum

Posted on Jun 6

FinOps guardrails at provisioning time: stop paying for mistakes you could have blocked in Terraform

#devops #finops #terraform #opa

TL;DR: I wired Infracost into terraform plan, fed the JSON into OPA via conftest, and made the PR check fail if the change adds more than $500/month. Three months in, the gate has blocked roughly $2,400/month of accidental NAT Gateways, oversized RDS instances, and lonely Elastic IPs that nobody noticed. The invoice doesn't arrive anyway when the bad config never merges.

The problem nobody owns

The classic FinOps story goes like this. Someone opens a PR. The PR adds a NAT Gateway because the new VPC needed private subnet egress. Reviewer says "lgtm." It merges. Thirty days later, a finance person pings the platform channel: "what is this $1,847 line item." Nobody remembers the PR. Nobody owns the cost.

The 2026 platform engineering trend reports show that 73% of platform teams have moved cost visibility left, away from the invoice and into the PR.¹ That number tracks with what I'm seeing in the field. The teams that still treat FinOps as a quarterly cleanup exercise are the same teams that get surprised every month.

I built CICosts last year for the same reason at the CI layer: you can't fix what you can't see, and finance asking the question is too late. This article is the same idea pushed one layer down, at the provisioning layer, where the money actually gets committed.

The anti-pattern: "the invoice arrives anyway"

Before I show the gate, the thing it replaces. Most FinOps tooling sold in 2024 to 2025 was retrospective. You buy a SaaS, it ingests your Cost and Usage Report, it shows you pretty graphs of where the money went last month. That's fine for reporting. It does not stop a single dollar from being spent.

The pattern I keep seeing:

Platform team installs a cost dashboard.
Dashboard shows last month's spend, broken down by tag.
Team holds a "cost review" once a quarter.
Team writes tickets to delete the obvious waste.
By the time anyone deletes the resource, it has been running for 60 to 90 days.

The invoice arrives anyway. The dashboard is a receipt, not a guardrail. You can stare at a Grafana panel showing $4,000/month of idle NAT Gateways for as long as you want; the money already left the building.

The fix is to put the question in the developer's face at PR time, when they still have the keyboard and the context.

The gate, end to end

The pipeline has three moving parts. Terraform produces the plan. Infracost converts the plan into a JSON breakdown with monthly costs. OPA reads the breakdown and decides whether to merge.

Step 1: terraform plan

Nothing exotic here. The CI job runs:

terraform init -backend-config=backend.hcl
terraform plan -out=tfplan.binary
terraform show -json tfplan.binary > tfplan.json

The tfplan.json is what Infracost wants. Keep the binary plan too, because some teams like to attach it to the PR for review.

Step 2: Infracost breakdown

Infracost reads the plan and looks up prices from its pricing API. It supports AWS, Azure, GCP, and a long tail of SaaS providers. The interesting flag is --format json, which gives a structured diff you can feed into a policy engine.

infracost breakdown \
  --path tfplan.json \
  --format json \
  --out-file infracost.json

The output has a top-level projects[].diff.totalMonthlyCost field. That's the number I care about. A small sample:

{
  "projects": [
    {
      "name": "platform/networking",
      "diff": {
        "totalMonthlyCost": "612.34",
        "resources": [
          {
            "name": "aws_nat_gateway.private_egress",
            "monthlyCost": "32.85",
            "monthlyQuantity": "730",
            "unit": "hours"
          },
          {
            "name": "aws_db_instance.analytics",
            "monthlyCost": "579.49"
          }
        ]
      }
    }
  ]
}

You can see exactly what's driving the delta. That analytics RDS instance is the suspicious one. The NAT Gateway is fine on its own; the problem is usually that someone adds three of them because the module spins one up per AZ.

Step 3: OPA policy via conftest

Conftest is a thin wrapper around OPA that lets you write Rego against any structured config file. I keep the policy in policy/cost.rego:

package main

# Hard limit: any PR that adds more than $500/mo of net cost fails.
threshold_monthly := 500

# Allow-list: resource types that are exempt from the cap.
# Example: bumping an existing prod RDS up one size during incident response.
exempt_resource_types := {
  "aws_cloudwatch_log_group",
}

deny[msg] {
  delta := to_number(input.projects[_].diff.totalMonthlyCost)
  delta > threshold_monthly
  msg := sprintf(
    "Monthly cost delta is $%.2f, which exceeds the $%d limit. Break this into smaller changes or request an exception.",
    [delta, threshold_monthly],
  )
}

# Block any single resource over $200/mo without a justification label.
deny[msg] {
  some i, j
  resource := input.projects[i].diff.resources[j]
  cost := to_number(resource.monthlyCost)
  cost > 200
  not exempt_resource_types[resource.resource_type]
  not has_justification(resource)
  msg := sprintf(
    "Resource %s costs $%.2f/mo. Add a # cost-justified: <reason> comment in the .tf file or split the PR.",
    [resource.name, cost],
  )
}

has_justification(resource) {
  startswith(resource.metadata.code_comment, "cost-justified:")
}

Two rules, both opinionated. The aggregate cap stops "death by a thousand cuts" PRs that each add $50 but ship 20 resources. The per-resource cap stops one fat outlier from sneaking past the aggregate check.

To run it locally:

conftest test --policy policy/ infracost.json

If the policy denies, conftest exits non-zero, which fails the GitHub Actions job, which blocks the merge if you have branch protection on.

Step 4: GitHub Actions workflow

The full workflow lives in .github/workflows/cost-gate.yml:

name: cost-gate

on:
  pull_request:
    paths:
      - 'terraform/**'
      - '.github/workflows/cost-gate.yml'

permissions:
  contents: read
  pull-requests: write

jobs:
  cost-gate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - uses: hashicorp/setup-terraform@v3
        with:
          terraform_version: 1.9.5

      - name: Configure AWS credentials
        uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: ${{ secrets.PLAN_ROLE_ARN }}
          aws-region: us-east-1

      - name: terraform init and plan
        working-directory: terraform
        run: |
          terraform init -input=false
          terraform plan -out=tfplan.binary -input=false
          terraform show -json tfplan.binary > tfplan.json

      - name: Install Infracost
        uses: infracost/actions/setup@v3
        with:
          api-key: ${{ secrets.INFRACOST_API_KEY }}

      - name: Run Infracost
        working-directory: terraform
        run: |
          infracost breakdown \
            --path tfplan.json \
            --format json \
            --out-file ../infracost.json

      - name: Install conftest
        run: |
          curl -L https://github.com/open-policy-agent/conftest/releases/download/v0.55.0/conftest_0.55.0_Linux_x86_64.tar.gz \
            | tar -xz conftest
          sudo mv conftest /usr/local/bin/

      - name: Enforce cost policy
        run: conftest test --policy policy/ infracost.json

      - name: Comment cost diff on PR
        if: always()
        uses: infracost/actions/comment@v3
        with:
          path: infracost.json
          behavior: update

A few things worth calling out. The plan uses a read-only IAM role assumed via OIDC, not a long-lived key. The conftest step is the actual gate; the Infracost PR comment is just a nicety so reviewers can see the breakdown without opening Actions logs. The paths: filter keeps the gate from running on docs-only PRs.

Step 5: Backstage scorecard

The gate is enforcement. The scorecard is visibility. Once you have the cost-gate workflow on every infra repo, you want a place to see which repos pass, which fail, and which never adopted the gate at all.

I use Backstage's Tech Insights module for scorecards. The check definition lives in app-config.yaml:

techInsights:
  factRetrievers:
    costGateRetriever:
      schedule:
        frequency: { hours: 6 }
        timeout: { minutes: 5 }
  scorecards:
    cost-policy-compliance:
      title: "Cost policy compliance"
      description: "Repos with the FinOps gate installed and passing"
      checks:
        - id: has-cost-gate-workflow
          type: json-rules-engine
          name: "Cost gate workflow exists"
          factIds:
            - github.workflows.exists
          rule:
            conditions:
              all:
                - fact: workflows
                  operator: contains
                  value: cost-gate.yml
        - id: cost-gate-passing
          type: json-rules-engine
          name: "Cost gate is passing on main"
          factIds:
            - github.workflows.lastStatus
          rule:
            conditions:
              all:
                - fact: lastStatus
                  operator: equal
                  value: success
        - id: under-monthly-budget
          type: json-rules-engine
          name: "Monthly infra cost under team budget"
          factIds:
            - infracost.monthlyTotal
            - team.monthlyBudget
          rule:
            conditions:
              all:
                - fact: monthlyTotal
                  operator: lessThanInclusive
                  value: { fact: monthlyBudget }

The scorecard surfaces on each component's overview page. Three checks per repo. Green means the gate is installed, the last main run was green, and the projected monthly cost is under whatever budget the team owner set. Anything else is a flag.

The point is not to shame teams. The point is to make non-adoption visible. If 18 of 20 services have the gate and 2 don't, the conversation is now "why don't those 2," not "we should probably do something about cost someday."

Results

Three months of running this on a six-person platform team owning roughly 30 Terraform repos.

Metric	Before	After
Cost surprises per month	2 to 4	0
Time from spend to detection	30 days	~90 seconds
Monthly waste prevented (avg)	n/a	~$2,400
PR cycle time impact	n/a	+47 sec p50
Engineer pushback after week 2	n/a	none

The $2,400/month number comes from summing the blocked deltas across the three months and dividing by three. The largest single block was a misconfigured module that would have provisioned three NAT Gateways instead of one ($99/month wasted). The second largest was a developer trying to spin up a db.r6i.4xlarge for a staging workload because the template they copied was production-sized ($1,100/month avoided).

The PR cycle time impact is real but small. The whole gate, including init, plan, and policy eval, runs in under a minute and a half on a ubuntu-latest runner. Nobody has complained about it.

What I would do differently

A few honest notes from running this in production.

Start with a soft fail. The first two weeks, set the conftest step to continue-on-error: true. Just collect data. You'll find PRs that legitimately exceed the threshold (a one-time data warehouse provisioning, a region expansion) and you want to know your real distribution before you draw a line. I drew the line at $500 because that's about the 90th percentile of the PRs I sampled.

Make the exception path easy. A hard cap with no escape valve creates resentment fast. The Rego policy supports a # cost-justified: <reason> HCL comment on individual resources. Use the comment for things you actually want to ship anyway, and the comment becomes an audit trail. Reviewers can ask "is this justification real" without blocking the gate.

Don't tag-shame. I avoided building anything that publicly ranks teams or developers by cost. Cost is correlated with workload, and the team running the data warehouse will always cost more than the team running the marketing site. Build scorecards on policy compliance, not absolute cost.

Re-evaluate the threshold every quarter. Infrastructure changes, your business changes, your tolerance for cost noise changes. The $500 cap that made sense in Q1 might be too tight in Q4 when you're spinning up a new region. Treat the number as a config value, not a constant.

Try it yourself

The full reference repo (Terraform examples, the Rego policy, the workflow, the Backstage scorecard YAML) is in progress at github.com/mateenali66/finops-guardrails-terraform. If you want to copy the policy out of this post and drop it into your own pipeline, you should be 30 minutes from a working gate.

If you also want CI cost visibility on top of provisioning cost visibility, CICosts is the companion piece. Same philosophy, different layer.

LeanOps and platformengineering.org joint 2026 trend report, "State of Platform Engineering," reports that 73% of surveyed platform teams have introduced cost policy enforcement before merge. Cite the published report when you reference this number in your own work. ↩

DEV Community