DevOps Start

Posted on Apr 24 • Originally published at devopsstart.com

Secure Terraform PRs with an Architecture Firewall

#terraform #policyascode #openpolicyagent #devsecops

Stop the 'merge and pray' workflow! This guide was originally published on devopsstart.com and explores how to implement an automated architecture firewall for your Terraform PRs using OPA.

Introduction

An architecture firewall is a governance layer integrated into your CI/CD pipeline that automatically blocks infrastructure changes violating security or organizational standards before they reach your environment. Unlike a network firewall that filters packets, this firewall filters Pull Requests (PRs). It transforms your infrastructure requirements from passive documentation in a wiki into active, executable code that cannot be ignored.

In this article, you will learn how to move beyond the "merge and pray" workflow by implementing Policy as Code (PaC). We will explore the technical bridge between a terraform plan and automated validation using tools like Open Policy Agent (OPA) and Checkov. You'll discover how to create a pipeline that converts Terraform plans to JSON, evaluates them against strict guardrails and provides immediate feedback to developers via PR comments. By the end, you will have a strategy to enforce encryption, restrict public access and prevent accidental resource deletion without slowing down your engineering velocity. This approach aligns with modern Terraform testing best practices, ensuring that your cloud footprint remains secure by design rather than by chance.

Why Manual PR Reviews Fail the Architecture Test

Relying solely on human peer reviews to catch security holes is a recipe for a production outage. In high-velocity environments, reviewers suffer from fatigue. When a developer submits a PR with 500 lines of HCL, a reviewer might miss a single 0.0.0.0/0 in a security group or a missing encryption_enabled = true flag on an S3 bucket. Humans are great at reviewing logic and intent, but they are terrible at consistently auditing thousands of lines of configuration against a 50-page security compliance PDF.

The "Merge and Pray" workflow creates a dangerous gap where "architectural drift" occurs. This happens when the actual state of your cloud deviates from your intended security posture because a few "small" exceptions were merged over time. To solve this, you need an automated gate that operates on the terraform plan output. This plan is the only source of truth because it represents exactly what Terraform intends to do, accounting for variables, modules and the current state of the cloud.

For example, if you are using Terraform v1.9.0+, you can generate a machine-readable plan that your architecture firewall can analyze. This removes the ambiguity of reviewing the .tf files alone, which doesn't show the final resolved values.

# Generate the binary plan file
terraform plan -out=tfplan

# Convert the binary plan to JSON for policy evaluation
terraform show -json tfplan > tfplan.json

By shifting the audit from the code to the plan, you ensure that the firewall sees the final result, not just the intent. This is the foundation of a robust governance strategy.

Implementing the Policy as Code Engine

To build an architecture firewall, you must choose a Policy as Code (PaC) engine. For simple, industry-standard checks, tools like Checkov or TFLint are excellent because they come with hundreds of pre-built policies. However, for complex organizational logic (such as "Production databases must be deployed in three availability zones and have a specific naming convention"), you need a general-purpose policy engine like Open Policy Agent (OPA). OPA uses a language called Rego to query JSON data.

The technical flow is straightforward: your CI pipeline runs the plan, converts it to JSON and pipes that JSON into OPA. If the Rego policy returns a "deny" result, the CI pipeline fails and the PR is blocked from merging. This turns your security requirements into a unit test for your infrastructure.

Below is a practical example of a Rego policy that prevents any AWS S3 bucket from being created without server-side encryption.

package terraform.analysis

import future.keywords.if

# Default allow unless a violation is found
default allow = true

# Violation: S3 bucket without encryption
deny[msg] if {
    resource := input.resource_changes[_]
    resource.type == "aws_s3_bucket"

    # Check if server_side_encryption_configuration is missing or empty
    # In Terraform JSON, 'after' contains the planned state
    not resource.change.after.server_side_encryption_configuration
    msg := sprintf("Security Violation: S3 bucket %s must have encryption enabled", [resource.address])
}

To run this against your plan in a GitHub Action or GitLab CI runner using OPA v0.60.0, you would execute:

# Run OPA evaluation and capture the deny rules
opa eval -i tfplan.json -d policy.rego "data.terraform.analysis.deny"

If the output contains any messages, the firewall has triggered and the build should fail. This process provides a mathematical guarantee that no unencrypted bucket ever reaches production.

Real-World Application: Preventing Production Catastrophes

A common production nightmare is the accidental deletion of a critical resource, such as a primary database or a core VPC, due to a renaming error or a module refactor. A manual reviewer might not realize that changing a resource name in Terraform results in a "destroy and recreate" action. An architecture firewall can catch this by analyzing the actions array in the terraform plan JSON.

By writing a policy that flags any delete action on resources tagged as critical, you create a safety net. This doesn't mean you can never delete things; it means you must explicitly acknowledge the risk, perhaps through a "break-glass" label on the PR or a manual override from a Lead Architect.

Consider this scenario: a developer changes the name of an RDS instance to match a new naming convention. Terraform sees this as deleting the old DB and creating a new one. Without a firewall, the PR looks like a simple string change. With a firewall, the system sees a delete action on an aws_db_instance and blocks it.

Here is how you would implement a "Protection" rule in Rego to block deletions of production databases:

package terraform.analysis

deny[msg] if {
    resource := input.resource_changes[_]
    resource.type == "aws_db_instance"

    # Check if the action includes 'delete'
    resource.change.actions[_] == "delete"

    # Only apply this to production environments by checking input variables
    input.variables.environment == "prod"

    msg := sprintf("CRITICAL FAILURE: Attempting to delete production database %s. This action is blocked by the Architecture Firewall.", [resource.address])
}

Integrating this into your workflow requires a tight loop. You can use automation for Terraform reviews to post these specific error messages directly as comments on the offending line of the PR. This transforms the "No" from the security team into a helpful, automated suggestion from the platform.

Best Practices for Architecture Guardrails

Implementing a firewall can create friction if handled poorly. If every PR is blocked by 50 different warnings, developers will find ways to bypass the system. Use these strategies to balance security with velocity.

Distinguish Between Warnings and Failures. Not every policy should block a merge. Use "Advisory" levels for things like "missing cost-center tag" (warning) and "Critical" levels for "open SSH port" (hard fail). This prevents the firewall from becoming a nuisance.
Version Your Policies. Treat your Rego or Checkov policies like application code. Store them in a separate Git repository, version them and test them against a suite of "known-bad" Terraform plans to ensure no regressions in your security posture.
Provide Remediation Guidance. A failure message like Policy violation: SEC-01 is useless. Your firewall should return Security Violation: Port 22 is open to 0.0.0.0/0. Please restrict this to the corporate VPN range (10.x.x.x).
Implement an Exception Process. There will always be a legitimate reason to break a rule. Create a standardized way to grant exceptions, such as requiring a specific metadata tag (exception_id = "SEC-123") that the policy engine is programmed to ignore.
Shift Left with Local Pre-commit Hooks. Don't make the CI pipeline the first time a developer sees a failure. Provide a pre-commit configuration using tools like terraform-docs and checkov so they can catch errors on their local machine.

FAQ

Does the architecture firewall replace the need for manual peer reviews?

No, it augments them. The firewall handles the "objective" checks (security, compliance, syntax) so that human reviewers can focus on the "subjective" checks (architecture design, business logic and efficiency). It removes the tedious parts of the review process, allowing engineers to have higher-level discussions about the implementation.

Which tool should I choose: OPA, Checkov, or Sentinel?

If you are using Terraform Cloud/Enterprise, Sentinel is the native choice and offers the deepest integration. If you need a free, industry-standard scanner that works out-of-the-box with minimal configuration, go with Checkov. If you have complex, custom business logic that spans multiple cloud providers and requires a powerful query language, Open Policy Agent (OPA) is the gold standard. I have seen mature platform teams use Checkov for general security and OPA for custom organizational guardrails.

How do I prevent the firewall from slowing down my deployment pipeline?

Running terraform plan and opa eval typically adds less than 60 seconds to a pipeline. To further optimize, you can run these checks in parallel with other tests. Additionally, by implementing local pre-commit hooks, you reduce the number of failed CI runs, meaning the pipeline only handles "clean" code.

Can I use this to manage costs?

Yes, this is a powerful use case. You can write policies that analyze the resource_changes for expensive instance types. For example, you can block any PR that attempts to spin up an aws_instance of type p4d.24xlarge unless the project has a specific "high-compute" approval tag. This prevents "bill shock" by catching expensive mistakes before the resources are actually provisioned.

Conclusion

Building an architecture firewall is about shifting your mindset from "trusting the reviewer" to "trusting the system." By implementing Policy as Code, you ensure that your security standards are consistently applied across every single PR, regardless of who is reviewing it. This creates a scalable governance model that allows your platform team to support hundreds of developers without becoming a bottleneck.

To get started, don't try to automate your entire security handbook at once. Start with the "low hanging fruit": block public S3 buckets and open SSH ports. Once your team is comfortable with the automated feedback loop, gradually introduce more complex architectural rules.

Your next steps are to install OPA or Checkov, integrate a terraform show -json step into your GitHub Actions or GitLab CI and write your first "deny" rule. This transition from manual oversight to automated guardrails is the defining characteristic of a mature Platform Engineering organization.

DEV Community