Stop relying on 'Plan and Pray' for your infrastructure. This guide, originally published on devopsstart.com, explores how to build a rigorous IaC testing pyramid using native Terraform tests and Terratest.
Introduction
Most DevOps engineers treat terraform plan as their primary testing strategy. You run the plan, scan the output for any unexpected resource deletions, and if everything looks "mostly right", you run terraform apply and hope for the best. This is the "Plan and Pray" methodology. It works for small projects, but it fails miserably at scale. When you manage hundreds of modules across multiple environments, a single typo in a variable or a misunderstood dependency can trigger a catastrophic outage, delete a production database, or leave a security group wide open to the public internet.
Infrastructure as Code (IaC) is still code, yet we often treat it as configuration. To build resilient systems, you must stop relying on manual verification and start implementing a formal testing lifecycle. In this guide, you'll learn how to move beyond the plan phase by implementing a rigorous testing pyramid. We will cover static analysis to catch syntax errors, the native terraform test framework for unit and integration testing, and how to leverage Terratest for complex end to end validation.
By the end of this article, you will have a blueprint for a CI/CD pipeline that ensures no infrastructure change reaches production without being programmatically verified. For a deeper dive into the conceptual layers of this approach, check out our guide on /blog/testing-infrastructure-as-code-the-terraform-testing-pyramid.
The IaC Testing Pyramid and Static Analysis
In traditional software engineering, the testing pyramid suggests a high volume of fast unit tests, fewer integration tests and a handful of slow end to end tests. Infrastructure follows the same logic. At the base of the pyramid is Static Analysis. These tests are the fastest because they don't actually deploy any resources; they simply analyze the code for patterns, smells and security vulnerabilities.
Static analysis is your first line of defense. Tools like tflint check for provider specific errors that the standard terraform validate command misses. For example, tflint can tell you if an AWS instance type is invalid for a specific region before you even attempt to initialize the provider. Simultaneously, security scanners like Checkov or tfsec scan your HCL for misconfigurations, such as S3 buckets without encryption or open SSH ports.
Consider a scenario where a developer accidentally opens port 22 to 0.0.0.0/0. A manual reviewer might miss this in a 500 line diff, but a static analyzer will catch it in milliseconds. To implement this, you can use the following commands in your local environment or CI runner:
# Install tflint and initialize the AWS plugin
tflint --init
# Run tflint to check for code quality issues
tflint
# Run checkov to scan for security regressions
checkov -d . --framework terraform
Example Checkov output for a security violation:
Check: CKV_AWS_20: Ensure no security groups allow ingress from 0.0.0.0/0 to port 22
File: /home/user/infra/main.tf:12-25
Guideline: Open SSH ports are a common entry point for attackers.
Result: FAILED
By integrating these tools, you eliminate "noise" from your later testing stages. You don't want to waste five minutes deploying a sandbox environment only to find out the deployment failed because of a typo in an AMI ID. Static analysis cleans the slate so your functional tests can focus on logic and state.
Deep Dive into the Native Terraform Test Framework
With the release of Terraform v1.6.0, HashiCorp introduced a native testing framework that fundamentally changes how we validate modules. Previously, testing required external wrappers or complex Go code. Now, you can write .tftest.hcl files that live alongside your code. This framework allows you to define "run" blocks that execute a plan or apply and then assert that the resulting attributes match your expectations.
The power of terraform test lies in its ability to simulate different scenarios using variables without modifying your actual production code. You can create a test file that validates that a module creates a public subnet when a specific flag is set to true and a private subnet when it is false. This is essentially unit testing for your infrastructure logic.
Here is a practical example of a test file for a VPC module. Assume you have a module that takes a cidr_block and a public_subnet_cidr variable. You can create a file named tests/vpc.tftest.hcl:
# tests/vpc.tftest.hcl
# Define the variables for the test run
variables {
cidr_block = "10.0.0.0/16"
public_subnet_cidr = "10.0.1.0/24"
}
# Run a plan and verify the intended changes without deploying
run "verify_plan" {
command = plan
assert {
condition = aws_subnet.public.cidr_block == "10.0.1.0/24"
error_message = "The public subnet CIDR does not match the input variable"
}
}
# Run an apply to verify the resource actually exists in AWS
run "verify_deployment" {
command = apply
assert {
condition = aws_vpc.main.cidr_block == "10.0.0.0/16"
error_message = "The deployed VPC has an incorrect CIDR block"
}
}
To execute these tests, you simply run:
terraform test
The framework handles the initialization and cleanup of the resources automatically. It creates a temporary state file and destroys the resources after the assertions are evaluated. This prevents "state pollution" where test resources linger in your cloud account. When the command runs, you'll see a clean output indicating which assertions passed or failed:
terraform test [v1.6.2]
Run 'verify_plan': passed
Run 'verify_deployment': passed
Result: 2 passed, 0 failed
This approach is significantly faster than manual testing. You can commit these tests to your repository, ensuring that any future change to the module doesn't break existing functionality. It transforms your infrastructure from a "hope it works" model to a "proven to work" model. For more details on the syntax, refer to the official Terraform documentation.
Advanced Integration Testing with Terratest
While the native terraform test framework is excellent for validating resource attributes, it cannot test the "behavior" of the infrastructure. For example, if you deploy a Load Balancer and an Auto Scaling Group, terraform test can tell you that the Load Balancer exists and has the correct DNS name. However, it cannot tell you if the Load Balancer actually returns a 200 OK response when you hit its URL.
This is where Terratest comes in. Terratest is a Go library that allows you to write tests that interact with your infrastructure from the outside. It follows a strict pattern: terraform init -> terraform apply -> Verify via API/SSH/HTTP -> terraform destroy. Because it's written in Go, you have access to the full power of a programming language, including the ability to make HTTP requests, check database connections or run shell scripts inside a VM.
Consider a real world use case: testing a Kubernetes cluster deployment. You want to ensure that not only is the cluster "up", but that you can actually deploy a pod and reach it. You would write a Go test that uses the Kubernetes client to check for node readiness.
package test
import (
"testing"
"github.com/gruntwork-io/terratest/modules/terraform"
"github.com/stretchr/testify/assert"
)
func TestHttpServerDeployment(t *testing.T) {
terraformOptions := terraform.WithDefaultRetryableErrors(t, &terraform.Options{
TerraformDir: "../examples/http-server",
})
// At the end of the test, run 'terraform destroy'
defer terraform.Destroy(t, terraformOptions)
// Run 'terraform init' and 'terraform apply'
terraform.InitAndApply(t, terraformOptions)
// Get the output variable for the Load Balancer DNS
lbDns := terraform.Output(t, terraformOptions, "lb_dns_name")
// Perform a real HTTP request to verify the application is running
httpCode := terraform.HttpGetStatusCode(t, lbDns, 80)
assert.Equal(t, 200, httpCode)
}
To run this, you'll need Go installed on your machine:
go mod init infra_test
go mod tidy
go test -v -timeout 30m
The trade off here is the learning curve. Your team must know Go, and the tests take much longer to run because they involve full deployments and network calls. I recommend a hybrid approach: use the native terraform test framework for 80% of your checks (unit and attribute testing) and reserve Terratest for the 20% of critical paths that require behavioral validation.
Best Practices for Infrastructure Testing
Implementing a testing strategy is not just about the tools; it is about the process. If your tests are flaky or slow, your team will eventually ignore them or disable them in CI. Follow these guidelines to maintain a healthy testing pipeline.
-
Isolate Test State: Never run tests against your production or staging state files. Use a dedicated "sandbox" account or separate S3 buckets for test state. The native
terraform testframework does this by default with temporary state, but for Terratest, you must explicitly manage separate workspaces. -
Automate Cleanup with Traps: Cloud costs can spiral if
terraform destroyfails or is skipped. In your CI pipelines, usealways()(GitHub Actions) orafter_script(GitLab CI) to ensure that resources are torn down regardless of whether the test passed or failed. - Test the "Negative" Path: Don't just test that the infrastructure works when inputs are correct. Write tests that provide invalid inputs to ensure your module fails gracefully with a helpful error message rather than a cryptic cloud provider error.
- Keep Tests Fast: Avoid deploying massive environments for every test. Break your infrastructure into small, composable modules. Test the VPC module separately from the Database module. This prevents a 30 minute wait time for a simple change.
-
Shift Left: Run
tflintandcheckovon every commit. Only run full integration tests on Merge Requests. This ensures that developers get immediate feedback on syntax and security before the expensive deployment phase begins. If you are implementing a more complex delivery model, you might also look into /blog/testing-in-production-guide-to-progressive-delivery to manage risks after the deployment.
FAQ
Is terraform test a replacement for Terratest?
No, they serve different purposes. terraform test is an internal validator; it checks if Terraform is doing what you told it to do. Terratest is an external validator; it checks if the resulting infrastructure actually functions in the real world. Use the native framework for attribute validation and Terratest for end to end connectivity and application health checks.
How do I handle secrets in my infrastructure tests?
Never hardcode secrets in .tftest.hcl or Go files. Use environment variables (e.g., TF_VAR_db_password) or a secret manager. In CI/CD, inject these secrets as protected variables. Since test environments should be ephemeral and use mock data, avoid using production secrets entirely.
Won't running integration tests every time be too expensive?
Yes, if you deploy everything. To mitigate cost, use "Smoke Tests" for most commits and "Full Integration Tests" only before merging to the main branch. Additionally, use smaller instance sizes (e.g., t3.micro instead of m5.large) in your test variables to keep costs minimal.
What is the best way to handle provider authentication in CI?
The most secure way is using OIDC (OpenID Connect). For GitHub Actions, use google-github-actions/auth or the AWS official action to assume a role via a short lived token. This eliminates the need to store long lived IAM access keys in your repository secrets.
Conclusion
Moving beyond "Plan and Pray" requires a cultural shift. You must stop viewing infrastructure as a static set of files and start viewing it as a software product that requires a testing lifecycle. By implementing a pyramid consisting of static analysis, native attribute tests and behavioral integration tests, you can deploy changes with confidence.
The transition doesn't happen overnight. Start by adding tflint and checkov to your local workflow today. Next, write a single .tftest.hcl file for your most critical module to validate its outputs. Finally, integrate these into a GitHub Action or GitLab CI pipeline so that a failing test blocks a merge. This discipline reduces the "fear of the apply" and allows your team to iterate faster without risking the stability of your production environment. Your next step is to audit your current modules and identify the one that causes the most production incidents; that is where your first test suite belongs.
Top comments (0)