Stop relying on hope-based deployments for your infrastructure. This guide, originally published on devopsstart.com, explores how to build a professional IaC testing pyramid using TFLint, Terraform's native test framework, and Terratest.
Introduction
Most DevOps engineers treat Infrastructure as Code (IaC) like a script rather than software. You write your HCL or YAML, run a terraform plan to see if it looks right, and then pray that the terraform apply doesn't destroy a production database because of a misplaced dependency. This "hope-based deployment" strategy works when you're managing three servers, but it fails catastrophically at scale. In environments with >100 modules across multiple clouds, a single typo in a variable can trigger a cascade of resource deletions.
Testing IaC is fundamentally different from testing application code. You aren't just checking logic in a virtual machine; you're interacting with real cloud APIs that have latency, rate limits and cost implications. To solve this, you need a structured approach that moves from fast, cheap checks to slow, expensive validations.
In this guide, you'll learn how to implement a professional IaC testing pyramid. We will move from static analysis and linting to the native terraform test framework introduced in Terraform v1.6.0, and finally to full integration testing with Terratest. By the end, you'll have a blueprint for a CI/CD pipeline that catches security vulnerabilities and logic errors before a single cloud resource is actually provisioned.
The IaC Testing Pyramid: Static Analysis and Linting
The base of your testing pyramid must be static analysis. These tests are fast because they don't require a provider connection or a cloud account. They analyze the code as text or an abstract syntax tree. If you rely solely on integration tests, your feedback loop will take 20 minutes per commit and your cloud bill will spike.
The first line of defense is terraform validate. Built into the core binary, it checks for internal consistency, such as whether you've omitted required arguments or used invalid attribute names. However, validation is not linting. To ensure your code follows best practices, you need tflint. It catches provider-specific errors that terraform validate misses, such as using an invalid EC2 instance type for a specific AWS region.
Beyond syntax, you need security scanning. Tools like checkov or tfsec scan your code for common misconfigurations, such as S3 buckets with public read access or security groups allowing SSH (port 22) from 0.0.0.0/0. This is where you prevent the security incident before it becomes a ticket.
# Install tflint and checkov via Homebrew
brew install tflint
brew install checkov
# Initialize tflint to download required plugins
tflint --init
# Run tflint to check for provider-specific issues
tflint
# Example tflint output:
# Error: aws_instance.web.instance_type is not a valid instance type in us-east-1
# /main.tf:12:24
# Run checkov to scan the current directory for security vulnerabilities
checkov -d .
# Example checkov output:
# Check: CKV_AWS_20: Ensure S3 bucket has block public access settings
# File: /s3.tf:5-15
# Result: FAILED
By integrating these tools into a pre-commit hook or the first stage of your GitHub Actions pipeline, you ensure that only clean code reaches the planning phase. This reduces the cognitive load on reviewers and stops obvious errors from ever hitting the cloud API. For more details on official standards, refer to the HashiCorp Terraform documentation.
Shifting Left with the Native Terraform Test Framework
For years, the industry struggled with "unit testing" for Terraform. You either had to use complex mocks or deploy real resources. Terraform v1.6.0 changed this by introducing the terraform test command. This framework allows you to write tests in HCL, keeping your test logic close to your infrastructure logic without needing to switch to a different language like Go or Python.
The native test framework operates by creating a temporary environment, applying the module, running assertions against the resulting state and then destroying the resources. While this interacts with the cloud, it is designed as a component test because it targets a specific module in isolation.
The power of terraform test lies in its ability to validate that your module outputs the expected values based on specific inputs. For example, if you have a module that should create a "small" or "large" instance based on a size variable, you can write a test suite to verify both scenarios.
# Create a file at tests/web_server.tftest.hcl
variables {
instance_size = "t3.micro"
environment = "test"
}
run "verify_instance_type" {
command = apply
assert {
condition = aws_instance.web.instance_type == "t3.micro"
error_message = "The instance type did not match the requested size"
}
}
run "verify_tags" {
command = apply
assert {
condition = aws_instance.web.tags["Environment"] == "test"
error_message = "The Environment tag was not applied correctly"
}
}
To execute these tests, run:
terraform test
The output shows which assertions passed and which failed, providing a deterministic way to verify module logic. This is a massive improvement over manually checking the Terraform state file or clicking through the AWS Console. It allows you to refactor your modules with confidence, knowing that the core requirements are still being met.
High-Fidelity Integration Testing with Terratest
While terraform test is great for module logic, it doesn't tell you if your infrastructure actually works in the real world. Just because Terraform says an ALB (Application Load Balancer) was created successfully doesn't mean the listener is correctly routing traffic to your target group. This is where Terratest comes in.
Terratest is a Go library that allows you to write integration tests that treat your infrastructure as a black box. The workflow is: terraform init $\rightarrow$ terraform apply $\rightarrow$ perform a real HTTP request or SSH check $\rightarrow$ terraform destroy. This is the gold standard for mission-critical infrastructure because it validates the actual data plane, not just the control plane.
Imagine you are deploying a Kubernetes cluster. You want to ensure that the API server is reachable and you can deploy a pod. Using Terratest, you can automate the deployment and use the Kubernetes Go client to verify the cluster state.
// Example Terratest snippet in Go
package test
import (
"testing"
"github.com/gruntwork-io/terratest/modules/terraform"
"github.com/stretchr/testify/assert"
)
func TestHttpServer(t *testing.T) {
terraformOptions := terraform.WithDefaultRetryableErrors(t, &terraform.Options{
// Path to the Terraform module to be tested
TerraformDir: "../examples/web-server",
})
// Ensure resources are destroyed at the end of the test regardless of outcome
defer terraform.Destroy(t, terraformOptions)
// Run 'terraform init' and 'terraform apply'
terraform.InitAndApply(t, terraformOptions)
// Retrieve the output variable 'public_ip' from Terraform
publicIp := terraform.Output(t, terraformOptions, "public_ip")
// Perform a real HTTP request to verify the server is actually responding
// Retries for 20 seconds to account for slow boot times
httpCode := terraform.HttpGetWithRetry(t, "http://"+publicIp, 200, "10s", 2)
assert.Equal(t, 200, httpCode)
}
Running this test requires a Go environment:
go mod init iac_test
go mod tidy
go test -v -timeout 30m
The trade-off here is time and cost. Integration tests are slow and can be expensive if you accidentally leave resources running. Always use a defer terraform.Destroy block to ensure infrastructure is torn down.
Best Practices for IaC Testing Pipelines
Implementing these tools is only half the battle. The other half is integrating them into a pipeline that doesn't frustrate developers. If your pipeline takes 40 minutes to tell a developer they missed a tag, they will start bypassing the process.
-
Enforce a Strict Order of Execution: Your pipeline should follow the pyramid exactly:
terraform validate$\rightarrow$tflint$\rightarrow$checkov$\rightarrow$terraform test$\rightarrow$Terratest. If any step fails, stop immediately. There is no point in running a 10-minute integration test if the code fails a 2-second lint check. - Use Ephemeral Environments: Never run integration tests in a shared staging account. Use a dedicated sandbox account or create a dynamic project for every Pull Request. This prevents state locking conflicts and drift caused by multiple tests fighting over the same resource names.
- Parallelize Integration Tests: Since Terratest can be slow, run your test suites in parallel using Go's built-in testing capabilities or by spinning up multiple GitHub Action runners. This can reduce the feedback loop from hours to minutes.
-
Implement Policy as Code (PaC): Use Open Policy Agent (OPA) or Terraform Sentinel to create organizational guardrails. While
checkovfinds general security holes, OPA can enforce business rules, such as "All resources in the Production account must have a CostCenter tag" or "Database instances cannot be larger than db.m5.large in the Dev environment." -
Version Your Modules: Never point your environments to the
mainbranch of a module. Use semantic versioning (e.g.,source = "...?ref=v1.2.0"). This allows you to test a new version in a sandbox without risking production stability.
FAQ
Does terraform test replace Terratest?
No. terraform test validates that your HCL logic produces the expected state (e.g., "Did I create 3 subnets?"). Terratest validates that the infrastructure actually functions (e.g., "Can I actually ping the server in that subnet?"). Use terraform test for fast module validation and Terratest for critical path integration testing.
How do I handle secrets during integration tests?
Never hardcode secrets. Use environment variables or a secret manager like AWS Secrets Manager or HashiCorp Vault. Terratest can read environment variables using os.Getenv("AWS_ACCESS_KEY_ID"). In CI/CD, use GitHub Secrets or GitLab CI variables to inject these credentials into the runner.
Won't running integration tests on every commit be too expensive?
It can be. To mitigate this, implement Conditional Testing. Run linting and terraform test on every commit, but only run the full Terratest suite when a Pull Request is opened or code is merged into develop or main. You can also use "smoke tests" that only verify the most critical 10% of your infrastructure.
What is the best way to manage state for ephemeral tests?
Use a remote backend that supports dynamic keys. For example, if using AWS S3, append the GitHub Run ID to the state key: key = "tests/terraform.tfstate-${GITHUB_RUN_ID}". This ensures every test run has its own isolated state file, preventing collisions between concurrent CI jobs.
Conclusion
Moving from a manual check workflow to an automated IaC testing pyramid is the single biggest leap you can take in your DevOps maturity. By implementing static analysis with tflint and checkov, you eliminate the low-hanging fruit of security risks and syntax errors. By leveraging the native terraform test framework, you ensure your module logic is sound. Finally, by using Terratest, you prove that your infrastructure actually delivers the intended service.
The transition doesn't happen overnight. Start by adding tflint and terraform validate to your pre-commit hooks. Once those are stable, write three to five terraform test files for your most used modules. Finally, identify your most critical business path (e.g., the API gateway to the database) and write a Terratest integration test for it. This layered approach reduces risk, increases deployment velocity and ensures that infrastructure as code actually benefits from the rigor of software engineering.
Top comments (0)