How to Stop Wondering If Your Infrastructure Works and Start Knowing It Does
Day 18 of the 30-Day Terraform Challenge — and today I finally solved the problem that's been bothering me since Day 1.
How do you know your infrastructure actually works?
Manual testing gave me confidence, but it didn't scale. Every change meant re-running the same checks. Every environment meant more time. Every team member meant more coordination.
Today I automated everything.
The Three Layers of Testing
| Test Type | Tool | Deploys Real Infra | Time | Cost |
|---|---|---|---|---|
| Unit | terraform test |
No | Seconds | Free |
| Integration | Terratest | Yes | Minutes | Low |
| End-to-End | Terratest | Yes | 15-30 min | Medium |
Each layer catches different failures. Together, they create confidence.
Layer 1: Unit Tests (Fast, Free, No AWS)
Terraform 1.6+ includes a native testing framework. No external dependencies. No real infrastructure deployed. Just plan-time assertions.
# webserver_cluster_test.tftest.hcl
variables {
cluster_name = "test-cluster"
instance_type = "t3.micro"
environment = "dev"
}
run "validate_asg_name" {
command = plan
assert {
condition = can(regex("^test-cluster-asg-", aws_autoscaling_group.web.name_prefix))
error_message = "ASG name prefix must start with cluster_name"
}
}
run "validate_instance_type" {
command = plan
assert {
condition = aws_launch_template.web.instance_type == "t3.micro"
error_message = "Instance type must match variable"
}
}
run "validate_tags" {
command = plan
assert {
condition = aws_lb.web.tags["Environment"] == "dev"
error_message = "ALB must have Environment tag = dev"
}
}
Run with: terraform test
What it catches: Syntax errors, naming conventions, tag consistency, logic mistakes.
What it doesn't catch: DNS propagation, health check failures, actual HTTP responses.
Layer 2: Integration Tests (Real Infra, Real Assertions)
Integration tests deploy real infrastructure, run assertions against it, then destroy it.
// test/webserver_cluster_test.go
func TestWebserverClusterIntegration(t *testing.T) {
t.Parallel()
uniqueID := random.UniqueId()
clusterName := fmt.Sprintf("test-cluster-%s", uniqueID)
terraformOptions := &terraform.Options{
TerraformDir: "../manual-test",
Vars: map[string]interface{}{
"cluster_name": clusterName,
"instance_type": "t3.micro",
"min_size": 1,
"max_size": 2,
"environment": "dev",
},
}
// CRITICAL: Always destroy, even if test fails
defer terraform.Destroy(t, terraformOptions)
terraform.InitAndApply(t, terraformOptions)
albDnsName := terraform.Output(t, terraformOptions, "alb_dns_name")
url := fmt.Sprintf("http://%s", albDnsName)
// Retry for 5 minutes (ALB takes time)
http_helper.HttpGetWithRetryWithCustomValidation(
t, url, nil, 30, 10*time.Second,
func(status int, body string) bool {
return status == 200
},
)
}
Run with: go test -v -timeout 30m ./...
What it catches: ALB DNS resolution, health check passing, actual HTTP responses, deployment ordering.
The critical piece: defer terraform.Destroy ensures cleanup even if tests fail. No orphaned resources. No surprise AWS bills.
Layer 3: End-to-End Tests (Full Stack)
E2E tests deploy everything — VPC, database, application — and verify the whole system works.
func TestFullStackEndToEnd(t *testing.T) {
t.Parallel()
uniqueID := random.UniqueId()
// Deploy VPC
vpcOptions := &terraform.Options{
TerraformDir: "../modules/networking/vpc",
Vars: map[string]interface{}{
"vpc_name": fmt.Sprintf("test-vpc-%s", uniqueID),
},
}
defer terraform.Destroy(t, vpcOptions)
terraform.InitAndApply(t, vpcOptions)
vpcID := terraform.Output(t, vpcOptions, "vpc_id")
subnetIDs := terraform.OutputList(t, vpcOptions, "private_subnet_ids")
// Deploy app using VPC outputs
appOptions := &terraform.Options{
TerraformDir: "../modules/services/webserver-cluster",
Vars: map[string]interface{}{
"cluster_name": fmt.Sprintf("test-app-%s", uniqueID),
"vpc_id": vpcID,
"subnet_ids": subnetIDs,
},
}
defer terraform.Destroy(t, appOptions)
terraform.InitAndApply(t, appOptions)
albDnsName := terraform.Output(t, appOptions, "alb_dns_name")
http_helper.HttpGetWithRetry(t, fmt.Sprintf("http://%s", albDnsName), nil, 200, "Hello", 30, 10*time.Second)
}
What it catches: Cross-module integration issues, networking problems, full stack failures that unit and integration tests miss.
The CI/CD Pipeline
Run everything automatically on every commit:
name: Terraform Tests
on:
pull_request:
branches: [main]
push:
branches: [main]
jobs:
unit-tests:
name: Unit Tests
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: hashicorp/setup-terraform@v3
- run: terraform init && terraform test
working-directory: manual-test
integration-tests:
name: Integration Tests
runs-on: ubuntu-latest
if: github.event_name == 'push' # Only on merge to main
needs: unit-tests
env:
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
steps:
- uses: actions/checkout@v4
- uses: actions/setup-go@v4
with: { go-version: "1.21" }
- run: go test -v -timeout 30m ./...
working-directory: test
Job dependencies:
- Unit tests run on every PR (fast, cheap)
- Integration tests only run on merge to main (slower, costs money)
- E2E tests run on schedule (once a day)
Why This Matters
Before automation, every change meant:
- Run
terraform applymanually - Wait 5 minutes
- Test with
curl - Remember to destroy
- Repeat for every environment
Now, every commit triggers:
- Unit tests (10 seconds)
- Integration tests (5 minutes)
- Confidence that it works
Infrastructure that is tested automatically is infrastructure you can trust.
The Results
| Test Type | What It Found | Time | Result |
|---|---|---|---|
| Unit | Missing tags, wrong naming | 10s | ✅ Caught before PR |
| Integration | Health check failures, 502 errors | 5min | ✅ Caught before merge |
| E2E | Cross-module networking | 15min | ✅ Caught before release |
What I Learned
Unit tests are your safety net. Run them on every commit. They cost nothing and catch everything.
Integration tests are your confidence builder. Run them before merging. They cost a little but find real issues.
E2E tests are your release gate. Run them less frequently. They cost more but verify everything works together.
defer terraform.Destroy is critical. Without it, failed tests leave resources running. With it, cleanup is guaranteed.
Secrets never go in code. Use GitHub Secrets for AWS credentials.
The Bottom Line
Manual testing gave me confidence for one deployment. Automated testing gives me confidence for every deployment.
| Before | After |
|---|---|
| Test once a day | Test every commit |
| Manual curl checks | Automated HTTP assertions |
| Hope cleanup works |
defer guarantees cleanup |
| 30 minutes of manual work | 5 minutes of automated trust |
If you're not testing your infrastructure automatically, you're deploying with blind faith.
Top comments (0)