Mukami

Posted on Apr 10

Automating Terraform Testing: From Unit Tests to End-to-End Validation

#aws #terraform #test #30daychallenge

How to Stop Wondering If Your Infrastructure Works and Start Knowing It Does

Day 18 of the 30-Day Terraform Challenge — and today I finally solved the problem that's been bothering me since Day 1.

How do you know your infrastructure actually works?

Manual testing gave me confidence, but it didn't scale. Every change meant re-running the same checks. Every environment meant more time. Every team member meant more coordination.

Today I automated everything.

The Three Layers of Testing

Test Type	Tool	Deploys Real Infra	Time	Cost
Unit	`terraform test`	No	Seconds	Free
Integration	Terratest	Yes	Minutes	Low
End-to-End	Terratest	Yes	15-30 min	Medium

Each layer catches different failures. Together, they create confidence.

Layer 1: Unit Tests (Fast, Free, No AWS)

Terraform 1.6+ includes a native testing framework. No external dependencies. No real infrastructure deployed. Just plan-time assertions.

# webserver_cluster_test.tftest.hcl

variables {
  cluster_name  = "test-cluster"
  instance_type = "t3.micro"
  environment   = "dev"
}

run "validate_asg_name" {
  command = plan

  assert {
    condition     = can(regex("^test-cluster-asg-", aws_autoscaling_group.web.name_prefix))
    error_message = "ASG name prefix must start with cluster_name"
  }
}

run "validate_instance_type" {
  command = plan

  assert {
    condition     = aws_launch_template.web.instance_type == "t3.micro"
    error_message = "Instance type must match variable"
  }
}

run "validate_tags" {
  command = plan

  assert {
    condition     = aws_lb.web.tags["Environment"] == "dev"
    error_message = "ALB must have Environment tag = dev"
  }
}

Run with: terraform test

What it catches: Syntax errors, naming conventions, tag consistency, logic mistakes.

What it doesn't catch: DNS propagation, health check failures, actual HTTP responses.

Layer 2: Integration Tests (Real Infra, Real Assertions)

Integration tests deploy real infrastructure, run assertions against it, then destroy it.

// test/webserver_cluster_test.go
func TestWebserverClusterIntegration(t *testing.T) {
  t.Parallel()

  uniqueID := random.UniqueId()
  clusterName := fmt.Sprintf("test-cluster-%s", uniqueID)

  terraformOptions := &terraform.Options{
    TerraformDir: "../manual-test",
    Vars: map[string]interface{}{
      "cluster_name":  clusterName,
      "instance_type": "t3.micro",
      "min_size":      1,
      "max_size":      2,
      "environment":   "dev",
    },
  }

  // CRITICAL: Always destroy, even if test fails
  defer terraform.Destroy(t, terraformOptions)

  terraform.InitAndApply(t, terraformOptions)

  albDnsName := terraform.Output(t, terraformOptions, "alb_dns_name")
  url := fmt.Sprintf("http://%s", albDnsName)

  // Retry for 5 minutes (ALB takes time)
  http_helper.HttpGetWithRetryWithCustomValidation(
    t, url, nil, 30, 10*time.Second,
    func(status int, body string) bool {
      return status == 200
    },
  )
}

Run with: go test -v -timeout 30m ./...

What it catches: ALB DNS resolution, health check passing, actual HTTP responses, deployment ordering.

The critical piece: defer terraform.Destroy ensures cleanup even if tests fail. No orphaned resources. No surprise AWS bills.

Layer 3: End-to-End Tests (Full Stack)

E2E tests deploy everything — VPC, database, application — and verify the whole system works.

func TestFullStackEndToEnd(t *testing.T) {
  t.Parallel()
  uniqueID := random.UniqueId()

  // Deploy VPC
  vpcOptions := &terraform.Options{
    TerraformDir: "../modules/networking/vpc",
    Vars: map[string]interface{}{
      "vpc_name": fmt.Sprintf("test-vpc-%s", uniqueID),
    },
  }
  defer terraform.Destroy(t, vpcOptions)
  terraform.InitAndApply(t, vpcOptions)

  vpcID := terraform.Output(t, vpcOptions, "vpc_id")
  subnetIDs := terraform.OutputList(t, vpcOptions, "private_subnet_ids")

  // Deploy app using VPC outputs
  appOptions := &terraform.Options{
    TerraformDir: "../modules/services/webserver-cluster",
    Vars: map[string]interface{}{
      "cluster_name": fmt.Sprintf("test-app-%s", uniqueID),
      "vpc_id":       vpcID,
      "subnet_ids":   subnetIDs,
    },
  }
  defer terraform.Destroy(t, appOptions)
  terraform.InitAndApply(t, appOptions)

  albDnsName := terraform.Output(t, appOptions, "alb_dns_name")
  http_helper.HttpGetWithRetry(t, fmt.Sprintf("http://%s", albDnsName), nil, 200, "Hello", 30, 10*time.Second)
}

What it catches: Cross-module integration issues, networking problems, full stack failures that unit and integration tests miss.

The CI/CD Pipeline

Run everything automatically on every commit:

name: Terraform Tests

on:
  pull_request:
    branches: [main]
  push:
    branches: [main]

jobs:
  unit-tests:
    name: Unit Tests
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: hashicorp/setup-terraform@v3
      - run: terraform init && terraform test
        working-directory: manual-test

  integration-tests:
    name: Integration Tests
    runs-on: ubuntu-latest
    if: github.event_name == 'push'  # Only on merge to main
    needs: unit-tests
    env:
      AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
      AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-go@v4
        with: { go-version: "1.21" }
      - run: go test -v -timeout 30m ./...
        working-directory: test

Job dependencies:

Unit tests run on every PR (fast, cheap)
Integration tests only run on merge to main (slower, costs money)
E2E tests run on schedule (once a day)

Why This Matters

Before automation, every change meant:

Run terraform apply manually
Wait 5 minutes
Test with curl
Remember to destroy
Repeat for every environment

Now, every commit triggers:

Unit tests (10 seconds)
Integration tests (5 minutes)
Confidence that it works

Infrastructure that is tested automatically is infrastructure you can trust.

The Results

Test Type	What It Found	Time	Result
Unit	Missing tags, wrong naming	10s	✅ Caught before PR
Integration	Health check failures, 502 errors	5min	✅ Caught before merge
E2E	Cross-module networking	15min	✅ Caught before release

What I Learned

Unit tests are your safety net. Run them on every commit. They cost nothing and catch everything.

Integration tests are your confidence builder. Run them before merging. They cost a little but find real issues.

E2E tests are your release gate. Run them less frequently. They cost more but verify everything works together.

defer terraform.Destroy is critical. Without it, failed tests leave resources running. With it, cleanup is guaranteed.

Secrets never go in code. Use GitHub Secrets for AWS credentials.

The Bottom Line

Manual testing gave me confidence for one deployment. Automated testing gives me confidence for every deployment.

Before	After
Test once a day	Test every commit
Manual curl checks	Automated HTTP assertions
Hope cleanup works	`defer` guarantees cleanup
30 minutes of manual work	5 minutes of automated trust

If you're not testing your infrastructure automatically, you're deploying with blind faith.

DEV Community