Mukami

Posted on Mar 30

The Importance of Manual Testing in Terraform

#testing #aws #terraform #tutorial

Why "It Works" Isn't Enough Until You Prove It

Day 17 of the 30-Day Terraform Challenge — and today I learned that my infrastructure "worked" until I actually tested it.

I had a webserver cluster. Terraform applied without errors. Everything looked perfect in the AWS Console. I was confident.

Then I ran a structured manual test. The results were humbling.

The Problem: Code Success ≠ Functional Success

Terraform told me:

✅ 11 resources created
✅ No errors
✅ State matches configuration

But when I actually tried to use my infrastructure:

$ curl http://my-alb-dns
502 Bad Gateway

The code worked. The infrastructure didn't.

This is why manual testing matters.

My Test Checklist

I built a structured test plan covering five categories:

1. Provisioning Verification

terraform init completes without errors
terraform validate passes cleanly
terraform plan shows expected resources
terraform apply completes successfully

2. Resource Correctness

Resources visible in AWS Console
Names match variables
Tags match expected values
Security group rules exactly as defined

3. Functional Verification

ALB DNS resolves
curl returns expected response
ASG instances pass health checks
Instance termination triggers replacement

4. State Consistency

terraform plan returns "No changes"
State file matches AWS resources

5. Cleanup

terraform destroy completes
AWS Console verification shows no resources

What I Found

Passed: 12 tests ✅

Provisioning worked perfectly
All resources created with correct tags
State consistency was perfect
Destroy cleaned up properly

Failed: 2 tests ❌

ALB DNS resolution (timeout)
ALB returned 502 Bad Gateway

The Root Cause

The infrastructure was created, but the application wasn't working. Why?

ALB DNS takes time to propagate — I tested too early
Health checks were failing — Instances weren't responding to HTTP
User-data script may have failed — Apache probably wasn't running

The code was correct. The application was not.

What Manual Testing Taught Me

Terraform applies successfully ≠ infrastructure works

Terraform only checks that resources are created. It doesn't verify that your application is actually running.

DNS propagation is real — Just because the ALB exists doesn't mean it's reachable immediately.

Health checks are the real indicator — A running instance isn't enough. It needs to respond correctly.

Cleanup is harder than it looks — After terraform destroy, I found leftover instances. Manual verification is essential.

The Value of a Test Checklist

Before today, I'd run terraform apply and call it done.

Now I have a checklist that catches:

DNS propagation issues
Application startup failures
Health check problems
Cleanup gaps

Each failed test is a gap I can fix and later automate.

What I Learned About Cleanup

After terraform destroy, I verified with:

aws ec2 describe-instances --filters "Name=tag:Name,Values=*test-webserver*"

I found five instances still running. Terraform destroyed the ASG but instances were still terminating. Manual verification caught what automation missed.

Lesson: Always verify cleanup. Don't trust destroy alone.

The Manual Test Results

Test	Result
terraform init	✅ PASS
terraform validate	✅ PASS
terraform plan	✅ PASS
terraform apply	✅ PASS
Resources in AWS	✅ PASS
Tags correct	✅ PASS
Security group rules	✅ PASS
ALB DNS resolution	❌ FAIL
ALB returns webpage	❌ FAIL
ASG instances running	✅ PASS
State consistency	✅ PASS
terraform destroy	✅ PASS
Cleanup verification	✅ PASS

12 passed, 2 failed.

Why This Matters

Manual testing isn't about checking boxes. It's about finding gaps before they become outages.

If I had deployed this infrastructure without testing:

Users would see 502 errors
I'd be debugging under pressure
The problem would take longer to find

Instead, I found the failure in a controlled environment. I can now fix it and write automated tests to prevent it from happening again.

The Big Lesson

Terraform applies successfully ≠ Infrastructure works

The gap between "code success" and "functional success" is where outages happen. Manual testing closes that gap.

Next Steps

Fix the user-data script to ensure Apache starts reliably
Add wait_for_capacity_timeout to ASG
Wait 2-3 minutes after apply before testing
Write automated tests to catch these issues in CI

P.S. The 502 Bad Gateway was humbling. But finding it manually before deployment was a win. Test early, test often, test manually before you automate. 🚀

DEV Community