Why "It Works" Isn't Enough Until You Prove It
Day 17 of the 30-Day Terraform Challenge — and today I learned that my infrastructure "worked" until I actually tested it.
I had a webserver cluster. Terraform applied without errors. Everything looked perfect in the AWS Console. I was confident.
Then I ran a structured manual test. The results were humbling.
The Problem: Code Success ≠ Functional Success
Terraform told me:
- ✅ 11 resources created
- ✅ No errors
- ✅ State matches configuration
But when I actually tried to use my infrastructure:
$ curl http://my-alb-dns
502 Bad Gateway
The code worked. The infrastructure didn't.
This is why manual testing matters.
My Test Checklist
I built a structured test plan covering five categories:
1. Provisioning Verification
-
terraform initcompletes without errors -
terraform validatepasses cleanly -
terraform planshows expected resources -
terraform applycompletes successfully
2. Resource Correctness
- Resources visible in AWS Console
- Names match variables
- Tags match expected values
- Security group rules exactly as defined
3. Functional Verification
- ALB DNS resolves
-
curlreturns expected response - ASG instances pass health checks
- Instance termination triggers replacement
4. State Consistency
-
terraform planreturns "No changes" - State file matches AWS resources
5. Cleanup
-
terraform destroycompletes - AWS Console verification shows no resources
What I Found
Passed: 12 tests ✅
- Provisioning worked perfectly
- All resources created with correct tags
- State consistency was perfect
- Destroy cleaned up properly
Failed: 2 tests ❌
- ALB DNS resolution (timeout)
- ALB returned 502 Bad Gateway
The Root Cause
The infrastructure was created, but the application wasn't working. Why?
- ALB DNS takes time to propagate — I tested too early
- Health checks were failing — Instances weren't responding to HTTP
- User-data script may have failed — Apache probably wasn't running
The code was correct. The application was not.
What Manual Testing Taught Me
Terraform applies successfully ≠ infrastructure works
Terraform only checks that resources are created. It doesn't verify that your application is actually running.
DNS propagation is real — Just because the ALB exists doesn't mean it's reachable immediately.
Health checks are the real indicator — A running instance isn't enough. It needs to respond correctly.
Cleanup is harder than it looks — After terraform destroy, I found leftover instances. Manual verification is essential.
The Value of a Test Checklist
Before today, I'd run terraform apply and call it done.
Now I have a checklist that catches:
- DNS propagation issues
- Application startup failures
- Health check problems
- Cleanup gaps
Each failed test is a gap I can fix and later automate.
What I Learned About Cleanup
After terraform destroy, I verified with:
aws ec2 describe-instances --filters "Name=tag:Name,Values=*test-webserver*"
I found five instances still running. Terraform destroyed the ASG but instances were still terminating. Manual verification caught what automation missed.
Lesson: Always verify cleanup. Don't trust destroy alone.
The Manual Test Results
| Test | Result |
|---|---|
| terraform init | ✅ PASS |
| terraform validate | ✅ PASS |
| terraform plan | ✅ PASS |
| terraform apply | ✅ PASS |
| Resources in AWS | ✅ PASS |
| Tags correct | ✅ PASS |
| Security group rules | ✅ PASS |
| ALB DNS resolution | ❌ FAIL |
| ALB returns webpage | ❌ FAIL |
| ASG instances running | ✅ PASS |
| State consistency | ✅ PASS |
| terraform destroy | ✅ PASS |
| Cleanup verification | ✅ PASS |
12 passed, 2 failed.
Why This Matters
Manual testing isn't about checking boxes. It's about finding gaps before they become outages.
If I had deployed this infrastructure without testing:
- Users would see 502 errors
- I'd be debugging under pressure
- The problem would take longer to find
Instead, I found the failure in a controlled environment. I can now fix it and write automated tests to prevent it from happening again.
The Big Lesson
Terraform applies successfully ≠ Infrastructure works
The gap between "code success" and "functional success" is where outages happen. Manual testing closes that gap.
Next Steps
- Fix the user-data script to ensure Apache starts reliably
- Add
wait_for_capacity_timeoutto ASG - Wait 2-3 minutes after apply before testing
- Write automated tests to catch these issues in CI
P.S. The 502 Bad Gateway was humbling. But finding it manually before deployment was a win. Test early, test often, test manually before you automate. 🚀
Top comments (0)