gideonclottey

Posted on Jul 5

I Built My First AWS Network with Terraform. Here's Everything That Went Wrong.

#aws #beginners #infrastructure #terraform

Week 1 of a five-week Infrastructure-as-Code journey, told honestly: the 403s, the 400s, the missing resource, and the security group I got wrong twice.

I recently started a structured five-week Terraform course, and I made myself a promise: I would document the journey as it actually happened, not the cleaned-up version where every terraform apply works on the first try.

Spoiler: almost nothing worked on the first try. And that turned out to be the best part.

By the end of Week 1, the assignment was to build a complete, self-contained network on AWS using nothing but code: a VPC, a public subnet, a private subnet, an EC2 instance living in the public subnet, and a security group protecting it. Eight resources, all defined in HCL, all provisioned with a single command.

Here is how it really went.

The setup tax

Before writing a single line of Terraform, there is a setup tax to pay: installing Terraform and the AWS CLI, creating an IAM user (never the root account), generating access keys, and running aws configure.

I paid the tax, ran the verification command, and got the green light:

$ aws sts get-caller-identity
{
    "UserId": "AIDA...",
    "Arn": "arn:aws:iam::************:user/gclottey"
}

I felt ready. AWS disagreed.

Failure #1: The 403 that taught me to read error messages

My very first terraform plan did not even survive the AMI lookup:

Error: reading EC2 AMIs: ... StatusCode: 403 ...
User: arn:aws:iam::****:user/gclottey is not authorized to perform:
ec2:DescribeImages because no identity-based policy allows the
ec2:DescribeImages action

My instinct was to assume my code was broken. It wasn't. The error message tells you everything if you slow down and read it: which identity was rejected, which exact action failed, and why. My IAM user existed and could authenticate, but had no policy allowing it to do anything with EC2.

The fix was attaching a policy in the IAM console. The lesson was bigger: a 403 means AWS knows who you are and is saying no. Your code is fine. Your permissions are not.

Bonus complication: I discovered mid-debugging that I was juggling two AWS accounts with the same username, and I had attached the policy in one account while my CLI was authenticated to the other. Same error, completely different cause. Check aws sts get-caller-identity before you check anything else.

Failure #2: The 400 that knew more about my account than I did

Permissions fixed, I ran terraform apply, typed yes, and watched Terraform start creating my instance... and fail:

Error: creating EC2 Instance: ... StatusCode: 400 ...
InvalidParameterCombination: The specified instance type is not
eligible for Free Tier.

Every tutorial on the internet uses t2.micro. My AWS account, created recently on the newer free-tier plan, refused it. Newer accounts can only launch instance types flagged as free-tier eligible in their region, and in ca-central-1 that meant t3.micro, not t2.micro.

The error message even handed me the diagnostic command:

aws ec2 describe-instance-types \
  --filters "Name=free-tier-eligible,Values=true" \
  --query "InstanceTypes[].InstanceType" --output table

The fix was one variable. But notice the difference from the first failure: a 403 is "you are not allowed," a 400 is "your request itself is invalid." Two different layers of the system telling you two different things. Learning to distinguish them is half of cloud debugging.

I also learned something about Terraform variables here. I did not have to touch my code to test the fix:

terraform apply -var="instance_type=t3.micro"

Defaults are just defaults. The command line can override them, which is exactly the point of parameterizing your configuration in the first place.

Failure #3: The plan said 7. I needed 8.

For the assignment, I counted my resources before planning: VPC, internet gateway, two subnets, route table, route table association, security group, instance. Eight.

Terraform's plan came back:

Plan: 7 to add, 0 to change, 0 to destroy.

I had forgotten the route table association. And here is what makes this failure the most instructive of the week: everything would have applied cleanly anyway. Seven green resources, no errors, a public IP in my outputs. And an instance that could not be reached from the internet, because my "public" subnet was never actually connected to the route table that points at the internet gateway.

A subnet is not public because you name it public. It is public because a route table with a 0.0.0.0/0 → internet gateway route is explicitly associated with it. Forget the association and your subnet silently falls back to the VPC's main route table, which has no path to the internet.

No syntax checker catches this. terraform validate passes. The only thing that catches it is reading the plan and counting. The plan is not a formality to scroll past. The plan is the code review.

Failure #4: My security group, reviewed like a pull request

I wrote my security group from memory, feeling confident. It had three problems.

Problem one: I wrote a rule intended for SSH like this:

ingress {
  from_port = 0
  to_port   = 0
  protocol  = "tcp"
  ...
}

SSH is port 22. Ports 0 to 0 with TCP is not "everything," it is nothing useful. (The "all traffic" pattern only works with protocol = "-1".)

Problem two: I had no egress block at all. This one is a genuine trap. AWS security groups natively allow all outbound traffic, but Terraform's aws_security_group resource removes that default the moment it manages the group. Terraform's philosophy is that the configuration is the complete truth, so an unwritten rule is a rule that should not exist. Without an explicit egress block, my instance could receive requests but could not initiate a single outbound connection. Not even to download OS updates.

Problem three: a leftover comment claimed my Amazon Linux AMI was published by Canonical, which is Ubuntu's publisher. Small, harmless, and exactly the kind of copy-paste residue that erodes trust in a codebase.

The corrected version allows HTTP on 80, SSH on 22, and all outbound traffic. In a real environment, that SSH rule would be locked to my own IP with a /32 instead of open to the world. Knowing the difference between lab settings and production settings matters as much as knowing the syntax.

The moment it all clicked

With all eight resources finally applied, I grabbed my shiny public IP, pasted it into a browser, and got... a timeout.

Naturally, I assumed something else was broken. Nothing was. Port 80 was open in the firewall, but nothing on the instance was listening on it. I had never installed a web server. An open firewall port and a running service are two completely different layers, and confusing them is one of the most common mistakes in cloud networking.

Fifteen lines of user_data later (a boot script installing Apache), I refreshed the browser and saw my own heading served from a server I had defined entirely in code:

user_data = <<-EOF
            #!/bin/bash
            dnf install -y httpd
            echo "<h1>Deployed with Terraform</h1>" > /var/www/html/index.html
            systemctl enable --now httpd
            EOF

Then I ran terraform destroy, watched all eight resources dissolve in reverse dependency order, and confirmed my AWS bill for the whole exercise: effectively zero.

What Week 1 actually taught me

Not HCL syntax. Syntax is the easy part. Week 1 taught me:

Read the error message. All of it. AWS errors name the identity, the action, and the reason. The answer is usually in there.
403 and 400 are different conversations. One is about permissions, the other is about the request itself.
Count your resources before you apply. The plan output is a review artifact, not a loading screen.
Terraform manages exactly what you write. Nothing more. The missing egress rule was not a bug; it was the tool taking my configuration literally.
Public is a routing decision, not a name. Infrastructure is what it is wired to, not what you call it.
An open port is not a running service. Layers, always layers. Week 2 is state management: moving the terraform.tfstate file off my laptop and into S3 with proper locking. Based on Week 1, I fully expect it to break in ways I have not imagined yet, and I will write those down too.

This is part 1 of a series documenting a five-week Infrastructure-as-Code course, failures included. If you are learning Terraform and everything is working perfectly on your first try, you are probably missing the good parts.

Let's share our learnings and get better together. If you're on a similar Terraform or AWS journey, I'd love to hear what broke for you and how you fixed it. Feel free to connect with me: Gideon Clottey | LinkedIn

Top comments (3)

FromZeroToShip • Jul 6

"Public is a routing decision, not a name" is going straight into my notes. What strikes me about your four failures is that three of them announced themselves with error codes — but the missing route table association just sat there, silently, looking finished. That silent category is the one that still scares me the most. I'm not a developer by background, and the single biggest mindset shift for me was exactly what you describe: error messages stopped being scary the day I started reading them as diagnostics instead of verdicts. 403 tells you who, 400 tells you what. Writing up the failures in this much detail is rarer than it should be — this post will save someone's afternoon.

gideonclottey • Jul 6

"Error messages as diagnostics instead of verdicts" ,that's exactly the shift, and you've said it better than I did. And you're right that the silent failure is the scariest category: the loud ones interrupt you, the quiet ones wait for production. Funny enough, the next posts in this series are basically about that ,Week 2 is state management, which is largely Terraform's answer to "how do I notice what silently changed." Thanks for reading this closely.

FromZeroToShip • Jul 6

"The loud ones interrupt you, the quiet ones wait for production" — that's the line I'll be stealing, with attribution. Funny thing: without knowing any of the proper tooling, I ended up inventing a caveman version of state management years ago — making a full copy of everything before touching anything, so I could diff reality against "what was true yesterday" by hand. Sounds like week 2 is the grown-up version of that instinct. Following along — genuinely curious how Terraform formalizes it.