A comprehensive guide to creating a production-ready AWS infrastructure including all the roadblocks I hit and how to solve them
Introduction
Hey there, fellow cloud explorer! Have you ever tried setting up an AWS infrastructure and found yourself scratching your head, wondering, “Why isn’t this working?!” I’ve been there too! In this down-to-earth guide, I’ll walk you through building a full AWS Virtual Private Cloud (VPC) with an Application Load Balancer (ALB), EC2 instances, and more while sharing every stumble I made along the way and how I fixed them.
This isn’t your typical polished tutorial. It’s the real deal complete with those “uh-oh” moments and their practical solutions. By the end, you’ll have a production ready setup and the know how to troubleshoot like a pro. Let’s dive in!
What We’re Building
Imagine we’re setting up “Dublin Delights,” a small online store selling Irish goodies. Our final architecture will include:
- A VPC with public and private subnets
- An Internet Gateway for public access
- A NAT Gateway for private subnet internet access
- An Application Load Balancer to distribute traffic
- 3 EC2 instances (1 public, 2 private)
- Tight security groups and routing
- Working web servers behind the load balancer
Ready? Let’s get started!
Phase 1: Setting Up the VPC Foundation
Creating the VPC
First, let’s build the foundation our Virtual Private Cloud:
- Head to the AWS Console > Search “VPC” > Click “VPC” > “Create VPC”.
- Choose “VPC and more” (this sets up everything in one go).
- Configuration:
- Name:
dublin-delights-vpc
- IPv4 CIDR:
10.0.0.0/16
(gives us 65,536 IP addresses) - Number of Availability Zones (AZs): 2 (e.g., US-east-1a, US-east-1b)
- Public subnets: 1 (e.g.,
10.0.1.0/24
) - Private subnets: 2 (e.g.,
10.0.2.0/24
,10.0.3.0/24
) - NAT gateways: 1 (in the public subnet)
- VPC endpoints: None
- Name:
- Hit “Create VPC” and wait a minute for it to spin up.
What This Creates:
- A VPC with the
10.0.0.0/16
CIDR block - 1 public subnet (
10.0.1.0/24
) - 2 private subnets (
10.0.2.0/24
,10.0.3.0/24
) - An Internet Gateway
- A NAT Gateway in the public subnet
- Route tables (we’ll tweak these later)
The First Problem: DNS Resolution Woes
After creating the VPC, I launched instances but hit a snag DNS wasn’t working! Updates failed with “temporary failure resolving” errors. Here’s the fix:
- VPC Console > “Your VPCs” > Select
dublin-delights-vpc
. - “Actions” > “Edit VPC settings”.
- Enable “DNS hostnames” and “DNS resolution” (both checkboxes).
- Save changes.
Why This Matters: Without DNS hostnames, instances can’t resolve domain names (e.g., sudo-apt-get update
fails). This little toggle saved my day!
Phase 2: Launching EC2 Instances
Creating the Instances
Let’s launch our three servers:
-
Public Instance (public-web-1):
- EC2 Console > “Launch Instance”.
- Configuration:
- Name:
public-web-1
- AMI: Ubuntu Server 22.04 LTS
- Instance type:
t2.micro
(Free Tier eligible) - Key pair: Select or create
devops-key
(download.pem
file) - VPC:
dublin-delights-vpc
- Subnet:
public-subnet-1
- Auto-assign Public IP: Enable
- Security group: Create new “public-sg”
- Name:
- Launch and wait for “Running”.
-
Private Instances (private-web-2, private-web-3):
- Repeat the process, but:
- Name:
private-web-2
andprivate-web-3
- Subnet:
private-subnet-1
andprivate-subnet-2
- Auto-assign Public IP: Disable
- Security group: Create new “private-sg”
- Launch both.
The Second Problem: Connection Timeouts
I couldn’t SSH to my instances got “ssh: connect to host 52.23.157.242 port 22: Connection timed out.” Ouch! The culprit? Security groups.
-
Solution - Fix Security Groups:
- public-sg:
- Inbound: SSH (22) from “My IP”, HTTP (80) from
0.0.0.0/0
- Outbound: All traffic to
0.0.0.0/0
- private-sg:
- Inbound: SSH (22) from
public-sg
, HTTP (80) from ALB security group (later) - Outbound: All traffic to
0.0.0.0/0
- Apply and test SSH:
ssh -i devops-key.pem ubuntu@public-ip
.
The Third Problem: Session Manager Not Working
Tried AWS Session Manager got “SSM Agent is not online.” The route table was the issue my “public” subnet wasn’t really public!
-
Solution - Fix Route Tables:
- VPC Console > “Route Tables” > Find the one for
public-subnet-1
. - “Routes” tab > “Edit routes” > Add: Destination
0.0.0.0/0
, Target “Internet Gateway” (dublin-igw
). - Save.
- VPC Console > “Route Tables” > Find the one for
- Result: SSH, Session Manager, and internet access worked!
Phase 3: Setting Up the Application Load Balancer
Creating the ALB
- EC2 Console > “Load Balancers” > “Create Load Balancer” > “Application Load Balancer”.
- Configuration:
- Name:
dublin-alb
- Scheme: Internet-facing
- IP address type: IPv4
- VPC:
dublin-delights-vpc
- Availability Zones: Select both AZs with
public-subnet-1
- Security group: Create new “alb-sg”
- Name:
- Create and wait 5 minutes.
Creating the Target Group
- EC2 Console > “Target Groups” > “Create target group”.
- Configuration:
- Target type: Instances
- Name:
web-targets
- Protocol: HTTP
- Port: 80
- VPC:
dublin-delights-vpc
- Health check path:
/
- Register targets: Add all 3 instances.
- Create.
The Fourth Problem: 504 Gateway Timeout
Accessing the ALB gave “504 Gateway Time-out,” with targets “Unhealthy” (Target.Timeout).
-
Root Cause #1: No Web Servers
- Solution: Install Apache on each instance:
sudo apt update sudo apt install -y apache2 sudo systemctl start apache2 sudo systemctl enable apache2 echo "<h1>Web Server - $(hostname)</h1>" | sudo tee /var/www/html/index.html curl http://localhost
-
Root Cause #2: ALB Security Group Outbound Rules Missing
- Even with web servers, 504 persisted. The ALB’s security group had no outbound rules!
- Critical Fix:
- EC2 Console > “Security Groups” >
alb-sg
. - “Outbound rules” > “Edit outbound rules” > Add: Type “All traffic”, Destination
0.0.0.0/0
. - Save.
Why This Matters: ALB needs outbound rules to send health checks and traffic.
Phase 4: Fine-Tuning and Testing
Proper Security Group Configuration
-
alb-sg:
- Inbound: HTTP (80) from
0.0.0.0/0
- Outbound: All traffic to
0.0.0.0/0
- Inbound: HTTP (80) from
-
public-sg:
- Inbound: SSH (22) from “My IP”, HTTP (80) from
alb-sg
- Outbound: All traffic to
0.0.0.0/0
- Inbound: SSH (22) from “My IP”, HTTP (80) from
-
private-sg:
- Inbound: SSH (22) from
public-sg
, HTTP (80) fromalb-sg
- Outbound: All traffic to
0.0.0.0/0
- Inbound: SSH (22) from
Testing the Complete Setup
-
Test 1: Individual Instance Access
- From public instance:
curl http://10.0.2.135
andcurl http://10.0.3.150
.
- From public instance:
-
Test 2: Load Balancer
curl http://dublin-alb-689675767.eu-west-1.elb.amazonaws.com
- Loop:
for i in {1..5}; do curl http://dublin-alb-689675767.eu-west-1.elb.amazonaws.com; done
(shows rotation).
-
Test 3: NAT Gateway Functionality
- From private instances:
curl ifconfig.me
(shows NAT IP),sudo apt update
(works).
- From private instances:
Phase 5: Common Issues and Solutions
-
Issue 1: “Instance is not in public subnet”
- Fix: Add
0.0.0.0/0
→ Internet Gateway in route table.
- Fix: Add
-
Issue 2: “Referenced group id for existing IPv4 CIDR rule”
- Fix: Add new rules instead of editing existing ones.
-
Issue 3: Private instances can’t reach internet
- Fix: Route
0.0.0.0/0
→ NAT Gateway in private route table.
- Fix: Route
-
Issue 4: Health checks failing with “Target.Timeout”
- Fix: Install web server, fix security groups, ensure
/var/www/html/index.html
.
- Fix: Install web server, fix security groups, ensure
Final Architecture Overview
🌐 INTERNET
|
┌────▼────┐
│ IGW │ dublin-igw
└────┬────┘
|
┌──────────▼──────────┐
│ Application LB │ dublin-alb
│ (alb-sg) │
└──┬────────┬────────┬┘
| | |
┌────────────▼─┐ ┌───▼───┐ ┌─▼────────────┐
│ AZ: us-east-1a│ │us-east│ │ AZ: us-east-1c│
│ │ │ -1b │ │ │
│ ┌───────────┐ │ │┌─────┐│ │ ┌───────────┐ │
│ │Public Sub │ │ ││Priv ││ │ │Private Sub│ │
│ │10.0.1.0/24│ │ ││Sub 1││ │ │10.0.3.0/24│ │
│ │ │ │ ││10.0.││ │ │ │ │
│ │┌─────────┐│ │ ││2.0/ ││ │ │┌─────────┐│ │
│ ││public- ││ │ ││24 ││ │ ││private- ││ │
│ ││web-1 ││ │ ││ ││ │ ││web-3 ││ │
│ ││(public- ││ │ ││┌───┐││ │ ││(private-││ │
│ ││sg) ││ │ │││web│││ │ ││sg) ││ │
│ │└─────────┘│ │ │││-2 │││ │ │└─────────┘│ │
│ │ │ │ ││└───┘││ │ │ │ │
│ │┌─────────┐│ │ │└─────┘│ │ │ │ │
│ ││NAT GW ││ │ └───────┘ │ │ │ │
│ │└─────────┘│ │ │ │ │ │
│ └───────────┘ │ │ └───────────┘ │
└───────────────┘ └───────────────┘
| |
└─────────┬─────────────────┘
|
┌────▼────┐
│ IGW │ (for NAT traffic)
└────┬────┘
|
🌐 INTERNET
- Network Flow: Inbound: Internet → IGW → ALB → Instances. Outbound: Public via IGW, Private via NAT.
- Security: Public SSH from specific IP, private via public, ALB open on 80.
Key Lessons Learned
- Security Groups Are Stateful: Inbound allows return traffic, but outbound rules are needed for initiated connections.
- “Public” Subnet Needs Routing: Name doesn’t matter route tables make it public.
- ALB Needs Outbound Rules: Missing rules cause 504 errors.
- NAT Gateway Placement: Must be in a public subnet.
- Health Checks Are Critical: Unhealthy targets stop traffic.
Cost Optimization Tips
- Costs: 3 t2.micro (~$25/month), NAT Gateway (~$32/month), ALB (~$16/month), data variable. Total ~$73/month.
- Savings: Use NAT Instance (~$0), Network Load Balancer, terminate unused instances, or Reserved Instances (30-60% off).
Production Readiness Checklist
- Security: Restrict SSH, enable VPC Flow Logs, CloudTrail, IAM roles, encryption.
- Monitoring: CloudWatch alarms, ALB logs, target health, SNS notifications.
- High Availability: Multi-AZ, Auto Scaling, health checks, disaster recovery.
- Performance: Right-size instances, ALB stickiness, CloudFront CDN.
Troubleshooting Commands Reference
-
Connectivity Issues:
telnet <ip> <port>
aws ec2 describe-security-groups --group-ids sg-xxxxxxxxx
nslookup google.com
ip route show
-
Web Server Issues:
sudo systemctl status apache2
sudo netstat -tlnp | grep :80
curl http://localhost
sudo tail -f /var/log/apache2/error.log
-
ALB Issues:
curl -I http://your-alb-dns-name.elb.amazonaws.com
aws elbv2 describe-target-health --target-group-arn <arn>
Conclusion
Building AWS infrastructure is like assembling a puzzle every piece (VPC, subnets, security groups, ALB) must fit. My mistakes taught me to start with basics, test incrementally, and read error messages closely. Each “why isn’t this working?” turned into a “now I get it!” moment.
What’s Next?
Explore Auto Scaling Groups, RDS, CloudFront, Route 53, ACM, or Infrastructure as Code (CloudFormation/Terraform). Share your AWS adventures in the comments let’s learn together!
Top comments (0)