Alexin

Posted on Feb 15

How I Built My First Production-Grade AWS Infrastructure (And Almost Lost My Mind)

A brutally honest walkthrough of my AltSchool Cloud Engineering exam project

The Assignment That Humbled Me

I'm a student at AltSchool Africa studying Cloud Engineering, and for our semester 2 exam, we got this assignment where we had to roleplay as a Junior Cloud Engineer hired by a company. The task? Deploy a secure, highly available web application on AWS.

I thought "how hard could it be?"

Spoiler alert: Pretty hard.

But I learned more in those two days than I did in weeks of watching tutorials. This is my story of building real cloud infrastructure, breaking things, fixing them, breaking them again, and eventually getting it right.

What I Had to Build

The assignment was clear:

1 Bastion Host in a public subnet
2 Web Servers in a private subnet (NO public IPs)
Everything automated with Ansible
An Application Load Balancer distributing traffic
All access through the load balancer only

The catch? Web servers couldn't be directly accessible from the internet. Everything had to go through proper security layers.

Here's what I ended up building:

Day 1: "This Should Be Easy"

Starting with Security Groups

I started by creating security groups because I figured I'd need those first. Three of them:

bastion-sg - For my jump box

Allow SSH from my IP
That's it

webserver-sg - For the application servers

Allow SSH from bastion-sg only
Allow HTTP from alb-sg only
No direct access from internet

alb-sg - For the load balancer

Allow HTTP from anywhere (0.0.0.0/0)
This is the only thing exposed to the internet

This part actually went smoothly. I was feeling confident. Too confident.

Launching EC2 Instances

Next up: spinning up instances. I created an SSH key pair (saved it as cloud-assignment-key.pem) and launched:

Bastion Host - Ubuntu 24.04, t2.micro, in a public subnet with a public IP
Web-Server-1 - Ubuntu 24.04, t2.micro, NO public IP
Web-Server-2 - Ubuntu 24.04, t2.micro, NO public IP

Here's where I made my first mistake: I put all three instances in the same subnet. I didn't think about the fact that a subnet is either public OR private, not both. This came back to bite me later.

Setting Up SSH Access

I tested SSH to the Bastion - worked fine:

ssh -i ~/.ssh/cloud-assignment-key.pem ubuntu@<bastion-public-ip>

But I needed to access the web servers through the Bastion. This is where ProxyJump saved my life. I created an SSH config file:

Host bastion
    HostName <BASTION_PUBLIC_IP>
    User ubuntu
    IdentityFile ~/.ssh/cloud-assignment-key.pem
    ForwardAgent yes
    StrictHostKeyChecking no

Host webserver1
    HostName <WEB_SERVER_1_PRIVATE_IP>
    User ubuntu
    IdentityFile ~/.ssh/cloud-assignment-key.pem
    ProxyJump bastion
    StrictHostKeyChecking no

Host webserver2
    HostName <WEB_SERVER_2_PRIVATE_IP>
    User ubuntu
    IdentityFile ~/.ssh/cloud-assignment-key.pem
    ProxyJump bastion
    StrictHostKeyChecking no

Now I could just type ssh webserver1 and it would automatically jump through the Bastion. Magic.

Day 1 Evening: The Ansible Disaster

Okay, time for automation. I installed Ansible on my laptop and created an inventory file:

[webservers]
webserver1 ansible_host=172.31.x.x
webserver2 ansible_host=172.31.y.y

[webservers:vars]
ansible_user=ubuntu
ansible_ssh_private_key_file=~/.ssh/cloud-assignment-key.pem
ansible_ssh_common_args='-o StrictHostKeyChecking=no -o ProxyJump=bastion'

Tested connection:

ansible webservers -i inventory.ini -m ping

Success! Both servers responded:

webserver1 | SUCCESS
webserver2 | SUCCESS

Great, I thought. Time to run the actual playbook. Here's what it looked like:

- name: Deploy Web Application
  hosts: webservers
  become: yes
  tasks:
    - name: Update apt cache
      apt:
        update_cache: yes
        cache_valid_time: 3600

    - name: Install NGINX
      apt:
        name: nginx
        state: present

    - name: Start and enable NGINX
      systemd:
        name: nginx
        state: started
        enabled: yes

    - name: Get instance hostname
      shell: hostname
      register: instance_hostname

    - name: Get private IP
      shell: hostname -I | awk '{print $1}'
      register: private_ip

    - name: Deploy custom HTML page
      copy:
        dest: /var/www/html/index.html
        content: |
          <!DOCTYPE html>
          <html lang='en'>
          <head>
              <meta charset='UTF-8'>
              <title>Cloud Assignment - Alexin</title>
              <style>
                  body {
                      font-family: 'Segoe UI', sans-serif;
                      max-width: 800px;
                      margin: 50px auto;
                      padding: 20px;
                      background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);
                      color: white;
                  }
                  .container {
                      background: rgba(255, 255, 255, 0.1);
                      padding: 40px;
                      border-radius: 15px;
                      backdrop-filter: blur(10px);
                  }
                  .info { 
                      background: rgba(255, 255, 255, 0.2); 
                      padding: 15px; 
                      border-radius: 8px; 
                      margin: 20px 0; 
                  }
              </style>
          </head>
          <body>
              <div class='container'>
                  <h1>🚀 Cloud Assignment</h1>
                  <div class='info'>
                      <p><strong>Name:</strong> Alexin</p>
                      <p><strong>Instance:</strong> {{ instance_hostname.stdout }}</p>
                      <p><strong>Private IP:</strong> {{ private_ip.stdout }}</p>
                  </div>
                  <div class='about'>
                      <h2>About Me</h2>
                      <p>Backend engineer from Lagos, Nigeria. This was deployed with 
                      Ansible through a Bastion Host to demonstrate Infrastructure as Code.</p>
                  </div>
              </div>
          </body>
          </html>
        mode: '0644'
      notify: Restart NGINX

  handlers:
    - name: Restart NGINX
      systemd:
        name: nginx
        state: restarted

Ran it:

ansible-playbook -i inventory.ini deploy-webserver.yml

IT FAILED.

The playbook started running, got to the "Install NGINX" task, and just... timed out. Connection errors everywhere. The servers were reachable via SSH, Ansible could ping them, but they couldn't download anything.

TASK [Install NGINX] *****
fatal: [webserver1]: FAILED! => {"changed": false, "msg": "Failed to update apt cache"}
fatal: [webserver2]: FAILED! => {"changed": false, "msg": "Failed to update apt cache"}

I spent an hour debugging this. Turns out my web servers couldn't reach the internet to download packages. They could talk to Ansible (via the Bastion), but they couldn't reach Ubuntu's package repositories.

The Internet Access Problem

Here's what I learned the hard way: My servers needed internet access to run apt update and apt install nginx. They were technically reachable via SSH, but they couldn't initiate outbound connections to the internet.

The proper solution? NAT Gateway.

But I was already tired and the deadline was approaching, so I did a "quick fix": I kept everything in the same public subnet (with Internet Gateway access) so the servers could download packages. Not best practice, but it worked for getting Ansible running.

After that fix, the playbook ran perfectly:

PLAY RECAP *****
webserver1: ok=8 changed=6 unreachable=0 failed=0
webserver2: ok=8 changed=6 unreachable=0 failed=0

Both servers configured in under 2 minutes. This is when I understood why everyone talks about automation - imagine doing this manually on 10 or 100 servers.

Why Ansible Changed Everything

Before this project, I'd always SSH'd into servers and ran commands manually. Copy-paste from StackOverflow, hope it works, repeat on the next server.

Ansible made me realize:

Consistency - Both servers got the EXACT same config
Speed - 2 minutes vs 30 minutes of manual work
Documentation - My playbook IS my documentation
Repeatability - I can run it again tomorrow and get the same result

The playbook grabs the hostname and private IP dynamically with shell commands, then injects them into the HTML template. Each server shows different info, proving load balancing works later.

Day 2: Load Balancer Time

Creating the Target Group

First, I created a target group:

Name: webserver-tg
Protocol: HTTP, Port: 80
Registered both web servers
Health checks: HTTP GET to /

Creating the ALB

Then the Application Load Balancer:

Name: web-app-alb
Internet-facing
Selected multiple availability zones
Security group: alb-sg
Forwarded traffic to: webserver-tg

Waited for health checks... and both targets went healthy!

Accessed the ALB DNS name in my browser:

http://web-app-alb-xxxxxxxxx.eu-north-1.elb.amazonaws.com

IT WORKED!

But I noticed it kept showing the same server. Turns out that's normal - it's called "sticky sessions". The load balancer keeps you on the same server for consistency. I verified it was actually load balancing by using curl:

for i in {1..10}; do curl http://alb-dns-name | grep "Instance:"; done

And boom - I saw both server hostnames appearing. Load balancing confirmed.

Why a Bastion Host?

At first, I didn't get why we needed a Bastion Host. Why not just SSH directly to the servers?

Here's what I learned:

Without Bastion (Bad):

Every server exposed to the internet
More attack surface
Hard to control who accesses what
Each server needs its own public IP

With Bastion (Good):

Only ONE server exposed to internet
Single point of entry I can monitor and secure
Web servers hidden in private subnet
I can log every SSH session
If someone compromises the Bastion, web servers are still protected

Real companies use this because:

Compliance - Many regulations require it
Security - Defense in depth
Auditing - Know who accessed what and when

The Bastion is like a security checkpoint. Everything goes through it, nothing bypasses it.

Load Balancer vs Direct Access: Why It Matters

When I started, I thought "why not just give each server a public IP and call it a day?"

Here's why that's a terrible idea:

Direct EC2 Access (What NOT to Do):

Single Point of Failure - Server goes down? Your app is offline
No Failover - Users get errors until you manually fix it
Security Risk - Servers exposed directly to attacks
Hard to Scale - Adding servers means changing DNS
No Health Checks - You won't know a server died until users complain

Load Balancer Access (The Right Way):

High Availability - One server dies? Traffic goes to the other
Automatic Failover - Load balancer detects failures and reroutes
Better Security - Servers stay in private subnet
Easy Scaling - Add/remove servers without DNS changes
Health Monitoring - Load balancer constantly checks if servers are healthy
SSL Termination - Handle HTTPS at the load balancer, not on each server

In my setup, if Web-Server-1 crashes, the load balancer immediately stops sending traffic to it. Users never notice. That's production-grade architecture.

The "Oh Shit" Moment: Realizing I Needed to Rebuild

Later, I realized something: I had put all my instances in the same subnet. But the assignment specifically said "private subnet" for web servers.

A subnet is either public (routes to Internet Gateway) or private (routes to NAT Gateway). You can't have both in the same subnet.

So I had to:

Create a NAT Gateway (in a public subnet)
Create a Private Route Table (routes outbound traffic through NAT)
Associate that route table with a private subnet
Terminate my web servers
Launch new ones in the private subnet WITHOUT public IPs
Update my inventory and SSH config
Run Ansible again
Update the target group

This took another hour, but it was the right way to do it.

The NAT Gateway Confusion

Here's something that confused me: If web servers are in a private subnet with NO internet access, how do they download packages?

Answer: NAT Gateway.

A NAT Gateway allows outbound-only internet access:

Web Server: "I need to apt install nginx"
    ↓
Private Subnet Route Table: "Send to NAT Gateway"
    ↓
NAT Gateway: "I'll translate your private IP to my public IP"
    ↓
Internet: "Here's nginx"
    ↓
NAT Gateway: "I remember who asked, sending back to Web Server"
    ↓
Web Server: "Got it, thanks!"

The key: Outbound connections work, but the internet CANNOT initiate connections to your private servers. They're truly isolated.

Without NAT Gateway, private servers are completely offline. Can't download anything, can't reach APIs, nothing.

All The Things That Went Wrong

Issue 1: "Connection Timed Out"

Problem: Ansible couldn't connect to web servers. Ping returned UNREACHABLE.

Why: Web servers had no internet access to download packages. They were in a public subnet, but they had no IP addresses themselves.

Solution: Quick fix - made the instances public. Later rebuilt properly with NAT Gateway.

Lesson: Private instances = NO internet unless you add NAT Gateway.

Issue 2: All Instances in Same Subnet

Problem: Assignment required private subnet for web servers, but I had everything in the same subnet.

Why: I didn't understand that subnets are either public OR private, not both.

Solution: Created separate private subnet with route table pointing to NAT Gateway. Launched new web servers there.

Lesson: Network architecture planning matters. Can't just throw everything anywhere.

Issue 3: Can't curl from Bastion to Web Servers

Problem: Tried to test web servers from Bastion using curl. Connection refused.

Why: Security group webserver-sg only allowed HTTP from alb-sg, not from bastion-sg.

Solution: Skipped this test. Verified through the load balancer instead.

Lesson: Security groups are strict. They block everything not explicitly allowed.

Issue 4: Load Balancer Only Shows One Server

Problem: Refreshing the page kept showing the same server. Thought load balancing was broken.

Why: Sticky sessions - the load balancer uses cookies to keep you on the same server.

Solution: Used curl in a loop to verify both servers were getting traffic. They were.

Lesson: Sticky sessions are a feature, not a bug. Keeps user sessions consistent.

Issue 5: Target Group Shows "Unused"

Problem: After registering web servers in target group, status showed "Unused" with message "Target is in an Availability Zone that is not enabled for the load balancer"

Why: My web servers were in AZ eu-north-1c in a private subnet, but my load balancer wasn't configured to use that AZ since it couldn't use private subnets.

Solution: Created a public subnet in AZ 1c and added it to the load balancer's network mapping.

Lesson: Load balancers need to span the availability zones where your targets are.

Issue 6: NAT Gateway Not Appearing in Route Table

Problem: When trying to add route 0.0.0.0/0 => NAT Gateway, it wasn't showing up in the dropdown.

Why: NAT Gateway was still in "Pending" status. Hadn't finished creating yet.

Solution: Waited 2-3 minutes for status to become "Available", then refreshed the page.

Lesson: AWS resources take time to provision. Be patient.

Issue 7: New Subnet Not Showing in ALB Config

Problem: Created a new public subnet for the ALB, but it didn't appear in the subnet dropdown.

Why: New subnet wasn't associated with a route table that has Internet Gateway route.

Solution: Associated the subnet with the public route table (one with IGW route). Refreshed page.

Lesson: Subnets need proper route table associations to be recognized correctly.

What I Learned About Cloud Security

This project taught me security is about layers:

Layer 1: Network Segmentation

Public subnet: Only Bastion, NAT, Load Balancer
Private subnet: Application servers

Layer 2: Security Groups

Each component has its own firewall rules
Rules reference each other by security group ID
Principle of least privilege

Layer 3: No Public IPs on App Servers

Can't be attacked if they're not reachable
All traffic forced through load balancer

Layer 4: SSH Keys, Not Passwords

Cryptographic authentication
Keys never leave my laptop

Layer 5: Bastion Host

Single auditable entry point
Can log all SSH sessions
Additional barrier

This is what "defense in depth" means. You don't rely on one security measure, you stack them.

If I Did It Again

Things I'd add for a real production system:

1. HTTPS

Right now it's just HTTP. In production, I'd:

Get a free SSL cert from AWS Certificate Manager
Configure ALB to handle HTTPS on port 443
Redirect HTTP to HTTPS

2. Auto Scaling

Currently just 2 fixed servers. I'd add:

Auto Scaling Group
Scale up when CPU > 70%
Scale down when CPU < 30%
Minimum 2 instances for availability

3. Multiple Availability Zones for Web Servers

Right now both servers are in AZ 1c. For true high availability:

Put Web-Server-1 in AZ 1b (since Bastion is in 1a)
Put Web-Server-2 in AZ 1c
Protected against entire data center failures

4. Database Layer

Currently no database. I'd add:

Amazon RDS in private subnet
Separate database security group
Automated backups

5. Monitoring

Zero visibility right now. I'd add:

CloudWatch alarms (CPU, memory, disk)
CloudWatch Logs (application logs)
SNS notifications for alerts

6. CI/CD Pipeline

Manual deployments don't scale. I'd add:

GitHub Actions or AWS CodePipeline
Automated testing
Blue/green deployments

7. Infrastructure as Code

I clicked through the console for everything. Better approach:

Terraform or CloudFormation
Version control entire infrastructure
Reproducible across environments

Real Talk: Was It Worth It?

Two days. Countless errors. Almost gave up multiple times.

But here's what I got out of it:

Technical Skills:

Actually understand VPCs, subnets, route tables now
Know how security groups work
Can write Ansible playbooks
Understand load balancer concepts

Problem-Solving:

Learned to read error messages carefully
Know how to debug connection issues
Can think through network architecture

Confidence:

Built something real, not a tutorial
Can explain this in interviews
Know I can figure things out when stuck

The assignment said "If you can explain this project clearly, you can explain 70% of junior cloud interviews."

They weren't lying. This project covers:

Cloud networking fundamentals
Security best practices
Infrastructure automation
High availability patterns
Real-world troubleshooting

For Anyone Doing This Assignment

If you're an AltSchool student reading this:

Do's:

Start early. This takes longer than you think
Read error messages carefully
Draw your architecture before building
Test each component before moving to the next
Take screenshots as you go (you'll need them)
Ask for help when stuck

Don'ts:

Don't put everything in the same subnet
Don't skip the NAT Gateway (private servers need it)
Don't give web servers public IPs (assignment requires private)
Don't forget to delete resources after submission (AWS bills add up!)
Don't just copy commands without understanding them

Key Resources:

AWS Documentation
Ansible Documentation
This article (seriously, refer back to it)

Final Architecture Checklist

Before submitting, verify:

3 EC2 instances running (1 Bastion, 2 Web Servers)
Only Bastion has public IP
Web servers in different subnet than Bastion
NAT Gateway created and available
Private route table routes through NAT Gateway
Can SSH: laptop => bastion => web servers
Ansible playbook runs successfully
Both web servers show different hostnames/IPs
Application Load Balancer is active
Both targets healthy in target group
ALB DNS shows your page
Refreshing shows different servers (use curl to verify)
RESOURCES DELETED AFTER SUBMISSION

That last one is critical. NAT Gateway alone costs ~$32/month if you forget it.

Closing Thoughts

This assignment kicked my ass, but in the best way possible.

I went from "I've watched AWS tutorials" to "I've built AWS infrastructure". That's a huge difference.

The frustration when Ansible couldn't connect. The confusion about subnets. The moment when the load balancer finally showed both servers. The relief when all health checks turned green.

That's real learning.

If you're reading this because you're about to start this assignment: good luck. You'll need it. But you'll also learn more than you expect.

If you're reading this because you finished: congrats! We survived.

If you're a recruiter: this is what I know now. Want to see if I can build something for you?