A brutally honest walkthrough of my AltSchool Cloud Engineering exam project
The Assignment That Humbled Me
I'm a student at AltSchool Africa studying Cloud Engineering, and for our semester 2 exam, we got this assignment where we had to roleplay as a Junior Cloud Engineer hired by a company. The task? Deploy a secure, highly available web application on AWS.
I thought "how hard could it be?"
Spoiler alert: Pretty hard.
But I learned more in those two days than I did in weeks of watching tutorials. This is my story of building real cloud infrastructure, breaking things, fixing them, breaking them again, and eventually getting it right.
What I Had to Build
The assignment was clear:
- 1 Bastion Host in a public subnet
- 2 Web Servers in a private subnet (NO public IPs)
- Everything automated with Ansible
- An Application Load Balancer distributing traffic
- All access through the load balancer only
The catch? Web servers couldn't be directly accessible from the internet. Everything had to go through proper security layers.
Here's what I ended up building:
Day 1: "This Should Be Easy"
Starting with Security Groups
I started by creating security groups because I figured I'd need those first. Three of them:
bastion-sg - For my jump box
- Allow SSH from my IP
- That's it
webserver-sg - For the application servers
- Allow SSH from bastion-sg only
- Allow HTTP from alb-sg only
- No direct access from internet
alb-sg - For the load balancer
- Allow HTTP from anywhere (0.0.0.0/0)
- This is the only thing exposed to the internet
This part actually went smoothly. I was feeling confident. Too confident.
Launching EC2 Instances
Next up: spinning up instances. I created an SSH key pair (saved it as cloud-assignment-key.pem) and launched:
- Bastion Host - Ubuntu 24.04, t2.micro, in a public subnet with a public IP
- Web-Server-1 - Ubuntu 24.04, t2.micro, NO public IP
- Web-Server-2 - Ubuntu 24.04, t2.micro, NO public IP
Here's where I made my first mistake: I put all three instances in the same subnet. I didn't think about the fact that a subnet is either public OR private, not both. This came back to bite me later.
Setting Up SSH Access
I tested SSH to the Bastion - worked fine:
ssh -i ~/.ssh/cloud-assignment-key.pem ubuntu@<bastion-public-ip>
But I needed to access the web servers through the Bastion. This is where ProxyJump saved my life. I created an SSH config file:
Host bastion
HostName <BASTION_PUBLIC_IP>
User ubuntu
IdentityFile ~/.ssh/cloud-assignment-key.pem
ForwardAgent yes
StrictHostKeyChecking no
Host webserver1
HostName <WEB_SERVER_1_PRIVATE_IP>
User ubuntu
IdentityFile ~/.ssh/cloud-assignment-key.pem
ProxyJump bastion
StrictHostKeyChecking no
Host webserver2
HostName <WEB_SERVER_2_PRIVATE_IP>
User ubuntu
IdentityFile ~/.ssh/cloud-assignment-key.pem
ProxyJump bastion
StrictHostKeyChecking no
Now I could just type ssh webserver1 and it would automatically jump through the Bastion. Magic.
Day 1 Evening: The Ansible Disaster
Okay, time for automation. I installed Ansible on my laptop and created an inventory file:
[webservers]
webserver1 ansible_host=172.31.x.x
webserver2 ansible_host=172.31.y.y
[webservers:vars]
ansible_user=ubuntu
ansible_ssh_private_key_file=~/.ssh/cloud-assignment-key.pem
ansible_ssh_common_args='-o StrictHostKeyChecking=no -o ProxyJump=bastion'
Tested connection:
ansible webservers -i inventory.ini -m ping
Success! Both servers responded:
webserver1 | SUCCESS
webserver2 | SUCCESS
Great, I thought. Time to run the actual playbook. Here's what it looked like:
- name: Deploy Web Application
hosts: webservers
become: yes
tasks:
- name: Update apt cache
apt:
update_cache: yes
cache_valid_time: 3600
- name: Install NGINX
apt:
name: nginx
state: present
- name: Start and enable NGINX
systemd:
name: nginx
state: started
enabled: yes
- name: Get instance hostname
shell: hostname
register: instance_hostname
- name: Get private IP
shell: hostname -I | awk '{print $1}'
register: private_ip
- name: Deploy custom HTML page
copy:
dest: /var/www/html/index.html
content: |
<!DOCTYPE html>
<html lang='en'>
<head>
<meta charset='UTF-8'>
<title>Cloud Assignment - Alexin</title>
<style>
body {
font-family: 'Segoe UI', sans-serif;
max-width: 800px;
margin: 50px auto;
padding: 20px;
background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);
color: white;
}
.container {
background: rgba(255, 255, 255, 0.1);
padding: 40px;
border-radius: 15px;
backdrop-filter: blur(10px);
}
.info {
background: rgba(255, 255, 255, 0.2);
padding: 15px;
border-radius: 8px;
margin: 20px 0;
}
</style>
</head>
<body>
<div class='container'>
<h1>🚀 Cloud Assignment</h1>
<div class='info'>
<p><strong>Name:</strong> Alexin</p>
<p><strong>Instance:</strong> {{ instance_hostname.stdout }}</p>
<p><strong>Private IP:</strong> {{ private_ip.stdout }}</p>
</div>
<div class='about'>
<h2>About Me</h2>
<p>Backend engineer from Lagos, Nigeria. This was deployed with
Ansible through a Bastion Host to demonstrate Infrastructure as Code.</p>
</div>
</div>
</body>
</html>
mode: '0644'
notify: Restart NGINX
handlers:
- name: Restart NGINX
systemd:
name: nginx
state: restarted
Ran it:
ansible-playbook -i inventory.ini deploy-webserver.yml
IT FAILED.
The playbook started running, got to the "Install NGINX" task, and just... timed out. Connection errors everywhere. The servers were reachable via SSH, Ansible could ping them, but they couldn't download anything.
TASK [Install NGINX] *****
fatal: [webserver1]: FAILED! => {"changed": false, "msg": "Failed to update apt cache"}
fatal: [webserver2]: FAILED! => {"changed": false, "msg": "Failed to update apt cache"}
I spent an hour debugging this. Turns out my web servers couldn't reach the internet to download packages. They could talk to Ansible (via the Bastion), but they couldn't reach Ubuntu's package repositories.
The Internet Access Problem
Here's what I learned the hard way: My servers needed internet access to run apt update and apt install nginx. They were technically reachable via SSH, but they couldn't initiate outbound connections to the internet.
The proper solution? NAT Gateway.
But I was already tired and the deadline was approaching, so I did a "quick fix": I kept everything in the same public subnet (with Internet Gateway access) so the servers could download packages. Not best practice, but it worked for getting Ansible running.
After that fix, the playbook ran perfectly:
PLAY RECAP *****
webserver1: ok=8 changed=6 unreachable=0 failed=0
webserver2: ok=8 changed=6 unreachable=0 failed=0
Both servers configured in under 2 minutes. This is when I understood why everyone talks about automation - imagine doing this manually on 10 or 100 servers.
Why Ansible Changed Everything
Before this project, I'd always SSH'd into servers and ran commands manually. Copy-paste from StackOverflow, hope it works, repeat on the next server.
Ansible made me realize:
- Consistency - Both servers got the EXACT same config
- Speed - 2 minutes vs 30 minutes of manual work
- Documentation - My playbook IS my documentation
- Repeatability - I can run it again tomorrow and get the same result
The playbook grabs the hostname and private IP dynamically with shell commands, then injects them into the HTML template. Each server shows different info, proving load balancing works later.
Day 2: Load Balancer Time
Creating the Target Group
First, I created a target group:
- Name:
webserver-tg - Protocol: HTTP, Port: 80
- Registered both web servers
- Health checks: HTTP GET to /
Creating the ALB
Then the Application Load Balancer:
- Name:
web-app-alb - Internet-facing
- Selected multiple availability zones
- Security group: alb-sg
- Forwarded traffic to: webserver-tg
Waited for health checks... and both targets went healthy!
Accessed the ALB DNS name in my browser:
http://web-app-alb-xxxxxxxxx.eu-north-1.elb.amazonaws.com
IT WORKED!
But I noticed it kept showing the same server. Turns out that's normal - it's called "sticky sessions". The load balancer keeps you on the same server for consistency. I verified it was actually load balancing by using curl:
for i in {1..10}; do curl http://alb-dns-name | grep "Instance:"; done
And boom - I saw both server hostnames appearing. Load balancing confirmed.
Why a Bastion Host?
At first, I didn't get why we needed a Bastion Host. Why not just SSH directly to the servers?
Here's what I learned:
Without Bastion (Bad):
- Every server exposed to the internet
- More attack surface
- Hard to control who accesses what
- Each server needs its own public IP
With Bastion (Good):
- Only ONE server exposed to internet
- Single point of entry I can monitor and secure
- Web servers hidden in private subnet
- I can log every SSH session
- If someone compromises the Bastion, web servers are still protected
Real companies use this because:
- Compliance - Many regulations require it
- Security - Defense in depth
- Auditing - Know who accessed what and when
The Bastion is like a security checkpoint. Everything goes through it, nothing bypasses it.
Load Balancer vs Direct Access: Why It Matters
When I started, I thought "why not just give each server a public IP and call it a day?"
Here's why that's a terrible idea:
Direct EC2 Access (What NOT to Do):
- Single Point of Failure - Server goes down? Your app is offline
- No Failover - Users get errors until you manually fix it
- Security Risk - Servers exposed directly to attacks
- Hard to Scale - Adding servers means changing DNS
- No Health Checks - You won't know a server died until users complain
Load Balancer Access (The Right Way):
- High Availability - One server dies? Traffic goes to the other
- Automatic Failover - Load balancer detects failures and reroutes
- Better Security - Servers stay in private subnet
- Easy Scaling - Add/remove servers without DNS changes
- Health Monitoring - Load balancer constantly checks if servers are healthy
- SSL Termination - Handle HTTPS at the load balancer, not on each server
In my setup, if Web-Server-1 crashes, the load balancer immediately stops sending traffic to it. Users never notice. That's production-grade architecture.
The "Oh Shit" Moment: Realizing I Needed to Rebuild
Later, I realized something: I had put all my instances in the same subnet. But the assignment specifically said "private subnet" for web servers.
A subnet is either public (routes to Internet Gateway) or private (routes to NAT Gateway). You can't have both in the same subnet.
So I had to:
- Create a NAT Gateway (in a public subnet)
- Create a Private Route Table (routes outbound traffic through NAT)
- Associate that route table with a private subnet
- Terminate my web servers
- Launch new ones in the private subnet WITHOUT public IPs
- Update my inventory and SSH config
- Run Ansible again
- Update the target group
This took another hour, but it was the right way to do it.
The NAT Gateway Confusion
Here's something that confused me: If web servers are in a private subnet with NO internet access, how do they download packages?
Answer: NAT Gateway.
A NAT Gateway allows outbound-only internet access:
Web Server: "I need to apt install nginx"
↓
Private Subnet Route Table: "Send to NAT Gateway"
↓
NAT Gateway: "I'll translate your private IP to my public IP"
↓
Internet: "Here's nginx"
↓
NAT Gateway: "I remember who asked, sending back to Web Server"
↓
Web Server: "Got it, thanks!"
The key: Outbound connections work, but the internet CANNOT initiate connections to your private servers. They're truly isolated.
Without NAT Gateway, private servers are completely offline. Can't download anything, can't reach APIs, nothing.
All The Things That Went Wrong
Issue 1: "Connection Timed Out"
Problem: Ansible couldn't connect to web servers. Ping returned UNREACHABLE.
Why: Web servers had no internet access to download packages. They were in a public subnet, but they had no IP addresses themselves.
Solution: Quick fix - made the instances public. Later rebuilt properly with NAT Gateway.
Lesson: Private instances = NO internet unless you add NAT Gateway.
Issue 2: All Instances in Same Subnet
Problem: Assignment required private subnet for web servers, but I had everything in the same subnet.
Why: I didn't understand that subnets are either public OR private, not both.
Solution: Created separate private subnet with route table pointing to NAT Gateway. Launched new web servers there.
Lesson: Network architecture planning matters. Can't just throw everything anywhere.
Issue 3: Can't curl from Bastion to Web Servers
Problem: Tried to test web servers from Bastion using curl. Connection refused.
Why: Security group webserver-sg only allowed HTTP from alb-sg, not from bastion-sg.
Solution: Skipped this test. Verified through the load balancer instead.
Lesson: Security groups are strict. They block everything not explicitly allowed.
Issue 4: Load Balancer Only Shows One Server
Problem: Refreshing the page kept showing the same server. Thought load balancing was broken.
Why: Sticky sessions - the load balancer uses cookies to keep you on the same server.
Solution: Used curl in a loop to verify both servers were getting traffic. They were.
Lesson: Sticky sessions are a feature, not a bug. Keeps user sessions consistent.
Issue 5: Target Group Shows "Unused"
Problem: After registering web servers in target group, status showed "Unused" with message "Target is in an Availability Zone that is not enabled for the load balancer"
Why: My web servers were in AZ eu-north-1c in a private subnet, but my load balancer wasn't configured to use that AZ since it couldn't use private subnets.
Solution: Created a public subnet in AZ 1c and added it to the load balancer's network mapping.
Lesson: Load balancers need to span the availability zones where your targets are.
Issue 6: NAT Gateway Not Appearing in Route Table
Problem: When trying to add route 0.0.0.0/0 => NAT Gateway, it wasn't showing up in the dropdown.
Why: NAT Gateway was still in "Pending" status. Hadn't finished creating yet.
Solution: Waited 2-3 minutes for status to become "Available", then refreshed the page.
Lesson: AWS resources take time to provision. Be patient.
Issue 7: New Subnet Not Showing in ALB Config
Problem: Created a new public subnet for the ALB, but it didn't appear in the subnet dropdown.
Why: New subnet wasn't associated with a route table that has Internet Gateway route.
Solution: Associated the subnet with the public route table (one with IGW route). Refreshed page.
Lesson: Subnets need proper route table associations to be recognized correctly.
What I Learned About Cloud Security
This project taught me security is about layers:
Layer 1: Network Segmentation
- Public subnet: Only Bastion, NAT, Load Balancer
- Private subnet: Application servers
Layer 2: Security Groups
- Each component has its own firewall rules
- Rules reference each other by security group ID
- Principle of least privilege
Layer 3: No Public IPs on App Servers
- Can't be attacked if they're not reachable
- All traffic forced through load balancer
Layer 4: SSH Keys, Not Passwords
- Cryptographic authentication
- Keys never leave my laptop
Layer 5: Bastion Host
- Single auditable entry point
- Can log all SSH sessions
- Additional barrier
This is what "defense in depth" means. You don't rely on one security measure, you stack them.
If I Did It Again
Things I'd add for a real production system:
1. HTTPS
Right now it's just HTTP. In production, I'd:
- Get a free SSL cert from AWS Certificate Manager
- Configure ALB to handle HTTPS on port 443
- Redirect HTTP to HTTPS
2. Auto Scaling
Currently just 2 fixed servers. I'd add:
- Auto Scaling Group
- Scale up when CPU > 70%
- Scale down when CPU < 30%
- Minimum 2 instances for availability
3. Multiple Availability Zones for Web Servers
Right now both servers are in AZ 1c. For true high availability:
- Put Web-Server-1 in AZ 1b (since Bastion is in 1a)
- Put Web-Server-2 in AZ 1c
- Protected against entire data center failures
4. Database Layer
Currently no database. I'd add:
- Amazon RDS in private subnet
- Separate database security group
- Automated backups
5. Monitoring
Zero visibility right now. I'd add:
- CloudWatch alarms (CPU, memory, disk)
- CloudWatch Logs (application logs)
- SNS notifications for alerts
6. CI/CD Pipeline
Manual deployments don't scale. I'd add:
- GitHub Actions or AWS CodePipeline
- Automated testing
- Blue/green deployments
7. Infrastructure as Code
I clicked through the console for everything. Better approach:
- Terraform or CloudFormation
- Version control entire infrastructure
- Reproducible across environments
Real Talk: Was It Worth It?
Two days. Countless errors. Almost gave up multiple times.
But here's what I got out of it:
Technical Skills:
- Actually understand VPCs, subnets, route tables now
- Know how security groups work
- Can write Ansible playbooks
- Understand load balancer concepts
Problem-Solving:
- Learned to read error messages carefully
- Know how to debug connection issues
- Can think through network architecture
Confidence:
- Built something real, not a tutorial
- Can explain this in interviews
- Know I can figure things out when stuck
The assignment said "If you can explain this project clearly, you can explain 70% of junior cloud interviews."
They weren't lying. This project covers:
- Cloud networking fundamentals
- Security best practices
- Infrastructure automation
- High availability patterns
- Real-world troubleshooting
For Anyone Doing This Assignment
If you're an AltSchool student reading this:
Do's:
- Start early. This takes longer than you think
- Read error messages carefully
- Draw your architecture before building
- Test each component before moving to the next
- Take screenshots as you go (you'll need them)
- Ask for help when stuck
Don'ts:
- Don't put everything in the same subnet
- Don't skip the NAT Gateway (private servers need it)
- Don't give web servers public IPs (assignment requires private)
- Don't forget to delete resources after submission (AWS bills add up!)
- Don't just copy commands without understanding them
Key Resources:
- AWS Documentation
- Ansible Documentation
- This article (seriously, refer back to it)
Final Architecture Checklist
Before submitting, verify:
- 3 EC2 instances running (1 Bastion, 2 Web Servers)
- Only Bastion has public IP
- Web servers in different subnet than Bastion
- NAT Gateway created and available
- Private route table routes through NAT Gateway
- Can SSH: laptop => bastion => web servers
- Ansible playbook runs successfully
- Both web servers show different hostnames/IPs
- Application Load Balancer is active
- Both targets healthy in target group
- ALB DNS shows your page
- Refreshing shows different servers (use curl to verify)
- RESOURCES DELETED AFTER SUBMISSION
That last one is critical. NAT Gateway alone costs ~$32/month if you forget it.
Closing Thoughts
This assignment kicked my ass, but in the best way possible.
I went from "I've watched AWS tutorials" to "I've built AWS infrastructure". That's a huge difference.
The frustration when Ansible couldn't connect. The confusion about subnets. The moment when the load balancer finally showed both servers. The relief when all health checks turned green.
That's real learning.
If you're reading this because you're about to start this assignment: good luck. You'll need it. But you'll also learn more than you expect.
If you're reading this because you finished: congrats! We survived.
If you're a recruiter: this is what I know now. Want to see if I can build something for you?

Top comments (0)