I built a self-healing web app on AWS and watched it recover from failure in real time

#aws #devops #typescript #cloud

I wanted actually to understand AWS networking. Not "I followed a tutorial, and it worked" understand. More like "I can explain why this NAT Gateway exists and what breaks if I delete it" understand.

So I built CloudPulse. It's a TypeScript app that monitors its own infrastructure and displays the health of every instance on a dashboard. The interesting part: when you kill an instance, the system detects the failure and replaces it automatically while users never notice anything went wrong.

No Terraform. No CloudFormation. Raw AWS CLI calls in bash scripts, each one commented so I'd remember what it does in six months.

How it's wired together

Internet traffic hits an Application Load Balancer sitting in public subnets. The ALB forwards requests to EC2 instances in private subnets on port 3000. Those instances have no public IP; they can't be reached directly from the internet at all. When they need to talk to AWS APIs (publishing CloudWatch metrics, describing their own ASG), they go through NAT Gateways.

There's one NAT Gateway per availability zone. If the one in AZ-1 dies, only the instance in AZ-1 loses outbound connectivity. The instance in AZ-2 keeps working through its own NAT Gateway. That's the point of having two.

The ALB checks /health every 30 seconds. Three consecutive failures and the instance gets pulled from the target group. The Auto Scaling Group notices the instance is unhealthy, terminates it, and launches a fresh one. No human involved.

The part I actually wanted to see: killing an instance

This is why I built the whole thing. I wanted to watch a system heal itself.

I ran aws ec2 terminate-instances on one of the two running instances. Then I sat there watching the dashboard refresh every 30 seconds.

Within about a minute, the terminated instance showed up as unhealthy. The ASG launched a replacement. The new instance booted Amazon Linux 2023, pulled my app from S3, installed dependencies, started the Node process, and began responding to the ALB health checks.

Total recovery time: under 2 minutes. And during those 2 minutes, the ALB was sending all traffic to the surviving instance. Nobody waiting for a page load would have noticed anything.

That's the thing about self-healing infrastructure. It's boring when it works. You kill something, wait a bit, and everything is back to normal. But getting to that boring place required wiring up health checks, ASG policies, target group settings, and IAM permissions correctly. The boring outcome is the proof that the wiring works.

Auto scaling under load

I connected to one of the instances via SSM Session Manager (no SSH keys anywhere in this setup) and ran stress --cpu 4 --timeout 180. This pegged the CPU at 100% for 3 minutes.

CloudWatch saw the CPUUtilization metric exceed 70% for 2 consecutive 60-second periods. The alarm fired. The ASG added a third instance. When the stress test ended and CPU dropped below 30% for 2 minutes, the alarm fired again, and the ASG removed the extra instance.

The scaling policies have a 300-second cooldown so they don't thrash back and forth.

The instances themselves

Both run t3.micro (free tier eligible, sort of; you get 750 hours/month, but 2 instances burn 1440 hours). Private subnets, no public IP, no SSH key pair. I access them through Systems Manager Session Manager when I need to poke around.

The IAM role attached to the instances allows exactly four things: publish CloudWatch metrics, describe EC2 instances, describe ASG state, and use SSM for shell access. Nothing else.

One-command deployment

bash deploy.sh runs five scripts in order:

iam.sh creates the role and instance profile
vpc.sh builds the entire network (this takes ~3 minutes because NAT Gateways are slow to provision)
alb.sh creates security groups, the load balancer, target group, and listener
compute.sh creates the launch template and ASG (instances start booting here)
monitoring.sh creates the CloudWatch alarms

At the end, it prints the ALB URL. Wait 3-5 minutes for instances to pass health checks, then open it.

bash teardown.sh deletes everything in reverse order. Takes about 3 minutes. I run it every time I finish a learning session because NAT Gateways cost $2/day just sitting there.

What I used

The app itself is TypeScript on Express. Server-side rendered HTML with EJS (no frontend framework; the dashboard is one page that refreshes every 30 seconds). 101 tests across unit, property-based (fast-check), and integration (supertest).

The infrastructure is pure AWS CLI in bash. Every script sources a shared config file and a common utilities file. Resource IDs get saved to an env file so scripts can reference what previous scripts created.

AWS services in this project: VPC, public/private subnets, Internet Gateway, NAT Gateways, route tables, NACLs, security groups, Application Load Balancer, EC2 via Launch Template, Auto Scaling Group, CloudWatch custom metrics, CloudWatch alarms, IAM roles with instance profiles, EBS gp3 volumes, SSM Session Manager, and S3 for code delivery.

What I learned the hard way

Git Bash on Windows rewrites any path starting with / to a Windows path. My health check path /health became C:/Program Files/Git/health during deployment. Took me a while to figure out why the target group health check was failing. Fix: export MSYS_NO_PATHCONV=1.

ALB network interfaces take 5-10 minutes to fully release after you delete the ALB. If you try to delete the security groups too early, you get "DependencyViolation" errors. The teardown script has to wait.

IAM is eventually consistent. If you create an instance profile and immediately reference it in a launch template, it sometimes fails because the profile hasn't propagated yet. I added a 10-second sleep after IAM operations. Ugly, but it works.

Security groups that reference each other can't be deleted independently. You have to remove the cross-reference rules first, then delete them. The teardown script handles this, but it was a pain to debug the first time.