Overview
In this tutorial, we'll use Stakpak to investigate and fix an AWS networking incident where an application running on EC2 is healthy, but unreachable from the internet.
Rather than manually inspecting EC2, VPC, subnet, route table, security group, network ACL, systemd, nginx, and application logs one by one, we'll use Stakpak to:
Investigate the incident
Identify the root cause
Apply the fix
Validate that the EC2 application becomes reachable again
By the end of this tutorial, you'll learn how to use Stakpak to troubleshoot EC2 application reachability issues across both the instance and AWS networking layers. We will also sit stakpak autopilot so it monitors our infrastructure 24/7, auto fix issues when it's safe, and pings us when human judgment is needed.
Note: Stakpak is open source, vendor neutral, and works with any model you choose.
Problem
You deploy a simple web application to an EC2 instance, and everything seems fine at first.
The Terraform deployment succeeds.
The EC2 instance is running.
The instance has a public IP address.
The security group appears to allow HTTP traffic.
The application process is healthy.
nginx is running.
But when you try to access the application from the internet, the request times out.
curl -v --connect-timeout 5 --max-time 10 http://ec2-3-236-155-58.compute-1.amazonaws.com/health
So you start the usual EC2 reachability debugging loop:
aws ec2 describe-instances \
--instance-ids i-0a2bf3df8a5769989 \
--region us-east-1
aws ec2 describe-instance-status \
--instance-ids i-0a2bf3df8a5769989 \
--region us-east-1
aws ec2 describe-security-groups \
--group-ids sg-0d133f86e2d08a392 \
--region us-east-1
aws ec2 describe-route-tables \
--filters Name=vpc-id,Values=vpc-001f8813b0d78f5e3 \
--region us-east-1
aws ec2 describe-subnets \
--subnet-ids subnet-07083683f7e1d2f09 \
--region us-east-1
aws ec2 describe-network-acls \
--filters Name=association.subnet-id,Values=subnet-07083683f7e1d2f09 \
--region us-east-1
Then you start checking the instance
aws ssm start-session \
--target i-0a2bf3df8a5769989 \
--region us-east-1
Now you have to figure out what actually matters.
Is the instance unhealthy?
Is nginx down?
Is the app listening on the wrong interface?
Is the security group blocking traffic?
Is the subnet public?
Is the route table missing an internet route?
Is the public IP missing?
Is another VPC networking control blocking the request?
AWS gives you the clues, but you still have to connect them.
Application
The application is a simple web service running on an EC2 instance.
It represents a small catalog preview service for the Northstar Commerce platform.
The app exposes a health endpoint and a basic HTML page. It runs locally on the instance and is served to external clients through nginx.
The main components are:
EC2 Instance: Runs the application and nginx.
Python Web Application: Provides the demo web service and health endpoint.
systemd Service: Keeps the application process running.
nginx: Listens on HTTP port 80 and proxies requests to the local app.
Security Group: Controls instance-level inbound and outbound traffic.
Subnet: Places the instance inside the VPC network.
Route Table: Defines how traffic leaves the subnet.
Internet Gateway: Provides internet connectivity for the VPC.
Network ACL: Applies subnet-level traffic rules.
IAM Instance Profile: Allows access through AWS Systems Manager Session Manager.
The normal request flow is:
A user sends an HTTP request to the EC2 public DNS name, traffic enters the VPC through the internet gateway, reaches the public subnet, passes the subnet and instance network controls, reaches nginx on port 80, nginx proxies the request to the local Python app on 127.0.0.1:8080, and the app returns a health response.The normal request flow is:
A user sends an HTTP request to the EC2 public DNS name, traffic enters the VPC through the internet gateway, reaches the public subnet, passes the subnet and instance network controls, reaches nginx on port 80, nginx proxies the request to the local Python app on 127.0.0.1:8080, and the app returns a health response.
The expected health endpoint is: GET /health
When the application is working correctly, it returns:
{
"status": "ok",
"service": "northstar-catalog-preview"
}
In this incident, the application is healthy from inside the instance, but unreachable from the internet.
Now that we understand the app, we can start troubleshooting.
Step-by-Step Guide
Prerequisites
Cloud provider credentials configured
Troubleshooting
- Open Stakpak and ask it to
investigate the EC2 issue
Now lets let it do its magic
Stakpak started by investigating why the EC2 /health endpoint was timing out by checking DNS, EC2 status, SSM access, security groups, route tables, NACLs, and local app health.
It found that the EC2 instance and app were healthy, but the subnet Network ACL was blocking outbound ephemeral response traffic. The instance could receive traffic on port 80, but couldn’t send responses back to clients.
Then it:
Verified EC2 status checks were passing
Confirmed SSM access was online
Confirmed nginx and the app were running locally
Verified local /health returned 200 OK
Confirmed the security group and route table were correct
Added an outbound NACL rule for TCP 1024-65535
Ran Terraform validation
Applied the Terraform fix
During apply, Terraform replaced the EC2 instance because the AL2023 AMI changed.
After the fix, Stakpak verified that:
The new instance i-04244ee1e1e4ef422 was running
The new URL was http://ec2-44-223-99-238.compute-1.amazonaws.com
The NACL allowed outbound ephemeral traffic
/health returned HTTP/1.1 200 OK
Now everything is working🥳
Let's ask it to set up Stakpak Autopilot so we avoid waking up at 3am because of an incident🤡
Stakpak Autopilot monitors your apps 24/7, detects unexpected changes, fixes what’s safe, and only alerts you when it actually matters.
Monitoring
Thats it, now it won't hunt us in our nightmares at 3 am.




Top comments (0)