Investigate Why an EC2 Application is Not Reachable

#aws #ec2 #devops #cloud

Overview

In this tutorial, we'll use Stakpak to investigate and fix an AWS networking incident where an application running on EC2 is healthy, but unreachable from the internet.

Rather than manually inspecting EC2, VPC, subnet, route table, security group, network ACL, systemd, nginx, and application logs one by one, we'll use Stakpak to:

Investigate the incident
Identify the root cause
Apply the fix
Validate that the EC2 application becomes reachable again

By the end of this tutorial, you'll learn how to use Stakpak to troubleshoot EC2 application reachability issues across both the instance and AWS networking layers. We will also sit stakpak autopilot so it monitors our infrastructure 24/7, auto fix issues when it's safe, and pings us when human judgment is needed.

Note: Stakpak is open source, vendor neutral, and works with any model you choose.

Problem

You deploy a simple web application to an EC2 instance, and everything seems fine at first.

The Terraform deployment succeeds.

The EC2 instance is running.

The instance has a public IP address.

The security group appears to allow HTTP traffic.

The application process is healthy.

nginx is running.

But when you try to access the application from the internet, the request times out.

curl -v --connect-timeout 5 --max-time 10 http://ec2-3-236-155-58.compute-1.amazonaws.com/health

So you start the usual EC2 reachability debugging loop:

aws ec2 describe-instances \
  --instance-ids i-0a2bf3df8a5769989 \
  --region us-east-1

aws ec2 describe-instance-status \
  --instance-ids i-0a2bf3df8a5769989 \
  --region us-east-1

aws ec2 describe-security-groups \
  --group-ids sg-0d133f86e2d08a392 \
  --region us-east-1

aws ec2 describe-route-tables \
  --filters Name=vpc-id,Values=vpc-001f8813b0d78f5e3 \
  --region us-east-1

aws ec2 describe-subnets \
  --subnet-ids subnet-07083683f7e1d2f09 \
  --region us-east-1

aws ec2 describe-network-acls \
  --filters Name=association.subnet-id,Values=subnet-07083683f7e1d2f09 \
  --region us-east-1

Then you start checking the instance

aws ssm start-session \
  --target i-0a2bf3df8a5769989 \
  --region us-east-1

Now you have to figure out what actually matters.

Is the instance unhealthy?
Is nginx down?
Is the app listening on the wrong interface?
Is the security group blocking traffic?
Is the subnet public?
Is the route table missing an internet route?
Is the public IP missing?
Is another VPC networking control blocking the request?

AWS gives you the clues, but you still have to connect them.

Application

The application is a simple web service running on an EC2 instance.

It represents a small catalog preview service for the Northstar Commerce platform.

The app exposes a health endpoint and a basic HTML page. It runs locally on the instance and is served to external clients through nginx.

The main components are:

EC2 Instance: Runs the application and nginx.
Python Web Application: Provides the demo web service and health endpoint.
systemd Service: Keeps the application process running.
nginx: Listens on HTTP port 80 and proxies requests to the local app.
Security Group: Controls instance-level inbound and outbound traffic.
Subnet: Places the instance inside the VPC network.
Route Table: Defines how traffic leaves the subnet.
Internet Gateway: Provides internet connectivity for the VPC.
Network ACL: Applies subnet-level traffic rules.
IAM Instance Profile: Allows access through AWS Systems Manager Session Manager.

The normal request flow is:

A user sends an HTTP request to the EC2 public DNS name, traffic enters the VPC through the internet gateway, reaches the public subnet, passes the subnet and instance network controls, reaches nginx on port 80, nginx proxies the request to the local Python app on 127.0.0.1:8080, and the app returns a health response.The normal request flow is:

The expected health endpoint is: GET /health

When the application is working correctly, it returns:

{
"status": "ok",
"service": "northstar-catalog-preview"
}

In this incident, the application is healthy from inside the instance, but unreachable from the internet.

Now that we understand the app, we can start troubleshooting.

Step-by-Step Guide

Prerequisites

Install Stakpak
Cloud provider credentials configured

Troubleshooting

Open Stakpak and ask it to investigate the EC2 issue

Now lets let it do its magic

Stakpak started by investigating why the EC2 /health endpoint was timing out by checking DNS, EC2 status, SSM access, security groups, route tables, NACLs, and local app health.

It found that the EC2 instance and app were healthy, but the subnet Network ACL was blocking outbound ephemeral response traffic. The instance could receive traffic on port 80, but couldn’t send responses back to clients.