Aisalkyn Aidarova

Posted on Apr 30

VPC, subnets, IGW, NAT, routing, firewall, DMZ, private DB, and troubleshooting part #3

Real Outage Simulation: SRE Networking Debugging

Architecture:

User
 ↓
Route 53 DNS
 ↓
ALB public subnet / DMZ
 ↓
Web EC2 private subnet
 ↓
DB private subnet

Your SRE troubleshooting order:

1. DNS
2. WAF / ALB
3. Target Group health
4. Security Groups
5. Route Tables
6. NAT / IGW
7. EC2 / Nginx
8. DB
9. Logs / Flow Logs

OUTAGE 1 — Website Completely Down

Symptom

User says:

app.company.com is not opening.

Browser shows:

This site can’t be reached

Step 1 — Check DNS

From your laptop:

nslookup app.company.com

Expected good output:

Name: app.company.com
Address: ALB-DNS or ALB IPs

Bad output:

server can't find app.company.com

Root cause

Route 53 record deleted or wrong.

Fix

Go to:

Route 53 → Hosted Zone → Create Record

Create:

Record type: A
Alias: Yes
Target: Application Load Balancer

SRE explanation

DNS was not resolving to the ALB, so traffic never reached AWS infrastructure.

OUTAGE 2 — ALB Returns 503

Symptom

Browser opens, but shows:

503 Service Temporarily Unavailable

Meaning

ALB is reachable, but it has no healthy backend targets.

Step 1 — Check Target Group

Go to:

EC2 → Target Groups → sre-app-tg → Targets

Bad output:

Unhealthy

Step 2 — Check health reason

Possible reasons:

Health checks failed
Request timed out
Target.ResponseCodeMismatch

Step 3 — Check app server

Connect to private EC2 using SSM or bastion.

Run:

sudo systemctl status nginx

Bad output:

inactive (dead)

Fix

sudo systemctl start nginx
sudo systemctl enable nginx

Check:

curl localhost

Expected:

Hello from Web Server 1

SRE explanation

ALB returned 503 because the target group had no healthy instances. The web service was stopped, so the health check failed.

OUTAGE 3 — ALB Target Unhealthy Because Security Group Is Wrong

Symptom

ALB returns:

Target group shows:

Unhealthy
Health check timeout

Check Security Group

Go to:

EC2 → Security Groups → web-sg → Inbound rules

Correct rule should be:

HTTP 80 from alb-sg

Bad rule example:

HTTP 80 from your IP

Root cause

ALB cannot reach web server because web SG does not allow traffic from ALB SG.

Fix

Edit web-sg inbound:

Type: HTTP
Port: 80
Source: alb-sg

Wait 1–2 minutes.

Expected:

Target health: Healthy

SRE explanation

The application itself was fine, but the firewall blocked ALB-to-web traffic.

OUTAGE 4 — Private EC2 Cannot Install Packages

Symptom

On private EC2:

sudo apt update

Bad output:

Temporary failure resolving
or
Connection timed out

Step 1 — Check route table

Go to:

VPC → Route Tables → private-rt → Routes

Correct:

0.0.0.0/0 → NAT Gateway

Bad:

No default route

Step 2 — Check NAT

Go to:

VPC → NAT Gateways

Expected:

State: Available
Subnet: Public subnet
Elastic IP: attached

Step 3 — Check public route table

Public subnet must have:

0.0.0.0/0 → Internet Gateway

Fix

Add route:

Private Route Table
0.0.0.0/0 → NAT Gateway

SRE explanation

Private EC2 had no outbound internet because the private subnet was missing the NAT route.

OUTAGE 5 — Public EC2 / ALB Not Reachable

Symptom

Browser cannot reach ALB DNS.

Check ALB Security Group

Go to:

EC2 → Security Groups → alb-sg

Correct inbound:

HTTP 80 from 0.0.0.0/0

Bad:

No inbound rule

Fix

Add:

HTTP 80 → 0.0.0.0/0

SRE explanation

The ALB was healthy, but its security group blocked public HTTP traffic.

OUTAGE 6 — Web Server Cannot Connect to DB

Symptom

Application shows:

Database connection failed

Step 1 — From web EC2 test DB port

nc -vz <db-private-ip> 3306

Bad output:

connection timed out

Good output:

succeeded

Step 2 — Check DB SG

Correct inbound rule:

MySQL 3306 from web-sg

Bad rule:

MySQL 3306 from your IP
or
No MySQL rule

Fix

Edit db-sg:

Inbound:
MySQL/Aurora
Port: 3306
Source: web-sg

SRE explanation

The database was private and secure, but the app tier was not allowed by the DB security group.

OUTAGE 7 — One Web Server Down, Site Still Works

Symptom

Stop one EC2:

sre-web-1 stopped

User still sees website.

Why?

ALB sends traffic only to healthy targets.

Check:

Target group:
web-1 → unused/unhealthy
web-2 → healthy

SRE explanation

This proves high availability. One instance failed, but ALB removed it from rotation and continued sending traffic to the healthy instance.

OUTAGE 8 — Both Web Servers in Same AZ

Symptom

Application works normally, but during AZ failure everything goes down.

Root cause

Both web servers are in one Availability Zone.

Bad design:

web-1 → us-east-1a
web-2 → us-east-1a

Good design:

web-1 → us-east-1a
web-2 → us-east-1b

Fix

Launch web servers across different private subnets in different AZs.

SRE explanation

High availability requires spreading resources across multiple Availability Zones.

OUTAGE 9 — Wrong Health Check Path

Symptom

Website works manually:

curl http://private-ip

Output:

Hello from Web Server

But ALB target is unhealthy.

Check health check path

Go to:

Target Group → Health checks

Bad path:

/health

But app only serves:

Fix

Change health check path to:

Or create /health endpoint.

SRE explanation

The application was running, but ALB health check was using a path that did not exist.

OUTAGE 10 — NACL Blocks Return Traffic

Symptom

Security groups look correct. Route tables look correct. Still traffic times out.

Check NACL

Go to:

VPC → Network ACLs

Remember:

NACL is stateless
Inbound and outbound both must be allowed

For HTTP, allow:

Inbound:

80 from source
1024-65535 ephemeral ports

Outbound:

80
1024-65535 ephemeral ports

Root cause

NACL allowed inbound request but blocked return traffic.

Fix

Allow ephemeral ports.

SRE explanation

Security groups are stateful, but NACLs are stateless. Return traffic must be explicitly allowed.

OUTAGE 11 — VPC Peering Not Working

Symptom

EC2 in VPC A cannot reach EC2 in VPC B.

Check 1 — Peering status

VPC → Peering Connections

Expected:

Active

Bad:

Pending acceptance

Check 2 — Route tables

VPC A route table:

10.1.0.0/16 → peering connection

VPC B route table:

10.0.0.0/16 → peering connection

Check 3 — CIDR overlap

Bad:

VPC A: 10.0.0.0/16
VPC B: 10.0.0.0/16

Peering will not work.

SRE explanation

VPC peering requires non-overlapping CIDR ranges, active peering, routes on both sides, and firewall rules allowing traffic.

OUTAGE 12 — PrivateLink Works for One Service Only

Symptom

Consumer VPC can access one API through endpoint, but cannot reach other private EC2s in provider VPC.

Explanation

This is expected.

PrivateLink is not full network connectivity.

PrivateLink → service-level access
VPC Peering → network-level access
Transit Gateway → large network hub

SRE explanation

PrivateLink exposes only a specific service through an endpoint. It does not allow full VPC-to-VPC communication.

OUTAGE 13 — VPN Tunnel Down

Symptom

On-prem users cannot reach AWS private app.

Check

In AWS:

VPC → Site-to-Site VPN Connections

Bad output:

Tunnel 1: DOWN
Tunnel 2: DOWN

Check:

Customer gateway public IP correct?
On-prem firewall allows IPsec?
BGP routes advertised?
AWS route table has route to on-prem CIDR?

SRE explanation

VPN issues usually come from tunnel status, BGP route advertisement, firewall rules, or missing route table entries.

OUTAGE 14 — DNS Points to Old ALB

Symptom

New deployment completed, but users still hit old app.

Check

dig app.company.com

Compare with current ALB DNS.

Root cause

Route 53 record points to old ALB or DNS cache TTL has not expired.

Fix

Update Route 53 alias record.

Lower TTL before planned migration.

SRE explanation

The application was not broken. DNS was pointing users to the wrong load balancer.

OUTAGE 15 — WAF Blocks Real Users

Symptom

Some users get:

403 Forbidden

Check

Go to:

AWS WAF → Web ACL → Logs / Sampled requests

Look for:

Blocked rule
Source IP
URI path
User agent

Fix

Options:

Adjust managed rule
Add allowlist
Change rule priority
Tune rate limit

SRE explanation

WAF protects the app, but rules can create false positives. SRE must verify blocked requests before disabling protection.

Final SRE Outage Debugging Script

In interview, say:

When an outage happens, I do not guess. I follow the request path. First I check DNS resolution, then ALB status, listener, security group, target group health, application service, route tables, NAT or IGW, then database connectivity. I also use ALB logs, VPC Flow Logs, CloudWatch metrics, and application logs to prove where the traffic is failing.

Best Practice Summary

DNS issue       → dig / nslookup
ALB issue       → listener, SG, target group
503             → no healthy targets
504             → backend timeout
Private no net  → NAT route
DB issue        → DB SG from web SG
NACL issue      → remember stateless
Peering issue   → routes both sides
VPN issue       → tunnel + BGP + routes
WAF issue       → check blocked rules

Very strong interview answer

I troubleshoot production outages by following the traffic path from user to backend: DNS, WAF, load balancer, target group, security groups, route tables, NACLs, EC2 service, and database. I verify each layer with tools like dig, curl, nc, ALB health checks, CloudWatch metrics, ALB logs, and VPC Flow Logs. My goal is to quickly identify whether the problem is DNS, routing, firewall, load balancer, application, or database, then restore service and document the root cause.