DEV Community

Aisalkyn Aidarova
Aisalkyn Aidarova

Posted on

VPC, subnets, IGW, NAT, routing, firewall, DMZ, private DB, and troubleshooting part #3

Real Outage Simulation: SRE Networking Debugging

Architecture:

User
 ↓
Route 53 DNS
 ↓
ALB public subnet / DMZ
 ↓
Web EC2 private subnet
 ↓
DB private subnet
Enter fullscreen mode Exit fullscreen mode

Your SRE troubleshooting order:

1. DNS
2. WAF / ALB
3. Target Group health
4. Security Groups
5. Route Tables
6. NAT / IGW
7. EC2 / Nginx
8. DB
9. Logs / Flow Logs
Enter fullscreen mode Exit fullscreen mode

OUTAGE 1 — Website Completely Down

Symptom

User says:

app.company.com is not opening.
Enter fullscreen mode Exit fullscreen mode

Browser shows:

This site can’t be reached
Enter fullscreen mode Exit fullscreen mode

Step 1 — Check DNS

From your laptop:

nslookup app.company.com
Enter fullscreen mode Exit fullscreen mode

Expected good output:

Name: app.company.com
Address: ALB-DNS or ALB IPs
Enter fullscreen mode Exit fullscreen mode

Bad output:

server can't find app.company.com
Enter fullscreen mode Exit fullscreen mode

Root cause

Route 53 record deleted or wrong.

Fix

Go to:

Route 53 → Hosted Zone → Create Record
Enter fullscreen mode Exit fullscreen mode

Create:

Record type: A
Alias: Yes
Target: Application Load Balancer
Enter fullscreen mode Exit fullscreen mode

SRE explanation

DNS was not resolving to the ALB, so traffic never reached AWS infrastructure.


OUTAGE 2 — ALB Returns 503

Symptom

Browser opens, but shows:

503 Service Temporarily Unavailable
Enter fullscreen mode Exit fullscreen mode

Meaning

ALB is reachable, but it has no healthy backend targets.

Step 1 — Check Target Group

Go to:

EC2 → Target Groups → sre-app-tg → Targets
Enter fullscreen mode Exit fullscreen mode

Bad output:

Unhealthy
Enter fullscreen mode Exit fullscreen mode

Step 2 — Check health reason

Possible reasons:

Health checks failed
Request timed out
Target.ResponseCodeMismatch
Enter fullscreen mode Exit fullscreen mode

Step 3 — Check app server

Connect to private EC2 using SSM or bastion.

Run:

sudo systemctl status nginx
Enter fullscreen mode Exit fullscreen mode

Bad output:

inactive (dead)
Enter fullscreen mode Exit fullscreen mode

Fix

sudo systemctl start nginx
sudo systemctl enable nginx
Enter fullscreen mode Exit fullscreen mode

Check:

curl localhost
Enter fullscreen mode Exit fullscreen mode

Expected:

Hello from Web Server 1
Enter fullscreen mode Exit fullscreen mode

SRE explanation

ALB returned 503 because the target group had no healthy instances. The web service was stopped, so the health check failed.


OUTAGE 3 — ALB Target Unhealthy Because Security Group Is Wrong

Symptom

ALB returns:

503
Enter fullscreen mode Exit fullscreen mode

Target group shows:

Unhealthy
Health check timeout
Enter fullscreen mode Exit fullscreen mode

Check Security Group

Go to:

EC2 → Security Groups → web-sg → Inbound rules
Enter fullscreen mode Exit fullscreen mode

Correct rule should be:

HTTP 80 from alb-sg
Enter fullscreen mode Exit fullscreen mode

Bad rule example:

HTTP 80 from your IP
Enter fullscreen mode Exit fullscreen mode

Root cause

ALB cannot reach web server because web SG does not allow traffic from ALB SG.

Fix

Edit web-sg inbound:

Type: HTTP
Port: 80
Source: alb-sg
Enter fullscreen mode Exit fullscreen mode

Wait 1–2 minutes.

Expected:

Target health: Healthy
Enter fullscreen mode Exit fullscreen mode

SRE explanation

The application itself was fine, but the firewall blocked ALB-to-web traffic.


OUTAGE 4 — Private EC2 Cannot Install Packages

Symptom

On private EC2:

sudo apt update
Enter fullscreen mode Exit fullscreen mode

Bad output:

Temporary failure resolving
or
Connection timed out
Enter fullscreen mode Exit fullscreen mode

Step 1 — Check route table

Go to:

VPC → Route Tables → private-rt → Routes
Enter fullscreen mode Exit fullscreen mode

Correct:

0.0.0.0/0 → NAT Gateway
Enter fullscreen mode Exit fullscreen mode

Bad:

No default route
Enter fullscreen mode Exit fullscreen mode

Step 2 — Check NAT

Go to:

VPC → NAT Gateways
Enter fullscreen mode Exit fullscreen mode

Expected:

State: Available
Subnet: Public subnet
Elastic IP: attached
Enter fullscreen mode Exit fullscreen mode

Step 3 — Check public route table

Public subnet must have:

0.0.0.0/0 → Internet Gateway
Enter fullscreen mode Exit fullscreen mode

Fix

Add route:

Private Route Table
0.0.0.0/0 → NAT Gateway
Enter fullscreen mode Exit fullscreen mode

SRE explanation

Private EC2 had no outbound internet because the private subnet was missing the NAT route.


OUTAGE 5 — Public EC2 / ALB Not Reachable

Symptom

Browser cannot reach ALB DNS.

Check ALB Security Group

Go to:

EC2 → Security Groups → alb-sg
Enter fullscreen mode Exit fullscreen mode

Correct inbound:

HTTP 80 from 0.0.0.0/0
Enter fullscreen mode Exit fullscreen mode

Bad:

No inbound rule
Enter fullscreen mode Exit fullscreen mode

Fix

Add:

HTTP 80 → 0.0.0.0/0
Enter fullscreen mode Exit fullscreen mode

SRE explanation

The ALB was healthy, but its security group blocked public HTTP traffic.


OUTAGE 6 — Web Server Cannot Connect to DB

Symptom

Application shows:

Database connection failed
Enter fullscreen mode Exit fullscreen mode

Step 1 — From web EC2 test DB port

nc -vz <db-private-ip> 3306
Enter fullscreen mode Exit fullscreen mode

Bad output:

connection timed out
Enter fullscreen mode Exit fullscreen mode

Good output:

succeeded
Enter fullscreen mode Exit fullscreen mode

Step 2 — Check DB SG

Correct inbound rule:

MySQL 3306 from web-sg
Enter fullscreen mode Exit fullscreen mode

Bad rule:

MySQL 3306 from your IP
or
No MySQL rule
Enter fullscreen mode Exit fullscreen mode

Fix

Edit db-sg:

Inbound:
MySQL/Aurora
Port: 3306
Source: web-sg
Enter fullscreen mode Exit fullscreen mode

SRE explanation

The database was private and secure, but the app tier was not allowed by the DB security group.


OUTAGE 7 — One Web Server Down, Site Still Works

Symptom

Stop one EC2:

sre-web-1 stopped
Enter fullscreen mode Exit fullscreen mode

User still sees website.

Why?

ALB sends traffic only to healthy targets.

Check:

Target group:
web-1 → unused/unhealthy
web-2 → healthy
Enter fullscreen mode Exit fullscreen mode

SRE explanation

This proves high availability. One instance failed, but ALB removed it from rotation and continued sending traffic to the healthy instance.


OUTAGE 8 — Both Web Servers in Same AZ

Symptom

Application works normally, but during AZ failure everything goes down.

Root cause

Both web servers are in one Availability Zone.

Bad design:

web-1 → us-east-1a
web-2 → us-east-1a
Enter fullscreen mode Exit fullscreen mode

Good design:

web-1 → us-east-1a
web-2 → us-east-1b
Enter fullscreen mode Exit fullscreen mode

Fix

Launch web servers across different private subnets in different AZs.

SRE explanation

High availability requires spreading resources across multiple Availability Zones.


OUTAGE 9 — Wrong Health Check Path

Symptom

Website works manually:

curl http://private-ip
Enter fullscreen mode Exit fullscreen mode

Output:

Hello from Web Server
Enter fullscreen mode Exit fullscreen mode

But ALB target is unhealthy.

Check health check path

Go to:

Target Group → Health checks
Enter fullscreen mode Exit fullscreen mode

Bad path:

/health
Enter fullscreen mode Exit fullscreen mode

But app only serves:

/
Enter fullscreen mode Exit fullscreen mode

Fix

Change health check path to:

/
Enter fullscreen mode Exit fullscreen mode

Or create /health endpoint.

SRE explanation

The application was running, but ALB health check was using a path that did not exist.


OUTAGE 10 — NACL Blocks Return Traffic

Symptom

Security groups look correct. Route tables look correct. Still traffic times out.

Check NACL

Go to:

VPC → Network ACLs
Enter fullscreen mode Exit fullscreen mode

Remember:

NACL is stateless
Inbound and outbound both must be allowed
Enter fullscreen mode Exit fullscreen mode

For HTTP, allow:

Inbound:

80 from source
1024-65535 ephemeral ports
Enter fullscreen mode Exit fullscreen mode

Outbound:

80
1024-65535 ephemeral ports
Enter fullscreen mode Exit fullscreen mode

Root cause

NACL allowed inbound request but blocked return traffic.

Fix

Allow ephemeral ports.

SRE explanation

Security groups are stateful, but NACLs are stateless. Return traffic must be explicitly allowed.


OUTAGE 11 — VPC Peering Not Working

Symptom

EC2 in VPC A cannot reach EC2 in VPC B.

Check 1 — Peering status

VPC → Peering Connections
Enter fullscreen mode Exit fullscreen mode

Expected:

Active
Enter fullscreen mode Exit fullscreen mode

Bad:

Pending acceptance
Enter fullscreen mode Exit fullscreen mode

Check 2 — Route tables

VPC A route table:

10.1.0.0/16 → peering connection
Enter fullscreen mode Exit fullscreen mode

VPC B route table:

10.0.0.0/16 → peering connection
Enter fullscreen mode Exit fullscreen mode

Check 3 — CIDR overlap

Bad:

VPC A: 10.0.0.0/16
VPC B: 10.0.0.0/16
Enter fullscreen mode Exit fullscreen mode

Peering will not work.

SRE explanation

VPC peering requires non-overlapping CIDR ranges, active peering, routes on both sides, and firewall rules allowing traffic.


OUTAGE 12 — PrivateLink Works for One Service Only

Symptom

Consumer VPC can access one API through endpoint, but cannot reach other private EC2s in provider VPC.

Explanation

This is expected.

PrivateLink is not full network connectivity.

PrivateLink → service-level access
VPC Peering → network-level access
Transit Gateway → large network hub
Enter fullscreen mode Exit fullscreen mode

SRE explanation

PrivateLink exposes only a specific service through an endpoint. It does not allow full VPC-to-VPC communication.


OUTAGE 13 — VPN Tunnel Down

Symptom

On-prem users cannot reach AWS private app.

Check

In AWS:

VPC → Site-to-Site VPN Connections
Enter fullscreen mode Exit fullscreen mode

Bad output:

Tunnel 1: DOWN
Tunnel 2: DOWN
Enter fullscreen mode Exit fullscreen mode

Check:

Customer gateway public IP correct?
On-prem firewall allows IPsec?
BGP routes advertised?
AWS route table has route to on-prem CIDR?
Enter fullscreen mode Exit fullscreen mode

SRE explanation

VPN issues usually come from tunnel status, BGP route advertisement, firewall rules, or missing route table entries.


OUTAGE 14 — DNS Points to Old ALB

Symptom

New deployment completed, but users still hit old app.

Check

dig app.company.com
Enter fullscreen mode Exit fullscreen mode

Compare with current ALB DNS.

Root cause

Route 53 record points to old ALB or DNS cache TTL has not expired.

Fix

Update Route 53 alias record.

Lower TTL before planned migration.

SRE explanation

The application was not broken. DNS was pointing users to the wrong load balancer.


OUTAGE 15 — WAF Blocks Real Users

Symptom

Some users get:

403 Forbidden
Enter fullscreen mode Exit fullscreen mode

Check

Go to:

AWS WAF → Web ACL → Logs / Sampled requests
Enter fullscreen mode Exit fullscreen mode

Look for:

Blocked rule
Source IP
URI path
User agent
Enter fullscreen mode Exit fullscreen mode

Fix

Options:

Adjust managed rule
Add allowlist
Change rule priority
Tune rate limit
Enter fullscreen mode Exit fullscreen mode

SRE explanation

WAF protects the app, but rules can create false positives. SRE must verify blocked requests before disabling protection.


Final SRE Outage Debugging Script

In interview, say:

When an outage happens, I do not guess. I follow the request path. First I check DNS resolution, then ALB status, listener, security group, target group health, application service, route tables, NAT or IGW, then database connectivity. I also use ALB logs, VPC Flow Logs, CloudWatch metrics, and application logs to prove where the traffic is failing.
Enter fullscreen mode Exit fullscreen mode

Best Practice Summary

DNS issue       → dig / nslookup
ALB issue       → listener, SG, target group
503             → no healthy targets
504             → backend timeout
Private no net  → NAT route
DB issue        → DB SG from web SG
NACL issue      → remember stateless
Peering issue   → routes both sides
VPN issue       → tunnel + BGP + routes
WAF issue       → check blocked rules
Enter fullscreen mode Exit fullscreen mode

Very strong interview answer

I troubleshoot production outages by following the traffic path from user to backend: DNS, WAF, load balancer, target group, security groups, route tables, NACLs, EC2 service, and database. I verify each layer with tools like dig, curl, nc, ALB health checks, CloudWatch metrics, ALB logs, and VPC Flow Logs. My goal is to quickly identify whether the problem is DNS, routing, firewall, load balancer, application, or database, then restore service and document the root cause.

Top comments (0)