Real Outage Simulation: SRE Networking Debugging
Architecture:
User
↓
Route 53 DNS
↓
ALB public subnet / DMZ
↓
Web EC2 private subnet
↓
DB private subnet
Your SRE troubleshooting order:
1. DNS
2. WAF / ALB
3. Target Group health
4. Security Groups
5. Route Tables
6. NAT / IGW
7. EC2 / Nginx
8. DB
9. Logs / Flow Logs
OUTAGE 1 — Website Completely Down
Symptom
User says:
app.company.com is not opening.
Browser shows:
This site can’t be reached
Step 1 — Check DNS
From your laptop:
nslookup app.company.com
Expected good output:
Name: app.company.com
Address: ALB-DNS or ALB IPs
Bad output:
server can't find app.company.com
Root cause
Route 53 record deleted or wrong.
Fix
Go to:
Route 53 → Hosted Zone → Create Record
Create:
Record type: A
Alias: Yes
Target: Application Load Balancer
SRE explanation
DNS was not resolving to the ALB, so traffic never reached AWS infrastructure.
OUTAGE 2 — ALB Returns 503
Symptom
Browser opens, but shows:
503 Service Temporarily Unavailable
Meaning
ALB is reachable, but it has no healthy backend targets.
Step 1 — Check Target Group
Go to:
EC2 → Target Groups → sre-app-tg → Targets
Bad output:
Unhealthy
Step 2 — Check health reason
Possible reasons:
Health checks failed
Request timed out
Target.ResponseCodeMismatch
Step 3 — Check app server
Connect to private EC2 using SSM or bastion.
Run:
sudo systemctl status nginx
Bad output:
inactive (dead)
Fix
sudo systemctl start nginx
sudo systemctl enable nginx
Check:
curl localhost
Expected:
Hello from Web Server 1
SRE explanation
ALB returned 503 because the target group had no healthy instances. The web service was stopped, so the health check failed.
OUTAGE 3 — ALB Target Unhealthy Because Security Group Is Wrong
Symptom
ALB returns:
503
Target group shows:
Unhealthy
Health check timeout
Check Security Group
Go to:
EC2 → Security Groups → web-sg → Inbound rules
Correct rule should be:
HTTP 80 from alb-sg
Bad rule example:
HTTP 80 from your IP
Root cause
ALB cannot reach web server because web SG does not allow traffic from ALB SG.
Fix
Edit web-sg inbound:
Type: HTTP
Port: 80
Source: alb-sg
Wait 1–2 minutes.
Expected:
Target health: Healthy
SRE explanation
The application itself was fine, but the firewall blocked ALB-to-web traffic.
OUTAGE 4 — Private EC2 Cannot Install Packages
Symptom
On private EC2:
sudo apt update
Bad output:
Temporary failure resolving
or
Connection timed out
Step 1 — Check route table
Go to:
VPC → Route Tables → private-rt → Routes
Correct:
0.0.0.0/0 → NAT Gateway
Bad:
No default route
Step 2 — Check NAT
Go to:
VPC → NAT Gateways
Expected:
State: Available
Subnet: Public subnet
Elastic IP: attached
Step 3 — Check public route table
Public subnet must have:
0.0.0.0/0 → Internet Gateway
Fix
Add route:
Private Route Table
0.0.0.0/0 → NAT Gateway
SRE explanation
Private EC2 had no outbound internet because the private subnet was missing the NAT route.
OUTAGE 5 — Public EC2 / ALB Not Reachable
Symptom
Browser cannot reach ALB DNS.
Check ALB Security Group
Go to:
EC2 → Security Groups → alb-sg
Correct inbound:
HTTP 80 from 0.0.0.0/0
Bad:
No inbound rule
Fix
Add:
HTTP 80 → 0.0.0.0/0
SRE explanation
The ALB was healthy, but its security group blocked public HTTP traffic.
OUTAGE 6 — Web Server Cannot Connect to DB
Symptom
Application shows:
Database connection failed
Step 1 — From web EC2 test DB port
nc -vz <db-private-ip> 3306
Bad output:
connection timed out
Good output:
succeeded
Step 2 — Check DB SG
Correct inbound rule:
MySQL 3306 from web-sg
Bad rule:
MySQL 3306 from your IP
or
No MySQL rule
Fix
Edit db-sg:
Inbound:
MySQL/Aurora
Port: 3306
Source: web-sg
SRE explanation
The database was private and secure, but the app tier was not allowed by the DB security group.
OUTAGE 7 — One Web Server Down, Site Still Works
Symptom
Stop one EC2:
sre-web-1 stopped
User still sees website.
Why?
ALB sends traffic only to healthy targets.
Check:
Target group:
web-1 → unused/unhealthy
web-2 → healthy
SRE explanation
This proves high availability. One instance failed, but ALB removed it from rotation and continued sending traffic to the healthy instance.
OUTAGE 8 — Both Web Servers in Same AZ
Symptom
Application works normally, but during AZ failure everything goes down.
Root cause
Both web servers are in one Availability Zone.
Bad design:
web-1 → us-east-1a
web-2 → us-east-1a
Good design:
web-1 → us-east-1a
web-2 → us-east-1b
Fix
Launch web servers across different private subnets in different AZs.
SRE explanation
High availability requires spreading resources across multiple Availability Zones.
OUTAGE 9 — Wrong Health Check Path
Symptom
Website works manually:
curl http://private-ip
Output:
Hello from Web Server
But ALB target is unhealthy.
Check health check path
Go to:
Target Group → Health checks
Bad path:
/health
But app only serves:
/
Fix
Change health check path to:
/
Or create /health endpoint.
SRE explanation
The application was running, but ALB health check was using a path that did not exist.
OUTAGE 10 — NACL Blocks Return Traffic
Symptom
Security groups look correct. Route tables look correct. Still traffic times out.
Check NACL
Go to:
VPC → Network ACLs
Remember:
NACL is stateless
Inbound and outbound both must be allowed
For HTTP, allow:
Inbound:
80 from source
1024-65535 ephemeral ports
Outbound:
80
1024-65535 ephemeral ports
Root cause
NACL allowed inbound request but blocked return traffic.
Fix
Allow ephemeral ports.
SRE explanation
Security groups are stateful, but NACLs are stateless. Return traffic must be explicitly allowed.
OUTAGE 11 — VPC Peering Not Working
Symptom
EC2 in VPC A cannot reach EC2 in VPC B.
Check 1 — Peering status
VPC → Peering Connections
Expected:
Active
Bad:
Pending acceptance
Check 2 — Route tables
VPC A route table:
10.1.0.0/16 → peering connection
VPC B route table:
10.0.0.0/16 → peering connection
Check 3 — CIDR overlap
Bad:
VPC A: 10.0.0.0/16
VPC B: 10.0.0.0/16
Peering will not work.
SRE explanation
VPC peering requires non-overlapping CIDR ranges, active peering, routes on both sides, and firewall rules allowing traffic.
OUTAGE 12 — PrivateLink Works for One Service Only
Symptom
Consumer VPC can access one API through endpoint, but cannot reach other private EC2s in provider VPC.
Explanation
This is expected.
PrivateLink is not full network connectivity.
PrivateLink → service-level access
VPC Peering → network-level access
Transit Gateway → large network hub
SRE explanation
PrivateLink exposes only a specific service through an endpoint. It does not allow full VPC-to-VPC communication.
OUTAGE 13 — VPN Tunnel Down
Symptom
On-prem users cannot reach AWS private app.
Check
In AWS:
VPC → Site-to-Site VPN Connections
Bad output:
Tunnel 1: DOWN
Tunnel 2: DOWN
Check:
Customer gateway public IP correct?
On-prem firewall allows IPsec?
BGP routes advertised?
AWS route table has route to on-prem CIDR?
SRE explanation
VPN issues usually come from tunnel status, BGP route advertisement, firewall rules, or missing route table entries.
OUTAGE 14 — DNS Points to Old ALB
Symptom
New deployment completed, but users still hit old app.
Check
dig app.company.com
Compare with current ALB DNS.
Root cause
Route 53 record points to old ALB or DNS cache TTL has not expired.
Fix
Update Route 53 alias record.
Lower TTL before planned migration.
SRE explanation
The application was not broken. DNS was pointing users to the wrong load balancer.
OUTAGE 15 — WAF Blocks Real Users
Symptom
Some users get:
403 Forbidden
Check
Go to:
AWS WAF → Web ACL → Logs / Sampled requests
Look for:
Blocked rule
Source IP
URI path
User agent
Fix
Options:
Adjust managed rule
Add allowlist
Change rule priority
Tune rate limit
SRE explanation
WAF protects the app, but rules can create false positives. SRE must verify blocked requests before disabling protection.
Final SRE Outage Debugging Script
In interview, say:
When an outage happens, I do not guess. I follow the request path. First I check DNS resolution, then ALB status, listener, security group, target group health, application service, route tables, NAT or IGW, then database connectivity. I also use ALB logs, VPC Flow Logs, CloudWatch metrics, and application logs to prove where the traffic is failing.
Best Practice Summary
DNS issue → dig / nslookup
ALB issue → listener, SG, target group
503 → no healthy targets
504 → backend timeout
Private no net → NAT route
DB issue → DB SG from web SG
NACL issue → remember stateless
Peering issue → routes both sides
VPN issue → tunnel + BGP + routes
WAF issue → check blocked rules
Very strong interview answer
I troubleshoot production outages by following the traffic path from user to backend: DNS, WAF, load balancer, target group, security groups, route tables, NACLs, EC2 service, and database. I verify each layer with tools like dig, curl, nc, ALB health checks, CloudWatch metrics, ALB logs, and VPC Flow Logs. My goal is to quickly identify whether the problem is DNS, routing, firewall, load balancer, application, or database, then restore service and document the root cause.
Top comments (0)