DEV Community

Aisalkyn Aidarova
Aisalkyn Aidarova

Posted on

Full SRE Networking Lecture: What You Must Know After Basic VPC

1. DNS and Route 53

DNS is one of the most important networking topics for SRE. Many production outages look like “application is down,” but the real issue is DNS.

DNS translates a name into an IP address or another DNS name.

Example:

jumptotech.com → ALB DNS name → EC2 targets
Enter fullscreen mode Exit fullscreen mode

In AWS, the main DNS service is Amazon Route 53. Route 53 is used to manage domain records and route users to AWS resources like ALB, CloudFront, S3 static websites, or failover endpoints.

Important DNS record types:

A record     → domain to IPv4 address
AAAA record  → domain to IPv6 address
CNAME        → domain to another domain
ALIAS        → Route 53 record pointing to AWS resources like ALB or CloudFront
TXT          → verification, SPF, DKIM, security records
MX           → email routing
Enter fullscreen mode Exit fullscreen mode

In production, for an application, the flow is usually:

User
 ↓
Route 53
 ↓
Application Load Balancer
 ↓
Private application servers
Enter fullscreen mode Exit fullscreen mode

As an SRE, when a website is not reachable, you should not immediately check EC2. First check DNS.

Use:

nslookup example.com
dig example.com
dig example.com +short
Enter fullscreen mode Exit fullscreen mode

Troubleshooting questions:

Does the domain resolve?
Does it point to the correct ALB?
Was DNS changed recently?
Is TTL too long?
Is Route 53 health check failing?
Is the record public or private hosted zone?
Enter fullscreen mode Exit fullscreen mode

Route 53 can also use health checks and failover routing. AWS recommends evaluating target health for alias records when using health-based DNS routing, otherwise Route 53 may still route traffic to unhealthy resources. (AWS Documentation)

Interview answer:

Route 53 is AWS DNS service. I use it to route user traffic to AWS resources such as ALB or CloudFront. As an SRE, I troubleshoot DNS by checking resolution, record type, TTL, hosted zone, and whether the DNS target is healthy.


2. Load Balancing: ALB vs NLB

A load balancer distributes traffic across multiple targets. In production, users should not directly access EC2 instances. They should access a load balancer.

Main AWS load balancers:

ALB = Application Load Balancer
NLB = Network Load Balancer
Enter fullscreen mode Exit fullscreen mode

Application Load Balancer

ALB works at Layer 7, the application layer. It understands HTTP and HTTPS.

Use ALB for:

web applications
APIs
path-based routing
host-based routing
HTTPS termination
microservices
containerized apps
Enter fullscreen mode Exit fullscreen mode

Example:

app.example.com      → frontend target group
api.example.com      → backend target group
/example/orders      → orders service
/example/payments    → payment service
Enter fullscreen mode Exit fullscreen mode

ALB uses:

Listener
Rule
Target Group
Health Check
Enter fullscreen mode Exit fullscreen mode

A listener receives traffic on a port, usually 80 or 443. A rule decides where to forward the request. A target group contains EC2, ECS tasks, IPs, or Lambda targets. Health checks decide whether the target should receive traffic. AWS documentation says ALB target groups route requests to registered targets and health checks are configured per target group. (AWS Documentation)

Production flow:

Internet
 ↓
Route 53
 ↓
ALB in public subnet
 ↓
EC2/ECS/EKS app in private subnet
 ↓
RDS database in private subnet
Enter fullscreen mode Exit fullscreen mode

As an SRE, if ALB returns 503, usually it means no healthy targets.

Check:

Are targets registered?
Are targets healthy?
Is health check path correct?
Is app listening on correct port?
Does app security group allow traffic from ALB security group?
Is the app returning 200 on health check path?
Enter fullscreen mode Exit fullscreen mode

Useful commands:

curl -I http://alb-dns-name
curl http://private-app-ip:8080/health
Enter fullscreen mode Exit fullscreen mode

Network Load Balancer

NLB works at Layer 4. It handles TCP, UDP, and TLS traffic.

Use NLB for:

very high performance
TCP applications
static IP requirement
low latency
non-HTTP protocols
Enter fullscreen mode Exit fullscreen mode

Examples:

Kafka
database proxy
game servers
TCP services
Enter fullscreen mode Exit fullscreen mode

NLB health checks determine whether targets are available. AWS documentation says NLB uses active and passive health checks and routes traffic only to healthy targets in enabled Availability Zones unless cross-zone load balancing is enabled. (AWS Documentation)

Interview answer:

I use ALB for HTTP and HTTPS applications because it supports Layer 7 routing, TLS termination, host-based and path-based rules. I use NLB for high-performance TCP or UDP workloads where low latency or static IP is required.


3. Target Groups and Health Checks

Health checks are critical for reliability.

A load balancer should not send traffic to a broken server. That is why every target group has a health check.

Example health check:

Protocol: HTTP
Path: /health
Success code: 200
Interval: 30 seconds
Healthy threshold: 3
Unhealthy threshold: 2
Enter fullscreen mode Exit fullscreen mode

Bad health check:

/
Enter fullscreen mode Exit fullscreen mode

Why? Maybe homepage works but database connection is broken.

Better health check:

/health
Enter fullscreen mode Exit fullscreen mode

This endpoint should check:

application running
database reachable
required dependencies available
Enter fullscreen mode Exit fullscreen mode

But do not make health checks too heavy. If /health runs expensive database queries every few seconds, the health check itself can overload the app.

As an SRE, when deployment causes outage, check target group health first.

Troubleshooting:

Target unhealthy because timeout?
Target unhealthy because 403?
Target unhealthy because 500?
Wrong port?
Wrong path?
Security group blocking ALB?
App binding to localhost only?
Enter fullscreen mode Exit fullscreen mode

Common mistake:

Application listens on:

127.0.0.1:8080
Enter fullscreen mode Exit fullscreen mode

But it should listen on:

0.0.0.0:8080
Enter fullscreen mode Exit fullscreen mode

Interview answer:

Health checks allow the load balancer to remove unhealthy targets from rotation. As an SRE, I always verify the health check path, port, response code, security group, and application logs.


4. VPC Endpoints

VPC endpoints allow private resources to access AWS services without using the public internet.

AWS documentation says VPC endpoints privately connect your VPC to supported AWS services without requiring an internet gateway, NAT device, VPN, or Direct Connect. Traffic stays on the AWS network backbone. (AWS Documentation)

Without VPC endpoint:

Private EC2
 ↓
NAT Gateway
 ↓
Internet path
 ↓
S3
Enter fullscreen mode Exit fullscreen mode

With VPC endpoint:

Private EC2
 ↓
VPC Endpoint
 ↓
S3
Enter fullscreen mode Exit fullscreen mode

Types:

Gateway Endpoint   → S3, DynamoDB
Interface Endpoint → SSM, ECR, CloudWatch, Secrets Manager, STS, KMS
Enter fullscreen mode Exit fullscreen mode

Why SRE uses VPC endpoints:

increase security
reduce NAT Gateway dependency
reduce NAT data processing cost
allow private subnet access to AWS services
support private architecture
Enter fullscreen mode Exit fullscreen mode

Very important real-world example:

You have private EC2 with no public IP. You want to connect using AWS Systems Manager Session Manager.

You may need interface endpoints for:

ssm
ssmmessages
ec2messages
Enter fullscreen mode Exit fullscreen mode

If your private EC2 needs to pull Docker images from ECR, you may need endpoints for:

ecr.api
ecr.dkr
s3
CloudWatch Logs
Enter fullscreen mode Exit fullscreen mode

Troubleshooting endpoint issues:

Is endpoint created in correct VPC?
Is private DNS enabled?
Is security group allowing HTTPS 443 to endpoint?
Is route table configured for gateway endpoint?
Does IAM policy allow access?
Is endpoint policy blocking access?
Enter fullscreen mode Exit fullscreen mode

Interview answer:

I use VPC endpoints when private workloads need to access AWS services without going through the public internet or NAT Gateway. This improves security and can reduce cost.


5. AWS PrivateLink

PrivateLink is related to VPC endpoints, but it is more advanced.

AWS PrivateLink allows private connectivity between VPCs, AWS services, services in other AWS accounts, and Marketplace services without using public internet, NAT, VPN, or Direct Connect. (AWS Documentation)

Use case:

Company A exposes service privately.

Company B consumes it privately.

Consumer VPC
 ↓
Interface Endpoint
 ↓
PrivateLink
 ↓
Provider NLB
 ↓
Provider service
Enter fullscreen mode Exit fullscreen mode

PrivateLink is service-level access, not full network access.

This is very important.

Difference:

VPC Peering     → connects networks
Transit Gateway → connects many networks
PrivateLink     → exposes only one service privately
Enter fullscreen mode Exit fullscreen mode

Why this matters:

With VPC peering, VPCs can potentially route to many internal resources.

With PrivateLink, the consumer can access only the specific service exposed through the endpoint.

When to use PrivateLink:

SaaS provider exposing private API
shared internal service
cross-account service access
security-sensitive architecture
avoid full VPC-to-VPC routing
Enter fullscreen mode Exit fullscreen mode

Interview answer:

PrivateLink is used when we want to expose a specific service privately without giving full network access between VPCs. It is more controlled than VPC peering.


6. VPC Peering

VPC peering connects two VPCs using private IPs.

Example:

VPC A: 10.0.0.0/16
VPC B: 10.1.0.0/16
Enter fullscreen mode Exit fullscreen mode

After peering:

EC2 in VPC A → private IP → EC2 in VPC B
Enter fullscreen mode Exit fullscreen mode

Rules:

CIDR blocks cannot overlap
Peering must be accepted
Routes must be added on both sides
Security groups must allow traffic
NACLs must allow traffic
Peering is not transitive
Enter fullscreen mode Exit fullscreen mode

Not transitive means:

VPC A peers with VPC B
VPC B peers with VPC C
VPC A cannot automatically reach VPC C
Enter fullscreen mode Exit fullscreen mode

Use peering when:

only two or a few VPCs need communication
simple architecture
low operational complexity
Enter fullscreen mode Exit fullscreen mode

Do not use peering when you have many VPCs. It becomes hard to manage.

Troubleshooting peering:

Is peering active?
Are CIDRs overlapping?
Does VPC A route table point to peering connection?
Does VPC B route table point back?
Do SG/NACL allow traffic?
Is DNS resolution enabled if using private DNS names?
Enter fullscreen mode Exit fullscreen mode

Interview answer:

VPC peering is private connectivity between two VPCs. It is simple and low-latency, but it does not support transitive routing and does not scale well for many VPCs.


7. Transit Gateway

Transit Gateway is like a cloud router.

Instead of creating many VPC peering connections, you attach VPCs to one central hub.

Without Transit Gateway:

VPC A ↔ VPC B
VPC A ↔ VPC C
VPC A ↔ VPC D
VPC B ↔ VPC C
...
Enter fullscreen mode Exit fullscreen mode

This becomes messy.

With Transit Gateway:

VPC A
  ↓
Transit Gateway
  ↑
VPC B
  ↑
VPC C
  ↑
VPN / Direct Connect
Enter fullscreen mode Exit fullscreen mode

Use Transit Gateway for:

many VPCs
multi-account architecture
shared services VPC
hybrid cloud
centralized firewall inspection
enterprise networks
Enter fullscreen mode Exit fullscreen mode

AWS VPC connectivity documentation lists Transit Gateway, VPC peering, PrivateLink, VPN, and Direct Connect as major private connectivity options. (AWS Documentation)

As an SRE, you need to know that Transit Gateway has route tables too.

Troubleshooting Transit Gateway:

Is VPC attached to TGW?
Is attachment available?
Is route propagated?
Is route associated with correct TGW route table?
Do subnet route tables point to TGW?
Do SG/NACL allow traffic?
Is there asymmetric routing?
Enter fullscreen mode Exit fullscreen mode

Interview answer:

Transit Gateway is used as a central network hub to connect many VPCs and hybrid networks. It is better than VPC peering when the environment has many VPCs or accounts.


8. VPN and Direct Connect

These are used for hybrid cloud: connecting on-premises data centers to AWS.

Site-to-Site VPN

VPN creates encrypted tunnels over the public internet.

Flow:

On-prem router/firewall
 ↓ encrypted tunnel
AWS VPN Gateway / Transit Gateway
 ↓
VPC
Enter fullscreen mode Exit fullscreen mode

Use VPN when:

quick setup
lower cost
backup connection
encrypted connection over internet
Enter fullscreen mode Exit fullscreen mode

Limitations:

internet-dependent
latency can vary
bandwidth limited compared to Direct Connect
Enter fullscreen mode Exit fullscreen mode

Direct Connect

Direct Connect is a dedicated private network connection from your data center or colocation to AWS.

Use Direct Connect when:

stable latency required
large data transfer
enterprise hybrid cloud
more predictable performance
Enter fullscreen mode Exit fullscreen mode

AWS documentation describes connectivity options using Direct Connect, Site-to-Site VPN, and Transit Gateway for remote network to VPC connectivity. (AWS Documentation)

Production design often uses:

Direct Connect as primary
VPN as backup
Transit Gateway as hub
Enter fullscreen mode Exit fullscreen mode

Troubleshooting hybrid connectivity:

Is tunnel up?
Is BGP established?
Are routes advertised?
Are security groups allowing traffic?
Are on-prem firewalls allowing traffic?
Is return route correct?
Is DNS resolving private names?
Enter fullscreen mode Exit fullscreen mode

Interview answer:

VPN provides encrypted connectivity over the internet, while Direct Connect provides a dedicated private connection to AWS. In production, companies often use Direct Connect for stable performance and VPN as backup.


9. AWS WAF

WAF means Web Application Firewall.

Security Group controls ports and IP access.

NACL controls subnet-level traffic.

WAF protects web applications at Layer 7.

WAF can block:

SQL injection
cross-site scripting
bad bots
malicious IPs
rate-based attacks
suspicious headers
Enter fullscreen mode Exit fullscreen mode

Common placement:

User
 ↓
Route 53
 ↓
CloudFront or ALB
 ↓
WAF
 ↓
Application
Enter fullscreen mode Exit fullscreen mode

Use WAF when:

public web application
API exposed to internet
compliance requirement
need Layer 7 protection
need rate limiting
Enter fullscreen mode Exit fullscreen mode

Troubleshooting WAF:

Is WAF blocking legitimate traffic?
Check WAF logs
Check rule priority
Check managed rule false positives
Check rate limit
Check IP reputation list
Enter fullscreen mode Exit fullscreen mode

Interview answer:

WAF protects applications at Layer 7 from web attacks such as SQL injection, XSS, and bad bots. I use it in front of ALB or CloudFront for internet-facing applications.


10. CloudFront and CDN Basics

CloudFront is AWS CDN.

CDN means content delivery network.

It caches content close to users.

Example:

Without CloudFront:

User in California → ALB in Virginia
Enter fullscreen mode Exit fullscreen mode

With CloudFront:

User in California → nearest edge location → origin
Enter fullscreen mode Exit fullscreen mode

Use CloudFront for:

static websites
images
videos
frontend apps
API acceleration
global users
DDoS protection with Shield
TLS termination
Enter fullscreen mode Exit fullscreen mode

CloudFront origin can be:

S3
ALB
EC2
API Gateway
custom domain
Enter fullscreen mode Exit fullscreen mode

SRE troubleshooting:

Is cache serving old content?
Is origin healthy?
Is behavior path correct?
Is HTTPS certificate valid?
Is WAF blocking request?
Is DNS pointing to CloudFront?
Enter fullscreen mode Exit fullscreen mode

Common issue:

You deploy new frontend, but users still see old version.

Fix:

CloudFront invalidation
Enter fullscreen mode Exit fullscreen mode

Interview answer:

CloudFront improves performance by caching content at edge locations closer to users. As an SRE, I troubleshoot CloudFront by checking cache behavior, origin health, invalidations, certificates, WAF, and DNS.


11. Network Observability

SRE must prove what is happening in the network.

Important tools:

VPC Flow Logs
ALB access logs
CloudWatch metrics
CloudTrail
Route 53 query logs
WAF logs
Transit Gateway flow logs
Enter fullscreen mode Exit fullscreen mode

VPC Flow Logs

VPC Flow Logs capture IP traffic metadata for network interfaces.

They help answer:

Was traffic accepted or rejected?
Which source IP connected?
Which destination port?
Which ENI?
Which subnet?
Enter fullscreen mode Exit fullscreen mode

Use VPC Flow Logs for:

security investigation
NACL troubleshooting
SG troubleshooting
network visibility
unexpected traffic analysis
Enter fullscreen mode Exit fullscreen mode

Example:

REJECT TCP 10.0.3.10 10.0.5.20 5432
Enter fullscreen mode Exit fullscreen mode

This tells you traffic to database port was rejected.

ALB access logs

ALB logs show:

client IP
request path
target status code
load balancer status code
response time
target processing time
Enter fullscreen mode Exit fullscreen mode

Useful for:

502
503
504
slow requests
bad target responses
Enter fullscreen mode Exit fullscreen mode

SRE book mindset

Google’s SRE book emphasizes that monitoring should help decide what should interrupt a human and what should not. Good monitoring is not collecting everything; good monitoring detects user-impacting issues. (Google SRE)

Interview answer:

For network observability, I use VPC Flow Logs, ALB logs, Route 53 logs, WAF logs, and CloudWatch metrics. These help me identify whether the issue is DNS, routing, firewall, load balancer, target health, or application.


12. Full Production Network Architecture

This is the architecture you must be able to explain in interviews.

Users
 ↓
Route 53
 ↓
CloudFront + WAF
 ↓
Application Load Balancer - public subnets
 ↓
Application servers / ECS / EKS - private app subnets
 ↓
RDS / ElastiCache - private DB subnets
Enter fullscreen mode Exit fullscreen mode

Supporting components:

NAT Gateway       → private outbound internet
VPC Endpoint      → private AWS service access
Transit Gateway   → multi-VPC connectivity
VPN/DX            → on-prem connectivity
VPC Flow Logs     → network observability
CloudWatch        → metrics and alarms
Enter fullscreen mode Exit fullscreen mode

Production rules:

ALB goes in public subnets
App goes in private subnets
DB goes in private subnets
NAT Gateway goes in public subnet
Private servers do not get public IPs
DB is never open to internet
Use SG references instead of hardcoded IPs
Use Multi-AZ for availability
Use VPC endpoints for private AWS service access
Use WAF for internet-facing apps
Enter fullscreen mode Exit fullscreen mode

13. SRE Troubleshooting Framework

When production is down, do not guess.

Follow this order:

1. DNS
2. CDN / WAF
3. Load Balancer
4. Security Groups
5. Route Tables
6. NACL
7. Target Health
8. Application Logs
9. Database
10. Dependencies
Enter fullscreen mode Exit fullscreen mode

Scenario 1: Website is down

Check:

dig app.example.com
curl -I https://app.example.com
Enter fullscreen mode Exit fullscreen mode

Then:

Is Route 53 pointing to correct ALB/CloudFront?
Is certificate valid?
Is WAF blocking?
Is ALB reachable?
Are targets healthy?
Is app running?
Enter fullscreen mode Exit fullscreen mode

Scenario 2: ALB returns 503

Meaning:

No healthy targets
Enter fullscreen mode Exit fullscreen mode

Check:

Target group health
Health check path
Security group from ALB to app
App port
App logs
Deployment status
Enter fullscreen mode Exit fullscreen mode

Scenario 3: ALB returns 504

Meaning:

Gateway timeout
Enter fullscreen mode Exit fullscreen mode

Check:

App too slow?
DB slow?
Target not responding?
Timeout configuration?
Network path blocked?
Enter fullscreen mode Exit fullscreen mode

Scenario 4: Private EC2 cannot access internet

Check:

Private route table has 0.0.0.0/0 → NAT Gateway
NAT Gateway is available
NAT Gateway has Elastic IP
NAT is in public subnet
Public subnet routes 0.0.0.0/0 → IGW
SG allows outbound
NACL allows ephemeral ports
Enter fullscreen mode Exit fullscreen mode

Scenario 5: EC2 cannot access S3 privately

Check:

Gateway endpoint exists?
Route table associated?
Bucket policy allows endpoint?
IAM role allows S3?
Region correct?
Enter fullscreen mode Exit fullscreen mode

Scenario 6: App cannot connect to RDS

Check:

RDS running?
Correct endpoint?
Correct port?
DB SG allows app SG?
App subnet route table has local route?
NACL allows traffic both directions?
Credentials correct?
DB max connections reached?
Enter fullscreen mode Exit fullscreen mode

Scenario 7: VPC peering not working

Check:

Peering active?
CIDR non-overlapping?
Routes added both sides?
SG allows remote CIDR or SG?
NACL allows?
DNS resolution enabled?
Enter fullscreen mode Exit fullscreen mode

14. What You Must Memorize for Interview

You must know this table:

Route 53        → DNS
CloudFront      → CDN / edge caching
WAF             → Layer 7 web protection
ALB             → HTTP/HTTPS load balancing
NLB             → TCP/UDP load balancing
VPC             → private network
Subnet          → network segment in one AZ
Route Table     → controls traffic direction
IGW             → internet access for public subnets
NAT Gateway     → outbound internet for private subnets
SG              → stateful resource firewall
NACL            → stateless subnet firewall
VPC Endpoint    → private access to AWS services
PrivateLink     → private service exposure
VPC Peering     → private VPC-to-VPC network connection
Transit Gateway → central router for many VPCs
VPN             → encrypted hybrid connection over internet
Direct Connect  → private dedicated hybrid connection
Flow Logs       → network traffic visibility
Enter fullscreen mode Exit fullscreen mode

15. Strong Final Interview Answer

Use this answer:

I design AWS networking using layered architecture. I use Route 53 for DNS, CloudFront and WAF for edge performance and security, ALB for public application entry, private subnets for application workloads, and isolated private subnets for databases. I use NAT Gateway only for outbound internet from private subnets and VPC endpoints when private workloads need AWS service access without internet. For multi-VPC communication, I choose VPC peering for simple cases, Transit Gateway for large enterprise hub-and-spoke architecture, and PrivateLink when only one private service should be exposed. As an SRE, I troubleshoot from DNS to load balancer, route tables, security groups, NACLs, target health, logs, and application dependencies.

Top comments (0)