DevOps/SRE Interview Guide
Complete preparation for DevOps and Site Reliability Engineering interviews. Covers infrastructure design, incident response scenarios, CI/CD pipeline architecture, observability, and reliability engineering concepts. Includes hands-on scenarios, on-call simulations, and production war stories.
Key Features
- 40 infrastructure design questions with reference architectures
- 15 incident response scenarios — timed postmortem exercises
- CI/CD pipeline design problems from basic to GitOps-scale
- Observability deep-dive — metrics, logs, traces, alerting strategies
- SLO/SLI/SLA workshop with calculation exercises
- Linux troubleshooting gauntlet — 25 real-world debugging scenarios
Content Breakdown
| Section | Items | Difficulty Range |
|---|---|---|
| Infrastructure Design | 40 | ★★ to ★★★★★ |
| Incident Response | 15 | ★★★ to ★★★★ |
| CI/CD Design | 12 | ★★ to ★★★★ |
| Observability | 10 | ★★ to ★★★★ |
| Linux / Networking | 25 | ★ to ★★★★ |
| Coding (Python/Bash) | 20 | ★★ to ★★★ |
Sample Content
Infrastructure Design: Highly Available Web Application
Prompt: Design a deployment architecture for a web application that handles 10K requests/second, requires 99.95% uptime, and serves users across North America and Europe.
┌─────────────┐
│ Global DNS │ (Route 53 / CloudFlare)
│ GeoDNS │
└──────┬──────┘
┌────────────┼────────────┐
▼ ▼
┌─────────────────┐ ┌─────────────────┐
│ US-East Region │ │ EU-West Region │
│ │ │ │
│ ┌───────────┐ │ │ ┌───────────┐ │
│ │ ALB │ │ │ │ ALB │ │
│ └─────┬─────┘ │ │ └─────┬─────┘ │
│ ┌─────┼─────┐ │ │ ┌─────┼─────┐ │
│ │ K8s Cluster│ │ │ │ K8s Cluster│ │
│ │ (3 nodes) │ │ │ │ (3 nodes) │ │
│ └─────┬─────┘ │ │ └─────┬─────┘ │
│ ┌─────┼─────┐ │ │ ┌─────┼─────┐ │
│ │Primary DB │◄─┼──────┼──│ Read Replica│ │
│ │(PostgreSQL)│ │ │ │(PostgreSQL)│ │
│ └───────────┘ │ │ └───────────┘ │
└─────────────────┘ └─────────────────┘
Discussion points: Active-active vs active-passive, database replication lag, cache invalidation across regions, failover testing strategy.
Incident Response Scenario
Scenario: At 3:17 AM, PagerDuty fires: "API error rate >5% for 10 minutes." You're the on-call SRE.
Timeline exercise:
T+0min Alert fires. What do you check first?
T+5min Dashboard shows p99 latency spiked from 200ms to 8s.
Error logs show "connection refused" from the payment service.
T+10min Payment service pods are running but health checks failing.
T+15min Root cause: a config change deployed at 3:10 AM changed
the DB connection pool size from 20 to 2 (typo).
T+20min You rollback the config. How?
T+30min Service recovers. What's your immediate follow-up?
Expected answers include: Check error dashboards and recent deployments first. Communicate in incident channel. Rollback via GitOps (revert commit) or kubectl. Follow-up: incident report, blameless postmortem, add config validation.
SLO Calculation Exercise
Service: User Authentication API
SLO: 99.9% availability (measured monthly)
Monthly minutes: 43,200
Error budget: 43.2 minutes of downtime
Current month incidents:
- 12-minute outage on the 5th
- 3-minute blip on the 15th
Remaining budget: 43.2 - 12 - 3 = 28.2 minutes
Budget consumed: 34.7%
Question: Should you approve a risky database migration
this month? What factors influence your decision?
Study Plan
| Week | Focus | Daily Time |
|---|---|---|
| 1 | Linux fundamentals, networking (TCP/IP, DNS, HTTP) | 45 min |
| 2 | Containers, Kubernetes architecture, pod networking | 60 min |
| 3 | CI/CD pipelines: design, security, GitOps patterns | 45 min |
| 4 | Infrastructure as Code: Terraform patterns, state management | 45 min |
| 5 | Observability: metrics, logging, tracing, alerting design | 45 min |
| 6 | Incident response, SLOs, chaos engineering concepts | 45 min |
| 7 | Infrastructure design problems (whiteboard practice) | 60 min |
| 8 | Mock interviews + coding (Python/Bash scripting) | 60 min |
Practice Tips
- Run things locally. Set up Minikube, write Terraform, build a CI pipeline. Hands-on > theory.
- Write postmortems for public incidents. Pick a well-known outage and write your own analysis.
- Know your monitoring stack. Be able to discuss Prometheus vs Datadog vs CloudWatch tradeoffs.
- Practice "think aloud" for design questions. Start with requirements, then constraints, then design.
- Prepare a production war story. Your best incident response with clear actions and outcome.
Contents
-
src/— Design problems, incident scenarios, SLO exercises -
examples/— Reference architectures and postmortem templates -
docs/— Tool comparison guides, Linux troubleshooting flowcharts
This is 1 of 11 resources in the Interview Prep Pro toolkit. Get the complete [DevOps/SRE Interview Guide] with all files, templates, and documentation for $39.
Or grab the entire Interview Prep Pro bundle (11 products) for $199 — save 30%.
Top comments (0)