DEV Community

Thesius Code
Thesius Code

Posted on • Originally published at datanest-stores.pages.dev

DevOps/SRE Interview Guide

DevOps/SRE Interview Guide

Complete preparation for DevOps and Site Reliability Engineering interviews. Covers infrastructure design, incident response scenarios, CI/CD pipeline architecture, observability, and reliability engineering concepts. Includes hands-on scenarios, on-call simulations, and production war stories.

Key Features

  • 40 infrastructure design questions with reference architectures
  • 15 incident response scenarios — timed postmortem exercises
  • CI/CD pipeline design problems from basic to GitOps-scale
  • Observability deep-dive — metrics, logs, traces, alerting strategies
  • SLO/SLI/SLA workshop with calculation exercises
  • Linux troubleshooting gauntlet — 25 real-world debugging scenarios

Content Breakdown

Section Items Difficulty Range
Infrastructure Design 40 ★★ to ★★★★★
Incident Response 15 ★★★ to ★★★★
CI/CD Design 12 ★★ to ★★★★
Observability 10 ★★ to ★★★★
Linux / Networking 25 ★ to ★★★★
Coding (Python/Bash) 20 ★★ to ★★★

Sample Content

Infrastructure Design: Highly Available Web Application

Prompt: Design a deployment architecture for a web application that handles 10K requests/second, requires 99.95% uptime, and serves users across North America and Europe.

                    ┌─────────────┐
                    │ Global DNS  │ (Route 53 / CloudFlare)
                    │ GeoDNS      │
                    └──────┬──────┘
              ┌────────────┼────────────┐
              ▼                         ▼
    ┌─────────────────┐      ┌─────────────────┐
    │ US-East Region  │      │ EU-West Region  │
    │                 │      │                 │
    │  ┌───────────┐  │      │  ┌───────────┐  │
    │  │    ALB     │  │      │  │    ALB     │  │
    │  └─────┬─────┘  │      │  └─────┬─────┘  │
    │  ┌─────┼─────┐  │      │  ┌─────┼─────┐  │
    │  │ K8s Cluster│  │      │  │ K8s Cluster│  │
    │  │ (3 nodes)  │  │      │  │ (3 nodes)  │  │
    │  └─────┬─────┘  │      │  └─────┬─────┘  │
    │  ┌─────┼─────┐  │      │  ┌─────┼─────┐  │
    │  │Primary DB  │◄─┼──────┼──│ Read Replica│  │
    │  │(PostgreSQL)│  │      │  │(PostgreSQL)│  │
    │  └───────────┘  │      │  └───────────┘  │
    └─────────────────┘      └─────────────────┘
Enter fullscreen mode Exit fullscreen mode

Discussion points: Active-active vs active-passive, database replication lag, cache invalidation across regions, failover testing strategy.

Incident Response Scenario

Scenario: At 3:17 AM, PagerDuty fires: "API error rate >5% for 10 minutes." You're the on-call SRE.

Timeline exercise:

T+0min   Alert fires. What do you check first?
T+5min   Dashboard shows p99 latency spiked from 200ms to 8s.
         Error logs show "connection refused" from the payment service.
T+10min  Payment service pods are running but health checks failing.
T+15min  Root cause: a config change deployed at 3:10 AM changed
         the DB connection pool size from 20 to 2 (typo).
T+20min  You rollback the config. How?
T+30min  Service recovers. What's your immediate follow-up?
Enter fullscreen mode Exit fullscreen mode

Expected answers include: Check error dashboards and recent deployments first. Communicate in incident channel. Rollback via GitOps (revert commit) or kubectl. Follow-up: incident report, blameless postmortem, add config validation.

SLO Calculation Exercise

Service: User Authentication API
SLO: 99.9% availability (measured monthly)

Monthly minutes:  43,200
Error budget:     43.2 minutes of downtime

Current month incidents:
  - 12-minute outage on the 5th
  - 3-minute blip on the 15th

Remaining budget: 43.2 - 12 - 3 = 28.2 minutes
Budget consumed:  34.7%

Question: Should you approve a risky database migration
          this month? What factors influence your decision?
Enter fullscreen mode Exit fullscreen mode

Study Plan

Week Focus Daily Time
1 Linux fundamentals, networking (TCP/IP, DNS, HTTP) 45 min
2 Containers, Kubernetes architecture, pod networking 60 min
3 CI/CD pipelines: design, security, GitOps patterns 45 min
4 Infrastructure as Code: Terraform patterns, state management 45 min
5 Observability: metrics, logging, tracing, alerting design 45 min
6 Incident response, SLOs, chaos engineering concepts 45 min
7 Infrastructure design problems (whiteboard practice) 60 min
8 Mock interviews + coding (Python/Bash scripting) 60 min

Practice Tips

  1. Run things locally. Set up Minikube, write Terraform, build a CI pipeline. Hands-on > theory.
  2. Write postmortems for public incidents. Pick a well-known outage and write your own analysis.
  3. Know your monitoring stack. Be able to discuss Prometheus vs Datadog vs CloudWatch tradeoffs.
  4. Practice "think aloud" for design questions. Start with requirements, then constraints, then design.
  5. Prepare a production war story. Your best incident response with clear actions and outcome.

Contents

  • src/ — Design problems, incident scenarios, SLO exercises
  • examples/ — Reference architectures and postmortem templates
  • docs/ — Tool comparison guides, Linux troubleshooting flowcharts

This is 1 of 11 resources in the Interview Prep Pro toolkit. Get the complete [DevOps/SRE Interview Guide] with all files, templates, and documentation for $39.

Get the Full Kit →

Or grab the entire Interview Prep Pro bundle (11 products) for $199 — save 30%.

Get the Complete Bundle →


Related Articles

Top comments (0)