Thesius Code

Posted on Mar 23 • Originally published at datanest-stores.pages.dev

DevOps/SRE Interview Guide

#career #algorithms #programming #interview

DevOps/SRE Interview Guide

Complete preparation for DevOps and Site Reliability Engineering interviews. Covers infrastructure design, incident response scenarios, CI/CD pipeline architecture, observability, and reliability engineering concepts. Includes hands-on scenarios, on-call simulations, and production war stories.

Key Features

40 infrastructure design questions with reference architectures
15 incident response scenarios — timed postmortem exercises
CI/CD pipeline design problems from basic to GitOps-scale
Observability deep-dive — metrics, logs, traces, alerting strategies
SLO/SLI/SLA workshop with calculation exercises
Linux troubleshooting gauntlet — 25 real-world debugging scenarios

Content Breakdown

Section	Items	Difficulty Range
Infrastructure Design	40	★★ to ★★★★★
Incident Response	15	★★★ to ★★★★
CI/CD Design	12	★★ to ★★★★
Observability	10	★★ to ★★★★
Linux / Networking	25	★ to ★★★★
Coding (Python/Bash)	20	★★ to ★★★

Sample Content

Infrastructure Design: Highly Available Web Application

Prompt: Design a deployment architecture for a web application that handles 10K requests/second, requires 99.95% uptime, and serves users across North America and Europe.

                    ┌─────────────┐
                    │ Global DNS  │ (Route 53 / CloudFlare)
                    │ GeoDNS      │
                    └──────┬──────┘
              ┌────────────┼────────────┐
              ▼                         ▼
    ┌─────────────────┐      ┌─────────────────┐
    │ US-East Region  │      │ EU-West Region  │
    │                 │      │                 │
    │  ┌───────────┐  │      │  ┌───────────┐  │
    │  │    ALB     │  │      │  │    ALB     │  │
    │  └─────┬─────┘  │      │  └─────┬─────┘  │
    │  ┌─────┼─────┐  │      │  ┌─────┼─────┐  │
    │  │ K8s Cluster│  │      │  │ K8s Cluster│  │
    │  │ (3 nodes)  │  │      │  │ (3 nodes)  │  │
    │  └─────┬─────┘  │      │  └─────┬─────┘  │
    │  ┌─────┼─────┐  │      │  ┌─────┼─────┐  │
    │  │Primary DB  │◄─┼──────┼──│ Read Replica│  │
    │  │(PostgreSQL)│  │      │  │(PostgreSQL)│  │
    │  └───────────┘  │      │  └───────────┘  │
    └─────────────────┘      └─────────────────┘

Discussion points: Active-active vs active-passive, database replication lag, cache invalidation across regions, failover testing strategy.

Incident Response Scenario

Scenario: At 3:17 AM, PagerDuty fires: "API error rate >5% for 10 minutes." You're the on-call SRE.

Timeline exercise:

T+0min   Alert fires. What do you check first?
T+5min   Dashboard shows p99 latency spiked from 200ms to 8s.
         Error logs show "connection refused" from the payment service.
T+10min  Payment service pods are running but health checks failing.
T+15min  Root cause: a config change deployed at 3:10 AM changed
         the DB connection pool size from 20 to 2 (typo).
T+20min  You rollback the config. How?
T+30min  Service recovers. What's your immediate follow-up?

Expected answers include: Check error dashboards and recent deployments first. Communicate in incident channel. Rollback via GitOps (revert commit) or kubectl. Follow-up: incident report, blameless postmortem, add config validation.

SLO Calculation Exercise

Service: User Authentication API
SLO: 99.9% availability (measured monthly)

Monthly minutes:  43,200
Error budget:     43.2 minutes of downtime

Current month incidents:
  - 12-minute outage on the 5th
  - 3-minute blip on the 15th

Remaining budget: 43.2 - 12 - 3 = 28.2 minutes
Budget consumed:  34.7%

Question: Should you approve a risky database migration
          this month? What factors influence your decision?

Study Plan

Week	Focus	Daily Time
1	Linux fundamentals, networking (TCP/IP, DNS, HTTP)	45 min
2	Containers, Kubernetes architecture, pod networking	60 min
3	CI/CD pipelines: design, security, GitOps patterns	45 min
4	Infrastructure as Code: Terraform patterns, state management	45 min
5	Observability: metrics, logging, tracing, alerting design	45 min
6	Incident response, SLOs, chaos engineering concepts	45 min
7	Infrastructure design problems (whiteboard practice)	60 min
8	Mock interviews + coding (Python/Bash scripting)	60 min

Practice Tips

Run things locally. Set up Minikube, write Terraform, build a CI pipeline. Hands-on > theory.
Write postmortems for public incidents. Pick a well-known outage and write your own analysis.
Know your monitoring stack. Be able to discuss Prometheus vs Datadog vs CloudWatch tradeoffs.
Practice "think aloud" for design questions. Start with requirements, then constraints, then design.
Prepare a production war story. Your best incident response with clear actions and outcome.

src/ — Design problems, incident scenarios, SLO exercises
examples/ — Reference architectures and postmortem templates
docs/ — Tool comparison guides, Linux troubleshooting flowcharts

This is 1 of 11 resources in the Interview Prep Pro toolkit. Get the complete [DevOps/SRE Interview Guide] with all files, templates, and documentation for $39.

Get the Full Kit →

Or grab the entire Interview Prep Pro bundle (11 products) for $199 — save 30%.

Get the Complete Bundle →

DEV Community

DevOps/SRE Interview Guide

DevOps/SRE Interview Guide

Key Features

Content Breakdown

Sample Content

Infrastructure Design: Highly Available Web Application

Incident Response Scenario

SLO Calculation Exercise

Study Plan

Practice Tips

Contents

Related Articles

Top comments (0)