Stage 1 — Linux & OS Internals
Every container, every Kubernetes component, every performance issue traces back to Linux. Start here.
Processes & threads
- How
fork()andexec()work — process creation lifecycle - Process states: running, sleeping (interruptible vs uninterruptible), zombie, stopped
- What a context switch is and why it has a cost
Namespaces (this is what containers ARE)
-
pidnamespace — isolated process trees, PID 1 inside a container -
netnamespace — isolated network stack (interfaces, routes, iptables) -
mntnamespace — isolated mount points and filesystem view -
utsnamespace — isolated hostname and domain name -
ipcnamespace — isolated System V IPC, POSIX message queues -
usernamespace — UID/GID remapping, unprivileged containers -
cgroupnamespace — isolated cgroup root view - How to experiment:
unshare,nsenter,lsnscommands
Control Groups / cgroups
- cgroups v1 vs v2 — why v2 (unified hierarchy) matters for containers
- CPU controller:
cpu.shares,cpu.cfs_quota_us,cpu.cfs_period_us - Memory controller: hard limit, soft limit, swap limit,
memory.stat - How Kubernetes maps resource
requestsandlimitsto cgroup settings - The OOM killer — how it scores processes, why your container gets killed,
/proc/pid/oom_score_adj -
pidscontroller — preventing fork bombs in containers
Memory management
- Page cache — how the kernel caches disk reads in RAM, impact on
freeoutput - Memory metrics: RSS vs VSZ vs PSS vs USS — why RSS is misleading in containers
File systems
- Inodes — what they store, inode exhaustion problem
- Bind mounts — how Kubernetes volume mounts work under the hood
-
/procand/sys— virtual filesystems that expose kernel state
Signals and IPC
-
SIGTERMvsSIGKILL— why your app must handle SIGTERM for graceful shutdown -
SIGCHLD— zombie process prevention, proper child reaping (PID 1 problem in containers) - Why PID 1 in a container needs to reap children —
tinianddumb-init
System call tracing and performance
-
strace -p <pid>— trace syscalls of a running process -
/proc/<pid>/—maps,status,fd,net— per-process kernel state -
ssandnetstat— socket state inspection -
lsof— open file descriptors per process
Stage 2 — Networking fundamentals
You cannot work in Kubernetes, Cilium, Cloudflare, or Fastly without deeply understanding networking.
TCP/IP stack
- IP addressing, subnets, CIDR notation, route tables
- TCP handshake (SYN, SYN-ACK, ACK), teardown (FIN, TIME_WAIT)
-
TIME_WAITstorms — what causes them, why they matter at scale
DNS
- How DNS resolution works end-to-end — recursive resolver, authoritative server
- DNS record types: A, AAAA, CNAME, MX, TXT, PTR, SRV
- TTL — caching, negative caching, TTL trade-offs
-
ndotssetting in Linux — how it affects resolution order (critical for Kubernetes) - CoreDNS — how Kubernetes uses it, common misconfigurations, DNS debugging
Linux networking internals
- Network interfaces — physical, virtual (
veth), bridge, loopback, dummy -
vethpairs — how they work, why they are used for container networking - Linux bridge — how it connects veth pairs (like a virtual switch)
-
conntrack— connection tracking table, how NAT works,conntrack -L
TLS and certificates
- TLS handshake — client hello, server hello, certificate exchange, key exchange
- Certificate chain — root CA, intermediate CA, leaf certificate
- mTLS — mutual authentication, both sides present certificates (used in service meshes)
- Certificate management — cert-manager in Kubernetes, Let's Encrypt, ACME protocol
Stage 3 — Go (Golang)
Go is the language of the entire CNCF ecosystem. Kubernetes, Prometheus, Terraform, ArgoCD, Cilium, Vault — all Go.
Language basics
- Packages, modules (
go.mod,go.sum), workspace mode - Basic types, structs, interfaces, methods, pointers
- Error handling —
errorinterface,errors.Is(),errors.As(), wrapping errors with%w - Defer, panic, recover — use cases and pitfalls
Interfaces and composition
- Implicit interface satisfaction — no
implementskeyword - Embedding structs and interfaces
- The
io.Reader/io.Writer/io.Closerinterface family -
context.Context— cancellation, deadlines, value propagation — used everywhere in infra code
Goroutines and concurrency
- Goroutines — lightweight threads managed by the Go runtime
- Channels — unbuffered vs buffered, direction, closing
-
selectstatement — multiplexing channel operations - Race detector —
go run -race,go test -race - Common concurrency mistakes: goroutine leaks, channel deadlocks
Memory and performance
- Stack vs heap allocation — escape analysis (
go build -gcflags="-m")
Standard library for infra work
-
net/http— building HTTP servers and clients, middleware pattern -
os/exec— running subprocesses safely -
flagandos.Args— CLI argument parsing -
time— duration arithmetic, ticker, timer
CLI tools and infra tooling patterns
- Config file loading — layered config (flags > env vars > config file > defaults)
- Writing a simple HTTP server with graceful shutdown on
SIGTERM
Stage 4 — Kubernetes (the most important stage)
Most companies on your list either build on K8s, build for K8s, or expect you to operate it at scale.
4.1 Architecture and control plane
API server
- Central hub — all components communicate through the API server
- REST API — resource types, verbs (get/list/watch/create/update/patch/delete)
- Authentication — service account tokens (JWT), kubeconfig, OIDC, certificates
- Admission control chain — mutating admission webhooks run first, then validating
- etcd watch — how the API server streams changes to controllers
etcd
- Raft consensus — leader election, log replication, quorum (why 3 or 5 nodes)
- Key-value watch API — how controllers get notified of changes
Scheduler
- Scheduling cycle: filtering (predicates) → scoring (priorities) → binding
- Predicates:
NodeSelector,NodeAffinity,PodAffinity,Taints/Tolerations, resource fit
Controller manager
- Informer pattern — List + Watch, local cache, event handlers
- Reconcile loop — compare desired state (spec) with actual state, take action to converge
- Key controllers: Deployment, ReplicaSet, StatefulSet, DaemonSet, Job, CronJob, Endpoints
kubelet
- Watches the API server for pods assigned to its node
- CRI (Container Runtime Interface) — how kubelet talks to containerd or CRI-O
- Pod lifecycle: pending → pulling image → creating container → running → terminating
- Liveness vs readiness vs startup probes — how they work, when each probe type fails
- Eviction — memory pressure, disk pressure, node conditions
kube-proxy
- iptables mode — creates DNAT rules in
KUBE-SERVICESchain for every Service - How ClusterIP services work — virtual IP that only exists in iptables/IPVS rules
4.2 Workloads and objects
Core objects you must know cold
- Pod — smallest deployable unit, spec fields, container lifecycle hooks, init containers
- Deployment — rolling update strategy,
maxSurge,maxUnavailable, rollback - StatefulSet — stable network identity, ordered deployment, persistent volume claims
- DaemonSet — one pod per node, use cases (log shippers, monitoring agents, CNI plugins)
- Job and CronJob — completions, parallelism, failure handling, cron schedule format
Networking objects
- Endpoints and EndpointSlices — how services know which pods to route to
- Ingress — host/path-based routing, TLS termination, ingress controllers (Nginx, Traefik)
- Network Policy — ingress/egress rules, podSelector, namespaceSelector, default deny
Storage objects
- PersistentVolume (PV) and PersistentVolumeClaim (PVC) — static vs dynamic provisioning
- StorageClass — provisioner, reclaim policy, volume binding mode
- CSI — plugin interface for dynamic storage provisioning in Kubernetes
Resource management
-
requestsvslimits— requests used for scheduling, limits enforced by cgroups - QoS classes: Guaranteed (requests = limits), Burstable (requests < limits), BestEffort (no requests)
- LimitRange — default limits/requests for a namespace
- ResourceQuota — total resource budget for a namespace
- PodDisruptionBudget (PDB) — minimum available pods during voluntary disruptions
4.3 Kubernetes operators (concepts — implementation detail in deep-dive file)
- Operators extend Kubernetes with custom resources (CRDs) and controllers that reconcile desired vs actual state
- CRD — custom API object stored in etcd; separates spec (desired) from status (observed)
- Reconcile loop — compare spec to reality; create/update/delete until they match
- Finalizers — block deletion until cleanup (e.g. snapshot before delete) completes
- Full controller-runtime, webhooks, and operator patterns → see deep-dive file
4.4 Kubernetes networking
CNI (Container Network Interface)
- IPAM (IP Address Management) — how pods get IPs
Pod-to-pod networking
- Each pod gets its own network namespace
-
vethpair — one end in pod namespace, one end in host namespace - Linux bridge (
cbr0or similar) — connects all veth pairs on a node - How packets travel between pods on the same node vs different nodes
- Overlay networks — VXLAN encapsulation for cross-node traffic
Services and kube-proxy
- How ClusterIP works — DNS → ClusterIP → iptables DNAT → pod IP
- NodePort — how traffic enters the cluster from outside
DNS in Kubernetes
- CoreDNS deployment — Deployment with 2 replicas,
kube-dnsService - DNS search path —
<svc>.<ns>.svc.cluster.local,<svc>.<ns>,<svc> -
ndots:5— causes 5 failed DNS lookups before resolving external names (latency issue) - Headless services — no ClusterIP, DNS returns pod IPs directly (used by StatefulSets)
- DNS debugging —
kubectl execinto a pod, usenslookup,dig, check CoreDNS logs
4.5 Autoscaling and resource optimization
Horizontal Pod Autoscaler (HPA)
- Metrics server — provides CPU/memory metrics from kubelet
- HPA control loop — target metric value, current metric value, desired replicas formula
- Custom and external metrics — KEDA for event-driven scaling
- Stabilization window — prevents flapping (scale-down slower than scale-up)
Cluster Autoscaler (CA)
- Scale-up trigger — unschedulable pods (Pending state)
- Scale-down trigger — underutilized nodes for 10 minutes (default)
- Node groups — CA works with cloud provider node groups (ASGs in AWS)
- CA and PDBs — CA respects PodDisruptionBudgets during scale-down
- Safe-to-evict annotation —
cluster-autoscaler.kubernetes.io/safe-to-evict: "false"
KEDA (Kubernetes Event-Driven Autoscaling)
- ScaledObject CRD — links a workload to a scaler
- Built-in scalers — Kafka consumer lag, queue depth, Prometheus metrics, cron
- Scale to zero — KEDA can scale deployments down to 0 (HPA cannot)
4.6 Cloud-native Kubernetes
Platform teams operate K8s on top of cloud infrastructure. You need to understand what the cloud layer provides.
Managed control planes (EKS / GKE / AKS)
- Who owns etcd — the cloud provider manages the control plane; you manage worker nodes and workloads
- API server endpoint — public vs private endpoint, implications for CI/CD and developer access
VPC and networking
- Public vs private subnets — worker nodes typically in private subnets, NAT gateway for egress
- Pod CIDR vs node subnet vs service CIDR — three separate address spaces that must not overlap
- Cloud load balancers — ALB/NLB/GCLB mapping to
LoadBalancerService type -
externalTrafficPolicy: LocalvsCluster— source IP preservation and health check trade-offs on cloud LBs
Cloud IAM and workload identity
- IAM roles, policies, trust relationships — who can assume what, least-privilege policy design
- AWS IRSA — OIDC provider on cluster, annotated ServiceAccount, projected token → STS AssumeRole
- GCP Workload Identity — Kubernetes SA bound to GCP SA, no long-lived keys on nodes
Managed services vs in-cluster
- When to use RDS/Aurora vs self-hosted Postgres in K8s — ops burden, HA, backups, patching
- ElastiCache/Memorystore vs Redis Cluster in K8s — same trade-off for caching
- Object storage (S3/GCS) — Loki/Thanos blocks, Terraform state, CI artifacts, backup targets
Cloud DNS and certificates
- ACM / Google-managed certs — integration with cloud load balancers and Ingress
4.7 Service mesh and gateways (awareness — detail in deep-dive file)
- North-south — traffic from outside the cluster (ingress, TLS termination)
- East-west — service-to-service traffic inside the cluster
- Service mesh — sidecar proxies add mTLS, traffic splitting, and observability between services
- Default-deny NetworkPolicy — baseline for multi-tenant clusters; explicitly allow required paths
- Envoy, Istio, Gateway API, API gateways → see deep-dive file
Stage 5 — Container Security
Critical for Aqua Security, Snyk, Chainguard. Also tested at GitLab, Harness, Datadog.
5.1 Container image security
Image layers and attack surface
- How Docker image layers work — each
RUNinstruction creates a layer - Base image choice — Alpine vs Debian vs distroless vs scratch
- Distroless images — no shell, no package manager, minimal attack surface
- Multi-stage builds — only copy the binary into the final stage, discard build tools
Vulnerability scanning
- Static scanning tools — Trivy, Grype, Snyk Container, Clair
- What scanners check — OS packages, language dependencies, Dockerfile misconfigs
- CVE prioritization — severity (CVSS score), exploitability, reachability
- Base image updates — automated PRs to update base images (Renovate, Dependabot)
- Scanning in CI — fail the pipeline on critical/high CVEs, policy as code
Software Bill of Materials (SBOM)
- What an SBOM is — list of all components in a software artifact
- Generating SBOMs — Syft,
docker sbom,cosign attest
5.2 Supply chain security
The problem
- SolarWinds attack — build system compromise, malicious code injected into signed artifacts
- log4shell — transitive dependency vulnerability, hard to find without SBOMs
- XZ Utils backdoor — malicious maintainer, social engineering, compromised source
- The threat model — compromised build system, malicious dependency, typosquatting
SLSA framework (Supply chain Levels for Software Artifacts)
- SLSA Level 1 — provenance document exists
- Provenance — who built the artifact, from what source, on what system, with what inputs
Sigstore stack
- Cosign — signs container images and other OCI artifacts
- Keyless signing — short-lived certificate from Fulcio CA, no long-lived private keys
5.3 Kubernetes RBAC and access control
RBAC model
- Role (namespace-scoped) vs ClusterRole (cluster-scoped)
- RoleBinding vs ClusterRoleBinding
- Subjects: ServiceAccount, User, Group
Least privilege patterns
- Never use
cluster-adminfor application workloads - Namespace-scoped service accounts for every workload
- Projected service account tokens — short-lived, audience-bound, auto-rotated
Pod security
- Pod Security Standards — Privileged, Baseline, Restricted profiles
- Pod Security Admission controller — enforces standards at namespace level
- Security context —
runAsNonRoot,runAsUser,readOnlyRootFilesystem,allowPrivilegeEscalation: false
5.5 Cloud security (essentials — detail in deep-dive file)
- IAM — least-privilege roles and policies; no long-lived keys on nodes or in CI
- Encryption — at rest (disks, S3) and in transit (TLS)
- Audit logs — CloudTrail / cloud audit logs for who changed what
- Permission boundaries, WAF, GuardDuty, compliance frameworks → deep-dive file
Stage 6 — Observability
Core product domain for Datadog, Grafana Labs, New Relic, Splunk.
6.1 Metrics and Prometheus
Prometheus data model
- Time series — metric name + label set + sequence of (timestamp, float64) samples
- Label cardinality — why high-cardinality labels (
user_id,request_id) cause OOM - Metric types:
- Counter — monotonically increasing (requests total, errors total)
- Gauge — can go up and down (memory usage, queue depth, temperature)
- Histogram — distribution of values in configurable buckets (request duration, response size)
- Summary — pre-calculated quantiles on client side (avoid if possible — not aggregatable)
PromQL
- Instant vector vs range vector —
http_requests_totalvshttp_requests_total[5m] -
rate()— per-second rate of a counter over a range (use for counters, not gauges) -
increase()— total increase in a counter over a range -
sum by(),avg by(),max by()— aggregation operators, label dropping -
histogram_quantile()— calculate p50/p95/p99 from histogram buckets - Alerting rules —
forduration,labels,annotations, Alertmanager integration
6.2 Distributed tracing
Tracing concepts
- Trace — end-to-end record of a request through a distributed system
- Span — a single unit of work within a trace (one service call, one DB query)
- Parent-child span relationship — forms a tree structure (the trace)
- Trace context propagation — W3C
traceparentheader, B3 headers
OpenTelemetry (OTel)
- Exporters — OTLP (preferred), Jaeger, Zipkin, Prometheus
- OTel Collector — receives spans/metrics/logs, processes them, exports to backends
Sampling strategies
- Head sampling — decision made at trace start (random %, always-on for errors)
- Tail sampling — decision made after seeing the full trace (can sample based on error, latency)
6.3 Logging
Log shipping pipeline
- Log sources — container stdout/stderr (collected by node agent), application log files
- DaemonSet log agents — Fluent Bit (lightweight), Fluentd (more plugins), Vector (Rust-based)
- Structured logging — JSON logs, consistent field names, log levels
Grafana Loki
- Loki's key design decision — indexes only labels (like Prometheus), not log content
- Why this matters — much cheaper to store and index than Elasticsearch-style full-text index
- Log streams — a stream is a set of logs with the same label set (like a Prometheus time series)
- LogQL — log query language, filter expressions
{app="nginx"} |= "error", metric queries
6.4 SLOs and alerting
SLI/SLO/SLA
- SLI (Service Level Indicator) — the metric you measure (e.g., error rate, latency p99)
- SLO (Service Level Objective) — the target (e.g., 99.9% of requests under 200ms)
- Error budget — time you can be non-compliant (0.1% of 30 days = 43.8 minutes)
- Error budget burn rate — how fast you are consuming the error budget
Multi-window burn rate alerts
- Alertmanager integration — Prometheus rules send alerts to Alertmanager
- Short window (5 min) + long window (1 hour) — two-condition alert to reduce false positives
- Routing trees — route alerts to correct team based on labels
6.5 SRE practices
Stage 6.4 covers SLO metrics and alerting. This section covers how platform/SRE teams operate.
Incident management
- Severity levels (SEV1–SEV4) — customer impact, response time expectations
- Incident commander role — coordinates response, comms, decision-making
- Incident lifecycle — detect → triage → mitigate → resolve → postmortem
- Status pages and stakeholder comms — internal vs external, update cadence
- Runbooks — symptom-based (not cause-based), links to dashboards and remediation steps
On-call and alert quality
- Alert design — page on symptoms (SLO burn, user-facing errors), not causes (CPU high)
- On-call rotation — follow-the-sun, escalation policies, handoff rituals
- Toil — repetitive manual work; measure and automate (platform team's core mandate)
- Error budget policy — when budget is exhausted, freeze features, focus on reliability
Reliability engineering
- Application-level patterns — timeouts, retries, circuit breakers, idempotency (Stage 7.5)
- Capacity planning — headroom targets, load testing before launches, saturation metrics (USE method, Stage 6.6)
- Failure domain isolation — blast radius, multi-AZ/region design (Stage 4.6)
Disaster recovery and resilience
- Backup strategy beyond etcd — application data, cross-region replication, restore drills
- Multi-AZ vs multi-region — zone failure tolerance vs region failure tolerance
- Game days and chaos engineering — Litmus/Chaos Mesh: pod kill, network partition, AZ failure
Postmortems
- Blameless culture — focus on systems and process, not individuals
- Timeline, contributing factors (not root cause singular), action items with owners
- Follow-through — track action items to completion, review in subsequent incidents
6.6 Performance engineering
Unifies performance concepts scattered across Stages 1, 3, 6, and 10 into a methodology.
Performance methodology
- Define the goal first — latency vs throughput vs tail behavior vs cost
- Measure before optimizing — establish baseline with load tests and production metrics
- One change at a time — isolate variables; validate with before/after comparison
Throughput vs latency
- Why higher throughput often worsens tail latency — queue buildup under saturation
- Concurrency limits — connection pools, worker counts, HPA max replicas as backpressure levers
- Backpressure — propagate slowness upstream instead of buffering indefinitely
Latency analysis and percentiles
- p50 (median) vs p95 vs p99 vs p999 — why averages lie; tail latency drives user experience
- Histogram buckets in Prometheus — choose bucket boundaries for your SLO thresholds (Stage 6.1)
- Why Summary metrics are problematic — pre-computed quantiles on client side are not aggregatable
- RED method — Rate, Errors, Duration (for request-driven services)
- USE method — Utilization, Saturation, Errors (for resources: CPU, memory, disk, network)
Finding bottlenecks
- Layered diagnosis — app → pod (cgroup metrics) → node (
vmstat,iostat) → network → control plane - Go-specific —
pprofCPU/heap profiles, GC pause analysis,GOGCtuning (Stage 3) - Database — slow query logs, connection pool exhaustion, replication lag (Stage 7)
Load testing
- Test types — smoke, load (steady state), stress (find breaking point), spike, soak (memory leaks)
- What platform teams validate — HPA response time, CA scale-up latency, PDB behavior under drain, ingress capacity
- Warm-up period — exclude from measurements; run long enough for GC and caches to stabilize
- Production-like data volume and cardinality — load test observability pipeline too (Stage 6.1 cardinality)
Caching and batching
- Cache hierarchy — CDN edge (Stage 10.3) → Redis (Stage 7.3) → application in-memory
- Connection pooling — DB pools, HTTP keep-alive; file descriptor and cgroup limits (Stage 1)
Stage 7 — Distributed systems and databases
Critical for CockroachDB, YugabyteDB, PlanetScale, ScyllaDB, Snowflake, Redis.
7.1 Distributed systems theory
Fundamental problems
CAP theorem
- Consistency — every read sees the most recent write
- Availability — every request gets a response (not necessarily the most recent data)
- Partition tolerance — system works despite network partitions
Real-world: CP systems (Zookeeper, etcd, CockroachDB), AP systems (Cassandra, DynamoDB)
Extends CAP — when no Partition: trade-off between latency (L) and consistency (C)
More practical than CAP for comparing real databases
Consistency levels
- Strong consistency / linearizability — operations appear instantaneous, globally ordered
- Eventual consistency — replicas will converge eventually, reads may be stale
Consensus algorithms
- Raft — designed for understandability, used in etcd, CockroachDB, TiKV
- Leader election — candidates request votes, majority wins, term numbers
Replication patterns
- Single-leader — all writes go to leader, replicated to followers (PostgreSQL, MySQL)
- Leaderless (Dynamo-style) — any node accepts writes, quorum reads/writes (Cassandra)
Clocks in distributed systems
- Physical clocks — NTP sync, still have drift,
clock_gettime()
7.2 Distributed SQL (CockroachDB / YugabyteDB)
Architecture
- YugabyteDB — similar model, supports PostgreSQL and Cassandra APIs, DocDB storage layer
Distributed transactions
- MVCC (Multi-Version Concurrency Control) — every write creates a new version, readers see a consistent snapshot
Schema changes
Geo-distribution
- Region/zone topology — replicas placed in different regions/zones
7.3 Redis internals
Data structures and their implementations
Persistence
- RDB snapshot —
BGSAVEforks the process, child writes snapshot using CoW, parent continues serving - AOF (Append-Only File) — logs every write command,
fsyncpolicies:always,everysec,no - Hybrid persistence — RDB + AOF combined, AOF replays only since last RDB snapshot
- No persistence mode — pure cache, data loss on restart acceptable
Replication
-
REPLICAOF— replica connects to master, full sync (RDB transfer) then partial sync - Replica lag —
INFO replicationshowsmaster_repl_offsetvs replica offset
Redis Cluster
Eviction policies
-
noeviction— return error when maxmemory hit -
allkeys-lru— evict any key using LRU approximation -
volatile-lru— evict only keys with TTL set, using LRU -
allkeys-lfu— evict least frequently used keys (better for skewed access patterns)
7.5 Backend patterns for platform engineers
Platform teams build controllers, webhooks, internal APIs, and golden-path services. These patterns apply.
API design and gRPC
- REST vs gRPC — REST for human-facing/admin APIs; gRPC for high-performance internal service-to-service
- Deadlines and cancellation —
context.Contextpropagation, client-side timeouts (Stage 3) - API versioning — URL path vs header vs protobuf package; deprecation policy
- Idempotent APIs — safe retries for POST/PUT; idempotency keys for create operations
PostgreSQL fundamentals
- MVCC — multi-version concurrency control, snapshots, vacuum, bloat
- Indexes — B-tree (default), partial indexes, covering indexes, when indexes hurt writes
- Connection limits —
max_connections, connection pooling (PgBouncer), pool sizing vs pod count - Replication — streaming replication, replication lag, synchronous vs asynchronous
- Isolation levels — Read Committed (default), Repeatable Read, Serializable
- Foundation for CockroachDB/Yugabyte (Stage 7.2) and PlanetScale/Vitess (Stage 11)
Message queues and event streaming
- Kafka fundamentals — topics, partitions, consumer groups, offset commits, consumer lag
- Delivery semantics — at-most-once, at-least-once, exactly-once (idempotent consumers + transactions)
- Dead-letter queues (DLQ) — poison messages, retry policies, manual inspection
- When to use what — Kafka (high-throughput log), SQS/RabbitMQ (task queues), NATS (low-latency pub/sub)
- KEDA integration — scale on Kafka consumer lag (Stage 4.5)
Reliability patterns in application code
- Timeouts on every outbound call — HTTP clients, DB queries, gRPC deadlines
- Retries with exponential backoff and jitter — max attempts, retry only on idempotent operations
- Circuit breakers — open/half-open/closed states, failure threshold, recovery probe
- Health checks — liveness (restart if broken) vs readiness (stop sending traffic) vs startup (Stage 4.1)
Caching and background work
- Cache-aside vs read-through vs write-through — invalidation strategies, TTL design
- Cache stampede protection — single-flight, lock-based refresh
- Background jobs — Job vs long-running Deployment worker in K8s (Stage 4.2)
7.6 Data migration strategies
Deployment (Stage 9.4) ships code; data migration moves state. These are separate problems.
Expand-contract pattern
- Expand — add new column/table/API field (backward compatible, old code still works)
- Migrate — backfill data, dual-read or dual-write during transition
- Contract — remove old column/table/API field once all code uses new path
- Why it matters — enables zero-downtime deploys with rolling updates (Stage 9.4)
Dual writes and reconciliation
- Write to old and new systems simultaneously during transition
- Reconciliation job — compare old vs new, fix drift, idempotency required
- Risk — inconsistency window if one write succeeds and the other fails; needs compensating transactions
Change Data Capture (CDC)
- CDC tools — Debezium, AWS DMS, Maxwell — stream DB changes to Kafka/message bus
- Use cases — real-time replication, event-driven architecture, incremental migration
- Initial snapshot + streaming — full load then switch to binlog/WAL streaming
Online schema migrations
- Expand-contract for indexes — create index concurrently, swap in application
- Migration ordering — schema before code (expand) or code before schema (contract) depending on direction
Cutover and verification
- Traffic shifting — percentage-based cutover, instant rollback if error rate spikes
- Backfill throttling — rate-limit backfill to protect production DB performance
- Rollback plan — can you revert if cutover fails? How long is old system kept warm?
Stage 8 — Infrastructure as Code
HashiCorp and Pulumi are on your list. IaC is also tested at almost every other company.
8.1 Terraform
Core concepts
- HCL (HashiCorp Configuration Language) — declarative configuration language
- Provider — plugin that manages a specific API (AWS, GCP, Kubernetes, Vault)
- Resource — infrastructure object managed by Terraform
- Data source — read-only reference to existing infrastructure
- Output — export values from a configuration
- Variable — input values, with type constraints and validation
State management
- State file (
terraform.tfstate) — JSON file recording current state of all managed resources - Remote backends — S3 + DynamoDB (locking), Terraform Cloud, GCS
- State locking — prevents concurrent applies, DynamoDB table for distributed lock
-
terraform import— bring existing infrastructure under Terraform management - State drift — real world diverges from state,
terraform plandetects this
Plan and apply lifecycle
- Dependency graph — Terraform builds a DAG of all resources and their dependencies
-
create_before_destroylifecycle meta-argument — zero-downtime replacements -
prevent_destroy— protect critical resources from accidental deletion - Targeted applies —
terraform apply -target=aws_instance.foo(use sparingly)
Modules
- Module structure —
main.tf,variables.tf,outputs.tf - Module versioning — source from Terraform Registry, GitHub with
?ref=v1.2.3 - Module composition patterns — root module calls child modules
8.2 Pulumi (awareness)
- Alternative to Terraform — define infrastructure in TypeScript, Python, or Go instead of HCL
- Same plan/apply/state model; details in deep-dive file when you use it
8.3 HashiCorp Vault
Architecture
- Core + storage backend — Vault core is stateless, all state in storage (Raft integrated or external like Consul)
- Auto-unseal — use cloud KMS (AWS KMS, GCP KMS) to automatically unseal on restart
Auth methods
- Kubernetes auth — pod presents service account token, Vault validates with K8s API server
- AWS IAM auth — use IAM role/instance profile to authenticate
- OIDC/JWT — integrate with any OIDC provider (GitHub Actions, GitLab CI)
Secret engines
- KV v2 — versioned key-value store, soft delete,
max_versionsper key - Dynamic secrets — Vault generates credentials on-demand (DB passwords, AWS keys, certificates)
- Database secret engine — Vault creates a DB user, returns credentials, auto-revokes on lease expiry
Vault Agent
- Sidecar pattern — runs alongside your app, authenticates to Vault, writes secrets to file
- Kubernetes Vault Agent Injector — annotate pods, sidecar is automatically injected
8.4 FinOps for platform teams
Connects autoscaling (Stage 4.5), cloud infrastructure (Stage 4.6), and IaC (Stages 8.1–8.2).
Cost visibility and allocation
- Tagging strategy — mandatory tags: team, environment, service, cost-center
- Showback vs chargeback — visibility to teams vs actual billing
- Cost per namespace / per cluster / per service — Kubecost, CloudHealth, native cloud cost explorer
- Unit economics — cost per request, cost per GB ingested, cost per tenant
Compute optimization
- Rightsizing — VPA recommendations (Stage 4.5), instance type selection, CPU/memory fit
- Spot / preemptible nodes — cost savings vs interruption risk, taints/tolerations for fault-tolerant workloads
- Cluster Autoscaler
priceexpander — prefer cheaper node groups (Stage 4.5) - Idle resource detection — orphaned volumes, unused load balancers, over-provisioned node groups
- HPA min replicas — don't run 10 replicas at 3am if traffic allows 2
Storage and data costs
- Object storage lifecycle policies — S3 Intelligent-Tiering, Glacier for old Loki/Thanos blocks
- Persistent volume sizing — right-size PVCs, storage class selection (gp3 vs io2)
- Log and metrics retention — shorter retention = lower cost (Stage 6); cardinality = cost (Stage 6.1)
- Egress costs — cross-AZ, cross-region, internet egress; design to minimize (CDN, PrivateLink)
FinOps in IaC and CI/CD
- Cost estimation in PRs — Infracost, Terraform plan cost diff
- Policy as code — deny expensive instance types, enforce tagging in Terraform/Kyverno
- Environment lifecycle — tear down ephemeral preview environments (Stage 9.3), scheduled shutdown of dev clusters
- Reserved instances / savings plans vs on-demand — when commitment makes sense
Stage 9 — CI/CD, GitOps and Developer Platforms
GitLab, Harness, CircleCI on your list. GitOps is expected everywhere.
9.1 GitOps with ArgoCD
GitOps principles
- Git as the single source of truth for desired state
- Declarative — desired state expressed as files, not imperative commands
- Automated reconciliation — controller continuously syncs actual state to desired state
- Auditability — every change is a Git commit with author, timestamp, diff
ArgoCD architecture
- Application CRD — defines source (Git repo/path) and destination (cluster/namespace)
- Application controller — watches Applications, compares live state with desired state (Git)
- Repo server — clones Git repos, renders Helm/Kustomize/Jsonnet manifests
- API server — serves gRPC and REST API, handles sync triggers
App-of-apps pattern
- Enables managing hundreds of apps from a single Git repo
Multi-cluster GitOps
- Cluster credentials — stored as Secrets in ArgoCD namespace
- Progressive delivery across clusters — sync to dev → staging → prod with approvals
Secrets in GitOps
- External Secrets Operator — CRD points to Vault/AWS Secrets Manager, controller creates K8s Secret
9.2 CI/CD pipeline engineering
Pipeline concepts
- DAG execution — stages/steps as a directed acyclic graph, parallel by default
- Artifact passing — how outputs of one stage become inputs of the next
- Build cache — Docker layer cache, language-specific caches (Go module cache, npm cache)
- Pipeline triggers — push, MR/PR, schedule, API trigger, upstream pipeline
GitLab CI specifics
-
.gitlab-ci.yml— pipeline definition, stages, jobs, rules, needs - GitLab Runner — the agent that executes jobs, registered to a GitLab instance
- Executor types — Shell, Docker, Kubernetes (most scalable)
- Kubernetes executor — creates a pod per job, ephemeral, configurable resources
- Caching —
cache:key with hash of lock file, stored in S3 or runner local cache - Artifacts —
artifacts:paths persisted and passed between jobs/stages
Security in CI/CD pipelines
- SAST scanning — GitLab AutoDevOps, Semgrep, CodeQL
- SCA (Software Composition Analysis) — Snyk, Trivy,
grype - Container scanning — scan image after build, before push
- Secret detection — gitleaks, trufflehog, GitLab secret detection
9.3 Internal Developer Platform (IDP)
The "platform engineering" product layer — what app teams interact with daily.
Platform as a product
- Internal customers — application developers, data engineers, ML engineers
- Golden paths — opinionated, supported, easy way to do the right thing
- Self-service vs guardrails — developers provision infra within policy boundaries
Developer portal and service catalog
- Service catalog metadata — owner, on-call rotation, dependencies, SLOs, runbooks
- Scaffolder templates — "Create microservice" → repo + CI + Dockerfile + K8s manifests + monitoring + RBAC
- TechDocs — docs-as-code in the repo, rendered in the portal
Golden path templates
- What a complete template includes — Git repo,
.gitlab-ci.yml, container build, image signing (Stage 5.2), GitOps manifest (Stage 9.1), Prometheus alerts (Stage 6), NetworkPolicy (Stage 5.3) - Template versioning — upgrade path when platform standards change
Environment management
- Dev / staging / prod promotion — GitOps sync waves across clusters (Stage 9.1)
- Ephemeral environments — preview apps per MR (Stage 9.2), namespace-per-branch, TTL-based cleanup
- Environment parity — same Helm chart, different values; avoid snowflake environments
Artifact management
- Container registries — ECR, GCR, Harbor; image retention policies, vulnerability scan gates (Stage 5.1)
- SBOM and provenance storage — attach to images in registry (Stage 5.2)
Policy in the delivery path
- Shift-left security — scan in CI before merge (Stage 9.2)
- Admission control at deploy — Kyverno/Gatekeeper enforce standards (Stage 5.3)
- Policy exceptions — audit mode, break-glass with approval workflow
9.4 Deployment and release strategies
How to ship changes safely. Coordinate with data migrations (Stage 7.6) and SLOs (Stage 6.4).
Choosing a strategy
| Strategy | Downtime | Rollback speed | Infrastructure cost | Best for |
|---|---|---|---|---|
| Rolling | None | Slow (re-deploy old version) | Low | Stateless services, default K8s |
| Blue-green | None | Fast (switch traffic) | 2x during deploy | Critical services, fast rollback needed |
| Canary | None | Fast (shift traffic back) | Low extra | High-traffic services, metric-gated promotion |
| Shadow | None | N/A (no user impact) | 2x compute | Validation before any user traffic |
Rolling deployment
- K8s Deployment —
maxSurge,maxUnavailable, rolling update strategy (Stage 4.2) - Readiness probes — new pods must pass before old pods terminate
- PodDisruptionBudget — minimum available during voluntary disruptions (Stage 4.2)
- Limitation — mixed versions run simultaneously; requires backward-compatible API and schema (Stage 7.6)
Blue-green deployment
- Two identical environments — blue (current) and green (new)
- Traffic switch — DNS, load balancer, or service mesh route flip
- Rollback — switch traffic back to blue instantly
- Cost — running double infrastructure during deploy window
- Database consideration — schema must be compatible with both versions (expand-contract, Stage 7.6)
Canary deployment
- Traffic split — 1% → 5% → 25% → 50% → 100%, gated by metrics at each step
- Metric gates — error rate, p99 latency, saturation (Stage 6.6); SLO burn rate (Stage 6.4)
- Automated rollback — Argo Rollouts / Flagger revert on failed analysis (Stage 9.1)
- Service mesh or ingress required — Istio VirtualService, NGINX canary annotations, Cilium (Stage 4.7)
Shadow / dark traffic
- Mirror production traffic to new version — no user-facing impact
- Compare responses — diff old vs new output, log discrepancies
- Use cases — validate rewrite, test new database backend, ML model comparison
Feature flags
- Decouple deploy from release — code is deployed but feature is off
- Flag types — release flags (short-lived), ops flags (kill switch), experiment flags (A/B)
- Kill switch — disable feature instantly without rollback deploy
- Flag hygiene — remove stale flags; tech debt if flags accumulate
Deployment safety checklist
- Backward-compatible API and schema changes (Stage 7.6 expand phase)
- Feature flags for risky changes
- Dashboards and alerts ready before deploy (Stage 6)
- Rollback plan documented — code rollback vs schema rollback (schema rollback is hard)
- PDB and HPA configured — don't deploy during capacity constraints
- Error budget check — freeze deploys if budget exhausted (Stage 6.5)
Coordinating code and data deploys
- Expand before deploy — add new DB column/table before code that uses it
- Contract after deploy — remove old column only after all code migrated
- Dual-write period — both old and new code paths write to both stores (Stage 7.6)
- Never deploy breaking schema change with rolling update — old pods will crash
Stage 10 — eBPF and Advanced Networking (for Cilium, Cloudflare, Fastly)
10.1 Advanced networking awareness (learn later detail in deep-dive file)
Full eBPF, Cilium, and CDN content is in platform-engineering-deep-dive.md. For now:
- eBPF — programmable hooks in the Linux kernel for networking, security, and observability
- Cilium — Kubernetes networking and policy using eBPF instead of iptables
- CDN edge — caches responses by
Cache-Controlheaders; mitigates DDoS at L3/L4 (volume) and L7 (HTTP-aware)
Stage 11 — Distributed databases continued (ScyllaDB / PlanetScale)
Wide-column (ScyllaDB/Cassandra) and sharded MySQL (Vitess/PlanetScale) — full detail in deep-dive file.
- ScyllaDB/Cassandra — partition key determines node; design tables for query patterns, not normalized joins
- Vitess/PlanetScale — MySQL sharded at scale; avoid scatter queries without a shard key
- LSM trees, VTGate, gh-ost, resharding → deep-dive file
Stage 12 — Architecture case studies
Apply everything from Stages 1–11. Each case study follows: problem → architecture → key decisions → failure modes → interview follow-ups.
12.1 Datadog metrics ingest pipeline
Problem
- Ingest millions of metrics per second from agents across customer infrastructure
- High cardinality risk — bad label design can OOM the pipeline
- Must query recent data fast; older data can be slower/cheaper
Architecture (conceptual)
- Agent (node/pod) → local aggregation → intake API (load balanced)
- Kafka or similar queue — decouple ingest from processing, absorb spikes
- Processing workers — normalize, validate, drop/blacklist high-cardinality series
- Hot storage — recent data, fast queries (like Prometheus TSDB, Stage 6.1)
- Cold storage — object storage (S3) for long retention, queried on demand
- Query layer — federates hot + cold, PromQL-compatible
Key decisions
- Why queue between intake and storage — backpressure, burst absorption
- Cardinality limits — per-metric, per-tag, per-customer quotas
- Downsampling — reduce resolution for older data to control storage cost (Stage 8.4)
- Sharding — by customer ID or metric hash for horizontal scale
Failure modes
- Cardinality explosion — one bad deployment sends unique label per request
- Ingest lag — queue depth grows, delayed metrics, alert on pipeline lag not just app metrics
- Hot shard — uneven customer traffic distribution
Interview follow-ups
- How would you design cardinality limits?
- What happens if Kafka is down for 5 minutes?
- How do you migrate storage backends without downtime?
12.2 Cloudflare DDoS mitigation
Problem
- Mitigate multi-Tbps volumetric attacks without impacting legitimate traffic
- Must operate at line rate — cannot afford per-packet userspace processing at scale
Architecture (conceptual)
- Anycast BGP — same IP from every PoP, traffic routed to nearest edge (Stage 10.3)
- XDP/eBPF at NIC — drop malicious packets before kernel network stack (Stage 10.1)
- Flow tracking — stateful inspection for SYN floods, UDP amplification
- Rate limiting — token bucket per IP/ASN/fingerprint (Stage 10.3)
- Challenge layer — JS/CAPTCHA for suspicious but not clearly malicious traffic
- Origin shield — aggregate cache misses through single PoP to protect origin
Key decisions
- XDP vs iptables — XDP for line-rate drop, iptables for complex stateful rules
- False positive vs false negative trade-off — blocking legit users vs letting attack through
- Attack signature updates — how fast can rules propagate to all PoPs globally?
Failure modes
- Origin overload during cache miss storm — origin-facing PoP becomes bottleneck
- SYN flood exhausting conntrack table (Stage 2) — eBPF replaces kernel conntrack at scale (Stage 10.2)
- L7 attacks that look like legitimate HTTP — require application-aware detection
Interview follow-ups
- How does anycast handle a PoP going offline?
- Design rate limiting for 10M unique IPs.
- How would you test DDoS mitigation without affecting production?
12.3 Multi-tenant Kubernetes platform
Problem
- Run 50+ teams on shared clusters with isolation, fair resource sharing, and cost allocation
Architecture (conceptual)
- Namespace per team — ResourceQuota, LimitRange (Stage 4.2)
- NetworkPolicy default-deny — explicit allow between namespaces (Stage 4.2, 4.7)
- Pod Security Standards — Restricted profile enforced via admission (Stage 5.3)
- RBAC — namespace-scoped roles, no cluster-admin for app teams (Stage 5.3)
- Cost allocation — Kubecost or cloud tags mapped to namespaces (Stage 8.4)
- IDP self-service — Backstage template creates namespace + quota + GitOps repo (Stage 9.3)
Key decisions
- Shared vs dedicated nodes — taints/tolerations for noisy-neighbor isolation
- Cluster per env vs cluster per team — blast radius vs operational overhead
- How much self-service — golden path vs bring-your-own-manifests
Failure modes
- Noisy neighbor — one team's memory spike triggers node OOM, evicts other teams' pods
- Quota exhaustion — team hits ResourceQuota, pods stuck Pending, unclear error message
- NetworkPolicy too restrictive — breaks legitimate cross-team dependencies
Interview follow-ups
- How do you handle a team that needs GPU nodes?
- Design chargeback for shared cluster costs.
- One team deploys a crypto miner — how do you detect and respond?
12.4 GitOps at scale (100+ clusters)
Problem
- Manage application deployments across hundreds of clusters from a central platform
- Balance consistency with cluster-specific overrides; control blast radius
Architecture (conceptual)
- ArgoCD hub — central instance managing remote clusters (Stage 9.1)
- App-of-apps / ApplicationSet — templated apps per cluster (Stage 9.1)
- Repo structure — base manifests + Kustomize overlays per cluster/environment
- Sync waves — CRDs first, then operators, then workloads
- Progressive sync — dev clusters auto-sync, prod requires manual approval
- Secrets — External Secrets Operator pulling from Vault (Stage 9.1, 8.3)
Key decisions
- Monorepo vs polyrepo — trade-off between visibility and access control
- Auto-sync vs manual sync for production — speed vs safety
- How to handle cluster-specific config — Kustomize overlays vs Helm values files
Failure modes
- Bad manifest synced to all clusters simultaneously — blast radius
- ArgoCD itself becomes SPOF — HA deployment, multiple replicas
- Secret rotation breaks sync — stale ExternalSecret, pods fail to start
- Drift — manual
kubectl editon cluster, GitOps fights live state
Interview follow-ups
- How do you roll out a platform-wide NetworkPolicy change safely?
- Design a canary cluster before promoting to all prod clusters.
- How do you handle a cluster that can't reach Git?
12.5 Secure CI/CD supply chain end-to-end
Problem
- Ensure only trusted, scanned, signed artifacts reach production clusters
Architecture (conceptual)
- Developer push → CI pipeline (Stage 9.2)
- SAST + SCA + secret detection in CI (Stage 9.2)
- Build container image → Trivy/Grype scan (Stage 5.1)
- Generate SBOM (Syft) + SLSA provenance (Stage 5.2)
- Sign with Cosign keyless signing via GitHub OIDC → Fulcio → Rekor (Stage 5.2)
- Push to registry with signature attached
- Admission webhook — Kyverno verify-image policy, reject unsigned or vulnerable images (Stage 5.3)
- GitOps deploy — ArgoCD syncs signed image to cluster (Stage 9.1)
- Runtime — Falco detects anomalous behavior (Stage 5.4)
Key decisions
- Where to enforce — CI gate vs registry gate vs admission gate (defense in depth)
- Keyless vs key-based signing — OIDC identity vs long-lived keys
- CVE policy — block critical, warn on high, allow with exception workflow
Failure modes
- Compromised CI runner — attacker pushes malicious signed image
- Policy bypass —
--privilegedpod admitted because namespace lacks Pod Security - Stale base image — image signed but base layer has new CVE discovered later
Interview follow-ups
- How do you handle emergency hotfix bypass of scan gates?
- Design provenance verification that works across multiple CI systems.
- What if Rekor is unavailable — can you still verify signatures?
12.6 Globally distributed SQL (CockroachDB-style)
Problem
- PostgreSQL-compatible database that survives region failure with strong consistency
Architecture (conceptual)
- Keyspace split into ranges, each range = Raft group (Stage 7.2)
- Multi-Raft — independent consensus per range, scales horizontally
- Transaction coordinator — 2PC across ranges for distributed transactions
- Geo-partitioning — pin data to regions for latency and compliance (Stage 7.2)
- Follower reads — read from local replica at stale timestamp for lower latency
Key decisions
- CP over AP — strong consistency, sacrifice availability during partition (CAP, Stage 7.1)
- Range size — too small = Raft overhead; too large = hot spots
- Survival goals — zone vs region failure tolerance
Failure modes
- Hot range — one range gets disproportionate writes, single Raft group bottleneck
- Clock skew — HLC mitigates but extreme skew causes transaction retries
- Region partition — CP system may become unavailable for affected ranges
Interview follow-ups
- How does CockroachDB handle a node failure mid-transaction?
- Design a schema migration for a globally distributed table.
- Compare to Spanner's TrueTime approach (Stage 7.1).
12.7 Observability pipeline at scale (Loki + Prometheus)
Problem
- Collect logs and metrics from 10,000+ pods without overwhelming storage or query performance
Architecture (conceptual)
- Metrics — Prometheus per cluster → remote_write → Mimir/Thanos (Stage 6.1)
- Logs — Fluent Bit DaemonSet → Loki distributor → ingester → S3 chunks (Stage 6.3)
- Traces — OTel Collector → tail sampling → Jaeger/Tempo (Stage 6.2)
- Unified query — Grafana dashboards correlating metrics + logs + traces
- Cardinality control — drop high-cardinality labels at ingest, recording rules for aggregates
Key decisions
- Loki label design — index only labels (not log content), low-cardinality labels only
- Retention tiers — 15 days hot, 90 days warm in object storage, delete after
- Sampling — head sampling for traces (99% dropped), tail sampling for errors (Stage 6.2)
Failure modes
- Label cardinality explosion in Loki — same problem as Prometheus, different storage
- Remote write backpressure — Prometheus WAL grows, disk fills
- Log volume spike — one service debug-logging at ERROR floods pipeline
Interview follow-ups
- How do you debug a production issue when traces were sampled out?
- Design log retention that meets compliance without bankrupting storage budget (Stage 8.4).
- How do you correlate a metric spike to the exact log lines?
12.8 Cilium replacing kube-proxy
Problem
- kube-proxy iptables mode doesn't scale to thousands of Services; need faster datapath
Architecture (conceptual)
- Cilium agent (DaemonSet) — programs eBPF on each node (Stage 10.2)
- eBPF LB map — service IP → backend pod IP, O(1) lookup, no iptables chain walk
- Identity-based policy — numeric security identity from labels, not IP (Stage 10.2)
- Hubble — flow-level observability from eBPF, no sidecar needed
-
--kube-proxy-replacement=strict— Cilium owns all service routing
Key decisions
- eBPF over iptables — performance at scale, but requires kernel 4.19+ and BTF
- DSR (Direct Server Return) — reply bypasses load balancer node, lower latency
- Identity vs IP policy — IPs change on pod restart; identity is stable
Failure modes
- eBPF map full — service/backend limit hit, new services fail to program
- Kernel upgrade breaks eBPF programs — CO-RE (BTF) mitigates (Stage 10.1)
- Policy misconfiguration — identity mismatch blocks legitimate traffic silently
Interview follow-ups
- Walk through packet path for ClusterIP Service with Cilium eBPF vs iptables.
- How does Cilium handle a pod IP change during rolling update?
- Compare Cilium LB to IPVS mode kube-proxy (Stage 4.1).
12.9 Zero-downtime database migration
Problem
- Migrate a 500GB PostgreSQL table (monolith DB) to a new schema, shard, or datastore with zero downtime and a rollback path
- Application must keep serving traffic throughout; old and new code versions run simultaneously during rolling deploys (Stage 9.4)
Architecture (conceptual)
- Phase 1 (expand) — add new column/table/index in old DB; deploy code that writes to both old and new paths (Stage 7.6)
- Phase 2 (backfill) — batch or streaming job copies historical data; throttle to protect prod DB performance
- Phase 3 (CDC) — Debezium/DMS streams ongoing changes from old DB to new store, keeping new store in sync (Stage 7.6)
- Phase 4 (dual-read validation) — compare row counts, checksums, sample queries between old and new
- Phase 5 (cutover) — shift read traffic to new store (percentage-based or instant); monitor error rate and SLO burn (Stage 6.4)
- Phase 6 (contract) — remove old column/table once all code reads from new path; decommission old store
Key decisions
- Expand-contract over big-bang — only safe pattern with rolling K8s deploys (Stage 9.4)
- Dual-write vs CDC-only — dual-write simpler but risk of inconsistency; CDC cleaner but adds pipeline complexity
- Cutover strategy — percentage traffic shift vs DNS flip vs feature flag per tenant
- How long to keep old system warm — rollback window vs cost of running dual systems
Failure modes
- Dual-write partial failure — one write succeeds, other fails; needs idempotency and reconciliation job (Stage 7.5)
- Backfill overload — unthrottled backfill saturates DB I/O, degrades live traffic
- Schema incompatibility — new code deployed before expand phase completes, old pods crash
- Cutover with replication lag — reads from new store return stale data, user-visible inconsistency
- Rollback after contract phase — schema rollback is hard; may require forward-fix instead
Interview follow-ups
- How do you verify data correctness before cutover?
- What if CDC pipeline falls 30 minutes behind during peak traffic?
- Design migration for a table with 10K writes/sec and foreign key constraints.
12.10 Autoscaling under a traffic spike
Problem
- Traffic increases 20× in 10 minutes (product launch, Black Friday, viral event)
- Platform must scale pods, nodes, and ingress without breaching SLOs or exhausting error budget (Stage 6.4)
Architecture (conceptual)
- Ingress / load balancer — cloud ALB/NLB or CDN absorbs initial burst (Stages 4.6, 10.3)
- HPA — scales pod replicas based on CPU, memory, or custom metrics (Stage 4.5)
- Cluster Autoscaler — adds nodes when pods are Pending due to insufficient resources (Stage 4.5)
- KEDA — event-driven scaling on queue lag or external metrics; scale-to-zero off-peak (Stage 4.5)
- Pre-warming — raise HPA
minReplicasand pre-provision node pool before known events - Observability — RED metrics on autoscaling loop itself: time-to-new-pod-ready, time-to-new-node, scheduling latency (Stage 6.6)
Key decisions
- HPA metric choice — CPU lags behind request rate; custom metrics (RPS, queue depth) react faster
- CA scale-up delay — new node takes 2–5 minutes; pre-warm node groups for predictable events
- PDB vs scale-down — CA respects PodDisruptionBudgets; may block scale-down, leaving costly idle nodes (Stage 4.2)
- Spot/preemptible nodes — cost savings vs interruption during spike; use for fault-tolerant workloads only (Stage 8.4)
- Max replicas cap — prevent runaway scaling from bug or DDoS; balance cost vs availability
Failure modes
- HPA lag — metrics-server delay + cooldown window; pods not ready before traffic overwhelms existing replicas
- CA can't scale — hit node group max, instance quota, or IP address exhaustion in subnet
- Thundering herd on new pods — all new pods cold-start simultaneously, DB connection pool exhausted (Stage 7.5)
- Ingress bottleneck — pods scaled but ingress/controller becomes the limit
- Flapping — scale-up then rapid scale-down as metrics spike and drop; tune stabilization windows (Stage 4.5)
Interview follow-ups
- How do you load-test autoscaling behavior before a launch?
- HPA vs KEDA for a Kafka consumer workload — which and why?
- Traffic drops after spike — how fast should you scale down without causing another outage?
12.11 Building a production Kubernetes operator
Problem
- Platform team needs a CRD (e.g.,
Database,Application,Tenant) with a controller that provisions and manages lifecycle automatically - Must be reliable, idempotent, and operable at scale across many clusters
Architecture (conceptual)
- CRD definition — OpenAPI validation schema, status subresource, printer columns (Stage 4.3)
- controller-runtime —
Manager,Reconciler, work queue, shared informer cache (Stage 4.3) - Reconcile loop — compare spec (desired) vs observed state; create/update/delete child resources
- Webhooks — mutating (defaults) and validating (reject invalid specs) admission (Stage 4.3)
- Finalizers — pre-delete cleanup (e.g., snapshot DB before CR deletion); prevent stuck resources
- Observability — controller metrics (reconcile duration, errors, queue depth), structured logs, tracing (Stage 6)
Key decisions
- Idempotent reconcile — calling reconcile N times has same effect as once; use
CreateOrUpdatepattern (Stage 4.3) - Error handling — transient errors requeue with backoff; permanent errors update status condition
- Owner references — child resources garbage-collected when parent CR deleted (Stage 4.3)
- Leader election — only one active controller replica; others standby (Stage 4.3)
- Secondary resource watches — trigger reconcile when child Secret or Deployment changes
- Testing — envtest for unit tests, kind cluster for integration, contract tests on CRD schema
Failure modes
- Reconcile storm — API server blip causes resync of all objects; rate-limit queue, use predicates (Stage 4.3)
- Stuck finalizer — external dependency unavailable, CR can't delete; manual finalizer removal as break-glass
- Status update conflict — concurrent reconcilers or user edits cause optimistic locking conflict
- Webhook failure — invalid object rejected but error opaque to user; clear validation messages critical
- Partial provision — DB created but Secret not written; status must reflect partial state accurately
Interview follow-ups
- Walk through reconcile for a
DatabaseCR: create → running → upgrade → delete. - How do you handle a controller bug that corrupted 50 resources — rollback strategy?
- How do you version CRD schemas without breaking existing resources?
12.12 Vault as the org-wide secrets platform
Problem
- 2,000+ microservices need dynamic DB credentials, PKI certs, and API keys without Vault becoming a single point of failure or bottleneck
- Must integrate with Kubernetes, CI/CD, and cloud IAM across multiple clusters and accounts
Architecture (conceptual)
- Vault HA cluster — Raft integrated storage, 3+ nodes, active/standby with auto-failover (Stage 8.3)
- Auto-unseal — AWS KMS / GCP KMS; no manual unseal on restart (Stage 8.3)
- Auth methods — Kubernetes auth (pod SA token), AWS IAM auth, AppRole for CI, OIDC for GitHub Actions (Stage 8.3)
- Secret engines — Database (dynamic creds), PKI (internal CA), KV v2 (static secrets), Transit (encryption-as-a-service)
- Vault Agent Injector — sidecar injected via pod annotation, renders secrets to file, auto-renews leases (Stage 8.3)
- External Secrets Operator — syncs Vault secrets to K8s Secret for GitOps compatibility (Stage 9.1)
Key decisions
- Agent sidecar vs ESO vs direct API — sidecar for app file-based secrets; ESO for GitOps; direct API for controllers
- Dynamic vs static secrets — dynamic DB creds auto-revoke on lease expiry; static secrets need rotation policy
- Namespace isolation — each team gets Vault policy scoped to their path; no cross-team secret access
- Performance standbys — read replicas for high read volume; writes still go to active node (Stage 8.3)
- Break-glass — emergency root token procedure, audited, time-limited
Failure modes
- Vault sealed after restart — auto-unseal misconfigured, all secret retrieval fails across fleet
- Lease expiry without renewal — app crashes when DB cred expires; Agent must renew before TTL
- Rate limiting — thundering herd of pods restarting simultaneously overwhelms Vault auth endpoint
- Token leak — compromised SA token grants Vault access; short-lived tokens + narrow policies limit blast radius
- Raft quorum loss — 2 of 3 nodes down, Vault read-only or unavailable; multi-AZ placement critical
Interview follow-ups
- Vault is down for 10 minutes — what breaks, in what order?
- How do you rotate a database password for 500 services without restart?
- Design Vault topology for 5 K8s clusters across 2 cloud accounts.
12.13 FinOps: reducing a $2M/month K8s and cloud bill
Problem
- Platform spend growing 30% quarter-over-quarter; leadership demands ~40% reduction without SLO regression or team revolt
- Must identify waste, rightsize, and implement guardrails — not just cut capacity blindly
Architecture (conceptual)
- Cost visibility — Kubecost / CloudHealth / native cost explorer with mandatory tagging (team, env, service) (Stage 8.4)
- Compute — VPA recommendations, rightsizing requests/limits, spot/preemptible for fault-tolerant workloads (Stages 4.5, 8.4)
- Node efficiency — CA scale-down idle nodes, reduce max node group size, consolidate low-utilization clusters
- Storage — right-size PVCs, S3 lifecycle policies for logs/metrics/backups, delete orphaned volumes (Stage 8.4)
- Observability cost — reduce metrics cardinality, shorten retention, drop debug logs in prod (Stages 6.1, 8.4)
- Egress — CDN for static assets, PrivateLink for cross-service traffic, same-AZ preference (Stages 4.6, 8.4)
- Governance — Infracost in PRs, Kyverno policy blocking oversized instances, chargeback reports to teams (Stages 8.1, 8.4)
Key decisions
- What to cut first — idle resources, over-provisioned dev/staging, excessive retention; never cut prod headroom blindly
- Spot/preemptible adoption — start with stateless batch/CI workloads; keep on-demand for critical path (Stage 8.4)
- Chargeback vs showback — showback educates; chargeback creates accountability but needs accurate allocation
- Reserved instances / savings plans — commit for baseline load only; keep burst on-demand
- Unit economics — cost per request, per tenant, per GB ingested; track over time to prove savings didn't hurt reliability
Failure modes
- Aggressive rightsizing causes OOM kills during traffic spike — under-provisioned after cutting limits
- Spot interruption during peak — no on-demand fallback, SLO breach
- Retention cut too short — can't debug incident from last week; false economy
- Tagging gaps — 30% of spend is "untagged," can't allocate or optimize
- Team workaround — devs spin up resources outside platform to avoid chargeback, creating shadow IT
Interview follow-ups
- Show me your prioritization: what do you cut first, second, never?
- How do you prove a 40% cost cut didn't increase incident rate?
- Design chargeback model for a shared multi-tenant K8s cluster (Stage 12.3).
12.14 Chaos game day on a production-like environment
Problem
- Platform team needs confidence that the system survives realistic failures before they happen in production
- Run controlled experiments in a prod-like staging environment without customer impact
Architecture (conceptual)
- Environment — full prod parity: same K8s version, same operators, same observability stack, synthetic load at ~50% prod traffic (Stage 9.3)
- Chaos tools — Litmus Chaos, Chaos Mesh, or Gremlin; inject faults as K8s CRs or API calls (Stage 6.5)
- Experiment design — hypothesize steady state (SLOs hold), define blast radius, set abort conditions
- Fault types — pod kill, node drain, network partition (NetworkPolicy drop), AZ failure simulation, DNS failure, latency injection
- Observability during experiment — pre-built dashboards for SLO burn, error rate, latency; on-call team observes but doesn't intervene unless abort threshold hit (Stage 6)
- Post-experiment — blameless review, gap analysis, action items (runbook updates, new alerts, code fixes) (Stage 6.5)
Key decisions
- Prod-like vs prod — never inject chaos in prod without mature practice; staging with realistic load is the starting point
- Steady-state hypothesis — "p99 latency stays under 500ms during single pod kill" — must be measurable before starting
- Blast radius — one namespace/team at a time; don't kill all etcd members simultaneously
- Abort conditions — auto-abort if error rate exceeds 5% or SLO burn rate hits 10× (Stage 6.4)
- Frequency — quarterly game days for platform; smaller automated chaos in CI for individual services
Failure modes
- Experiment exceeds blast radius — network partition CR affects wrong namespace, staging outage
- No abort condition — experiment runs too long, staging unusable for other teams for hours
- False confidence — staging lacks prod traffic patterns; passes game day but fails in prod
- Missing observability — can't tell if hypothesis passed or failed; experiment is worthless
- Action items not tracked — same failure found in 3 game days, never fixed
Interview follow-ups
- Design a game day for "single AZ becomes unavailable."
- How is chaos different from load testing (Stage 6.6)?
- When would you allow chaos experiments in production (e.g., Netflix approach)?
12.15 Terraform/IaC at scale (monorepo, drift, blast radius)
Problem
- 500+ resources across 20 environments, 50 engineers contributing; a bad
terraform applycan take down production - State files grow large, modules proliferate, drift accumulates, and nobody knows what's actually deployed
Architecture (conceptual)
- Repo structure — monorepo with
modules/(reusable) andenvironments/(dev/staging/prod overlays) or polyrepo per team (Stage 8.1) - Remote state — S3 + DynamoDB locking; separate state file per environment; never share state across envs (Stage 8.1)
- CI pipeline —
terraform planon every PR, mandatory review for prod applies,terraform applyonly from CI (Stage 9.2) - Module registry — versioned modules (
?ref=v1.2.3), semver, changelog; consumers pin versions (Stage 8.1) - Drift detection — scheduled
terraform planin CI; alert on non-zero diff; investigate manual console changes (Stage 8.1) - Policy as code — OPA/Sentinel/Checkov scan plans before apply; deny public S3, unencrypted volumes, missing tags (Stages 5.5, 8.4)
- Terragrunt — DRY backend config, dependency ordering between stacks (Stage 8.1)
Key decisions
- Monorepo vs polyrepo — monorepo: visibility and consistency; polyrepo: team autonomy and blast radius isolation
- State granularity — one state per environment vs per service; smaller state = faster plan but more coordination
- Module boundaries — too granular = versioning overhead; too coarse = tight coupling and wide blast radius
-
-targetapplies — escape hatch for emergencies; dangerous at scale, audit every use - Import vs recreate — bringing existing infra under TF management without downtime requires careful
terraform import(Stage 8.1)
Failure modes
- State lock stuck — crashed CI job holds DynamoDB lock; blocks all applies until manual force-unlock
- Module breaking change — v2 module removes attribute,
terraform applydestroys and recreates production RDS - Drift undetected for months — someone changed security group in console; next apply reverts it, breaks traffic
- Giant state file — plan takes 15 minutes, CI timeout, teams skip plan review
- Provider bug — provider v5 changes resource behavior, silent replacement of critical infrastructure
Interview follow-ups
- How do you structure Terraform modules for 50 teams with different needs?
- State file is 500MB and plans take 20 minutes — what do you do?
- Engineer runs
terraform applylocally against prod — how do you prevent this?
Top comments (0)