kubeboiii

Posted on May 29

Infra Platform Engineering

#devops #infrastructure #linux #systems

Stage 1 — Linux & OS Internals

Every container, every Kubernetes component, every performance issue traces back to Linux. Start here.

Processes & threads

How fork() and exec() work — process creation lifecycle
Process states: running, sleeping (interruptible vs uninterruptible), zombie, stopped
What a context switch is and why it has a cost

Namespaces (this is what containers ARE)

pid namespace — isolated process trees, PID 1 inside a container
net namespace — isolated network stack (interfaces, routes, iptables)
mnt namespace — isolated mount points and filesystem view
uts namespace — isolated hostname and domain name
ipc namespace — isolated System V IPC, POSIX message queues
user namespace — UID/GID remapping, unprivileged containers
cgroup namespace — isolated cgroup root view
How to experiment: unshare, nsenter, lsns commands

Control Groups / cgroups

cgroups v1 vs v2 — why v2 (unified hierarchy) matters for containers
CPU controller: cpu.shares, cpu.cfs_quota_us, cpu.cfs_period_us
Memory controller: hard limit, soft limit, swap limit, memory.stat
How Kubernetes maps resource requests and limits to cgroup settings
The OOM killer — how it scores processes, why your container gets killed, /proc/pid/oom_score_adj
pids controller — preventing fork bombs in containers

Memory management

Page cache — how the kernel caches disk reads in RAM, impact on free output
Memory metrics: RSS vs VSZ vs PSS vs USS — why RSS is misleading in containers

File systems

Inodes — what they store, inode exhaustion problem
Bind mounts — how Kubernetes volume mounts work under the hood
/proc and /sys — virtual filesystems that expose kernel state

Signals and IPC

SIGTERM vs SIGKILL — why your app must handle SIGTERM for graceful shutdown
SIGCHLD — zombie process prevention, proper child reaping (PID 1 problem in containers)
Why PID 1 in a container needs to reap children — tini and dumb-init

System call tracing and performance

strace -p <pid> — trace syscalls of a running process
/proc/<pid>/ — maps, status, fd, net — per-process kernel state
ss and netstat — socket state inspection
lsof — open file descriptors per process

Stage 2 — Networking fundamentals

You cannot work in Kubernetes, Cilium, Cloudflare, or Fastly without deeply understanding networking.

TCP/IP stack

IP addressing, subnets, CIDR notation, route tables
TCP handshake (SYN, SYN-ACK, ACK), teardown (FIN, TIME_WAIT)
TIME_WAIT storms — what causes them, why they matter at scale

DNS

How DNS resolution works end-to-end — recursive resolver, authoritative server
DNS record types: A, AAAA, CNAME, MX, TXT, PTR, SRV
TTL — caching, negative caching, TTL trade-offs
ndots setting in Linux — how it affects resolution order (critical for Kubernetes)
CoreDNS — how Kubernetes uses it, common misconfigurations, DNS debugging

Linux networking internals

Network interfaces — physical, virtual (veth), bridge, loopback, dummy
veth pairs — how they work, why they are used for container networking
Linux bridge — how it connects veth pairs (like a virtual switch)
conntrack — connection tracking table, how NAT works, conntrack -L

TLS and certificates

TLS handshake — client hello, server hello, certificate exchange, key exchange
Certificate chain — root CA, intermediate CA, leaf certificate
mTLS — mutual authentication, both sides present certificates (used in service meshes)
Certificate management — cert-manager in Kubernetes, Let's Encrypt, ACME protocol

Stage 3 — Go (Golang)

Go is the language of the entire CNCF ecosystem. Kubernetes, Prometheus, Terraform, ArgoCD, Cilium, Vault — all Go.

Language basics

Packages, modules (go.mod, go.sum), workspace mode
Basic types, structs, interfaces, methods, pointers
Error handling — error interface, errors.Is(), errors.As(), wrapping errors with %w
Defer, panic, recover — use cases and pitfalls

Interfaces and composition

Implicit interface satisfaction — no implements keyword
Embedding structs and interfaces
The io.Reader / io.Writer / io.Closer interface family
context.Context — cancellation, deadlines, value propagation — used everywhere in infra code

Goroutines and concurrency

Goroutines — lightweight threads managed by the Go runtime
Channels — unbuffered vs buffered, direction, closing
select statement — multiplexing channel operations
Race detector — go run -race, go test -race
Common concurrency mistakes: goroutine leaks, channel deadlocks

Memory and performance

Stack vs heap allocation — escape analysis (go build -gcflags="-m")

Standard library for infra work

net/http — building HTTP servers and clients, middleware pattern
os/exec — running subprocesses safely
flag and os.Args — CLI argument parsing
time — duration arithmetic, ticker, timer

CLI tools and infra tooling patterns

Config file loading — layered config (flags > env vars > config file > defaults)
Writing a simple HTTP server with graceful shutdown on SIGTERM

Stage 4 — Kubernetes (the most important stage)

Most companies on your list either build on K8s, build for K8s, or expect you to operate it at scale.

4.1 Architecture and control plane

API server

Central hub — all components communicate through the API server
REST API — resource types, verbs (get/list/watch/create/update/patch/delete)
Authentication — service account tokens (JWT), kubeconfig, OIDC, certificates
Admission control chain — mutating admission webhooks run first, then validating
etcd watch — how the API server streams changes to controllers

etcd

Raft consensus — leader election, log replication, quorum (why 3 or 5 nodes)
Key-value watch API — how controllers get notified of changes

Scheduler

Scheduling cycle: filtering (predicates) → scoring (priorities) → binding
Predicates: NodeSelector, NodeAffinity, PodAffinity, Taints/Tolerations, resource fit

Controller manager

Informer pattern — List + Watch, local cache, event handlers
Reconcile loop — compare desired state (spec) with actual state, take action to converge
Key controllers: Deployment, ReplicaSet, StatefulSet, DaemonSet, Job, CronJob, Endpoints

kubelet

Watches the API server for pods assigned to its node
CRI (Container Runtime Interface) — how kubelet talks to containerd or CRI-O
Pod lifecycle: pending → pulling image → creating container → running → terminating
Liveness vs readiness vs startup probes — how they work, when each probe type fails
Eviction — memory pressure, disk pressure, node conditions

kube-proxy

iptables mode — creates DNAT rules in KUBE-SERVICES chain for every Service
How ClusterIP services work — virtual IP that only exists in iptables/IPVS rules

4.2 Workloads and objects

Core objects you must know cold

Pod — smallest deployable unit, spec fields, container lifecycle hooks, init containers
Deployment — rolling update strategy, maxSurge, maxUnavailable, rollback
StatefulSet — stable network identity, ordered deployment, persistent volume claims
DaemonSet — one pod per node, use cases (log shippers, monitoring agents, CNI plugins)
Job and CronJob — completions, parallelism, failure handling, cron schedule format

Networking objects

Endpoints and EndpointSlices — how services know which pods to route to
Ingress — host/path-based routing, TLS termination, ingress controllers (Nginx, Traefik)
Network Policy — ingress/egress rules, podSelector, namespaceSelector, default deny

Storage objects

PersistentVolume (PV) and PersistentVolumeClaim (PVC) — static vs dynamic provisioning
StorageClass — provisioner, reclaim policy, volume binding mode
CSI — plugin interface for dynamic storage provisioning in Kubernetes

Resource management

requests vs limits — requests used for scheduling, limits enforced by cgroups
QoS classes: Guaranteed (requests = limits), Burstable (requests < limits), BestEffort (no requests)
LimitRange — default limits/requests for a namespace
ResourceQuota — total resource budget for a namespace
PodDisruptionBudget (PDB) — minimum available pods during voluntary disruptions

4.3 Kubernetes operators (concepts — implementation detail in deep-dive file)

Operators extend Kubernetes with custom resources (CRDs) and controllers that reconcile desired vs actual state
CRD — custom API object stored in etcd; separates spec (desired) from status (observed)
Reconcile loop — compare spec to reality; create/update/delete until they match
Finalizers — block deletion until cleanup (e.g. snapshot before delete) completes
Full controller-runtime, webhooks, and operator patterns → see deep-dive file

4.4 Kubernetes networking

CNI (Container Network Interface)

IPAM (IP Address Management) — how pods get IPs

Pod-to-pod networking

Each pod gets its own network namespace
veth pair — one end in pod namespace, one end in host namespace
Linux bridge (cbr0 or similar) — connects all veth pairs on a node
How packets travel between pods on the same node vs different nodes
Overlay networks — VXLAN encapsulation for cross-node traffic

Services and kube-proxy

How ClusterIP works — DNS → ClusterIP → iptables DNAT → pod IP
NodePort — how traffic enters the cluster from outside

DNS in Kubernetes

CoreDNS deployment — Deployment with 2 replicas, kube-dns Service
DNS search path — <svc>.<ns>.svc.cluster.local, <svc>.<ns>, <svc>
ndots:5 — causes 5 failed DNS lookups before resolving external names (latency issue)
Headless services — no ClusterIP, DNS returns pod IPs directly (used by StatefulSets)
DNS debugging — kubectl exec into a pod, use nslookup, dig, check CoreDNS logs

4.5 Autoscaling and resource optimization

Horizontal Pod Autoscaler (HPA)

Metrics server — provides CPU/memory metrics from kubelet
HPA control loop — target metric value, current metric value, desired replicas formula
Custom and external metrics — KEDA for event-driven scaling
Stabilization window — prevents flapping (scale-down slower than scale-up)

Cluster Autoscaler (CA)

Scale-up trigger — unschedulable pods (Pending state)
Scale-down trigger — underutilized nodes for 10 minutes (default)
Node groups — CA works with cloud provider node groups (ASGs in AWS)
CA and PDBs — CA respects PodDisruptionBudgets during scale-down
Safe-to-evict annotation — cluster-autoscaler.kubernetes.io/safe-to-evict: "false"

KEDA (Kubernetes Event-Driven Autoscaling)

ScaledObject CRD — links a workload to a scaler
Built-in scalers — Kafka consumer lag, queue depth, Prometheus metrics, cron
Scale to zero — KEDA can scale deployments down to 0 (HPA cannot)

4.6 Cloud-native Kubernetes

Platform teams operate K8s on top of cloud infrastructure. You need to understand what the cloud layer provides.

Managed control planes (EKS / GKE / AKS)

Who owns etcd — the cloud provider manages the control plane; you manage worker nodes and workloads
API server endpoint — public vs private endpoint, implications for CI/CD and developer access

VPC and networking

Public vs private subnets — worker nodes typically in private subnets, NAT gateway for egress
Pod CIDR vs node subnet vs service CIDR — three separate address spaces that must not overlap
Cloud load balancers — ALB/NLB/GCLB mapping to LoadBalancer Service type
externalTrafficPolicy: Local vs Cluster — source IP preservation and health check trade-offs on cloud LBs

Cloud IAM and workload identity

IAM roles, policies, trust relationships — who can assume what, least-privilege policy design
AWS IRSA — OIDC provider on cluster, annotated ServiceAccount, projected token → STS AssumeRole
GCP Workload Identity — Kubernetes SA bound to GCP SA, no long-lived keys on nodes

Managed services vs in-cluster

When to use RDS/Aurora vs self-hosted Postgres in K8s — ops burden, HA, backups, patching
ElastiCache/Memorystore vs Redis Cluster in K8s — same trade-off for caching
Object storage (S3/GCS) — Loki/Thanos blocks, Terraform state, CI artifacts, backup targets

Cloud DNS and certificates

ACM / Google-managed certs — integration with cloud load balancers and Ingress

4.7 Service mesh and gateways (awareness — detail in deep-dive file)

North-south — traffic from outside the cluster (ingress, TLS termination)
East-west — service-to-service traffic inside the cluster
Service mesh — sidecar proxies add mTLS, traffic splitting, and observability between services
Default-deny NetworkPolicy — baseline for multi-tenant clusters; explicitly allow required paths
Envoy, Istio, Gateway API, API gateways → see deep-dive file

Stage 5 — Container Security

Critical for Aqua Security, Snyk, Chainguard. Also tested at GitLab, Harness, Datadog.

5.1 Container image security

Image layers and attack surface

How Docker image layers work — each RUN instruction creates a layer
Base image choice — Alpine vs Debian vs distroless vs scratch
Distroless images — no shell, no package manager, minimal attack surface
Multi-stage builds — only copy the binary into the final stage, discard build tools

Vulnerability scanning

Static scanning tools — Trivy, Grype, Snyk Container, Clair
What scanners check — OS packages, language dependencies, Dockerfile misconfigs
CVE prioritization — severity (CVSS score), exploitability, reachability
Base image updates — automated PRs to update base images (Renovate, Dependabot)
Scanning in CI — fail the pipeline on critical/high CVEs, policy as code

Software Bill of Materials (SBOM)

What an SBOM is — list of all components in a software artifact
Generating SBOMs — Syft, docker sbom, cosign attest

5.2 Supply chain security

The problem

SolarWinds attack — build system compromise, malicious code injected into signed artifacts
log4shell — transitive dependency vulnerability, hard to find without SBOMs
XZ Utils backdoor — malicious maintainer, social engineering, compromised source
The threat model — compromised build system, malicious dependency, typosquatting

SLSA framework (Supply chain Levels for Software Artifacts)

SLSA Level 1 — provenance document exists
Provenance — who built the artifact, from what source, on what system, with what inputs

Sigstore stack

Cosign — signs container images and other OCI artifacts
Keyless signing — short-lived certificate from Fulcio CA, no long-lived private keys

5.3 Kubernetes RBAC and access control

RBAC model

Role (namespace-scoped) vs ClusterRole (cluster-scoped)
RoleBinding vs ClusterRoleBinding
Subjects: ServiceAccount, User, Group

Least privilege patterns

Never use cluster-admin for application workloads
Namespace-scoped service accounts for every workload
Projected service account tokens — short-lived, audience-bound, auto-rotated

Pod security

Pod Security Standards — Privileged, Baseline, Restricted profiles
Pod Security Admission controller — enforces standards at namespace level
Security context — runAsNonRoot, runAsUser, readOnlyRootFilesystem, allowPrivilegeEscalation: false

5.5 Cloud security (essentials — detail in deep-dive file)

IAM — least-privilege roles and policies; no long-lived keys on nodes or in CI
Encryption — at rest (disks, S3) and in transit (TLS)
Audit logs — CloudTrail / cloud audit logs for who changed what
Permission boundaries, WAF, GuardDuty, compliance frameworks → deep-dive file

Stage 6 — Observability

Core product domain for Datadog, Grafana Labs, New Relic, Splunk.

6.1 Metrics and Prometheus

Prometheus data model

Time series — metric name + label set + sequence of (timestamp, float64) samples
Label cardinality — why high-cardinality labels (user_id, request_id) cause OOM
Metric types:
- Counter — monotonically increasing (requests total, errors total)
- Gauge — can go up and down (memory usage, queue depth, temperature)
- Histogram — distribution of values in configurable buckets (request duration, response size)
- Summary — pre-calculated quantiles on client side (avoid if possible — not aggregatable)

PromQL

Instant vector vs range vector — http_requests_total vs http_requests_total[5m]
rate() — per-second rate of a counter over a range (use for counters, not gauges)
increase() — total increase in a counter over a range
sum by(), avg by(), max by() — aggregation operators, label dropping
histogram_quantile() — calculate p50/p95/p99 from histogram buckets
Alerting rules — for duration, labels, annotations, Alertmanager integration

6.2 Distributed tracing

Tracing concepts

Trace — end-to-end record of a request through a distributed system
Span — a single unit of work within a trace (one service call, one DB query)
Parent-child span relationship — forms a tree structure (the trace)
Trace context propagation — W3C traceparent header, B3 headers

OpenTelemetry (OTel)

Exporters — OTLP (preferred), Jaeger, Zipkin, Prometheus
OTel Collector — receives spans/metrics/logs, processes them, exports to backends

Sampling strategies

Head sampling — decision made at trace start (random %, always-on for errors)
Tail sampling — decision made after seeing the full trace (can sample based on error, latency)

6.3 Logging

Log shipping pipeline

Log sources — container stdout/stderr (collected by node agent), application log files
DaemonSet log agents — Fluent Bit (lightweight), Fluentd (more plugins), Vector (Rust-based)
Structured logging — JSON logs, consistent field names, log levels

Grafana Loki

Loki's key design decision — indexes only labels (like Prometheus), not log content
Why this matters — much cheaper to store and index than Elasticsearch-style full-text index
Log streams — a stream is a set of logs with the same label set (like a Prometheus time series)
LogQL — log query language, filter expressions {app="nginx"} |= "error", metric queries

6.4 SLOs and alerting

SLI/SLO/SLA

SLI (Service Level Indicator) — the metric you measure (e.g., error rate, latency p99)
SLO (Service Level Objective) — the target (e.g., 99.9% of requests under 200ms)
Error budget — time you can be non-compliant (0.1% of 30 days = 43.8 minutes)
Error budget burn rate — how fast you are consuming the error budget

Multi-window burn rate alerts

Alertmanager integration — Prometheus rules send alerts to Alertmanager
Short window (5 min) + long window (1 hour) — two-condition alert to reduce false positives
Routing trees — route alerts to correct team based on labels

6.5 SRE practices

Stage 6.4 covers SLO metrics and alerting. This section covers how platform/SRE teams operate.

Incident management

Severity levels (SEV1–SEV4) — customer impact, response time expectations
Incident commander role — coordinates response, comms, decision-making
Incident lifecycle — detect → triage → mitigate → resolve → postmortem
Status pages and stakeholder comms — internal vs external, update cadence
Runbooks — symptom-based (not cause-based), links to dashboards and remediation steps

On-call and alert quality

Alert design — page on symptoms (SLO burn, user-facing errors), not causes (CPU high)
On-call rotation — follow-the-sun, escalation policies, handoff rituals
Toil — repetitive manual work; measure and automate (platform team's core mandate)
Error budget policy — when budget is exhausted, freeze features, focus on reliability

Reliability engineering

Application-level patterns — timeouts, retries, circuit breakers, idempotency (Stage 7.5)
Capacity planning — headroom targets, load testing before launches, saturation metrics (USE method, Stage 6.6)
Failure domain isolation — blast radius, multi-AZ/region design (Stage 4.6)

Disaster recovery and resilience

Backup strategy beyond etcd — application data, cross-region replication, restore drills
Multi-AZ vs multi-region — zone failure tolerance vs region failure tolerance
Game days and chaos engineering — Litmus/Chaos Mesh: pod kill, network partition, AZ failure

Postmortems

Blameless culture — focus on systems and process, not individuals
Timeline, contributing factors (not root cause singular), action items with owners
Follow-through — track action items to completion, review in subsequent incidents

6.6 Performance engineering

Unifies performance concepts scattered across Stages 1, 3, 6, and 10 into a methodology.

Performance methodology

Define the goal first — latency vs throughput vs tail behavior vs cost
Measure before optimizing — establish baseline with load tests and production metrics
One change at a time — isolate variables; validate with before/after comparison

Throughput vs latency

Why higher throughput often worsens tail latency — queue buildup under saturation
Concurrency limits — connection pools, worker counts, HPA max replicas as backpressure levers
Backpressure — propagate slowness upstream instead of buffering indefinitely

Latency analysis and percentiles

p50 (median) vs p95 vs p99 vs p999 — why averages lie; tail latency drives user experience
Histogram buckets in Prometheus — choose bucket boundaries for your SLO thresholds (Stage 6.1)
Why Summary metrics are problematic — pre-computed quantiles on client side are not aggregatable
RED method — Rate, Errors, Duration (for request-driven services)
USE method — Utilization, Saturation, Errors (for resources: CPU, memory, disk, network)

Finding bottlenecks

Layered diagnosis — app → pod (cgroup metrics) → node (vmstat, iostat) → network → control plane
Go-specific — pprof CPU/heap profiles, GC pause analysis, GOGC tuning (Stage 3)
Database — slow query logs, connection pool exhaustion, replication lag (Stage 7)

Load testing

Test types — smoke, load (steady state), stress (find breaking point), spike, soak (memory leaks)
What platform teams validate — HPA response time, CA scale-up latency, PDB behavior under drain, ingress capacity
Warm-up period — exclude from measurements; run long enough for GC and caches to stabilize
Production-like data volume and cardinality — load test observability pipeline too (Stage 6.1 cardinality)

Caching and batching

Cache hierarchy — CDN edge (Stage 10.3) → Redis (Stage 7.3) → application in-memory
Connection pooling — DB pools, HTTP keep-alive; file descriptor and cgroup limits (Stage 1)

Stage 7 — Distributed systems and databases

Critical for CockroachDB, YugabyteDB, PlanetScale, ScyllaDB, Snowflake, Redis.

7.1 Distributed systems theory

Fundamental problems

CAP theorem

Consistency — every read sees the most recent write
Availability — every request gets a response (not necessarily the most recent data)
Partition tolerance — system works despite network partitions
Real-world: CP systems (Zookeeper, etcd, CockroachDB), AP systems (Cassandra, DynamoDB)
Extends CAP — when no Partition: trade-off between latency (L) and consistency (C)
More practical than CAP for comparing real databases

Consistency levels

Strong consistency / linearizability — operations appear instantaneous, globally ordered
Eventual consistency — replicas will converge eventually, reads may be stale

Consensus algorithms

Raft — designed for understandability, used in etcd, CockroachDB, TiKV
- Leader election — candidates request votes, majority wins, term numbers

Replication patterns

Single-leader — all writes go to leader, replicated to followers (PostgreSQL, MySQL)
Leaderless (Dynamo-style) — any node accepts writes, quorum reads/writes (Cassandra)

Clocks in distributed systems

Physical clocks — NTP sync, still have drift, clock_gettime()

7.2 Distributed SQL (CockroachDB / YugabyteDB)

Architecture

YugabyteDB — similar model, supports PostgreSQL and Cassandra APIs, DocDB storage layer

Distributed transactions

MVCC (Multi-Version Concurrency Control) — every write creates a new version, readers see a consistent snapshot

Schema changes

Geo-distribution

Region/zone topology — replicas placed in different regions/zones

7.3 Redis internals

Data structures and their implementations

Persistence

RDB snapshot — BGSAVE forks the process, child writes snapshot using CoW, parent continues serving
AOF (Append-Only File) — logs every write command, fsync policies: always, everysec, no
Hybrid persistence — RDB + AOF combined, AOF replays only since last RDB snapshot
No persistence mode — pure cache, data loss on restart acceptable

Replication

REPLICAOF — replica connects to master, full sync (RDB transfer) then partial sync
Replica lag — INFO replication shows master_repl_offset vs replica offset

Redis Cluster

Eviction policies

noeviction — return error when maxmemory hit
allkeys-lru — evict any key using LRU approximation
volatile-lru — evict only keys with TTL set, using LRU
allkeys-lfu — evict least frequently used keys (better for skewed access patterns)

7.5 Backend patterns for platform engineers

Platform teams build controllers, webhooks, internal APIs, and golden-path services. These patterns apply.

API design and gRPC

REST vs gRPC — REST for human-facing/admin APIs; gRPC for high-performance internal service-to-service
Deadlines and cancellation — context.Context propagation, client-side timeouts (Stage 3)
API versioning — URL path vs header vs protobuf package; deprecation policy
Idempotent APIs — safe retries for POST/PUT; idempotency keys for create operations

PostgreSQL fundamentals

MVCC — multi-version concurrency control, snapshots, vacuum, bloat
Indexes — B-tree (default), partial indexes, covering indexes, when indexes hurt writes
Connection limits — max_connections, connection pooling (PgBouncer), pool sizing vs pod count
Replication — streaming replication, replication lag, synchronous vs asynchronous
Isolation levels — Read Committed (default), Repeatable Read, Serializable
Foundation for CockroachDB/Yugabyte (Stage 7.2) and PlanetScale/Vitess (Stage 11)

Message queues and event streaming

Kafka fundamentals — topics, partitions, consumer groups, offset commits, consumer lag
Delivery semantics — at-most-once, at-least-once, exactly-once (idempotent consumers + transactions)
Dead-letter queues (DLQ) — poison messages, retry policies, manual inspection
When to use what — Kafka (high-throughput log), SQS/RabbitMQ (task queues), NATS (low-latency pub/sub)
KEDA integration — scale on Kafka consumer lag (Stage 4.5)

Reliability patterns in application code

Timeouts on every outbound call — HTTP clients, DB queries, gRPC deadlines
Retries with exponential backoff and jitter — max attempts, retry only on idempotent operations
Circuit breakers — open/half-open/closed states, failure threshold, recovery probe
Health checks — liveness (restart if broken) vs readiness (stop sending traffic) vs startup (Stage 4.1)

Caching and background work

Cache-aside vs read-through vs write-through — invalidation strategies, TTL design
Cache stampede protection — single-flight, lock-based refresh
Background jobs — Job vs long-running Deployment worker in K8s (Stage 4.2)

7.6 Data migration strategies

Deployment (Stage 9.4) ships code; data migration moves state. These are separate problems.

Expand-contract pattern

Expand — add new column/table/API field (backward compatible, old code still works)
Migrate — backfill data, dual-read or dual-write during transition
Contract — remove old column/table/API field once all code uses new path
Why it matters — enables zero-downtime deploys with rolling updates (Stage 9.4)

Dual writes and reconciliation

Write to old and new systems simultaneously during transition
Reconciliation job — compare old vs new, fix drift, idempotency required
Risk — inconsistency window if one write succeeds and the other fails; needs compensating transactions

Change Data Capture (CDC)

CDC tools — Debezium, AWS DMS, Maxwell — stream DB changes to Kafka/message bus
Use cases — real-time replication, event-driven architecture, incremental migration
Initial snapshot + streaming — full load then switch to binlog/WAL streaming

Online schema migrations

Expand-contract for indexes — create index concurrently, swap in application
Migration ordering — schema before code (expand) or code before schema (contract) depending on direction

Cutover and verification

Traffic shifting — percentage-based cutover, instant rollback if error rate spikes
Backfill throttling — rate-limit backfill to protect production DB performance
Rollback plan — can you revert if cutover fails? How long is old system kept warm?

Stage 8 — Infrastructure as Code

HashiCorp and Pulumi are on your list. IaC is also tested at almost every other company.

8.1 Terraform

Core concepts

HCL (HashiCorp Configuration Language) — declarative configuration language
Provider — plugin that manages a specific API (AWS, GCP, Kubernetes, Vault)
Resource — infrastructure object managed by Terraform
Data source — read-only reference to existing infrastructure
Output — export values from a configuration
Variable — input values, with type constraints and validation

State management

State file (terraform.tfstate) — JSON file recording current state of all managed resources
Remote backends — S3 + DynamoDB (locking), Terraform Cloud, GCS
State locking — prevents concurrent applies, DynamoDB table for distributed lock
terraform import — bring existing infrastructure under Terraform management
State drift — real world diverges from state, terraform plan detects this

Plan and apply lifecycle

Dependency graph — Terraform builds a DAG of all resources and their dependencies
create_before_destroy lifecycle meta-argument — zero-downtime replacements
prevent_destroy — protect critical resources from accidental deletion
Targeted applies — terraform apply -target=aws_instance.foo (use sparingly)

Modules

Module structure — main.tf, variables.tf, outputs.tf
Module versioning — source from Terraform Registry, GitHub with ?ref=v1.2.3
Module composition patterns — root module calls child modules

8.2 Pulumi (awareness)

Alternative to Terraform — define infrastructure in TypeScript, Python, or Go instead of HCL
Same plan/apply/state model; details in deep-dive file when you use it

8.3 HashiCorp Vault

Architecture

Core + storage backend — Vault core is stateless, all state in storage (Raft integrated or external like Consul)
Auto-unseal — use cloud KMS (AWS KMS, GCP KMS) to automatically unseal on restart

Auth methods

Kubernetes auth — pod presents service account token, Vault validates with K8s API server
AWS IAM auth — use IAM role/instance profile to authenticate
OIDC/JWT — integrate with any OIDC provider (GitHub Actions, GitLab CI)

Secret engines

KV v2 — versioned key-value store, soft delete, max_versions per key
Dynamic secrets — Vault generates credentials on-demand (DB passwords, AWS keys, certificates)
Database secret engine — Vault creates a DB user, returns credentials, auto-revokes on lease expiry

Vault Agent

Sidecar pattern — runs alongside your app, authenticates to Vault, writes secrets to file
Kubernetes Vault Agent Injector — annotate pods, sidecar is automatically injected

8.4 FinOps for platform teams

Connects autoscaling (Stage 4.5), cloud infrastructure (Stage 4.6), and IaC (Stages 8.1–8.2).

Cost visibility and allocation

Tagging strategy — mandatory tags: team, environment, service, cost-center
Showback vs chargeback — visibility to teams vs actual billing
Cost per namespace / per cluster / per service — Kubecost, CloudHealth, native cloud cost explorer
Unit economics — cost per request, cost per GB ingested, cost per tenant

Compute optimization

Rightsizing — VPA recommendations (Stage 4.5), instance type selection, CPU/memory fit
Spot / preemptible nodes — cost savings vs interruption risk, taints/tolerations for fault-tolerant workloads
Cluster Autoscaler price expander — prefer cheaper node groups (Stage 4.5)
Idle resource detection — orphaned volumes, unused load balancers, over-provisioned node groups
HPA min replicas — don't run 10 replicas at 3am if traffic allows 2

Storage and data costs

Object storage lifecycle policies — S3 Intelligent-Tiering, Glacier for old Loki/Thanos blocks
Persistent volume sizing — right-size PVCs, storage class selection (gp3 vs io2)
Log and metrics retention — shorter retention = lower cost (Stage 6); cardinality = cost (Stage 6.1)
Egress costs — cross-AZ, cross-region, internet egress; design to minimize (CDN, PrivateLink)

FinOps in IaC and CI/CD

Cost estimation in PRs — Infracost, Terraform plan cost diff
Policy as code — deny expensive instance types, enforce tagging in Terraform/Kyverno
Environment lifecycle — tear down ephemeral preview environments (Stage 9.3), scheduled shutdown of dev clusters
Reserved instances / savings plans vs on-demand — when commitment makes sense

Stage 9 — CI/CD, GitOps and Developer Platforms

GitLab, Harness, CircleCI on your list. GitOps is expected everywhere.

9.1 GitOps with ArgoCD

GitOps principles

Git as the single source of truth for desired state
Declarative — desired state expressed as files, not imperative commands
Automated reconciliation — controller continuously syncs actual state to desired state
Auditability — every change is a Git commit with author, timestamp, diff

ArgoCD architecture

Application CRD — defines source (Git repo/path) and destination (cluster/namespace)
Application controller — watches Applications, compares live state with desired state (Git)
Repo server — clones Git repos, renders Helm/Kustomize/Jsonnet manifests
API server — serves gRPC and REST API, handles sync triggers

App-of-apps pattern

Enables managing hundreds of apps from a single Git repo

Multi-cluster GitOps

Cluster credentials — stored as Secrets in ArgoCD namespace
Progressive delivery across clusters — sync to dev → staging → prod with approvals

Secrets in GitOps

External Secrets Operator — CRD points to Vault/AWS Secrets Manager, controller creates K8s Secret

9.2 CI/CD pipeline engineering

Pipeline concepts

DAG execution — stages/steps as a directed acyclic graph, parallel by default
Artifact passing — how outputs of one stage become inputs of the next
Build cache — Docker layer cache, language-specific caches (Go module cache, npm cache)
Pipeline triggers — push, MR/PR, schedule, API trigger, upstream pipeline

GitLab CI specifics

.gitlab-ci.yml — pipeline definition, stages, jobs, rules, needs
GitLab Runner — the agent that executes jobs, registered to a GitLab instance
Executor types — Shell, Docker, Kubernetes (most scalable)
Kubernetes executor — creates a pod per job, ephemeral, configurable resources
Caching — cache: key with hash of lock file, stored in S3 or runner local cache
Artifacts — artifacts: paths persisted and passed between jobs/stages

Security in CI/CD pipelines

SAST scanning — GitLab AutoDevOps, Semgrep, CodeQL
SCA (Software Composition Analysis) — Snyk, Trivy, grype
Container scanning — scan image after build, before push
Secret detection — gitleaks, trufflehog, GitLab secret detection

9.3 Internal Developer Platform (IDP)

The "platform engineering" product layer — what app teams interact with daily.

Platform as a product

Internal customers — application developers, data engineers, ML engineers
Golden paths — opinionated, supported, easy way to do the right thing
Self-service vs guardrails — developers provision infra within policy boundaries

Developer portal and service catalog

Service catalog metadata — owner, on-call rotation, dependencies, SLOs, runbooks
Scaffolder templates — "Create microservice" → repo + CI + Dockerfile + K8s manifests + monitoring + RBAC
TechDocs — docs-as-code in the repo, rendered in the portal

Golden path templates

What a complete template includes — Git repo, .gitlab-ci.yml, container build, image signing (Stage 5.2), GitOps manifest (Stage 9.1), Prometheus alerts (Stage 6), NetworkPolicy (Stage 5.3)
Template versioning — upgrade path when platform standards change

Environment management

Dev / staging / prod promotion — GitOps sync waves across clusters (Stage 9.1)
Ephemeral environments — preview apps per MR (Stage 9.2), namespace-per-branch, TTL-based cleanup
Environment parity — same Helm chart, different values; avoid snowflake environments

Artifact management

Container registries — ECR, GCR, Harbor; image retention policies, vulnerability scan gates (Stage 5.1)
SBOM and provenance storage — attach to images in registry (Stage 5.2)

Policy in the delivery path

Shift-left security — scan in CI before merge (Stage 9.2)
Admission control at deploy — Kyverno/Gatekeeper enforce standards (Stage 5.3)
Policy exceptions — audit mode, break-glass with approval workflow

9.4 Deployment and release strategies

How to ship changes safely. Coordinate with data migrations (Stage 7.6) and SLOs (Stage 6.4).

Choosing a strategy

Strategy	Downtime	Rollback speed	Infrastructure cost	Best for
Rolling	None	Slow (re-deploy old version)	Low	Stateless services, default K8s
Blue-green	None	Fast (switch traffic)	2x during deploy	Critical services, fast rollback needed
Canary	None	Fast (shift traffic back)	Low extra	High-traffic services, metric-gated promotion
Shadow	None	N/A (no user impact)	2x compute	Validation before any user traffic

Rolling deployment

K8s Deployment — maxSurge, maxUnavailable, rolling update strategy (Stage 4.2)
Readiness probes — new pods must pass before old pods terminate
PodDisruptionBudget — minimum available during voluntary disruptions (Stage 4.2)
Limitation — mixed versions run simultaneously; requires backward-compatible API and schema (Stage 7.6)

Blue-green deployment

Two identical environments — blue (current) and green (new)
Traffic switch — DNS, load balancer, or service mesh route flip
Rollback — switch traffic back to blue instantly
Cost — running double infrastructure during deploy window
Database consideration — schema must be compatible with both versions (expand-contract, Stage 7.6)

Canary deployment

Traffic split — 1% → 5% → 25% → 50% → 100%, gated by metrics at each step
Metric gates — error rate, p99 latency, saturation (Stage 6.6); SLO burn rate (Stage 6.4)
Automated rollback — Argo Rollouts / Flagger revert on failed analysis (Stage 9.1)
Service mesh or ingress required — Istio VirtualService, NGINX canary annotations, Cilium (Stage 4.7)

Shadow / dark traffic

Mirror production traffic to new version — no user-facing impact
Compare responses — diff old vs new output, log discrepancies
Use cases — validate rewrite, test new database backend, ML model comparison

Feature flags

Decouple deploy from release — code is deployed but feature is off
Flag types — release flags (short-lived), ops flags (kill switch), experiment flags (A/B)
Kill switch — disable feature instantly without rollback deploy
Flag hygiene — remove stale flags; tech debt if flags accumulate

Deployment safety checklist

Backward-compatible API and schema changes (Stage 7.6 expand phase)
Feature flags for risky changes
Dashboards and alerts ready before deploy (Stage 6)
Rollback plan documented — code rollback vs schema rollback (schema rollback is hard)
PDB and HPA configured — don't deploy during capacity constraints
Error budget check — freeze deploys if budget exhausted (Stage 6.5)

Coordinating code and data deploys

Expand before deploy — add new DB column/table before code that uses it
Contract after deploy — remove old column only after all code migrated
Dual-write period — both old and new code paths write to both stores (Stage 7.6)
Never deploy breaking schema change with rolling update — old pods will crash

Stage 10 — eBPF and Advanced Networking (for Cilium, Cloudflare, Fastly)

10.1 Advanced networking awareness (learn later detail in deep-dive file)

Full eBPF, Cilium, and CDN content is in platform-engineering-deep-dive.md. For now:

eBPF — programmable hooks in the Linux kernel for networking, security, and observability
Cilium — Kubernetes networking and policy using eBPF instead of iptables
CDN edge — caches responses by Cache-Control headers; mitigates DDoS at L3/L4 (volume) and L7 (HTTP-aware)

Stage 11 — Distributed databases continued (ScyllaDB / PlanetScale)

Wide-column (ScyllaDB/Cassandra) and sharded MySQL (Vitess/PlanetScale) — full detail in deep-dive file.

ScyllaDB/Cassandra — partition key determines node; design tables for query patterns, not normalized joins
Vitess/PlanetScale — MySQL sharded at scale; avoid scatter queries without a shard key
LSM trees, VTGate, gh-ost, resharding → deep-dive file

Stage 12 — Architecture case studies

Apply everything from Stages 1–11. Each case study follows: problem → architecture → key decisions → failure modes → interview follow-ups.

12.1 Datadog metrics ingest pipeline

Problem

Ingest millions of metrics per second from agents across customer infrastructure
High cardinality risk — bad label design can OOM the pipeline
Must query recent data fast; older data can be slower/cheaper

Architecture (conceptual)

Agent (node/pod) → local aggregation → intake API (load balanced)
Kafka or similar queue — decouple ingest from processing, absorb spikes
Processing workers — normalize, validate, drop/blacklist high-cardinality series
Hot storage — recent data, fast queries (like Prometheus TSDB, Stage 6.1)
Cold storage — object storage (S3) for long retention, queried on demand
Query layer — federates hot + cold, PromQL-compatible

Key decisions

Why queue between intake and storage — backpressure, burst absorption
Cardinality limits — per-metric, per-tag, per-customer quotas
Downsampling — reduce resolution for older data to control storage cost (Stage 8.4)
Sharding — by customer ID or metric hash for horizontal scale

Failure modes

Cardinality explosion — one bad deployment sends unique label per request
Ingest lag — queue depth grows, delayed metrics, alert on pipeline lag not just app metrics
Hot shard — uneven customer traffic distribution

Interview follow-ups

How would you design cardinality limits?
What happens if Kafka is down for 5 minutes?
How do you migrate storage backends without downtime?

12.2 Cloudflare DDoS mitigation

Problem

Mitigate multi-Tbps volumetric attacks without impacting legitimate traffic
Must operate at line rate — cannot afford per-packet userspace processing at scale

Architecture (conceptual)

Anycast BGP — same IP from every PoP, traffic routed to nearest edge (Stage 10.3)
XDP/eBPF at NIC — drop malicious packets before kernel network stack (Stage 10.1)
Flow tracking — stateful inspection for SYN floods, UDP amplification
Rate limiting — token bucket per IP/ASN/fingerprint (Stage 10.3)
Challenge layer — JS/CAPTCHA for suspicious but not clearly malicious traffic
Origin shield — aggregate cache misses through single PoP to protect origin

Key decisions

XDP vs iptables — XDP for line-rate drop, iptables for complex stateful rules
False positive vs false negative trade-off — blocking legit users vs letting attack through
Attack signature updates — how fast can rules propagate to all PoPs globally?

Failure modes

Origin overload during cache miss storm — origin-facing PoP becomes bottleneck
SYN flood exhausting conntrack table (Stage 2) — eBPF replaces kernel conntrack at scale (Stage 10.2)
L7 attacks that look like legitimate HTTP — require application-aware detection

Interview follow-ups

How does anycast handle a PoP going offline?
Design rate limiting for 10M unique IPs.
How would you test DDoS mitigation without affecting production?

12.3 Multi-tenant Kubernetes platform

Problem

Run 50+ teams on shared clusters with isolation, fair resource sharing, and cost allocation

Architecture (conceptual)

Namespace per team — ResourceQuota, LimitRange (Stage 4.2)
NetworkPolicy default-deny — explicit allow between namespaces (Stage 4.2, 4.7)
Pod Security Standards — Restricted profile enforced via admission (Stage 5.3)
RBAC — namespace-scoped roles, no cluster-admin for app teams (Stage 5.3)
Cost allocation — Kubecost or cloud tags mapped to namespaces (Stage 8.4)
IDP self-service — Backstage template creates namespace + quota + GitOps repo (Stage 9.3)

Key decisions

Shared vs dedicated nodes — taints/tolerations for noisy-neighbor isolation
Cluster per env vs cluster per team — blast radius vs operational overhead
How much self-service — golden path vs bring-your-own-manifests

Failure modes

Noisy neighbor — one team's memory spike triggers node OOM, evicts other teams' pods
Quota exhaustion — team hits ResourceQuota, pods stuck Pending, unclear error message
NetworkPolicy too restrictive — breaks legitimate cross-team dependencies

Interview follow-ups

How do you handle a team that needs GPU nodes?
Design chargeback for shared cluster costs.
One team deploys a crypto miner — how do you detect and respond?

12.4 GitOps at scale (100+ clusters)

Problem

Manage application deployments across hundreds of clusters from a central platform
Balance consistency with cluster-specific overrides; control blast radius

Architecture (conceptual)

ArgoCD hub — central instance managing remote clusters (Stage 9.1)
App-of-apps / ApplicationSet — templated apps per cluster (Stage 9.1)
Repo structure — base manifests + Kustomize overlays per cluster/environment
Sync waves — CRDs first, then operators, then workloads
Progressive sync — dev clusters auto-sync, prod requires manual approval
Secrets — External Secrets Operator pulling from Vault (Stage 9.1, 8.3)

Key decisions

Monorepo vs polyrepo — trade-off between visibility and access control
Auto-sync vs manual sync for production — speed vs safety
How to handle cluster-specific config — Kustomize overlays vs Helm values files

Failure modes

Bad manifest synced to all clusters simultaneously — blast radius
ArgoCD itself becomes SPOF — HA deployment, multiple replicas
Secret rotation breaks sync — stale ExternalSecret, pods fail to start
Drift — manual kubectl edit on cluster, GitOps fights live state

Interview follow-ups

How do you roll out a platform-wide NetworkPolicy change safely?
Design a canary cluster before promoting to all prod clusters.
How do you handle a cluster that can't reach Git?

12.5 Secure CI/CD supply chain end-to-end

Problem

Ensure only trusted, scanned, signed artifacts reach production clusters

Architecture (conceptual)

Developer push → CI pipeline (Stage 9.2)
SAST + SCA + secret detection in CI (Stage 9.2)
Build container image → Trivy/Grype scan (Stage 5.1)
Generate SBOM (Syft) + SLSA provenance (Stage 5.2)
Sign with Cosign keyless signing via GitHub OIDC → Fulcio → Rekor (Stage 5.2)
Push to registry with signature attached
Admission webhook — Kyverno verify-image policy, reject unsigned or vulnerable images (Stage 5.3)
GitOps deploy — ArgoCD syncs signed image to cluster (Stage 9.1)
Runtime — Falco detects anomalous behavior (Stage 5.4)

Key decisions

Where to enforce — CI gate vs registry gate vs admission gate (defense in depth)
Keyless vs key-based signing — OIDC identity vs long-lived keys
CVE policy — block critical, warn on high, allow with exception workflow

Failure modes

Compromised CI runner — attacker pushes malicious signed image
Policy bypass — --privileged pod admitted because namespace lacks Pod Security
Stale base image — image signed but base layer has new CVE discovered later

Interview follow-ups

How do you handle emergency hotfix bypass of scan gates?
Design provenance verification that works across multiple CI systems.
What if Rekor is unavailable — can you still verify signatures?

12.6 Globally distributed SQL (CockroachDB-style)

Problem

PostgreSQL-compatible database that survives region failure with strong consistency

Architecture (conceptual)

Keyspace split into ranges, each range = Raft group (Stage 7.2)
Multi-Raft — independent consensus per range, scales horizontally
Transaction coordinator — 2PC across ranges for distributed transactions
Geo-partitioning — pin data to regions for latency and compliance (Stage 7.2)
Follower reads — read from local replica at stale timestamp for lower latency

Key decisions

CP over AP — strong consistency, sacrifice availability during partition (CAP, Stage 7.1)
Range size — too small = Raft overhead; too large = hot spots
Survival goals — zone vs region failure tolerance

Failure modes

Hot range — one range gets disproportionate writes, single Raft group bottleneck
Clock skew — HLC mitigates but extreme skew causes transaction retries
Region partition — CP system may become unavailable for affected ranges

Interview follow-ups

How does CockroachDB handle a node failure mid-transaction?
Design a schema migration for a globally distributed table.
Compare to Spanner's TrueTime approach (Stage 7.1).

12.7 Observability pipeline at scale (Loki + Prometheus)

Problem

Collect logs and metrics from 10,000+ pods without overwhelming storage or query performance

Architecture (conceptual)

Metrics — Prometheus per cluster → remote_write → Mimir/Thanos (Stage 6.1)
Logs — Fluent Bit DaemonSet → Loki distributor → ingester → S3 chunks (Stage 6.3)
Traces — OTel Collector → tail sampling → Jaeger/Tempo (Stage 6.2)
Unified query — Grafana dashboards correlating metrics + logs + traces
Cardinality control — drop high-cardinality labels at ingest, recording rules for aggregates

Key decisions

Loki label design — index only labels (not log content), low-cardinality labels only
Retention tiers — 15 days hot, 90 days warm in object storage, delete after
Sampling — head sampling for traces (99% dropped), tail sampling for errors (Stage 6.2)

Failure modes

Label cardinality explosion in Loki — same problem as Prometheus, different storage
Remote write backpressure — Prometheus WAL grows, disk fills
Log volume spike — one service debug-logging at ERROR floods pipeline

Interview follow-ups

How do you debug a production issue when traces were sampled out?
Design log retention that meets compliance without bankrupting storage budget (Stage 8.4).
How do you correlate a metric spike to the exact log lines?

12.8 Cilium replacing kube-proxy

Problem

kube-proxy iptables mode doesn't scale to thousands of Services; need faster datapath

Architecture (conceptual)

Cilium agent (DaemonSet) — programs eBPF on each node (Stage 10.2)
eBPF LB map — service IP → backend pod IP, O(1) lookup, no iptables chain walk
Identity-based policy — numeric security identity from labels, not IP (Stage 10.2)
Hubble — flow-level observability from eBPF, no sidecar needed
--kube-proxy-replacement=strict — Cilium owns all service routing

Key decisions

eBPF over iptables — performance at scale, but requires kernel 4.19+ and BTF
DSR (Direct Server Return) — reply bypasses load balancer node, lower latency
Identity vs IP policy — IPs change on pod restart; identity is stable

Failure modes

eBPF map full — service/backend limit hit, new services fail to program
Kernel upgrade breaks eBPF programs — CO-RE (BTF) mitigates (Stage 10.1)
Policy misconfiguration — identity mismatch blocks legitimate traffic silently

Interview follow-ups

Walk through packet path for ClusterIP Service with Cilium eBPF vs iptables.
How does Cilium handle a pod IP change during rolling update?
Compare Cilium LB to IPVS mode kube-proxy (Stage 4.1).

12.9 Zero-downtime database migration

Problem

Migrate a 500GB PostgreSQL table (monolith DB) to a new schema, shard, or datastore with zero downtime and a rollback path
Application must keep serving traffic throughout; old and new code versions run simultaneously during rolling deploys (Stage 9.4)

Architecture (conceptual)

Phase 1 (expand) — add new column/table/index in old DB; deploy code that writes to both old and new paths (Stage 7.6)
Phase 2 (backfill) — batch or streaming job copies historical data; throttle to protect prod DB performance
Phase 3 (CDC) — Debezium/DMS streams ongoing changes from old DB to new store, keeping new store in sync (Stage 7.6)
Phase 4 (dual-read validation) — compare row counts, checksums, sample queries between old and new
Phase 5 (cutover) — shift read traffic to new store (percentage-based or instant); monitor error rate and SLO burn (Stage 6.4)
Phase 6 (contract) — remove old column/table once all code reads from new path; decommission old store

Key decisions

Expand-contract over big-bang — only safe pattern with rolling K8s deploys (Stage 9.4)
Dual-write vs CDC-only — dual-write simpler but risk of inconsistency; CDC cleaner but adds pipeline complexity
Cutover strategy — percentage traffic shift vs DNS flip vs feature flag per tenant
How long to keep old system warm — rollback window vs cost of running dual systems

Failure modes

Dual-write partial failure — one write succeeds, other fails; needs idempotency and reconciliation job (Stage 7.5)
Backfill overload — unthrottled backfill saturates DB I/O, degrades live traffic
Schema incompatibility — new code deployed before expand phase completes, old pods crash
Cutover with replication lag — reads from new store return stale data, user-visible inconsistency
Rollback after contract phase — schema rollback is hard; may require forward-fix instead

Interview follow-ups

How do you verify data correctness before cutover?
What if CDC pipeline falls 30 minutes behind during peak traffic?
Design migration for a table with 10K writes/sec and foreign key constraints.

12.10 Autoscaling under a traffic spike

Problem

Traffic increases 20× in 10 minutes (product launch, Black Friday, viral event)
Platform must scale pods, nodes, and ingress without breaching SLOs or exhausting error budget (Stage 6.4)

Architecture (conceptual)

Ingress / load balancer — cloud ALB/NLB or CDN absorbs initial burst (Stages 4.6, 10.3)
HPA — scales pod replicas based on CPU, memory, or custom metrics (Stage 4.5)
Cluster Autoscaler — adds nodes when pods are Pending due to insufficient resources (Stage 4.5)
KEDA — event-driven scaling on queue lag or external metrics; scale-to-zero off-peak (Stage 4.5)
Pre-warming — raise HPA minReplicas and pre-provision node pool before known events
Observability — RED metrics on autoscaling loop itself: time-to-new-pod-ready, time-to-new-node, scheduling latency (Stage 6.6)

Key decisions

HPA metric choice — CPU lags behind request rate; custom metrics (RPS, queue depth) react faster
CA scale-up delay — new node takes 2–5 minutes; pre-warm node groups for predictable events
PDB vs scale-down — CA respects PodDisruptionBudgets; may block scale-down, leaving costly idle nodes (Stage 4.2)
Spot/preemptible nodes — cost savings vs interruption during spike; use for fault-tolerant workloads only (Stage 8.4)
Max replicas cap — prevent runaway scaling from bug or DDoS; balance cost vs availability

Failure modes

HPA lag — metrics-server delay + cooldown window; pods not ready before traffic overwhelms existing replicas
CA can't scale — hit node group max, instance quota, or IP address exhaustion in subnet
Thundering herd on new pods — all new pods cold-start simultaneously, DB connection pool exhausted (Stage 7.5)
Ingress bottleneck — pods scaled but ingress/controller becomes the limit
Flapping — scale-up then rapid scale-down as metrics spike and drop; tune stabilization windows (Stage 4.5)

Interview follow-ups

How do you load-test autoscaling behavior before a launch?
HPA vs KEDA for a Kafka consumer workload — which and why?
Traffic drops after spike — how fast should you scale down without causing another outage?

12.11 Building a production Kubernetes operator

Problem

Platform team needs a CRD (e.g., Database, Application, Tenant) with a controller that provisions and manages lifecycle automatically
Must be reliable, idempotent, and operable at scale across many clusters

Architecture (conceptual)

CRD definition — OpenAPI validation schema, status subresource, printer columns (Stage 4.3)
controller-runtime — Manager, Reconciler, work queue, shared informer cache (Stage 4.3)
Reconcile loop — compare spec (desired) vs observed state; create/update/delete child resources
Webhooks — mutating (defaults) and validating (reject invalid specs) admission (Stage 4.3)
Finalizers — pre-delete cleanup (e.g., snapshot DB before CR deletion); prevent stuck resources
Observability — controller metrics (reconcile duration, errors, queue depth), structured logs, tracing (Stage 6)

Key decisions

Idempotent reconcile — calling reconcile N times has same effect as once; use CreateOrUpdate pattern (Stage 4.3)
Error handling — transient errors requeue with backoff; permanent errors update status condition
Owner references — child resources garbage-collected when parent CR deleted (Stage 4.3)
Leader election — only one active controller replica; others standby (Stage 4.3)
Secondary resource watches — trigger reconcile when child Secret or Deployment changes
Testing — envtest for unit tests, kind cluster for integration, contract tests on CRD schema

Failure modes

Reconcile storm — API server blip causes resync of all objects; rate-limit queue, use predicates (Stage 4.3)
Stuck finalizer — external dependency unavailable, CR can't delete; manual finalizer removal as break-glass
Status update conflict — concurrent reconcilers or user edits cause optimistic locking conflict
Webhook failure — invalid object rejected but error opaque to user; clear validation messages critical
Partial provision — DB created but Secret not written; status must reflect partial state accurately

Interview follow-ups

Walk through reconcile for a Database CR: create → running → upgrade → delete.
How do you handle a controller bug that corrupted 50 resources — rollback strategy?
How do you version CRD schemas without breaking existing resources?

12.12 Vault as the org-wide secrets platform

Problem

2,000+ microservices need dynamic DB credentials, PKI certs, and API keys without Vault becoming a single point of failure or bottleneck
Must integrate with Kubernetes, CI/CD, and cloud IAM across multiple clusters and accounts

Architecture (conceptual)

Vault HA cluster — Raft integrated storage, 3+ nodes, active/standby with auto-failover (Stage 8.3)
Auto-unseal — AWS KMS / GCP KMS; no manual unseal on restart (Stage 8.3)
Auth methods — Kubernetes auth (pod SA token), AWS IAM auth, AppRole for CI, OIDC for GitHub Actions (Stage 8.3)
Secret engines — Database (dynamic creds), PKI (internal CA), KV v2 (static secrets), Transit (encryption-as-a-service)
Vault Agent Injector — sidecar injected via pod annotation, renders secrets to file, auto-renews leases (Stage 8.3)
External Secrets Operator — syncs Vault secrets to K8s Secret for GitOps compatibility (Stage 9.1)

Key decisions

Agent sidecar vs ESO vs direct API — sidecar for app file-based secrets; ESO for GitOps; direct API for controllers
Dynamic vs static secrets — dynamic DB creds auto-revoke on lease expiry; static secrets need rotation policy
Namespace isolation — each team gets Vault policy scoped to their path; no cross-team secret access
Performance standbys — read replicas for high read volume; writes still go to active node (Stage 8.3)
Break-glass — emergency root token procedure, audited, time-limited

Failure modes

Vault sealed after restart — auto-unseal misconfigured, all secret retrieval fails across fleet
Lease expiry without renewal — app crashes when DB cred expires; Agent must renew before TTL
Rate limiting — thundering herd of pods restarting simultaneously overwhelms Vault auth endpoint
Token leak — compromised SA token grants Vault access; short-lived tokens + narrow policies limit blast radius
Raft quorum loss — 2 of 3 nodes down, Vault read-only or unavailable; multi-AZ placement critical

Interview follow-ups

Vault is down for 10 minutes — what breaks, in what order?
How do you rotate a database password for 500 services without restart?
Design Vault topology for 5 K8s clusters across 2 cloud accounts.

12.13 FinOps: reducing a $2M/month K8s and cloud bill

Problem

Platform spend growing 30% quarter-over-quarter; leadership demands ~40% reduction without SLO regression or team revolt
Must identify waste, rightsize, and implement guardrails — not just cut capacity blindly

Architecture (conceptual)

Cost visibility — Kubecost / CloudHealth / native cost explorer with mandatory tagging (team, env, service) (Stage 8.4)
Compute — VPA recommendations, rightsizing requests/limits, spot/preemptible for fault-tolerant workloads (Stages 4.5, 8.4)
Node efficiency — CA scale-down idle nodes, reduce max node group size, consolidate low-utilization clusters
Storage — right-size PVCs, S3 lifecycle policies for logs/metrics/backups, delete orphaned volumes (Stage 8.4)
Observability cost — reduce metrics cardinality, shorten retention, drop debug logs in prod (Stages 6.1, 8.4)
Egress — CDN for static assets, PrivateLink for cross-service traffic, same-AZ preference (Stages 4.6, 8.4)
Governance — Infracost in PRs, Kyverno policy blocking oversized instances, chargeback reports to teams (Stages 8.1, 8.4)

Key decisions

What to cut first — idle resources, over-provisioned dev/staging, excessive retention; never cut prod headroom blindly
Spot/preemptible adoption — start with stateless batch/CI workloads; keep on-demand for critical path (Stage 8.4)
Chargeback vs showback — showback educates; chargeback creates accountability but needs accurate allocation
Reserved instances / savings plans — commit for baseline load only; keep burst on-demand
Unit economics — cost per request, per tenant, per GB ingested; track over time to prove savings didn't hurt reliability

Failure modes

Aggressive rightsizing causes OOM kills during traffic spike — under-provisioned after cutting limits
Spot interruption during peak — no on-demand fallback, SLO breach
Retention cut too short — can't debug incident from last week; false economy
Tagging gaps — 30% of spend is "untagged," can't allocate or optimize
Team workaround — devs spin up resources outside platform to avoid chargeback, creating shadow IT

Interview follow-ups

Show me your prioritization: what do you cut first, second, never?
How do you prove a 40% cost cut didn't increase incident rate?
Design chargeback model for a shared multi-tenant K8s cluster (Stage 12.3).

12.14 Chaos game day on a production-like environment

Problem

Platform team needs confidence that the system survives realistic failures before they happen in production
Run controlled experiments in a prod-like staging environment without customer impact

Architecture (conceptual)

Environment — full prod parity: same K8s version, same operators, same observability stack, synthetic load at ~50% prod traffic (Stage 9.3)
Chaos tools — Litmus Chaos, Chaos Mesh, or Gremlin; inject faults as K8s CRs or API calls (Stage 6.5)
Experiment design — hypothesize steady state (SLOs hold), define blast radius, set abort conditions
Fault types — pod kill, node drain, network partition (NetworkPolicy drop), AZ failure simulation, DNS failure, latency injection
Observability during experiment — pre-built dashboards for SLO burn, error rate, latency; on-call team observes but doesn't intervene unless abort threshold hit (Stage 6)
Post-experiment — blameless review, gap analysis, action items (runbook updates, new alerts, code fixes) (Stage 6.5)

Key decisions

Prod-like vs prod — never inject chaos in prod without mature practice; staging with realistic load is the starting point
Steady-state hypothesis — "p99 latency stays under 500ms during single pod kill" — must be measurable before starting
Blast radius — one namespace/team at a time; don't kill all etcd members simultaneously
Abort conditions — auto-abort if error rate exceeds 5% or SLO burn rate hits 10× (Stage 6.4)
Frequency — quarterly game days for platform; smaller automated chaos in CI for individual services

Failure modes

Experiment exceeds blast radius — network partition CR affects wrong namespace, staging outage
No abort condition — experiment runs too long, staging unusable for other teams for hours
False confidence — staging lacks prod traffic patterns; passes game day but fails in prod
Missing observability — can't tell if hypothesis passed or failed; experiment is worthless
Action items not tracked — same failure found in 3 game days, never fixed

Interview follow-ups

Design a game day for "single AZ becomes unavailable."
How is chaos different from load testing (Stage 6.6)?
When would you allow chaos experiments in production (e.g., Netflix approach)?

12.15 Terraform/IaC at scale (monorepo, drift, blast radius)

Problem

500+ resources across 20 environments, 50 engineers contributing; a bad terraform apply can take down production
State files grow large, modules proliferate, drift accumulates, and nobody knows what's actually deployed

Architecture (conceptual)

Repo structure — monorepo with modules/ (reusable) and environments/ (dev/staging/prod overlays) or polyrepo per team (Stage 8.1)
Remote state — S3 + DynamoDB locking; separate state file per environment; never share state across envs (Stage 8.1)
CI pipeline — terraform plan on every PR, mandatory review for prod applies, terraform apply only from CI (Stage 9.2)
Module registry — versioned modules (?ref=v1.2.3), semver, changelog; consumers pin versions (Stage 8.1)
Drift detection — scheduled terraform plan in CI; alert on non-zero diff; investigate manual console changes (Stage 8.1)
Policy as code — OPA/Sentinel/Checkov scan plans before apply; deny public S3, unencrypted volumes, missing tags (Stages 5.5, 8.4)
Terragrunt — DRY backend config, dependency ordering between stacks (Stage 8.1)

Key decisions

Monorepo vs polyrepo — monorepo: visibility and consistency; polyrepo: team autonomy and blast radius isolation
State granularity — one state per environment vs per service; smaller state = faster plan but more coordination
Module boundaries — too granular = versioning overhead; too coarse = tight coupling and wide blast radius
-target applies — escape hatch for emergencies; dangerous at scale, audit every use
Import vs recreate — bringing existing infra under TF management without downtime requires careful terraform import (Stage 8.1)

Failure modes

State lock stuck — crashed CI job holds DynamoDB lock; blocks all applies until manual force-unlock
Module breaking change — v2 module removes attribute, terraform apply destroys and recreates production RDS
Drift undetected for months — someone changed security group in console; next apply reverts it, breaks traffic
Giant state file — plan takes 15 minutes, CI timeout, teams skip plan review
Provider bug — provider v5 changes resource behavior, silent replacement of critical infrastructure

Interview follow-ups

How do you structure Terraform modules for 50 teams with different needs?
State file is 500MB and plans take 20 minutes — what do you do?
Engineer runs terraform apply locally against prod — how do you prevent this?