DEV Community

kubeboiii
kubeboiii

Posted on

Infra Platform Engineering

Stage 1 — Linux & OS Internals

Every container, every Kubernetes component, every performance issue traces back to Linux. Start here.

Processes & threads

  • How fork() and exec() work — process creation lifecycle
  • Process states: running, sleeping (interruptible vs uninterruptible), zombie, stopped
  • What a context switch is and why it has a cost

Namespaces (this is what containers ARE)

  • pid namespace — isolated process trees, PID 1 inside a container
  • net namespace — isolated network stack (interfaces, routes, iptables)
  • mnt namespace — isolated mount points and filesystem view
  • uts namespace — isolated hostname and domain name
  • ipc namespace — isolated System V IPC, POSIX message queues
  • user namespace — UID/GID remapping, unprivileged containers
  • cgroup namespace — isolated cgroup root view
  • How to experiment: unshare, nsenter, lsns commands

Control Groups / cgroups

  • cgroups v1 vs v2 — why v2 (unified hierarchy) matters for containers
  • CPU controller: cpu.shares, cpu.cfs_quota_us, cpu.cfs_period_us
  • Memory controller: hard limit, soft limit, swap limit, memory.stat
  • How Kubernetes maps resource requests and limits to cgroup settings
  • The OOM killer — how it scores processes, why your container gets killed, /proc/pid/oom_score_adj
  • pids controller — preventing fork bombs in containers

Memory management

  • Page cache — how the kernel caches disk reads in RAM, impact on free output
  • Memory metrics: RSS vs VSZ vs PSS vs USS — why RSS is misleading in containers

File systems

  • Inodes — what they store, inode exhaustion problem
  • Bind mounts — how Kubernetes volume mounts work under the hood
  • /proc and /sys — virtual filesystems that expose kernel state

Signals and IPC

  • SIGTERM vs SIGKILL — why your app must handle SIGTERM for graceful shutdown
  • SIGCHLD — zombie process prevention, proper child reaping (PID 1 problem in containers)
  • Why PID 1 in a container needs to reap children — tini and dumb-init

System call tracing and performance

  • strace -p <pid> — trace syscalls of a running process
  • /proc/<pid>/maps, status, fd, net — per-process kernel state
  • ss and netstat — socket state inspection
  • lsof — open file descriptors per process

Stage 2 — Networking fundamentals

You cannot work in Kubernetes, Cilium, Cloudflare, or Fastly without deeply understanding networking.

TCP/IP stack

  • IP addressing, subnets, CIDR notation, route tables
  • TCP handshake (SYN, SYN-ACK, ACK), teardown (FIN, TIME_WAIT)
  • TIME_WAIT storms — what causes them, why they matter at scale

DNS

  • How DNS resolution works end-to-end — recursive resolver, authoritative server
  • DNS record types: A, AAAA, CNAME, MX, TXT, PTR, SRV
  • TTL — caching, negative caching, TTL trade-offs
  • ndots setting in Linux — how it affects resolution order (critical for Kubernetes)
  • CoreDNS — how Kubernetes uses it, common misconfigurations, DNS debugging

Linux networking internals

  • Network interfaces — physical, virtual (veth), bridge, loopback, dummy
  • veth pairs — how they work, why they are used for container networking
  • Linux bridge — how it connects veth pairs (like a virtual switch)
  • conntrack — connection tracking table, how NAT works, conntrack -L

TLS and certificates

  • TLS handshake — client hello, server hello, certificate exchange, key exchange
  • Certificate chain — root CA, intermediate CA, leaf certificate
  • mTLS — mutual authentication, both sides present certificates (used in service meshes)
  • Certificate management — cert-manager in Kubernetes, Let's Encrypt, ACME protocol

Stage 3 — Go (Golang)

Go is the language of the entire CNCF ecosystem. Kubernetes, Prometheus, Terraform, ArgoCD, Cilium, Vault — all Go.

Language basics

  • Packages, modules (go.mod, go.sum), workspace mode
  • Basic types, structs, interfaces, methods, pointers
  • Error handling — error interface, errors.Is(), errors.As(), wrapping errors with %w
  • Defer, panic, recover — use cases and pitfalls

Interfaces and composition

  • Implicit interface satisfaction — no implements keyword
  • Embedding structs and interfaces
  • The io.Reader / io.Writer / io.Closer interface family
  • context.Context — cancellation, deadlines, value propagation — used everywhere in infra code

Goroutines and concurrency

  • Goroutines — lightweight threads managed by the Go runtime
  • Channels — unbuffered vs buffered, direction, closing
  • select statement — multiplexing channel operations
  • Race detector — go run -race, go test -race
  • Common concurrency mistakes: goroutine leaks, channel deadlocks

Memory and performance

  • Stack vs heap allocation — escape analysis (go build -gcflags="-m")

Standard library for infra work

  • net/http — building HTTP servers and clients, middleware pattern
  • os/exec — running subprocesses safely
  • flag and os.Args — CLI argument parsing
  • time — duration arithmetic, ticker, timer

CLI tools and infra tooling patterns

  • Config file loading — layered config (flags > env vars > config file > defaults)
  • Writing a simple HTTP server with graceful shutdown on SIGTERM

Stage 4 — Kubernetes (the most important stage)

Most companies on your list either build on K8s, build for K8s, or expect you to operate it at scale.

4.1 Architecture and control plane

API server

  • Central hub — all components communicate through the API server
  • REST API — resource types, verbs (get/list/watch/create/update/patch/delete)
  • Authentication — service account tokens (JWT), kubeconfig, OIDC, certificates
  • Admission control chain — mutating admission webhooks run first, then validating
  • etcd watch — how the API server streams changes to controllers

etcd

  • Raft consensus — leader election, log replication, quorum (why 3 or 5 nodes)
  • Key-value watch API — how controllers get notified of changes

Scheduler

  • Scheduling cycle: filtering (predicates) → scoring (priorities) → binding
  • Predicates: NodeSelector, NodeAffinity, PodAffinity, Taints/Tolerations, resource fit

Controller manager

  • Informer pattern — List + Watch, local cache, event handlers
  • Reconcile loop — compare desired state (spec) with actual state, take action to converge
  • Key controllers: Deployment, ReplicaSet, StatefulSet, DaemonSet, Job, CronJob, Endpoints

kubelet

  • Watches the API server for pods assigned to its node
  • CRI (Container Runtime Interface) — how kubelet talks to containerd or CRI-O
  • Pod lifecycle: pending → pulling image → creating container → running → terminating
  • Liveness vs readiness vs startup probes — how they work, when each probe type fails
  • Eviction — memory pressure, disk pressure, node conditions

kube-proxy

  • iptables mode — creates DNAT rules in KUBE-SERVICES chain for every Service
  • How ClusterIP services work — virtual IP that only exists in iptables/IPVS rules

4.2 Workloads and objects

Core objects you must know cold

  • Pod — smallest deployable unit, spec fields, container lifecycle hooks, init containers
  • Deployment — rolling update strategy, maxSurge, maxUnavailable, rollback
  • StatefulSet — stable network identity, ordered deployment, persistent volume claims
  • DaemonSet — one pod per node, use cases (log shippers, monitoring agents, CNI plugins)
  • Job and CronJob — completions, parallelism, failure handling, cron schedule format

Networking objects

  • Endpoints and EndpointSlices — how services know which pods to route to
  • Ingress — host/path-based routing, TLS termination, ingress controllers (Nginx, Traefik)
  • Network Policy — ingress/egress rules, podSelector, namespaceSelector, default deny

Storage objects

  • PersistentVolume (PV) and PersistentVolumeClaim (PVC) — static vs dynamic provisioning
  • StorageClass — provisioner, reclaim policy, volume binding mode
  • CSI — plugin interface for dynamic storage provisioning in Kubernetes

Resource management

  • requests vs limits — requests used for scheduling, limits enforced by cgroups
  • QoS classes: Guaranteed (requests = limits), Burstable (requests < limits), BestEffort (no requests)
  • LimitRange — default limits/requests for a namespace
  • ResourceQuota — total resource budget for a namespace
  • PodDisruptionBudget (PDB) — minimum available pods during voluntary disruptions

4.3 Kubernetes operators (concepts — implementation detail in deep-dive file)

  • Operators extend Kubernetes with custom resources (CRDs) and controllers that reconcile desired vs actual state
  • CRD — custom API object stored in etcd; separates spec (desired) from status (observed)
  • Reconcile loop — compare spec to reality; create/update/delete until they match
  • Finalizers — block deletion until cleanup (e.g. snapshot before delete) completes
  • Full controller-runtime, webhooks, and operator patterns → see deep-dive file

4.4 Kubernetes networking

CNI (Container Network Interface)

  • IPAM (IP Address Management) — how pods get IPs

Pod-to-pod networking

  • Each pod gets its own network namespace
  • veth pair — one end in pod namespace, one end in host namespace
  • Linux bridge (cbr0 or similar) — connects all veth pairs on a node
  • How packets travel between pods on the same node vs different nodes
  • Overlay networks — VXLAN encapsulation for cross-node traffic

Services and kube-proxy

  • How ClusterIP works — DNS → ClusterIP → iptables DNAT → pod IP
  • NodePort — how traffic enters the cluster from outside

DNS in Kubernetes

  • CoreDNS deployment — Deployment with 2 replicas, kube-dns Service
  • DNS search path — <svc>.<ns>.svc.cluster.local, <svc>.<ns>, <svc>
  • ndots:5 — causes 5 failed DNS lookups before resolving external names (latency issue)
  • Headless services — no ClusterIP, DNS returns pod IPs directly (used by StatefulSets)
  • DNS debugging — kubectl exec into a pod, use nslookup, dig, check CoreDNS logs

4.5 Autoscaling and resource optimization

Horizontal Pod Autoscaler (HPA)

  • Metrics server — provides CPU/memory metrics from kubelet
  • HPA control loop — target metric value, current metric value, desired replicas formula
  • Custom and external metrics — KEDA for event-driven scaling
  • Stabilization window — prevents flapping (scale-down slower than scale-up)

Cluster Autoscaler (CA)

  • Scale-up trigger — unschedulable pods (Pending state)
  • Scale-down trigger — underutilized nodes for 10 minutes (default)
  • Node groups — CA works with cloud provider node groups (ASGs in AWS)
  • CA and PDBs — CA respects PodDisruptionBudgets during scale-down
  • Safe-to-evict annotation — cluster-autoscaler.kubernetes.io/safe-to-evict: "false"

KEDA (Kubernetes Event-Driven Autoscaling)

  • ScaledObject CRD — links a workload to a scaler
  • Built-in scalers — Kafka consumer lag, queue depth, Prometheus metrics, cron
  • Scale to zero — KEDA can scale deployments down to 0 (HPA cannot)

4.6 Cloud-native Kubernetes

Platform teams operate K8s on top of cloud infrastructure. You need to understand what the cloud layer provides.

Managed control planes (EKS / GKE / AKS)

  • Who owns etcd — the cloud provider manages the control plane; you manage worker nodes and workloads
  • API server endpoint — public vs private endpoint, implications for CI/CD and developer access

VPC and networking

  • Public vs private subnets — worker nodes typically in private subnets, NAT gateway for egress
  • Pod CIDR vs node subnet vs service CIDR — three separate address spaces that must not overlap
  • Cloud load balancers — ALB/NLB/GCLB mapping to LoadBalancer Service type
  • externalTrafficPolicy: Local vs Cluster — source IP preservation and health check trade-offs on cloud LBs

Cloud IAM and workload identity

  • IAM roles, policies, trust relationships — who can assume what, least-privilege policy design
  • AWS IRSA — OIDC provider on cluster, annotated ServiceAccount, projected token → STS AssumeRole
  • GCP Workload Identity — Kubernetes SA bound to GCP SA, no long-lived keys on nodes

Managed services vs in-cluster

  • When to use RDS/Aurora vs self-hosted Postgres in K8s — ops burden, HA, backups, patching
  • ElastiCache/Memorystore vs Redis Cluster in K8s — same trade-off for caching
  • Object storage (S3/GCS) — Loki/Thanos blocks, Terraform state, CI artifacts, backup targets

Cloud DNS and certificates

  • ACM / Google-managed certs — integration with cloud load balancers and Ingress

4.7 Service mesh and gateways (awareness — detail in deep-dive file)

  • North-south — traffic from outside the cluster (ingress, TLS termination)
  • East-west — service-to-service traffic inside the cluster
  • Service mesh — sidecar proxies add mTLS, traffic splitting, and observability between services
  • Default-deny NetworkPolicy — baseline for multi-tenant clusters; explicitly allow required paths
  • Envoy, Istio, Gateway API, API gateways → see deep-dive file

Stage 5 — Container Security

Critical for Aqua Security, Snyk, Chainguard. Also tested at GitLab, Harness, Datadog.

5.1 Container image security

Image layers and attack surface

  • How Docker image layers work — each RUN instruction creates a layer
  • Base image choice — Alpine vs Debian vs distroless vs scratch
  • Distroless images — no shell, no package manager, minimal attack surface
  • Multi-stage builds — only copy the binary into the final stage, discard build tools

Vulnerability scanning

  • Static scanning tools — Trivy, Grype, Snyk Container, Clair
  • What scanners check — OS packages, language dependencies, Dockerfile misconfigs
  • CVE prioritization — severity (CVSS score), exploitability, reachability
  • Base image updates — automated PRs to update base images (Renovate, Dependabot)
  • Scanning in CI — fail the pipeline on critical/high CVEs, policy as code

Software Bill of Materials (SBOM)

  • What an SBOM is — list of all components in a software artifact
  • Generating SBOMs — Syft, docker sbom, cosign attest

5.2 Supply chain security

The problem

  • SolarWinds attack — build system compromise, malicious code injected into signed artifacts
  • log4shell — transitive dependency vulnerability, hard to find without SBOMs
  • XZ Utils backdoor — malicious maintainer, social engineering, compromised source
  • The threat model — compromised build system, malicious dependency, typosquatting

SLSA framework (Supply chain Levels for Software Artifacts)

  • SLSA Level 1 — provenance document exists
  • Provenance — who built the artifact, from what source, on what system, with what inputs

Sigstore stack

  • Cosign — signs container images and other OCI artifacts
  • Keyless signing — short-lived certificate from Fulcio CA, no long-lived private keys

5.3 Kubernetes RBAC and access control

RBAC model

  • Role (namespace-scoped) vs ClusterRole (cluster-scoped)
  • RoleBinding vs ClusterRoleBinding
  • Subjects: ServiceAccount, User, Group

Least privilege patterns

  • Never use cluster-admin for application workloads
  • Namespace-scoped service accounts for every workload
  • Projected service account tokens — short-lived, audience-bound, auto-rotated

Pod security

  • Pod Security Standards — Privileged, Baseline, Restricted profiles
  • Pod Security Admission controller — enforces standards at namespace level
  • Security context — runAsNonRoot, runAsUser, readOnlyRootFilesystem, allowPrivilegeEscalation: false

5.5 Cloud security (essentials — detail in deep-dive file)

  • IAM — least-privilege roles and policies; no long-lived keys on nodes or in CI
  • Encryption — at rest (disks, S3) and in transit (TLS)
  • Audit logs — CloudTrail / cloud audit logs for who changed what
  • Permission boundaries, WAF, GuardDuty, compliance frameworks → deep-dive file

Stage 6 — Observability

Core product domain for Datadog, Grafana Labs, New Relic, Splunk.

6.1 Metrics and Prometheus

Prometheus data model

  • Time series — metric name + label set + sequence of (timestamp, float64) samples
  • Label cardinality — why high-cardinality labels (user_id, request_id) cause OOM
  • Metric types:
    • Counter — monotonically increasing (requests total, errors total)
    • Gauge — can go up and down (memory usage, queue depth, temperature)
    • Histogram — distribution of values in configurable buckets (request duration, response size)
    • Summary — pre-calculated quantiles on client side (avoid if possible — not aggregatable)

PromQL

  • Instant vector vs range vector — http_requests_total vs http_requests_total[5m]
  • rate() — per-second rate of a counter over a range (use for counters, not gauges)
  • increase() — total increase in a counter over a range
  • sum by(), avg by(), max by() — aggregation operators, label dropping
  • histogram_quantile() — calculate p50/p95/p99 from histogram buckets
  • Alerting rules — for duration, labels, annotations, Alertmanager integration

6.2 Distributed tracing

Tracing concepts

  • Trace — end-to-end record of a request through a distributed system
  • Span — a single unit of work within a trace (one service call, one DB query)
  • Parent-child span relationship — forms a tree structure (the trace)
  • Trace context propagation — W3C traceparent header, B3 headers

OpenTelemetry (OTel)

  • Exporters — OTLP (preferred), Jaeger, Zipkin, Prometheus
  • OTel Collector — receives spans/metrics/logs, processes them, exports to backends

Sampling strategies

  • Head sampling — decision made at trace start (random %, always-on for errors)
  • Tail sampling — decision made after seeing the full trace (can sample based on error, latency)

6.3 Logging

Log shipping pipeline

  • Log sources — container stdout/stderr (collected by node agent), application log files
  • DaemonSet log agents — Fluent Bit (lightweight), Fluentd (more plugins), Vector (Rust-based)
  • Structured logging — JSON logs, consistent field names, log levels

Grafana Loki

  • Loki's key design decision — indexes only labels (like Prometheus), not log content
  • Why this matters — much cheaper to store and index than Elasticsearch-style full-text index
  • Log streams — a stream is a set of logs with the same label set (like a Prometheus time series)
  • LogQL — log query language, filter expressions {app="nginx"} |= "error", metric queries

6.4 SLOs and alerting

SLI/SLO/SLA

  • SLI (Service Level Indicator) — the metric you measure (e.g., error rate, latency p99)
  • SLO (Service Level Objective) — the target (e.g., 99.9% of requests under 200ms)
  • Error budget — time you can be non-compliant (0.1% of 30 days = 43.8 minutes)
  • Error budget burn rate — how fast you are consuming the error budget

Multi-window burn rate alerts

  • Alertmanager integration — Prometheus rules send alerts to Alertmanager
  • Short window (5 min) + long window (1 hour) — two-condition alert to reduce false positives
  • Routing trees — route alerts to correct team based on labels

6.5 SRE practices

Stage 6.4 covers SLO metrics and alerting. This section covers how platform/SRE teams operate.

Incident management

  • Severity levels (SEV1–SEV4) — customer impact, response time expectations
  • Incident commander role — coordinates response, comms, decision-making
  • Incident lifecycle — detect → triage → mitigate → resolve → postmortem
  • Status pages and stakeholder comms — internal vs external, update cadence
  • Runbooks — symptom-based (not cause-based), links to dashboards and remediation steps

On-call and alert quality

  • Alert design — page on symptoms (SLO burn, user-facing errors), not causes (CPU high)
  • On-call rotation — follow-the-sun, escalation policies, handoff rituals
  • Toil — repetitive manual work; measure and automate (platform team's core mandate)
  • Error budget policy — when budget is exhausted, freeze features, focus on reliability

Reliability engineering

  • Application-level patterns — timeouts, retries, circuit breakers, idempotency (Stage 7.5)
  • Capacity planning — headroom targets, load testing before launches, saturation metrics (USE method, Stage 6.6)
  • Failure domain isolation — blast radius, multi-AZ/region design (Stage 4.6)

Disaster recovery and resilience

  • Backup strategy beyond etcd — application data, cross-region replication, restore drills
  • Multi-AZ vs multi-region — zone failure tolerance vs region failure tolerance
  • Game days and chaos engineering — Litmus/Chaos Mesh: pod kill, network partition, AZ failure

Postmortems

  • Blameless culture — focus on systems and process, not individuals
  • Timeline, contributing factors (not root cause singular), action items with owners
  • Follow-through — track action items to completion, review in subsequent incidents

6.6 Performance engineering

Unifies performance concepts scattered across Stages 1, 3, 6, and 10 into a methodology.

Performance methodology

  • Define the goal first — latency vs throughput vs tail behavior vs cost
  • Measure before optimizing — establish baseline with load tests and production metrics
  • One change at a time — isolate variables; validate with before/after comparison

Throughput vs latency

  • Why higher throughput often worsens tail latency — queue buildup under saturation
  • Concurrency limits — connection pools, worker counts, HPA max replicas as backpressure levers
  • Backpressure — propagate slowness upstream instead of buffering indefinitely

Latency analysis and percentiles

  • p50 (median) vs p95 vs p99 vs p999 — why averages lie; tail latency drives user experience
  • Histogram buckets in Prometheus — choose bucket boundaries for your SLO thresholds (Stage 6.1)
  • Why Summary metrics are problematic — pre-computed quantiles on client side are not aggregatable
  • RED method — Rate, Errors, Duration (for request-driven services)
  • USE method — Utilization, Saturation, Errors (for resources: CPU, memory, disk, network)

Finding bottlenecks

  • Layered diagnosis — app → pod (cgroup metrics) → node (vmstat, iostat) → network → control plane
  • Go-specific — pprof CPU/heap profiles, GC pause analysis, GOGC tuning (Stage 3)
  • Database — slow query logs, connection pool exhaustion, replication lag (Stage 7)

Load testing

  • Test types — smoke, load (steady state), stress (find breaking point), spike, soak (memory leaks)
  • What platform teams validate — HPA response time, CA scale-up latency, PDB behavior under drain, ingress capacity
  • Warm-up period — exclude from measurements; run long enough for GC and caches to stabilize
  • Production-like data volume and cardinality — load test observability pipeline too (Stage 6.1 cardinality)

Caching and batching

  • Cache hierarchy — CDN edge (Stage 10.3) → Redis (Stage 7.3) → application in-memory
  • Connection pooling — DB pools, HTTP keep-alive; file descriptor and cgroup limits (Stage 1)

Stage 7 — Distributed systems and databases

Critical for CockroachDB, YugabyteDB, PlanetScale, ScyllaDB, Snowflake, Redis.

7.1 Distributed systems theory

Fundamental problems

CAP theorem

  • Consistency — every read sees the most recent write
  • Availability — every request gets a response (not necessarily the most recent data)
  • Partition tolerance — system works despite network partitions
  • Real-world: CP systems (Zookeeper, etcd, CockroachDB), AP systems (Cassandra, DynamoDB)

  • Extends CAP — when no Partition: trade-off between latency (L) and consistency (C)

  • More practical than CAP for comparing real databases

Consistency levels

  • Strong consistency / linearizability — operations appear instantaneous, globally ordered
  • Eventual consistency — replicas will converge eventually, reads may be stale

Consensus algorithms

  • Raft — designed for understandability, used in etcd, CockroachDB, TiKV
    • Leader election — candidates request votes, majority wins, term numbers

Replication patterns

  • Single-leader — all writes go to leader, replicated to followers (PostgreSQL, MySQL)
  • Leaderless (Dynamo-style) — any node accepts writes, quorum reads/writes (Cassandra)

Clocks in distributed systems

  • Physical clocks — NTP sync, still have drift, clock_gettime()

7.2 Distributed SQL (CockroachDB / YugabyteDB)

Architecture

  • YugabyteDB — similar model, supports PostgreSQL and Cassandra APIs, DocDB storage layer

Distributed transactions

  • MVCC (Multi-Version Concurrency Control) — every write creates a new version, readers see a consistent snapshot

Schema changes

Geo-distribution

  • Region/zone topology — replicas placed in different regions/zones

7.3 Redis internals

Data structures and their implementations

Persistence

  • RDB snapshot — BGSAVE forks the process, child writes snapshot using CoW, parent continues serving
  • AOF (Append-Only File) — logs every write command, fsync policies: always, everysec, no
  • Hybrid persistence — RDB + AOF combined, AOF replays only since last RDB snapshot
  • No persistence mode — pure cache, data loss on restart acceptable

Replication

  • REPLICAOF — replica connects to master, full sync (RDB transfer) then partial sync
  • Replica lag — INFO replication shows master_repl_offset vs replica offset

Redis Cluster

Eviction policies

  • noeviction — return error when maxmemory hit
  • allkeys-lru — evict any key using LRU approximation
  • volatile-lru — evict only keys with TTL set, using LRU
  • allkeys-lfu — evict least frequently used keys (better for skewed access patterns)

7.5 Backend patterns for platform engineers

Platform teams build controllers, webhooks, internal APIs, and golden-path services. These patterns apply.

API design and gRPC

  • REST vs gRPC — REST for human-facing/admin APIs; gRPC for high-performance internal service-to-service
  • Deadlines and cancellation — context.Context propagation, client-side timeouts (Stage 3)
  • API versioning — URL path vs header vs protobuf package; deprecation policy
  • Idempotent APIs — safe retries for POST/PUT; idempotency keys for create operations

PostgreSQL fundamentals

  • MVCC — multi-version concurrency control, snapshots, vacuum, bloat
  • Indexes — B-tree (default), partial indexes, covering indexes, when indexes hurt writes
  • Connection limits — max_connections, connection pooling (PgBouncer), pool sizing vs pod count
  • Replication — streaming replication, replication lag, synchronous vs asynchronous
  • Isolation levels — Read Committed (default), Repeatable Read, Serializable
  • Foundation for CockroachDB/Yugabyte (Stage 7.2) and PlanetScale/Vitess (Stage 11)

Message queues and event streaming

  • Kafka fundamentals — topics, partitions, consumer groups, offset commits, consumer lag
  • Delivery semantics — at-most-once, at-least-once, exactly-once (idempotent consumers + transactions)
  • Dead-letter queues (DLQ) — poison messages, retry policies, manual inspection
  • When to use what — Kafka (high-throughput log), SQS/RabbitMQ (task queues), NATS (low-latency pub/sub)
  • KEDA integration — scale on Kafka consumer lag (Stage 4.5)

Reliability patterns in application code

  • Timeouts on every outbound call — HTTP clients, DB queries, gRPC deadlines
  • Retries with exponential backoff and jitter — max attempts, retry only on idempotent operations
  • Circuit breakers — open/half-open/closed states, failure threshold, recovery probe
  • Health checks — liveness (restart if broken) vs readiness (stop sending traffic) vs startup (Stage 4.1)

Caching and background work

  • Cache-aside vs read-through vs write-through — invalidation strategies, TTL design
  • Cache stampede protection — single-flight, lock-based refresh
  • Background jobs — Job vs long-running Deployment worker in K8s (Stage 4.2)

7.6 Data migration strategies

Deployment (Stage 9.4) ships code; data migration moves state. These are separate problems.

Expand-contract pattern

  • Expand — add new column/table/API field (backward compatible, old code still works)
  • Migrate — backfill data, dual-read or dual-write during transition
  • Contract — remove old column/table/API field once all code uses new path
  • Why it matters — enables zero-downtime deploys with rolling updates (Stage 9.4)

Dual writes and reconciliation

  • Write to old and new systems simultaneously during transition
  • Reconciliation job — compare old vs new, fix drift, idempotency required
  • Risk — inconsistency window if one write succeeds and the other fails; needs compensating transactions

Change Data Capture (CDC)

  • CDC tools — Debezium, AWS DMS, Maxwell — stream DB changes to Kafka/message bus
  • Use cases — real-time replication, event-driven architecture, incremental migration
  • Initial snapshot + streaming — full load then switch to binlog/WAL streaming

Online schema migrations

  • Expand-contract for indexes — create index concurrently, swap in application
  • Migration ordering — schema before code (expand) or code before schema (contract) depending on direction

Cutover and verification

  • Traffic shifting — percentage-based cutover, instant rollback if error rate spikes
  • Backfill throttling — rate-limit backfill to protect production DB performance
  • Rollback plan — can you revert if cutover fails? How long is old system kept warm?

Stage 8 — Infrastructure as Code

HashiCorp and Pulumi are on your list. IaC is also tested at almost every other company.

8.1 Terraform

Core concepts

  • HCL (HashiCorp Configuration Language) — declarative configuration language
  • Provider — plugin that manages a specific API (AWS, GCP, Kubernetes, Vault)
  • Resource — infrastructure object managed by Terraform
  • Data source — read-only reference to existing infrastructure
  • Output — export values from a configuration
  • Variable — input values, with type constraints and validation

State management

  • State file (terraform.tfstate) — JSON file recording current state of all managed resources
  • Remote backends — S3 + DynamoDB (locking), Terraform Cloud, GCS
  • State locking — prevents concurrent applies, DynamoDB table for distributed lock
  • terraform import — bring existing infrastructure under Terraform management
  • State drift — real world diverges from state, terraform plan detects this

Plan and apply lifecycle

  • Dependency graph — Terraform builds a DAG of all resources and their dependencies
  • create_before_destroy lifecycle meta-argument — zero-downtime replacements
  • prevent_destroy — protect critical resources from accidental deletion
  • Targeted applies — terraform apply -target=aws_instance.foo (use sparingly)

Modules

  • Module structure — main.tf, variables.tf, outputs.tf
  • Module versioning — source from Terraform Registry, GitHub with ?ref=v1.2.3
  • Module composition patterns — root module calls child modules

8.2 Pulumi (awareness)

  • Alternative to Terraform — define infrastructure in TypeScript, Python, or Go instead of HCL
  • Same plan/apply/state model; details in deep-dive file when you use it

8.3 HashiCorp Vault

Architecture

  • Core + storage backend — Vault core is stateless, all state in storage (Raft integrated or external like Consul)
  • Auto-unseal — use cloud KMS (AWS KMS, GCP KMS) to automatically unseal on restart

Auth methods

  • Kubernetes auth — pod presents service account token, Vault validates with K8s API server
  • AWS IAM auth — use IAM role/instance profile to authenticate
  • OIDC/JWT — integrate with any OIDC provider (GitHub Actions, GitLab CI)

Secret engines

  • KV v2 — versioned key-value store, soft delete, max_versions per key
  • Dynamic secrets — Vault generates credentials on-demand (DB passwords, AWS keys, certificates)
  • Database secret engine — Vault creates a DB user, returns credentials, auto-revokes on lease expiry

Vault Agent

  • Sidecar pattern — runs alongside your app, authenticates to Vault, writes secrets to file
  • Kubernetes Vault Agent Injector — annotate pods, sidecar is automatically injected

8.4 FinOps for platform teams

Connects autoscaling (Stage 4.5), cloud infrastructure (Stage 4.6), and IaC (Stages 8.1–8.2).

Cost visibility and allocation

  • Tagging strategy — mandatory tags: team, environment, service, cost-center
  • Showback vs chargeback — visibility to teams vs actual billing
  • Cost per namespace / per cluster / per service — Kubecost, CloudHealth, native cloud cost explorer
  • Unit economics — cost per request, cost per GB ingested, cost per tenant

Compute optimization

  • Rightsizing — VPA recommendations (Stage 4.5), instance type selection, CPU/memory fit
  • Spot / preemptible nodes — cost savings vs interruption risk, taints/tolerations for fault-tolerant workloads
  • Cluster Autoscaler price expander — prefer cheaper node groups (Stage 4.5)
  • Idle resource detection — orphaned volumes, unused load balancers, over-provisioned node groups
  • HPA min replicas — don't run 10 replicas at 3am if traffic allows 2

Storage and data costs

  • Object storage lifecycle policies — S3 Intelligent-Tiering, Glacier for old Loki/Thanos blocks
  • Persistent volume sizing — right-size PVCs, storage class selection (gp3 vs io2)
  • Log and metrics retention — shorter retention = lower cost (Stage 6); cardinality = cost (Stage 6.1)
  • Egress costs — cross-AZ, cross-region, internet egress; design to minimize (CDN, PrivateLink)

FinOps in IaC and CI/CD

  • Cost estimation in PRs — Infracost, Terraform plan cost diff
  • Policy as code — deny expensive instance types, enforce tagging in Terraform/Kyverno
  • Environment lifecycle — tear down ephemeral preview environments (Stage 9.3), scheduled shutdown of dev clusters
  • Reserved instances / savings plans vs on-demand — when commitment makes sense

Stage 9 — CI/CD, GitOps and Developer Platforms

GitLab, Harness, CircleCI on your list. GitOps is expected everywhere.

9.1 GitOps with ArgoCD

GitOps principles

  • Git as the single source of truth for desired state
  • Declarative — desired state expressed as files, not imperative commands
  • Automated reconciliation — controller continuously syncs actual state to desired state
  • Auditability — every change is a Git commit with author, timestamp, diff

ArgoCD architecture

  • Application CRD — defines source (Git repo/path) and destination (cluster/namespace)
  • Application controller — watches Applications, compares live state with desired state (Git)
  • Repo server — clones Git repos, renders Helm/Kustomize/Jsonnet manifests
  • API server — serves gRPC and REST API, handles sync triggers

App-of-apps pattern

  • Enables managing hundreds of apps from a single Git repo

Multi-cluster GitOps

  • Cluster credentials — stored as Secrets in ArgoCD namespace
  • Progressive delivery across clusters — sync to dev → staging → prod with approvals

Secrets in GitOps

  • External Secrets Operator — CRD points to Vault/AWS Secrets Manager, controller creates K8s Secret

9.2 CI/CD pipeline engineering

Pipeline concepts

  • DAG execution — stages/steps as a directed acyclic graph, parallel by default
  • Artifact passing — how outputs of one stage become inputs of the next
  • Build cache — Docker layer cache, language-specific caches (Go module cache, npm cache)
  • Pipeline triggers — push, MR/PR, schedule, API trigger, upstream pipeline

GitLab CI specifics

  • .gitlab-ci.yml — pipeline definition, stages, jobs, rules, needs
  • GitLab Runner — the agent that executes jobs, registered to a GitLab instance
  • Executor types — Shell, Docker, Kubernetes (most scalable)
  • Kubernetes executor — creates a pod per job, ephemeral, configurable resources
  • Caching — cache: key with hash of lock file, stored in S3 or runner local cache
  • Artifacts — artifacts: paths persisted and passed between jobs/stages

Security in CI/CD pipelines

  • SAST scanning — GitLab AutoDevOps, Semgrep, CodeQL
  • SCA (Software Composition Analysis) — Snyk, Trivy, grype
  • Container scanning — scan image after build, before push
  • Secret detection — gitleaks, trufflehog, GitLab secret detection

9.3 Internal Developer Platform (IDP)

The "platform engineering" product layer — what app teams interact with daily.

Platform as a product

  • Internal customers — application developers, data engineers, ML engineers
  • Golden paths — opinionated, supported, easy way to do the right thing
  • Self-service vs guardrails — developers provision infra within policy boundaries

Developer portal and service catalog

  • Service catalog metadata — owner, on-call rotation, dependencies, SLOs, runbooks
  • Scaffolder templates — "Create microservice" → repo + CI + Dockerfile + K8s manifests + monitoring + RBAC
  • TechDocs — docs-as-code in the repo, rendered in the portal

Golden path templates

  • What a complete template includes — Git repo, .gitlab-ci.yml, container build, image signing (Stage 5.2), GitOps manifest (Stage 9.1), Prometheus alerts (Stage 6), NetworkPolicy (Stage 5.3)
  • Template versioning — upgrade path when platform standards change

Environment management

  • Dev / staging / prod promotion — GitOps sync waves across clusters (Stage 9.1)
  • Ephemeral environments — preview apps per MR (Stage 9.2), namespace-per-branch, TTL-based cleanup
  • Environment parity — same Helm chart, different values; avoid snowflake environments

Artifact management

  • Container registries — ECR, GCR, Harbor; image retention policies, vulnerability scan gates (Stage 5.1)
  • SBOM and provenance storage — attach to images in registry (Stage 5.2)

Policy in the delivery path

  • Shift-left security — scan in CI before merge (Stage 9.2)
  • Admission control at deploy — Kyverno/Gatekeeper enforce standards (Stage 5.3)
  • Policy exceptions — audit mode, break-glass with approval workflow

9.4 Deployment and release strategies

How to ship changes safely. Coordinate with data migrations (Stage 7.6) and SLOs (Stage 6.4).

Choosing a strategy

Strategy Downtime Rollback speed Infrastructure cost Best for
Rolling None Slow (re-deploy old version) Low Stateless services, default K8s
Blue-green None Fast (switch traffic) 2x during deploy Critical services, fast rollback needed
Canary None Fast (shift traffic back) Low extra High-traffic services, metric-gated promotion
Shadow None N/A (no user impact) 2x compute Validation before any user traffic

Rolling deployment

  • K8s Deployment — maxSurge, maxUnavailable, rolling update strategy (Stage 4.2)
  • Readiness probes — new pods must pass before old pods terminate
  • PodDisruptionBudget — minimum available during voluntary disruptions (Stage 4.2)
  • Limitation — mixed versions run simultaneously; requires backward-compatible API and schema (Stage 7.6)

Blue-green deployment

  • Two identical environments — blue (current) and green (new)
  • Traffic switch — DNS, load balancer, or service mesh route flip
  • Rollback — switch traffic back to blue instantly
  • Cost — running double infrastructure during deploy window
  • Database consideration — schema must be compatible with both versions (expand-contract, Stage 7.6)

Canary deployment

  • Traffic split — 1% → 5% → 25% → 50% → 100%, gated by metrics at each step
  • Metric gates — error rate, p99 latency, saturation (Stage 6.6); SLO burn rate (Stage 6.4)
  • Automated rollback — Argo Rollouts / Flagger revert on failed analysis (Stage 9.1)
  • Service mesh or ingress required — Istio VirtualService, NGINX canary annotations, Cilium (Stage 4.7)

Shadow / dark traffic

  • Mirror production traffic to new version — no user-facing impact
  • Compare responses — diff old vs new output, log discrepancies
  • Use cases — validate rewrite, test new database backend, ML model comparison

Feature flags

  • Decouple deploy from release — code is deployed but feature is off
  • Flag types — release flags (short-lived), ops flags (kill switch), experiment flags (A/B)
  • Kill switch — disable feature instantly without rollback deploy
  • Flag hygiene — remove stale flags; tech debt if flags accumulate

Deployment safety checklist

  • Backward-compatible API and schema changes (Stage 7.6 expand phase)
  • Feature flags for risky changes
  • Dashboards and alerts ready before deploy (Stage 6)
  • Rollback plan documented — code rollback vs schema rollback (schema rollback is hard)
  • PDB and HPA configured — don't deploy during capacity constraints
  • Error budget check — freeze deploys if budget exhausted (Stage 6.5)

Coordinating code and data deploys

  • Expand before deploy — add new DB column/table before code that uses it
  • Contract after deploy — remove old column only after all code migrated
  • Dual-write period — both old and new code paths write to both stores (Stage 7.6)
  • Never deploy breaking schema change with rolling update — old pods will crash

Stage 10 — eBPF and Advanced Networking (for Cilium, Cloudflare, Fastly)

10.1 Advanced networking awareness (learn later detail in deep-dive file)

Full eBPF, Cilium, and CDN content is in platform-engineering-deep-dive.md. For now:

  • eBPF — programmable hooks in the Linux kernel for networking, security, and observability
  • Cilium — Kubernetes networking and policy using eBPF instead of iptables
  • CDN edge — caches responses by Cache-Control headers; mitigates DDoS at L3/L4 (volume) and L7 (HTTP-aware)

Stage 11 — Distributed databases continued (ScyllaDB / PlanetScale)

Wide-column (ScyllaDB/Cassandra) and sharded MySQL (Vitess/PlanetScale) — full detail in deep-dive file.

  • ScyllaDB/Cassandra — partition key determines node; design tables for query patterns, not normalized joins
  • Vitess/PlanetScale — MySQL sharded at scale; avoid scatter queries without a shard key
  • LSM trees, VTGate, gh-ost, resharding → deep-dive file

Stage 12 — Architecture case studies

Apply everything from Stages 1–11. Each case study follows: problem → architecture → key decisions → failure modes → interview follow-ups.

12.1 Datadog metrics ingest pipeline

Problem

  • Ingest millions of metrics per second from agents across customer infrastructure
  • High cardinality risk — bad label design can OOM the pipeline
  • Must query recent data fast; older data can be slower/cheaper

Architecture (conceptual)

  • Agent (node/pod) → local aggregation → intake API (load balanced)
  • Kafka or similar queue — decouple ingest from processing, absorb spikes
  • Processing workers — normalize, validate, drop/blacklist high-cardinality series
  • Hot storage — recent data, fast queries (like Prometheus TSDB, Stage 6.1)
  • Cold storage — object storage (S3) for long retention, queried on demand
  • Query layer — federates hot + cold, PromQL-compatible

Key decisions

  • Why queue between intake and storage — backpressure, burst absorption
  • Cardinality limits — per-metric, per-tag, per-customer quotas
  • Downsampling — reduce resolution for older data to control storage cost (Stage 8.4)
  • Sharding — by customer ID or metric hash for horizontal scale

Failure modes

  • Cardinality explosion — one bad deployment sends unique label per request
  • Ingest lag — queue depth grows, delayed metrics, alert on pipeline lag not just app metrics
  • Hot shard — uneven customer traffic distribution

Interview follow-ups

  • How would you design cardinality limits?
  • What happens if Kafka is down for 5 minutes?
  • How do you migrate storage backends without downtime?

12.2 Cloudflare DDoS mitigation

Problem

  • Mitigate multi-Tbps volumetric attacks without impacting legitimate traffic
  • Must operate at line rate — cannot afford per-packet userspace processing at scale

Architecture (conceptual)

  • Anycast BGP — same IP from every PoP, traffic routed to nearest edge (Stage 10.3)
  • XDP/eBPF at NIC — drop malicious packets before kernel network stack (Stage 10.1)
  • Flow tracking — stateful inspection for SYN floods, UDP amplification
  • Rate limiting — token bucket per IP/ASN/fingerprint (Stage 10.3)
  • Challenge layer — JS/CAPTCHA for suspicious but not clearly malicious traffic
  • Origin shield — aggregate cache misses through single PoP to protect origin

Key decisions

  • XDP vs iptables — XDP for line-rate drop, iptables for complex stateful rules
  • False positive vs false negative trade-off — blocking legit users vs letting attack through
  • Attack signature updates — how fast can rules propagate to all PoPs globally?

Failure modes

  • Origin overload during cache miss storm — origin-facing PoP becomes bottleneck
  • SYN flood exhausting conntrack table (Stage 2) — eBPF replaces kernel conntrack at scale (Stage 10.2)
  • L7 attacks that look like legitimate HTTP — require application-aware detection

Interview follow-ups

  • How does anycast handle a PoP going offline?
  • Design rate limiting for 10M unique IPs.
  • How would you test DDoS mitigation without affecting production?

12.3 Multi-tenant Kubernetes platform

Problem

  • Run 50+ teams on shared clusters with isolation, fair resource sharing, and cost allocation

Architecture (conceptual)

  • Namespace per team — ResourceQuota, LimitRange (Stage 4.2)
  • NetworkPolicy default-deny — explicit allow between namespaces (Stage 4.2, 4.7)
  • Pod Security Standards — Restricted profile enforced via admission (Stage 5.3)
  • RBAC — namespace-scoped roles, no cluster-admin for app teams (Stage 5.3)
  • Cost allocation — Kubecost or cloud tags mapped to namespaces (Stage 8.4)
  • IDP self-service — Backstage template creates namespace + quota + GitOps repo (Stage 9.3)

Key decisions

  • Shared vs dedicated nodes — taints/tolerations for noisy-neighbor isolation
  • Cluster per env vs cluster per team — blast radius vs operational overhead
  • How much self-service — golden path vs bring-your-own-manifests

Failure modes

  • Noisy neighbor — one team's memory spike triggers node OOM, evicts other teams' pods
  • Quota exhaustion — team hits ResourceQuota, pods stuck Pending, unclear error message
  • NetworkPolicy too restrictive — breaks legitimate cross-team dependencies

Interview follow-ups

  • How do you handle a team that needs GPU nodes?
  • Design chargeback for shared cluster costs.
  • One team deploys a crypto miner — how do you detect and respond?

12.4 GitOps at scale (100+ clusters)

Problem

  • Manage application deployments across hundreds of clusters from a central platform
  • Balance consistency with cluster-specific overrides; control blast radius

Architecture (conceptual)

  • ArgoCD hub — central instance managing remote clusters (Stage 9.1)
  • App-of-apps / ApplicationSet — templated apps per cluster (Stage 9.1)
  • Repo structure — base manifests + Kustomize overlays per cluster/environment
  • Sync waves — CRDs first, then operators, then workloads
  • Progressive sync — dev clusters auto-sync, prod requires manual approval
  • Secrets — External Secrets Operator pulling from Vault (Stage 9.1, 8.3)

Key decisions

  • Monorepo vs polyrepo — trade-off between visibility and access control
  • Auto-sync vs manual sync for production — speed vs safety
  • How to handle cluster-specific config — Kustomize overlays vs Helm values files

Failure modes

  • Bad manifest synced to all clusters simultaneously — blast radius
  • ArgoCD itself becomes SPOF — HA deployment, multiple replicas
  • Secret rotation breaks sync — stale ExternalSecret, pods fail to start
  • Drift — manual kubectl edit on cluster, GitOps fights live state

Interview follow-ups

  • How do you roll out a platform-wide NetworkPolicy change safely?
  • Design a canary cluster before promoting to all prod clusters.
  • How do you handle a cluster that can't reach Git?

12.5 Secure CI/CD supply chain end-to-end

Problem

  • Ensure only trusted, scanned, signed artifacts reach production clusters

Architecture (conceptual)

  • Developer push → CI pipeline (Stage 9.2)
  • SAST + SCA + secret detection in CI (Stage 9.2)
  • Build container image → Trivy/Grype scan (Stage 5.1)
  • Generate SBOM (Syft) + SLSA provenance (Stage 5.2)
  • Sign with Cosign keyless signing via GitHub OIDC → Fulcio → Rekor (Stage 5.2)
  • Push to registry with signature attached
  • Admission webhook — Kyverno verify-image policy, reject unsigned or vulnerable images (Stage 5.3)
  • GitOps deploy — ArgoCD syncs signed image to cluster (Stage 9.1)
  • Runtime — Falco detects anomalous behavior (Stage 5.4)

Key decisions

  • Where to enforce — CI gate vs registry gate vs admission gate (defense in depth)
  • Keyless vs key-based signing — OIDC identity vs long-lived keys
  • CVE policy — block critical, warn on high, allow with exception workflow

Failure modes

  • Compromised CI runner — attacker pushes malicious signed image
  • Policy bypass — --privileged pod admitted because namespace lacks Pod Security
  • Stale base image — image signed but base layer has new CVE discovered later

Interview follow-ups

  • How do you handle emergency hotfix bypass of scan gates?
  • Design provenance verification that works across multiple CI systems.
  • What if Rekor is unavailable — can you still verify signatures?

12.6 Globally distributed SQL (CockroachDB-style)

Problem

  • PostgreSQL-compatible database that survives region failure with strong consistency

Architecture (conceptual)

  • Keyspace split into ranges, each range = Raft group (Stage 7.2)
  • Multi-Raft — independent consensus per range, scales horizontally
  • Transaction coordinator — 2PC across ranges for distributed transactions
  • Geo-partitioning — pin data to regions for latency and compliance (Stage 7.2)
  • Follower reads — read from local replica at stale timestamp for lower latency

Key decisions

  • CP over AP — strong consistency, sacrifice availability during partition (CAP, Stage 7.1)
  • Range size — too small = Raft overhead; too large = hot spots
  • Survival goals — zone vs region failure tolerance

Failure modes

  • Hot range — one range gets disproportionate writes, single Raft group bottleneck
  • Clock skew — HLC mitigates but extreme skew causes transaction retries
  • Region partition — CP system may become unavailable for affected ranges

Interview follow-ups

  • How does CockroachDB handle a node failure mid-transaction?
  • Design a schema migration for a globally distributed table.
  • Compare to Spanner's TrueTime approach (Stage 7.1).

12.7 Observability pipeline at scale (Loki + Prometheus)

Problem

  • Collect logs and metrics from 10,000+ pods without overwhelming storage or query performance

Architecture (conceptual)

  • Metrics — Prometheus per cluster → remote_write → Mimir/Thanos (Stage 6.1)
  • Logs — Fluent Bit DaemonSet → Loki distributor → ingester → S3 chunks (Stage 6.3)
  • Traces — OTel Collector → tail sampling → Jaeger/Tempo (Stage 6.2)
  • Unified query — Grafana dashboards correlating metrics + logs + traces
  • Cardinality control — drop high-cardinality labels at ingest, recording rules for aggregates

Key decisions

  • Loki label design — index only labels (not log content), low-cardinality labels only
  • Retention tiers — 15 days hot, 90 days warm in object storage, delete after
  • Sampling — head sampling for traces (99% dropped), tail sampling for errors (Stage 6.2)

Failure modes

  • Label cardinality explosion in Loki — same problem as Prometheus, different storage
  • Remote write backpressure — Prometheus WAL grows, disk fills
  • Log volume spike — one service debug-logging at ERROR floods pipeline

Interview follow-ups

  • How do you debug a production issue when traces were sampled out?
  • Design log retention that meets compliance without bankrupting storage budget (Stage 8.4).
  • How do you correlate a metric spike to the exact log lines?

12.8 Cilium replacing kube-proxy

Problem

  • kube-proxy iptables mode doesn't scale to thousands of Services; need faster datapath

Architecture (conceptual)

  • Cilium agent (DaemonSet) — programs eBPF on each node (Stage 10.2)
  • eBPF LB map — service IP → backend pod IP, O(1) lookup, no iptables chain walk
  • Identity-based policy — numeric security identity from labels, not IP (Stage 10.2)
  • Hubble — flow-level observability from eBPF, no sidecar needed
  • --kube-proxy-replacement=strict — Cilium owns all service routing

Key decisions

  • eBPF over iptables — performance at scale, but requires kernel 4.19+ and BTF
  • DSR (Direct Server Return) — reply bypasses load balancer node, lower latency
  • Identity vs IP policy — IPs change on pod restart; identity is stable

Failure modes

  • eBPF map full — service/backend limit hit, new services fail to program
  • Kernel upgrade breaks eBPF programs — CO-RE (BTF) mitigates (Stage 10.1)
  • Policy misconfiguration — identity mismatch blocks legitimate traffic silently

Interview follow-ups

  • Walk through packet path for ClusterIP Service with Cilium eBPF vs iptables.
  • How does Cilium handle a pod IP change during rolling update?
  • Compare Cilium LB to IPVS mode kube-proxy (Stage 4.1).

12.9 Zero-downtime database migration

Problem

  • Migrate a 500GB PostgreSQL table (monolith DB) to a new schema, shard, or datastore with zero downtime and a rollback path
  • Application must keep serving traffic throughout; old and new code versions run simultaneously during rolling deploys (Stage 9.4)

Architecture (conceptual)

  • Phase 1 (expand) — add new column/table/index in old DB; deploy code that writes to both old and new paths (Stage 7.6)
  • Phase 2 (backfill) — batch or streaming job copies historical data; throttle to protect prod DB performance
  • Phase 3 (CDC) — Debezium/DMS streams ongoing changes from old DB to new store, keeping new store in sync (Stage 7.6)
  • Phase 4 (dual-read validation) — compare row counts, checksums, sample queries between old and new
  • Phase 5 (cutover) — shift read traffic to new store (percentage-based or instant); monitor error rate and SLO burn (Stage 6.4)
  • Phase 6 (contract) — remove old column/table once all code reads from new path; decommission old store

Key decisions

  • Expand-contract over big-bang — only safe pattern with rolling K8s deploys (Stage 9.4)
  • Dual-write vs CDC-only — dual-write simpler but risk of inconsistency; CDC cleaner but adds pipeline complexity
  • Cutover strategy — percentage traffic shift vs DNS flip vs feature flag per tenant
  • How long to keep old system warm — rollback window vs cost of running dual systems

Failure modes

  • Dual-write partial failure — one write succeeds, other fails; needs idempotency and reconciliation job (Stage 7.5)
  • Backfill overload — unthrottled backfill saturates DB I/O, degrades live traffic
  • Schema incompatibility — new code deployed before expand phase completes, old pods crash
  • Cutover with replication lag — reads from new store return stale data, user-visible inconsistency
  • Rollback after contract phase — schema rollback is hard; may require forward-fix instead

Interview follow-ups

  • How do you verify data correctness before cutover?
  • What if CDC pipeline falls 30 minutes behind during peak traffic?
  • Design migration for a table with 10K writes/sec and foreign key constraints.

12.10 Autoscaling under a traffic spike

Problem

  • Traffic increases 20× in 10 minutes (product launch, Black Friday, viral event)
  • Platform must scale pods, nodes, and ingress without breaching SLOs or exhausting error budget (Stage 6.4)

Architecture (conceptual)

  • Ingress / load balancer — cloud ALB/NLB or CDN absorbs initial burst (Stages 4.6, 10.3)
  • HPA — scales pod replicas based on CPU, memory, or custom metrics (Stage 4.5)
  • Cluster Autoscaler — adds nodes when pods are Pending due to insufficient resources (Stage 4.5)
  • KEDA — event-driven scaling on queue lag or external metrics; scale-to-zero off-peak (Stage 4.5)
  • Pre-warming — raise HPA minReplicas and pre-provision node pool before known events
  • Observability — RED metrics on autoscaling loop itself: time-to-new-pod-ready, time-to-new-node, scheduling latency (Stage 6.6)

Key decisions

  • HPA metric choice — CPU lags behind request rate; custom metrics (RPS, queue depth) react faster
  • CA scale-up delay — new node takes 2–5 minutes; pre-warm node groups for predictable events
  • PDB vs scale-down — CA respects PodDisruptionBudgets; may block scale-down, leaving costly idle nodes (Stage 4.2)
  • Spot/preemptible nodes — cost savings vs interruption during spike; use for fault-tolerant workloads only (Stage 8.4)
  • Max replicas cap — prevent runaway scaling from bug or DDoS; balance cost vs availability

Failure modes

  • HPA lag — metrics-server delay + cooldown window; pods not ready before traffic overwhelms existing replicas
  • CA can't scale — hit node group max, instance quota, or IP address exhaustion in subnet
  • Thundering herd on new pods — all new pods cold-start simultaneously, DB connection pool exhausted (Stage 7.5)
  • Ingress bottleneck — pods scaled but ingress/controller becomes the limit
  • Flapping — scale-up then rapid scale-down as metrics spike and drop; tune stabilization windows (Stage 4.5)

Interview follow-ups

  • How do you load-test autoscaling behavior before a launch?
  • HPA vs KEDA for a Kafka consumer workload — which and why?
  • Traffic drops after spike — how fast should you scale down without causing another outage?

12.11 Building a production Kubernetes operator

Problem

  • Platform team needs a CRD (e.g., Database, Application, Tenant) with a controller that provisions and manages lifecycle automatically
  • Must be reliable, idempotent, and operable at scale across many clusters

Architecture (conceptual)

  • CRD definition — OpenAPI validation schema, status subresource, printer columns (Stage 4.3)
  • controller-runtime — Manager, Reconciler, work queue, shared informer cache (Stage 4.3)
  • Reconcile loop — compare spec (desired) vs observed state; create/update/delete child resources
  • Webhooks — mutating (defaults) and validating (reject invalid specs) admission (Stage 4.3)
  • Finalizers — pre-delete cleanup (e.g., snapshot DB before CR deletion); prevent stuck resources
  • Observability — controller metrics (reconcile duration, errors, queue depth), structured logs, tracing (Stage 6)

Key decisions

  • Idempotent reconcile — calling reconcile N times has same effect as once; use CreateOrUpdate pattern (Stage 4.3)
  • Error handling — transient errors requeue with backoff; permanent errors update status condition
  • Owner references — child resources garbage-collected when parent CR deleted (Stage 4.3)
  • Leader election — only one active controller replica; others standby (Stage 4.3)
  • Secondary resource watches — trigger reconcile when child Secret or Deployment changes
  • Testing — envtest for unit tests, kind cluster for integration, contract tests on CRD schema

Failure modes

  • Reconcile storm — API server blip causes resync of all objects; rate-limit queue, use predicates (Stage 4.3)
  • Stuck finalizer — external dependency unavailable, CR can't delete; manual finalizer removal as break-glass
  • Status update conflict — concurrent reconcilers or user edits cause optimistic locking conflict
  • Webhook failure — invalid object rejected but error opaque to user; clear validation messages critical
  • Partial provision — DB created but Secret not written; status must reflect partial state accurately

Interview follow-ups

  • Walk through reconcile for a Database CR: create → running → upgrade → delete.
  • How do you handle a controller bug that corrupted 50 resources — rollback strategy?
  • How do you version CRD schemas without breaking existing resources?

12.12 Vault as the org-wide secrets platform

Problem

  • 2,000+ microservices need dynamic DB credentials, PKI certs, and API keys without Vault becoming a single point of failure or bottleneck
  • Must integrate with Kubernetes, CI/CD, and cloud IAM across multiple clusters and accounts

Architecture (conceptual)

  • Vault HA cluster — Raft integrated storage, 3+ nodes, active/standby with auto-failover (Stage 8.3)
  • Auto-unseal — AWS KMS / GCP KMS; no manual unseal on restart (Stage 8.3)
  • Auth methods — Kubernetes auth (pod SA token), AWS IAM auth, AppRole for CI, OIDC for GitHub Actions (Stage 8.3)
  • Secret engines — Database (dynamic creds), PKI (internal CA), KV v2 (static secrets), Transit (encryption-as-a-service)
  • Vault Agent Injector — sidecar injected via pod annotation, renders secrets to file, auto-renews leases (Stage 8.3)
  • External Secrets Operator — syncs Vault secrets to K8s Secret for GitOps compatibility (Stage 9.1)

Key decisions

  • Agent sidecar vs ESO vs direct API — sidecar for app file-based secrets; ESO for GitOps; direct API for controllers
  • Dynamic vs static secrets — dynamic DB creds auto-revoke on lease expiry; static secrets need rotation policy
  • Namespace isolation — each team gets Vault policy scoped to their path; no cross-team secret access
  • Performance standbys — read replicas for high read volume; writes still go to active node (Stage 8.3)
  • Break-glass — emergency root token procedure, audited, time-limited

Failure modes

  • Vault sealed after restart — auto-unseal misconfigured, all secret retrieval fails across fleet
  • Lease expiry without renewal — app crashes when DB cred expires; Agent must renew before TTL
  • Rate limiting — thundering herd of pods restarting simultaneously overwhelms Vault auth endpoint
  • Token leak — compromised SA token grants Vault access; short-lived tokens + narrow policies limit blast radius
  • Raft quorum loss — 2 of 3 nodes down, Vault read-only or unavailable; multi-AZ placement critical

Interview follow-ups

  • Vault is down for 10 minutes — what breaks, in what order?
  • How do you rotate a database password for 500 services without restart?
  • Design Vault topology for 5 K8s clusters across 2 cloud accounts.

12.13 FinOps: reducing a $2M/month K8s and cloud bill

Problem

  • Platform spend growing 30% quarter-over-quarter; leadership demands ~40% reduction without SLO regression or team revolt
  • Must identify waste, rightsize, and implement guardrails — not just cut capacity blindly

Architecture (conceptual)

  • Cost visibility — Kubecost / CloudHealth / native cost explorer with mandatory tagging (team, env, service) (Stage 8.4)
  • Compute — VPA recommendations, rightsizing requests/limits, spot/preemptible for fault-tolerant workloads (Stages 4.5, 8.4)
  • Node efficiency — CA scale-down idle nodes, reduce max node group size, consolidate low-utilization clusters
  • Storage — right-size PVCs, S3 lifecycle policies for logs/metrics/backups, delete orphaned volumes (Stage 8.4)
  • Observability cost — reduce metrics cardinality, shorten retention, drop debug logs in prod (Stages 6.1, 8.4)
  • Egress — CDN for static assets, PrivateLink for cross-service traffic, same-AZ preference (Stages 4.6, 8.4)
  • Governance — Infracost in PRs, Kyverno policy blocking oversized instances, chargeback reports to teams (Stages 8.1, 8.4)

Key decisions

  • What to cut first — idle resources, over-provisioned dev/staging, excessive retention; never cut prod headroom blindly
  • Spot/preemptible adoption — start with stateless batch/CI workloads; keep on-demand for critical path (Stage 8.4)
  • Chargeback vs showback — showback educates; chargeback creates accountability but needs accurate allocation
  • Reserved instances / savings plans — commit for baseline load only; keep burst on-demand
  • Unit economics — cost per request, per tenant, per GB ingested; track over time to prove savings didn't hurt reliability

Failure modes

  • Aggressive rightsizing causes OOM kills during traffic spike — under-provisioned after cutting limits
  • Spot interruption during peak — no on-demand fallback, SLO breach
  • Retention cut too short — can't debug incident from last week; false economy
  • Tagging gaps — 30% of spend is "untagged," can't allocate or optimize
  • Team workaround — devs spin up resources outside platform to avoid chargeback, creating shadow IT

Interview follow-ups

  • Show me your prioritization: what do you cut first, second, never?
  • How do you prove a 40% cost cut didn't increase incident rate?
  • Design chargeback model for a shared multi-tenant K8s cluster (Stage 12.3).

12.14 Chaos game day on a production-like environment

Problem

  • Platform team needs confidence that the system survives realistic failures before they happen in production
  • Run controlled experiments in a prod-like staging environment without customer impact

Architecture (conceptual)

  • Environment — full prod parity: same K8s version, same operators, same observability stack, synthetic load at ~50% prod traffic (Stage 9.3)
  • Chaos tools — Litmus Chaos, Chaos Mesh, or Gremlin; inject faults as K8s CRs or API calls (Stage 6.5)
  • Experiment design — hypothesize steady state (SLOs hold), define blast radius, set abort conditions
  • Fault types — pod kill, node drain, network partition (NetworkPolicy drop), AZ failure simulation, DNS failure, latency injection
  • Observability during experiment — pre-built dashboards for SLO burn, error rate, latency; on-call team observes but doesn't intervene unless abort threshold hit (Stage 6)
  • Post-experiment — blameless review, gap analysis, action items (runbook updates, new alerts, code fixes) (Stage 6.5)

Key decisions

  • Prod-like vs prod — never inject chaos in prod without mature practice; staging with realistic load is the starting point
  • Steady-state hypothesis — "p99 latency stays under 500ms during single pod kill" — must be measurable before starting
  • Blast radius — one namespace/team at a time; don't kill all etcd members simultaneously
  • Abort conditions — auto-abort if error rate exceeds 5% or SLO burn rate hits 10× (Stage 6.4)
  • Frequency — quarterly game days for platform; smaller automated chaos in CI for individual services

Failure modes

  • Experiment exceeds blast radius — network partition CR affects wrong namespace, staging outage
  • No abort condition — experiment runs too long, staging unusable for other teams for hours
  • False confidence — staging lacks prod traffic patterns; passes game day but fails in prod
  • Missing observability — can't tell if hypothesis passed or failed; experiment is worthless
  • Action items not tracked — same failure found in 3 game days, never fixed

Interview follow-ups

  • Design a game day for "single AZ becomes unavailable."
  • How is chaos different from load testing (Stage 6.6)?
  • When would you allow chaos experiments in production (e.g., Netflix approach)?

12.15 Terraform/IaC at scale (monorepo, drift, blast radius)

Problem

  • 500+ resources across 20 environments, 50 engineers contributing; a bad terraform apply can take down production
  • State files grow large, modules proliferate, drift accumulates, and nobody knows what's actually deployed

Architecture (conceptual)

  • Repo structure — monorepo with modules/ (reusable) and environments/ (dev/staging/prod overlays) or polyrepo per team (Stage 8.1)
  • Remote state — S3 + DynamoDB locking; separate state file per environment; never share state across envs (Stage 8.1)
  • CI pipeline — terraform plan on every PR, mandatory review for prod applies, terraform apply only from CI (Stage 9.2)
  • Module registry — versioned modules (?ref=v1.2.3), semver, changelog; consumers pin versions (Stage 8.1)
  • Drift detection — scheduled terraform plan in CI; alert on non-zero diff; investigate manual console changes (Stage 8.1)
  • Policy as code — OPA/Sentinel/Checkov scan plans before apply; deny public S3, unencrypted volumes, missing tags (Stages 5.5, 8.4)
  • Terragrunt — DRY backend config, dependency ordering between stacks (Stage 8.1)

Key decisions

  • Monorepo vs polyrepo — monorepo: visibility and consistency; polyrepo: team autonomy and blast radius isolation
  • State granularity — one state per environment vs per service; smaller state = faster plan but more coordination
  • Module boundaries — too granular = versioning overhead; too coarse = tight coupling and wide blast radius
  • -target applies — escape hatch for emergencies; dangerous at scale, audit every use
  • Import vs recreate — bringing existing infra under TF management without downtime requires careful terraform import (Stage 8.1)

Failure modes

  • State lock stuck — crashed CI job holds DynamoDB lock; blocks all applies until manual force-unlock
  • Module breaking change — v2 module removes attribute, terraform apply destroys and recreates production RDS
  • Drift undetected for months — someone changed security group in console; next apply reverts it, breaks traffic
  • Giant state file — plan takes 15 minutes, CI timeout, teams skip plan review
  • Provider bug — provider v5 changes resource behavior, silent replacement of critical infrastructure

Interview follow-ups

  • How do you structure Terraform modules for 50 teams with different needs?
  • State file is 500MB and plans take 20 minutes — what do you do?
  • Engineer runs terraform apply locally against prod — how do you prevent this?

Top comments (0)