<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: kubeboiii</title>
    <description>The latest articles on DEV Community by kubeboiii (@kubeboiii).</description>
    <link>https://dev.to/kubeboiii</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1570693%2F1098a3d4-4bc3-4381-beaf-2f61b97ac6ed.png</url>
      <title>DEV Community: kubeboiii</title>
      <link>https://dev.to/kubeboiii</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/kubeboiii"/>
    <language>en</language>
    <item>
      <title>Infra Platform Engineering</title>
      <dc:creator>kubeboiii</dc:creator>
      <pubDate>Fri, 29 May 2026 07:25:29 +0000</pubDate>
      <link>https://dev.to/kubeboiii/infra-platform-engineering-gbn</link>
      <guid>https://dev.to/kubeboiii/infra-platform-engineering-gbn</guid>
      <description>&lt;h3&gt;
  
  
  &lt;strong&gt;Stage 1 — Linux &amp;amp; OS Internals&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;em&gt;Every container, every Kubernetes component, every performance issue traces back to Linux. Start here.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Processes &amp;amp; threads&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;How &lt;code&gt;fork()&lt;/code&gt; and &lt;code&gt;exec()&lt;/code&gt; work — process creation lifecycle
&lt;/li&gt;
&lt;li&gt;Process states: running, sleeping (interruptible vs uninterruptible), zombie, stopped
&lt;/li&gt;
&lt;li&gt;What a context switch is and why it has a cost
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Namespaces (this is what containers ARE)&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;pid&lt;/code&gt; namespace — isolated process trees, PID 1 inside a container
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;net&lt;/code&gt; namespace — isolated network stack (interfaces, routes, iptables)
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;mnt&lt;/code&gt; namespace — isolated mount points and filesystem view
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;uts&lt;/code&gt; namespace — isolated hostname and domain name
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;ipc&lt;/code&gt; namespace — isolated System V IPC, POSIX message queues
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;user&lt;/code&gt; namespace — UID/GID remapping, unprivileged containers
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;cgroup&lt;/code&gt; namespace — isolated cgroup root view
&lt;/li&gt;
&lt;li&gt;How to experiment: &lt;code&gt;unshare&lt;/code&gt;, &lt;code&gt;nsenter&lt;/code&gt;, &lt;code&gt;lsns&lt;/code&gt; commands&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Control Groups / cgroups&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;cgroups v1 vs v2 — why v2 (unified hierarchy) matters for containers
&lt;/li&gt;
&lt;li&gt;CPU controller: &lt;code&gt;cpu.shares&lt;/code&gt;, &lt;code&gt;cpu.cfs_quota_us&lt;/code&gt;, &lt;code&gt;cpu.cfs_period_us&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Memory controller: hard limit, soft limit, swap limit, &lt;code&gt;memory.stat&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;How Kubernetes maps resource &lt;code&gt;requests&lt;/code&gt; and &lt;code&gt;limits&lt;/code&gt; to cgroup settings
&lt;/li&gt;
&lt;li&gt;The OOM killer — how it scores processes, why your container gets killed, &lt;code&gt;/proc/pid/oom_score_adj&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;pids&lt;/code&gt; controller — preventing fork bombs in containers&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Memory management&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Page cache — how the kernel caches disk reads in RAM, impact on &lt;code&gt;free&lt;/code&gt; output
&lt;/li&gt;
&lt;li&gt;Memory metrics: RSS vs VSZ vs PSS vs USS — why RSS is misleading in containers
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;File systems&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Inodes — what they store, inode exhaustion problem
&lt;/li&gt;
&lt;li&gt;Bind mounts — how Kubernetes volume mounts work under the hood
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;/proc&lt;/code&gt; and &lt;code&gt;/sys&lt;/code&gt; — virtual filesystems that expose kernel state&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Signals and IPC&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;SIGTERM&lt;/code&gt; vs &lt;code&gt;SIGKILL&lt;/code&gt; — why your app must handle SIGTERM for graceful shutdown
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;SIGCHLD&lt;/code&gt; — zombie process prevention, proper child reaping (PID 1 problem in containers)
&lt;/li&gt;
&lt;li&gt;Why PID 1 in a container needs to reap children — &lt;code&gt;tini&lt;/code&gt; and &lt;code&gt;dumb-init&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;System call tracing and performance&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;strace -p &amp;lt;pid&amp;gt;&lt;/code&gt; — trace syscalls of a running process
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;/proc/&amp;lt;pid&amp;gt;/&lt;/code&gt; — &lt;code&gt;maps&lt;/code&gt;, &lt;code&gt;status&lt;/code&gt;, &lt;code&gt;fd&lt;/code&gt;, &lt;code&gt;net&lt;/code&gt; — per-process kernel state
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;ss&lt;/code&gt; and &lt;code&gt;netstat&lt;/code&gt; — socket state inspection
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;lsof&lt;/code&gt; — open file descriptors per process&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  &lt;strong&gt;Stage 2 — Networking fundamentals&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;em&gt;You cannot work in Kubernetes, Cilium, Cloudflare, or Fastly without deeply understanding networking.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;TCP/IP stack&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;IP addressing, subnets, CIDR notation, route tables
&lt;/li&gt;
&lt;li&gt;TCP handshake (SYN, SYN-ACK, ACK), teardown (FIN, TIME_WAIT)
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;TIME_WAIT&lt;/code&gt; storms — what causes them, why they matter at scale
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;DNS&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;How DNS resolution works end-to-end — recursive resolver, authoritative server
&lt;/li&gt;
&lt;li&gt;DNS record types: A, AAAA, CNAME, MX, TXT, PTR, SRV
&lt;/li&gt;
&lt;li&gt;TTL — caching, negative caching, TTL trade-offs
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;ndots&lt;/code&gt; setting in Linux — how it affects resolution order (critical for Kubernetes)
&lt;/li&gt;
&lt;li&gt;CoreDNS — how Kubernetes uses it, common misconfigurations, DNS debugging&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Linux networking internals&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Network interfaces — physical, virtual (&lt;code&gt;veth&lt;/code&gt;), bridge, loopback, dummy
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;veth&lt;/code&gt; pairs — how they work, why they are used for container networking
&lt;/li&gt;
&lt;li&gt;Linux bridge — how it connects veth pairs (like a virtual switch)
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;conntrack&lt;/code&gt; — connection tracking table, how NAT works, &lt;code&gt;conntrack -L&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;TLS and certificates&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;TLS handshake — client hello, server hello, certificate exchange, key exchange
&lt;/li&gt;
&lt;li&gt;Certificate chain — root CA, intermediate CA, leaf certificate
&lt;/li&gt;
&lt;li&gt;mTLS — mutual authentication, both sides present certificates (used in service meshes)
&lt;/li&gt;
&lt;li&gt;Certificate management — cert-manager in Kubernetes, Let's Encrypt, ACME protocol
&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  &lt;strong&gt;Stage 3 — Go (Golang)&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;em&gt;Go is the language of the entire CNCF ecosystem. Kubernetes, Prometheus, Terraform, ArgoCD, Cilium, Vault — all Go.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Language basics&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Packages, modules (&lt;code&gt;go.mod&lt;/code&gt;, &lt;code&gt;go.sum&lt;/code&gt;), workspace mode
&lt;/li&gt;
&lt;li&gt;Basic types, structs, interfaces, methods, pointers
&lt;/li&gt;
&lt;li&gt;Error handling — &lt;code&gt;error&lt;/code&gt; interface, &lt;code&gt;errors.Is()&lt;/code&gt;, &lt;code&gt;errors.As()&lt;/code&gt;, wrapping errors with &lt;code&gt;%w&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Defer, panic, recover — use cases and pitfalls
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Interfaces and composition&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Implicit interface satisfaction — no &lt;code&gt;implements&lt;/code&gt; keyword
&lt;/li&gt;
&lt;li&gt;Embedding structs and interfaces
&lt;/li&gt;
&lt;li&gt;The &lt;code&gt;io.Reader&lt;/code&gt; / &lt;code&gt;io.Writer&lt;/code&gt; / &lt;code&gt;io.Closer&lt;/code&gt; interface family
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;context.Context&lt;/code&gt; — cancellation, deadlines, value propagation — used everywhere in infra code&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Goroutines and concurrency&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Goroutines — lightweight threads managed by the Go runtime
&lt;/li&gt;
&lt;li&gt;Channels — unbuffered vs buffered, direction, closing
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;select&lt;/code&gt; statement — multiplexing channel operations
&lt;/li&gt;
&lt;li&gt;Race detector — &lt;code&gt;go run -race&lt;/code&gt;, &lt;code&gt;go test -race&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Common concurrency mistakes: goroutine leaks, channel deadlocks&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Memory and performance&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Stack vs heap allocation — escape analysis (&lt;code&gt;go build -gcflags="-m"&lt;/code&gt;)
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Standard library for infra work&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;net/http&lt;/code&gt; — building HTTP servers and clients, middleware pattern
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;os/exec&lt;/code&gt; — running subprocesses safely
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;flag&lt;/code&gt; and &lt;code&gt;os.Args&lt;/code&gt; — CLI argument parsing
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;time&lt;/code&gt; — duration arithmetic, ticker, timer&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;CLI tools and infra tooling patterns&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Config file loading — layered config (flags &amp;gt; env vars &amp;gt; config file &amp;gt; defaults)
&lt;/li&gt;
&lt;li&gt;Writing a simple HTTP server with graceful shutdown on &lt;code&gt;SIGTERM&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  &lt;strong&gt;Stage 4 — Kubernetes (the most important stage)&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;em&gt;Most companies on your list either build on K8s, build for K8s, or expect you to operate it at scale.&lt;/em&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;4.1 Architecture and control plane&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;API server&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Central hub — all components communicate through the API server
&lt;/li&gt;
&lt;li&gt;REST API — resource types, verbs (get/list/watch/create/update/patch/delete)
&lt;/li&gt;
&lt;li&gt;Authentication — service account tokens (JWT), kubeconfig, OIDC, certificates
&lt;/li&gt;
&lt;li&gt;Admission control chain — mutating admission webhooks run first, then validating
&lt;/li&gt;
&lt;li&gt;etcd watch — how the API server streams changes to controllers
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;etcd&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Raft consensus — leader election, log replication, quorum (why 3 or 5 nodes)
&lt;/li&gt;
&lt;li&gt;Key-value watch API — how controllers get notified of changes
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Scheduler&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Scheduling cycle: filtering (predicates) → scoring (priorities) → binding
&lt;/li&gt;
&lt;li&gt;Predicates: &lt;code&gt;NodeSelector&lt;/code&gt;, &lt;code&gt;NodeAffinity&lt;/code&gt;, &lt;code&gt;PodAffinity&lt;/code&gt;, &lt;code&gt;Taints/Tolerations&lt;/code&gt;, resource fit
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Controller manager&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Informer pattern — List + Watch, local cache, event handlers
&lt;/li&gt;
&lt;li&gt;Reconcile loop — compare desired state (spec) with actual state, take action to converge
&lt;/li&gt;
&lt;li&gt;Key controllers: Deployment, ReplicaSet, StatefulSet, DaemonSet, Job, CronJob, Endpoints&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;kubelet&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Watches the API server for pods assigned to its node
&lt;/li&gt;
&lt;li&gt;CRI (Container Runtime Interface) — how kubelet talks to containerd or CRI-O
&lt;/li&gt;
&lt;li&gt;Pod lifecycle: pending → pulling image → creating container → running → terminating
&lt;/li&gt;
&lt;li&gt;Liveness vs readiness vs startup probes — how they work, when each probe type fails
&lt;/li&gt;
&lt;li&gt;Eviction — memory pressure, disk pressure, node conditions&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;kube-proxy&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;iptables mode — creates DNAT rules in &lt;code&gt;KUBE-SERVICES&lt;/code&gt; chain for every Service
&lt;/li&gt;
&lt;li&gt;How ClusterIP services work — virtual IP that only exists in iptables/IPVS rules
&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;4.2 Workloads and objects&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Core objects you must know cold&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Pod — smallest deployable unit, spec fields, container lifecycle hooks, init containers
&lt;/li&gt;
&lt;li&gt;Deployment — rolling update strategy, &lt;code&gt;maxSurge&lt;/code&gt;, &lt;code&gt;maxUnavailable&lt;/code&gt;, rollback
&lt;/li&gt;
&lt;li&gt;StatefulSet — stable network identity, ordered deployment, persistent volume claims
&lt;/li&gt;
&lt;li&gt;DaemonSet — one pod per node, use cases (log shippers, monitoring agents, CNI plugins)
&lt;/li&gt;
&lt;li&gt;Job and CronJob — completions, parallelism, failure handling, cron schedule format
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Networking objects&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Endpoints and EndpointSlices — how services know which pods to route to
&lt;/li&gt;
&lt;li&gt;Ingress — host/path-based routing, TLS termination, ingress controllers (Nginx, Traefik)
&lt;/li&gt;
&lt;li&gt;Network Policy — ingress/egress rules, podSelector, namespaceSelector, default deny&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Storage objects&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;PersistentVolume (PV) and PersistentVolumeClaim (PVC) — static vs dynamic provisioning
&lt;/li&gt;
&lt;li&gt;StorageClass — provisioner, reclaim policy, volume binding mode
&lt;/li&gt;
&lt;li&gt;CSI — plugin interface for dynamic storage provisioning in Kubernetes
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Resource management&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;requests&lt;/code&gt; vs &lt;code&gt;limits&lt;/code&gt; — requests used for scheduling, limits enforced by cgroups
&lt;/li&gt;
&lt;li&gt;QoS classes: Guaranteed (requests = limits), Burstable (requests &amp;lt; limits), BestEffort (no requests)
&lt;/li&gt;
&lt;li&gt;LimitRange — default limits/requests for a namespace
&lt;/li&gt;
&lt;li&gt;ResourceQuota — total resource budget for a namespace
&lt;/li&gt;
&lt;li&gt;PodDisruptionBudget (PDB) — minimum available pods during voluntary disruptions&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;4.3 Kubernetes operators (concepts — implementation detail in deep-dive file)&lt;/strong&gt;
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;Operators extend Kubernetes with custom resources (CRDs) and controllers that reconcile desired vs actual state
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;CRD&lt;/strong&gt; — custom API object stored in etcd; separates spec (desired) from status (observed)
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reconcile loop&lt;/strong&gt; — compare spec to reality; create/update/delete until they match
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Finalizers&lt;/strong&gt; — block deletion until cleanup (e.g. snapshot before delete) completes
&lt;/li&gt;
&lt;li&gt;Full controller-runtime, webhooks, and operator patterns → see deep-dive file&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;4.4 Kubernetes networking&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;CNI (Container Network Interface)&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;IPAM (IP Address Management) — how pods get IPs
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Pod-to-pod networking&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Each pod gets its own network namespace
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;veth&lt;/code&gt; pair — one end in pod namespace, one end in host namespace
&lt;/li&gt;
&lt;li&gt;Linux bridge (&lt;code&gt;cbr0&lt;/code&gt; or similar) — connects all veth pairs on a node
&lt;/li&gt;
&lt;li&gt;How packets travel between pods on the same node vs different nodes
&lt;/li&gt;
&lt;li&gt;Overlay networks — VXLAN encapsulation for cross-node traffic
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Services and kube-proxy&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;How ClusterIP works — DNS → ClusterIP → iptables DNAT → pod IP
&lt;/li&gt;
&lt;li&gt;NodePort — how traffic enters the cluster from outside&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;DNS in Kubernetes&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;CoreDNS deployment — Deployment with 2 replicas, &lt;code&gt;kube-dns&lt;/code&gt; Service
&lt;/li&gt;
&lt;li&gt;DNS search path — &lt;code&gt;&amp;lt;svc&amp;gt;.&amp;lt;ns&amp;gt;.svc.cluster.local&lt;/code&gt;, &lt;code&gt;&amp;lt;svc&amp;gt;.&amp;lt;ns&amp;gt;&lt;/code&gt;, &lt;code&gt;&amp;lt;svc&amp;gt;&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;ndots:5&lt;/code&gt; — causes 5 failed DNS lookups before resolving external names (latency issue)
&lt;/li&gt;
&lt;li&gt;Headless services — no ClusterIP, DNS returns pod IPs directly (used by StatefulSets)
&lt;/li&gt;
&lt;li&gt;DNS debugging — &lt;code&gt;kubectl exec&lt;/code&gt; into a pod, use &lt;code&gt;nslookup&lt;/code&gt;, &lt;code&gt;dig&lt;/code&gt;, check CoreDNS logs&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;4.5 Autoscaling and resource optimization&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Horizontal Pod Autoscaler (HPA)&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Metrics server — provides CPU/memory metrics from kubelet
&lt;/li&gt;
&lt;li&gt;HPA control loop — target metric value, current metric value, desired replicas formula
&lt;/li&gt;
&lt;li&gt;Custom and external metrics — KEDA for event-driven scaling
&lt;/li&gt;
&lt;li&gt;Stabilization window — prevents flapping (scale-down slower than scale-up)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Cluster Autoscaler (CA)&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Scale-up trigger — unschedulable pods (Pending state)
&lt;/li&gt;
&lt;li&gt;Scale-down trigger — underutilized nodes for 10 minutes (default)
&lt;/li&gt;
&lt;li&gt;Node groups — CA works with cloud provider node groups (ASGs in AWS)
&lt;/li&gt;
&lt;li&gt;CA and PDBs — CA respects PodDisruptionBudgets during scale-down
&lt;/li&gt;
&lt;li&gt;Safe-to-evict annotation — &lt;code&gt;cluster-autoscaler.kubernetes.io/safe-to-evict: "false"&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;KEDA (Kubernetes Event-Driven Autoscaling)&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;ScaledObject CRD — links a workload to a scaler
&lt;/li&gt;
&lt;li&gt;Built-in scalers — Kafka consumer lag, queue depth, Prometheus metrics, cron
&lt;/li&gt;
&lt;li&gt;Scale to zero — KEDA can scale deployments down to 0 (HPA cannot)
&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;4.6 Cloud-native Kubernetes&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;&lt;em&gt;Platform teams operate K8s on top of cloud infrastructure. You need to understand what the cloud layer provides.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Managed control planes (EKS / GKE / AKS)&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Who owns etcd — the cloud provider manages the control plane; you manage worker nodes and workloads
&lt;/li&gt;
&lt;li&gt;API server endpoint — public vs private endpoint, implications for CI/CD and developer access
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;VPC and networking&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Public vs private subnets — worker nodes typically in private subnets, NAT gateway for egress
&lt;/li&gt;
&lt;li&gt;Pod CIDR vs node subnet vs service CIDR — three separate address spaces that must not overlap
&lt;/li&gt;
&lt;li&gt;Cloud load balancers — ALB/NLB/GCLB mapping to &lt;code&gt;LoadBalancer&lt;/code&gt; Service type
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;externalTrafficPolicy: Local&lt;/code&gt; vs &lt;code&gt;Cluster&lt;/code&gt; — source IP preservation and health check trade-offs on cloud LBs
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Cloud IAM and workload identity&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;IAM roles, policies, trust relationships — who can assume what, least-privilege policy design
&lt;/li&gt;
&lt;li&gt;AWS IRSA — OIDC provider on cluster, annotated ServiceAccount, projected token → STS AssumeRole
&lt;/li&gt;
&lt;li&gt;GCP Workload Identity — Kubernetes SA bound to GCP SA, no long-lived keys on nodes
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Managed services vs in-cluster&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;When to use RDS/Aurora vs self-hosted Postgres in K8s — ops burden, HA, backups, patching
&lt;/li&gt;
&lt;li&gt;ElastiCache/Memorystore vs Redis Cluster in K8s — same trade-off for caching
&lt;/li&gt;
&lt;li&gt;Object storage (S3/GCS) — Loki/Thanos blocks, Terraform state, CI artifacts, backup targets&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Cloud DNS and certificates&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;ACM / Google-managed certs — integration with cloud load balancers and Ingress
&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;4.7 Service mesh and gateways (awareness — detail in deep-dive file)&lt;/strong&gt;
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;North-south&lt;/strong&gt; — traffic from outside the cluster (ingress, TLS termination)
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;East-west&lt;/strong&gt; — service-to-service traffic inside the cluster
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Service mesh&lt;/strong&gt; — sidecar proxies add mTLS, traffic splitting, and observability between services
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Default-deny NetworkPolicy&lt;/strong&gt; — baseline for multi-tenant clusters; explicitly allow required paths
&lt;/li&gt;
&lt;li&gt;Envoy, Istio, Gateway API, API gateways → see deep-dive file&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  &lt;strong&gt;Stage 5 — Container Security&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;em&gt;Critical for Aqua Security, Snyk, Chainguard. Also tested at GitLab, Harness, Datadog.&lt;/em&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;5.1 Container image security&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Image layers and attack surface&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;How Docker image layers work — each &lt;code&gt;RUN&lt;/code&gt; instruction creates a layer
&lt;/li&gt;
&lt;li&gt;Base image choice — Alpine vs Debian vs distroless vs scratch
&lt;/li&gt;
&lt;li&gt;Distroless images — no shell, no package manager, minimal attack surface
&lt;/li&gt;
&lt;li&gt;Multi-stage builds — only copy the binary into the final stage, discard build tools&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Vulnerability scanning&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Static scanning tools — Trivy, Grype, Snyk Container, Clair
&lt;/li&gt;
&lt;li&gt;What scanners check — OS packages, language dependencies, Dockerfile misconfigs
&lt;/li&gt;
&lt;li&gt;CVE prioritization — severity (CVSS score), exploitability, reachability
&lt;/li&gt;
&lt;li&gt;Base image updates — automated PRs to update base images (Renovate, Dependabot)
&lt;/li&gt;
&lt;li&gt;Scanning in CI — fail the pipeline on critical/high CVEs, policy as code&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Software Bill of Materials (SBOM)&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;What an SBOM is — list of all components in a software artifact
&lt;/li&gt;
&lt;li&gt;Generating SBOMs — Syft, &lt;code&gt;docker sbom&lt;/code&gt;, &lt;code&gt;cosign attest&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;5.2 Supply chain security&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;The problem&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;SolarWinds attack — build system compromise, malicious code injected into signed artifacts
&lt;/li&gt;
&lt;li&gt;log4shell — transitive dependency vulnerability, hard to find without SBOMs
&lt;/li&gt;
&lt;li&gt;XZ Utils backdoor — malicious maintainer, social engineering, compromised source
&lt;/li&gt;
&lt;li&gt;The threat model — compromised build system, malicious dependency, typosquatting&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;SLSA framework (Supply chain Levels for Software Artifacts)&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;SLSA Level 1 — provenance document exists
&lt;/li&gt;
&lt;li&gt;Provenance — who built the artifact, from what source, on what system, with what inputs&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Sigstore stack&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Cosign — signs container images and other OCI artifacts
&lt;/li&gt;
&lt;li&gt;Keyless signing — short-lived certificate from Fulcio CA, no long-lived private keys
&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;5.3 Kubernetes RBAC and access control&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;RBAC model&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Role (namespace-scoped) vs ClusterRole (cluster-scoped)
&lt;/li&gt;
&lt;li&gt;RoleBinding vs ClusterRoleBinding
&lt;/li&gt;
&lt;li&gt;Subjects: ServiceAccount, User, Group
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Least privilege patterns&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Never use &lt;code&gt;cluster-admin&lt;/code&gt; for application workloads
&lt;/li&gt;
&lt;li&gt;Namespace-scoped service accounts for every workload
&lt;/li&gt;
&lt;li&gt;Projected service account tokens — short-lived, audience-bound, auto-rotated&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Pod security&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Pod Security Standards — Privileged, Baseline, Restricted profiles
&lt;/li&gt;
&lt;li&gt;Pod Security Admission controller — enforces standards at namespace level
&lt;/li&gt;
&lt;li&gt;Security context — &lt;code&gt;runAsNonRoot&lt;/code&gt;, &lt;code&gt;runAsUser&lt;/code&gt;, &lt;code&gt;readOnlyRootFilesystem&lt;/code&gt;, &lt;code&gt;allowPrivilegeEscalation: false&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;5.5 Cloud security (essentials — detail in deep-dive file)&lt;/strong&gt;
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;IAM&lt;/strong&gt; — least-privilege roles and policies; no long-lived keys on nodes or in CI
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Encryption&lt;/strong&gt; — at rest (disks, S3) and in transit (TLS)
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Audit logs&lt;/strong&gt; — CloudTrail / cloud audit logs for who changed what
&lt;/li&gt;
&lt;li&gt;Permission boundaries, WAF, GuardDuty, compliance frameworks → deep-dive file&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  &lt;strong&gt;Stage 6 — Observability&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;em&gt;Core product domain for Datadog, Grafana Labs, New Relic, Splunk.&lt;/em&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;6.1 Metrics and Prometheus&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Prometheus data model&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Time series — metric name + label set + sequence of (timestamp, float64) samples
&lt;/li&gt;
&lt;li&gt;Label cardinality — why high-cardinality labels (&lt;code&gt;user_id&lt;/code&gt;, &lt;code&gt;request_id&lt;/code&gt;) cause OOM
&lt;/li&gt;
&lt;li&gt;Metric types:

&lt;ul&gt;
&lt;li&gt;Counter — monotonically increasing (requests total, errors total)
&lt;/li&gt;
&lt;li&gt;Gauge — can go up and down (memory usage, queue depth, temperature)
&lt;/li&gt;
&lt;li&gt;Histogram — distribution of values in configurable buckets (request duration, response size)
&lt;/li&gt;
&lt;li&gt;Summary — pre-calculated quantiles on client side (avoid if possible — not aggregatable)&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;PromQL&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Instant vector vs range vector — &lt;code&gt;http_requests_total&lt;/code&gt; vs &lt;code&gt;http_requests_total[5m]&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;rate()&lt;/code&gt; — per-second rate of a counter over a range (use for counters, not gauges)
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;increase()&lt;/code&gt; — total increase in a counter over a range
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;sum by()&lt;/code&gt;, &lt;code&gt;avg by()&lt;/code&gt;, &lt;code&gt;max by()&lt;/code&gt; — aggregation operators, label dropping
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;histogram_quantile()&lt;/code&gt; — calculate p50/p95/p99 from histogram buckets
&lt;/li&gt;
&lt;li&gt;Alerting rules — &lt;code&gt;for&lt;/code&gt; duration, &lt;code&gt;labels&lt;/code&gt;, &lt;code&gt;annotations&lt;/code&gt;, Alertmanager integration&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;6.2 Distributed tracing&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Tracing concepts&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Trace — end-to-end record of a request through a distributed system
&lt;/li&gt;
&lt;li&gt;Span — a single unit of work within a trace (one service call, one DB query)
&lt;/li&gt;
&lt;li&gt;Parent-child span relationship — forms a tree structure (the trace)
&lt;/li&gt;
&lt;li&gt;Trace context propagation — W3C &lt;code&gt;traceparent&lt;/code&gt; header, B3 headers
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;OpenTelemetry (OTel)&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Exporters — OTLP (preferred), Jaeger, Zipkin, Prometheus
&lt;/li&gt;
&lt;li&gt;OTel Collector — receives spans/metrics/logs, processes them, exports to backends
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Sampling strategies&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Head sampling — decision made at trace start (random %, always-on for errors)
&lt;/li&gt;
&lt;li&gt;Tail sampling — decision made after seeing the full trace (can sample based on error, latency)
&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;6.3 Logging&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Log shipping pipeline&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Log sources — container stdout/stderr (collected by node agent), application log files
&lt;/li&gt;
&lt;li&gt;DaemonSet log agents — Fluent Bit (lightweight), Fluentd (more plugins), Vector (Rust-based)
&lt;/li&gt;
&lt;li&gt;Structured logging — JSON logs, consistent field names, log levels&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Grafana Loki&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Loki's key design decision — indexes only labels (like Prometheus), not log content
&lt;/li&gt;
&lt;li&gt;Why this matters — much cheaper to store and index than Elasticsearch-style full-text index
&lt;/li&gt;
&lt;li&gt;Log streams — a stream is a set of logs with the same label set (like a Prometheus time series)
&lt;/li&gt;
&lt;li&gt;LogQL — log query language, filter expressions &lt;code&gt;{app="nginx"} |= "error"&lt;/code&gt;, metric queries
&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;6.4 SLOs and alerting&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;SLI/SLO/SLA&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;SLI (Service Level Indicator) — the metric you measure (e.g., error rate, latency p99)
&lt;/li&gt;
&lt;li&gt;SLO (Service Level Objective) — the target (e.g., 99.9% of requests under 200ms)
&lt;/li&gt;
&lt;li&gt;Error budget — time you can be non-compliant (0.1% of 30 days = 43.8 minutes)
&lt;/li&gt;
&lt;li&gt;Error budget burn rate — how fast you are consuming the error budget&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Multi-window burn rate alerts&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Alertmanager integration — Prometheus rules send alerts to Alertmanager
&lt;/li&gt;
&lt;li&gt;Short window (5 min) + long window (1 hour) — two-condition alert to reduce false positives
&lt;/li&gt;
&lt;li&gt;Routing trees — route alerts to correct team based on labels&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;6.5 SRE practices&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;&lt;em&gt;Stage 6.4 covers SLO metrics and alerting. This section covers how platform/SRE teams operate.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Incident management&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Severity levels (SEV1–SEV4) — customer impact, response time expectations
&lt;/li&gt;
&lt;li&gt;Incident commander role — coordinates response, comms, decision-making
&lt;/li&gt;
&lt;li&gt;Incident lifecycle — detect → triage → mitigate → resolve → postmortem
&lt;/li&gt;
&lt;li&gt;Status pages and stakeholder comms — internal vs external, update cadence
&lt;/li&gt;
&lt;li&gt;Runbooks — symptom-based (not cause-based), links to dashboards and remediation steps&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;On-call and alert quality&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Alert design — page on symptoms (SLO burn, user-facing errors), not causes (CPU high)
&lt;/li&gt;
&lt;li&gt;On-call rotation — follow-the-sun, escalation policies, handoff rituals
&lt;/li&gt;
&lt;li&gt;Toil — repetitive manual work; measure and automate (platform team's core mandate)
&lt;/li&gt;
&lt;li&gt;Error budget policy — when budget is exhausted, freeze features, focus on reliability&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Reliability engineering&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Application-level patterns — timeouts, retries, circuit breakers, idempotency (Stage 7.5)
&lt;/li&gt;
&lt;li&gt;Capacity planning — headroom targets, load testing before launches, saturation metrics (USE method, Stage 6.6)
&lt;/li&gt;
&lt;li&gt;Failure domain isolation — blast radius, multi-AZ/region design (Stage 4.6)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Disaster recovery and resilience&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Backup strategy beyond etcd — application data, cross-region replication, restore drills
&lt;/li&gt;
&lt;li&gt;Multi-AZ vs multi-region — zone failure tolerance vs region failure tolerance
&lt;/li&gt;
&lt;li&gt;Game days and chaos engineering — Litmus/Chaos Mesh: pod kill, network partition, AZ failure
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Postmortems&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Blameless culture — focus on systems and process, not individuals
&lt;/li&gt;
&lt;li&gt;Timeline, contributing factors (not root cause singular), action items with owners
&lt;/li&gt;
&lt;li&gt;Follow-through — track action items to completion, review in subsequent incidents&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;6.6 Performance engineering&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;&lt;em&gt;Unifies performance concepts scattered across Stages 1, 3, 6, and 10 into a methodology.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Performance methodology&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Define the goal first — latency vs throughput vs tail behavior vs cost
&lt;/li&gt;
&lt;li&gt;Measure before optimizing — establish baseline with load tests and production metrics
&lt;/li&gt;
&lt;li&gt;One change at a time — isolate variables; validate with before/after comparison
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Throughput vs latency&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Why higher throughput often worsens tail latency — queue buildup under saturation
&lt;/li&gt;
&lt;li&gt;Concurrency limits — connection pools, worker counts, HPA max replicas as backpressure levers
&lt;/li&gt;
&lt;li&gt;Backpressure — propagate slowness upstream instead of buffering indefinitely&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Latency analysis and percentiles&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;p50 (median) vs p95 vs p99 vs p999 — why averages lie; tail latency drives user experience
&lt;/li&gt;
&lt;li&gt;Histogram buckets in Prometheus — choose bucket boundaries for your SLO thresholds (Stage 6.1)
&lt;/li&gt;
&lt;li&gt;Why Summary metrics are problematic — pre-computed quantiles on client side are not aggregatable
&lt;/li&gt;
&lt;li&gt;RED method — Rate, Errors, Duration (for request-driven services)
&lt;/li&gt;
&lt;li&gt;USE method — Utilization, Saturation, Errors (for resources: CPU, memory, disk, network)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Finding bottlenecks&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Layered diagnosis — app → pod (cgroup metrics) → node (&lt;code&gt;vmstat&lt;/code&gt;, &lt;code&gt;iostat&lt;/code&gt;) → network → control plane
&lt;/li&gt;
&lt;li&gt;Go-specific — &lt;code&gt;pprof&lt;/code&gt; CPU/heap profiles, GC pause analysis, &lt;code&gt;GOGC&lt;/code&gt; tuning (Stage 3)
&lt;/li&gt;
&lt;li&gt;Database — slow query logs, connection pool exhaustion, replication lag (Stage 7)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Load testing&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Test types — smoke, load (steady state), stress (find breaking point), spike, soak (memory leaks)
&lt;/li&gt;
&lt;li&gt;What platform teams validate — HPA response time, CA scale-up latency, PDB behavior under drain, ingress capacity
&lt;/li&gt;
&lt;li&gt;Warm-up period — exclude from measurements; run long enough for GC and caches to stabilize
&lt;/li&gt;
&lt;li&gt;Production-like data volume and cardinality — load test observability pipeline too (Stage 6.1 cardinality)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Caching and batching&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Cache hierarchy — CDN edge (Stage 10.3) → Redis (Stage 7.3) → application in-memory
&lt;/li&gt;
&lt;li&gt;Connection pooling — DB pools, HTTP keep-alive; file descriptor and cgroup limits (Stage 1)&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  &lt;strong&gt;Stage 7 — Distributed systems and databases&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;em&gt;Critical for CockroachDB, YugabyteDB, PlanetScale, ScyllaDB, Snowflake, Redis.&lt;/em&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;7.1 Distributed systems theory&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Fundamental problems&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;CAP theorem&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Consistency — every read sees the most recent write
&lt;/li&gt;
&lt;li&gt;Availability — every request gets a response (not necessarily the most recent data)
&lt;/li&gt;
&lt;li&gt;Partition tolerance — system works despite network partitions
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Real-world: CP systems (Zookeeper, etcd, CockroachDB), AP systems (Cassandra, DynamoDB)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Extends CAP — when no Partition: trade-off between latency (L) and consistency (C)  &lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;More practical than CAP for comparing real databases&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Consistency levels&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Strong consistency / linearizability — operations appear instantaneous, globally ordered
&lt;/li&gt;
&lt;li&gt;Eventual consistency — replicas will converge eventually, reads may be stale&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Consensus algorithms&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Raft — designed for understandability, used in etcd, CockroachDB, TiKV

&lt;ul&gt;
&lt;li&gt;Leader election — candidates request votes, majority wins, term numbers
&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Replication patterns&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Single-leader — all writes go to leader, replicated to followers (PostgreSQL, MySQL)
&lt;/li&gt;
&lt;li&gt;Leaderless (Dynamo-style) — any node accepts writes, quorum reads/writes (Cassandra)
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Clocks in distributed systems&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Physical clocks — NTP sync, still have drift, &lt;code&gt;clock_gettime()&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;7.2 Distributed SQL (CockroachDB / YugabyteDB)&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Architecture&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;YugabyteDB — similar model, supports PostgreSQL and Cassandra APIs, DocDB storage layer&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Distributed transactions&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;MVCC (Multi-Version Concurrency Control) — every write creates a new version, readers see a consistent snapshot
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Schema changes&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Geo-distribution&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Region/zone topology — replicas placed in different regions/zones
&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;7.3 Redis internals&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Data structures and their implementations&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Persistence&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;RDB snapshot — &lt;code&gt;BGSAVE&lt;/code&gt; forks the process, child writes snapshot using CoW, parent continues serving
&lt;/li&gt;
&lt;li&gt;AOF (Append-Only File) — logs every write command, &lt;code&gt;fsync&lt;/code&gt; policies: &lt;code&gt;always&lt;/code&gt;, &lt;code&gt;everysec&lt;/code&gt;, &lt;code&gt;no&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Hybrid persistence — RDB + AOF combined, AOF replays only since last RDB snapshot
&lt;/li&gt;
&lt;li&gt;No persistence mode — pure cache, data loss on restart acceptable&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Replication&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;REPLICAOF&lt;/code&gt; — replica connects to master, full sync (RDB transfer) then partial sync
&lt;/li&gt;
&lt;li&gt;Replica lag — &lt;code&gt;INFO replication&lt;/code&gt; shows &lt;code&gt;master_repl_offset&lt;/code&gt; vs replica offset&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Redis Cluster&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Eviction policies&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;noeviction&lt;/code&gt; — return error when maxmemory hit
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;allkeys-lru&lt;/code&gt; — evict any key using LRU approximation
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;volatile-lru&lt;/code&gt; — evict only keys with TTL set, using LRU
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;allkeys-lfu&lt;/code&gt; — evict least frequently used keys (better for skewed access patterns)
&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;7.5 Backend patterns for platform engineers&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;&lt;em&gt;Platform teams build controllers, webhooks, internal APIs, and golden-path services. These patterns apply.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;API design and gRPC&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;REST vs gRPC — REST for human-facing/admin APIs; gRPC for high-performance internal service-to-service
&lt;/li&gt;
&lt;li&gt;Deadlines and cancellation — &lt;code&gt;context.Context&lt;/code&gt; propagation, client-side timeouts (Stage 3)
&lt;/li&gt;
&lt;li&gt;API versioning — URL path vs header vs protobuf package; deprecation policy
&lt;/li&gt;
&lt;li&gt;Idempotent APIs — safe retries for POST/PUT; idempotency keys for create operations&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;PostgreSQL fundamentals&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;MVCC — multi-version concurrency control, snapshots, vacuum, bloat
&lt;/li&gt;
&lt;li&gt;Indexes — B-tree (default), partial indexes, covering indexes, when indexes hurt writes
&lt;/li&gt;
&lt;li&gt;Connection limits — &lt;code&gt;max_connections&lt;/code&gt;, connection pooling (PgBouncer), pool sizing vs pod count
&lt;/li&gt;
&lt;li&gt;Replication — streaming replication, replication lag, synchronous vs asynchronous
&lt;/li&gt;
&lt;li&gt;Isolation levels — Read Committed (default), Repeatable Read, Serializable
&lt;/li&gt;
&lt;li&gt;Foundation for CockroachDB/Yugabyte (Stage 7.2) and PlanetScale/Vitess (Stage 11)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Message queues and event streaming&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Kafka fundamentals — topics, partitions, consumer groups, offset commits, consumer lag
&lt;/li&gt;
&lt;li&gt;Delivery semantics — at-most-once, at-least-once, exactly-once (idempotent consumers + transactions)
&lt;/li&gt;
&lt;li&gt;Dead-letter queues (DLQ) — poison messages, retry policies, manual inspection
&lt;/li&gt;
&lt;li&gt;When to use what — Kafka (high-throughput log), SQS/RabbitMQ (task queues), NATS (low-latency pub/sub)
&lt;/li&gt;
&lt;li&gt;KEDA integration — scale on Kafka consumer lag (Stage 4.5)
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Reliability patterns in application code&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Timeouts on every outbound call — HTTP clients, DB queries, gRPC deadlines
&lt;/li&gt;
&lt;li&gt;Retries with exponential backoff and jitter — max attempts, retry only on idempotent operations
&lt;/li&gt;
&lt;li&gt;Circuit breakers — open/half-open/closed states, failure threshold, recovery probe
&lt;/li&gt;
&lt;li&gt;Health checks — liveness (restart if broken) vs readiness (stop sending traffic) vs startup (Stage 4.1)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Caching and background work&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Cache-aside vs read-through vs write-through — invalidation strategies, TTL design
&lt;/li&gt;
&lt;li&gt;Cache stampede protection — single-flight, lock-based refresh
&lt;/li&gt;
&lt;li&gt;Background jobs — Job vs long-running Deployment worker in K8s (Stage 4.2)
&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;7.6 Data migration strategies&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;&lt;em&gt;Deployment (Stage 9.4) ships code; data migration moves state. These are separate problems.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Expand-contract pattern&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Expand — add new column/table/API field (backward compatible, old code still works)
&lt;/li&gt;
&lt;li&gt;Migrate — backfill data, dual-read or dual-write during transition
&lt;/li&gt;
&lt;li&gt;Contract — remove old column/table/API field once all code uses new path
&lt;/li&gt;
&lt;li&gt;Why it matters — enables zero-downtime deploys with rolling updates (Stage 9.4)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Dual writes and reconciliation&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Write to old and new systems simultaneously during transition
&lt;/li&gt;
&lt;li&gt;Reconciliation job — compare old vs new, fix drift, idempotency required
&lt;/li&gt;
&lt;li&gt;Risk — inconsistency window if one write succeeds and the other fails; needs compensating transactions
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Change Data Capture (CDC)&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;CDC tools — Debezium, AWS DMS, Maxwell — stream DB changes to Kafka/message bus
&lt;/li&gt;
&lt;li&gt;Use cases — real-time replication, event-driven architecture, incremental migration
&lt;/li&gt;
&lt;li&gt;Initial snapshot + streaming — full load then switch to binlog/WAL streaming&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Online schema migrations&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Expand-contract for indexes — create index concurrently, swap in application
&lt;/li&gt;
&lt;li&gt;Migration ordering — schema before code (expand) or code before schema (contract) depending on direction&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Cutover and verification&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Traffic shifting — percentage-based cutover, instant rollback if error rate spikes
&lt;/li&gt;
&lt;li&gt;Backfill throttling — rate-limit backfill to protect production DB performance
&lt;/li&gt;
&lt;li&gt;Rollback plan — can you revert if cutover fails? How long is old system kept warm?&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  &lt;strong&gt;Stage 8 — Infrastructure as Code&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;em&gt;HashiCorp and Pulumi are on your list. IaC is also tested at almost every other company.&lt;/em&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;8.1 Terraform&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Core concepts&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;HCL (HashiCorp Configuration Language) — declarative configuration language
&lt;/li&gt;
&lt;li&gt;Provider — plugin that manages a specific API (AWS, GCP, Kubernetes, Vault)
&lt;/li&gt;
&lt;li&gt;Resource — infrastructure object managed by Terraform
&lt;/li&gt;
&lt;li&gt;Data source — read-only reference to existing infrastructure
&lt;/li&gt;
&lt;li&gt;Output — export values from a configuration
&lt;/li&gt;
&lt;li&gt;Variable — input values, with type constraints and validation&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;State management&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;State file (&lt;code&gt;terraform.tfstate&lt;/code&gt;) — JSON file recording current state of all managed resources
&lt;/li&gt;
&lt;li&gt;Remote backends — S3 + DynamoDB (locking), Terraform Cloud, GCS
&lt;/li&gt;
&lt;li&gt;State locking — prevents concurrent applies, DynamoDB table for distributed lock
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;terraform import&lt;/code&gt; — bring existing infrastructure under Terraform management
&lt;/li&gt;
&lt;li&gt;State drift — real world diverges from state, &lt;code&gt;terraform plan&lt;/code&gt; detects this&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Plan and apply lifecycle&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Dependency graph — Terraform builds a DAG of all resources and their dependencies
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;create_before_destroy&lt;/code&gt; lifecycle meta-argument — zero-downtime replacements
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;prevent_destroy&lt;/code&gt; — protect critical resources from accidental deletion
&lt;/li&gt;
&lt;li&gt;Targeted applies — &lt;code&gt;terraform apply -target=aws_instance.foo&lt;/code&gt; (use sparingly)
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Modules&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Module structure — &lt;code&gt;main.tf&lt;/code&gt;, &lt;code&gt;variables.tf&lt;/code&gt;, &lt;code&gt;outputs.tf&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Module versioning — source from Terraform Registry, GitHub with &lt;code&gt;?ref=v1.2.3&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Module composition patterns — root module calls child modules
&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;8.2 Pulumi (awareness)&lt;/strong&gt;
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;Alternative to Terraform — define infrastructure in TypeScript, Python, or Go instead of HCL
&lt;/li&gt;
&lt;li&gt;Same plan/apply/state model; details in deep-dive file when you use it&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;8.3 HashiCorp Vault&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Architecture&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Core + storage backend — Vault core is stateless, all state in storage (Raft integrated or external like Consul)
&lt;/li&gt;
&lt;li&gt;Auto-unseal — use cloud KMS (AWS KMS, GCP KMS) to automatically unseal on restart
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Auth methods&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Kubernetes auth — pod presents service account token, Vault validates with K8s API server
&lt;/li&gt;
&lt;li&gt;AWS IAM auth — use IAM role/instance profile to authenticate
&lt;/li&gt;
&lt;li&gt;OIDC/JWT — integrate with any OIDC provider (GitHub Actions, GitLab CI)
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Secret engines&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;KV v2 — versioned key-value store, soft delete, &lt;code&gt;max_versions&lt;/code&gt; per key
&lt;/li&gt;
&lt;li&gt;Dynamic secrets — Vault generates credentials on-demand (DB passwords, AWS keys, certificates)
&lt;/li&gt;
&lt;li&gt;Database secret engine — Vault creates a DB user, returns credentials, auto-revokes on lease expiry
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Vault Agent&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Sidecar pattern — runs alongside your app, authenticates to Vault, writes secrets to file
&lt;/li&gt;
&lt;li&gt;Kubernetes Vault Agent Injector — annotate pods, sidecar is automatically injected&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;8.4 FinOps for platform teams&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;&lt;em&gt;Connects autoscaling (Stage 4.5), cloud infrastructure (Stage 4.6), and IaC (Stages 8.1–8.2).&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cost visibility and allocation&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Tagging strategy — mandatory tags: team, environment, service, cost-center
&lt;/li&gt;
&lt;li&gt;Showback vs chargeback — visibility to teams vs actual billing
&lt;/li&gt;
&lt;li&gt;Cost per namespace / per cluster / per service — Kubecost, CloudHealth, native cloud cost explorer
&lt;/li&gt;
&lt;li&gt;Unit economics — cost per request, cost per GB ingested, cost per tenant&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Compute optimization&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Rightsizing — VPA recommendations (Stage 4.5), instance type selection, CPU/memory fit
&lt;/li&gt;
&lt;li&gt;Spot / preemptible nodes — cost savings vs interruption risk, taints/tolerations for fault-tolerant workloads
&lt;/li&gt;
&lt;li&gt;Cluster Autoscaler &lt;code&gt;price&lt;/code&gt; expander — prefer cheaper node groups (Stage 4.5)
&lt;/li&gt;
&lt;li&gt;Idle resource detection — orphaned volumes, unused load balancers, over-provisioned node groups
&lt;/li&gt;
&lt;li&gt;HPA min replicas — don't run 10 replicas at 3am if traffic allows 2&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Storage and data costs&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Object storage lifecycle policies — S3 Intelligent-Tiering, Glacier for old Loki/Thanos blocks
&lt;/li&gt;
&lt;li&gt;Persistent volume sizing — right-size PVCs, storage class selection (gp3 vs io2)
&lt;/li&gt;
&lt;li&gt;Log and metrics retention — shorter retention = lower cost (Stage 6); cardinality = cost (Stage 6.1)
&lt;/li&gt;
&lt;li&gt;Egress costs — cross-AZ, cross-region, internet egress; design to minimize (CDN, PrivateLink)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;FinOps in IaC and CI/CD&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Cost estimation in PRs — Infracost, Terraform plan cost diff
&lt;/li&gt;
&lt;li&gt;Policy as code — deny expensive instance types, enforce tagging in Terraform/Kyverno
&lt;/li&gt;
&lt;li&gt;Environment lifecycle — tear down ephemeral preview environments (Stage 9.3), scheduled shutdown of dev clusters
&lt;/li&gt;
&lt;li&gt;Reserved instances / savings plans vs on-demand — when commitment makes sense&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  &lt;strong&gt;Stage 9 — CI/CD, GitOps and Developer Platforms&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;em&gt;GitLab, Harness, CircleCI on your list. GitOps is expected everywhere.&lt;/em&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;9.1 GitOps with ArgoCD&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;GitOps principles&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Git as the single source of truth for desired state
&lt;/li&gt;
&lt;li&gt;Declarative — desired state expressed as files, not imperative commands
&lt;/li&gt;
&lt;li&gt;Automated reconciliation — controller continuously syncs actual state to desired state
&lt;/li&gt;
&lt;li&gt;Auditability — every change is a Git commit with author, timestamp, diff&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;ArgoCD architecture&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Application CRD — defines source (Git repo/path) and destination (cluster/namespace)
&lt;/li&gt;
&lt;li&gt;Application controller — watches Applications, compares live state with desired state (Git)
&lt;/li&gt;
&lt;li&gt;Repo server — clones Git repos, renders Helm/Kustomize/Jsonnet manifests
&lt;/li&gt;
&lt;li&gt;API server — serves gRPC and REST API, handles sync triggers
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;App-of-apps pattern&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Enables managing hundreds of apps from a single Git repo
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Multi-cluster GitOps&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Cluster credentials — stored as Secrets in ArgoCD namespace
&lt;/li&gt;
&lt;li&gt;Progressive delivery across clusters — sync to dev → staging → prod with approvals&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Secrets in GitOps&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;External Secrets Operator — CRD points to Vault/AWS Secrets Manager, controller creates K8s Secret
&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;9.2 CI/CD pipeline engineering&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Pipeline concepts&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;DAG execution — stages/steps as a directed acyclic graph, parallel by default
&lt;/li&gt;
&lt;li&gt;Artifact passing — how outputs of one stage become inputs of the next
&lt;/li&gt;
&lt;li&gt;Build cache — Docker layer cache, language-specific caches (Go module cache, npm cache)
&lt;/li&gt;
&lt;li&gt;Pipeline triggers — push, MR/PR, schedule, API trigger, upstream pipeline&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;GitLab CI specifics&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;.gitlab-ci.yml&lt;/code&gt; — pipeline definition, stages, jobs, rules, needs
&lt;/li&gt;
&lt;li&gt;GitLab Runner — the agent that executes jobs, registered to a GitLab instance
&lt;/li&gt;
&lt;li&gt;Executor types — Shell, Docker, Kubernetes (most scalable)
&lt;/li&gt;
&lt;li&gt;Kubernetes executor — creates a pod per job, ephemeral, configurable resources
&lt;/li&gt;
&lt;li&gt;Caching — &lt;code&gt;cache:&lt;/code&gt; key with hash of lock file, stored in S3 or runner local cache
&lt;/li&gt;
&lt;li&gt;Artifacts — &lt;code&gt;artifacts:&lt;/code&gt; paths persisted and passed between jobs/stages
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Security in CI/CD pipelines&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;SAST scanning — GitLab AutoDevOps, Semgrep, CodeQL
&lt;/li&gt;
&lt;li&gt;SCA (Software Composition Analysis) — Snyk, Trivy, &lt;code&gt;grype&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Container scanning — scan image after build, before push
&lt;/li&gt;
&lt;li&gt;Secret detection — gitleaks, trufflehog, GitLab secret detection
&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;9.3 Internal Developer Platform (IDP)&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;&lt;em&gt;The "platform engineering" product layer — what app teams interact with daily.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Platform as a product&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Internal customers — application developers, data engineers, ML engineers
&lt;/li&gt;
&lt;li&gt;Golden paths — opinionated, supported, easy way to do the right thing
&lt;/li&gt;
&lt;li&gt;Self-service vs guardrails — developers provision infra within policy boundaries
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Developer portal and service catalog&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Service catalog metadata — owner, on-call rotation, dependencies, SLOs, runbooks
&lt;/li&gt;
&lt;li&gt;Scaffolder templates — "Create microservice" → repo + CI + Dockerfile + K8s manifests + monitoring + RBAC
&lt;/li&gt;
&lt;li&gt;TechDocs — docs-as-code in the repo, rendered in the portal&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Golden path templates&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;What a complete template includes — Git repo, &lt;code&gt;.gitlab-ci.yml&lt;/code&gt;, container build, image signing (Stage 5.2), GitOps manifest (Stage 9.1), Prometheus alerts (Stage 6), NetworkPolicy (Stage 5.3)
&lt;/li&gt;
&lt;li&gt;Template versioning — upgrade path when platform standards change
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Environment management&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Dev / staging / prod promotion — GitOps sync waves across clusters (Stage 9.1)
&lt;/li&gt;
&lt;li&gt;Ephemeral environments — preview apps per MR (Stage 9.2), namespace-per-branch, TTL-based cleanup
&lt;/li&gt;
&lt;li&gt;Environment parity — same Helm chart, different values; avoid snowflake environments&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Artifact management&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Container registries — ECR, GCR, Harbor; image retention policies, vulnerability scan gates (Stage 5.1)
&lt;/li&gt;
&lt;li&gt;SBOM and provenance storage — attach to images in registry (Stage 5.2)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Policy in the delivery path&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Shift-left security — scan in CI before merge (Stage 9.2)
&lt;/li&gt;
&lt;li&gt;Admission control at deploy — Kyverno/Gatekeeper enforce standards (Stage 5.3)
&lt;/li&gt;
&lt;li&gt;Policy exceptions — audit mode, break-glass with approval workflow&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;9.4 Deployment and release strategies&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;&lt;em&gt;How to ship changes safely. Coordinate with data migrations (Stage 7.6) and SLOs (Stage 6.4).&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Choosing a strategy&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Strategy&lt;/th&gt;
&lt;th&gt;Downtime&lt;/th&gt;
&lt;th&gt;Rollback speed&lt;/th&gt;
&lt;th&gt;Infrastructure cost&lt;/th&gt;
&lt;th&gt;Best for&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Rolling&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;td&gt;Slow (re-deploy old version)&lt;/td&gt;
&lt;td&gt;Low&lt;/td&gt;
&lt;td&gt;Stateless services, default K8s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Blue-green&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;td&gt;Fast (switch traffic)&lt;/td&gt;
&lt;td&gt;2x during deploy&lt;/td&gt;
&lt;td&gt;Critical services, fast rollback needed&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Canary&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;td&gt;Fast (shift traffic back)&lt;/td&gt;
&lt;td&gt;Low extra&lt;/td&gt;
&lt;td&gt;High-traffic services, metric-gated promotion&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Shadow&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;td&gt;N/A (no user impact)&lt;/td&gt;
&lt;td&gt;2x compute&lt;/td&gt;
&lt;td&gt;Validation before any user traffic&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Rolling deployment&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;K8s Deployment — &lt;code&gt;maxSurge&lt;/code&gt;, &lt;code&gt;maxUnavailable&lt;/code&gt;, rolling update strategy (Stage 4.2)
&lt;/li&gt;
&lt;li&gt;Readiness probes — new pods must pass before old pods terminate
&lt;/li&gt;
&lt;li&gt;PodDisruptionBudget — minimum available during voluntary disruptions (Stage 4.2)
&lt;/li&gt;
&lt;li&gt;Limitation — mixed versions run simultaneously; requires backward-compatible API and schema (Stage 7.6)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Blue-green deployment&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Two identical environments — blue (current) and green (new)
&lt;/li&gt;
&lt;li&gt;Traffic switch — DNS, load balancer, or service mesh route flip
&lt;/li&gt;
&lt;li&gt;Rollback — switch traffic back to blue instantly
&lt;/li&gt;
&lt;li&gt;Cost — running double infrastructure during deploy window
&lt;/li&gt;
&lt;li&gt;Database consideration — schema must be compatible with both versions (expand-contract, Stage 7.6)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Canary deployment&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Traffic split — 1% → 5% → 25% → 50% → 100%, gated by metrics at each step
&lt;/li&gt;
&lt;li&gt;Metric gates — error rate, p99 latency, saturation (Stage 6.6); SLO burn rate (Stage 6.4)
&lt;/li&gt;
&lt;li&gt;Automated rollback — Argo Rollouts / Flagger revert on failed analysis (Stage 9.1)
&lt;/li&gt;
&lt;li&gt;Service mesh or ingress required — Istio VirtualService, NGINX canary annotations, Cilium (Stage 4.7)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Shadow / dark traffic&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Mirror production traffic to new version — no user-facing impact
&lt;/li&gt;
&lt;li&gt;Compare responses — diff old vs new output, log discrepancies
&lt;/li&gt;
&lt;li&gt;Use cases — validate rewrite, test new database backend, ML model comparison
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Feature flags&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Decouple deploy from release — code is deployed but feature is off
&lt;/li&gt;
&lt;li&gt;Flag types — release flags (short-lived), ops flags (kill switch), experiment flags (A/B)
&lt;/li&gt;
&lt;li&gt;Kill switch — disable feature instantly without rollback deploy
&lt;/li&gt;
&lt;li&gt;Flag hygiene — remove stale flags; tech debt if flags accumulate&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Deployment safety checklist&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Backward-compatible API and schema changes (Stage 7.6 expand phase)
&lt;/li&gt;
&lt;li&gt;Feature flags for risky changes
&lt;/li&gt;
&lt;li&gt;Dashboards and alerts ready before deploy (Stage 6)
&lt;/li&gt;
&lt;li&gt;Rollback plan documented — code rollback vs schema rollback (schema rollback is hard)
&lt;/li&gt;
&lt;li&gt;PDB and HPA configured — don't deploy during capacity constraints
&lt;/li&gt;
&lt;li&gt;Error budget check — freeze deploys if budget exhausted (Stage 6.5)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Coordinating code and data deploys&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Expand before deploy — add new DB column/table before code that uses it
&lt;/li&gt;
&lt;li&gt;Contract after deploy — remove old column only after all code migrated
&lt;/li&gt;
&lt;li&gt;Dual-write period — both old and new code paths write to both stores (Stage 7.6)
&lt;/li&gt;
&lt;li&gt;Never deploy breaking schema change with rolling update — old pods will crash&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  &lt;strong&gt;Stage 10 — eBPF and Advanced Networking (for Cilium, Cloudflare, Fastly)&lt;/strong&gt;
&lt;/h3&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;10.1 Advanced networking awareness (learn later detail in deep-dive file)&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;&lt;em&gt;Full eBPF, Cilium, and CDN content is in &lt;code&gt;platform-engineering-deep-dive.md&lt;/code&gt;. For now:&lt;/em&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;eBPF — programmable hooks in the Linux kernel for networking, security, and observability
&lt;/li&gt;
&lt;li&gt;Cilium — Kubernetes networking and policy using eBPF instead of iptables
&lt;/li&gt;
&lt;li&gt;CDN edge — caches responses by &lt;code&gt;Cache-Control&lt;/code&gt; headers; mitigates DDoS at L3/L4 (volume) and L7 (HTTP-aware)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Stage 11 — Distributed databases continued (ScyllaDB / PlanetScale)&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;em&gt;Wide-column (ScyllaDB/Cassandra) and sharded MySQL (Vitess/PlanetScale) — full detail in deep-dive file.&lt;/em&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;ScyllaDB/Cassandra&lt;/strong&gt; — partition key determines node; design tables for query patterns, not normalized joins
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Vitess/PlanetScale&lt;/strong&gt; — MySQL sharded at scale; avoid scatter queries without a shard key
&lt;/li&gt;
&lt;li&gt;LSM trees, VTGate, gh-ost, resharding → deep-dive file&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Stage 12 — Architecture case studies&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;em&gt;Apply everything from Stages 1–11. Each case study follows: problem → architecture → key decisions → failure modes → interview follow-ups.&lt;/em&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;12.1 Datadog metrics ingest pipeline&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Problem&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Ingest millions of metrics per second from agents across customer infrastructure
&lt;/li&gt;
&lt;li&gt;High cardinality risk — bad label design can OOM the pipeline
&lt;/li&gt;
&lt;li&gt;Must query recent data fast; older data can be slower/cheaper&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Architecture (conceptual)&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Agent (node/pod) → local aggregation → intake API (load balanced)
&lt;/li&gt;
&lt;li&gt;Kafka or similar queue — decouple ingest from processing, absorb spikes
&lt;/li&gt;
&lt;li&gt;Processing workers — normalize, validate, drop/blacklist high-cardinality series
&lt;/li&gt;
&lt;li&gt;Hot storage — recent data, fast queries (like Prometheus TSDB, Stage 6.1)
&lt;/li&gt;
&lt;li&gt;Cold storage — object storage (S3) for long retention, queried on demand
&lt;/li&gt;
&lt;li&gt;Query layer — federates hot + cold, PromQL-compatible&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Key decisions&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Why queue between intake and storage — backpressure, burst absorption
&lt;/li&gt;
&lt;li&gt;Cardinality limits — per-metric, per-tag, per-customer quotas
&lt;/li&gt;
&lt;li&gt;Downsampling — reduce resolution for older data to control storage cost (Stage 8.4)
&lt;/li&gt;
&lt;li&gt;Sharding — by customer ID or metric hash for horizontal scale&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Failure modes&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Cardinality explosion — one bad deployment sends unique label per request
&lt;/li&gt;
&lt;li&gt;Ingest lag — queue depth grows, delayed metrics, alert on pipeline lag not just app metrics
&lt;/li&gt;
&lt;li&gt;Hot shard — uneven customer traffic distribution&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Interview follow-ups&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;How would you design cardinality limits?
&lt;/li&gt;
&lt;li&gt;What happens if Kafka is down for 5 minutes?
&lt;/li&gt;
&lt;li&gt;How do you migrate storage backends without downtime?&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;12.2 Cloudflare DDoS mitigation&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Problem&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Mitigate multi-Tbps volumetric attacks without impacting legitimate traffic
&lt;/li&gt;
&lt;li&gt;Must operate at line rate — cannot afford per-packet userspace processing at scale&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Architecture (conceptual)&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Anycast BGP — same IP from every PoP, traffic routed to nearest edge (Stage 10.3)
&lt;/li&gt;
&lt;li&gt;XDP/eBPF at NIC — drop malicious packets before kernel network stack (Stage 10.1)
&lt;/li&gt;
&lt;li&gt;Flow tracking — stateful inspection for SYN floods, UDP amplification
&lt;/li&gt;
&lt;li&gt;Rate limiting — token bucket per IP/ASN/fingerprint (Stage 10.3)
&lt;/li&gt;
&lt;li&gt;Challenge layer — JS/CAPTCHA for suspicious but not clearly malicious traffic
&lt;/li&gt;
&lt;li&gt;Origin shield — aggregate cache misses through single PoP to protect origin&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Key decisions&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;XDP vs iptables — XDP for line-rate drop, iptables for complex stateful rules
&lt;/li&gt;
&lt;li&gt;False positive vs false negative trade-off — blocking legit users vs letting attack through
&lt;/li&gt;
&lt;li&gt;Attack signature updates — how fast can rules propagate to all PoPs globally?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Failure modes&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Origin overload during cache miss storm — origin-facing PoP becomes bottleneck
&lt;/li&gt;
&lt;li&gt;SYN flood exhausting conntrack table (Stage 2) — eBPF replaces kernel conntrack at scale (Stage 10.2)
&lt;/li&gt;
&lt;li&gt;L7 attacks that look like legitimate HTTP — require application-aware detection&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Interview follow-ups&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;How does anycast handle a PoP going offline?
&lt;/li&gt;
&lt;li&gt;Design rate limiting for 10M unique IPs.
&lt;/li&gt;
&lt;li&gt;How would you test DDoS mitigation without affecting production?&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;12.3 Multi-tenant Kubernetes platform&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Problem&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Run 50+ teams on shared clusters with isolation, fair resource sharing, and cost allocation&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Architecture (conceptual)&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Namespace per team — ResourceQuota, LimitRange (Stage 4.2)
&lt;/li&gt;
&lt;li&gt;NetworkPolicy default-deny — explicit allow between namespaces (Stage 4.2, 4.7)
&lt;/li&gt;
&lt;li&gt;Pod Security Standards — Restricted profile enforced via admission (Stage 5.3)
&lt;/li&gt;
&lt;li&gt;RBAC — namespace-scoped roles, no cluster-admin for app teams (Stage 5.3)
&lt;/li&gt;
&lt;li&gt;Cost allocation — Kubecost or cloud tags mapped to namespaces (Stage 8.4)
&lt;/li&gt;
&lt;li&gt;IDP self-service — Backstage template creates namespace + quota + GitOps repo (Stage 9.3)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Key decisions&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Shared vs dedicated nodes — taints/tolerations for noisy-neighbor isolation
&lt;/li&gt;
&lt;li&gt;Cluster per env vs cluster per team — blast radius vs operational overhead
&lt;/li&gt;
&lt;li&gt;How much self-service — golden path vs bring-your-own-manifests&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Failure modes&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Noisy neighbor — one team's memory spike triggers node OOM, evicts other teams' pods
&lt;/li&gt;
&lt;li&gt;Quota exhaustion — team hits ResourceQuota, pods stuck Pending, unclear error message
&lt;/li&gt;
&lt;li&gt;NetworkPolicy too restrictive — breaks legitimate cross-team dependencies&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Interview follow-ups&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;How do you handle a team that needs GPU nodes?
&lt;/li&gt;
&lt;li&gt;Design chargeback for shared cluster costs.
&lt;/li&gt;
&lt;li&gt;One team deploys a crypto miner — how do you detect and respond?&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;12.4 GitOps at scale (100+ clusters)&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Problem&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Manage application deployments across hundreds of clusters from a central platform
&lt;/li&gt;
&lt;li&gt;Balance consistency with cluster-specific overrides; control blast radius&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Architecture (conceptual)&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;ArgoCD hub — central instance managing remote clusters (Stage 9.1)
&lt;/li&gt;
&lt;li&gt;App-of-apps / ApplicationSet — templated apps per cluster (Stage 9.1)
&lt;/li&gt;
&lt;li&gt;Repo structure — base manifests + Kustomize overlays per cluster/environment
&lt;/li&gt;
&lt;li&gt;Sync waves — CRDs first, then operators, then workloads
&lt;/li&gt;
&lt;li&gt;Progressive sync — dev clusters auto-sync, prod requires manual approval
&lt;/li&gt;
&lt;li&gt;Secrets — External Secrets Operator pulling from Vault (Stage 9.1, 8.3)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Key decisions&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Monorepo vs polyrepo — trade-off between visibility and access control
&lt;/li&gt;
&lt;li&gt;Auto-sync vs manual sync for production — speed vs safety
&lt;/li&gt;
&lt;li&gt;How to handle cluster-specific config — Kustomize overlays vs Helm values files&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Failure modes&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Bad manifest synced to all clusters simultaneously — blast radius
&lt;/li&gt;
&lt;li&gt;ArgoCD itself becomes SPOF — HA deployment, multiple replicas
&lt;/li&gt;
&lt;li&gt;Secret rotation breaks sync — stale ExternalSecret, pods fail to start
&lt;/li&gt;
&lt;li&gt;Drift — manual &lt;code&gt;kubectl edit&lt;/code&gt; on cluster, GitOps fights live state&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Interview follow-ups&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;How do you roll out a platform-wide NetworkPolicy change safely?
&lt;/li&gt;
&lt;li&gt;Design a canary cluster before promoting to all prod clusters.
&lt;/li&gt;
&lt;li&gt;How do you handle a cluster that can't reach Git?&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;12.5 Secure CI/CD supply chain end-to-end&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Problem&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Ensure only trusted, scanned, signed artifacts reach production clusters&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Architecture (conceptual)&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Developer push → CI pipeline (Stage 9.2)
&lt;/li&gt;
&lt;li&gt;SAST + SCA + secret detection in CI (Stage 9.2)
&lt;/li&gt;
&lt;li&gt;Build container image → Trivy/Grype scan (Stage 5.1)
&lt;/li&gt;
&lt;li&gt;Generate SBOM (Syft) + SLSA provenance (Stage 5.2)
&lt;/li&gt;
&lt;li&gt;Sign with Cosign keyless signing via GitHub OIDC → Fulcio → Rekor (Stage 5.2)
&lt;/li&gt;
&lt;li&gt;Push to registry with signature attached
&lt;/li&gt;
&lt;li&gt;Admission webhook — Kyverno verify-image policy, reject unsigned or vulnerable images (Stage 5.3)
&lt;/li&gt;
&lt;li&gt;GitOps deploy — ArgoCD syncs signed image to cluster (Stage 9.1)
&lt;/li&gt;
&lt;li&gt;Runtime — Falco detects anomalous behavior (Stage 5.4)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Key decisions&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Where to enforce — CI gate vs registry gate vs admission gate (defense in depth)
&lt;/li&gt;
&lt;li&gt;Keyless vs key-based signing — OIDC identity vs long-lived keys
&lt;/li&gt;
&lt;li&gt;CVE policy — block critical, warn on high, allow with exception workflow&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Failure modes&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Compromised CI runner — attacker pushes malicious signed image
&lt;/li&gt;
&lt;li&gt;Policy bypass — &lt;code&gt;--privileged&lt;/code&gt; pod admitted because namespace lacks Pod Security
&lt;/li&gt;
&lt;li&gt;Stale base image — image signed but base layer has new CVE discovered later&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Interview follow-ups&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;How do you handle emergency hotfix bypass of scan gates?
&lt;/li&gt;
&lt;li&gt;Design provenance verification that works across multiple CI systems.
&lt;/li&gt;
&lt;li&gt;What if Rekor is unavailable — can you still verify signatures?&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;12.6 Globally distributed SQL (CockroachDB-style)&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Problem&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;PostgreSQL-compatible database that survives region failure with strong consistency&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Architecture (conceptual)&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Keyspace split into ranges, each range = Raft group (Stage 7.2)
&lt;/li&gt;
&lt;li&gt;Multi-Raft — independent consensus per range, scales horizontally
&lt;/li&gt;
&lt;li&gt;Transaction coordinator — 2PC across ranges for distributed transactions
&lt;/li&gt;
&lt;li&gt;Geo-partitioning — pin data to regions for latency and compliance (Stage 7.2)
&lt;/li&gt;
&lt;li&gt;Follower reads — read from local replica at stale timestamp for lower latency&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Key decisions&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;CP over AP — strong consistency, sacrifice availability during partition (CAP, Stage 7.1)
&lt;/li&gt;
&lt;li&gt;Range size — too small = Raft overhead; too large = hot spots
&lt;/li&gt;
&lt;li&gt;Survival goals — zone vs region failure tolerance&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Failure modes&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Hot range — one range gets disproportionate writes, single Raft group bottleneck
&lt;/li&gt;
&lt;li&gt;Clock skew — HLC mitigates but extreme skew causes transaction retries
&lt;/li&gt;
&lt;li&gt;Region partition — CP system may become unavailable for affected ranges&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Interview follow-ups&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;How does CockroachDB handle a node failure mid-transaction?
&lt;/li&gt;
&lt;li&gt;Design a schema migration for a globally distributed table.
&lt;/li&gt;
&lt;li&gt;Compare to Spanner's TrueTime approach (Stage 7.1).&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;12.7 Observability pipeline at scale (Loki + Prometheus)&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Problem&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Collect logs and metrics from 10,000+ pods without overwhelming storage or query performance&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Architecture (conceptual)&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Metrics — Prometheus per cluster → remote_write → Mimir/Thanos (Stage 6.1)
&lt;/li&gt;
&lt;li&gt;Logs — Fluent Bit DaemonSet → Loki distributor → ingester → S3 chunks (Stage 6.3)
&lt;/li&gt;
&lt;li&gt;Traces — OTel Collector → tail sampling → Jaeger/Tempo (Stage 6.2)
&lt;/li&gt;
&lt;li&gt;Unified query — Grafana dashboards correlating metrics + logs + traces
&lt;/li&gt;
&lt;li&gt;Cardinality control — drop high-cardinality labels at ingest, recording rules for aggregates&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Key decisions&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Loki label design — index only labels (not log content), low-cardinality labels only
&lt;/li&gt;
&lt;li&gt;Retention tiers — 15 days hot, 90 days warm in object storage, delete after
&lt;/li&gt;
&lt;li&gt;Sampling — head sampling for traces (99% dropped), tail sampling for errors (Stage 6.2)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Failure modes&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Label cardinality explosion in Loki — same problem as Prometheus, different storage
&lt;/li&gt;
&lt;li&gt;Remote write backpressure — Prometheus WAL grows, disk fills
&lt;/li&gt;
&lt;li&gt;Log volume spike — one service debug-logging at ERROR floods pipeline&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Interview follow-ups&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;How do you debug a production issue when traces were sampled out?
&lt;/li&gt;
&lt;li&gt;Design log retention that meets compliance without bankrupting storage budget (Stage 8.4).
&lt;/li&gt;
&lt;li&gt;How do you correlate a metric spike to the exact log lines?&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;12.8 Cilium replacing kube-proxy&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Problem&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;kube-proxy iptables mode doesn't scale to thousands of Services; need faster datapath&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Architecture (conceptual)&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Cilium agent (DaemonSet) — programs eBPF on each node (Stage 10.2)
&lt;/li&gt;
&lt;li&gt;eBPF LB map — service IP → backend pod IP, O(1) lookup, no iptables chain walk
&lt;/li&gt;
&lt;li&gt;Identity-based policy — numeric security identity from labels, not IP (Stage 10.2)
&lt;/li&gt;
&lt;li&gt;Hubble — flow-level observability from eBPF, no sidecar needed
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;--kube-proxy-replacement=strict&lt;/code&gt; — Cilium owns all service routing&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Key decisions&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;eBPF over iptables — performance at scale, but requires kernel 4.19+ and BTF
&lt;/li&gt;
&lt;li&gt;DSR (Direct Server Return) — reply bypasses load balancer node, lower latency
&lt;/li&gt;
&lt;li&gt;Identity vs IP policy — IPs change on pod restart; identity is stable&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Failure modes&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;eBPF map full — service/backend limit hit, new services fail to program
&lt;/li&gt;
&lt;li&gt;Kernel upgrade breaks eBPF programs — CO-RE (BTF) mitigates (Stage 10.1)
&lt;/li&gt;
&lt;li&gt;Policy misconfiguration — identity mismatch blocks legitimate traffic silently&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Interview follow-ups&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Walk through packet path for ClusterIP Service with Cilium eBPF vs iptables.
&lt;/li&gt;
&lt;li&gt;How does Cilium handle a pod IP change during rolling update?
&lt;/li&gt;
&lt;li&gt;Compare Cilium LB to IPVS mode kube-proxy (Stage 4.1).&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;12.9 Zero-downtime database migration&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Problem&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Migrate a 500GB PostgreSQL table (monolith DB) to a new schema, shard, or datastore with zero downtime and a rollback path
&lt;/li&gt;
&lt;li&gt;Application must keep serving traffic throughout; old and new code versions run simultaneously during rolling deploys (Stage 9.4)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Architecture (conceptual)&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Phase 1 (expand) — add new column/table/index in old DB; deploy code that writes to both old and new paths (Stage 7.6)
&lt;/li&gt;
&lt;li&gt;Phase 2 (backfill) — batch or streaming job copies historical data; throttle to protect prod DB performance
&lt;/li&gt;
&lt;li&gt;Phase 3 (CDC) — Debezium/DMS streams ongoing changes from old DB to new store, keeping new store in sync (Stage 7.6)
&lt;/li&gt;
&lt;li&gt;Phase 4 (dual-read validation) — compare row counts, checksums, sample queries between old and new
&lt;/li&gt;
&lt;li&gt;Phase 5 (cutover) — shift read traffic to new store (percentage-based or instant); monitor error rate and SLO burn (Stage 6.4)
&lt;/li&gt;
&lt;li&gt;Phase 6 (contract) — remove old column/table once all code reads from new path; decommission old store&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Key decisions&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Expand-contract over big-bang — only safe pattern with rolling K8s deploys (Stage 9.4)
&lt;/li&gt;
&lt;li&gt;Dual-write vs CDC-only — dual-write simpler but risk of inconsistency; CDC cleaner but adds pipeline complexity
&lt;/li&gt;
&lt;li&gt;Cutover strategy — percentage traffic shift vs DNS flip vs feature flag per tenant
&lt;/li&gt;
&lt;li&gt;How long to keep old system warm — rollback window vs cost of running dual systems&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Failure modes&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Dual-write partial failure — one write succeeds, other fails; needs idempotency and reconciliation job (Stage 7.5)
&lt;/li&gt;
&lt;li&gt;Backfill overload — unthrottled backfill saturates DB I/O, degrades live traffic
&lt;/li&gt;
&lt;li&gt;Schema incompatibility — new code deployed before expand phase completes, old pods crash
&lt;/li&gt;
&lt;li&gt;Cutover with replication lag — reads from new store return stale data, user-visible inconsistency
&lt;/li&gt;
&lt;li&gt;Rollback after contract phase — schema rollback is hard; may require forward-fix instead&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Interview follow-ups&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;How do you verify data correctness before cutover?
&lt;/li&gt;
&lt;li&gt;What if CDC pipeline falls 30 minutes behind during peak traffic?
&lt;/li&gt;
&lt;li&gt;Design migration for a table with 10K writes/sec and foreign key constraints.&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;12.10 Autoscaling under a traffic spike&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Problem&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Traffic increases 20× in 10 minutes (product launch, Black Friday, viral event)
&lt;/li&gt;
&lt;li&gt;Platform must scale pods, nodes, and ingress without breaching SLOs or exhausting error budget (Stage 6.4)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Architecture (conceptual)&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Ingress / load balancer — cloud ALB/NLB or CDN absorbs initial burst (Stages 4.6, 10.3)
&lt;/li&gt;
&lt;li&gt;HPA — scales pod replicas based on CPU, memory, or custom metrics (Stage 4.5)
&lt;/li&gt;
&lt;li&gt;Cluster Autoscaler — adds nodes when pods are Pending due to insufficient resources (Stage 4.5)
&lt;/li&gt;
&lt;li&gt;KEDA — event-driven scaling on queue lag or external metrics; scale-to-zero off-peak (Stage 4.5)
&lt;/li&gt;
&lt;li&gt;Pre-warming — raise HPA &lt;code&gt;minReplicas&lt;/code&gt; and pre-provision node pool before known events
&lt;/li&gt;
&lt;li&gt;Observability — RED metrics on autoscaling loop itself: time-to-new-pod-ready, time-to-new-node, scheduling latency (Stage 6.6)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Key decisions&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;HPA metric choice — CPU lags behind request rate; custom metrics (RPS, queue depth) react faster
&lt;/li&gt;
&lt;li&gt;CA scale-up delay — new node takes 2–5 minutes; pre-warm node groups for predictable events
&lt;/li&gt;
&lt;li&gt;PDB vs scale-down — CA respects PodDisruptionBudgets; may block scale-down, leaving costly idle nodes (Stage 4.2)
&lt;/li&gt;
&lt;li&gt;Spot/preemptible nodes — cost savings vs interruption during spike; use for fault-tolerant workloads only (Stage 8.4)
&lt;/li&gt;
&lt;li&gt;Max replicas cap — prevent runaway scaling from bug or DDoS; balance cost vs availability&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Failure modes&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;HPA lag — metrics-server delay + cooldown window; pods not ready before traffic overwhelms existing replicas
&lt;/li&gt;
&lt;li&gt;CA can't scale — hit node group max, instance quota, or IP address exhaustion in subnet
&lt;/li&gt;
&lt;li&gt;Thundering herd on new pods — all new pods cold-start simultaneously, DB connection pool exhausted (Stage 7.5)
&lt;/li&gt;
&lt;li&gt;Ingress bottleneck — pods scaled but ingress/controller becomes the limit
&lt;/li&gt;
&lt;li&gt;Flapping — scale-up then rapid scale-down as metrics spike and drop; tune stabilization windows (Stage 4.5)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Interview follow-ups&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;How do you load-test autoscaling behavior before a launch?
&lt;/li&gt;
&lt;li&gt;HPA vs KEDA for a Kafka consumer workload — which and why?
&lt;/li&gt;
&lt;li&gt;Traffic drops after spike — how fast should you scale down without causing another outage?&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;12.11 Building a production Kubernetes operator&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Problem&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Platform team needs a CRD (e.g., &lt;code&gt;Database&lt;/code&gt;, &lt;code&gt;Application&lt;/code&gt;, &lt;code&gt;Tenant&lt;/code&gt;) with a controller that provisions and manages lifecycle automatically
&lt;/li&gt;
&lt;li&gt;Must be reliable, idempotent, and operable at scale across many clusters&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Architecture (conceptual)&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;CRD definition — OpenAPI validation schema, status subresource, printer columns (Stage 4.3)
&lt;/li&gt;
&lt;li&gt;controller-runtime — &lt;code&gt;Manager&lt;/code&gt;, &lt;code&gt;Reconciler&lt;/code&gt;, work queue, shared informer cache (Stage 4.3)
&lt;/li&gt;
&lt;li&gt;Reconcile loop — compare spec (desired) vs observed state; create/update/delete child resources
&lt;/li&gt;
&lt;li&gt;Webhooks — mutating (defaults) and validating (reject invalid specs) admission (Stage 4.3)
&lt;/li&gt;
&lt;li&gt;Finalizers — pre-delete cleanup (e.g., snapshot DB before CR deletion); prevent stuck resources
&lt;/li&gt;
&lt;li&gt;Observability — controller metrics (reconcile duration, errors, queue depth), structured logs, tracing (Stage 6)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Key decisions&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Idempotent reconcile — calling reconcile N times has same effect as once; use &lt;code&gt;CreateOrUpdate&lt;/code&gt; pattern (Stage 4.3)
&lt;/li&gt;
&lt;li&gt;Error handling — transient errors requeue with backoff; permanent errors update status condition
&lt;/li&gt;
&lt;li&gt;Owner references — child resources garbage-collected when parent CR deleted (Stage 4.3)
&lt;/li&gt;
&lt;li&gt;Leader election — only one active controller replica; others standby (Stage 4.3)
&lt;/li&gt;
&lt;li&gt;Secondary resource watches — trigger reconcile when child Secret or Deployment changes
&lt;/li&gt;
&lt;li&gt;Testing — envtest for unit tests, kind cluster for integration, contract tests on CRD schema&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Failure modes&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Reconcile storm — API server blip causes resync of all objects; rate-limit queue, use predicates (Stage 4.3)
&lt;/li&gt;
&lt;li&gt;Stuck finalizer — external dependency unavailable, CR can't delete; manual finalizer removal as break-glass
&lt;/li&gt;
&lt;li&gt;Status update conflict — concurrent reconcilers or user edits cause optimistic locking conflict
&lt;/li&gt;
&lt;li&gt;Webhook failure — invalid object rejected but error opaque to user; clear validation messages critical
&lt;/li&gt;
&lt;li&gt;Partial provision — DB created but Secret not written; status must reflect partial state accurately&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Interview follow-ups&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Walk through reconcile for a &lt;code&gt;Database&lt;/code&gt; CR: create → running → upgrade → delete.
&lt;/li&gt;
&lt;li&gt;How do you handle a controller bug that corrupted 50 resources — rollback strategy?
&lt;/li&gt;
&lt;li&gt;How do you version CRD schemas without breaking existing resources?&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;12.12 Vault as the org-wide secrets platform&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Problem&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;2,000+ microservices need dynamic DB credentials, PKI certs, and API keys without Vault becoming a single point of failure or bottleneck
&lt;/li&gt;
&lt;li&gt;Must integrate with Kubernetes, CI/CD, and cloud IAM across multiple clusters and accounts&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Architecture (conceptual)&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Vault HA cluster — Raft integrated storage, 3+ nodes, active/standby with auto-failover (Stage 8.3)
&lt;/li&gt;
&lt;li&gt;Auto-unseal — AWS KMS / GCP KMS; no manual unseal on restart (Stage 8.3)
&lt;/li&gt;
&lt;li&gt;Auth methods — Kubernetes auth (pod SA token), AWS IAM auth, AppRole for CI, OIDC for GitHub Actions (Stage 8.3)
&lt;/li&gt;
&lt;li&gt;Secret engines — Database (dynamic creds), PKI (internal CA), KV v2 (static secrets), Transit (encryption-as-a-service)
&lt;/li&gt;
&lt;li&gt;Vault Agent Injector — sidecar injected via pod annotation, renders secrets to file, auto-renews leases (Stage 8.3)
&lt;/li&gt;
&lt;li&gt;External Secrets Operator — syncs Vault secrets to K8s Secret for GitOps compatibility (Stage 9.1)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Key decisions&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Agent sidecar vs ESO vs direct API — sidecar for app file-based secrets; ESO for GitOps; direct API for controllers
&lt;/li&gt;
&lt;li&gt;Dynamic vs static secrets — dynamic DB creds auto-revoke on lease expiry; static secrets need rotation policy
&lt;/li&gt;
&lt;li&gt;Namespace isolation — each team gets Vault policy scoped to their path; no cross-team secret access
&lt;/li&gt;
&lt;li&gt;Performance standbys — read replicas for high read volume; writes still go to active node (Stage 8.3)
&lt;/li&gt;
&lt;li&gt;Break-glass — emergency root token procedure, audited, time-limited&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Failure modes&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Vault sealed after restart — auto-unseal misconfigured, all secret retrieval fails across fleet
&lt;/li&gt;
&lt;li&gt;Lease expiry without renewal — app crashes when DB cred expires; Agent must renew before TTL
&lt;/li&gt;
&lt;li&gt;Rate limiting — thundering herd of pods restarting simultaneously overwhelms Vault auth endpoint
&lt;/li&gt;
&lt;li&gt;Token leak — compromised SA token grants Vault access; short-lived tokens + narrow policies limit blast radius
&lt;/li&gt;
&lt;li&gt;Raft quorum loss — 2 of 3 nodes down, Vault read-only or unavailable; multi-AZ placement critical&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Interview follow-ups&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Vault is down for 10 minutes — what breaks, in what order?
&lt;/li&gt;
&lt;li&gt;How do you rotate a database password for 500 services without restart?
&lt;/li&gt;
&lt;li&gt;Design Vault topology for 5 K8s clusters across 2 cloud accounts.&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;12.13 FinOps: reducing a $2M/month K8s and cloud bill&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Problem&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Platform spend growing 30% quarter-over-quarter; leadership demands ~40% reduction without SLO regression or team revolt
&lt;/li&gt;
&lt;li&gt;Must identify waste, rightsize, and implement guardrails — not just cut capacity blindly&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Architecture (conceptual)&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Cost visibility — Kubecost / CloudHealth / native cost explorer with mandatory tagging (team, env, service) (Stage 8.4)
&lt;/li&gt;
&lt;li&gt;Compute — VPA recommendations, rightsizing requests/limits, spot/preemptible for fault-tolerant workloads (Stages 4.5, 8.4)
&lt;/li&gt;
&lt;li&gt;Node efficiency — CA scale-down idle nodes, reduce max node group size, consolidate low-utilization clusters
&lt;/li&gt;
&lt;li&gt;Storage — right-size PVCs, S3 lifecycle policies for logs/metrics/backups, delete orphaned volumes (Stage 8.4)
&lt;/li&gt;
&lt;li&gt;Observability cost — reduce metrics cardinality, shorten retention, drop debug logs in prod (Stages 6.1, 8.4)
&lt;/li&gt;
&lt;li&gt;Egress — CDN for static assets, PrivateLink for cross-service traffic, same-AZ preference (Stages 4.6, 8.4)
&lt;/li&gt;
&lt;li&gt;Governance — Infracost in PRs, Kyverno policy blocking oversized instances, chargeback reports to teams (Stages 8.1, 8.4)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Key decisions&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;What to cut first — idle resources, over-provisioned dev/staging, excessive retention; never cut prod headroom blindly
&lt;/li&gt;
&lt;li&gt;Spot/preemptible adoption — start with stateless batch/CI workloads; keep on-demand for critical path (Stage 8.4)
&lt;/li&gt;
&lt;li&gt;Chargeback vs showback — showback educates; chargeback creates accountability but needs accurate allocation
&lt;/li&gt;
&lt;li&gt;Reserved instances / savings plans — commit for baseline load only; keep burst on-demand
&lt;/li&gt;
&lt;li&gt;Unit economics — cost per request, per tenant, per GB ingested; track over time to prove savings didn't hurt reliability&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Failure modes&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Aggressive rightsizing causes OOM kills during traffic spike — under-provisioned after cutting limits
&lt;/li&gt;
&lt;li&gt;Spot interruption during peak — no on-demand fallback, SLO breach
&lt;/li&gt;
&lt;li&gt;Retention cut too short — can't debug incident from last week; false economy
&lt;/li&gt;
&lt;li&gt;Tagging gaps — 30% of spend is "untagged," can't allocate or optimize
&lt;/li&gt;
&lt;li&gt;Team workaround — devs spin up resources outside platform to avoid chargeback, creating shadow IT&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Interview follow-ups&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Show me your prioritization: what do you cut first, second, never?
&lt;/li&gt;
&lt;li&gt;How do you prove a 40% cost cut didn't increase incident rate?
&lt;/li&gt;
&lt;li&gt;Design chargeback model for a shared multi-tenant K8s cluster (Stage 12.3).&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;12.14 Chaos game day on a production-like environment&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Problem&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Platform team needs confidence that the system survives realistic failures before they happen in production
&lt;/li&gt;
&lt;li&gt;Run controlled experiments in a prod-like staging environment without customer impact&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Architecture (conceptual)&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Environment — full prod parity: same K8s version, same operators, same observability stack, synthetic load at ~50% prod traffic (Stage 9.3)
&lt;/li&gt;
&lt;li&gt;Chaos tools — Litmus Chaos, Chaos Mesh, or Gremlin; inject faults as K8s CRs or API calls (Stage 6.5)
&lt;/li&gt;
&lt;li&gt;Experiment design — hypothesize steady state (SLOs hold), define blast radius, set abort conditions
&lt;/li&gt;
&lt;li&gt;Fault types — pod kill, node drain, network partition (NetworkPolicy drop), AZ failure simulation, DNS failure, latency injection
&lt;/li&gt;
&lt;li&gt;Observability during experiment — pre-built dashboards for SLO burn, error rate, latency; on-call team observes but doesn't intervene unless abort threshold hit (Stage 6)
&lt;/li&gt;
&lt;li&gt;Post-experiment — blameless review, gap analysis, action items (runbook updates, new alerts, code fixes) (Stage 6.5)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Key decisions&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Prod-like vs prod — never inject chaos in prod without mature practice; staging with realistic load is the starting point
&lt;/li&gt;
&lt;li&gt;Steady-state hypothesis — "p99 latency stays under 500ms during single pod kill" — must be measurable before starting
&lt;/li&gt;
&lt;li&gt;Blast radius — one namespace/team at a time; don't kill all etcd members simultaneously
&lt;/li&gt;
&lt;li&gt;Abort conditions — auto-abort if error rate exceeds 5% or SLO burn rate hits 10× (Stage 6.4)
&lt;/li&gt;
&lt;li&gt;Frequency — quarterly game days for platform; smaller automated chaos in CI for individual services&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Failure modes&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Experiment exceeds blast radius — network partition CR affects wrong namespace, staging outage
&lt;/li&gt;
&lt;li&gt;No abort condition — experiment runs too long, staging unusable for other teams for hours
&lt;/li&gt;
&lt;li&gt;False confidence — staging lacks prod traffic patterns; passes game day but fails in prod
&lt;/li&gt;
&lt;li&gt;Missing observability — can't tell if hypothesis passed or failed; experiment is worthless
&lt;/li&gt;
&lt;li&gt;Action items not tracked — same failure found in 3 game days, never fixed&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Interview follow-ups&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Design a game day for "single AZ becomes unavailable."
&lt;/li&gt;
&lt;li&gt;How is chaos different from load testing (Stage 6.6)?
&lt;/li&gt;
&lt;li&gt;When would you allow chaos experiments in production (e.g., Netflix approach)?&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;12.15 Terraform/IaC at scale (monorepo, drift, blast radius)&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Problem&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;500+ resources across 20 environments, 50 engineers contributing; a bad &lt;code&gt;terraform apply&lt;/code&gt; can take down production
&lt;/li&gt;
&lt;li&gt;State files grow large, modules proliferate, drift accumulates, and nobody knows what's actually deployed&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Architecture (conceptual)&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Repo structure — monorepo with &lt;code&gt;modules/&lt;/code&gt; (reusable) and &lt;code&gt;environments/&lt;/code&gt; (dev/staging/prod overlays) or polyrepo per team (Stage 8.1)
&lt;/li&gt;
&lt;li&gt;Remote state — S3 + DynamoDB locking; separate state file per environment; never share state across envs (Stage 8.1)
&lt;/li&gt;
&lt;li&gt;CI pipeline — &lt;code&gt;terraform plan&lt;/code&gt; on every PR, mandatory review for prod applies, &lt;code&gt;terraform apply&lt;/code&gt; only from CI (Stage 9.2)
&lt;/li&gt;
&lt;li&gt;Module registry — versioned modules (&lt;code&gt;?ref=v1.2.3&lt;/code&gt;), semver, changelog; consumers pin versions (Stage 8.1)
&lt;/li&gt;
&lt;li&gt;Drift detection — scheduled &lt;code&gt;terraform plan&lt;/code&gt; in CI; alert on non-zero diff; investigate manual console changes (Stage 8.1)
&lt;/li&gt;
&lt;li&gt;Policy as code — OPA/Sentinel/Checkov scan plans before apply; deny public S3, unencrypted volumes, missing tags (Stages 5.5, 8.4)
&lt;/li&gt;
&lt;li&gt;Terragrunt — DRY backend config, dependency ordering between stacks (Stage 8.1)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Key decisions&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Monorepo vs polyrepo — monorepo: visibility and consistency; polyrepo: team autonomy and blast radius isolation
&lt;/li&gt;
&lt;li&gt;State granularity — one state per environment vs per service; smaller state = faster plan but more coordination
&lt;/li&gt;
&lt;li&gt;Module boundaries — too granular = versioning overhead; too coarse = tight coupling and wide blast radius
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;-target&lt;/code&gt; applies — escape hatch for emergencies; dangerous at scale, audit every use
&lt;/li&gt;
&lt;li&gt;Import vs recreate — bringing existing infra under TF management without downtime requires careful &lt;code&gt;terraform import&lt;/code&gt; (Stage 8.1)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Failure modes&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;State lock stuck — crashed CI job holds DynamoDB lock; blocks all applies until manual force-unlock
&lt;/li&gt;
&lt;li&gt;Module breaking change — v2 module removes attribute, &lt;code&gt;terraform apply&lt;/code&gt; destroys and recreates production RDS
&lt;/li&gt;
&lt;li&gt;Drift undetected for months — someone changed security group in console; next apply reverts it, breaks traffic
&lt;/li&gt;
&lt;li&gt;Giant state file — plan takes 15 minutes, CI timeout, teams skip plan review
&lt;/li&gt;
&lt;li&gt;Provider bug — provider v5 changes resource behavior, silent replacement of critical infrastructure&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Interview follow-ups&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;How do you structure Terraform modules for 50 teams with different needs?
&lt;/li&gt;
&lt;li&gt;State file is 500MB and plans take 20 minutes — what do you do?
&lt;/li&gt;
&lt;li&gt;Engineer runs &lt;code&gt;terraform apply&lt;/code&gt; locally against prod — how do you prevent this?&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>devops</category>
      <category>infrastructure</category>
      <category>linux</category>
      <category>systems</category>
    </item>
  </channel>
</rss>
