cloud-sky-ops

Posted on Oct 5

Post 2/10 — Reliability by Design: Probes, PodDisruptionBudgets, and Topology Spread Constraints

#programming #devops #kubernetes #learning

Now you’ve laid a baseline of namespace isolation, quotas, network policy, and PSA (if not, see Post 1 of the series), the next layer is reliability. In this post I walk you through the key Kubernetes primitives that help your workloads survive disruptions and evolve safely: probes, PDBs, topology constraints, and rollout strategies. This blog dives deeper into these nuanced offerings by Kubernetes, buckle up for some fun, hope you enjoy the ride.

Executive Summary

Use liveness, readiness, and startup probes to let Kubernetes detect and recover from unhealthy application states.
A PodDisruptionBudget (PDB) ensures voluntary disruptions (e.g. node drain, rolling upgrades) don’t violate your availability SLO.
TopologySpreadConstraints force pods to be balanced across failure domains (zones, nodes) to reduce blast radius.
Carefully configure rollout strategies (surge, maxUnavailable) in your Deployment to control downtime vs speed.
Together, these tools let you design reliability from the start—preventing cascading failures rather than firefighting.

Prereqs

You already have a Kubernetes cluster with kubectl access (as assumed in Post 1).
You have an existing Deployment (or create one) that you can modify.
You have at least two nodes (ideally in different zones or failure domains) to test spread constraints.
You can cordon/drain nodes (kubectl drain) to simulate disruption.

Concepts

A. Probes: Liveness / Readiness / Startup

Definition: Probes are periodic checks (HTTP, TCP, exec) that Kubernetes makes into containers to detect their health or readiness. Without them, a stuck process can stay “Running” indefinitely, and traffic may go to unhealthy pods.
Best practices:
- Always include readiness in your services so endpoints only include truly ready pods.
- Use startup probe for apps with long initialization (so liveness doesn’t kill them prematurely).
- Be conservative: gentle probe intervals and timeouts to avoid false negatives under GC / background load.
- Test locally to find thresholds under load.

Commands / YAML snippets:

readinessProbe:
  httpGet:
    path: /health
    port: 8080
  initialDelaySeconds: 5
  periodSeconds: 5
  timeoutSeconds: 2

livenessProbe:
  httpGet:
    path: /live
    port: 8080
  initialDelaySeconds: 15
  periodSeconds: 10
  timeoutSeconds: 2

startupProbe:
  httpGet:
    path: /ready
    port: 8080
  failureThreshold: 30
  periodSeconds: 10

Before → After:

Before: Pod is “Running” perpetually, even if app crashes internally; Service routes traffic to a dead process.
After: Kubernetes restarts the pod automatically (liveness), and the pod is only added to load via readiness when healthy.

When to use: Always for production; startup probes if your app has long boot phases.

B. PodDisruptionBudget (PDB)

Definition: A PDB is a policy that defines the minimum number (or fraction) of pods that must remain available during voluntary disruptions. Ensures your system doesn’t accidentally violate availability during upgrades, node drains, or autoscaling events.
Best practices:
- Use minAvailable when you want a floor on availability, or maxUnavailable for a cap on disruption.
- Don’t set them too tight (you might block your own updates). Leave wiggle room.
- Align PDBs with your rolling update strategy (surge / unavailable) to avoid deadlocks.
- Monitor PDB status (kubectl get pdb) to detect stuck updates.

Commands / YAML snippet:

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: my-app-pdb
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: my-app

Before → After:

Before: During a node drain, all pods could be disrupted and cause downtime.
After: Only one pod can be evicted at a time, preserving minimal availability.

When to use: For any service with more than one replica; optional for batch jobs but still beneficial.

C. TopologySpreadConstraints

Definition: A declaration in pod spec that controls how Kubernetes spreads pods across failure domains (nodes, zones) to enforce balance. Avoid overconcentration: if one zone or node goes down, you don’t lose your entire workload.
Best practices:
- Use well-known node labels: e.g. topology.kubernetes.io/zone or kubernetes.io/hostname.
- maxSkew = 1 is a typical starting point (difference <=1 pod across domains).
- whenUnsatisfiable: use DoNotSchedule for strict spreading or ScheduleAnyway for softer enforcement.
- Use the same spread constraints on all revisions of your Deployment to maintain consistency.

Commands / YAML snippet:

spec:
  topologySpreadConstraints:
  - maxSkew: 1
    topologyKey: topology.kubernetes.io/zone
    whenUnsatisfiable: DoNotSchedule
    labelSelector:
      matchLabels:
        app: my-app
  - maxSkew: 1
    topologyKey: kubernetes.io/hostname
    whenUnsatisfiable: DoNotSchedule
    labelSelector:
      matchLabels:
        app: my-app

Before → After:

Before: All pods end up in one zone or node (fastest node, better resources).
After: Pods evenly distributed across zones/nodes, reducing risk if one fails.

When to use: For replicated workloads in HA setups, especially multi-zone or multi-node clusters.

D. Rollout Strategies (Surge / maxUnavailable)

Definition: In a Deployment’s strategy.rollingUpdate, maxSurge is how many extra pods can be created during upgrade; maxUnavailable is how many pods are allowed to drop at once. They control the trade-off between speed and availability during upgrades.
Best practices:
- Use maxUnavailable: 0 and maxSurge: 1 (or more) for zero-downtime (if resources allow).
- For batch or low-priority workloads, allow some unavailability for faster rollout (e.g. 20–30%).
- Always test with your PDB + spread settings to ensure upgrade doesn’t stall.

Commands / YAML snippet:

strategy:
  type: RollingUpdate
  rollingUpdate:
    maxSurge: 1
    maxUnavailable: 0

Before → After:

Before: Default 25% surge/unavailable may drop too many pods at once, violating SLO during high load.
After: You control how many can be updated and how many remain live.

When to use: Always set explicitly rather than relying on defaults; vary per workload type.

Mini-Lab

Let’s walk through building an app that has probes, PDB, and topology spread constraints. Then simulate node disruption and see your reliability in action.

Step 1: Namespace + sample deployment

kubectl create ns reliability-demo
kubectl config set-context --current --namespace=reliability-demo

Deploy a simple app (e.g. HTTP echo) with 3 replicas:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: echo
  labels: { app: echo }
spec:
  replicas: 3
  selector:
    matchLabels: { app: echo }
  template:
    metadata:
      labels: { app: echo }
    spec:
      containers:
      - name: app
        image: hashicorp/http-echo:0.2.3
        args:
        - "-text=hello"
        ports:
        - containerPort: 5678

kubectl apply -f echo.yaml
kubectl get pods -o wide

Step 2: Add probes, PDB, and topology constraints

Edit the above spec:

# inside spec.template.spec.containers[0]:
        readinessProbe:
          httpGet:
            path: /health
            port: 5678
          initialDelaySeconds: 3
          periodSeconds: 5
        livenessProbe:
          httpGet:
            path: /live
            port: 5678
          initialDelaySeconds: 10
          periodSeconds: 10
# Also add at spec
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0
  topologySpreadConstraints:
  - maxSkew: 1
    topologyKey: topology.kubernetes.io/zone
    whenUnsatisfiable: DoNotSchedule
    labelSelector:
      matchLabels: { app: echo }
  - maxSkew: 1
    topologyKey: kubernetes.io/hostname
    whenUnsatisfiable: DoNotSchedule
    labelSelector:
      matchLabels: { app: echo }

Create a PDB:

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: echo-pdb
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: echo

Apply both:

kubectl apply -f modified-echo.yaml
kubectl apply -f pdb-echo.yaml

Check status:

kubectl get pods -o wide
kubectl get pdb
kubectl describe pdb echo-pdb

Step 3: Simulate node drain and verify behavior

Pick one node:

NODE=$(kubectl get nodes -o jsonpath='{.items[0].metadata.name}')
kubectl cordon $NODE
kubectl drain $NODE --ignore-daemonsets --delete-emptydir-data --disable-eviction

Watch pods:

kubectl get pods -o wide -w

Expectations:

One pod evicted (PDB ensures minAvailable 2 remain).
New pods rescheduled onto other nodes (respecting topology constraints).
No downtime for readiness endpoints if requests route to healthy pods.

Restore node:

kubectl uncordon $NODE

Bonus: Trigger a rollout:

kubectl set image deployment/echo app=hashicorp/http-echo:0.3.0
kubectl rollout status deployment/echo

Observe that upgrades are safe, respecting availability and spread.

Cheat Sheet Table

Action	Command / YAML	Purpose / Note
Add readiness probe	see snippet above	Ensures pod is only traffic-ready when healthy
Add liveness probe	see snippet above	Restarts stuck pods
Add startup probe	similar pattern as above	Prevents liveness kill during slow init
Create PDB	`kubectl apply -f pdb.yaml`	Enforce minimal availability
Check PDB	`kubectl get pdb` / `kubectl describe pdb`	Confirm eviction limits
Set rollout strategy	`strategy.rollingUpdate`	Control surge / downtime
Drain a node	`kubectl drain <node> --ignore-daemonsets`	Simulate voluntary disruption
Check rollout	`kubectl rollout status deployment/<name>`	Wait for safe update
View pod distribution	`kubectl get pods -o wide`	Inspect zone/node distribution

Pitfalls & Gotchas

Misconfigured probes kill healthy pods / false positives.
Probe timeouts too aggressive lead to unnecessary restarts. Tune after load testing.
PDB deadlocks your upgrades.
If your PDB demands minAvailable = replicas, plus you set maxUnavailable = 0 and maxSurge = 0, your rollout cannot make progress. Always allow some headroom (e.g. maxSurge: 1) or loosen PDB.
Skew violations during rolling updates.
Spread constraints sometimes misbehave during updates: the scheduler considers old and new pods together in balancing, so you may temporarily skew. Use ScheduleAnyway as fallback or trigger rescheduling.
Asymmetric zones or resource imbalance.
If one zone has less capacity, strict constraints may block scheduling. Use a softer whenUnsatisfiable: ScheduleAnyway or allow some skew.
Spread constraints don’t rebalance after scale-down.
When pods are removed, existing pods may end up unevenly distributed. Use a Descheduler or manual intervention to rebalance.
Startup probe too permissive or missing disables liveness protection.
Without startup probe, a slow boot may trigger liveness failure. With one, if too lenient, you delay detecting broken pods.

Wrap-up & Bridge to Post 3

With probes, PDBs, topology spread, and rollout control, you now have a robust reliability foundation. Your services will survive node drains, upgrades, and zone outages while satisfying availability SLOs.

In Post 3, we’ll build on this: version upgrades, canary/blue-green deployments, cluster upgrades, and rollback strategies, so you can evolve your system safely under load.