Introduction
One of the fastest ways to identify whether an application was designed for production is to delete a pod.
If deleting a single pod causes customer-facing downtime, the application is not resilient regardless of how modern the platform appears.
I have encountered environments running on Kubernetes with autoscaling, ingress controllers, monitoring platforms, and multiple node pools, yet a single pod restart was enough to create an outage.
The problem is not Kubernetes.
The problem is assuming Kubernetes automatically provides resilience.
Kubernetes provides the tools required to build resilient workloads, but those tools must be implemented correctly.
This article explores practical techniques for preventing routine events such as pod restarts, node maintenance, upgrades, and scaling operations from becoming customer incidents.
Understanding Failure as a Normal Event
Many engineers approach Kubernetes as though failures are exceptional.
In reality, failures are expected.
Pods terminate.
Nodes reboot.
Containers crash.
Applications restart.
Deployments roll forward.
Deployments roll back.
A resilient application assumes these events will happen and continues serving traffic when they do.
The goal is not to prevent failure.
The goal is to prevent failure from becoming an outage.
The Single Replica Problem
One of the most common production risks is running a workload with a single replica.
Example:
replicas: 1
The architecture looks like:
Application
↓
Single Pod
Everything appears healthy until:
- Node maintenance occurs
- The pod crashes
- The image is updated
- The node is drained
- Memory pressure occurs
At that point:
Application
↓
No Running Pods
Customers experience downtime immediately.
A single replica deployment should be considered a single point of failure.
Start with Multiple Replicas
The first step toward resilience is redundancy.
Instead of:
replicas: 1
Use:
replicas: 3
Architecture:
Application
↓
┌────┼────┐
↓ ↓ ↓
Pod1 Pod2 Pod3
Now a single pod failure does not affect availability.
Traffic continues flowing through the remaining replicas.
This is the simplest resilience improvement most teams can make.
Readiness Probes Matter More Than Most People Think
Many outages occur because applications receive traffic before they are ready.
Consider a .NET application:
readinessProbe:
httpGet:
path: /health
port: 80
When the application starts:
Container Started
does not necessarily mean:
Application Ready
The application may still be:
- Loading configuration
- Establishing database connections
- Building caches
- Initializing services
Without a readiness probe, traffic arrives immediately.
Users encounter failures.
With a readiness probe:
Pod Starts
↓
Application Initializes
↓
Probe Succeeds
↓
Traffic Arrives
The difference is significant.
Liveness Probes Prevent Stuck Applications
Applications do not always crash.
Sometimes they stop responding.
Examples include:
- Deadlocks
- Thread starvation
- Dependency hangs
- Resource exhaustion
From Kubernetes' perspective:
Container Running
From the customer's perspective:
Application Down
A liveness probe helps detect these situations.
Example:
livenessProbe:
httpGet:
path: /health
port: 80
initialDelaySeconds: 30
When the application becomes unresponsive, Kubernetes restarts it automatically.
Graceful Shutdown Is Often Overlooked
A pod does not disappear instantly.
Kubernetes sends:
SIGTERM
before terminating the container.
Applications should use this period to:
- Complete active requests
- Flush logs
- Release resources
- Close connections
Without graceful shutdown:
Customer Request
↓
Pod Terminated
↓
Request Lost
With graceful shutdown:
Customer Request
↓
Request Completes
↓
Pod Terminates
This becomes especially important during deployments and node upgrades.
Pod Disruption Budgets
A Pod Disruption Budget (PDB) prevents Kubernetes from removing too many replicas simultaneously.
Example:
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: payment-api-pdb
spec:
minAvailable: 2
selector:
matchLabels:
app: payment-api
This tells Kubernetes:
Always Keep
At Least
Two Pods Running
Without a PDB:
Node Drain
↓
Multiple Pods Evicted
↓
Application Impact
With a PDB:
Node Drain
↓
Eviction Controlled
↓
Application Remains Available
Avoid Running Everything on One Node
Multiple replicas alone do not guarantee resilience.
Consider:
Node 1
├─ Pod A
├─ Pod B
└─ Pod C
Three replicas exist.
Everything looks healthy.
Then:
Node Failure
All replicas disappear simultaneously.
The application is unavailable.
Kubernetes must distribute replicas across nodes.
Use:
topologySpreadConstraints
or
podAntiAffinity
to prevent workload concentration.
Multi-Zone Deployments
For production environments, node-level resilience is often insufficient.
Availability Zones provide protection against larger failures.
Example:
Zone A
Zone B
Zone C
Pods distributed across zones:
Zone A → Pod1
Zone B → Pod2
Zone C → Pod3
If an entire zone becomes unavailable:
Zone A Lost
traffic continues flowing through:
Zone B
Zone C
This significantly improves platform resilience.
Horizontal Pod Autoscaler
Traffic rarely remains constant.
An application supporting:
100 Requests Per Minute
may suddenly receive:
5,000 Requests Per Minute
Without scaling:
High Latency
Timeouts
Failures
The Horizontal Pod Autoscaler can increase replica counts automatically.
Example:
minReplicas: 3
maxReplicas: 10
When demand increases:
3 Pods
↓
6 Pods
↓
10 Pods
The workload remains responsive.
Cluster Autoscaler
Sometimes the problem is not pod capacity.
The problem is node capacity.
Consider:
10 Pods Required
but the cluster only has resources for:
6 Pods
The remaining pods stay pending.
Cluster Autoscaler solves this by adding nodes automatically.
Demand Increases
↓
Nodes Added
↓
Pods Scheduled
This allows the platform to grow with workload demand.
Protecting Against OOMKills
One of the most common causes of unexpected restarts is memory exhaustion.
Example:
Reason: OOMKilled
Applications should define realistic:
resources:
requests:
limits:
Monitoring memory consumption is critical.
Blindly increasing limits often masks the root cause.
A resilient workload understands its resource requirements.
Chaos Testing
One of the best ways to evaluate resilience is to introduce controlled failures.
Examples:
Delete a pod:
kubectl delete pod <pod-name>
Drain a node:
kubectl drain <node-name>
Restart a deployment:
kubectl rollout restart deployment
The question is simple:
Do customers notice?
If they do, resilience improvements are required.
Monitoring for Resilience
Resilience requires visibility.
Monitor:
- Pod restarts
- OOMKills
- Node failures
- Deployment failures
- Availability
- Latency
Useful metrics include:
Restart Count
Error Rate
Request Duration
Availability Percentage
Problems identified early are far easier to address than production incidents.
Common Mistakes
Mistake 1
Running production workloads with:
replicas: 1
Mistake 2
Missing readiness probes.
Mistake 3
Missing liveness probes.
Mistake 4
No Pod Disruption Budget.
Mistake 5
All replicas scheduled on one node.
Mistake 6
No autoscaling strategy.
Mistake 7
Never testing failure scenarios.
A Simple Resilience Checklist
Before promoting a workload to production, verify:
- Multiple replicas exist.
- Readiness probes are configured.
- Liveness probes are configured.
- Resource limits are defined.
- Pod Disruption Budgets exist.
- Replicas are distributed across nodes.
- Autoscaling is configured.
- Monitoring is enabled.
- Failure scenarios have been tested.
Final Thoughts
Kubernetes provides powerful resilience capabilities, but those capabilities are not automatic.
A deployment with a single replica remains a single point of failure regardless of how advanced the platform appears.
The most resilient environments assume that failures will occur every day.
Pods will restart.
Nodes will fail.
Applications will crash.
Upgrades will happen.
The difference between a resilient platform and an unreliable one is how well the workload continues serving customers when those events occur.
The true measure of resilience is not whether failures happen.
It is whether anyone notices when they do.

Top comments (0)