DEV Community

Falolu Olaitan
Falolu Olaitan

Posted on

Building Resilient AKS Workloads: Preventing Single-Pod Failures from Becoming Customer Incidents

Introduction

One of the fastest ways to identify whether an application was designed for production is to delete a pod.

If deleting a single pod causes customer-facing downtime, the application is not resilient regardless of how modern the platform appears.

I have encountered environments running on Kubernetes with autoscaling, ingress controllers, monitoring platforms, and multiple node pools, yet a single pod restart was enough to create an outage.

The problem is not Kubernetes.

The problem is assuming Kubernetes automatically provides resilience.

Kubernetes provides the tools required to build resilient workloads, but those tools must be implemented correctly.

This article explores practical techniques for preventing routine events such as pod restarts, node maintenance, upgrades, and scaling operations from becoming customer incidents.


Understanding Failure as a Normal Event

Many engineers approach Kubernetes as though failures are exceptional.

In reality, failures are expected.

Pods terminate.

Nodes reboot.

Containers crash.

Applications restart.

Deployments roll forward.

Deployments roll back.

A resilient application assumes these events will happen and continues serving traffic when they do.

The goal is not to prevent failure.

The goal is to prevent failure from becoming an outage.


The Single Replica Problem

One of the most common production risks is running a workload with a single replica.

Example:

replicas: 1
Enter fullscreen mode Exit fullscreen mode

The architecture looks like:

Application
    ↓
Single Pod
Enter fullscreen mode Exit fullscreen mode

Everything appears healthy until:

  • Node maintenance occurs
  • The pod crashes
  • The image is updated
  • The node is drained
  • Memory pressure occurs

At that point:

Application
     ↓
No Running Pods
Enter fullscreen mode Exit fullscreen mode

Customers experience downtime immediately.

A single replica deployment should be considered a single point of failure.


Start with Multiple Replicas

The first step toward resilience is redundancy.

Instead of:

replicas: 1
Enter fullscreen mode Exit fullscreen mode

Use:

replicas: 3
Enter fullscreen mode Exit fullscreen mode

Architecture:

Application
      ↓
 ┌────┼────┐
 ↓    ↓    ↓
Pod1 Pod2 Pod3
Enter fullscreen mode Exit fullscreen mode

Now a single pod failure does not affect availability.

Traffic continues flowing through the remaining replicas.

This is the simplest resilience improvement most teams can make.


Readiness Probes Matter More Than Most People Think

Many outages occur because applications receive traffic before they are ready.

Consider a .NET application:

readinessProbe:
  httpGet:
    path: /health
    port: 80
Enter fullscreen mode Exit fullscreen mode

When the application starts:

Container Started
Enter fullscreen mode Exit fullscreen mode

does not necessarily mean:

Application Ready
Enter fullscreen mode Exit fullscreen mode

The application may still be:

  • Loading configuration
  • Establishing database connections
  • Building caches
  • Initializing services

Without a readiness probe, traffic arrives immediately.

Users encounter failures.

With a readiness probe:

Pod Starts
      ↓
Application Initializes
      ↓
Probe Succeeds
      ↓
Traffic Arrives
Enter fullscreen mode Exit fullscreen mode

The difference is significant.


Liveness Probes Prevent Stuck Applications

Applications do not always crash.

Sometimes they stop responding.

Examples include:

  • Deadlocks
  • Thread starvation
  • Dependency hangs
  • Resource exhaustion

From Kubernetes' perspective:

Container Running
Enter fullscreen mode Exit fullscreen mode

From the customer's perspective:

Application Down
Enter fullscreen mode Exit fullscreen mode

A liveness probe helps detect these situations.

Example:

livenessProbe:
  httpGet:
    path: /health
    port: 80
  initialDelaySeconds: 30
Enter fullscreen mode Exit fullscreen mode

When the application becomes unresponsive, Kubernetes restarts it automatically.


Graceful Shutdown Is Often Overlooked

A pod does not disappear instantly.

Kubernetes sends:

SIGTERM
Enter fullscreen mode Exit fullscreen mode

before terminating the container.

Applications should use this period to:

  • Complete active requests
  • Flush logs
  • Release resources
  • Close connections

Without graceful shutdown:

Customer Request
       ↓
Pod Terminated
       ↓
Request Lost
Enter fullscreen mode Exit fullscreen mode

With graceful shutdown:

Customer Request
       ↓
Request Completes
       ↓
Pod Terminates
Enter fullscreen mode Exit fullscreen mode

This becomes especially important during deployments and node upgrades.


Pod Disruption Budgets

A Pod Disruption Budget (PDB) prevents Kubernetes from removing too many replicas simultaneously.

Example:

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: payment-api-pdb
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: payment-api
Enter fullscreen mode Exit fullscreen mode

This tells Kubernetes:

Always Keep
At Least
Two Pods Running
Enter fullscreen mode Exit fullscreen mode

Without a PDB:

Node Drain
      ↓
Multiple Pods Evicted
      ↓
Application Impact
Enter fullscreen mode Exit fullscreen mode

With a PDB:

Node Drain
      ↓
Eviction Controlled
      ↓
Application Remains Available
Enter fullscreen mode Exit fullscreen mode

Avoid Running Everything on One Node

Multiple replicas alone do not guarantee resilience.

Consider:

Node 1
 ├─ Pod A
 ├─ Pod B
 └─ Pod C
Enter fullscreen mode Exit fullscreen mode

Three replicas exist.

Everything looks healthy.

Then:

Node Failure
Enter fullscreen mode Exit fullscreen mode

All replicas disappear simultaneously.

The application is unavailable.

Kubernetes must distribute replicas across nodes.

Use:

topologySpreadConstraints
Enter fullscreen mode Exit fullscreen mode

or

podAntiAffinity
Enter fullscreen mode Exit fullscreen mode

to prevent workload concentration.


Multi-Zone Deployments

For production environments, node-level resilience is often insufficient.

Availability Zones provide protection against larger failures.

Example:

Zone A
Zone B
Zone C
Enter fullscreen mode Exit fullscreen mode

Pods distributed across zones:

Zone A → Pod1
Zone B → Pod2
Zone C → Pod3
Enter fullscreen mode Exit fullscreen mode

If an entire zone becomes unavailable:

Zone A Lost
Enter fullscreen mode Exit fullscreen mode

traffic continues flowing through:

Zone B
Zone C
Enter fullscreen mode Exit fullscreen mode

This significantly improves platform resilience.


Horizontal Pod Autoscaler

Traffic rarely remains constant.

An application supporting:

100 Requests Per Minute
Enter fullscreen mode Exit fullscreen mode

may suddenly receive:

5,000 Requests Per Minute
Enter fullscreen mode Exit fullscreen mode

Without scaling:

High Latency
Timeouts
Failures
Enter fullscreen mode Exit fullscreen mode

The Horizontal Pod Autoscaler can increase replica counts automatically.

Example:

minReplicas: 3
maxReplicas: 10
Enter fullscreen mode Exit fullscreen mode

When demand increases:

3 Pods
  ↓
6 Pods
  ↓
10 Pods
Enter fullscreen mode Exit fullscreen mode

The workload remains responsive.


Cluster Autoscaler

Sometimes the problem is not pod capacity.

The problem is node capacity.

Consider:

10 Pods Required
Enter fullscreen mode Exit fullscreen mode

but the cluster only has resources for:

6 Pods
Enter fullscreen mode Exit fullscreen mode

The remaining pods stay pending.

Cluster Autoscaler solves this by adding nodes automatically.

Demand Increases
      ↓
Nodes Added
      ↓
Pods Scheduled
Enter fullscreen mode Exit fullscreen mode

This allows the platform to grow with workload demand.


Protecting Against OOMKills

One of the most common causes of unexpected restarts is memory exhaustion.

Example:

Reason: OOMKilled
Enter fullscreen mode Exit fullscreen mode

Applications should define realistic:

resources:
  requests:
  limits:
Enter fullscreen mode Exit fullscreen mode

Monitoring memory consumption is critical.

Blindly increasing limits often masks the root cause.

A resilient workload understands its resource requirements.


Chaos Testing

One of the best ways to evaluate resilience is to introduce controlled failures.

Examples:

Delete a pod:

kubectl delete pod <pod-name>
Enter fullscreen mode Exit fullscreen mode

Drain a node:

kubectl drain <node-name>
Enter fullscreen mode Exit fullscreen mode

Restart a deployment:

kubectl rollout restart deployment
Enter fullscreen mode Exit fullscreen mode

The question is simple:

Do customers notice?
Enter fullscreen mode Exit fullscreen mode

If they do, resilience improvements are required.


Monitoring for Resilience

Resilience requires visibility.

Monitor:

  • Pod restarts
  • OOMKills
  • Node failures
  • Deployment failures
  • Availability
  • Latency

Useful metrics include:

Restart Count
Error Rate
Request Duration
Availability Percentage
Enter fullscreen mode Exit fullscreen mode

Problems identified early are far easier to address than production incidents.


Common Mistakes

Mistake 1

Running production workloads with:

replicas: 1
Enter fullscreen mode Exit fullscreen mode

Mistake 2

Missing readiness probes.


Mistake 3

Missing liveness probes.


Mistake 4

No Pod Disruption Budget.


Mistake 5

All replicas scheduled on one node.


Mistake 6

No autoscaling strategy.


Mistake 7

Never testing failure scenarios.


A Simple Resilience Checklist

Before promoting a workload to production, verify:

  • Multiple replicas exist.
  • Readiness probes are configured.
  • Liveness probes are configured.
  • Resource limits are defined.
  • Pod Disruption Budgets exist.
  • Replicas are distributed across nodes.
  • Autoscaling is configured.
  • Monitoring is enabled.
  • Failure scenarios have been tested.

Final Thoughts

Kubernetes provides powerful resilience capabilities, but those capabilities are not automatic.

A deployment with a single replica remains a single point of failure regardless of how advanced the platform appears.

The most resilient environments assume that failures will occur every day.

Pods will restart.

Nodes will fail.

Applications will crash.

Upgrades will happen.

The difference between a resilient platform and an unreliable one is how well the workload continues serving customers when those events occur.

The true measure of resilience is not whether failures happen.

It is whether anyone notices when they do.

Top comments (0)