Falolu Olaitan

Posted on Jun 8

Building Resilient AKS Workloads: Preventing Single-Pod Failures from Becoming Customer Incidents

#sre #pod #aks #azure

Introduction

One of the fastest ways to identify whether an application was designed for production is to delete a pod.

If deleting a single pod causes customer-facing downtime, the application is not resilient regardless of how modern the platform appears.

I have encountered environments running on Kubernetes with autoscaling, ingress controllers, monitoring platforms, and multiple node pools, yet a single pod restart was enough to create an outage.

The problem is not Kubernetes.

The problem is assuming Kubernetes automatically provides resilience.

Kubernetes provides the tools required to build resilient workloads, but those tools must be implemented correctly.

This article explores practical techniques for preventing routine events such as pod restarts, node maintenance, upgrades, and scaling operations from becoming customer incidents.

Understanding Failure as a Normal Event

Many engineers approach Kubernetes as though failures are exceptional.

In reality, failures are expected.

Pods terminate.

Nodes reboot.

Containers crash.

Applications restart.

Deployments roll forward.

Deployments roll back.

A resilient application assumes these events will happen and continues serving traffic when they do.

The goal is not to prevent failure.

The goal is to prevent failure from becoming an outage.

The Single Replica Problem

One of the most common production risks is running a workload with a single replica.

Example:

replicas: 1

The architecture looks like:

Application
    ↓
Single Pod

Everything appears healthy until:

Node maintenance occurs
The pod crashes
The image is updated
The node is drained
Memory pressure occurs

At that point:

Application
     ↓
No Running Pods

Customers experience downtime immediately.

A single replica deployment should be considered a single point of failure.

Start with Multiple Replicas

The first step toward resilience is redundancy.

Instead of:

replicas: 1

Use:

replicas: 3

Architecture:

Application
      ↓
 ┌────┼────┐
 ↓    ↓    ↓
Pod1 Pod2 Pod3

Now a single pod failure does not affect availability.

Traffic continues flowing through the remaining replicas.

This is the simplest resilience improvement most teams can make.

Readiness Probes Matter More Than Most People Think

Many outages occur because applications receive traffic before they are ready.

Consider a .NET application:

readinessProbe:
  httpGet:
    path: /health
    port: 80

When the application starts:

Container Started

does not necessarily mean:

Application Ready

The application may still be:

Loading configuration
Establishing database connections
Building caches
Initializing services

Without a readiness probe, traffic arrives immediately.

Users encounter failures.

With a readiness probe:

Pod Starts
      ↓
Application Initializes
      ↓
Probe Succeeds
      ↓
Traffic Arrives

The difference is significant.

Liveness Probes Prevent Stuck Applications

Applications do not always crash.

Sometimes they stop responding.

Examples include:

Deadlocks
Thread starvation
Dependency hangs
Resource exhaustion

From Kubernetes' perspective:

Container Running

From the customer's perspective:

Application Down

A liveness probe helps detect these situations.

Example:

livenessProbe:
  httpGet:
    path: /health
    port: 80
  initialDelaySeconds: 30

When the application becomes unresponsive, Kubernetes restarts it automatically.

Graceful Shutdown Is Often Overlooked

A pod does not disappear instantly.

Kubernetes sends:

SIGTERM

before terminating the container.

Applications should use this period to:

Complete active requests
Flush logs
Release resources
Close connections

Without graceful shutdown:

Customer Request
       ↓
Pod Terminated
       ↓
Request Lost

With graceful shutdown:

Customer Request
       ↓
Request Completes
       ↓
Pod Terminates

This becomes especially important during deployments and node upgrades.

Pod Disruption Budgets

A Pod Disruption Budget (PDB) prevents Kubernetes from removing too many replicas simultaneously.

Example:

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: payment-api-pdb
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: payment-api

This tells Kubernetes:

Always Keep
At Least
Two Pods Running

Without a PDB:

Node Drain
      ↓
Multiple Pods Evicted
      ↓
Application Impact

With a PDB:

Node Drain
      ↓
Eviction Controlled
      ↓
Application Remains Available

Avoid Running Everything on One Node

Multiple replicas alone do not guarantee resilience.

Consider:

Node 1
 ├─ Pod A
 ├─ Pod B
 └─ Pod C

Three replicas exist.

Everything looks healthy.

Then:

Node Failure

All replicas disappear simultaneously.

The application is unavailable.

Kubernetes must distribute replicas across nodes.

Use:

topologySpreadConstraints

podAntiAffinity

to prevent workload concentration.

Multi-Zone Deployments

For production environments, node-level resilience is often insufficient.

Availability Zones provide protection against larger failures.

Example:

Zone A
Zone B
Zone C

Pods distributed across zones:

Zone A → Pod1
Zone B → Pod2
Zone C → Pod3

If an entire zone becomes unavailable:

Zone A Lost

traffic continues flowing through:

Zone B
Zone C

This significantly improves platform resilience.

Horizontal Pod Autoscaler

Traffic rarely remains constant.

An application supporting:

100 Requests Per Minute

may suddenly receive:

5,000 Requests Per Minute

Without scaling:

High Latency
Timeouts
Failures

The Horizontal Pod Autoscaler can increase replica counts automatically.

Example:

minReplicas: 3
maxReplicas: 10

When demand increases:

3 Pods
  ↓
6 Pods
  ↓
10 Pods

The workload remains responsive.

Cluster Autoscaler

Sometimes the problem is not pod capacity.

The problem is node capacity.

Consider:

10 Pods Required

but the cluster only has resources for:

6 Pods

The remaining pods stay pending.

Cluster Autoscaler solves this by adding nodes automatically.

Demand Increases
      ↓
Nodes Added
      ↓
Pods Scheduled

This allows the platform to grow with workload demand.

Protecting Against OOMKills

One of the most common causes of unexpected restarts is memory exhaustion.

Example:

Reason: OOMKilled

Applications should define realistic:

resources:
  requests:
  limits:

Monitoring memory consumption is critical.

Blindly increasing limits often masks the root cause.

A resilient workload understands its resource requirements.

Chaos Testing

One of the best ways to evaluate resilience is to introduce controlled failures.

Examples:

Delete a pod:

kubectl delete pod <pod-name>

Drain a node:

kubectl drain <node-name>

Restart a deployment:

kubectl rollout restart deployment

The question is simple:

Do customers notice?

If they do, resilience improvements are required.

Monitoring for Resilience

Resilience requires visibility.

Monitor:

Pod restarts
OOMKills
Node failures
Deployment failures
Availability
Latency

Useful metrics include:

Restart Count
Error Rate
Request Duration
Availability Percentage

Problems identified early are far easier to address than production incidents.

Common Mistakes

Mistake 1

Running production workloads with:

replicas: 1

Mistake 2

Missing readiness probes.

Mistake 3

Missing liveness probes.

Mistake 4

No Pod Disruption Budget.

Mistake 5

All replicas scheduled on one node.

Mistake 6

No autoscaling strategy.

Mistake 7

Never testing failure scenarios.

A Simple Resilience Checklist

Before promoting a workload to production, verify:

Multiple replicas exist.
Readiness probes are configured.
Liveness probes are configured.
Resource limits are defined.
Pod Disruption Budgets exist.
Replicas are distributed across nodes.
Autoscaling is configured.
Monitoring is enabled.
Failure scenarios have been tested.

Final Thoughts

Kubernetes provides powerful resilience capabilities, but those capabilities are not automatic.

A deployment with a single replica remains a single point of failure regardless of how advanced the platform appears.

The most resilient environments assume that failures will occur every day.

Pods will restart.

Nodes will fail.

Applications will crash.

Upgrades will happen.

The difference between a resilient platform and an unreliable one is how well the workload continues serving customers when those events occur.

The true measure of resilience is not whether failures happen.

It is whether anyone notices when they do.

DEV Community

Building Resilient AKS Workloads: Preventing Single-Pod Failures from Becoming Customer Incidents

Introduction

Understanding Failure as a Normal Event

The Single Replica Problem

Start with Multiple Replicas

Readiness Probes Matter More Than Most People Think

Liveness Probes Prevent Stuck Applications

Graceful Shutdown Is Often Overlooked

Pod Disruption Budgets

Avoid Running Everything on One Node

Multi-Zone Deployments

Horizontal Pod Autoscaler

Cluster Autoscaler

Protecting Against OOMKills

Chaos Testing

Monitoring for Resilience

Common Mistakes

Mistake 1

Mistake 2

Mistake 3

Mistake 4

Mistake 5

Mistake 6

Mistake 7

A Simple Resilience Checklist

Final Thoughts

Top comments (0)