infantus godfrey for CareerByteCode

Posted on Jan 4

Zero-Downtime AKS Node Patching

#aks #kubernetes #azure #linux

Introduction

Patching AKS node VMs sounds routine until you have a hundred of them backing production traffic. This article shares a real-world approach to patching AKS nodes safely, what went wrong, and the Azure-native practices that actually worked.
It started as a “simple” task: security patches were overdue, compliance was asking questions, and we had an AKS cluster backing a critical workload.

Then someone said the number out loud.

“We have just over 100 node VMs in this cluster.”

That’s when the confidence dropped.

If you’ve ever patched a handful of VMs, you know the drill. But patching 100 nodes in an AKS cluster, without breaking workloads, triggering mass pod evictions, or waking up on-call engineers at 2 a.m., is a very different game.

This article walks through how we approached patching at scale on AKS, what worked, what didn’t, and the Azure best practices I wish we had followed from day one.

The Backstory: Why This Matters

AKS abstracts away a lot of infrastructure pain until it doesn’t.

Under the hood, every AKS node is still a VM (or VMSS instance) that:

Needs OS security updates
Can reboot unexpectedly
Hosts multiple critical pods

In our case:

Multiple node pools
Mixed workloads (stateless + semi-stateful)
Strict SLOs
A hard compliance deadline

Manual patching was not an option. Blind automation was even worse.

The Core Idea: Let Kubernetes and Azure Do Their Jobs

The biggest mental shift was this:

We are not patching VMs. We are rotating nodes.

Instead of logging into machines or forcing updates, we leaned on:

AKS-managed upgrades
Node pool rotation
Proper pod disruption budgets
Controlled draining and surge capacity

If Kubernetes is given enough signals and room, it will protect your workloads.

Implementation: How We Patched 100 Nodes Safely

1. Split and Size Node Pools Intentionally

Large, single node pools are fragile during maintenance.

We:

Reduced blast radius by splitting workloads across pools
Ensured critical workloads had dedicated pools
Verified autoscaler limits before touching anything

Rule of thumb: If draining one node hurts, your node pool is too dense.

2. Set Pod Disruption Budgets (Seriously)

This was non-negotiable.

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: api-pdb
spec:
  minAvailable: 80%
  selector:
    matchLabels:
      app: api

Without PDBs:

Drains become chaos
Critical pods get evicted together

With PDBs:

Kubernetes pushes back
Drains slow down instead of breaking things

3. Enable Surge Upgrades on Node Pools

Surge Upgrade Flow (Why This Prevents Outages)

This is why surge upgrades are so powerful:

Capacity goes up before it goes down
Kubernetes has room to breathe
PDBs can actually do their job

This was the single biggest factor in keeping production stable.

This was the unsung hero.

By enabling max surge on node pools:

New nodes came up before old ones drained
Capacity stayed stable
Rollouts were predictable

az aks nodepool update \
  --resource-group rg-prod \
  --cluster-name aks-prod \
  --name nodepool1 \
  --max-surge 20%

Yes, it costs more temporarily. It’s worth it.

4. Use AKS Managed Node Image Upgrades

Instead of patching in-place, we:

Triggered node image upgrades
Let AKS cycle nodes gradually
Monitored pod rescheduling in real time

This aligned perfectly with Azure’s support model and saved us from custom scripts.

5. Drain With Observability, Not Hope

Every drain was monitored:

Pod restart counts
API error rates
Queue depths
Customer-facing latency

If metrics spiked, we paused.

Automation is useless without a big red stop button.

What Went Wrong (Lessons Learned)

We still made mistakes.

One node pool had no PDBs (legacy workload)
Autoscaler limits were too tight
A stateful pod pretended to be stateless

The result?

Longer drain times
One near-incident
A lot of humility

But nothing went down and that’s the bar.

Best Practices We’d Follow Again

Treat node patching as capacity management, not maintenance
Always over-provision before you drain
Test node rotation in non-prod regularly
Keep node pools smaller and purpose-driven
Document rollback paths

Common Pitfalls to Avoid

SSHing into AKS nodes to patch manually
Running giant node pools “for simplicity”
Ignoring PDB warnings
Patching during peak traffic
Assuming stateless means safe

Community Discussion

I’m curious:

How do you handle node patching at scale?
Do you rely fully on AKS upgrades or custom pipelines?
Any horror stories or success stories?

Drop them in the comments. We all learn from scars.

FAQ

Do I need to patch AKS nodes manually?

No. Azure recommends using managed node image upgrades or node pool rotation.

Can this be zero-downtime?

Yes if your workloads are designed for disruption.

What about stateful workloads?

They need extra care: dedicated pools, stronger PDBs, and slower rollouts.

Final Thoughts

Patching 100 VM nodes isn’t impressive.

Doing it without your users noticing is.

AKS gives you the tools but only if you respect how Kubernetes wants to work. Give it signals, time, and capacity, and it will repay you with boring, predictable maintenance.

And boring is exactly what production needs.

DEV Community