DEV Community

Cover image for Zero-Downtime AKS Node Patching
infantus godfrey for CareerByteCode

Posted on

Zero-Downtime AKS Node Patching

Introduction

node-patch
Patching AKS node VMs sounds routine until you have a hundred of them backing production traffic. This article shares a real-world approach to patching AKS nodes safely, what went wrong, and the Azure-native practices that actually worked.
It started as a “simple” task: security patches were overdue, compliance was asking questions, and we had an AKS cluster backing a critical workload.

Then someone said the number out loud.

“We have just over 100 node VMs in this cluster.”

That’s when the confidence dropped.

If you’ve ever patched a handful of VMs, you know the drill. But patching 100 nodes in an AKS cluster, without breaking workloads, triggering mass pod evictions, or waking up on-call engineers at 2 a.m., is a very different game.

This article walks through how we approached patching at scale on AKS, what worked, what didn’t, and the Azure best practices I wish we had followed from day one.


The Backstory: Why This Matters

AKS abstracts away a lot of infrastructure pain until it doesn’t.

Under the hood, every AKS node is still a VM (or VMSS instance) that:

  • Needs OS security updates
  • Can reboot unexpectedly
  • Hosts multiple critical pods

In our case:

  • Multiple node pools
  • Mixed workloads (stateless + semi-stateful)
  • Strict SLOs
  • A hard compliance deadline

Manual patching was not an option. Blind automation was even worse.


The Core Idea: Let Kubernetes and Azure Do Their Jobs

The biggest mental shift was this:

We are not patching VMs. We are rotating nodes.

Instead of logging into machines or forcing updates, we leaned on:

  • AKS-managed upgrades
  • Node pool rotation
  • Proper pod disruption budgets
  • Controlled draining and surge capacity

If Kubernetes is given enough signals and room, it will protect your workloads.


Implementation: How We Patched 100 Nodes Safely

1. Split and Size Node Pools Intentionally

Large, single node pools are fragile during maintenance.

We:

  • Reduced blast radius by splitting workloads across pools
  • Ensured critical workloads had dedicated pools
  • Verified autoscaler limits before touching anything

Rule of thumb: If draining one node hurts, your node pool is too dense.


2. Set Pod Disruption Budgets (Seriously)

This was non-negotiable.

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: api-pdb
spec:
  minAvailable: 80%
  selector:
    matchLabels:
      app: api
Enter fullscreen mode Exit fullscreen mode

Without PDBs:

  • Drains become chaos
  • Critical pods get evicted together

With PDBs:

  • Kubernetes pushes back
  • Drains slow down instead of breaking things

3. Enable Surge Upgrades on Node Pools

Surge Upgrade Flow (Why This Prevents Outages)
surge node
This is why surge upgrades are so powerful:

  • Capacity goes up before it goes down
  • Kubernetes has room to breathe
  • PDBs can actually do their job

This was the single biggest factor in keeping production stable.

This was the unsung hero.

By enabling max surge on node pools:

  • New nodes came up before old ones drained
  • Capacity stayed stable
  • Rollouts were predictable
az aks nodepool update \
  --resource-group rg-prod \
  --cluster-name aks-prod \
  --name nodepool1 \
  --max-surge 20%
Enter fullscreen mode Exit fullscreen mode

Yes, it costs more temporarily. It’s worth it.


4. Use AKS Managed Node Image Upgrades

Instead of patching in-place, we:

  • Triggered node image upgrades
  • Let AKS cycle nodes gradually
  • Monitored pod rescheduling in real time

This aligned perfectly with Azure’s support model and saved us from custom scripts.


5. Drain With Observability, Not Hope

Every drain was monitored:

  • Pod restart counts
  • API error rates
  • Queue depths
  • Customer-facing latency

If metrics spiked, we paused.

Automation is useless without a big red stop button.


What Went Wrong (Lessons Learned)

We still made mistakes.

  • One node pool had no PDBs (legacy workload)
  • Autoscaler limits were too tight
  • A stateful pod pretended to be stateless

The result?

  • Longer drain times
  • One near-incident
  • A lot of humility

But nothing went down and that’s the bar.


Best Practices We’d Follow Again

  • Treat node patching as capacity management, not maintenance
  • Always over-provision before you drain
  • Test node rotation in non-prod regularly
  • Keep node pools smaller and purpose-driven
  • Document rollback paths

Common Pitfalls to Avoid

  • SSHing into AKS nodes to patch manually
  • Running giant node pools “for simplicity”
  • Ignoring PDB warnings
  • Patching during peak traffic
  • Assuming stateless means safe

Community Discussion

I’m curious:

  • How do you handle node patching at scale?
  • Do you rely fully on AKS upgrades or custom pipelines?
  • Any horror stories or success stories?

Drop them in the comments. We all learn from scars.


FAQ

Do I need to patch AKS nodes manually?

No. Azure recommends using managed node image upgrades or node pool rotation.

Can this be zero-downtime?

Yes if your workloads are designed for disruption.

What about stateful workloads?

They need extra care: dedicated pools, stronger PDBs, and slower rollouts.


Final Thoughts

Patching 100 VM nodes isn’t impressive.

Doing it without your users noticing is.

AKS gives you the tools but only if you respect how Kubernetes wants to work. Give it signals, time, and capacity, and it will repay you with boring, predictable maintenance.

And boring is exactly what production needs.

Top comments (0)