Introduction

Patching AKS node VMs sounds routine until you have a hundred of them backing production traffic. This article shares a real-world approach to patching AKS nodes safely, what went wrong, and the Azure-native practices that actually worked.
It started as a “simple” task: security patches were overdue, compliance was asking questions, and we had an AKS cluster backing a critical workload.
Then someone said the number out loud.
“We have just over 100 node VMs in this cluster.”
That’s when the confidence dropped.
If you’ve ever patched a handful of VMs, you know the drill. But patching 100 nodes in an AKS cluster, without breaking workloads, triggering mass pod evictions, or waking up on-call engineers at 2 a.m., is a very different game.
This article walks through how we approached patching at scale on AKS, what worked, what didn’t, and the Azure best practices I wish we had followed from day one.
The Backstory: Why This Matters
AKS abstracts away a lot of infrastructure pain until it doesn’t.
Under the hood, every AKS node is still a VM (or VMSS instance) that:
- Needs OS security updates
- Can reboot unexpectedly
- Hosts multiple critical pods
In our case:
- Multiple node pools
- Mixed workloads (stateless + semi-stateful)
- Strict SLOs
- A hard compliance deadline
Manual patching was not an option. Blind automation was even worse.
The Core Idea: Let Kubernetes and Azure Do Their Jobs
The biggest mental shift was this:
We are not patching VMs. We are rotating nodes.
Instead of logging into machines or forcing updates, we leaned on:
- AKS-managed upgrades
- Node pool rotation
- Proper pod disruption budgets
- Controlled draining and surge capacity
If Kubernetes is given enough signals and room, it will protect your workloads.
Implementation: How We Patched 100 Nodes Safely
1. Split and Size Node Pools Intentionally
Large, single node pools are fragile during maintenance.
We:
- Reduced blast radius by splitting workloads across pools
- Ensured critical workloads had dedicated pools
- Verified autoscaler limits before touching anything
Rule of thumb: If draining one node hurts, your node pool is too dense.
2. Set Pod Disruption Budgets (Seriously)
This was non-negotiable.
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: api-pdb
spec:
minAvailable: 80%
selector:
matchLabels:
app: api
Without PDBs:
- Drains become chaos
- Critical pods get evicted together
With PDBs:
- Kubernetes pushes back
- Drains slow down instead of breaking things
3. Enable Surge Upgrades on Node Pools
Surge Upgrade Flow (Why This Prevents Outages)

This is why surge upgrades are so powerful:
- Capacity goes up before it goes down
- Kubernetes has room to breathe
- PDBs can actually do their job
This was the single biggest factor in keeping production stable.
This was the unsung hero.
By enabling max surge on node pools:
- New nodes came up before old ones drained
- Capacity stayed stable
- Rollouts were predictable
az aks nodepool update \
--resource-group rg-prod \
--cluster-name aks-prod \
--name nodepool1 \
--max-surge 20%
Yes, it costs more temporarily. It’s worth it.
4. Use AKS Managed Node Image Upgrades
Instead of patching in-place, we:
- Triggered node image upgrades
- Let AKS cycle nodes gradually
- Monitored pod rescheduling in real time
This aligned perfectly with Azure’s support model and saved us from custom scripts.
5. Drain With Observability, Not Hope
Every drain was monitored:
- Pod restart counts
- API error rates
- Queue depths
- Customer-facing latency
If metrics spiked, we paused.
Automation is useless without a big red stop button.
What Went Wrong (Lessons Learned)
We still made mistakes.
- One node pool had no PDBs (legacy workload)
- Autoscaler limits were too tight
- A stateful pod pretended to be stateless
The result?
- Longer drain times
- One near-incident
- A lot of humility
But nothing went down and that’s the bar.
Best Practices We’d Follow Again
- Treat node patching as capacity management, not maintenance
- Always over-provision before you drain
- Test node rotation in non-prod regularly
- Keep node pools smaller and purpose-driven
- Document rollback paths
Common Pitfalls to Avoid
- SSHing into AKS nodes to patch manually
- Running giant node pools “for simplicity”
- Ignoring PDB warnings
- Patching during peak traffic
- Assuming stateless means safe
Community Discussion
I’m curious:
- How do you handle node patching at scale?
- Do you rely fully on AKS upgrades or custom pipelines?
- Any horror stories or success stories?
Drop them in the comments. We all learn from scars.
FAQ
Do I need to patch AKS nodes manually?
No. Azure recommends using managed node image upgrades or node pool rotation.
Can this be zero-downtime?
Yes if your workloads are designed for disruption.
What about stateful workloads?
They need extra care: dedicated pools, stronger PDBs, and slower rollouts.
Final Thoughts
Patching 100 VM nodes isn’t impressive.
Doing it without your users noticing is.
AKS gives you the tools but only if you respect how Kubernetes wants to work. Give it signals, time, and capacity, and it will repay you with boring, predictable maintenance.
And boring is exactly what production needs.
Top comments (0)