DEV Community

Cover image for Kubernetes Upgrades Without Downtime
Samson Tanimawo
Samson Tanimawo

Posted on

Kubernetes Upgrades Without Downtime

Kubernetes upgrades used to terrify me. Then I learned to do them boringly. Here's the process.

Before the upgrade

1. Read the release notes. Every minor version. Check the 'removed APIs' and 'breaking changes' sections. Most Kubernetes outages come from someone skipping this.

2. Test in a non-prod cluster first. An identical cluster, not a toy one. Run your actual workloads. Let them bake for a week.

3. Deprecated API audit. Use pluto or kubent to find any manifests using deprecated APIs. Fix them before the upgrade, not during.

4. PodDisruptionBudgets for critical services. If you don't have PDBs, your critical services can all be drained at once during a node upgrade. PDBs prevent that.

During the upgrade

1. Control plane first, one component at a time. Never upgrade all control plane components simultaneously.

2. Nodes in small batches. 10% at a time is a good default. Wait for health checks between batches.

3. Watch your PDBs. If a node drain is stuck, it's usually a PDB. Don't override — fix the PDB first.

4. Monitor the specific metrics that matter. Control plane latency, API server error rate, pending pod count, node NotReady count.

When it goes wrong

Rollback in Kubernetes is hard. Harder than people realize. The best rollback is 'don't upgrade more until the current batch is confirmed healthy.'

If you have to rollback a node upgrade, drain the new node, delete it, and let the autoscaler replace with the old version. Don't try to in-place downgrade.

The schedule

My rule: Kubernetes upgrades happen in the first week of the month, Tuesday-Thursday, during business hours. Never Friday. Never the last week of a quarter.

Upgrades during off-hours feel safer but they're actually riskier — fewer people around to help if something breaks.

The bigger lesson

Kubernetes upgrades are a test of your reliability discipline. Teams that can't upgrade smoothly have deeper problems — no test environments, no PDBs, no runbooks. The upgrade is just where those gaps get exposed.

Fix the discipline first. The upgrade gets boring after that.


Written by Dr. Samson Tanimawo
BSc · MSc · MBA · PhD
Founder & CEO, Nova AI Ops. https://novaaiops.com

Top comments (0)