Paulo Victor Leite Lima Gomes

Posted on Jul 4

kubernetes upgrades needed rollback more than another checklist

#kubernetes #aws #eks #platformengineering

AWS announced EKS Version Rollback this week, and I think it is one of those features that will sound boring to anyone who has never owned a Kubernetes upgrade calendar.

It is not boring.

For years, the practical answer to "what if the cluster upgrade goes wrong?" has been a mixture of planning, nervousness, staging environments, compatibility checks, careful node rollouts, and someone saying "we can always rebuild" in a tone that did not make anyone feel better.

Kubernetes is excellent at making infrastructure programmable. It has been much less comforting when the thing you just programmed is the control plane itself.

So yes, EKS rollback is a product feature. You can roll a cluster back by one minor version within a limited window. EKS runs readiness checks. Auto Mode can also handle worker node rollback as part of the path.

But the interesting part is not the button.

The interesting part is Kubernetes upgrade risk becoming an explicit platform contract.

upgrades are where abstraction gets honest

The happy Kubernetes story is that the control plane gives us a stable API for running software. We declare intent. Controllers reconcile. Workloads move around. The platform absorbs a huge amount of operational detail.

Then upgrade season arrives.

Suddenly the abstraction gets very honest. API removals matter. Add-on versions matter. Kubelet skew matters. Admission controllers matter. Custom resources matter. Webhooks matter. That ancient Helm chart nobody wants to touch matters again. The cluster is not just "Kubernetes." It is a living pile of workloads, policies, controllers, node images, network plugins, and organizational habits.

This is why cluster upgrades become such an emotional process inside serious teams. The upgrade itself might be a managed service operation, but the confidence around it is not managed for you.

You still need to know whether your workloads are compatible. You still need a bake period. You still need a plan for what happens when something strange appears after the green checkmark.

For a long time, that plan often meant "fix forward unless the situation is truly terrible."

That is not a plan. That is optimism with an incident channel.

rollback changes the shape of the conversation

Rollback does not make upgrades easy. It does something more useful: it gives the platform a real recovery primitive.

That changes the conversation before the upgrade even starts.

Instead of asking only "did we pass the upgrade checklist?", the team can ask:

Are we still inside the rollback window?
Which checks would block rollback?
Which workloads started using new APIs after the upgrade?
Did we separate control-plane and data-plane changes enough to preserve options?
What evidence do we need before moving to the next stage?

Those are better questions.

They turn rollback from an emergency fantasy into part of the upgrade design. The pipeline can know about it. The runbook can know about it. The change approval can know about it. The audit trail can show that the team had a tested path back, not just a screenshot of a staging cluster that looked fine yesterday.

This is the boring stuff that makes production systems survivable.

the bake period becomes a first-class object

The AWS post includes a Salesforce section that I liked because it says the quiet part out loud: to use rollback well, you may need to restructure the upgrade sequence.

If you immediately upgrade the control plane, add-ons, and data plane in one continuous push, you reduce your ability to learn from the control-plane upgrade before the nodes move. If something breaks later, you have a messier question: is this the API server, the kubelet, an add-on, a workload assumption, or the combination?

A rollback feature rewards a staged process.

Upgrade the managed add-ons in a compatible way. Upgrade the control plane. Wait. Watch the real system. Let the API server meet your weird controllers, your quiet batch jobs, your old admission policies, your noisy tenants, and your one service that still behaves like it was written during a different geological period of cloud computing.

Then upgrade the data plane.

That bake period is not wasted time. It is the observation window that makes rollback useful.

This is a pattern I wish more platform tools made obvious. The operational primitive and the workflow need to fit each other. A rollback API without a staged rollout process is just a fire extinguisher behind furniture.

checks are not bureaucracy

EKS rollback readiness checks look like the kind of thing people are tempted to dismiss as another vendor checklist: API compatibility, field changes, cluster health, kubelet skew, kube-proxy compatibility, add-on compatibility, disruption budgets, and related blockers.

But these checks are the feature.

A rollback path that ignores compatibility is not a safety net. It is a different way to break production.

This is especially true for Kubernetes because the control plane is not an isolated binary you downgrade in a vacuum. It is the coordination point for a moving system. Workloads may create resources that did not exist before. Controllers may start depending on new behavior. Nodes may already be at the new kubelet version. Add-ons may have advanced. A "go back" operation has to ask whether the old version can still understand the world the new version created.

That is why I like readiness insights as a concept. They move the conversation from "can we click rollback?" to "what would make rollback unsafe?"

That is much closer to how real operations work.

rollback is governance, not just resilience

There is also a compliance angle here, and I do not mean that in the boring checkbox way.

Many organizations need to prove that critical platform changes have a recovery path. For Kubernetes upgrades, that proof has often been awkward. You could show staging tests. You could show backups. You could show blue/green cluster strategies if you were wealthy enough, disciplined enough, or forced enough to maintain them. You could show a long runbook that mostly said "do not get here."

A managed rollback feature gives the platform team a clearer artifact.

Not a guarantee. A rollback can still be blocked. It has limits. It is one minor version. It has a window. Some node types and custom pieces remain the customer's responsibility. New API usage can close doors behind you.

But it is still an artifact the organization can design around.

That matters because platform governance is not only about saying no. Good governance creates paths that let teams move faster without hiding the risk. If rollback makes teams more willing to stay on supported Kubernetes versions instead of delaying upgrades for months, that is not just an operational win. It is a security win.

Old clusters are not free. They accumulate CVEs, unsupported behavior, add-on friction, and eventually extended support costs. The reason many teams delay upgrades is not laziness. It is fear of losing the weekend to a control-plane mystery.

Reduce the fear, and you may improve the patch posture.

managed services are growing undo buttons

The larger trend is that managed platforms are slowly admitting that operations need undo, not just automation.

For a long time, cloud services were very good at creating and updating things. They were less good at helping you retreat from a bad change in a principled way. The implicit message was: use infrastructure as code, test carefully, design your own fallback, and please enjoy this API.

That is fine for simple resources. It is not enough for complex control planes.

Kubernetes upgrades sit in the uncomfortable middle. They are routine enough that every serious organization must do them, but risky enough that each fleet develops its own folklore. The scripts, the sequencing, the "never upgrade that environment on a Friday," the person who knows which add-on version lied last time.

Rollback does not remove the folklore overnight.

It gives the platform a better place to put it.

The upgrade pipeline can surface rollback readiness before promotion. The internal developer platform can show which clusters are inside the rollback window. The SRE runbook can distinguish "fix forward" from "rollback is still viable." The audit system can record the checks and the decision.

This is how a feature becomes a control plane.

do not confuse rollback with carelessness

There is a bad version of this story where teams hear "rollback exists" and upgrade with less discipline.

That would be a mistake.

Rollback is not a permission slip to skip compatibility work. It is a reason to make compatibility work more explicit. The better upgrade process still has staging, version skew planning, add-on validation, workload testing, communication, dashboards, and a calm human who knows when to stop.

The difference is that stopping no longer means staring at the cluster and hoping the next fix-forward patch is obvious.

Sometimes the best incident response is to reduce novelty. If the new control plane created a strange failure mode and the rollback checks are clean, going back to the known version may be the most responsible move.

That is not failure.

That is operations.

the punchline

EKS Version Rollback is useful because Kubernetes upgrades have always needed something more concrete than another checklist.

The feature will not save teams from bad process, unmanaged add-ons, careless API usage, or magical thinking. It has constraints, and those constraints matter.

But it gives platform teams a recovery primitive they can build around. It makes the rollback window visible. It makes blockers inspectable. It encourages staged upgrades. It turns "what if this goes wrong?" into a question the system can partially answer instead of a late-night debate.

That is the direction I want more infrastructure to move.

Not just faster changes.

Not just safer defaults.

Recoverable operations with enough evidence that humans can make better calls under pressure.

Kubernetes made a lot of infrastructure declarative.

Now the upgrade path is starting to get the same treatment: not perfect, not magic, but a little less dependent on bravery.

And honestly, bravery was never a great cluster management strategy.

references

To test my projects, I use Railway. If you want $20 USD to get started, use this link.

DEV Community