DEV Community: infantus godfrey

Zero-Downtime AKS Node Patching

infantus godfrey — Sun, 04 Jan 2026 19:28:00 +0000

Introduction

Patching AKS node VMs sounds routine until you have a hundred of them backing production traffic. This article shares a real-world approach to patching AKS nodes safely, what went wrong, and the Azure-native practices that actually worked.
It started as a “simple” task: security patches were overdue, compliance was asking questions, and we had an AKS cluster backing a critical workload.

Then someone said the number out loud.

“We have just over 100 node VMs in this cluster.”

That’s when the confidence dropped.

If you’ve ever patched a handful of VMs, you know the drill. But patching 100 nodes in an AKS cluster, without breaking workloads, triggering mass pod evictions, or waking up on-call engineers at 2 a.m., is a very different game.

This article walks through how we approached patching at scale on AKS, what worked, what didn’t, and the Azure best practices I wish we had followed from day one.

The Backstory: Why This Matters

AKS abstracts away a lot of infrastructure pain until it doesn’t.

Under the hood, every AKS node is still a VM (or VMSS instance) that:

Needs OS security updates
Can reboot unexpectedly
Hosts multiple critical pods

In our case:

Multiple node pools
Mixed workloads (stateless + semi-stateful)
Strict SLOs
A hard compliance deadline

Manual patching was not an option. Blind automation was even worse.

The Core Idea: Let Kubernetes and Azure Do Their Jobs

The biggest mental shift was this:

We are not patching VMs. We are rotating nodes.

Instead of logging into machines or forcing updates, we leaned on:

AKS-managed upgrades
Node pool rotation
Proper pod disruption budgets
Controlled draining and surge capacity

If Kubernetes is given enough signals and room, it will protect your workloads.

Implementation: How We Patched 100 Nodes Safely

1. Split and Size Node Pools Intentionally

Large, single node pools are fragile during maintenance.

We:

Reduced blast radius by splitting workloads across pools
Ensured critical workloads had dedicated pools
Verified autoscaler limits before touching anything

Rule of thumb: If draining one node hurts, your node pool is too dense.

2. Set Pod Disruption Budgets (Seriously)

This was non-negotiable.

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: api-pdb
spec:
  minAvailable: 80%
  selector:
    matchLabels:
      app: api

Without PDBs:

Drains become chaos
Critical pods get evicted together

With PDBs:

Kubernetes pushes back
Drains slow down instead of breaking things

3. Enable Surge Upgrades on Node Pools

Surge Upgrade Flow (Why This Prevents Outages)

This is why surge upgrades are so powerful:

Capacity goes up before it goes down
Kubernetes has room to breathe
PDBs can actually do their job

This was the single biggest factor in keeping production stable.

This was the unsung hero.

By enabling max surge on node pools:

New nodes came up before old ones drained
Capacity stayed stable
Rollouts were predictable

az aks nodepool update \
  --resource-group rg-prod \
  --cluster-name aks-prod \
  --name nodepool1 \
  --max-surge 20%

Yes, it costs more temporarily. It’s worth it.

4. Use AKS Managed Node Image Upgrades

Instead of patching in-place, we:

Triggered node image upgrades
Let AKS cycle nodes gradually
Monitored pod rescheduling in real time

This aligned perfectly with Azure’s support model and saved us from custom scripts.

5. Drain With Observability, Not Hope

Every drain was monitored:

Pod restart counts
API error rates
Queue depths
Customer-facing latency

If metrics spiked, we paused.

Automation is useless without a big red stop button.

What Went Wrong (Lessons Learned)

We still made mistakes.

One node pool had no PDBs (legacy workload)
Autoscaler limits were too tight
A stateful pod pretended to be stateless

The result?

Longer drain times
One near-incident
A lot of humility

But nothing went down and that’s the bar.

Best Practices We’d Follow Again

Treat node patching as capacity management, not maintenance
Always over-provision before you drain
Test node rotation in non-prod regularly
Keep node pools smaller and purpose-driven
Document rollback paths

Common Pitfalls to Avoid

SSHing into AKS nodes to patch manually
Running giant node pools “for simplicity”
Ignoring PDB warnings
Patching during peak traffic
Assuming stateless means safe

Community Discussion

I’m curious:

How do you handle node patching at scale?
Do you rely fully on AKS upgrades or custom pipelines?
Any horror stories or success stories?

Drop them in the comments. We all learn from scars.

FAQ

Do I need to patch AKS nodes manually?

No. Azure recommends using managed node image upgrades or node pool rotation.

Can this be zero-downtime?

Yes if your workloads are designed for disruption.

What about stateful workloads?

They need extra care: dedicated pools, stronger PDBs, and slower rollouts.

Final Thoughts

Patching 100 VM nodes isn’t impressive.

Doing it without your users noticing is.

AKS gives you the tools but only if you respect how Kubernetes wants to work. Give it signals, time, and capacity, and it will repay you with boring, predictable maintenance.

And boring is exactly what production needs.

Introduction to the emma Cloud Management Platform

infantus godfrey — Tue, 11 Nov 2025 01:58:22 +0000

Introduction
What is emma?
Key Features & Capabilities
Use Case: Multi-Cloud Kubernetes Provisioning
Use Case: Cloud Outage Resilience
Use Case: FinOps & Cost Governance
Integrations & Tooling
Best Practices & Tips
Common Questions & Answers
Conclusion & Call to Action

Introduction

As organizations increasingly adopt multi-cloud and hybrid-cloud strategies, the roles of DevOps, Platform, and Cloud Engineers have grown more complex. You’re not just provisioning VMs or Kubernetes clusters you’re managing cost, governance, security, sovereignty, and performance across disparate environments. Enter emma, a cloud-agnostic cloud management platform designed to simplify and centralize those tasks. In this article I’ll take you through what emma cloud management platform does, how you can use it in real-world scenarios, code snippets you can adapt, and best-practices I’ve picked up.

What is emma?

emma cloud management platform is a unified cloud-management platform that supports hybrid, multi-cloud and even sovereign-cloud environments. Some key differentiators:

It provides a single pane of glass across public clouds (AWS, GCP, Azure) and private/on-prem or hybrid setups.
It supports self-service provisioning, policy-based automation, and governance guardrails so engineering teams can spin up resources without sacrificing control.
It embeds FinOps capabilities (cost reporting, waste detection, rightsizing) alongside performance, governance and data-sovereignty features.
It enables hybrid / multi-cloud backup and disaster recovery across providers.
It promotes vendor-independence and cloud-agnostic operations (avoiding lock-in).

emma’s proposition is aimed at teams that want both agility (developer teams get self-service) and control (platform or central cloud ops still define guardrails).

Key Features & Capabilities

Here is a breakdown of core capabilities that matter for a technically minded audience:

Provisioning & Infrastructure Automation

Self-service infra: Deploy environments via UI, CLI/API, or Iac.
Kubernetes cluster management (single or multi-cloud) and VM provisioning across clouds.
Spot/interruptible-instance support for cost optimisation.

Governance, Cost & FinOps

Unified cost visibility across providers; waste detection; rightsizing recommendations.
Enforcement of budgets, role-based access, data residency / sovereignty policies.

Multi-Cloud & Hybrid Support

Manage AWS, Azure, GCP and even smaller/regional clouds from a single UI/API.
Cross-cloud networking/backbone, backup/DR across clouds.

Operational Stability & Performance

Auto-discovery of resources, inventory, logs and metrics from all clouds.
Automated incident response, backups, snapshot management.

Use Case Ready Templates & Workflows

Pre-approved templates (guardrails) to enable self-service without rogue deployments.
Cloud migration assistance: migrating Kubernetes pods between providers with minimal config changes.

Use Case: Multi-Cloud Kubernetes Provisioning

Scenario

Your organization operates development and staging environments across multiple cloud providers AWS, Azure, and GCP. Each engineering team requests Kubernetes clusters for testing, microservice deployments, or CI/CD pipelines.

Your challenge is to enable these teams to deploy clusters on-demand, in any cloud, without losing visibility or governance. You need to:

Standardize cluster configuration (versioning, node types, networking).

Enforce cost and security policies across providers.

Provide a self-service interface so teams don’t depend on Ops for every deployment.

Maintain observability, backup, and access control consistently across all clusters.

Previously, you might have maintained multiple Terraform modules or provider-specific scripts one for EKS, one for GKE, one for AKS. But that quickly becomes unmanageable.

With emma, you define a single cluster template that can be provisioned across providers while still applying organization-wide guardrails and policies. Developers choose a provider, and emma handles the provisioning workflow end-to-end including cost limits, RBAC enforcement, backup schedules, and monitoring integrations.

This approach delivers:

Speed: Teams deploy clusters in minutes, not days.
Governance: Platform team maintains control via policies and guardrails.
Visibility: emma consolidates metrics, logs, and cost data for all clusters.
Portability: Migrate workloads or replicas between AWS, GCP, and private cloud with minimal reconfiguration.

Use Case: Cloud Outage Resilience

Scenario

Your organization runs a critical service on Kubernetes pods deployed across multiple cloud environments using emma cloud management platform as the orchestration and management layer. For instance, your primary workloads run on AWS, while a secondary setup exists on Microsoft Azure (or Google Cloud) for continuity.

During normal operations, your workloads run on AWS. However, in the event of an unexpected AWS region or provider outage, your team can quickly redeploy pods to Azure using emma's multi-cloud management capabilities, ensuring a faster recovery with minimal reconfiguration. Once AWS recovers, workloads can be rolled back or balanced between both providers.

This approach enhances resilience and availability, reducing dependency on a single cloud vendor. Emma Cloud simplifies this process by allowing teams to move Kubernetes workloads between providers with minimal configuration changes and to distribute workloads seamlessly across clouds, helping organizations mitigate downtime risks.

Use Case: FinOps & Cost Governance

Scenario

Your organisation has runaway cloud spend across multiple providers. Platform/Finance teams need to track spend per project, identify waste (unused VMs, idle clusters, oversized disk volumes), and enforce budgets.

With emma:

It ingests cost data across clouds and presents unified dashboards.
It flags idle or under-utilised resources and recommends rightsizing or shutdown.
It enables setting budgets per project/team and triggers alerts when thresholds are exceeded.
You can set automated remediation (e.g., shut down idle clusters after X days) via emma’s automation engine.

Practical tip: Leverage emma’s tagging/rbac policies so cost attribution aligns with team/project owners and you can create FinOps reports by tag.

Integrations & Tooling

When deploying emma in your stack, you’ll want to integrate with other platform/devops tooling:

Infrastructure as Code (IaC): emma cloud management platform supports templates and integrates with Terraform modules. NOTE: As of now k8 and vm creation and more to be added in future
CI/CD: Link emma’s provisioning with Jenkins/GitHub Actions/GitLab pipelines for cluster or resource provisioning.
Kubernetes / GitOps: Tools like ArgoCD, Flux can use emma-managed clusters as targets.
Observability: Connect to tools like Prometheus, Grafana, Elasticsearch emma centralises logs/metrics across clouds.
Cost / FinOps tooling: Use alongside platforms like CloudHealth, Kubecost emma gives unified cross-cloud visibility and remediation.
Backup/DR tools: emma’s multi-cloud backup module simplifies cross-cloud restore scenarios.

Best Practices & Tips

Start small with one use-case (e.g., dev clusters) before broad multi-cloud rollout.
Enforce tagging discipline up-front so cost and accountability work from day one.
Define guardrails, not just policies: enable self-service but within boundaries (cost, region, allowed services).
Use spot/interruptible instances where appropriate for non-critical workloads to reduce cost.
Automate resource reclamation: idle clusters, detached volumes, orphaned snapshots.
Monitor cost drift across clouds not only within each provider but across providers.
Avoid provider lock-in: emma’s cloud-agnostic approach helps you move workloads between providers as needs change.
Document workflows and provide training for developer teams to understand self-service workflow.
Integrate with your existing IaC/CI/CD pipelines rather than completely replacing them.

Common Questions & Answers

Q: Does emma support on-premises or private cloud?
Yes, emma supports hybrid/cloud-agnostic operations including on-prem/private clouds as part of its unified management surface.

Q: Can I migrate Kubernetes workloads between clouds using emma?
Yes, one of the features described is moving Kubernetes pods between providers with minimal config changes via emma’s template and abstraction layer.

Q: How does emma help with cost optimisation?
It provides unified cost dashboards across clouds, waste detection, rightsizing recommendations, and automation for remediation.

Q: Is there an API or CLI for emma?
Yes, emma offers API support for integrations and automation.

Q: What about data sovereignty/compliance when dealing with multi-cloud?
emma includes data residency and sovereignty controls so you can enforce which region/cloud your data is allowed in, helping with compliance.

Conclusion & Call to Action

For DevOps, Platform, Cloud, SRE Engineers dealing with the complexity of multi-cloud, hybrid, and sovereign environments, emma delivers a compelling proposition: unify visibility, governance, cost optimisation and provisioning across disparate clouds without stifling developer agility.

If you’re looking to standardise your self-service infra, impose guardrails, and cut cloud waste while retaining agility, emma is worth evaluating.

Follow me for more dev tutorials where we dive into multi-cloud tools workflows, and practical, code-centric deep dives.

DEV Community: infantus godfrey

Zero-Downtime AKS Node Patching

Introduction

The Backstory: Why This Matters

The Core Idea: Let Kubernetes and Azure Do Their Jobs

Implementation: How We Patched 100 Nodes Safely

1. Split and Size Node Pools Intentionally

2. Set Pod Disruption Budgets (Seriously)

3. Enable Surge Upgrades on Node Pools

4. Use AKS Managed Node Image Upgrades

5. Drain With Observability, Not Hope

What Went Wrong (Lessons Learned)

Best Practices We’d Follow Again

Common Pitfalls to Avoid

Community Discussion

FAQ

Do I need to patch AKS nodes manually?

Can this be zero-downtime?

What about stateful workloads?

Final Thoughts

Introduction to the emma Cloud Management Platform

Table of Contents

Introduction

What is emma?

Key Features & Capabilities

Provisioning & Infrastructure Automation

Governance, Cost & FinOps

Multi-Cloud & Hybrid Support

Operational Stability & Performance

Use Case Ready Templates & Workflows

Use Case: Multi-Cloud Kubernetes Provisioning

Scenario

Use Case: Cloud Outage Resilience

Scenario

Use Case: FinOps & Cost Governance

Scenario

Integrations & Tooling

Best Practices & Tips

Common Questions & Answers

Conclusion & Call to Action