DEV Community

Cover image for Solved: How do you manage maintenance across tens/hundreds of K8s clusters?
Darian Vance
Darian Vance

Posted on • Originally published at wp.me

Solved: How do you manage maintenance across tens/hundreds of K8s clusters?

🚀 Executive Summary

TL;DR: Managing maintenance across tens to hundreds of Kubernetes clusters is challenging due to manual toil, inconsistencies, and security risks. Effective solutions involve a strategic shift towards automation and declarative management, leveraging approaches like scripted operations, GitOps for cluster lifecycle, or commercial multi-cluster platforms.

🎯 Key Takeaways

  • Scripted Automation utilizes custom scripts (Bash, Python, Go) with kubectl and Helm, orchestrated by CI/CD pipelines, to automate repetitive tasks by iterating through kubeconfig files across target clusters.
  • GitOps for Cluster Lifecycle Management extends the GitOps philosophy to cluster provisioning and configuration, using Cluster API (CAPI) for infrastructure, Crossplane for external services, and FluxCD/Argo CD for in-cluster resources, all driven by a Git repository as the single source of truth.
  • Commercial Multi-Cluster Platforms offer a unified control plane, often with a GUI, for provisioning, upgrading, and enforcing policies across a fleet of clusters via agents, abstracting complexity (e.g., Rancher, Anthos, Azure Arc).

Scaling Kubernetes cluster maintenance from tens to hundreds requires robust automation and strategic tooling. Discover effective approaches to streamline upgrades, security patches, and configuration management across your entire fleet.

The Multi-Cluster Maintenance Conundrum

As organizations increasingly adopt Kubernetes, the challenge often shifts from “how do we run Kubernetes?” to “how do we run hundreds of Kubernetes clusters?”. Managing maintenance operations – from core Kubernetes upgrades and operating system patches to add-on deployments and security updates – across a sprawling fleet of clusters is a significant undertaking. Manual processes quickly become unsustainable, leading to inconsistencies, increased downtime, and significant operational toil for DevOps and SRE teams.

Symptoms of a Stretched SRE Team

If your team is struggling with multi-cluster maintenance, you’re likely experiencing some of these common symptoms:

  • Inconsistent Configurations: Different clusters drift out of sync, making troubleshooting difficult and leading to “it works on my cluster” scenarios.
  • Manual Toil and Burnout: Engineers spend countless hours manually logging into clusters, running commands, and repeating the same upgrade or patch procedures.
  • Delayed Security Patches: The sheer volume of clusters makes it challenging to apply security updates promptly, leaving your infrastructure vulnerable.
  • Increased Downtime and Risk: Manual interventions increase the chance of human error during maintenance windows, potentially causing outages.
  • Lack of Visibility: Difficulty in gaining a unified view of the health, version, and compliance status of all clusters.
  • Inefficient Resource Utilization: Inconsistent cluster sizes or configurations lead to suboptimal use of underlying infrastructure.

Addressing these symptoms requires a strategic shift towards automation and declarative management. Let’s explore three primary solution patterns.

Solution 1: Scripted Automation & Centralized Tools

Overview

This approach involves using custom scripts (Bash, Python, Go) combined with standard Kubernetes tooling like kubectl and Helm to automate repetitive tasks across your clusters. A centralized CI/CD pipeline often orchestrates these scripts, pushing changes to multiple target clusters.

How it Works

The core idea is to maintain a list of your Kubernetes clusters, usually via their kubeconfig files or context names. Scripts then iterate through this list, executing specific commands or applying configurations. For application deployments and some add-ons, GitOps tools like Argo CD or FluxCD can be deployed on each cluster, with a central repository managing their configurations.

  • Centralized Kubeconfig Management: Store kubeconfig files securely and manage access. Tools like kubectx can aid in switching contexts, but for automation, direct pathing or environment variables are common.
  • Scripted Operations: Write scripts that loop through clusters, performing actions such as:
    • Upgrading Helm charts (e.g., ingress controllers, monitoring agents).
    • Applying common Kubernetes resources (e.g., RBAC policies, network policies).
    • Running diagnostic checks or security scans.
  • CI/CD Integration: Integrate these scripts into your CI/CD pipelines (Jenkins, GitLab CI, GitHub Actions) to trigger automated maintenance jobs based on schedules or code changes.

Example Configuration/Commands

Here’s a Bash script example to upgrade a common Helm chart (e.g., an Nginx Ingress Controller) across several clusters:

#!/bin/bash

# List of clusters and their kubeconfig paths
declare -A CLUSTERS
CLUSTERS["dev-cluster"]="/path/to/kubeconfigs/dev-cluster-kubeconfig"
CLUSTERS["staging-cluster"]="/path/to/kubeconfigs/staging-cluster-kubeconfig"
CLUSTERS["prod-us-cluster"]="/path/to/kubeconfigs/prod-us-cluster-kubeconfig"
CLUSTERS["prod-eu-cluster"]="/path/to/kubeconfigs/prod-eu-cluster-kubeconfig"

HELM_CHART_NAME="nginx-ingress"
HELM_NAMESPACE="ingress-nginx"
HELM_REPO_URL="https://kubernetes.github.io/ingress-nginx"
HELM_CHART_VERSION="4.8.2" # Target Helm chart version

echo "Starting Helm chart upgrade for ${HELM_CHART_NAME} to version ${HELM_CHART_VERSION}..."

for cluster_name in "${!CLUSTERS[@]}"; do
  KUBECONFIG_PATH="${CLUSTERS[${cluster_name}]}"

  if [[ ! -f "${KUBECONFIG_PATH}" ]]; then
    echo "ERROR: Kubeconfig not found for ${cluster_name} at ${KUBECONFIG_PATH}. Skipping."
    continue
  fi

  echo "--- Upgrading on cluster: ${cluster_name} (using kubeconfig: ${KUBECONFIG_PATH}) ---"

  # Add or update Helm repository (if not already done globally)
  KUBECONFIG=${KUBECONFIG_PATH} helm repo add ingress-nginx ${HELM_REPO_URL} --force-update > /dev/null 2>&1
  KUBECONFIG=${KUBECONFIG_PATH} helm repo update > /dev/null 2>&1

  # Check if the release already exists to decide between install or upgrade
  if KUBECONFIG=${KUBECONFIG_PATH} helm status ${HELM_CHART_NAME} -n ${HELM_NAMESPACE} &> /dev/null; then
    echo "Release '${HELM_CHART_NAME}' found. Performing upgrade."
    KUBECONFIG=${KUBECONFIG_PATH} helm upgrade ${HELM_CHART_NAME} ingress-nginx/${HELM_CHART_NAME} \
      --namespace ${HELM_NAMESPACE} \
      --version ${HELM_CHART_VERSION} \
      --atomic \
      --timeout 5m \
      --wait \
      --create-namespace || { echo "ERROR: Helm upgrade failed on ${cluster_name}"; exit 1; }
  else
    echo "Release '${HELM_CHART_NAME}' not found. Performing install."
    KUBECONFIG=${KUBECONFIG_PATH} helm install ${HELM_CHART_NAME} ingress-nginx/${HELM_CHART_NAME} \
      --namespace ${HELM_NAMESPACE} \
      --version ${HELM_CHART_VERSION} \
      --atomic \
      --timeout 5m \
      --wait \
      --create-namespace || { echo "ERROR: Helm install failed on ${cluster_name}"; exit 1; }
  fi

  echo "Successfully upgraded/installed ${HELM_CHART_NAME} on ${cluster_name}"
  echo ""
done

echo "Helm chart upgrade process completed."
Enter fullscreen mode Exit fullscreen mode

Solution 2: GitOps for Cluster Lifecycle Management

Overview

Extending the GitOps philosophy from application deployment to the entire cluster lifecycle is a powerful strategy for managing hundreds of clusters. This involves treating all aspects of a Kubernetes cluster – from its provisioning and core Kubernetes version to its add-ons and configurations – as code stored in Git. Specialized Kubernetes operators then reconcile this desired state with the actual state of your clusters.

How it Works

At its core, GitOps for cluster lifecycle management relies on a “management cluster” (or sometimes a single-tenant control plane) and several key open-source projects:

  • Cluster API (CAPI): This project provides declarative APIs for creating, configuring, and upgrading Kubernetes clusters themselves. It allows you to define clusters as Kubernetes resources (Cluster, MachineDeployment, etc.) within a management cluster, which then orchestrates their lifecycle on target infrastructure (AWS, Azure, GCP, vSphere, etc.).
  • Crossplane: Extends Kubernetes to manage external infrastructure and cloud services. You can use Crossplane to provision the underlying infrastructure (VPCs, subnets, IAM roles, databases) that your CAPI-managed clusters depend on, all through Kubernetes APIs.
  • FluxCD / Argo CD: These are the GitOps engines. They run on your target clusters and continuously pull configuration and application manifests from Git, applying them to the cluster to maintain the desired state. For cluster lifecycle, they would manage cluster-specific add-ons (monitoring, logging agents, security tools, ingress controllers).

The workflow typically involves committing changes to a Git repository, which then triggers reconciliation by CAPI for cluster infrastructure, and by Flux/Argo CD for in-cluster resources.

Example Configuration/Commands

Here’s a simplified Cluster API (CAPI) manifest to define an EKS cluster, managed from a separate management cluster:

# Define the Kubernetes Cluster itself
apiVersion: cluster.x-k8s.io/v1beta1
kind: Cluster
metadata:
  name: prod-us-east-cluster
  namespace: default
spec:
  clusterNetwork:
    pods:
      cidrBlocks: ["10.244.0.0/16"]
    serviceDomain: cluster.local
  controlPlaneRef:
    apiVersion: controlplane.cluster.x-k8s.io/v1beta1
    kind: EKSControlPlane
    name: prod-us-east-control-plane
  infrastructureRef:
    apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
    kind: AWSManagedCluster
    name: prod-us-east-cluster
---
# Define the EKS Control Plane configuration
apiVersion: controlplane.cluster.x-k8s.io/v1beta1
kind: EKSControlPlane
metadata:
  name: prod-us-east-control-plane
  namespace: default
spec:
  version: v1.28.0 # Target Kubernetes version
  # Optional: Define logging, subnets, etc.
  logging:
    clusterLogging:
      - types: ["api", "audit", "authenticator", "controllerManager", "scheduler"]
        enabled: true
---
# Define the underlying AWS infrastructure for the EKS cluster
apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
kind: AWSManagedCluster
metadata:
  name: prod-us-east-cluster
  namespace: default
spec:
  region: us-east-1
  sshKeyName: default-eks-ssh-key
  networkSpec:
    vpc:
      cidrBlock: 10.0.0.0/16
    subnets:
      - id: subnet-0xxxxxxxxxxxxxxa # Public Subnet 1
        isPublic: true
        availabilityZone: us-east-1a
      - id: subnet-0xxxxxxxxxxxxxxb # Public Subnet 2
        isPublic: true
        availabilityZone: us-east-1b
      - id: subnet-0xxxxxxxxxxxxxxp # Private Subnet 1
        isPublic: false
        availabilityZone: us-east-1a
      - id: subnet-0xxxxxxxxxxxxxxq # Private Subnet 2
        isPublic: false
        availabilityZone: us-east-1b
  # Identity Management via AWS IAM
  roleRef:
    # IAM Role ARN for the EKS Cluster
    # Assumed by EKS service for creating AWS resources
    arn: arn:aws:iam::123456789012:role/eks-cluster-role
---
# Define a MachineDeployment for worker nodes
apiVersion: cluster.x-k8s.io/v1beta1
kind: MachineDeployment
metadata:
  name: prod-us-east-worker-nodes
  namespace: default
spec:
  clusterName: prod-us-east-cluster
  replicas: 3
  template:
    spec:
      clusterName: prod-us-east-cluster
      version: v1.28.0
      bootstrap:
        configRef:
          apiVersion: bootstrap.cluster.x-k8s.io/v1beta1
          kind: EKSConfigTemplate
          name: prod-us-east-node-config
      infrastructureRef:
        apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
        kind: AWSMachineTemplate
        name: prod-us-east-instance-template
---
# Define EKSConfigTemplate (for kubelet settings etc.)
apiVersion: bootstrap.cluster.x-k8s.io/v1beta1
kind: EKSConfigTemplate
metadata:
  name: prod-us-east-node-config
  namespace: default
spec:
  template:
    spec:
      # Optional: Kubelet arguments, pre/post bootstrap commands
      # For example: --kube-reserved=cpu=100m,memory=200Mi
---
# Define the EC2 instance type for worker nodes
apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
kind: AWSMachineTemplate
metadata:
  name: prod-us-east-instance-template
  namespace: default
spec:
  template:
    spec:
      instanceType: t3.medium
      ami:
        id: ami-0xxxxxxxxxxxxxx # EKS optimized AMI for v1.28
      sshKeyName: default-eks-ssh-key
      iamInstanceProfile: node-instance-profile-for-eks
      # Set up EC2 instance tags, root volume size etc.
Enter fullscreen mode Exit fullscreen mode

Once the clusters are provisioned, FluxCD or Argo CD running on each cluster can manage common add-ons and applications. Here’s how you might define a FluxCD Kustomization to deploy baseline add-ons to a group of clusters:

# In a Git repository, e.g., 'clusters/prod-us-east/flux-system/addons.yaml'
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: cluster-addons
  namespace: flux-system
spec:
  interval: 10m0s
  path: ./addons/base # Points to a common directory for all base add-ons
  prune: true
  sourceRef:
    kind: GitRepository
    name: flux-system # Reference to the main Git repository
  targetNamespace: default
  # Apply common overlays or substitutions
  postBuild:
    substituteFrom:
      - kind: ConfigMap
        name: cluster-metadata # Contains cluster-specific variables
        namespace: flux-system
---
# Example: Deploying cert-manager using HelmRelease managed by Flux
# In 'addons/base/cert-manager-release.yaml'
apiVersion: helm.toolkit.fluxcd.io/v2beta1
kind: HelmRelease
metadata:
  name: cert-manager
  namespace: cert-manager
spec:
  interval: 5m
  chart:
    spec:
      chart: cert-manager
      version: "v1.13.0"
      sourceRef:
        kind: HelmRepository
        name: jetstack
        namespace: flux-system
  install:
    remediation:
      retries: 3
  upgrade:
    remediation:
      retries: 3
  values:
    installCRDs: true
    global:
      leaderElection:
        namespace: cert-manager
Enter fullscreen mode Exit fullscreen mode

Solution 3: Commercial Multi-Cluster Management Platforms

Overview

Commercial and open-source enterprise-grade platforms offer comprehensive solutions for managing Kubernetes fleets. These platforms abstract much of the underlying complexity, providing a unified control plane, often with a graphical user interface (GUI), centralized policy management, and integrated services.

How it Works

These platforms typically involve installing an agent or operator on each managed Kubernetes cluster. This agent registers the cluster with a central management plane, enabling it to:

  • Provision and Upgrade Clusters: Many platforms can provision new clusters on various clouds or on-premises infrastructure and manage their lifecycle (upgrades, scaling).
  • Centralized Policy Enforcement: Define and enforce security, compliance, and governance policies consistently across all clusters.
  • Unified Observability: Aggregate logs, metrics, and events from all clusters into a single dashboard.
  • Application Lifecycle Management: Deploy, manage, and monitor applications across multiple clusters from a central catalog or Git repository.
  • Identity and Access Management: Integrate with enterprise identity providers to manage user access to clusters and resources.
  • Fleet Management: Group clusters, perform phased rollouts, and monitor the health and configuration of the entire fleet.

Example Features (Rancher, Anthos, Azure Arc)

  • Rancher: An open-source and commercial platform that provides a complete software stack for teams to manage containerized applications. It supports diverse Kubernetes distributions (RKE, EKS, AKS, GKE, K3s, OpenShift) and offers a powerful UI for managing clusters, deploying applications, and enforcing policies.
  • Google Anthos / GKE Multi-cluster management: Google’s hybrid and multi-cloud platform extends GKE’s capabilities to manage Kubernetes clusters wherever they run. It offers unified control planes, policy enforcement (Anthos Policy Controller), service mesh (Anthos Service Mesh), and centralized logging/monitoring.
  • Azure Arc enabled Kubernetes: This Azure service allows you to attach and manage Kubernetes clusters located anywhere (on-premises, other cloud providers, edge) as if they were running in Azure. It enables capabilities like GitOps for configuration, Azure Monitor for observability, Azure Policy for governance, and Azure Defender for security.
  • OpenShift Advanced Cluster Management (ACM): Red Hat’s solution for managing OpenShift and Kubernetes clusters across hybrid and multi-cloud environments. It provides full lifecycle management, policy enforcement, and observability.

While the exact commands and configurations are platform-specific, the core idea is to interact with the platform’s API or GUI rather than directly with individual clusters for fleet-wide operations. For example, in Azure Arc, you would link a cluster using:

# Connect an existing Kubernetes cluster to Azure Arc
az connectedk8s connect --name my-hybrid-cluster --resource-group my-arc-rg --location eastus
Enter fullscreen mode Exit fullscreen mode

Once connected, you can then apply GitOps configurations via Azure Arc:

# Apply a GitOps configuration to an Azure Arc-connected cluster
az k8s-configuration create \
    --resource-group my-arc-rg \
    --cluster-name my-hybrid-cluster \
    --name cluster-baseline \
    --operator-instance-name flux \
    --operator-namespace flux-system \
    --repository-url https://github.com/my-org/cluster-configs.git \
    --cluster-type connectedClusters \
    --scope cluster \
    --flux-version 2.x \
    --branch main \
    --kustomization name=base path=./base-config prune=true
Enter fullscreen mode Exit fullscreen mode

Choosing Your Path: A Comparison

Each solution offers distinct advantages and disadvantages. The best choice depends on your team’s size, expertise, existing infrastructure, compliance requirements, and desired level of control versus operational overhead.

Feature Scripted Automation & Centralized Tools GitOps for Cluster Lifecycle (CAPI/Crossplane/Flux) Commercial Platforms (Rancher, Anthos, Azure Arc)
Initial Complexity Low (easy to start with familiar tools) High (significant learning curve for CAPI, GitOps principles) Moderate (platform setup, agent deployment)
Scalability Poor (becomes unwieldy with many clusters/tasks) Excellent (declarative nature scales well with Git) Excellent (designed for fleet management)
Declarative State Partial (scripts define actions, not always desired state) Full (Git is the single source of truth for everything) High (platform UI/API defines desired state, often Git-backed)
Consistency Low to Medium (prone to drift, difficult to verify) High (continuous reconciliation from Git) High (centralized policy enforcement, continuous monitoring)
Flexibility/Customization Very High (full control over scripts) High (open-source, extensible components) Medium (constrained by platform capabilities)
Operational Overhead (Long-term) High (manual updates, troubleshooting, script maintenance) Moderate (managing Git repos, operators, management cluster) Low to Moderate (leveraging vendor services, but still requires platform ops)
Cost Low (tooling), High (engineering effort) Medium (infrastructure for management cluster, engineering effort) High (licensing/subscription fees, underlying infrastructure)
Vendor Lock-in Minimal Minimal to None (open-source, Kubernetes-native) Moderate to High (reliance on specific platform APIs, services)
Best Use Case Small cluster count, specific one-off tasks, bootstrapping. Large-scale, standardized environments, teams preferring full control and open-source ecosystem. Large-scale, diverse environments (hybrid/multi-cloud), teams seeking reduced operational burden and integrated features.

Best Practices for Any Approach

Regardless of the solution you choose, a few best practices are critical for successful multi-cluster maintenance:

Gradual Rollouts and Canary Deployments

Never apply changes to all clusters simultaneously. Implement a phased rollout strategy (e.g., development clusters first, then staging, then production in batches) to minimize blast radius. Tools like Argo Rollouts can extend this to applications, while CAPI/commercial platforms often have built-in fleet management capabilities for phased cluster upgrades.

Observability and Alerting

Implement robust, centralized observability. Aggregate logs (e.g., Loki, Splunk), metrics (e.g., Prometheus, Grafana), and traces from all clusters. Set up proactive alerts for anomalies, health degradation, or configuration drift, allowing your team to identify and respond to issues rapidly across the entire fleet.

Standardization and Baselines

Define a set of baseline configurations, common add-ons, and security policies that apply to all clusters. Use Helm charts, Kustomize, or OPA Gatekeeper for policy enforcement to maintain consistency. This reduces complexity and simplifies troubleshooting.

Security and Compliance

Automate security patching for Kubernetes and underlying operating systems. Implement consistent RBAC policies, network policies, and container security scanning across all clusters. Ensure secrets management is integrated and consistent. Use tools to regularly audit clusters against security benchmarks (e.g., CIS Kubernetes Benchmark) and compliance standards.

Conclusion

Managing maintenance across tens or hundreds of Kubernetes clusters is a problem of scale that demands automation and a declarative mindset. While scripted automation can be a starting point for smaller environments, true scalability and consistency are achieved through GitOps for cluster lifecycle management (leveraging tools like Cluster API and Flux/Argo CD) or by adopting commercial multi-cluster management platforms. Each approach has its trade-offs in terms of complexity, flexibility, and cost. By understanding these options and adhering to best practices, your organization can move beyond manual toil, achieve greater consistency, and ensure the health and security of your vast Kubernetes fleet.


Darian Vance

👉 Read the original article on TechResolve.blog

Top comments (0)