The GKE Upgrade That Took Down Our Production Pods for 45 Minutes

#kubernetes #gcp #devops #containers

I want to start this blog with the story that made me realise I had been running Kubernetes on GKE without actually understanding how GKE runs Kubernetes. This is about a node pool upgrade that should have been routine and wasn't.

The Setup

We run three GKE Standard clusters on GCP. One for production, one for staging and one for our internal tooling. The production cluster runs about 40 pods across 8 namespaces handling customer-facing workloads. Nothing exotic. Deployments, services, a few stateful sets for some background processing.

Node pool upgrades in GKE can be set to automatic. Google releases a new node version and GKE upgrades your nodes on a schedule you configure in the maintenance window. We had this set up and running for about eight months without incident. That made us complacent.

What Happened

On a Tuesday morning during business hours our alerting fired. Response times on our main API were spiking and several pods were showing as not ready. The timing was strange because we had not deployed anything. It took about three minutes to figure out that a node pool upgrade was in progress.

GKE was cycling our nodes as part of an automatic minor version upgrade. It was doing what we had told it to do. The problem was that we had not thought carefully enough about what that actually meant in practice.

GKE's default upgrade strategy is surge upgrade. The way it works is that GKE adds a surge node to the pool, drains an existing node by evicting all pods from it and schedules them onto other nodes, then removes the old node. It does this node by node until the pool is upgraded. With a surge of one it upgrades one node at a time which sounds safe.

What we had not accounted for was the interaction between surge upgrades and our pod topology. We had eight nodes in the pool and most of our deployments had a replica count of two. Two replicas spread across eight nodes sounds fine. But when GKE drained a node it evicted one replica and tried to reschedule it. If the replacement pod took more than a few seconds to become ready our deployment was running at half capacity during that window.

For stateless API pods that handled individually low amounts of traffic losing half the replicas for thirty seconds was survivable. For two of our services it wasn't. One was a session validation service that our entire API depended on. The other was a rate limiting service. Both had two replicas. Both had replicas on different nodes. Both had replicas evicted during the same upgrade cycle with not enough time between evictions for the replacement pods to start and pass readiness checks.

The result was a 45 minute window where requests to those services were either slow or failing depending on which node traffic landed on. It stopped when the upgrade cycle moved on to nodes that didn't host those particular pods and all replicas were available again.

What We Got Wrong

There were three things we had not set up that would have prevented this entirely.

Pod Disruption Budgets

A PodDisruptionBudget tells Kubernetes the minimum number of pods for a given workload that must be available at any time. If you set minAvailable to 1 on a two-replica deployment GKE will not evict a pod from one node until a replacement on another node is healthy and passing readiness checks.

We had no PDBs configured on any of our workloads. This meant GKE's node draining had no constraints from the workload side. It evicted pods freely and trusted that Kubernetes would reschedule them fast enough. For most of our pods it was fast enough. For those two it wasn't.

Setting up a PDB is straightforward.

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: session-validator-pdb
  namespace: production
spec:
  minAvailable: 1
  selector:
    matchLabels:
      app: session-validator

With this in place GKE's node drain will wait for a replacement pod to be ready before evicting the existing one. The upgrade takes longer but your service stays available throughout.

Readiness Probes That Actually Reflected Readiness

One reason replacement pods were slow to become ready was that our readiness probes were too lenient. We had initialDelaySeconds set to 10 seconds with a single check thereafter. The probe would pass after 10 seconds even if the pod was still warming up its internal caches. Pods would get traffic before they were actually ready to serve it usefully.

We tightened the probes to more accurately reflect when each service was genuinely ready to handle load. Longer initial delays for services that needed warm up time and more frequent checks to catch failures faster.

readinessProbe:
  httpGet:
    path: /healthz/ready
    port: 8080
  initialDelaySeconds: 20
  periodSeconds: 5
  failureThreshold: 3
  successThreshold: 1

This alone reduced the window between a new pod starting and it being genuinely ready to handle traffic from around 15 seconds to around 5 seconds. That difference matters a lot during rolling upgrades.

A Maintenance Window That Matched Our Actual Low Traffic Period

Our maintenance window was set to a range that covered Tuesday mornings. We had set it up quickly without checking our actual traffic patterns carefully. Tuesdays at 9am turned out to be one of our busier windows because of a batch of automated customer jobs that ran at the start of the business week.

We moved the window to Saturday nights between 1am and 4am when our traffic was genuinely low. For a customer-facing product this sounds obvious in retrospect. At the time we had just picked a window that wasn't the middle of the business day and moved on.

What the Cluster Looks Like Now

After the incident we went through every deployment and stateful set in production and added PDBs. For anything with two replicas we set minAvailable to 1. For anything with three or more replicas we set it to 2. We also added a policy check using Kyverno that prevents any new deployment from being applied to the production namespace without a corresponding PDB.

apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: require-pdb
spec:
  validationFailureAction: Enforce
  rules:
    - name: check-pdb-exists
      match:
        any:
          - resources:
              kinds:
                - Deployment
              namespaces:
                - production
      validate:
        message: "A PodDisruptionBudget is required for all production deployments."
        deny:
          conditions:
            all:
              - key: "{{ request.object.metadata.name }}"
                operator: NotIn
                value: "{{ request.object.metadata.annotations.\"pdb-configured\" || '' }}"

We also stopped relying entirely on GKE's automatic upgrade scheduling for production. We still use it for staging and tooling but for production we switched to manual upgrades on a defined schedule. The upgrade process now includes a pre-check step that confirms all deployments have PDBs and a post-upgrade step that runs a readiness check across all pods before we close the change ticket.

What I Would Tell Someone Starting With GKE

GKE's managed features are genuinely good and they make a lot of operational work disappear. But managed does not mean you can ignore how the underlying mechanisms work. Surge upgrades are safe in the general case but in your specific workload topology they may not be.

Set up PDBs before you need them. Write readiness probes that accurately reflect when your service is ready rather than when it has started. And check when your maintenance window actually fires against your real traffic patterns rather than your assumed ones.

The 45 minutes we lost was not caused by anything surprising or exotic. It was caused by defaults that are reasonable in general and wrong for our specific situation. Understanding where your setup diverges from the general case is most of the job.