TechLogStack

Posted on May 22 • Originally published at techlogstack.com on May 21

OpenAI Deployed a Tool to Monitor Kubernetes — and It Took Down All of Kubernetes

#devops #kubernetes #reliability #webdev

4h 22m outage — 3:16 PM to 7:38 PM PST, December 11 2024
29 minutes from deployment start to all OpenAI products degrading
0 staging warnings — telemetry service passed validation completely
0 regions with staged rollout — applied to all clusters simultaneously
All OpenAI services affected: ChatGPT, the API, and Sora simultaneously
Engineers locked out of clusters — kubectl requires a control plane that was down

On December 11, 2024, OpenAI deployed a new telemetry service designed to improve Kubernetes observability — to give engineers better visibility into how their clusters were behaving, to catch problems earlier. Within 29 minutes, the telemetry service had crashed the Kubernetes control plane across every cluster. ChatGPT, the API, and Sora were all unavailable. And the engineers responsible for fixing it couldn't run kubectl — the control plane that manages Kubernetes was down, and it was the only way back in.

The Story

Our tests didn't catch the impact the change was having on the Kubernetes control plane. DNS caching added a delay between making the change and when services started failing. Remediation was very slow because of the locked out effect.

— OpenAI, December 11 2024 Incident Postmortem, status.openai.com

The events unfolded with the particular cruelty of incidents where staging does not predict production. The telemetry service was deployed to a staging cluster on December 10 and verified as working correctly. On December 11 at 2:51 PM, the change rolled out to all production clusters. At 3:16 PM — five minutes before the rollout was even complete — all OpenAI products began degrading. The root cause: a configuration that caused every node in every cluster to execute resource-intensive Kubernetes API operations simultaneously. The cost of these operations scaled with cluster size — meaning the largest, most critical clusters were hit hardest and fastest.

DNS Caching: The Hidden Time Bomb

The staging environment passed for two reasons. First, the staging cluster was small — the telemetry service's API load scaled with cluster size, so small staging generated manageable load. Second: DNS caching masked the failure. When the telemetry service started overwhelming the Kubernetes API servers, services that had already cached DNS responses continued functioning temporarily through stale cache entries. Engineers saw a clean deployment and services continuing to function — until the DNS cache expired and everything that hadn't failed yet failed all at once.

Problem

Telemetry Rollout to All Clusters in 29 Minutes

At 2:51 PM PST, the new telemetry service configuration began rolling out to all Kubernetes clusters simultaneously. The service's configuration caused every node in every cluster to issue simultaneous resource-intensive Kubernetes API calls — a load that scaled with cluster size, hitting the largest, most critical clusters hardest.

Cause

Kubernetes Control Plane Overwhelmed — DNS and Service Discovery Broken

With thousands of nodes simultaneously hammering the Kubernetes API servers, the control planes of most large clusters crashed. Kubernetes's control plane (the set of components managing overall cluster state — API server, etcd, scheduler, controller manager) manages service discovery and DNS resolution. When it failed, services could no longer find each other. DNS cache expiry then propagated the failure to services temporarily protected by stale cache entries, turning partial degradation into complete cascading failure.

Solution

The Locked-Out Problem: No kubectl Access

Recovery required rolling back the telemetry configuration — but rolling back Kubernetes configurations requires kubectl, which requires a functioning Kubernetes control plane. The control plane was down. Engineers were effectively locked out of the clusters they needed to fix. Recovery required out-of-band mechanisms: directly accessing nodes through cloud provider management consoles, bypassing the Kubernetes layer entirely to remove the telemetry service's configuration.

Result

4h 22min Outage, Full Postmortem Published

ChatGPT reached substantial recovery at 5:45 PM PST. Full recovery across all services was achieved at 7:38 PM PST — 4 hours and 22 minutes after the incident began. OpenAI published a detailed postmortem identifying four root causes and committing to specific architectural changes including break-glass emergency access mechanisms and staged rollouts for all infrastructure changes.

The Fix

What Actually Broke and Why Recovery Took Four Hours

The telemetry service's configuration caused each node to watch Kubernetes API resources continuously — a Watch API (a Kubernetes feature allowing clients to receive a stream of events as resources change — creates a persistent connection from each watcher to the API server, consuming server resources proportional to the number of watchers) operation making API calls proportional to cluster size. Across thousands of nodes in large clusters, these calls compounded into an overwhelming flood. The API servers became saturated. With them unresponsive, etcd (the distributed key-value store backing all Kubernetes state — node metadata, pod specifications, service definitions — API servers cannot function without it) became unreachable. Without etcd, API servers couldn't recover. Without API servers, nothing could be changed. The cluster was in a deadlock.

4h 22m — total outage duration, 3:16 PM to 7:38 PM PST — longest single outage in ChatGPT's history at the time
29 min — deployment start to all products degrading — fast enough that the full fleet was affected before the scope was understood
All — services affected simultaneously: ChatGPT, API, Sora — every OpenAI product at once
0 — staging warnings — staging clusters were too small to reproduce the API call scaling behaviour that took down production

# Simplified model of the failure: telemetry service overwhelming K8s API
# Each node watches K8s API objects — cost scales super-linearly with cluster size

TELEMETRY_CONFIG = {
    "watch_all_pods": True,      # persistent connection per node to API server
    "watch_all_nodes": True,     # another persistent connection per node
    "watch_all_services": True,  # another persistent connection per node
    "poll_interval_ms": 100,     # aggressive — 10 checks/second per watcher
}

def api_calls_per_second(cluster_size: int) -> int:
    # 3 watchers per node × 10 calls/sec per watcher
    return cluster_size * 3 * 10

# Staging cluster (100 nodes):
staging_load = api_calls_per_second(100)   # 3,000/sec — manageable
# K8s API server capacity: ~1,000–2,000 requests/sec

# Large production cluster (5,000 nodes):
prod_load = api_calls_per_second(5000)     # 150,000/sec — CATASTROPHIC
# API server saturated within seconds → DNS breaks → services go blind
# kubectl stops working → engineers locked out

# THE LOCKED-OUT DEADLOCK:
# Fix requires: kubectl → needs API server → API server is down → needs fix
#
# RECOVERY PATH (bypassing K8s entirely):
# 1. SSH to nodes via cloud provider console (not through K8s)
# 2. Manually stop telemetry service process on each node
# 3. API server load drops → control plane recovers
# 4. kubectl works again → roll back config through standard channels
# 5. Monitor DNS propagation and service recovery across fleet

The Four Root Causes from OpenAI's Postmortem

(1) Staging cluster too small — the failure only manifested at production cluster sizes. (2) DNS caching masked the initial failure — services continued on stale cache entries, giving engineers a false "clean deployment" signal before cache expiry revealed the truth. (3) No canary deployment — configuration applied to all clusters simultaneously rather than validated incrementally. (4) No break-glass mechanism — no pre-arranged out-of-band access path for the scenario where the standard Kubernetes management plane was unavailable.

Recovery steps — bypassing Kubernetes entirely:

Access individual nodes directly through the cloud provider's management console — not through Kubernetes
Manually stop the telemetry service process on each node to eliminate the API call flood
With load removed, Kubernetes API servers begin recovering
Once kubectl is functional, roll back the telemetry service configuration through standard channels
Monitor service recovery and DNS propagation across the fleet

Post-incident engineering commitments:

Immediate — locked the telemetry configuration to prevent re-deployment
Short-term — implement break-glass emergency access that functions when the K8s control plane is unavailable
Medium-term — decouple observability infrastructure from the components it monitors
Long-term — all infrastructure configuration changes use staged deployment with continuous monitoring and the ability to halt at any percentage

The iOS 18.2 coincidence

Apple shipped iOS 18.2 — which introduced ChatGPT integration into Apple Intelligence — on the same day as the outage. Millions of users who updated and then tried ChatGPT saw it was unavailable. Social media immediately speculated that the iOS update had caused the outage. OpenAI's postmortem was explicit: iOS 18.2 had nothing to do with it. The telemetry failure had already begun degrading infrastructure before the iOS update's traffic could have any effect. Correlation — especially coincidence of timing — is not causation, and attributing outage causes to the most visible concurrent event is a common and often wrong instinct.

Architecture

OpenAI's Kubernetes architecture runs the inference clusters powering ChatGPT's model serving, the API gateway, and the Sora video generation pipeline — all depending on the Kubernetes control plane for service discovery, DNS resolution, pod scheduling, and configuration management. When a single telemetry service configuration saturated the API servers, it took all three of these simultaneously.

The Failure Chain: From Telemetry Deployment to Complete Outage

View interactive diagram on TechLogStack →

Interactive diagram available on TechLogStack (link above).

Recovery Architecture: Bypassing Kubernetes to Restore It

View interactive diagram on TechLogStack →

Interactive diagram available on TechLogStack (link above).

Why Kubernetes Control Plane Failure Is Catastrophic

The control plane manages three things that are catastrophic to lose simultaneously: DNS resolution (services find each other by name, not IP — without DNS, microservices go blind), service discovery (load balancers can't route to healthy pods without the API server updating configuration), and pod scheduling (crashed pods can't be restarted, replicas can't be scaled). In most partial failures, you lose one of these. A control plane failure loses all three — and recovery requires the control plane to function, creating a circular dependency that demands pre-arranged out-of-band access.

The staged rollout that would have caught it

A staged rollout — 1 cluster → verify 30 minutes → 10% of clusters → verify → 50% → verify → 100% — would have caught this failure at the 1-cluster stage. One large cluster showing API server saturation is a signal. One large cluster crashing before engineers even understood why is an outage. The difference between the two outcomes is a verification window between deployment stages — time to observe behaviour before the next stage commits. OpenAI's December 11 deployment had no such window: configuration applied to all clusters in 29 minutes without a verification pause.

Lessons

Observability infrastructure is production infrastructure. A telemetry service deployed across your entire fleet has the blast radius of your entire fleet. Deploy it with the same staged rollout rigor you apply to production services: one cluster, verify, one region, verify, full fleet. The December 11 rollout applied the configuration to all clusters in 29 minutes. A staged rollout would have revealed the problem on the first cluster before it cascaded.
DNS caching (storing DNS lookup results locally for a period defined by the record's TTL) is a reliability asset that becomes a diagnostic liability during incidents. When an infrastructure change breaks DNS, services continue functioning on cached entries — masking the failure until TTLs expire. If your deployment passes initial health checks and then fails minutes later at scale, DNS cache expiry is a likely explanation. Monitor DNS resolution success rates separately from application health checks.
Build break-glass emergency access before you need it. The December 11 engineers needed to access nodes directly, bypassing the Kubernetes control plane, using mechanisms that had not been pre-arranged. Pre-arrange them. Every Kubernetes deployment should have a documented, tested procedure for accessing nodes when kubectl is unavailable. Like any emergency procedure, it must be practiced before the emergency.
Size-dependent bugs (failures manifesting only at production scale because their severity is a non-linear function of system size) cannot be caught by functional testing at representative scale. Load test infrastructure changes against production-equivalent cluster sizes. If production-scale testing is not feasible, test at 10% of production scale and extrapolate load metrics before applying to the full fleet.
Decouple the components that manage your infrastructure from the infrastructure they manage. The Kubernetes control plane should not be the only path to emergency recovery. If the control plane fails, some emergency management capability should remain available independently of the failed layer.

Engineering Glossary

Break-glass mechanism — a pre-arranged, out-of-band access path to infrastructure that functions even when the standard management layer is unavailable. Named after the physical "break glass in case of emergency" safety cabinet. The absence of a break-glass mechanism was one of OpenAI's four identified root causes.

DNS caching — storing the results of DNS lookups locally for a period defined by the record's TTL (Time to Live), allowing services to resolve domain names without contacting the DNS server on every request. A reliability asset under normal conditions; a diagnostic liability that masks failures during incidents.

etcd — the distributed key-value store that backs all Kubernetes cluster state — node metadata, pod specifications, service definitions. Kubernetes API servers cannot function without access to etcd; etcd unavailability produces total control plane failure.

Kubernetes control plane — the set of components managing overall Kubernetes cluster state: the API server (handles all REST operations), etcd (state store), the scheduler (assigns pods to nodes), and the controller manager (runs reconciliation loops). Runs on dedicated master nodes, separate from the data plane nodes running actual workloads.

Locked-out effect — the circular dependency where recovering from a Kubernetes control plane failure requires kubectl, which requires a functioning control plane. The cluster is frozen in a state where existing workloads continue running but nothing can be changed, fixed, scaled, or recovered through standard channels.

Size-dependent bug — a failure that only manifests at production scale because its severity is a non-linear function of system size. A 100-node staging cluster may pass cleanly while a 5,000-node production cluster fails catastrophically — the same configuration producing 50× the load.

Watch API — a Kubernetes API feature allowing clients to receive a stream of events as resources change. More efficient than polling, but creates a persistent connection from each watching client to the API server, consuming server resources proportional to the number of watchers. Misused by the December 11 telemetry service to create 15,000+ persistent connections on large clusters.

This case is a plain-English retelling of publicly available engineering material.

Read the full case on TechLogStack →

(Interactive diagrams, source links, and the full reader experience)

TechLogStack — built at scale, broken in public, rebuilt by engineers.

DEV Community