OpenAI · Reliability · 21 May 2026
On December 11, 2024, OpenAI deployed a new telemetry service designed to improve Kubernetes observability. Within 29 minutes, it had crashed the Kubernetes control plane across every cluster. ChatGPT, the API, and Sora were all unavailable for over four hours. The engineers trying to fix it couldn't run kubectl. The control plane that manages clusters was down — and it was the only way back in.
- 3:16 PM → 7:38 PM PST (4h 22min)
- ALL OpenAI services affected
- Kubernetes control plane down in most large clusters
- DNS resolution broken — cascading failure
- Engineers locked out of clusters (no kubectl)
- iOS 18.2 launched same day — coincidence
The Story
There is a particular category of failure that strikes harder than a random bug or infrastructure crash: the failure of a system deployed specifically to prevent other failures. On December 11, 2024, OpenAI deployed a new telemetry service designed to improve observability of their Kubernetes control planes — to give engineers better visibility into how their clusters were behaving, to catch problems earlier. Within minutes of deployment, the telemetry service was itself the problem. By 3:16 PM PST, every OpenAI service was degraded or completely unavailable. ChatGPT. The API. Sora. All down. And the engineers responsible for fixing it had a problem that made everything worse: the Kubernetes control plane — the system through which you manage Kubernetes — was itself down, which meant engineers couldn't run kubectl. The tools for recovery depended on the infrastructure that had failed.
Our tests didn't catch the impact the change was having on the Kubernetes control plane. DNS caching added a delay between making the change and when services started failing. Remediation was very slow because of the locked out effect.
— — OpenAI — December 11, 2024 Incident Postmortem, status.openai.com
The events unfolded with the particular cruelty of incidents where staging does not predict production. On December 10, the telemetry service was deployed to a staging cluster and verified as working correctly. On December 11 at 2:23 PM, the code was merged and the deployment pipeline triggered. From 2:51 PM to 3:20 PM, the change was applied to all production clusters. At 3:16 PM — five minutes before the rollout was even complete — all OpenAI products began degrading. The root cause was a configuration in the new service that caused every node in every cluster to execute resource-intensive Kubernetes API operations simultaneously. The cost of these operations scaled with the size of the cluster — which meant the largest clusters, which also happened to be the most critical, were hit hardest and fastest.
DNS CACHING: THE HIDDEN TIME BOMB
The staging environment passed because of a combination of two factors. First, the staging cluster was small — the telemetry service's API load scaled with cluster size, so a small staging cluster generated manageable load. Second, and more insidiously: DNS caching masked the failure. When the telemetry service started overwhelming the Kubernetes API servers, services that had already cached DNS responses continued functioning temporarily — they could still reach their dependencies through stale cache entries. This created a delay between the moment the change was applied and the moment services began failing. Engineers saw a clean deployment, saw services continuing to function, and assumed success — until the DNS cache expired and services that hadn't failed yet began failing all at once.
Problem
Telemetry Rollout to All Clusters in 29 Minutes
On December 11, 2024 at 2:51 PM PST, the new telemetry service configuration began rolling out to all Kubernetes clusters. The rollout completed at approximately 3:20 PM. The service's configuration caused every node in every cluster to issue simultaneous resource-intensive Kubernetes API calls — a load that scaled with cluster size, hitting the largest, most critical clusters hardest.
Cause
Kubernetes Control Plane Overwhelmed — DNS and Service Discovery Broken
With thousands of nodes simultaneously hammering the Kubernetes API servers, the control planes of most large clusters crashed. Kubernetes's control plane manages service discovery and DNS resolution — when it failed, services could no longer find each other. DNS cache expiry then propagated the failure to services that had been temporarily protected by stale cache entries, turning a partial degradation into a complete cascading failure.
Solution
The Locked-Out Problem: No kubectl Access
Recovery required rolling back the telemetry configuration — but rolling back Kubernetes configurations requires kubectl, which requires a functioning Kubernetes control plane. The control plane was down. Engineers were effectively locked out of the clusters they needed to fix. Recovery required out-of-band mechanisms: directly accessing nodes through cloud provider management consoles, bypassing the Kubernetes layer entirely to remove the telemetry service's configuration.
Result
4h 22min Outage, Full Postmortem Published
ChatGPT reached substantial recovery at 5:45 PM PST. Full recovery across all services was achieved at 7:38 PM PST — 4 hours and 22 minutes after the incident began. OpenAI published a detailed postmortem within days, identifying four root causes and committing to specific architectural changes including break-glass emergency access mechanisms and staged rollouts for all infrastructure changes.
📱
Apple released iOS 18.2 on December 11, 2024 — the same day as the ChatGPT outage. iOS 18.2 introduced ChatGPT integration into Apple Intelligence. The timing was spectacularly bad: millions of iPhone users who had just updated their OS to get ChatGPT access discovered ChatGPT was down. Many initially assumed the iOS update had caused the outage. OpenAI's postmortem explicitly confirmed it had not: the iOS 18.2 launch was coincidental. The real cause had nothing to do with the traffic spike from Apple users.
⚠️
The Circular Dependency That Made Recovery Hard
The deepest structural problem revealed by the December 11 outage was a circular dependency between the Kubernetes control plane and the services that depend on it. When the control plane failed, it took down: (1) DNS resolution for all services, (2) service discovery across the cluster, (3) the ability to schedule new pods or reschedule crashed ones, and (4) the primary mechanism engineers use to manage all of the above. Recovery from a Kubernetes control plane failure required access to a system that the control plane failure had disabled. This is the engineering equivalent of locking your keys inside your car — and the standard response (calling a locksmith) had not been pre-arranged.
The 'locked out effect' that OpenAI's postmortem names is a well-known failure mode in Kubernetes operations, though it often appears in less severe forms. Kubernetes is a complex distributed system where the control plane manages the state of the cluster, and the data plane (the nodes) depends on that state to function. But the management tools (kubectl, Helm, the Kubernetes API) also depend on the control plane. When the control plane goes down, the cluster enters a state where it continues running existing workloads on warm nodes (the data plane doesn't immediately die) but nothing can be changed, fixed, scaled, or recovered through standard channels. The cluster is frozen — and thawing it requires direct node access bypassing Kubernetes's own abstractions.
🔬
The Telemetry-Ironically-Causes-Outage Pattern
The December 11 outage belongs to a specific failure pattern category that has appeared at multiple major companies: the observability tool that causes the outage it's meant to detect. A new metrics agent deployed across a large fleet issues unexpected API calls. A distributed tracing system generates load spikes while capturing other services' load spikes. A log aggregation service fills up disk on the servers it monitors. The pattern is instructive: observability infrastructure touches every service in the fleet and therefore has a potential blast radius of everything. It requires the same staged rollout rigor as any production service deployment.
❌
The Scale of Impact
OpenAI's services reach hundreds of millions of users. On December 11, 2024, every one of them hit the same wall: ChatGPT is unavailable. Beyond consumer users, the ChatGPT API powers thousands of production applications — startups that had built their products on top of OpenAI's API, enterprises that had integrated ChatGPT into customer support flows, developers whose production services were serving errors to their own users. A single infrastructure configuration error at OpenAI propagated into cascading failures across the entire ecosystem of businesses built on its infrastructure. This is the multiplier effect of platform outages.
THE SAME DAY AS SORA'S DEBUT
The December 11 outage came on the same day that OpenAI was also managing the pressure of Sora's recent launch — its video generation platform, which had seen immediate scaling challenges upon release. The Sora platform was itself affected by the December 11 outage (listed in the postmortem alongside ChatGPT and the API as impacted products). This confluence made December 11 OpenAI's most visible reliability day: its most-watched new product and its most-used existing product, both down simultaneously. The postmortem was unusually forthcoming about the operational context — acknowledging explicitly that the organization was managing multiple scaling challenges at once.
⏰
The most counterintuitive fact in the December 11 timeline: services began degrading at 3:16 PM , but the rollout had only started at 2:51 PM and wasn't complete until 3:20 PM. Services started failing while the rollout was still in progress. This is the DNS cache masking effect in action: the earliest-affected clusters (the large ones, which received the change first) started degrading immediately; clusters that received the change later showed the failure slightly later. From the engineers' monitoring dashboards, it looked like a gradual degradation — masking the true cause until DNS caches expired everywhere and the pattern became undeniable.
The Fix
What Actually Broke and Why Recovery Took Four Hours
Understanding the December 11 recovery timeline requires understanding the specific Kubernetes failure mode. The telemetry service's configuration caused each node to watch Kubernetes API resources continuously — a watch operation that made API calls proportional to the number of resources in the cluster. Across thousands of nodes in large clusters, these API calls compounded into an overwhelming flood. The Kubernetes API servers — the stateful components of the control plane that maintain cluster state — became saturated. With the API servers unresponsive, etcd (the distributed key-value store that backs Kubernetes' state — all cluster state (node metadata, pod specifications, service definitions) lives in etcd, and the API servers cannot function without access to it) became unreachable. Without etcd, the API servers couldn't recover. Without the API servers, nothing could be changed. The cluster was in a deadlock.
- 4h 22m — Total outage duration from 3:16 PM to 7:38 PM PST December 11, 2024 — the longest single outage in ChatGPT's history at the time
- 29 min — Time from deployment start (2:51 PM) to all products degrading (3:16 PM) — fast enough that the full fleet was affected before the scope was understood
- All — Services affected simultaneously — ChatGPT, API, Sora, and all OpenAI products experienced degradation or complete unavailability at the same time
- 0 — Staging warnings — the telemetry service passed staging validation completely, because staging clusters were too small to reproduce the API call scaling behavior that took down production
# Simplified model of the failure mode: telemetry service overwhelming K8s API
# Each node watches K8s API objects — cost scales with cluster size
# The telemetry service configuration (simplified)
TELEMETRY_CONFIG = {
"watch_all_pods": True, # Watch all pod events in the cluster
"watch_all_nodes": True, # Watch all node events
"watch_all_services": True, # Watch all service definitions
"poll_interval_ms": 100, # Check for changes every 100ms (aggressive)
}
# What happens when this runs on N nodes simultaneously:
def nodes_making_api_calls(cluster_size: int) -> int:
# Each node creates 3 watchers, each calling K8s API every 100ms
return cluster_size * 3 * (1000 / 100) # calls per second
# Small staging cluster (100 nodes):
staging_load = nodes_making_api_calls(100) # 3,000 API calls/sec — manageable
# Large production clusters (thousands of nodes):
prod_load = nodes_making_api_calls(5000) # 150,000 API calls/sec — CATASTROPHIC
# K8s API server limit: typically ~1,000-2,000 requests/sec
# At 150,000: API server becomes unresponsive within seconds
# DNS resolution breaks: services can't find each other
# kubectl stops working: engineers can't recover
# The fix: remove the watch configuration from the telemetry service
# But to apply a config change, you need kubectl
# kubectl requires a working API server
# The API server is down because of the config
# ↑ The locked-out effect ↑
# Recovery path: bypass Kubernetes entirely
# SSH directly to nodes via cloud provider console
# Manually stop the telemetry service process on each node
# API server load drops
# Control plane recovers
# kubectl works again
# Verify and clean up
THE FOUR ROOT CAUSES FROM OPENAI'S POSTMORTEM
OpenAI's postmortem identified four specific contributing factors: (1) The staging cluster was too small to reproduce the load scaling behavior — the failure only manifested at production cluster sizes. (2) DNS caching masked the initial failure — services continued functioning on stale cache entries, giving engineers a false signal that the deployment was clean before cache expiry revealed the truth. (3) No canary deployment — the configuration was applied to all clusters simultaneously rather than validated incrementally on one cluster first. (4) No break-glass mechanism — there was no pre-arranged out-of-band access path for exactly this class of failure where the standard Kubernetes management plane was unavailable.
ℹ️
The Recovery Steps
OpenAI's engineers recovered the cluster through a sequence that bypassed Kubernetes abstractions entirely: Step 1 — access individual nodes directly through the cloud provider's management console (not through Kubernetes), bypassing the downed control plane. Step 2 — manually stop the telemetry service process on each node to eliminate the API call flood. Step 3 — with load removed, Kubernetes API servers began recovering. Step 4 — once kubectl was functional, roll back the telemetry service configuration through standard channels. Step 5 — monitor service recovery and DNS propagation across the fleet. Each step added latency because it required manual execution across thousands of nodes.
✅
Post-Incident Actions: Four Engineering Commitments
OpenAI's postmortem committed to four concrete engineering changes: (1) Immediate: locked the telemetry configuration to prevent re-deployment. (2) Short-term: implement break-glass emergency access mechanisms that function even when the Kubernetes control plane is unavailable. (3) Medium-term: decouple observability infrastructure from the components it monitors, so a failing telemetry system cannot cascade into the monitored services. (4) Long-term: all infrastructure-related configuration changes will use staged deployment with continuous monitoring and the ability to halt at any percentage. The staged rollout commitment was the same lesson Cloudflare had learned twice.
⚠️
The Kubernetes Watch API Amplification
The specific mechanism was a Kubernetes Watch API (a Kubernetes API feature that allows clients to receive a stream of events as resources change — an efficient alternative to polling, but one that creates a persistent connection from the watching client to the API server, consuming API server resources proportional to the number of watchers) misuse. Rather than polling for cluster state on a schedule, watch operations create a persistent long-lived connection from each watcher to the API server. The telemetry service created three watch connections per node — at 5,000 nodes in a large cluster, that's 15,000 persistent watch connections. Each watch connection requires API server resources to maintain. The API server, designed for a few hundred concurrent operations, was maintaining thousands — and also handling the event stream updates that each watch triggered as cluster state changed.
✅
The Immediate Locking Action
Within hours of the outage's resolution, OpenAI took one immediate action while longer-term architectural work was planned: locked the telemetry configuration so it could not be re-deployed in its original form without an explicit manual override. This lock-before-investigation pattern is a standard SRE practice: after a configuration causes a production incident, prevent it from being accidentally reapplied during the postmortem or by a team member who doesn't yet know about the incident. The lock is a cheap, immediate mitigation that buys time for proper architectural fixes. It is the equivalent of removing a circuit breaker from service rather than leaving it in a state where it could trip again.
Architecture
OpenAI's Kubernetes architecture runs the inference clusters that power ChatGPT's model serving, the API gateway that handles developer requests, and the Sora video generation pipeline. All of these depend on Kubernetes's control plane for service discovery, DNS resolution, pod scheduling, and configuration management. Understanding how a single telemetry service configuration could take all of them down requires understanding both the structure of Kubernetes and the specific amplification mechanism the December 11 configuration triggered.
The Failure Chain: From Telemetry Deployment to Complete Outage
View interactive diagram on TechLogStack →
Interactive diagram available on TechLogStack (link above).
Recovery Architecture: Bypassing Kubernetes to Restore It
View interactive diagram on TechLogStack →
Interactive diagram available on TechLogStack (link above).
WHY KUBERNETES CONTROL PLANE FAILURE IS CATASTROPHIC
The Kubernetes control plane manages three things that are catastrophic to lose simultaneously: DNS resolution (services find each other by name, not IP — without DNS, microservices go blind), service discovery (load balancers can't route to healthy pods without the API server updating their configuration), and pod scheduling (crashed pods can't be restarted, replicas can't be scaled). In most partial failures, you lose one of these. A control plane failure loses all three. And because the standard recovery path requires the control plane to function, recovery from total control plane failure requires out-of-band mechanisms that most teams haven't pre-arranged.
⚠️
The Staging Trap: Size-Dependent Bugs
The December 11 outage is a textbook example of a size-dependent bug — a failure that only manifests at production scale. The telemetry service worked correctly in staging because staging clusters were small enough that the aggregate API call load from all nodes was within the API server's capacity. Every small-scale test passed cleanly. At production scale, with thousands of nodes instead of dozens, the same configuration produced 100× the load — enough to overwhelm even a properly sized API server. Size-dependent bugs require load testing at production scale, not just functional testing at representative scale. The standard 'test in staging' process is insufficient for infrastructure changes whose failure modes are non-linear functions of cluster size.
ℹ️
The Kubernetes Control Plane Architecture
Kubernetes control plane (the set of components that manage the overall state of a Kubernetes cluster — including the API server (handles all REST operations), etcd (distributed key-value store backing all cluster state), the scheduler (assigns pods to nodes), and the controller manager (runs reconciliation loops)) is itself a distributed system running on dedicated master nodes. In OpenAI's architecture, the control plane runs on separate infrastructure from the data plane nodes that run model inference. When the control plane fails, data plane nodes continue running their existing workloads (model inference pods don't immediately die) but cannot be managed — pods can't be restarted, scaled, or reconfigured. Services that depended on service discovery (which uses control-plane-managed DNS) began failing immediately. Services with static configuration or warm DNS caches survived longer before failing.
Lessons
The December 11 ChatGPT outage is among the most instructive Kubernetes incidents ever publicly documented — partly because OpenAI published a detailed postmortem, and partly because the failure pattern recurs across the industry whenever teams deploy infrastructure changes without accounting for scale-dependent behavior.
- 01. Observability infrastructure is production infrastructure. A telemetry service deployed across your entire fleet has the blast radius of your entire fleet. Deploy it with the same staged rollout rigor you apply to production services: one cluster, verify, one region, verify, full fleet. The December 11 rollout applied the configuration to all clusters in 29 minutes. A staged rollout would have revealed the problem on the first cluster before it could cascade.
- 02. DNS caching (a mechanism where the results of DNS lookups are stored locally for a period defined by the record's TTL, allowing services to resolve domain names without contacting the DNS server on every request) is a reliability asset that can become a diagnostic liability during incidents. When an infrastructure change breaks DNS, services continue functioning on cached entries — masking the failure until cache TTLs expire. If your deployment passes initial health checks and then fails minutes later at scale, DNS cache expiry is a likely explanation. Monitor DNS resolution success rates separately from application health checks.
- 03. Build break-glass emergency access before you need it. The December 11 engineers needed to access nodes directly, bypassing the Kubernetes control plane, using mechanisms that had not been pre-arranged. Pre-arrange them. Every Kubernetes deployment should have a documented, tested procedure for accessing nodes and making configuration changes when kubectl is unavailable. Like any emergency procedure, it must be practiced before the emergency.
- 04. Size-dependent bugs (failures that manifest only at production scale because their severity is a non-linear function of system size — a 100-node staging cluster may pass cleanly while a 5,000-node production cluster fails catastrophically) cannot be caught by functional testing at representative scale. Load test infrastructure changes against production-equivalent cluster sizes. If production-scale testing is not feasible, test at 10% of production scale and extrapolate load metrics before applying to the full fleet.
- 05. Decouple the components that manage your infrastructure from the infrastructure they manage. The Kubernetes control plane should not be the only path to emergency recovery. OpenAI's post-incident commitment to decoupling Kubernetes components addresses this: if the control plane fails, some emergency management capability should remain available independently of the failed layer.
ℹ️
The iOS 18.2 Coincidence
Apple shipped iOS 18.2 — which introduced ChatGPT integration into Apple Intelligence — on the same day as the outage. Millions of users who updated and then tried ChatGPT saw it was unavailable. Social media immediately speculated that the iOS update had caused the outage. OpenAI's postmortem was explicit: iOS 18.2 had nothing to do with the outage. The telemetry service failure had already begun degrading the infrastructure before the iOS update's traffic could have any effect. The coincidence is a useful reminder that correlation — especially coincidence of timing — is not causation, and that attributing outage causes to the most visible concurrent event is a common and often wrong instinct.
THE STAGED ROLLOUT THAT WOULD HAVE CAUGHT IT
A staged rollout with the pattern: 1 cluster → verify 30 minutes → 10% of clusters → verify 30 minutes → 50% → verify → 100%, would have caught this failure at the 1-cluster stage. One large cluster showing API server saturation is a signal. One large cluster crashing before engineers even understood why is an outage. The difference between the two outcomes is the presence of a verification window between deployment stages — time where the system's behavior can be observed before the next deployment stage commits. OpenAI's December 11 deployment had no such window: the configuration was applied to all clusters in 29 minutes without a verification pause.
OpenAI deployed a service to watch their Kubernetes clusters more carefully, and the service watched so carefully it broke every cluster simultaneously — which is either ironic or a very expensive way to learn that observability infrastructure has a blast radius.
TechLogStack — built at scale, broken in public, rebuilt by engineers
This case is a plain-English retelling of publicly available engineering material.
Read the full case on TechLogStack → (interactive diagrams, source links, and the full reader experience).
Top comments (0)