Pawan Kumar

Posted on Jun 12 • Originally published at dheeth.blog

OpenAI Already Told Us the Kubernetes Scaling Story, Most People Just Did Not Read It Closely

#kubernetes #ai #devops #mlops

Series links

Part 1: Everything You Know About Scaling Web Apps Breaks When You Serve an LLM

Part 2: The Request Is the Wrong Unit of Scale for LLMs on Kubernetes

Part 3: How Do You Fit a Trillion-Parameter Model Into a Kubernetes Cluster?

Part 4: Before the Pod Starts: GPU Node Setup for LLMs on Kubernetes

OpenAI already told us a lot about Kubernetes and large AI workloads. They did it years ago, before everyone started calling every GPU cluster an AI platform.

The posts are not product launches. They are engineering notes: scaling Kubernetes to 2,500 nodes in 2018, then scaling Kubernetes to 7,500 nodes in 2021. Read them closely and the lesson is not "use Kubernetes because OpenAI used Kubernetes." That would be lazy.

The better lesson is sharper:

Kubernetes can run serious AI infrastructure, but only when the team understands which parts of Kubernetes help, which parts get bypassed, and which parts become load-bearing at scale.

That distinction matters. A lot of teams are now trying to serve LLMs on Kubernetes by stacking tools until the diagram looks impressive. OpenAI's posts are a useful correction. They show a system that is surprisingly pragmatic. Whole-node pods. Direct pod IPs. MPI. Blob storage. Custom health checks. API server tuning. Less magic than you might expect.

This is Part 5 of the LLMs on Kubernetes series. I am keeping it narrow on purpose. This is not another introduction to GPU scheduling, token metrics, model parallelism, or autoscaling. We already covered enough of that earlier. This one is a teardown of what OpenAI publicly said and what a smaller platform team can steal from it without pretending to be OpenAI.

Kubernetes was the substrate, not the AI platform

OpenAI's 2018 post is very clear about why Kubernetes was useful. Their largest-scale workloads still managed bare cloud VMs directly at the time, but Kubernetes gave most experiments a faster iteration cycle, reasonable scalability, and less boilerplate.

That is the first lesson.

Kubernetes did not become useful because it magically understood deep learning. It became useful because it gave researchers a common substrate for running jobs, getting capacity, and iterating without hand-rolling the same infrastructure every time.

That sounds boring. It is also the part many teams skip.

A good LLM platform should not begin with "which AI gateway should we buy?" or "which model server wins?" Those questions matter later. The first platform question is simpler: can teams reliably ask for compute, run the workload, observe it, restart it, and get their data in and out without inventing a new workflow every week?

OpenAI's Kubernetes story is not about Kubernetes replacing every ML system. It is about Kubernetes becoming the shared operating layer for a messy research environment.

For a smaller team, the translation is simple: do not copy the node count. Copy the substrate mindset.

Give teams a boring path to run workloads. Make resource ownership clear. Make failures visible. Make restart behavior predictable. Let the specialized AI tools sit on top of that instead of turning the whole cluster into a science project.

Whole-node pods were a feature, not waste

One line from OpenAI's 7,500-node post is easy to miss: for many workloads, a single pod occupied the entire node.

In normal Kubernetes land, that sounds inefficient. We are trained to think about bin packing, fragmentation, spreading, utilization, requests, limits, and squeezing many workloads onto the same pool.

Large ML jobs can invert that instinct.

OpenAI explained the reason plainly. A large machine learning job can span many nodes and run most efficiently when it has access to all hardware resources on each node. GPUs may communicate through NVLink. GPUs may communicate directly with the NIC through GPUDirect. CPU, NUMA, PCIe, and local hardware topology stop being background details and become part of the job's performance envelope.

So the pod is not a small web replica anymore. It is closer to a worker slot in a coordinated compute job.

That is a different scheduling shape.

This does not mean every LLM inference service should use one pod per node. Most teams should not start there. But it does mean you should be careful with the default Kubernetes instinct of treating every GPU node as a bin-packing target.

If your workload depends on multiple GPUs behaving as a tight local group, the cleanest unit of scheduling may be the node. If the model server expects exclusive access to the local GPU topology, pretending the node is a general shared basket can make the system worse.

The smaller-team version is not "use whole-node pods everywhere." It is this:

Know when the node is the unit of performance.

For some LLM workloads, especially larger replicas, the useful abstraction is not "one container gets one device." It is "this serving replica owns this hardware shape." Kubernetes can schedule that shape, but you have to describe it honestly.

Direct pod IPs beat pretty abstractions for some jobs

OpenAI also wrote that they did not rely heavily on Kubernetes load balancing. Their biggest jobs had very little HTTPS traffic. They were not doing A/B tests, blue/green deploys, or canary rollouts inside those jobs.

Pods communicated directly with each other on pod IPs, using MPI over SSH, not service endpoints. Discovery happened once at job startup: which pods are participating in this MPI job?

That is a very different world from a stateless HTTP service behind a Service object.

This is where OpenAI's post is most useful as a mindset check. Kubernetes has beautiful abstractions for service discovery and load balancing. But not every distributed AI workload wants that abstraction in the hot path.

For tightly coordinated jobs, the membership of the group matters. Rank 0, rank 1, rank 2, and the rest are not interchangeable web replicas. If one participant disappears, the whole job may stop. If traffic gets sprayed through a generic balancing layer, the abstraction can be actively wrong.

The practical takeaway is not to bypass Kubernetes networking casually. It is to separate two kinds of traffic:

User-facing service traffic, where Services, Gateways, ingress, and load balancing make sense.
Job-internal coordination traffic, where direct pod identity and predictable peer membership may matter more.

LLM serving teams will hit this distinction as systems get more advanced. A simple single-node inference server can look like any other HTTP deployment. A multi-node replica, a distributed prefill/decode setup, or a training-style job starts to care about peer identity and network paths.

The abstraction should match the workload. That is the point.

Checkpointing was not an optimization. It was survival.

OpenAI described their largest jobs as MPI jobs where all pods participate in a single communicator. If one pod dies, the whole job halts and needs to restart. The job checkpoints regularly and resumes from the last checkpoint.

That sentence is doing a lot of work.

In a normal web service, a pod dying is usually noise. A replica disappears. The endpoint controller updates. Traffic goes somewhere else. The user might never notice.

In a large coordinated ML job, one pod dying can waste a huge amount of work. The failure model is different. Kubernetes can replace the pod, but it cannot make the lost compute free.

That is why checkpointing belongs in the infrastructure conversation. Not as a nice ML detail. As part of reliability.

If your job takes hours or days, restart behavior is not a footnote. If your model takes minutes to load, restart behavior is not a footnote. If your serving replica spans multiple GPUs or nodes, restart behavior is not a footnote.

For smaller teams, the practical version is this: test the boring failure path before production tests it for you.

Kill a pod while the model is loaded. Drain a GPU node. Interrupt a worker. Restart the model server. Break access to the model bucket. Watch what happens.

Do not only measure the happy path where the endpoint starts once and serves a prompt. Measure how long it takes to become useful again.

OpenAI's posts make this obvious at huge scale, but the same lesson shows up much earlier. A five-node LLM platform with bad restart behavior can still ruin your day.

Blob storage carried the boring weight

Another small but important detail: most OpenAI jobs interacted with blob storage. Jobs streamed dataset shards or checkpoints from blob storage, or cached data to fast local ephemeral disks. They used PersistentVolumes only where POSIX semantics were useful, and said blob storage was more scalable and avoided slow detach/attach operations.

That is a very practical choice.

Kubernetes storage discussions often get stuck on volumes. Which CSI driver? Which storage class? ReadWriteMany or ReadWriteOnce? How do we attach it? Can the pod move?

For AI workloads, a lot of the heavy data path may be better shaped around object storage and local cache instead. Model weights, checkpoints, datasets, tokenizer files, adapters, and artifacts often move through object storage more naturally than through classic attached disks.

The smaller-team lesson is not "never use PersistentVolumes." It is to be honest about the access pattern.

If the workload needs POSIX semantics, use the right volume system. If the workload mostly needs large immutable files, checkpoints, or artifacts, object storage plus local ephemeral cache may be simpler and easier to scale.

This also affects cold starts. A model server that pulls hundreds of gigabytes from object storage on every restart is not just a pod problem. It is a storage, cache, network, and readiness problem.

Kubernetes starts the container. Your platform still has to make the model arrive.

The API server became part of the scaling story

The most Kubernetes-specific lesson in OpenAI's posts is not about GPUs. It is about the control plane.

At 2,500 nodes, OpenAI hit etcd latency, excessive API reads, Kubernetes Event pressure, image pull issues, KubeDNS hotspots, networking limits, and even ARP cache overflow. At 7,500 nodes, they paid close attention to API server 429 and 5xx rates, ran API servers and etcd on dedicated nodes, and observed up to 70 GB of heap usage per API server.

This is the part platform teams should read twice.

At AI scale, boring Kubernetes objects become infrastructure load.

A DaemonSet that watches the API server from every node may be harmless in a small cluster and painful in a large one. A monitoring agent that polls too aggressively can become control-plane traffic. A node join event is not just a node join event when hundreds of nodes arrive together. A Service with every node behind it can create huge watch traffic if the data structure is wrong.

OpenAI specifically called out WATCHes on Endpoints. Some services, like kubelet and node-exporter, had every node as a member. When nodes were added or removed, endpoint watches fired broadly. They said EndpointSlices, introduced in Kubernetes 1.17 and later stable, reduced that load by a massive amount.

Most teams will never see OpenAI's numbers. That does not make the lesson irrelevant.

If you are building an LLM platform, ask boring control-plane questions early:

kubectl get --raw /metrics | grep apiserver_request_total
kubectl get --raw /metrics | grep apiserver_flowcontrol_rejected_requests_total
kubectl get endpointslices -A | head
kubectl get events -A --sort-by=.lastTimestamp | tail

Those commands do not make your platform OpenAI-scale. They just force you to look at the control plane as part of the system.

The cluster is part of the model serving path.

If the API server is slow, if node joins are noisy, if events explode, if monitoring creates too much cardinality, if image pulls block startup, your LLM platform will feel unreliable even when the model server is technically fine.

Health checks had to understand hardware

OpenAI's 7,500-node post also described passive and active health checks. Passive checks watched basic resources, network reachability, disks, GPU errors, maintenance events, and signals such as DCGM Xid errors. Active GPU tests ran at boot through a preflight taint and label, then periodically during node lifetime.

That is a useful pattern because it draws a line between node readiness and workload readiness.

A Kubernetes node can be Ready while still being a bad place to run an expensive GPU job. The host may be reachable. The kubelet may be alive. The node may pass generic checks. But the GPU can still be unhealthy, misbehaving, or not worth trusting for a long-running job.

OpenAI handled this by preventing normal workloads from landing until preflight passed. That is a very Kubernetes-native move: use taints and labels to keep the node out of rotation until hardware-specific validation completes.

Smaller teams should steal that idea.

You do not need OpenAI-scale automation to start. A practical GPU node readiness path can be simple:

Node joins with a temporary taint.
A validation job checks GPU visibility, driver behavior, runtime access, and a tiny CUDA or model-server smoke test.
The taint is removed only after the test passes.
Failed nodes stay out of the serving pool.

This is not the same as Part 4's GPU setup checklist. Part 4 was about preparing the node. This is about trust over time.

The question changes from "can Kubernetes see the GPU?" to "should we let this node take expensive work right now?"

Do not copy OpenAI. Copy the questions.

The wrong response to OpenAI's posts is to imitate the architecture directly.

Most teams do not need 7,500-node clusters. They do not need to expose every pod CIDR to researchers. They do not need the same networking model, the same quota system, the same team tainting service, or the same autoscaling behavior.

But they do need the questions OpenAI was forced to answer:

What is the real unit of scheduling for this workload?
Does this job want service load balancing, or direct peer identity?
What happens when one participant dies?
Where do checkpoints, weights, and datasets actually live?
Which agents talk to the API server from every node?
What happens if hundreds of pods or nodes appear at once?
Can a node be Ready but still unsafe for GPU work?
Which metrics are useful, and which metrics are just expensive noise?

That list is more valuable than the node count.

OpenAI's posts are interesting because they make Kubernetes look both powerful and very ordinary. Kubernetes handled huge clusters, but not by pretending AI workloads were normal web apps. The team shaped the platform around the workload.

That is the real takeaway.

Do not turn Kubernetes into a shrine. Do not turn OpenAI into a cargo cult. Use Kubernetes where it gives you a common substrate, clear scheduling boundaries, and operational leverage. Bypass or reshape the abstractions when the workload proves they are wrong.

The cluster is not just where the model runs. At LLM scale, the cluster becomes part of the model's behavior.

In the next part, we will look at another public signal from the frontier labs: Claude on EKS, Trainium, and the rise of the AI megacluster.

If you are building or evaluating LLM serving on Kubernetes, subscribe to follow the rest of the series. I am also putting together a free LLM Serving on Kubernetes Production Readiness Checklist so teams can sanity-check GPU nodes, model loading, observability, scaling, cost, and failure recovery before production does it for them.