DEV Community: saiyam1814

How a Kubernetes Service Actually Works (and All 5 Types You Need)

saiyam1814 — Tue, 05 May 2026 07:01:11 +0000

A pod gets created. It gets an IP. Then it dies. A new pod replaces it. New IP. Now imagine you have ten pods of the same app, and they restart all the time. Which IP do you call?

You can't. That's the problem Services solve, and the answer is more interesting than "Kubernetes assigns a stable IP."

This post walks the full picture in five parts: why Services have to exist, what happens when you create one, what happens when traffic actually calls one, all five Service types (most posts stop at three), and a real cluster demo at the bottom. Every claim is verified against kubernetes/kubernetes 1.36 source.

%[https://youtu.be/uP4Gc08qeXM]

TL;DR

Why Services exist. Pods are ephemeral; their IPs change on every restart and reschedule. A Service is a stable identity for an unstable set of pods.
Creation flow. kubectl expose → API server allocates a ClusterIP from --service-cluster-ip-range (since 1.33, backed by IPAddress objects, GA) → EndpointSlice controller fills the slice with matching pod IPs → kube-proxy on every node programs iptables → CoreDNS adds the name. Sub-second end to end.
Request flow. curl my-service → CoreDNS returns ClusterIP → packet hits KUBE-SERVICES → matches KUBE-SVC-XXX → --mode random picks one rule → KUBE-SEP-YYY does DNAT to a pod IP → kernel routes to the backend → conntrack rewrites the source IP back to ClusterIP on the reply.
Five Service types. ClusterIP (internal), NodePort (every-node port, dev/test), LoadBalancer (cloud-provided external LB for prod), ExternalName (DNS CNAME alias for managed services), Headless (clusterIP: None, DNS returns pod IPs directly — for StatefulSets).
The dataplane is the kernel. Three controllers cooperate to write iptables rules, but the actual packet forwarding is pure Linux netfilter. The Service object is metadata; the kernel does the work.

Part 1: Why Services exist

The Kubernetes pod model is deliberately ephemeral. Pods get rescheduled, restarted, scaled up and down. Pod IPs change every time. There is no permanent address.

This is by design. Kubernetes treats pods as cattle, not pets. The system is at its best when no individual pod is precious and the whole workload can be reshuffled across the cluster. But applications need to talk to each other, and applications need a stable target.

A Service is the answer. It is a stable identity for an unstable set of pods. You give it a label selector (app: nginx). Any pod that matches the selector becomes a backend. The Service exposes one virtual IP, called the ClusterIP. Pods come and go. The ClusterIP stays.

Without this abstraction, you would be writing service discovery code from scratch in every application, forever. Every microservice architecture in Kubernetes depends on this idea. It is so fundamental that it is easy to miss how clever it is.

Part 2: What happens when you create a Service

One command:

$ kubectl expose deployment nginx --port=80
service/nginx exposed

Behind that one line, an entire pipeline of controllers wakes up.

Step 1: The API server allocates a ClusterIP

The API server receives the create request, runs admission, runs validation. Then it allocates a ClusterIP from the configured --service-cluster-ip-range (typically 10.96.0.0/12).

Since Kubernetes 1.33, the IP allocator is backed by first-class IPAddress and ServiceCIDR objects. You can run:

$ kubectl get ipaddresses
NAME           PARENTREF
10.96.255.20   services/default/my-service

This used to be a bitmap stored in etcd. Now it's a clean API. The Service object gets the allocated IP stamped onto spec.clusterIP, the write goes to etcd via Raft, and the Service exists. At this point, no pod is connected to it yet. The Service is just metadata.

Step 2: The EndpointSlice controller fills the slice

The EndpointSlice controller runs inside kube-controller-manager. It has two informers: one watches Services, one watches Pods. When a new Service appears, the controller scans every pod in the namespace. For each pod that matches the selector AND is Ready, it builds a slice entry: pod IP, pod node, ports, conditions.

The output is one or more EndpointSlice objects, capped at 100 endpoints each. A Service with three pods has one slice. A Service with three thousand pods has thirty.

Each slice has an ownerReference to the Service (so deletion cascades) and labels:

kubernetes.io/service-name: my-service
endpointslice.kubernetes.io/managed-by: endpointslice-controller.k8s.io

That's how kube-proxy knows which Service a slice belongs to.

Side note worth knowing: the legacy Endpoints API was officially deprecated in 1.33 (KEP-4974). It still works for older controllers, but EndpointSlices are the source of truth now.

Step 3: kube-proxy on every node reprograms

There is a kube-proxy pod on every node, deployed as a DaemonSet in nearly every installer. It has Service and EndpointSlice informers. When a slice changes, kube-proxy diffs the new state against the kernel's current rules, then batches the changes into a single iptables-restore call. The whole rule set is replaced as one transaction.

Three nodes, three independent reprograms, all in parallel. Total time on a normal cluster: milliseconds.

Step 4: What gets programmed

The actual rules look like this (real captured output from a kind 1.35.1 cluster, slightly trimmed):

-A KUBE-SERVICES -d 10.96.255.20/32 -p tcp --dport 80 -j KUBE-SVC-FXIYY
-A KUBE-SVC-FXIYY -m statistic --mode random --probability 0.333 -j KUBE-SEP-1
-A KUBE-SVC-FXIYY -m statistic --mode random --probability 0.500 -j KUBE-SEP-2
-A KUBE-SVC-FXIYY -j KUBE-SEP-3
-A KUBE-SEP-1 -j DNAT --to-destination 10.244.1.42:80
-A KUBE-SEP-2 -j DNAT --to-destination 10.244.2.18:80
-A KUBE-SEP-3 -j DNAT --to-destination 10.244.3.07:80

KUBE-SERVICES is the entry chain. Every Service port has a match rule that jumps to a per-Service chain, named KUBE-SVC- plus a hash of the service name. That chain has one rule per backend, with a --mode random --probability 1/n declining pattern. Each backend rule jumps to a per-endpoint KUBE-SEP- chain that does the actual DNAT.

iptables is the default mode. nftables (GA in 1.33) uses verdict maps for sub-microsecond hash lookup instead of the linear scan; recommended for modern Linux clusters. IPVS mode was deprecated in 1.35 and is now legacy.

Step 5: CoreDNS adds the name

There's a Service called kube-dns in the kube-system namespace, backed by CoreDNS pods. Every pod's /etc/resolv.conf has the kube-dns ClusterIP as its nameserver. CoreDNS has a kubernetes plugin that watches Services. When my-service appears, CoreDNS now resolves my-service.default.svc.cluster.local to the ClusterIP.

Three controllers cooperated, plus DNS, and the Service is live. Total time from kubectl expose to first traffic flowing: under a second.

Part 3: What happens when traffic calls a Service

kubectl exec into a busybox pod, run curl my-service. Two seconds later: <title>Welcome to nginx!</title>. Now slow that down to seven steps.

Step 1: DNS resolution

The pod sees curl my-service, but my-service is not a real hostname. The pod's resolver consults its search list (default.svc.cluster.local, svc.cluster.local, cluster.local). It tries my-service.default.svc.cluster.local first. The query goes to CoreDNS, which has the kubernetes plugin watching Services, and returns the ClusterIP.

Step 2: TCP packet to ClusterIP

The pod opens a TCP connection. SYN packet, destination ClusterIP, port 80. The packet leaves the pod's veth pair, enters the host's network namespace. PREROUTING.

Step 3: KUBE-SERVICES matches

The packet hits the KUBE-SERVICES chain. Every Service port has a match rule here. The rule says: destination is 10.96.255.20, port 80? Jump to KUBE-SVC-FXIYY. Now the packet is inside the per-Service chain.

Step 4: --mode random picks one

KUBE-SVC-FXIYY has one rule per backend, with the declining-probability pattern:

Rule 1 fires with probability 1/3 → KUBE-SEP-1
Rule 2 fires with probability 1/2 of what's left → KUBE-SEP-2
Rule 3 is the unconditional fallthrough → KUBE-SEP-3

The math works out: each backend gets exactly one third of the traffic. iptables -m statistic --mode random is the underlying mechanism.

Step 5: DNAT rewrites the destination

The chosen rule jumps to KUBE-SEP-2. That chain has one rule:

-j DNAT --to-destination 10.244.2.18:80

The kernel rewrites the destination of the packet from ClusterIP to the actual pod IP. The packet is no longer for the virtual address. It is now headed at a real pod, on a real node.

Step 6: Backend receives

The packet routes to the backend pod's node, traverses the CNI bridge or overlay, arrives at the backend. The pod sees a normal TCP packet to its own IP, port 80. It sends back a SYN-ACK. The backend has no idea it was reached via a Service abstraction.

Step 7: Reply rewriting via conntrack

The reply traffic is where the trick happens. Source IP is the backend pod. Destination is the original sender. But the Linux conntrack table remembers the DNAT we did on the way in. So when the reply comes back through the host, conntrack rewrites the source IP from the backend pod, back to the ClusterIP.

The original sender pod sees the response coming from the ClusterIP, exactly the address it sent the packet to. Connection works. End to end. The pod has no idea this dance happened.

Part 4: All five Service types (with real use cases)

Most "Service types" content stops at three. There are five, and each one solves a different problem.

1. ClusterIP

The default. Virtual IP, internal only. This is what we just walked through.

Use case: application-to-application traffic inside the cluster. Frontend talks to backend. Microservice A calls microservice B. Both inside the cluster. By far the most common type.

2. NodePort

NodePort opens the same port on every node in the cluster, in the range 30000–32767. Traffic to any node on that port gets forwarded to the ClusterIP, and from there to a pod.

Use case: local development clusters like kind or minikube, where you don't have a cloud load balancer to provision. NodePort is also a building block — LoadBalancer Services use it under the hood.

3. LoadBalancer

This is what you use in production for public-facing apps. When you create a LoadBalancer Service on EKS, GKE, or AKS, the cloud provider integration provisions a real external load balancer and points it at the NodePort on each node. You get a public IP. Real users hit it.

Use case: production-facing web apps. The browser-to-cluster ingress path.

4. ExternalName

This is the type most people skip. ExternalName has no ClusterIP, no selector, no pods. It's a DNS CNAME alias inside the cluster.

apiVersion: v1
kind: Service
metadata:
  name: my-database
spec:
  type: ExternalName
  externalName: prod-db.us-east-1.rds.amazonaws.com

Now any pod can curl my-database, and the cluster DNS returns the AWS hostname.

Use case: pointing in-cluster names at managed external services. Your apps look up my-database, the underlying address is a managed Postgres in RDS, and you can swap the target without changing application code. Same pattern works for managed Redis, S3-compatible stores, anything external.

5. Headless

apiVersion: v1
kind: Service
metadata:
  name: cassandra-svc
spec:
  clusterIP: None       # ← the magic
  selector:
    app: cassandra

There is no virtual IP, no kube-proxy involvement, no DNAT. Instead, DNS returns the IPs of all backend pods directly. With a StatefulSet, each pod also gets a stable DNS name like cassandra-0.cassandra-svc.default.

Use case: StatefulSets. Each pod gets a stable DNS name for peer discovery in distributed systems (Cassandra nodes finding each other, Kafka brokers, etcd peers). Custom client-side load balancing where the client wants to choose the backend itself, not have iptables choose. Anything that needs to talk to specific pods, not a load-balanced abstraction.

The summary

Type	Use case
ClusterIP	Inside the cluster
NodePort	Dev clusters, building block
LoadBalancer	Public production traffic
ExternalName	Alias managed services
Headless	Stateful workloads

Most teams over-use LoadBalancer when ClusterIP plus an Ingress would do. Pick the right one for the job.

Part 5: Live demo

To make all of this concrete, we ran the create-and-call flow on a real cluster (kind 1.35.1, three workers). What follows are verbatim outputs.

$ kubectl get nodes
NAME                         STATUS   ROLES           AGE   VERSION
service-demo-control-plane   Ready    control-plane   24s   v1.35.1
service-demo-worker          Ready    <none>          14s   v1.35.1
service-demo-worker2         Ready    <none>          14s   v1.35.1
service-demo-worker3         Ready    <none>          14s   v1.35.1

Apply an nginx Deployment with three replicas, then a Service:

$ kubectl apply -f nginx-deploy.yaml
deployment.apps/nginx created

$ kubectl get pods -o wide
NAME                    READY   STATUS    RESTARTS   AGE   IP           NODE
nginx-fd956d49d-49779   1/1     Running   0          12s   10.244.2.2   service-demo-worker2
nginx-fd956d49d-5pbsm   1/1     Running   0          12s   10.244.1.2   service-demo-worker
nginx-fd956d49d-g94jr   1/1     Running   0          12s   10.244.3.2   service-demo-worker3

$ kubectl apply -f my-service.yaml
service/my-service created

$ kubectl get svc my-service
NAME         TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)   AGE
my-service   ClusterIP   10.96.255.20   <none>        80/TCP    2s

The Service has no IPs in its spec — only the selector:

$ kubectl get svc my-service -o yaml | grep -A 8 spec:
spec:
  clusterIP: 10.96.255.20
  internalTrafficPolicy: Cluster
  ipFamilies:
  - IPv4
  ipFamilyPolicy: SingleStack
  ports:
  - port: 80
  selector:
    app: nginx

The 1.33+ IPAddress object shows it as a first-class allocation:

$ kubectl get ipaddresses
NAME           PARENTREF
10.96.255.20   services/default/my-service

The EndpointSlice has the real backend IPs:

$ kubectl get endpointslices -l kubernetes.io/service-name=my-service
NAME               ADDRESSTYPE   PORTS   ENDPOINTS                            AGE
my-service-x29k9   IPv4          80      10.244.1.2,10.244.2.2,10.244.3.2     8s

And the iptables rules on a worker show the chain we described:

$ docker exec service-demo-worker iptables-save | grep my-service
-A KUBE-SERVICES -d 10.96.255.20/32 -p tcp -m tcp --dport 80 -j KUBE-SVC-FXIYY6OHUSNBITIX
-A KUBE-SVC-FXIYY6OHUSNBITIX -m statistic --mode random --probability 0.33333333349 -j KUBE-SEP-4B2TTHBRUYTSCT32
-A KUBE-SVC-FXIYY6OHUSNBITIX -m statistic --mode random --probability 0.50000000000 -j KUBE-SEP-FAW7RO5CDYGWP4Y3
-A KUBE-SVC-FXIYY6OHUSNBITIX -j KUBE-SEP-4UWZBYSYCGDXTWU5
-A KUBE-SEP-4B2TTHBRUYTSCT32 -j DNAT --to-destination 10.244.1.2:80
-A KUBE-SEP-FAW7RO5CDYGWP4Y3 -j DNAT --to-destination 10.244.2.2:80
-A KUBE-SEP-4UWZBYSYCGDXTWU5 -j DNAT --to-destination 10.244.3.2:80

Real ClusterIP. Real chain hash. Real probabilities. Real pod IPs. The math (0.33333, 0.50000, fallthrough) is exactly what we derived earlier.

Now curl from inside the cluster:

$ kubectl run curl --rm -i --restart=Never --image=busybox:1.36 -- wget -q -O - http://my-service
<!DOCTYPE html>
<html>
<head><title>Welcome to nginx!</title></head>
...

Welcome to nginx. The pod knew nothing about iptables, KUBE-SVC chains, DNAT, or conntrack. It just curled my-service and got a response.

Scale up to ten replicas, watch everything reprogram in real time:

$ kubectl scale deploy/nginx --replicas=10
deployment.apps/nginx scaled

$ kubectl get endpointslices -l kubernetes.io/service-name=my-service
NAME               ADDRESSTYPE   PORTS   ENDPOINTS                                      AGE
my-service-x29k9   IPv4          80      10.244.1.2,10.244.2.2,10.244.3.2 + 7 more...   1m42s

$ docker exec service-demo-worker iptables-save | grep -c 'KUBE-SVC-FXIYY.*statistic'
9

Three rules became nine. (Tenth backend is the unconditional fallthrough, no --probability on it.) From kubectl scale to all kube-proxies reprogrammed: about 200 milliseconds.

Three takeaways

The Service object is metadata. The dataplane is the kernel. Nothing in the Service object knows about pod IPs. The kernel's iptables (or nftables) rules carry that mapping. When you understand this, you stop thinking of Services as magic and start thinking of them as cleverly placed netfilter rules.
EndpointSlice is the bridge. When a Pod becomes Ready, the EndpointSlice controller writes its IP. kube-proxy reads. The kernel obeys. Three controllers, no shared state, all eventually consistent — and reprogramming completes in milliseconds even on big clusters.
Use the right Service type for the job. ClusterIP for internal traffic. NodePort for dev clusters. LoadBalancer for public production. ExternalName to alias external services. Headless for StatefulSets. Most teams over-use LoadBalancer when ClusterIP-plus-Ingress would have been the right choice.

Where to go from here

The full video walks the 5-part flow in 10 minutes with animated visuals for each step. Link at the top of this post.

Sources for every claim in this post:

pkg/registry/core/service/storage/alloc.go — ClusterIP allocator
pkg/controller/endpointslice/ — EndpointSlice controller
pkg/proxy/iptables/proxier.go — kube-proxy iptables rules
pkg/proxy/nftables/ — nftables backend (GA 1.33)
KEP-1880 — MultiCIDRServiceAllocator (ServiceCIDR objects)
KEP-3866 — kube-proxy nftables backend
KEP-4974 — Endpoints API deprecation
The terminal output above is verbatim from a real Kubernetes 1.35.1 kind cluster, captured for this post.

A Kubeconfig for GKE That Doesn't Need gcloud

saiyam1814 — Wed, 29 Apr 2026 06:23:30 +0000

When you run gcloud container clusters get-credentials, the kubeconfig it writes looks innocent — until you hand it to a teammate and they hit:

error: exec plugin: invalid apiVersion "client.authentication.k8s.io/v1beta1"

…or the classic gke-gcloud-auth-plugin: executable not found.

That's because the generated kubeconfig doesn't actually contain a credential. It contains an exec: block that shells out to gke-gcloud-auth-plugin, which in turn calls gcloud to mint a fresh OAuth token on every kubectl call. If you look at the users section of a stock GKE kubeconfig, this is what's in there:

users:
- name: gke_saiyam-project_us-east1-b_demo-test
  user:
    exec:
      apiVersion: client.authentication.k8s.io/v1beta1
      command: gke-gcloud-auth-plugin
      installHint: Install gke-gcloud-auth-plugin for use with kubectl by following
        https://cloud.google.com/kubernetes-engine/docs/how-to/cluster-access-for-kubectl#install_plugin
      interactiveMode: IfAvailable
      provideClusterInfo: true

No token. No cert. Just "run this plugin and ask it for auth." No gcloud on the machine, no access.

If you want a kubeconfig that anyone can use — a CI runner, a contractor's laptop, a script on a VM — you need to swap that exec-plugin auth for something self-contained. The cleanest answer: a Kubernetes ServiceAccount and a bearer token.

Here's the full flow, run end-to-end against a live GKE cluster.

The mental model

Four pieces, in order:

Identity — a ServiceAccount in the cluster
Permissions — a (Cluster)RoleBinding attaching a role to that SA
Credential — a token the SA can present to the API server
Portable config — a kubeconfig file wrapping the token + cluster endpoint + CA cert

The API server validates the token itself. No Google, no gcloud, no OAuth round-trip.

Step 1: Identity and permissions

kubectl create serviceaccount shared-access -n kube-system

kubectl create clusterrolebinding shared-access-binding \
  --clusterrole=cluster-admin \
  --serviceaccount=kube-system:shared-access

Output:

serviceaccount/shared-access created
clusterrolebinding.rbac.authorization.k8s.io/shared-access-binding created

Two things worth calling out:

The SA lives in kube-system because it's a cluster-wide utility identity. The namespace doesn't restrict its access — RBAC does.
cluster-admin is * on *. Scope it down in production. view, edit, or a custom ClusterRole are usually what you actually want. If you only need namespace-scoped access, use a RoleBinding in that namespace instead of a ClusterRoleBinding.

Step 2: Mint a long-lived token

Before Kubernetes 1.24, creating a ServiceAccount automatically created a companion Secret with a non-expiring token. That was removed — long-lived bearer tokens are a security footgun — so now you opt in explicitly:

kubectl apply -f - <<'EOF'
apiVersion: v1
kind: Secret
metadata:
  name: shared-access-token
  namespace: kube-system
  annotations:
    kubernetes.io/service-account.name: shared-access
type: kubernetes.io/service-account-token
EOF

Output:

secret/shared-access-token created

The magic is in two fields:

type: kubernetes.io/service-account-token — tells the token controller (built into kube-controller-manager) "I'm a Secret you should populate."
kubernetes.io/service-account.name annotation — tells it which ServiceAccount's identity to embed in the token.

Wait a couple of seconds, then inspect the Secret — the controller has filled in the data for you:

kubectl get secret shared-access-token -n kube-system -o yaml

apiVersion: v1
data:
  ca.crt: LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSUVMVENDQXBXZ0F3SUJB...
  namespace: a3ViZS1zeXN0ZW0=
  token: ZXlKaGJHY2lPaUpTVXpJMU5pSXNJbXRwWkNJNklrWnNZMkk0VFRkWmFrVjN...
kind: Secret
metadata:
  annotations:
    kubernetes.io/service-account.name: shared-access
    kubernetes.io/service-account.uid: 9e8d4bdb-46ea-4893-9306-d56bea6aa304
  name: shared-access-token
  namespace: kube-system
type: kubernetes.io/service-account-token

Three fields got populated by the controller:

.data.token — a signed JWT, the actual bearer credential
.data.ca.crt — the cluster's CA certificate (so your client can trust the API server's TLS)
.data.namespace — the SA's namespace

If you'd rather have a short-lived token, skip the Secret and run kubectl create token shared-access -n kube-system --duration=24h. Good for automation that rotates. Bad for a "hand someone a file" use case, which is what we're doing here.

Step 3: Extract the three things a kubeconfig needs

SERVER=$(kubectl config view --minify -o jsonpath='{.clusters[0].cluster.server}')
CA=$(kubectl get secret shared-access-token -n kube-system -o jsonpath='{.data.ca\.crt}')
TOKEN=$(kubectl get secret shared-access-token -n kube-system -o jsonpath='{.data.token}' | base64 -d)

echo "SERVER = ${SERVER}"
echo "CA     = ${CA:0:60}..."
echo "TOKEN  = ${TOKEN:0:40}..."

Output:

SERVER = https://35.196.129.174
CA     = LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0WERQWERk1JSUVMVENDQXBXZ0F3SUJB...
TOKEN  = eyJhbGciOiJSUzIsImtpZCI6IkZsY2I4TTdZ...

SERVER — the GKE API endpoint, pulled straight from your current context
CA — already base64, drops straight into the kubeconfig as-is
TOKEN — we decode it because kubeconfig wants the raw JWT string, not base64

Step 4: Assemble the kubeconfig

cat > /tmp/shared-kubeconfig.yaml <<EOF
apiVersion: v1
kind: Config
clusters:
- name: cluster-1
  cluster:
    server: ${SERVER}
    certificate-authority-data: ${CA}
contexts:
- name: cluster-1
  context:
    cluster: cluster-1
    user: shared-access
current-context: cluster-1
users:
- name: shared-access
  user:
    token: ${TOKEN}
EOF

A kubeconfig is three independent lists — clusters, users, contexts — glued together by a context that names one cluster + one user. Nothing more.

Notice what's not in the users block: no auth-provider, no exec. kubectl has nothing to shell out to. It just sends Authorization: Bearer <token> on every request and the API server validates the JWT.

Step 5: Prove it works without gcloud

KUBECONFIG=/tmp/shared-kubeconfig.yaml kubectl get nodes
KUBECONFIG=/tmp/shared-kubeconfig.yaml kubectl auth whoami
KUBECONFIG=/tmp/shared-kubeconfig.yaml kubectl auth can-i '*' '*' --all-namespaces

Output:

NAME                                       STATUS   ROLES    AGE   VERSION
gke-demo-test-default-pool-a5aaa3f4-jcnk   Ready    <none>   18h   v1.35.1-gke.1396002

ATTRIBUTE   VALUE
Username    system:serviceaccount:kube-system:shared-access
UID         9e8d4bdb-46ea-4893-9306-d56bea6aa304
Groups      [system:serviceaccounts system:serviceaccounts:kube-system system:authenticated]

yes

That's the whole proof. The API server sees system:serviceaccount:kube-system:shared-access, not your Google identity. You can put this file on a machine that has never seen gcloud in its life, and it works.

Things to know before you ship this

Private clusters still need network reachability. The kubeconfig removes the auth dependency, not the network one. If your control plane is private, the recipient still needs VPN, authorized networks, or a public endpoint. The token won't help if they can't reach the API server.

The kubeconfig is a credential. Anyone with the file has whatever RBAC you bound. Store it like you'd store an SSH key or an API token. Don't commit it to Git.

Revocation is deletion. To kill access, delete the Secret:

kubectl delete secret shared-access-token -n kube-system

To kill it harder, also delete the binding and the SA. There's no "rotate" — you mint a new Secret and redistribute the new kubeconfig.

Scope down. cluster-admin is the demo default, not the production default. A RoleBinding to edit in a single namespace is usually closer to what a real sharing use case needs. ClusterRoleBinding + cluster-admin only when you truly mean it.

Wrap

The trick isn't really about GKE — it's about understanding what a kubeconfig is. Once you see it as a glue file between a cluster endpoint and any credential the API server will accept, the exec-plugin auth stops feeling magical and the bearer-token swap becomes obvious.

Same approach works for EKS (where the plugin is aws-iam-authenticator / aws eks get-token), AKS (kubelogin), and anything else that ships exec-based auth. Replace the user: block, keep the cluster: block, and you've got a kubeconfig that travels.

The Ingress NGINX Migration Just Got Easier: 119 Annotations, 3 Targets, Impact Ratings

saiyam1814 — Wed, 29 Apr 2026 06:23:28 +0000

A few months ago, I built ing-switch and wrote about it on kubesimplify. The response was incredible -- people loved the annotation mapping and the visual dashboard.

Since then, ingress-nginx was officially archived (March 24, 2026). March 31 is end of life -- zero security patches after that date.

Based on community feedback from KubeCon, this is the biggest update yet: 119 annotations (up from 50), Gateway API with Traefik as the provider (the #1 request), and impact ratings on every annotation so you know exactly what matters.

This post walks through a complete end-to-end migration on a vind cluster with actual command outputs.

Why You Need to Migrate Now

Nov 11, 2025: Kubernetes SIG Network announces ingress-nginx retirement
Jan 29, 2026: Joint statement from Kubernetes Steering + Security Response Committees urging immediate migration
Mar 24, 2026: GitHub repository archived (read-only)
Mar 31, 2026: End of life -- zero support from this date

Chainguard maintains a fork for CVE-level fixes only -- no features, no community PRs, no pre-built images. You're on your own.

The Three Migration Paths

Target	Best For	What Changes
Traefik v3	Fastest migration, lowest friction	Keep Ingress API, swap annotations to Middleware CRDs
Gateway API (Envoy)	Future-proof standard	Replace Ingresses with HTTPRoutes, Envoy policies
Gateway API (Traefik)	Rancher / k3s users	Standard HTTPRoutes + Gateway resources, with Traefik as the controller implementation. Advanced features (rate limiting, auth, IP filtering) use Traefik Middleware CRDs as extension policies.

The Annotation Problem

The real complexity isn't swapping controllers -- it's the annotations. A typical production Ingress has 10-15 NGINX annotations for SSL, auth, rate limiting, CORS, session affinity, and more.

ing-switch maps 119 annotations with impact ratings:

	Traefik	Gateway API
Supported (direct equivalent)	35	39
Partial (needs minor adjustment)	48	25
Unsupported (with impact notes)	42	62

Every unsupported annotation gets an impact rating: NONE (safe to ignore), LOW (better defaults), MEDIUM (needs workaround), or VARIES (review your snippets). Most teams discover 70%+ of "unsupported" annotations are safe to ignore.

End-to-End Demo: vCluster + ing-switch

Let's walk through a complete migration on a real cluster. We'll use vCluster to spin up a Kubernetes cluster in Docker, deploy 3 services with NGINX annotations, and migrate them to Gateway API with Traefik.

Step 1: Create a Cluster

vcluster create demo --driver docker

Output:

info  Using vCluster driver 'docker' to create your virtual clusters
info  Ensuring environment for vCluster demo...
done  Created network vcluster.demo
info  Starting vCluster standalone demo
done  Successfully created virtual cluster demo
info  Waiting for vCluster to become ready...
done  vCluster is ready
done  Switched active kube context to vcluster-docker_demo

Verify:

kubectl get namespaces

NAME                 STATUS   AGE
default              Active   16s
kube-flannel         Active   6s
kube-node-lease      Active   16s
kube-public          Active   16s
kube-system          Active   16s
local-path-storage   Active   6s

Step 2: Install Ingress NGINX

helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx
helm install ingress-nginx ingress-nginx/ingress-nginx \
  --namespace ingress-nginx \
  --create-namespace \
  --set controller.service.type=ClusterIP \
  --set controller.admissionWebhooks.enabled=false \
  --wait --timeout 120s

NAME: ingress-nginx
LAST DEPLOYED: Sun Mar 29 11:15:57 2026
NAMESPACE: ingress-nginx
STATUS: deployed

kubectl get pods -n ingress-nginx

NAME                                        READY   STATUS    RESTARTS   AGE
ingress-nginx-controller-5486dbd97f-vc9wv   1/1     Running   0          54s

Step 3: Deploy 3 Apps with NGINX Annotations

We deploy three services, each with different annotation patterns:

App 1 -- Basic web app (SSL redirect + timeouts):

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: web-app
  namespace: demo
  annotations:
    nginx.ingress.kubernetes.io/ssl-redirect: "true"
    nginx.ingress.kubernetes.io/proxy-read-timeout: "60"
    nginx.ingress.kubernetes.io/proxy-connect-timeout: "10"
spec:
  ingressClassName: nginx
  rules:
  - host: web.example.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: web-app
            port:
              number: 80

App 2 -- API with CORS + rate limiting (10 annotations):

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: api-cors
  namespace: demo
  annotations:
    nginx.ingress.kubernetes.io/force-ssl-redirect: "true"
    nginx.ingress.kubernetes.io/enable-cors: "true"
    nginx.ingress.kubernetes.io/cors-allow-origin: "https://app.example.com,https://admin.example.com"
    nginx.ingress.kubernetes.io/cors-allow-methods: "GET, POST, PUT, DELETE, OPTIONS"
    nginx.ingress.kubernetes.io/cors-allow-headers: "Content-Type, Authorization, X-API-Key"
    nginx.ingress.kubernetes.io/cors-allow-credentials: "true"
    nginx.ingress.kubernetes.io/cors-max-age: "86400"
    nginx.ingress.kubernetes.io/limit-rps: "50"
    nginx.ingress.kubernetes.io/limit-burst-multiplier: "3"
    nginx.ingress.kubernetes.io/proxy-body-size: "5m"
spec:
  ingressClassName: nginx
  rules:
  - host: api.example.com
    http:
      paths:
      - path: /v1
        pathType: Prefix
        backend:
          service:
            name: api-service
            port:
              number: 80

App 3 -- Auth-protected dashboard (external auth + IP allowlist + session affinity):

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: dashboard
  namespace: demo
  annotations:
    nginx.ingress.kubernetes.io/ssl-redirect: "true"
    nginx.ingress.kubernetes.io/auth-url: "https://auth.example.com/verify"
    nginx.ingress.kubernetes.io/auth-response-headers: "X-User-ID,X-User-Email"
    nginx.ingress.kubernetes.io/whitelist-source-range: "10.0.0.0/8,172.16.0.0/12"
    nginx.ingress.kubernetes.io/affinity: "cookie"
    nginx.ingress.kubernetes.io/session-cookie-name: "dashboard-session"
    nginx.ingress.kubernetes.io/session-cookie-max-age: "3600"
spec:
  ingressClassName: nginx
  rules:
  - host: dashboard.example.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: dashboard
            port:
              number: 80

After applying all three:

kubectl get ingress -n demo

NAME        CLASS   HOSTS                   ADDRESS   PORTS   AGE
api-cors    nginx   api.example.com                   80      5s
dashboard   nginx   dashboard.example.com             80      5s
web-app     nginx   web.example.com                   80      5s

kubectl get pods -n demo

NAME                           READY   STATUS    RESTARTS   AGE
api-service-5f99b6d99d-x7vmn   1/1     Running   0          24s
dashboard-9ddbf867-7dbgf       1/1     Running   0          24s
web-app-969c76b7c-7wqw5        1/1     Running   0          24s

3 ingresses, 20 NGINX annotations, 3 services running. Now let's see what ing-switch makes of this.

Step 4: Scan the Cluster

ing-switch scan

  ing-switch -- Cluster Scan Results
  Cluster: vcluster-docker_demo

  Ingress Controller Detected
  Type:      ingress-nginx
  Version:   unknown
  Namespace: ingress-nginx

  Found 3 Ingress resource(s)

  NAMESPACE   NAME        HOSTS                   ANNOTATIONS   TLS   COMPLEXITY
  ---------   ----        -----                   -----------   ---   ----------
  demo        api-cors    api.example.com         10            no    unsupported
  demo        dashboard   dashboard.example.com   7             no    complex
  demo        web-app     web.example.com         3             no    complex

ing-switch detected the NGINX controller and found all 3 ingresses with their annotation counts and complexity scores.

Step 5: Analyze Compatibility

Let's compare all three targets:

Traefik v3:

ing-switch analyze --target traefik

  Summary
  -------
  Total ingresses:      3
  Fully compatible:     1
  Needs workarounds:    2
  Has unsupported:      0

Gateway API (Envoy):

ing-switch analyze --target gateway-api

  Summary
  -------
  Total ingresses:      3
  Fully compatible:     0
  Needs workarounds:    3
  Has unsupported:      0

Gateway API (Traefik):

ing-switch analyze --target gateway-api-traefik

  Summary
  -------
  Total ingresses:      3
  Fully compatible:     0
  Needs workarounds:    3
  Has unsupported:      0

Key insight: Traefik is the highest-compatibility target for this workload (1 fully compatible out of 3). The CORS annotations map directly to Traefik's Headers middleware. For Gateway API, CORS is now also fully supported thanks to the native CORS filter in Gateway API v1.5.

Here's the detailed annotation mapping for the API with CORS:

  demo/api-cors
  -------------
  ANNOTATION               STATUS        TARGET RESOURCE                    NOTES
  enable-cors              [supported]   HTTPRoute (CORS filter)            Native CORS filter (GA in Gateway API v1.5)
  cors-allow-origin        [supported]   HTTPRoute (CORS filter)            allowOrigins in CORS filter
  cors-allow-methods       [supported]   HTTPRoute (CORS filter)            allowMethods in CORS filter
  cors-allow-headers       [supported]   HTTPRoute (CORS filter)            allowHeaders in CORS filter
  cors-allow-credentials   [supported]   HTTPRoute (CORS filter)            allowCredentials in CORS filter
  cors-max-age             [supported]   HTTPRoute (CORS filter)            maxAge in CORS filter
  force-ssl-redirect       [supported]   HTTPRoute (RequestRedirect filter) 301 redirect to HTTPS
  limit-rps                [partial]     BackendTrafficPolicy (RateLimit)   Envoy Gateway BackendTrafficPolicy
  limit-burst-multiplier   [partial]     BackendTrafficPolicy (RateLimit)   Burst configurable but uses tokens
  proxy-body-size          [partial]     BackendTrafficPolicy               requestBuffer.limit

7 out of 10 annotations are fully supported. The 3 "partial" ones work -- they just use a slightly different API.

Step 6: Generate Migration Files

ing-switch migrate --target gateway-api-traefik --output-dir ./migration

  ing-switch -- Generating Migration Files
  Target:     gateway-api-traefik
  Output dir: ./migration

  + 00-migration-report.md
  + 01-install-gateway-api-crds/install.sh
  + 02-install-traefik-gateway/helm-install.sh
  + 02-install-traefik-gateway/values.yaml
  + 03-gateway/gatewayclass.yaml
  + 03-gateway/gateway.yaml
  + 04-httproutes/demo-api-cors.yaml
  + 04-httproutes/demo-dashboard.yaml
  + 04-httproutes/demo-web-app.yaml
  + 05-policies/demo-api-cors-ratelimit.yaml
  + 05-policies/demo-dashboard-forwardauth.yaml
  + 05-policies/demo-dashboard-ipallowlist.yaml
  + 06-verify.sh
  + 07-cleanup/remove-nginx.sh
  Generated 13 files in ./migration/

Step 7: Inspect the Generated YAML

GatewayClass -- points to Traefik, not Envoy:

apiVersion: gateway.networking.k8s.io/v1
kind: GatewayClass
metadata:
  name: traefik
spec:
  controllerName: traefik.io/gateway-controller

HTTPRoute with native CORS filter (no more ResponseHeaderModifier hacks):

apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: api-cors
  namespace: demo
spec:
  parentRefs:
  - name: ing-switch-gateway
    namespace: default
  hostnames:
  - "api.example.com"
  rules:
  - matches:
    - path:
        type: PathPrefix
        value: "/v1"
    filters:
    - type: CORS
      cors:
        allowOrigins:
        - type: Exact
          value: "https://app.example.com"
        - type: Exact
          value: "https://admin.example.com"
        allowMethods:
        - "GET"
        - "POST"
        - "PUT"
        - "DELETE"
        - "OPTIONS"
        allowHeaders:
        - "Content-Type"
        - "Authorization"
        - "X-API-Key"
        allowCredentials: true
        maxAge: "86400s"
    backendRefs:
    - name: api-service
      port: 80

Traefik Middleware CRDs (not Envoy-specific policies):

# Rate Limiting
apiVersion: traefik.io/v1alpha1
kind: Middleware
metadata:
  name: demo-api-cors-ratelimit
  namespace: demo
spec:
  rateLimit:
    average: 50
    burst: 3

# ForwardAuth (external authentication)
apiVersion: traefik.io/v1alpha1
kind: Middleware
metadata:
  name: demo-dashboard-forwardauth
  namespace: demo
spec:
  forwardAuth:
    address: "https://auth.example.com/verify"
  authResponseHeaders:
    - "X-User-ID"
    - "X-User-Email"

# IP AllowList
apiVersion: traefik.io/v1alpha1
kind: Middleware
metadata:
  name: demo-dashboard-ipallowlist
  namespace: demo
spec:
  ipAllowList:
    sourceRange:
    - "10.0.0.0/8"
    - "172.16.0.0/12"

Step 8: Review the Migration Report

The migrate command automatically generates 00-migration-report.md in the output directory. Open it to see the full summary:

cat ./migration/00-migration-report.md

# ing-switch Migration Report
**Target Controller:** gateway-api-traefik

## Summary
| Metric | Count |
|--------|-------|
| Total Ingresses | 3 |
| Fully Compatible | 0 |
| Needs Workarounds | 3 |
| Has Unsupported Annotations | 0 |

## demo/api-cors -- Needs workaround
| Annotation | Status | Target Resource | Notes |
|-----------|--------|-----------------|-------|
| enable-cors | OK | HTTPRoute (CORS filter) | Native CORS filter (GA in v1.5) |
| cors-allow-origin | OK | HTTPRoute (CORS filter) | allowOrigins in CORS filter |
| limit-rps | WARN | BackendTrafficPolicy | Envoy Gateway BackendTrafficPolicy |
...

Step 9: Apply (Dry-Run First)

# Install Gateway API CRDs
bash ./migration/01-install-gateway-api-crds/install.sh

# Install Traefik with Gateway API provider
bash ./migration/02-install-traefik-gateway/helm-install.sh

# Dry-run all resources first
kubectl apply -f ./migration/03-gateway/ --dry-run=server
kubectl apply -f ./migration/04-httproutes/ --dry-run=server

# If dry-run passes, apply for real
kubectl apply -f ./migration/03-gateway/
kubectl apply -f ./migration/04-httproutes/
kubectl apply -f ./migration/05-policies/

At this point, both NGINX and Traefik are running side by side. DNS still points to NGINX. Production traffic is untouched.

Step 10: Verify and Cutover

# Run the generated verification script
bash ./migration/06-verify.sh

# Once verified, update DNS to Traefik's IP
# Then clean up NGINX
bash ./migration/07-cleanup/remove-nginx.sh

Step 11: Use the Web UI

For teams that prefer a visual workflow:

ing-switch ui
# Opens http://localhost:8080

The dashboard provides four pages:

Detect -- Scan your cluster and see all ingresses with annotation counts and complexity:

Analyze -- Choose between 3 targets and see the full annotation compatibility matrix:

Migrate -- One-click generation with step-by-step checklist and dry-run buttons:

View all generated files inline with syntax highlighting:

See migration gaps with impact ratings and fix instructions:

Validate -- Run live cluster checks to confirm your migration phase:

Cleanup

vcluster delete demo --driver docker

done  Successfully deleted virtual cluster demo

What Makes ing-switch Different

Feature	ing-switch	ingress2gateway	Manual
Annotation coverage	119	30+	You count
Traefik Ingress target	Yes	No	--
Gateway API (Traefik)	Yes	No	--
Gateway API (Envoy)	Yes	Yes	--
Impact ratings	Yes	No	No
Web UI	Yes	No	No
Install scripts	Yes	No	No
Verification scripts	Yes	No	No
DNS migration guide	Yes	No	No
Dry-run mode	Yes	No	--

The Ecosystem Is Ready

Gateway API v1.5 -- CORS filter, TLSRoute, BackendTLSPolicy all GA
ingress2gateway v1.0 -- Official tool with emitter architecture
Traefik v3.7 -- Native NGINX annotation provider (80+ annotations)
Envoy Gateway v1.7 -- XListenerSet, enhanced policies
cert-manager v1.20 -- Gateway API ListenerSet support
Kubernetes 1.36 -- Ships April 22, first release post-NGINX archival

The tools exist. The standards are stable. The only thing left is to actually run the migration.

Star it, fork it, migrate today: github.com/saiyam1814/ing-switch

ing-switch is open source under the MIT license. PRs welcome.

What Actually Happens When You Run kubectl run nginx

saiyam1814 — Wed, 29 Apr 2026 06:23:26 +0000

So you type kubectl run nginx --image nginx. One line, one pod. About a second later on a warm cluster, the pod is Running. But what actually happens behind the scenes? Let us walk through it, step by step, step by step.

%[https://www.youtube.com/watch?v=LLuUhU3SwJo&t=4s]

TL;DR, the 23 steps

kubectl parses argv and builds a minimal Pod object.
It reads ~/.kube/config for cluster, user, and context.
A TCP connection is opened to the API server. TLS 1.3 negotiates keys in one round trip with mutual cert auth.
kubectl sends POST /api/v1/namespaces/default/pods with a JSON body over HTTP/2.
The API server authenticates the caller (x509, bearer token, OIDC, or webhook).
It authorizes the request against RBAC. Can this user create pods in default?
Mutating admission runs. ServiceAccount injects a projected token volume, LimitRanger fills in default requests and limits, and so on.
The API server defaults missing fields (DNS policy, restart policy, termination grace period) and then validates against the OpenAPI schema.
Validating admission runs. ResourceQuota, PodSecurity, any ValidatingAdmissionWebhook, and the built in ValidatingAdmissionPolicy CEL engine (GA since 1.30).
The API server writes to etcd via Raft. Leader replicates, followers fsync, a majority acks, and only then does the pod exist.
The API server returns 201 Created. kubectl prints pod/nginx created.
Watch fanout. Every component holding an open watch stream (scheduler, kubelets, controllers) is notified within milliseconds.
The scheduler runs Filter plugins. NodeResourcesFit, NodeAffinity, TaintToleration, PodTopologySpread, VolumeBinding.
It runs Score plugins. NodeResourcesBalancedAllocation, ImageLocality, InterPodAffinity, NodeAffinity.
The winning node gets picked. Scheduler POSTs to /pods/nginx/binding, which updates spec.nodeName. One more etcd write.
The kubelet on that node sees the bound pod through its watch. syncPod fires.
Kubelet calls the container runtime over CRI (RunPodSandbox). containerd creates the pause container, PID 1, calling pause(2) and holding the pod's network namespace.
The CNI plugin (Calico, Flannel, Cilium, your choice) runs ADD. It creates a veth pair, allocates an IP from the pod CIDR, programs routes.
Image pull. containerd fetches the manifest, then the layers, verifying each with SHA-256.
Container create. The runtime stacks image layers with overlayfs, writes the OCI runtime spec, and asks runc to create.
runc takes over. clone3 with namespace flags (PID, mount, UTS, IPC), setns into the sandbox's network namespace, mount /proc, pivot_root, drop capabilities, apply the seccomp filter, execve nginx.
Kubelet's PLEG notices the container started. Most clusters still poll the runtime every second. Evented PLEG is the newer event stream version but it is still alpha in 1.36, so don't assume it is on.
The status manager patches pod.status to Running back to the API server. Done.

Setting the stage

I teach Kubernetes on the Kubesimplify YouTubeouTube channel, and I still get asked the same question in workshops. What actually happens when I run kubectl run? Most answers stop at "the API server writes to etcd and the scheduler picks a node." That is true, but it is the one line summary of a story that has twenty-three chapters.

So this post is the long form of the six-minute video I just shipped, paired with an interactive site you can scrub through step by step. If you are a platform engineer who already knows what a pod is, my goal is that by the end of this you can name the plugins, the syscalls, the admission chain order, and the CRI calls. And you should be able to point at the Kubernetes source tree when you need to go deeper.

Everything below is checked against Kubernetes 1.36.0, which shipped on April 22, 2026. Where a feature gate matters, I call the version out explicitly.

Phase 1, the client side (kubectl)

Step 1: kubectl parses your command

kubectl run is a subcommand whose job is to take sparse user input and build a valid Pod object. The code lives in staging/src/k8s.io/kubectl/pkg/cmd/run/run.go. For kubectl run nginx --image nginx, the object kubectl builds locally is roughly this.

apiVersion: v1
kind: Pod
metadata:
  name: nginx
spec:
  containers:
    - name: nginx
      image: nginx

So notice what is not there. No restartPolicy, no dnsPolicy, no terminationGracePeriodSeconds, no serviceAccountName, no imagePullPolicy. kubectl deliberately sends a minimal object. All those fields are filled in by the API server during defaulting, which happens after admission and before validation. This is the first real insight. The object you POST and the object etcd ends up storing, they are not the same.

Step 2: Reading kubeconfig

kubectl needs to know where to send the request. It reads ~/.kube/config (or whatever $KUBECONFIG points at) and resolves three things. The cluster (API server URL, CA bundle), the user (client certs, token, exec plugin), and the context (which cluster and user pair plus a default namespace). The logic sits in client-go/tools/clientcmd. If you run kubectl --v=8, you can watch this resolution happen inline.

Step 3: TCP plus TLS 1.3 handshake

kubectl opens a TCP connection to the API server on port 6443 and runs a TLS 1.3 handshake. TLS 1.3 is important here. It negotiates keys in a single round trip (TLS 1.2 needed two), and it does so with mutual authentication when you are using a client certificate. Both sides present certs, both sides verify against a CA. Same primitives your browser uses, nothing exotic. But worth noticing because every subsequent byte rides this mTLS tunnel.

Step 4: HTTP/2 POST to the API server

kubectl serializes the pod object to JSON, not YAML. YAML is a client side convenience, the wire format is JSON by default. Then it sends POST /api/v1/namespaces/default/pods over HTTP/2. Content-Type is application/json. HTTP/2 matters because all the watch streams later in the story will multiplex over the same connection.

Step 5: Request lands at the API server

The request hits kube-apiserver. The code path is the generic API server filter chain in staging/src/k8s.io/apiserver/pkg/server/filters. Every inbound request goes through the same stack of filters in order. Panic recovery, request deadline, auditing, authentication, impersonation, authorization, admission, validation. Most of the next phase is those filters.

Phase 2, the API server gate

Step 6: Authentication, "who are you?"

So the API server asks the first question. Who are you? The API server has four authenticator backends chained together. x509 client certificates, bearer tokens (static, service account, or OIDC), OIDC directly (with JWT verification against the configured issuer), and authentication webhooks (the TokenReview API). The first one that returns a positive identity wins.

For kubectl with a standard kubeconfig, you are usually on x509. The cert you presented in the TLS handshake is reused to populate user.Info with the CN as the username and the O values as groups. Code: staging/src/k8s.io/apiserver/pkg/authentication.

Step 7: Authorization, "can you do this?"

With identity established, the next question. Can this user perform create on the resource pods in the namespace default? The default authorizer is RBAC, backed by Role, ClusterRole, RoleBinding, ClusterRoleBinding objects. Multiple authorizers can be chained. In managed clusters you will often see Node,RBAC. The Node authorizer restricts what a kubelet can ask for, RBAC handles everything else. A single "allow" is enough. Explicit denies don't exist in RBAC.

Step 8: Mutating admission

This is the fun one. Mutating admission plugins run first, before schema validation, and they are allowed to change the object. Built-in mutators that fire for a pod create include:

ServiceAccount. Injects the projected service account token volume and the automountServiceAccountToken default.
DefaultStorageClass, DefaultTolerationSeconds, PodNodeSelector, RuntimeClass, depending on cluster config.
LimitRanger. Applies default resources.requests and limits when a LimitRange exists in the namespace.
Every MutatingAdmissionWebhook you have registered. Service meshes like Istio inject their sidecar here.
MutatingAdmissionPolicy. The CEL based in-process alternative to webhooks. This went GA (v1) in 1.36, so you no longer need a feature gate for the stable path.

Each plugin runs sequentially. The order that ships in the API server defaults matters. ServiceAccount before LimitRanger, for example. Source: plugin/pkg/admission in kubernetes/kubernetes.

Step 9: Schema validation

After mutation, the API server defaults remaining missing fields (restartPolicy: Always, dnsPolicy: ClusterFirst, terminationGracePeriodSeconds: 30, serviceAccountName: default) and validates the now complete object against the OpenAPI v3 schema published at /openapi/v3. Invalid names, empty required fields, wrong field types, all rejected here with a 422 Invalid.

Step 10: Validating admission

Validating admission is a second admission pass that cannot mutate. Built-ins include:

ResourceQuota. Do the namespace's quotas have room for this pod's requests?
PodSecurity. Does the pod meet the restricted, baseline, or privileged profile the namespace is labeled with?
Every ValidatingAdmissionWebhook you have registered.
ValidatingAdmissionPolicy. CEL based in-process validation, GA since 1.30. A great replacement for Kyverno or OPA in many cases.

So here is the subtle bit. Mutating admission runs before validating admission. If a user's webhook mutates a field, the validating chain sees the mutated value, not the original. This ordering is easy to get wrong in your head, and it matters when you are writing policy.

Step 11: etcd plus Raft quorum

Now the API server persists the pod. This is not a plain disk write. etcd is a Raft replicated key value store. The leader appends the entry to its Raft log, replicates to followers, each node fsyncs to disk, and only after a majority (3 of 5 in a typical HA setup) acks does the leader commit. The API server's storage layer blocks on that commit.

So if you ever see API latency spike, it is almost always etcd disk latency. Check etcd_disk_wal_fsync_duration_seconds. This is really, really important to know when you are debugging a slow cluster.

Step 12: 201 Created

The API server responds 201 Created with the full defaulted and mutated pod object in the body. kubectl prints:

pod/nginx created

From your terminal's perspective, it is done. From the cluster's perspective, the real work has not started.

Phase 3, the control loop hands off

Step 13: Watch fanout

Every long running component in Kubernetes holds an HTTP/2 watch stream to the API server. The scheduler watches unscheduled pods. Every kubelet watches pods bound to its node. Controllers watch their respective resources.

So when a new pod is written to etcd, the API server's watch cache broadcasts the event to all subscribers. No polling, no round trips, just a chunked HTTP/2 frame per event. Milliseconds. Source: staging/src/k8s.io/apiserver/pkg/storage/cacher.

Step 14: Scheduler, Filter

kube-scheduler receives the event. The pod has no spec.nodeName, so it is scheduler's problem. The scheduler runs a configurable pipeline of plugins, grouped into extension points. PreFilter, Filter, PostFilter, PreScore, Score, Reserve, Permit, PreBind, Bind, PostBind. For filter:

NodeResourcesFit. The node has enough allocatable CPU, memory, and ephemeral storage for the pod's requests.
NodeAffinity. The pod's nodeAffinity and nodeSelector match the node's labels.
TaintToleration. The pod tolerates the node's taints.
PodTopologySpread. The placement respects any topology spread constraints.
VolumeBinding. All unbound PVCs can be bound to volumes reachable from this node.
InterPodAffinity (at the filter level for hard constraints).

Any node that fails any filter is eliminated. Plugin source: pkg/scheduler/framework/plugins.

Step 15: Scheduler, Score

Surviving nodes get scored by a second set of plugins.

NodeResourcesBalancedAllocation. Prefers nodes with balanced CPU and memory utilization, so you don't pack a CPU heavy pod onto an already CPU saturated node.
ImageLocality. Prefers nodes that already have the container image cached locally. This saves image pull time.
InterPodAffinity. Soft affinity and anti-affinity preferences.
NodeAffinity. Soft (preferred) affinity terms.
TaintToleration. Soft toleration scoring.

Each plugin returns a score 0 to 100 per node. Scores are normalized, weighted, and summed. Highest total wins. Ties are broken with a random pick using Go's rand.Int().

One thing to flag here. Kubernetes 1.36 graduated Dynamic Resource Allocation (DRA) to GA. If you are scheduling GPU workloads or other devices through DRA, the scheduler's resource claim handling is now stable. Worth reading the KEP if you are running AI workloads.

Step 16: Scheduler, Bind

The scheduler POSTs to the binding subresource. POST /api/v1/namespaces/default/pods/nginx/binding with target.name=node-1. This is what actually updates spec.nodeName in etcd. One more Raft write.

So here is a fun detail. The scheduler never writes spec.nodeName directly on the pod. It always goes through binding. This exists precisely because binding is a separate privilege you can RBAC.

Phase 4, the kubelet brings the pod to life

Step 17: Kubelet SyncPod

Kubelet on the bound node has been watching pods?fieldSelector=spec.nodeName=node-1 since startup. It sees the update, runs its pod admission checks (eviction pressure, kubelet level PodSecurityContext sanity), and calls syncPod in pkg/kubelet/kubelet.go. SyncPod is the reconciliation loop. It compares the desired pod spec with the current runtime state and issues CRI calls to bring them into alignment.

Step 18: CRI, sandbox and the pause container

Before any app container runs, the kubelet creates a pod sandbox. It calls RunPodSandbox over the CRI gRPC API on the runtime's socket (/run/containerd/containerd.sock by default). containerd launches the pause container. A tiny statically linked binary whose entire job is to call pause(2) and block forever as PID 1.

But why? Because the pause container is what owns the pod's Linux namespaces, especially the network namespace. When you add more containers to the pod, they setns into the pause container's namespaces. If an app container dies and restarts, the namespaces (and the IP) survive because pause is still there.

Step 19: CNI, pod gets networking

With the sandbox up, the runtime invokes the CNI plugin specified in /etc/cni/net.d/*.conflist (whichever is lexically first). Calico, Flannel, Cilium, Weave, the plugin you installed. CNI's contract is simple. A binary that reads JSON from stdin, takes an action (ADD, DEL, CHECK), and returns JSON to stdout. The ADD call:

Creates a veth pair. One end in the pod's network namespace, one end on the node.
Allocates an IP from the pod CIDR. IPAM is either a local store, Kubernetes IPAM, or an external controller.
Programs routes and iptables or eBPF rules on the host.
Optionally sets up sysctls inside the pod's netns.

When this returns, kubectl get pod -o wide will start showing podIP.

Step 20: Image pull

Kubelet calls PullImage over CRI. containerd resolves the reference (nginx to docker.io/library/nginx:latest), fetches the manifest, then pulls each layer in parallel, verifying SHA-256 digests on every chunk. First pull for a popular image over broadband is a few seconds. Cached? About 100 ms. containerd just revalidates the manifest and returns.

Step 21: Container create

With the image unpacked, the runtime assembles the container.

Stacks the image layers as read only lower layers and adds a writable upper layer using overlayfs. The result is the container's rootfs.
Writes the OCI runtime spec (config.json). A JSON document describing every mount, every namespace flag, every capability, the seccomp profile, the apparmor profile, the cgroup limits, the user, the entrypoint.
Creates a bundle directory containing the rootfs and config.json and hands it to runc with runc create.

OCI runtime spec lives in the opencontainers/runtime-spec repo. This is the same spec Podman, CRI-O, and gVisor use. It is the portability boundary.

Phase 5, runc, namespaces, and the first breath

Step 22: runc

So this is the single coolest part of the whole pipeline. runc takes the bundle and does the following.

Calls clone3 with flags CLONE_NEWPID | CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWIPC. On a modern kernel, clone3 is preferred over the older clone because it takes a structured argument and supports more namespace flags cleanly. The network namespace is not created here. Instead, runc uses setns to enter the sandbox's network namespace that CNI created earlier, so the new container shares the pod IP.
Inside the new process, mounts /proc for the new PID namespace.
pivot_root into the overlayfs rootfs, then unmounts the old root.
Drops Linux capabilities to the OCI spec's bounding set. The default for a non-privileged container is a tight whitelist. No CAP_SYS_ADMIN, no CAP_NET_ADMIN.
Applies the seccomp filter. The runtime default profile blocks around 40 syscalls, like kexec_load, certain unshare flags, and bpf without capability.
Joins the cgroup v2 hierarchy with the configured CPU and memory limits.
Calls execve on the container's entrypoint, nginx -g daemon off;. execve is the syscall that replaces the current process image with a new program while keeping the PID. This is the moment nginx is alive.

If you strace -f runc during create, you will see this whole dance. It is worth doing once.

Step 23: PLEG and the Running status

Kubelet needs to know the container started. Historically, kubelet's PLEG (Pod Lifecycle Event Generator) polled the runtime every second via ListContainers, diffed the result, and emitted events. On a big node with hundreds of pods, this was a measurable source of CPU load.

So there is a newer path called Evented PLEG. It opens a CRI event stream (ContainerEventsRequest) so containerd pushes events like CONTAINER_STARTED_EVENT and CONTAINER_STOPPED_EVENT as they happen. But here is the thing. Evented PLEG is still alpha in 1.36. It was alpha in 1.25, promoted to beta in 1.27, then reverted to alpha in 1.30 after a static pod bug. It is disabled by default. So if you are reading kubelet code today, assume the polling path is what is actually running on your cluster.

When kubelet sees a new container has started (through polling or evented), the status manager computes the pod's phase as Running and patches pod.status back to the API server via a JSON merge patch. Watchers (you, with kubectl get pod -w) see the transition. The status patch is also the signal to any controller waiting on this pod. For example, the endpoints controller, which is about to add the pod's IP to a Service's EndpointSlice.

And that is the whole journey. From argv[1] in your shell to nginx serving on port 80, about a second on a warm cluster.

Watch and play

Video (6 min): What Actually Happens When You Run kubectl run nginx (23 Steps) on the Kubesimplify YouTube channel.
Interactive site: kubernetes-explained.vercel.app/pod. Pause, scrub, jump to any step, copy the code for yourself.

So if you liked this, the next one in the series is the scheduler deep-dive. How kube-scheduler actually decides. Subscribe on the channel so you catch it, and tell me in the comments which step surprised you. That is how I know what to unpack next.

What Actually Happens When kube-scheduler Picks a Node (13 Stages Inside Kubernetes)

saiyam1814 — Wed, 29 Apr 2026 06:23:24 +0000

Your pod has just been written to etcd. The API server returned 201 Created. The pod exists. But spec.nodeName is still empty, and that is the entire reason this post exists.

A pod with no node is not a real workload. It is a row in a database. Something has to look at it, decide which machine should run it, and atomically claim that machine. That something is kube-scheduler, and the way it makes the decision is more interesting than "pick the node with the most free CPU."

There are thirteen separate stages in modern scheduling. The Filter stage alone runs fourteen in-tree plugins, each one capable of disqualifying a candidate node with a single Unschedulable verdict. There is no appeal, no second chance, no "best effort." Either every plugin says yes, or that node is out.

This post walks every stage end-to-end against the v1.36 source code, with verbatim outputs from a real cluster at the bottom.

%[https://youtu.be/N-dDSCVWdqU]

TL;DR, the 13 stages

PreEnqueue. Gating plugins decide if the pod is even allowed into the queue. SchedulingGates lives here. If a gate is set, the pod waits.
QueueSort. The activeQ orders pods by priority. Higher priority first.
PreFilter. Eleven plugins precompute what the pod actually wants. Resources, affinity terms, topology spread, all stashed in CycleState. Compute once, read many times.
Filter. Fourteen plugins each test every node in parallel. NodeUnschedulable, NodeName, TaintToleration, NodeAffinity, NodePorts, NodeResourcesFit, VolumeRestrictions, NodeVolumeLimits, VolumeBinding, VolumeZone, PodTopologySpread, InterPodAffinity, DynamicResources, NodeDeclaredFeatures. One Unschedulable verdict and the node is out.
PostFilter. Only fires if every node failed Filter. DefaultPreemption asks, "if I evicted some lower priority pods, could this one fit?" If yes, it picks victims and the pod retries next cycle.
PreScore. Same trick as PreFilter. Plugins that do heavy per node work during scoring precompute once and cache.
Score. Nine plugins rate every surviving node, zero to one hundred. In parallel. Each plugin has a weight. TaintToleration is three. NodeAffinity, InterPodAffinity, PodTopologySpread, DynamicResources are all two. Rest are one.
NormalizeScore. Rescales every plugin's output. Then for each node, multiply scores by weights, add it all up. Highest sum wins. Ties? Go's rand.Int. Yes, random. Deterministic ties would hot spot the same node every time.
Reserve. The scheduler subtracts the pod's requests from the winning node's in memory snapshot. So the next pod in the same cycle sees that node as already loaded.
Permit. A hook. A plugin can Approve, Wait, or Reject. Stock cluster, no op. But Kueue, Volcano, Coscheduling all wait here for gang scheduling.
PreBind. Last chance to do work before the API server gets told. VolumeBinding finalizes PVC binds here.
Bind. The DefaultBinder updates spec dot node name on the pod via the API server. Now etcd has the assignment.
PostBind. Cleanup. The pod is gone from the scheduler's queue.

That is the whole walkthrough. The rest of this post is the part that does not fit in a tweet.

The scheduling framework

Since Kubernetes 1.19, kube-scheduler has been built on top of the scheduling framework (KEP-624, beta in 1.18, GA in 1.19). The core of the binary is small and intentionally dumb. All of the actual decision-making lives in plugins, registered at well-defined extension points.

This separation is what makes the rest of the ecosystem possible. You can disable plugins. You can write your own as a Go module or behind a webhook. You can run multiple scheduler profiles side by side and let pods pick one with spec.schedulerName. Most installations never touch the configuration, but if you have ever wondered how Volcano, Kueue, or Coscheduling plug into the scheduler without forking it, this is the answer: they register against the framework's extension points and the core just calls them at the right time.

The thirteen extension points are not arbitrary. Each one corresponds to a moment in the pod's lifecycle where it makes sense to ask plugins a question. Should this pod even enter the queue? That is PreEnqueue. Is this node a candidate? That is Filter. Among the candidates, which one is the best fit? That is Score. The framework gives you the seam; the plugin fills in the logic.

Three queues, before any plugin runs

Before any plugin gets called, the pod has to make it into the right queue. The scheduler maintains three of them, and they each serve a different purpose.

The activeQ is a priority heap. Unscheduled pods are ordered by spec.priority, and the scheduler always pops from the head. Higher-priority pods cut in line, which is exactly what you want for things like critical control-plane pods or paid-tier workloads.

The backoffQ holds pods that just failed a scheduling attempt. They sit there for a small (and exponentially growing) timeout before being promoted back into the activeQ. This is not laziness; it is a correctness property. If a pod could not be scheduled in this cycle, retrying it immediately almost always fails the same way. Backoff lets the cluster state change first.

The unschedulableQ (the source actually calls it unschedulablePods, but the docs and the metrics use the queue suffix) is an indexed map of pods that have been declared unschedulable for now. These pods do not retry on a timer. They retry on events. If a new node is added, an informer event fires MoveAllToActiveOrBackoffQueue and they all get a fresh shot. Same thing if a pod is deleted and frees up resources. There is also a five-minute fallback timer for pods that have been waiting too long, in case the event stream missed an update.

All three queues live in pkg/scheduler/backend/queue/scheduling_queue.go. Their names are also exposed as labels on the scheduler_pending_pods metric, which is the easiest way to debug a stuck cluster: a queue full of pods in Unschedulable is telling you something different than a queue full of pods in Backoff.

Stage 1: PreEnqueue (the gate)

PreEnqueue plugins decide whether a pod is even allowed into the activeQ. If any plugin returns Unschedulable, the pod sits in the unschedulableQ until something causes its gate to clear.

The canonical example is the SchedulingGates plugin. By setting spec.schedulingGates on a pod, you can create the pod object now but defer its scheduling until you explicitly remove the gate. This pattern shows up in batch workloads, in cost-aware scheduling controllers, and in anything that wants to express "this pod exists but isn't ready to run yet."

Most pods sail through PreEnqueue with no gates set, but it is the very first checkpoint and worth knowing about.

Stage 2: QueueSort (the order)

Pods waiting in the activeQ have to be ordered somehow. QueueSort plugins define that order. The default is PrioritySort: it ranks pods by spec.priority (an integer) descending, and falls back to creation timestamp for ties. Older pod with the same priority wins.

There is one plugin, it does one thing, and you almost never want to change it. Worth a sentence in the model, not much more.

Stage 3: PreFilter (cache once)

Once a pod is popped off the activeQ, the scheduler's first real job is to look at what the pod actually wants. That is PreFilter, and it runs exactly once per pod per cycle.

The default profile registers eleven PreFilter plugins, each one extracting a different facet of the pod's requirements: NodeResourcesFit pulls out CPU, memory, and extended-resource requests; NodeAffinity normalizes the affinity term tree; PodTopologySpread builds its per-topology-key constraint sets; InterPodAffinity walks the affinity and anti-affinity rules; VolumeBinding figures out which PVCs still need binding; and so on.

All of this work is cached in a framework.CycleState object. Think of CycleState as a per-pod scratch pad. compute the expensive things once, read them many times. The reason it matters becomes obvious in the next stage, where each plugin is about to be called several thousand times in tight loops.

Stage 4: Filter (every node, every plugin, in parallel)

Filter is where the bulk of the scheduling work actually happens. Fourteen plugins are called against every candidate node, in parallel, and any single Unschedulable verdict eliminates that node from the rest of the cycle.

Here is the verified list, straight from pkg/scheduler/apis/config/testing/defaults/defaults.go:

NodeUnschedulable
NodeName
TaintToleration
NodeAffinity
NodePorts
NodeResourcesFit
VolumeRestrictions
NodeVolumeLimits
VolumeBinding
VolumeZone
PodTopologySpread
InterPodAffinity
DynamicResources (went GA in 1.36)
NodeDeclaredFeatures

Each plugin receives the pod, the candidate node's info, and the CycleState that PreFilter built up. Each plugin returns Success or Unschedulable. If any of them says Unschedulable, that node is gone. There is no aggregation, no scoring at this stage, no "well, three out of four said yes." It is binary, and that is what makes Filter fast: the scheduler can fan out to all 14 plugins in parallel, and short-circuit on the first failure per node.

Most engineers will only ever care about a handful of these by name. A quick tour:

NodeUnschedulable is the first line of defense. If the node has spec.unschedulable: true, this plugin filters it out. That is exactly what kubectl cordon does.

NodeName is the simplest possible filter. If the pod has spec.nodeName set (you can set it manually and skip the scheduler entirely), only that node passes; the scheduler effectively becomes a no-op.

TaintToleration is the one most engineers will recognize. The node has taints, the pod has tolerations, and any unmatched NoSchedule or NoExecute taint kills candidacy. The "GPU node" pattern in the demo at the bottom of this post is just a NoSchedule taint that nothing tolerates.

NodeAffinity evaluates the pod's spec.affinity.nodeAffinity rules. Required affinity terms must match here at Filter; preferred terms get scored later.

NodeResourcesFit is the one most people intuitively understand. Does the node have enough free CPU, memory, and any other Kubernetes resource (hugepages, GPUs, custom resources) to fit the pod's requests? Notably, only requests are considered, not limits, which is why a node can be massively over-subscribed on limits and the scheduler still happily places more pods.

VolumeBinding deserves a paragraph of its own. If the pod has PVCs that are not yet bound, VolumeBinding has to decide whether each unbound PVC could be bound on this specific node. For a WaitForFirstConsumer storage class the answer depends on zone, on the storage backend's topology, and on which PVs exist. VolumeBinding doesn't just filter; it also remembers the provisioning plan it chose, and that plan gets locked in during the Reserve stage further down.

DynamicResources is the new kid on the block. It implements the DRA framework, which went GA in 1.36. If your pod uses ResourceClaims (the modern way to ask for GPUs and other devices), DynamicResources is the plugin that figures out whether a node can satisfy the claim.

NodeDeclaredFeatures is newer still. It compares features the node has declared against the pod's required features, and is feature-gated in some configurations.

Run all 14 plugins in parallel against every node, collect the verdicts, and whatever survives all 14 votes moves on. If nothing survives, the scheduler doesn't give up: it runs PostFilter.

Stage 5: PostFilter (preemption, the expensive escape hatch)

If every node failed Filter, the scheduler is in trouble. The pod is unschedulable on the cluster as it stands today. PostFilter exists for exactly this case, and the default plugin is DefaultPreemption. It asks a single question: if I evicted some lower-priority pods, could this one fit?

The algorithm sounds simple but is genuinely expensive. For each node, the scheduler:

Gathers all pods on the node with priority lower than the pending pod.
Simulates evicting them one at a time, lowest priority first.
After each simulated eviction, re-runs Filter on the hypothetical state.
If the pod becomes schedulable, the node is a candidate, and the minimum set of pods that need to die is recorded.

After all candidate nodes have been evaluated, the scheduler picks the "best" one. fewest victims, lowest victim priority, latest creation time as a tiebreaker. The exact ordering lives in pkg/scheduler/framework/plugins/defaultpreemption/default_preemption.go.

Once a candidate is picked, two things happen. The scheduler sets nominatedNodeName on the pending pod, so anyone watching the API can see this pod is targeting a specific node. Then it gracefully deletes the victims through the API server, respecting their terminationGracePeriodSeconds. The pending pod itself goes back into the activeQ to be retried in the next cycle.

This whole process is expensive. The first Filter sweep already touched every node. Now the scheduler is running Filter again, multiple times, against simulated state per candidate. Tens to hundreds of milliseconds easily, seconds on large clusters. The good news is that the vast majority of pods never hit this path; they schedule cleanly on the first try.

PostFilter has a second plugin now: DynamicResources. Same idea, but for ResourceClaims rather than pods. If a Filter cycle failed because of a claim that is busy, DynamicResources' PostFilter can deallocate idle claims to make room.

Stage 6: PreScore (cache once again)

Filter has done its work. Maybe four nodes are left, maybe forty. Either way, it is time to score them, and the scheduler reuses the same precompute trick from PreFilter.

Some Score plugins do expensive per-node work. To avoid recomputing the same input data once per node, those plugins do their work once in PreScore and stash the result in CycleState. The default PreScore plugins are TaintToleration, NodeAffinity, NodeResourcesFit, VolumeBinding, PodTopologySpread, InterPodAffinity, and NodeResourcesBalancedAllocation.

InterPodAffinity is the heaviest of the bunch. It walks the cluster's existing pods, builds a topology map of where each pod sits, and converts the new pod's affinity rules into an indexed structure. PodTopologySpread does similar work, building per-topology-key counts.

After PreScore, each individual Score call becomes effectively O(1). a lookup against precomputed state. Without it, scoring large clusters would be unworkable.

Stage 7: Score (the leaderboard, weighted)

Now the actual ranking. Every Score plugin rates every surviving node from 0 to 100, in parallel.

The default Score plugins, with weights:

TaintToleration, weight 3
NodeAffinity, weight 2
NodeResourcesFit, weight 1
VolumeBinding, weight 1
PodTopologySpread, weight 2
InterPodAffinity, weight 2
DynamicResources, weight 2
NodeResourcesBalancedAllocation, weight 1
ImageLocality, weight 1

That is nine plugins. All verified against defaults.go.

The weights are not arbitrary. The source comments explain the reasoning: TaintToleration is tripled because user-expressed taint preference is a strong signal. NodeAffinity, PodTopologySpread, InterPodAffinity, and DynamicResources are doubled because they encode user intent. The rest are weight 1 because they are infrastructure-level signals (balance, cache hits) that should influence the decision but not dominate it.

It is worth zooming in on ImageLocality for a moment. Once you understand it, you start noticing its effect everywhere.

ImageLocality asks one question per node: do you already have the container image's layers cached? If yes, score 100. If no, score 0. That is the entire plugin.

It matters because on a cold node, the kubelet has to pull the image over the network. seconds for a small image, tens of seconds for a fat ML or LLM image. On a warm node, the pod starts in milliseconds. ImageLocality is a soft preference (it doesn't filter), but it nudges the scheduler toward already-warm nodes when other things are equal, and the cumulative effect on workload startup latency is huge.

NodeResourcesFit is the resource-balance plugin you've probably tuned at some point. By default it uses LeastAllocated, which prefers nodes with more free capacity (spreading the load). You can flip it to MostAllocated for bin-packing, or to RequestedToCapacityRatio for custom curves, via KubeSchedulerConfiguration.

NodeResourcesBalancedAllocation is subtler. It rewards nodes whose CPU and memory utilization are balanced. A node at 80% CPU and 20% memory scores worse than a node at 50%/50%, because imbalanced nodes are more likely to fragment future scheduling decisions.

Stage 8: NormalizeScore and picking a winner

All nine plugins have scored every surviving node. The scheduler now picks the winner.

NormalizeScore rescales every plugin's output to a uniform 0 to 100 range. Some plugins return raw counts or other custom scales; this stage brings everything to the same units.

For each node, the scheduler then sums score × weight across all nine plugins. Highest sum wins.

The interesting question is what happens when two nodes have exactly the same total. The scheduler picks one at random. specifically, it uses Go's math/big.Int random (rand.Int), not rand.Intn. The choice matters more than it might seem.

Random tie-breaking exists to prevent hot-spotting. Imagine two equally suitable nodes for a workload. If the scheduler always picked the first one in some deterministic order, every pod from that workload would pile onto the same node and the other one would sit empty. Randomization spreads the load.

The choice of rand.Int over rand.Intn matters because rand.Intn has a subtle modulo bias for non-power-of-two ranges. Over millions of scheduling decisions across a large cluster, that bias becomes a real distribution skew. rand.Int avoids it.

Stage 9: Reserve (claim the resources in memory, before the API knows)

The winner is picked, but at this point the API server still does not know about the decision. As far as etcd is concerned, the pod is still unscheduled.

Reserve fixes that locally. The scheduler takes the winning node's in-memory snapshot and subtracts whatever the pod requested: CPU, memory, extended resources, PVCs that need binding.

A critical detail: the scheduler operates on requests, not limits. And if your pod has no requests at all, the scheduler does not invent defaults. that is LimitRanger's job, much earlier at admission time. Here, Reserve subtracts whatever requests the pod has, even if it is zero. The scheduler's view of node capacity is purely request-based; a node could be massively over-subscribed on limits and the scheduler would never know or care.

The reason Reserve happens in memory before the bind is so the next pod in the same scheduling cycle sees this node as already loaded. Picture scaling a deployment to twenty replicas all at once: without Reserve, the scheduler's cache would still show the same node as fully free for every pod, and they would all pile onto it. Reserve makes the cache reflect the scheduler's intent immediately, even before etcd has acknowledged anything.

If anything fails after Reserve, Unreserve rolls it back. The in-memory subtraction is undone and the node looks free again.

Stage 10: Permit (the gang-scheduling hook)

Permit is a hook with three possible outcomes per plugin. Approve lets the bind proceed (the default). Wait parks the pod, waiting for an external signal. Reject fails scheduling outright.

A stock cluster has no Permit plugins registered, so most pods sail through. But Permit is the seam where gang scheduling lives. Kueue, Volcano, and Coscheduling all register Permit plugins, and the pattern is the same: when the first pod of a gang arrives, return Wait and park it. When the last pod of the gang arrives, signal all the parked pods to proceed. They all bind together, atomically.

Without Permit, gang scheduling on Kubernetes would be effectively impossible. You would have to bind each pod individually and then evict the rest when one failed. Permit lets you wait at the right point. before any pod is bound. so failures cost nothing.

Stages 11, 12, 13: PreBind, Bind, PostBind (commit and clean up)

Permit returned Approve. Three stages left, all of them short.

PreBind is the last opportunity to do work before the API server is told. The biggest user is VolumeBinding: for dynamically provisioned PVCs, this is where the PV is actually created and the PVC's spec.volumeName is set. By the time Bind runs, the PVC is bound and ready.

Bind does the actual API call. The default is DefaultBinder, which calls pods.Bind() on the API server. a special endpoint that updates spec.nodeName and creates a Binding object. etcd persists it via Raft, followers fsync, and the pod is now officially assigned.

The kubelet on the chosen node has been watching the API server for pods with its own nodeName. The instant the bind lands, the kubelet's informer fires. The pod is no longer the scheduler's concern; it now belongs to a different deep-dive (image pull, runc, the five syscalls).

PostBind is cleanup. The scheduler removes the pod from its internal queue, and that scheduling cycle is done.

The live demo, preemption in action

Theory only carries so far. To watch the scheduler actually preempt a pod, we ran this against a real cluster (Kubernetes 1.36.1, three workers, one tainted). What follows are verbatim outputs from the live recording.

The setup: three worker nodes, with kube-worker-3 tainted as a fake GPU node so the scheduler refuses to put general workloads there.

$ kubectl get nodes
NAME            STATUS   ROLES           AGE   VERSION
kube-cp-01      Ready    control-plane   41d   v1.36.1
kube-worker-1   Ready    <none>          41d   v1.36.1
kube-worker-2   Ready    <none>          41d   v1.36.1
kube-worker-3   Ready    <none>          12d   v1.36.1

$ kubectl describe node kube-worker-3 | grep -E 'Taints|cpu:|memory:'
Taints:             workload=gpu:NoSchedule
  cpu:                8
  memory:             32852Mi
  cpu:                7800m
  memory:             30100Mi

We deploy a regular nginx pod requesting eight CPU. It schedules cleanly onto kube-worker-1 and starts up.

$ kubectl apply -f nginx-pod.yaml
pod/nginx-demo created

$ kubectl get events --sort-by=.lastTimestamp | tail -5
LAST SEEN   TYPE     REASON      OBJECT           MESSAGE
6s          Normal   Scheduled   pod/nginx-demo   Successfully assigned default/nginx-demo to kube-worker-1
5s          Normal   Pulling     pod/nginx-demo   Pulling image "nginx:1.27"
3s          Normal   Pulled      pod/nginx-demo   Successfully pulled image "nginx:1.27" in 1.812s
2s          Normal   Created     pod/nginx-demo   Created container: nginx
2s          Normal   Started     pod/nginx-demo   Started container nginx

$ kubectl get pod nginx-demo -o wide
NAME         READY   STATUS    RESTARTS   AGE   IP           NODE            NOMINATED NODE   READINESS GATES
nginx-demo   1/1     Running   0          18s   10.244.2.47  kube-worker-1   <none>           <none>

The cluster is now in a deliberately uncomfortable state. kube-worker-1 is mostly full. kube-worker-2 is similarly loaded. kube-worker-3 is empty but tainted. Then we apply a critical pod that asks for the same eight CPU, with priority 1,000,000, and no taint toleration.

$ kubectl apply -f payments-high-prio.yaml
pod/payments-critical created

The first scheduling cycle has nothing to give it. Three nodes are insufficient, the fourth has the wrong taint. The scheduler turns to PostFilter, which walks each node looking for a preemption victim. The tainted node is no help. The non-tainted nodes each have a candidate to evict. The scheduler picks one, sets nominatedNodeName, and gracefully evicts the lower-priority nginx pod.

$ kubectl describe pod payments-critical | tail -14
QoS Class:        Guaranteed
Priority:         1000000
Priority Class:   high-priority-payments
Events:
  Type     Reason            Age   From               Message
  ----     ------            ----  ----               -------
  Warning  FailedScheduling  14s   default-scheduler  0/4 nodes are available: 3 Insufficient cpu, 1 node(s) had untolerated taint {workload: gpu}. preemption: 0/4 nodes are available: 1 Preemption is not helpful for scheduling, 3 No preemption victims found for incoming pod.
  Normal   Preempted         9s    default-scheduler  Preempted by default/nginx-demo on node kube-worker-1
  Warning  FailedScheduling  9s    default-scheduler  0/4 nodes are available: 3 Insufficient cpu. preemption: 0/4 nodes are available.
  Normal   Scheduled         4s    default-scheduler  Successfully assigned default/payments-critical to kube-worker-1
  Normal   Pulling           3s    kubelet            Pulling image "payments:v2.4.1"
  Normal   Pulled            1s    kubelet            Successfully pulled image "payments:v2.4.1" in 1.632s
  Normal   Created           1s    kubelet            Created container: payments
  Normal   Started           1s    kubelet            Started container payments

Read those events carefully. There is a FailedScheduling at 14s, then Preempted by default/nginx-demo on node kube-worker-1 at 9s, then another FailedScheduling (the cycle right after preemption, where the nginx pod was still terminating), then Scheduled at 4s. From request to running, on a real cluster, about ten seconds. That includes the graceful eviction of the victim, which is the slow part.

$ kubectl get pods -o wide
NAME                READY   STATUS    RESTARTS   AGE   IP           NODE            NOMINATED NODE   READINESS GATES
payments-critical   1/1     Running   0          22s   10.244.2.58  kube-worker-1   <none>           <none>

That is preemption working as designed. A higher-priority pod arrives, the scheduler refuses to leave it pending when there is a lower-priority pod that could be moved, and the cluster reshuffles. No human intervention. No alert at 3 a.m.

Three takeaways

If only three things from this post stick with you:

1. The scheduler is plugins all the way down. Since 1.19, every meaningful decision is delegated to a plugin at one of thirteen extension points. You can write your own, disable the defaults, run multiple profiles in parallel. Volcano, Kueue, and Coscheduling exist because of this design. they did not have to fork the scheduler.

2. Filter is binary; Score is weighted. A single Unschedulable verdict from any of fourteen Filter plugins kills a node's candidacy. But Score is a weighted vote across nine plugins, and the weights are not equal. TaintToleration (×3) is the strongest single signal at scoring time, followed by the four ×2 plugins (NodeAffinity, PodTopologySpread, InterPodAffinity, DynamicResources). Weights matter much more than most engineers realize.

3. Reserve is why your scheduling is consistent. When you scale a deployment from one to twenty replicas and they all hit the scheduling queue in the same one-second window, Reserve's in-memory subtraction is what stops them from piling onto the same node. The scheduler commits an opinion before the API server even confirms the bind, and that opinion is visible to the next pod's scheduling cycle immediately.

Where to go from here

The full scheduler walkthrough on YouTube has the live demo, every stage animated, the preemption flow shown end-to-end. Link is at the top of this post.

If you want to step through it yourself rather than watch, the interactive at https://kubernetes-explained.vercel.app/scheduler walks every internal step with annotations and lets you pause anywhere.

Sources for every claim in this post:

pkg/scheduler/apis/config/testing/defaults/defaults.go: plugin lists and weights
pkg/scheduler/framework/plugins/: individual plugin implementations
pkg/scheduler/backend/queue/scheduling_queue.go: the three queues
pkg/scheduler/framework/plugins/defaultpreemption/default_preemption.go: the preemption algorithm
KEP-624. the scheduling framework graduation history
The kubectl describe pod events shown in the demo above are verbatim from a real Kubernetes 1.36.1 cluster, captured for this post.