"Your pod didn't just land on a node. It survived a tournament."
🎯 Who This Is For
You've deployed pods. You've written kubectl apply -f. You've watched pods go Running. But do you actually know how Kubernetes decides where your pod lives? Buckle up — because the answer is way more fascinating than "it picks a node."
🤯 Interesting Fact #1: Your Pod Goes Through a Tournament Before It's Born
Every unscheduled pod enters what Kubernetes internally calls the scheduling cycle — a ruthless, multi-round elimination process. It's part talent show, part gladiatorial arena.
Here's the battlefield:
API Server → Scheduling Queue → Filter Round → Score Round → Bind
Only nodes that survive all filters get to compete in the scoring round. The winner hosts your pod. Losers? They'll try again next pod.
📬 Phase 1: The Scheduling Queue — Not All Pods Are Equal
When your pod is created without a nodeName, it doesn't go straight to scheduling. It enters a priority queue.
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: production-critical
value: 1000000
globalDefault: false
description: "For production workloads. Will preempt lower-priority pods."
🔥 Wild Fact: If a high-priority pod can't find a node, Kubernetes will evict lower-priority pods from existing nodes to make room. This is called preemption — your pod can literally kick others out of their homes.
Google SRE Insight: Define at least 3 priority tiers: critical, high, batch. Your SLOs depend on it. A batch job should never starve a user-facing service.
🔍 Phase 2: Filtering — The Elimination Round
The scheduler runs your pod through a gauntlet of filter plugins. Each filter asks one question: "Can this node run this pod?"
| Filter Plugin | The Question It Asks |
|---|---|
NodeResourcesFit |
Does the node have enough CPU/Memory? |
NodeAffinity |
Do the node labels match? |
TaintToleration |
Does the pod tolerate the node's taints? |
VolumeBinding |
Can required PersistentVolumes be bound? |
PodTopologySpread |
Will placing here violate spread constraints? |
NodeUnschedulable |
Is the node cordoned? |
A node that fails any filter is immediately disqualified.
🤯 Mind-Blowing Fact: If zero nodes pass the filter phase, your pod enters
Pendingstate. But Kubernetes doesn't give up — it re-enqueues the pod and retries. If Cluster Autoscaler is running, it can provision a brand new node from your cloud provider on-demand to unblock it.
Real-World Gotcha:
# Pod stuck Pending? Check this first:
kubectl describe pod <pod-name>
# Look for Events like:
# 0/5 nodes are available:
# 3 Insufficient memory, 2 node(s) had taint that the pod didn't tolerate.
🏆 Phase 3: Scoring — The Olympics of Node Selection
Now the fun begins. Every node that survived filtering enters the scoring round. Each node gets a score from 0 to 100 across multiple plugins, then scores are weighted and summed.
Final Score = Σ (plugin_score × plugin_weight)
Key scoring plugins:
LeastAllocated — Prefers nodes with MORE free resources. This naturally spreads load.
Score = (CPU_free% + Memory_free%) / 2
InterPodAffinity — Scores nodes based on other pods already running there.
affinity:
podAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchLabels:
app: cache
topologyKey: kubernetes.io/hostname
ImageLocality — Nodes that already have your container image cached get bonus points. No image pull = faster startup.
🎲 Fun Fact: When two nodes have identical final scores, the scheduler picks one at random. Pure coin flip. Your pod's home could be decided by entropy itself.
🔗 Phase 4: Binding — Sealing the Deal
Once a winner is chosen, the scheduler sends a Binding object to the API server:
{
"apiVersion": "v1",
"kind": "Binding",
"metadata": { "name": "my-pod" },
"target": {
"apiVersion": "v1",
"kind": "Node",
"name": "node-winner-42"
}
}
The kubelet on that node watches the API server, sees its node is now assigned a pod, and immediately begins:
- Pulling the container image (if not cached)
- Creating the pod sandbox (network namespace, cgroups)
- Starting the containers
🧩 The Full Scheduling Pipeline
Here's the complete extension point chain — each is a plugin hook:
PreEnqueue
↓
QueueSort ← determines priority order in queue
↓
PreFilter ← pre-process / validation
↓
Filter ← elimination round
↓
PostFilter ← runs if NO nodes passed (preemption logic lives here)
↓
PreScore ← prepare scoring metadata
↓
Score ← score each node
↓
NormalizeScore ← normalize scores to 0-100 range
↓
Reserve ← optimistically reserve resources
↓
Permit ← allow/deny/wait (used for gang scheduling)
↓
PreBind ← e.g., bind PVCs before pod
↓
Bind ← write Binding to API server
↓
PostBind ← cleanup / notifications
🤯 Secret Weapon: The
Permitphase enables Gang Scheduling — where a group of pods (like a distributed ML training job) waits until ALL of them can be scheduled simultaneously. No partial starts. This is how frameworks like Volcano work.
🌍 Topology-Aware Scheduling: The Zone Survival Game
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: api-server
This tells Kubernetes: "Never let the count of my pods between any two zones differ by more than 1."
💡 SRE Insight: This is zone fault tolerance baked into scheduling. If us-east-1a goes down, you still have pods in 1b and 1c. No runbook needed — the scheduler enforced it from day one.
🚨 Interesting Fact #2: The Scheduler Is Pluggable — You Can Replace It
The entire kube-scheduler is built on the Scheduling Framework, a plugin-based architecture. You can:
- Write custom plugins in Go that hook into any phase
- Run multiple schedulers in the same cluster
-
Select which scheduler handles each pod via
schedulerName
spec:
schedulerName: my-custom-scheduler # Your pod, your rules
Companies like Google (for Borg-like workloads) and NVIDIA (for GPU placement) run custom schedulers alongside the default one.
📊 SRE Golden Signals for the Scheduler
Monitor these metrics to keep your scheduling healthy:
# Scheduling latency P99 — should be < 100ms for most clusters
histogram_quantile(0.99,
rate(scheduler_scheduling_attempt_duration_seconds_bucket[5m])
)
# Pending pods — alert if > 0 for your critical namespace
kube_pod_status_phase{phase="Pending", namespace="production"} > 0
# Preemptions happening — signals resource pressure
rate(scheduler_preemption_victims_total[5m]) > 0
# Scheduling failures
rate(scheduler_schedule_attempts_total{result="error"}[5m]) > 0
⚠️ SRE Alert Rule: A pod stuck
Pendingfor more than 2 minutes in a production namespace is a latent SLO burn. Page on it before your users feel it.
🏁 TL;DR — The Pod Scheduling Cheat Sheet
| Phase | What Happens | Plugin Examples |
|---|---|---|
| Queue | Pod sorted by priority | PrioritySort |
| Filter | Unfit nodes eliminated |
NodeResourcesFit, TaintToleration
|
| Score | Fit nodes ranked 0-100 |
LeastAllocated, ImageLocality
|
| Bind | Winner assigned to pod | DefaultBinder |
As an SRE, I believe understanding the system beneath the system is what separates good engineers from great ones.
Top comments (0)