DEV Community

Cover image for 🧠 The Hidden Brain of Kubernetes: How Pod Scheduling Really Works (And Why It's Smarter Than You Think)
Nijo George Payyappilly
Nijo George Payyappilly

Posted on

🧠 The Hidden Brain of Kubernetes: How Pod Scheduling Really Works (And Why It's Smarter Than You Think)

"Your pod didn't just land on a node. It survived a tournament."


🎯 Who This Is For

You've deployed pods. You've written kubectl apply -f. You've watched pods go Running. But do you actually know how Kubernetes decides where your pod lives? Buckle up — because the answer is way more fascinating than "it picks a node."


🤯 Interesting Fact #1: Your Pod Goes Through a Tournament Before It's Born

Every unscheduled pod enters what Kubernetes internally calls the scheduling cycle — a ruthless, multi-round elimination process. It's part talent show, part gladiatorial arena.

Here's the battlefield:

API Server → Scheduling Queue → Filter Round → Score Round → Bind
Enter fullscreen mode Exit fullscreen mode

Only nodes that survive all filters get to compete in the scoring round. The winner hosts your pod. Losers? They'll try again next pod.


📬 Phase 1: The Scheduling Queue — Not All Pods Are Equal

When your pod is created without a nodeName, it doesn't go straight to scheduling. It enters a priority queue.

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: production-critical
value: 1000000
globalDefault: false
description: "For production workloads. Will preempt lower-priority pods."
Enter fullscreen mode Exit fullscreen mode

🔥 Wild Fact: If a high-priority pod can't find a node, Kubernetes will evict lower-priority pods from existing nodes to make room. This is called preemption — your pod can literally kick others out of their homes.

Google SRE Insight: Define at least 3 priority tiers: critical, high, batch. Your SLOs depend on it. A batch job should never starve a user-facing service.


🔍 Phase 2: Filtering — The Elimination Round

The scheduler runs your pod through a gauntlet of filter plugins. Each filter asks one question: "Can this node run this pod?"

Filter Plugin The Question It Asks
NodeResourcesFit Does the node have enough CPU/Memory?
NodeAffinity Do the node labels match?
TaintToleration Does the pod tolerate the node's taints?
VolumeBinding Can required PersistentVolumes be bound?
PodTopologySpread Will placing here violate spread constraints?
NodeUnschedulable Is the node cordoned?

A node that fails any filter is immediately disqualified.

🤯 Mind-Blowing Fact: If zero nodes pass the filter phase, your pod enters Pending state. But Kubernetes doesn't give up — it re-enqueues the pod and retries. If Cluster Autoscaler is running, it can provision a brand new node from your cloud provider on-demand to unblock it.

Real-World Gotcha:

# Pod stuck Pending? Check this first:
kubectl describe pod <pod-name>

# Look for Events like:
# 0/5 nodes are available: 
# 3 Insufficient memory, 2 node(s) had taint that the pod didn't tolerate.
Enter fullscreen mode Exit fullscreen mode

🏆 Phase 3: Scoring — The Olympics of Node Selection

Now the fun begins. Every node that survived filtering enters the scoring round. Each node gets a score from 0 to 100 across multiple plugins, then scores are weighted and summed.

Final Score = Σ (plugin_score × plugin_weight)
Enter fullscreen mode Exit fullscreen mode

Key scoring plugins:

LeastAllocated — Prefers nodes with MORE free resources. This naturally spreads load.

Score = (CPU_free% + Memory_free%) / 2
Enter fullscreen mode Exit fullscreen mode

InterPodAffinity — Scores nodes based on other pods already running there.

affinity:
  podAffinity:
    preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 100
        podAffinityTerm:
          labelSelector:
            matchLabels:
              app: cache
          topologyKey: kubernetes.io/hostname
Enter fullscreen mode Exit fullscreen mode

ImageLocality — Nodes that already have your container image cached get bonus points. No image pull = faster startup.

🎲 Fun Fact: When two nodes have identical final scores, the scheduler picks one at random. Pure coin flip. Your pod's home could be decided by entropy itself.


🔗 Phase 4: Binding — Sealing the Deal

Once a winner is chosen, the scheduler sends a Binding object to the API server:

{
  "apiVersion": "v1",
  "kind": "Binding",
  "metadata": { "name": "my-pod" },
  "target": {
    "apiVersion": "v1",
    "kind": "Node",
    "name": "node-winner-42"
  }
}
Enter fullscreen mode Exit fullscreen mode

The kubelet on that node watches the API server, sees its node is now assigned a pod, and immediately begins:

  1. Pulling the container image (if not cached)
  2. Creating the pod sandbox (network namespace, cgroups)
  3. Starting the containers

🧩 The Full Scheduling Pipeline

Here's the complete extension point chain — each is a plugin hook:

PreEnqueue
    ↓
QueueSort        ← determines priority order in queue
    ↓
PreFilter        ← pre-process / validation
    ↓
Filter           ← elimination round
    ↓
PostFilter       ← runs if NO nodes passed (preemption logic lives here)
    ↓
PreScore         ← prepare scoring metadata
    ↓
Score            ← score each node
    ↓
NormalizeScore   ← normalize scores to 0-100 range
    ↓
Reserve          ← optimistically reserve resources
    ↓
Permit           ← allow/deny/wait (used for gang scheduling)
    ↓
PreBind          ← e.g., bind PVCs before pod
    ↓
Bind             ← write Binding to API server
    ↓
PostBind         ← cleanup / notifications
Enter fullscreen mode Exit fullscreen mode

🤯 Secret Weapon: The Permit phase enables Gang Scheduling — where a group of pods (like a distributed ML training job) waits until ALL of them can be scheduled simultaneously. No partial starts. This is how frameworks like Volcano work.


🌍 Topology-Aware Scheduling: The Zone Survival Game

topologySpreadConstraints:
  - maxSkew: 1
    topologyKey: topology.kubernetes.io/zone
    whenUnsatisfiable: DoNotSchedule
    labelSelector:
      matchLabels:
        app: api-server
Enter fullscreen mode Exit fullscreen mode

This tells Kubernetes: "Never let the count of my pods between any two zones differ by more than 1."

💡 SRE Insight: This is zone fault tolerance baked into scheduling. If us-east-1a goes down, you still have pods in 1b and 1c. No runbook needed — the scheduler enforced it from day one.


🚨 Interesting Fact #2: The Scheduler Is Pluggable — You Can Replace It

The entire kube-scheduler is built on the Scheduling Framework, a plugin-based architecture. You can:

  • Write custom plugins in Go that hook into any phase
  • Run multiple schedulers in the same cluster
  • Select which scheduler handles each pod via schedulerName
spec:
  schedulerName: my-custom-scheduler  # Your pod, your rules
Enter fullscreen mode Exit fullscreen mode

Companies like Google (for Borg-like workloads) and NVIDIA (for GPU placement) run custom schedulers alongside the default one.


📊 SRE Golden Signals for the Scheduler

Monitor these metrics to keep your scheduling healthy:

# Scheduling latency P99 — should be < 100ms for most clusters
histogram_quantile(0.99, 
  rate(scheduler_scheduling_attempt_duration_seconds_bucket[5m])
)

# Pending pods — alert if > 0 for your critical namespace
kube_pod_status_phase{phase="Pending", namespace="production"} > 0

# Preemptions happening — signals resource pressure
rate(scheduler_preemption_victims_total[5m]) > 0

# Scheduling failures
rate(scheduler_schedule_attempts_total{result="error"}[5m]) > 0
Enter fullscreen mode Exit fullscreen mode

⚠️ SRE Alert Rule: A pod stuck Pending for more than 2 minutes in a production namespace is a latent SLO burn. Page on it before your users feel it.


🏁 TL;DR — The Pod Scheduling Cheat Sheet

Phase What Happens Plugin Examples
Queue Pod sorted by priority PrioritySort
Filter Unfit nodes eliminated NodeResourcesFit, TaintToleration
Score Fit nodes ranked 0-100 LeastAllocated, ImageLocality
Bind Winner assigned to pod DefaultBinder

As an SRE, I believe understanding the system beneath the system is what separates good engineers from great ones.


Found this useful? Drop a ❤️, share it with your team, and follow for more deep-dives into Kubernetes internals.

Top comments (0)