Nijo George Payyappilly

Posted on Apr 11

🧠 The Hidden Brain of Kubernetes: How Pod Scheduling Really Works (And Why It's Smarter Than You Think)

#kubernetes #devops #sre #cloudnative

"Your pod didn't just land on a node. It survived a tournament."

🎯 Who This Is For

You've deployed pods. You've written kubectl apply -f. You've watched pods go Running. But do you actually know how Kubernetes decides where your pod lives? Buckle up — because the answer is way more fascinating than "it picks a node."

🤯 Interesting Fact #1: Your Pod Goes Through a Tournament Before It's Born

Every unscheduled pod enters what Kubernetes internally calls the scheduling cycle — a ruthless, multi-round elimination process. It's part talent show, part gladiatorial arena.

Here's the battlefield:

API Server → Scheduling Queue → Filter Round → Score Round → Bind

Only nodes that survive all filters get to compete in the scoring round. The winner hosts your pod. Losers? They'll try again next pod.

📬 Phase 1: The Scheduling Queue — Not All Pods Are Equal

When your pod is created without a nodeName, it doesn't go straight to scheduling. It enters a priority queue.

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: production-critical
value: 1000000
globalDefault: false
description: "For production workloads. Will preempt lower-priority pods."

🔥 Wild Fact: If a high-priority pod can't find a node, Kubernetes will evict lower-priority pods from existing nodes to make room. This is called preemption — your pod can literally kick others out of their homes.

Google SRE Insight: Define at least 3 priority tiers: critical, high, batch. Your SLOs depend on it. A batch job should never starve a user-facing service.

🔍 Phase 2: Filtering — The Elimination Round

The scheduler runs your pod through a gauntlet of filter plugins. Each filter asks one question: "Can this node run this pod?"

Filter Plugin	The Question It Asks
`NodeResourcesFit`	Does the node have enough CPU/Memory?
`NodeAffinity`	Do the node labels match?
`TaintToleration`	Does the pod tolerate the node's taints?
`VolumeBinding`	Can required PersistentVolumes be bound?
`PodTopologySpread`	Will placing here violate spread constraints?
`NodeUnschedulable`	Is the node cordoned?

A node that fails any filter is immediately disqualified.

🤯 Mind-Blowing Fact: If zero nodes pass the filter phase, your pod enters Pending state. But Kubernetes doesn't give up — it re-enqueues the pod and retries. If Cluster Autoscaler is running, it can provision a brand new node from your cloud provider on-demand to unblock it.

Real-World Gotcha:

# Pod stuck Pending? Check this first:
kubectl describe pod <pod-name>

# Look for Events like:
# 0/5 nodes are available: 
# 3 Insufficient memory, 2 node(s) had taint that the pod didn't tolerate.

🏆 Phase 3: Scoring — The Olympics of Node Selection

Now the fun begins. Every node that survived filtering enters the scoring round. Each node gets a score from 0 to 100 across multiple plugins, then scores are weighted and summed.

Final Score = Σ (plugin_score × plugin_weight)

Key scoring plugins:

LeastAllocated — Prefers nodes with MORE free resources. This naturally spreads load.

Score = (CPU_free% + Memory_free%) / 2

InterPodAffinity — Scores nodes based on other pods already running there.

affinity:
  podAffinity:
    preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 100
        podAffinityTerm:
          labelSelector:
            matchLabels:
              app: cache
          topologyKey: kubernetes.io/hostname

ImageLocality — Nodes that already have your container image cached get bonus points. No image pull = faster startup.

🎲 Fun Fact: When two nodes have identical final scores, the scheduler picks one at random. Pure coin flip. Your pod's home could be decided by entropy itself.

🔗 Phase 4: Binding — Sealing the Deal

Once a winner is chosen, the scheduler sends a Binding object to the API server:

{
  "apiVersion": "v1",
  "kind": "Binding",
  "metadata": { "name": "my-pod" },
  "target": {
    "apiVersion": "v1",
    "kind": "Node",
    "name": "node-winner-42"
  }
}

The kubelet on that node watches the API server, sees its node is now assigned a pod, and immediately begins:

Pulling the container image (if not cached)
Creating the pod sandbox (network namespace, cgroups)
Starting the containers

🧩 The Full Scheduling Pipeline

Here's the complete extension point chain — each is a plugin hook:

PreEnqueue
    ↓
QueueSort        ← determines priority order in queue
    ↓
PreFilter        ← pre-process / validation
    ↓
Filter           ← elimination round
    ↓
PostFilter       ← runs if NO nodes passed (preemption logic lives here)
    ↓
PreScore         ← prepare scoring metadata
    ↓
Score            ← score each node
    ↓
NormalizeScore   ← normalize scores to 0-100 range
    ↓
Reserve          ← optimistically reserve resources
    ↓
Permit           ← allow/deny/wait (used for gang scheduling)
    ↓
PreBind          ← e.g., bind PVCs before pod
    ↓
Bind             ← write Binding to API server
    ↓
PostBind         ← cleanup / notifications

🤯 Secret Weapon: The Permit phase enables Gang Scheduling — where a group of pods (like a distributed ML training job) waits until ALL of them can be scheduled simultaneously. No partial starts. This is how frameworks like Volcano work.

🌍 Topology-Aware Scheduling: The Zone Survival Game

topologySpreadConstraints:
  - maxSkew: 1
    topologyKey: topology.kubernetes.io/zone
    whenUnsatisfiable: DoNotSchedule
    labelSelector:
      matchLabels:
        app: api-server

This tells Kubernetes: "Never let the count of my pods between any two zones differ by more than 1."

💡 SRE Insight: This is zone fault tolerance baked into scheduling. If us-east-1a goes down, you still have pods in 1b and 1c. No runbook needed — the scheduler enforced it from day one.

🚨 Interesting Fact #2: The Scheduler Is Pluggable — You Can Replace It

The entire kube-scheduler is built on the Scheduling Framework, a plugin-based architecture. You can:

Write custom plugins in Go that hook into any phase
Run multiple schedulers in the same cluster
Select which scheduler handles each pod via schedulerName

spec:
  schedulerName: my-custom-scheduler  # Your pod, your rules

Companies like Google (for Borg-like workloads) and NVIDIA (for GPU placement) run custom schedulers alongside the default one.

📊 SRE Golden Signals for the Scheduler

Monitor these metrics to keep your scheduling healthy:

# Scheduling latency P99 — should be < 100ms for most clusters
histogram_quantile(0.99, 
  rate(scheduler_scheduling_attempt_duration_seconds_bucket[5m])
)

# Pending pods — alert if > 0 for your critical namespace
kube_pod_status_phase{phase="Pending", namespace="production"} > 0

# Preemptions happening — signals resource pressure
rate(scheduler_preemption_victims_total[5m]) > 0

# Scheduling failures
rate(scheduler_schedule_attempts_total{result="error"}[5m]) > 0

⚠️ SRE Alert Rule: A pod stuck Pending for more than 2 minutes in a production namespace is a latent SLO burn. Page on it before your users feel it.

🏁 TL;DR — The Pod Scheduling Cheat Sheet

Phase	What Happens	Plugin Examples
Queue	Pod sorted by priority	`PrioritySort`
Filter	Unfit nodes eliminated	`NodeResourcesFit`, `TaintToleration`
Score	Fit nodes ranked 0-100	`LeastAllocated`, `ImageLocality`
Bind	Winner assigned to pod	`DefaultBinder`