In the previous article we traced the full CRUD control flow in Kubernetes. We saw that resource creation passes through kube-scheduler at step ④. Now let's zoom into that step and understand exactly how the scheduler works.
1. Where Scheduling Fits in the Control Flow
┌─────────────────────────────────────────────────────────────────────┐
│ │
│ kubectl / REST Request │
│ │ ① │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ kube-apiserver │ │
│ └──┬──────────────────────────┬──────────────────────────┬───┘ │
│ ② │ ③ │ ④ │ │
│ ▼ ▼ ▼ │
│ etcd kube-controller-manager kube-scheduler │
│ │
│ ⑤ binding → apiserver │
│ │ │
│ ▼ │
│ ┌───────────────────────────┐ │
│ │ kubelet │ │
│ │ ⑥ Pod │ ⑥ Pod │ │
│ │ [C]...[C] │ [C]...[C] │ │
│ └───────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────┘
Scheduling is step ④: kube-scheduler watches the resource queue, runs its algorithms, and writes the Pod-to-Node binding back to etcd via the apiserver.
2. The Scheduling Pipeline: Three Phases
When kube-scheduler detects a Pod in the scheduling queue, it runs three sequential phases:
Resource queue (unscheduled Pods) + Available node list
│
▼
┌─────────────────────────────────┐
│ Phase 1: FILTERING │ ← eliminate infeasible nodes
└─────────────────┬───────────────┘
│ feasible nodes
▼
┌─────────────────────────────────┐
│ Phase 2: SCORING │ ← rank feasible nodes
└─────────────────┬───────────────┘
│ optimal node selected
▼
┌─────────────────────────────────┐
│ Phase 3: PRIORITY & PREEMPTION│ ← handle scheduling failures
└─────────────────┬───────────────┘
│
▼
Binding result (Pod ↔ Node) written to etcd
3. Phase 1: Filtering
Goal: From all cluster nodes, select every node that is capable of running this Pod.
A node must pass all active filter algorithms to be considered feasible.
| Algorithm | What it checks |
|---|---|
podFitsResources |
Node has sufficient CPU and memory for the Pod's requests
|
podFitsHost |
If nodeName is set, only that specific node passes |
podFitsHostPorts |
Required host ports are not already occupied on the node |
podMatchNodeSelector |
Node labels match the Pod's nodeSelector / nodeAffinity
|
NoDiskConflict |
Required volumes are not exclusively mounted elsewhere |
NoVolumeZoneConflict |
Volume availability zone is compatible with the node's zone |
MaxCSIVolumeCount |
Node has not exceeded the CSI volume attachment limit |
CheckNodeMemoryPressure |
Node is not under memory pressure |
CheckNodeDiskPressure |
Node is not under disk pressure |
CheckNodePIDPressure |
Node is not under PID pressure |
CheckNodeCondition |
Node is in a healthy, ready condition |
podToleratesNodeTaints |
Pod has tolerations for all of the node's taints |
CheckVolumeBinding |
Required PersistentVolumeClaims can be satisfied on this node |
Example: Pod requests 4 CPU, 8Gi memory
Node A: 8 CPU, 16Gi → 5 CPU free, 10Gi free ✅ passes all filters
Node B: 4 CPU, 8Gi → 1 CPU free, 2Gi free ❌ podFitsResources fails
Node C: 8 CPU, 16Gi → tainted NoSchedule ❌ podToleratesNodeTaints fails
Node D: 8 CPU, 16Gi → 6 CPU free, 12Gi free ✅ passes all filters
Feasible nodes after filtering: [A, D]
4. Phase 2: Scoring
Goal: From the feasible nodes, select the one best node by ranking them with scoring algorithms.
Each algorithm returns a score of 0–100. The final score is a weighted sum. The node with the highest total score wins.
| Algorithm | What it favors |
|---|---|
SelectorSpreadPriority |
Spread Pods of the same Service/ReplicaSet across different nodes (HA) |
InterPodAffinityPriority |
Nodes satisfying preferred inter-pod affinity/anti-affinity rules |
LeastRequestedPriority |
Nodes with the most remaining CPU + memory (load spreading) |
MostRequestedPriority |
Nodes with the least remaining resources (bin-packing, reduce active nodes) |
RequestedToCapacityRatioPriority |
Nodes where current utilization ratio is closest to a target ratio |
BalancedResourceAllocation |
Nodes where CPU and memory utilization are balanced (avoid skewed usage) |
NodePreferAvoidPodsPriority |
Avoid nodes annotated to repel certain Pod types |
NodeAffinityPriority |
Nodes matching preferredDuringSchedulingIgnoredDuringExecution affinity |
TaintTolerationPriority |
Nodes with fewer un-tolerated taints (soft preference) |
ImageLocalityPriority |
Nodes that already have the required container images cached locally |
ServiceSpreadingPriority |
Further spread Pods belonging to the same Service |
CalculateAntiAffinityPriorityMap |
Penalize nodes that would violate anti-affinity preferences |
EqualPriorityMap |
Give all nodes equal score (used as a baseline / tie-breaker) |
Scoring feasible nodes A and D:
Node A Node D
LeastRequested: 72 85
ImageLocality: 50 100 (image cached on D)
SelectorSpread: 80 60
BalancedResource: 75 80
─────────────────────────────────────
Weighted total: 68 82
Winner: Node D ✅
5. Phase 3: Priority & Preemption
Goal: Handle the case where a high-priority Pod cannot be scheduled because no feasible node exists.
Normal scheduling failure behavior
Under normal circumstances, when a Pod fails to schedule:
Pod scheduling fails
→ Pod status: Pending
→ Pod sits in queue
→ Retried only when: Pod spec is updated OR cluster state changes
This is fine for equal-priority workloads. But what if a critical system Pod can't be scheduled because lower-priority Pods are consuming all resources?
Priority & Preemption to the rescue
High-priority Pod scheduling fails (no feasible node)
│
▼
Preemption kicks in
│
▼
Scheduler finds a node where evicting low-priority Pods
would free enough resources for the high-priority Pod
│
▼
Low-priority Pods are evicted (graceful termination)
│
▼
High-priority Pod is scheduled on that node ✅
| Algorithm | Role |
|---|---|
podDisruptionBudget |
Defines the minimum number of Pods that must remain available during disruptions — limits how aggressively preemption can evict Pods |
Defining Pod Priority
# Step 1: Create a PriorityClass
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: high-priority
value: 1000000 # higher value = higher priority
globalDefault: false
---
# Step 2: Assign to a Pod
spec:
priorityClassName: high-priority
containers:
- name: critical-app
image: my-app:latest
Protecting critical Pods with PodDisruptionBudget
# Ensure at least 2 replicas of my-service are always running
# even during preemption or node maintenance
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: my-service-pdb
spec:
minAvailable: 2
selector:
matchLabels:
app: my-service
6. Complete Scheduling Decision Flow
kube-scheduler detects unscheduled Pod in queue
│
▼
┌─────────────────────────────────────────────────┐
│ FILTERING: run all predicate algorithms │
│ │
│ 0 nodes pass? │
│ ├── YES → go to Phase 3 (Preemption) │
│ └── NO → continue to Scoring │
└─────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────┐
│ SCORING: run all priority algorithms │
│ Compute weighted sum per node │
│ Select highest score (random on tie) │
└─────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────┐
│ PREEMPTION (if needed): │
│ Find node where evicting low-priority Pods │
│ would make room — respecting PodDisruptionBudget│
│ Evict → reschedule high-priority Pod │
└─────────────────────────────────────────────────┘
│
▼
Write Pod/Node binding → etcd via apiserver
│
▼
kubelet picks up binding → creates containers ✅
7. Summary
| Phase | Strategy | Algorithms | Purpose |
|---|---|---|---|
| Filtering | Hard constraints |
podFitsResources, podToleratesNodeTaints, CheckVolumeBinding, ... (13 total) |
Eliminate nodes that cannot run the Pod |
| Scoring | Soft preferences |
LeastRequestedPriority, ImageLocalityPriority, SelectorSpreadPriority, ... (13 total) |
Rank feasible nodes to find the best one |
| Priority & Preemption | Eviction policy | podDisruptionBudget |
Allow high-priority Pods to evict lower-priority ones when no space exists |
The three-phase pipeline gives Kubernetes scheduling a clean separation of concerns:
- Can it run here? → Filtering
- Where should it run? → Scoring
- What if nowhere works? → Preemption
Next in this series: Kubernetes Resource Orchestration: Deployments, ReplicaSets & Rolling Updates (Part 5)
Follow the series for more deep dives into Kubernetes internals.
Top comments (0)