Kubernetes Resource Scheduling: Filtering, Scoring & Priority Preemption

#architecture #devops #kubernetes #systemdesign

In the previous article we traced the full CRUD control flow in Kubernetes. We saw that resource creation passes through kube-scheduler at step ④. Now let's zoom into that step and understand exactly how the scheduler works.

1. Where Scheduling Fits in the Control Flow

┌─────────────────────────────────────────────────────────────────────┐
│                                                                     │
│   kubectl / REST Request                                            │
│          │ ①                                                        │
│          ▼                                                          │
│   ┌─────────────────────────────────────────────────────────────┐   │
│   │                    kube-apiserver                           │   │
│   └──┬──────────────────────────┬──────────────────────────┬───┘   │
│    ② │                        ③ │                        ④ │       │
│      ▼                          ▼                          ▼       │
│    etcd              kube-controller-manager        kube-scheduler  │
│                                                                     │
│                                              ⑤ binding → apiserver │
│                                                        │            │
│                                                        ▼            │
│                                    ┌───────────────────────────┐    │
│                                    │         kubelet           │    │
│                                    │  ⑥ Pod    │    ⑥ Pod     │    │
│                                    │ [C]...[C] │  [C]...[C]   │    │
│                                    └───────────────────────────┘    │
└─────────────────────────────────────────────────────────────────────┘

Scheduling is step ④: kube-scheduler watches the resource queue, runs its algorithms, and writes the Pod-to-Node binding back to etcd via the apiserver.

2. The Scheduling Pipeline: Three Phases

When kube-scheduler detects a Pod in the scheduling queue, it runs three sequential phases:

Resource queue (unscheduled Pods)  +  Available node list
                  │
                  ▼
┌─────────────────────────────────┐
│   Phase 1: FILTERING            │  ← eliminate infeasible nodes
└─────────────────┬───────────────┘
                  │  feasible nodes
                  ▼
┌─────────────────────────────────┐
│   Phase 2: SCORING              │  ← rank feasible nodes
└─────────────────┬───────────────┘
                  │  optimal node selected
                  ▼
┌─────────────────────────────────┐
│   Phase 3: PRIORITY & PREEMPTION│  ← handle scheduling failures
└─────────────────┬───────────────┘
                  │
                  ▼
      Binding result (Pod ↔ Node) written to etcd

3. Phase 1: Filtering

Goal: From all cluster nodes, select every node that is capable of running this Pod.

A node must pass all active filter algorithms to be considered feasible.

Algorithm	What it checks
`podFitsResources`	Node has sufficient CPU and memory for the Pod's `requests`
`podFitsHost`	If `nodeName` is set, only that specific node passes
`podFitsHostPorts`	Required host ports are not already occupied on the node
`podMatchNodeSelector`	Node labels match the Pod's `nodeSelector` / `nodeAffinity`
`NoDiskConflict`	Required volumes are not exclusively mounted elsewhere
`NoVolumeZoneConflict`	Volume availability zone is compatible with the node's zone
`MaxCSIVolumeCount`	Node has not exceeded the CSI volume attachment limit
`CheckNodeMemoryPressure`	Node is not under memory pressure
`CheckNodeDiskPressure`	Node is not under disk pressure
`CheckNodePIDPressure`	Node is not under PID pressure
`CheckNodeCondition`	Node is in a healthy, ready condition
`podToleratesNodeTaints`	Pod has tolerations for all of the node's taints
`CheckVolumeBinding`	Required PersistentVolumeClaims can be satisfied on this node

Example: Pod requests 4 CPU, 8Gi memory

Node A: 8 CPU, 16Gi → 5 CPU free, 10Gi free  ✅ passes all filters
Node B: 4 CPU,  8Gi → 1 CPU free,  2Gi free  ❌ podFitsResources fails
Node C: 8 CPU, 16Gi → tainted NoSchedule     ❌ podToleratesNodeTaints fails
Node D: 8 CPU, 16Gi → 6 CPU free, 12Gi free  ✅ passes all filters

Feasible nodes after filtering: [A, D]

4. Phase 2: Scoring

Goal: From the feasible nodes, select the one best node by ranking them with scoring algorithms.

Each algorithm returns a score of 0–100. The final score is a weighted sum. The node with the highest total score wins.

Algorithm	What it favors
`SelectorSpreadPriority`	Spread Pods of the same Service/ReplicaSet across different nodes (HA)
`InterPodAffinityPriority`	Nodes satisfying preferred inter-pod affinity/anti-affinity rules
`LeastRequestedPriority`	Nodes with the most remaining CPU + memory (load spreading)
`MostRequestedPriority`	Nodes with the least remaining resources (bin-packing, reduce active nodes)
`RequestedToCapacityRatioPriority`	Nodes where current utilization ratio is closest to a target ratio
`BalancedResourceAllocation`	Nodes where CPU and memory utilization are balanced (avoid skewed usage)
`NodePreferAvoidPodsPriority`	Avoid nodes annotated to repel certain Pod types
`NodeAffinityPriority`	Nodes matching `preferredDuringSchedulingIgnoredDuringExecution` affinity
`TaintTolerationPriority`	Nodes with fewer un-tolerated taints (soft preference)
`ImageLocalityPriority`	Nodes that already have the required container images cached locally
`ServiceSpreadingPriority`	Further spread Pods belonging to the same Service
`CalculateAntiAffinityPriorityMap`	Penalize nodes that would violate anti-affinity preferences
`EqualPriorityMap`	Give all nodes equal score (used as a baseline / tie-breaker)

Scoring feasible nodes A and D:

                        Node A    Node D
LeastRequested:           72        85
ImageLocality:            50       100   (image cached on D)
SelectorSpread:           80        60
BalancedResource:         75        80
─────────────────────────────────────
Weighted total:           68        82

Winner: Node D  ✅

5. Phase 3: Priority & Preemption

Goal: Handle the case where a high-priority Pod cannot be scheduled because no feasible node exists.

Normal scheduling failure behavior

Under normal circumstances, when a Pod fails to schedule:

Pod scheduling fails
    → Pod status: Pending
    → Pod sits in queue
    → Retried only when: Pod spec is updated OR cluster state changes

This is fine for equal-priority workloads. But what if a critical system Pod can't be scheduled because lower-priority Pods are consuming all resources?

Priority & Preemption to the rescue

High-priority Pod scheduling fails (no feasible node)
    │
    ▼
Preemption kicks in
    │
    ▼
Scheduler finds a node where evicting low-priority Pods
would free enough resources for the high-priority Pod
    │
    ▼
Low-priority Pods are evicted (graceful termination)
    │
    ▼
High-priority Pod is scheduled on that node  ✅

Algorithm	Role
`podDisruptionBudget`	Defines the minimum number of Pods that must remain available during disruptions — limits how aggressively preemption can evict Pods

Defining Pod Priority

# Step 1: Create a PriorityClass
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: high-priority
value: 1000000          # higher value = higher priority
globalDefault: false

---
# Step 2: Assign to a Pod
spec:
  priorityClassName: high-priority
  containers:
  - name: critical-app
    image: my-app:latest

Protecting critical Pods with PodDisruptionBudget

# Ensure at least 2 replicas of my-service are always running
# even during preemption or node maintenance
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: my-service-pdb
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: my-service

6. Complete Scheduling Decision Flow

kube-scheduler detects unscheduled Pod in queue
      │
      ▼
┌─────────────────────────────────────────────────┐
│  FILTERING: run all predicate algorithms        │
│                                                 │
│  0 nodes pass?                                  │
│  ├── YES → go to Phase 3 (Preemption)           │
│  └── NO  → continue to Scoring                 │
└─────────────────────────────────────────────────┘
      │
      ▼
┌─────────────────────────────────────────────────┐
│  SCORING: run all priority algorithms           │
│  Compute weighted sum per node                  │
│  Select highest score (random on tie)           │
└─────────────────────────────────────────────────┘
      │
      ▼
┌─────────────────────────────────────────────────┐
│  PREEMPTION (if needed):                        │
│  Find node where evicting low-priority Pods     │
│  would make room — respecting PodDisruptionBudget│
│  Evict → reschedule high-priority Pod           │
└─────────────────────────────────────────────────┘
      │
      ▼
Write Pod/Node binding → etcd via apiserver
      │
      ▼
kubelet picks up binding → creates containers  ✅

7. Summary

Phase	Strategy	Algorithms	Purpose
Filtering	Hard constraints	`podFitsResources`, `podToleratesNodeTaints`, `CheckVolumeBinding`, ... (13 total)	Eliminate nodes that cannot run the Pod
Scoring	Soft preferences	`LeastRequestedPriority`, `ImageLocalityPriority`, `SelectorSpreadPriority`, ... (13 total)	Rank feasible nodes to find the best one
Priority & Preemption	Eviction policy	`podDisruptionBudget`	Allow high-priority Pods to evict lower-priority ones when no space exists