DEV Community

James Lee
James Lee

Posted on

Kubernetes Resource Scheduling: Filtering, Scoring & Priority Preemption

In the previous article we traced the full CRUD control flow in Kubernetes. We saw that resource creation passes through kube-scheduler at step ④. Now let's zoom into that step and understand exactly how the scheduler works.


1. Where Scheduling Fits in the Control Flow

┌─────────────────────────────────────────────────────────────────────┐
│                                                                     │
│   kubectl / REST Request                                            │
│          │ ①                                                        │
│          ▼                                                          │
│   ┌─────────────────────────────────────────────────────────────┐   │
│   │                    kube-apiserver                           │   │
│   └──┬──────────────────────────┬──────────────────────────┬───┘   │
│    ② │                        ③ │                        ④ │       │
│      ▼                          ▼                          ▼       │
│    etcd              kube-controller-manager        kube-scheduler  │
│                                                                     │
│                                              ⑤ binding → apiserver │
│                                                        │            │
│                                                        ▼            │
│                                    ┌───────────────────────────┐    │
│                                    │         kubelet           │    │
│                                    │  ⑥ Pod    │    ⑥ Pod     │    │
│                                    │ [C]...[C] │  [C]...[C]   │    │
│                                    └───────────────────────────┘    │
└─────────────────────────────────────────────────────────────────────┘
Enter fullscreen mode Exit fullscreen mode

Scheduling is step ④: kube-scheduler watches the resource queue, runs its algorithms, and writes the Pod-to-Node binding back to etcd via the apiserver.


2. The Scheduling Pipeline: Three Phases

When kube-scheduler detects a Pod in the scheduling queue, it runs three sequential phases:

Resource queue (unscheduled Pods)  +  Available node list
                  │
                  ▼
┌─────────────────────────────────┐
│   Phase 1: FILTERING            │  ← eliminate infeasible nodes
└─────────────────┬───────────────┘
                  │  feasible nodes
                  ▼
┌─────────────────────────────────┐
│   Phase 2: SCORING              │  ← rank feasible nodes
└─────────────────┬───────────────┘
                  │  optimal node selected
                  ▼
┌─────────────────────────────────┐
│   Phase 3: PRIORITY & PREEMPTION│  ← handle scheduling failures
└─────────────────┬───────────────┘
                  │
                  ▼
      Binding result (Pod ↔ Node) written to etcd
Enter fullscreen mode Exit fullscreen mode

3. Phase 1: Filtering

Goal: From all cluster nodes, select every node that is capable of running this Pod.

A node must pass all active filter algorithms to be considered feasible.

Algorithm What it checks
podFitsResources Node has sufficient CPU and memory for the Pod's requests
podFitsHost If nodeName is set, only that specific node passes
podFitsHostPorts Required host ports are not already occupied on the node
podMatchNodeSelector Node labels match the Pod's nodeSelector / nodeAffinity
NoDiskConflict Required volumes are not exclusively mounted elsewhere
NoVolumeZoneConflict Volume availability zone is compatible with the node's zone
MaxCSIVolumeCount Node has not exceeded the CSI volume attachment limit
CheckNodeMemoryPressure Node is not under memory pressure
CheckNodeDiskPressure Node is not under disk pressure
CheckNodePIDPressure Node is not under PID pressure
CheckNodeCondition Node is in a healthy, ready condition
podToleratesNodeTaints Pod has tolerations for all of the node's taints
CheckVolumeBinding Required PersistentVolumeClaims can be satisfied on this node
Example: Pod requests 4 CPU, 8Gi memory

Node A: 8 CPU, 16Gi → 5 CPU free, 10Gi free  ✅ passes all filters
Node B: 4 CPU,  8Gi → 1 CPU free,  2Gi free  ❌ podFitsResources fails
Node C: 8 CPU, 16Gi → tainted NoSchedule     ❌ podToleratesNodeTaints fails
Node D: 8 CPU, 16Gi → 6 CPU free, 12Gi free  ✅ passes all filters

Feasible nodes after filtering: [A, D]
Enter fullscreen mode Exit fullscreen mode

4. Phase 2: Scoring

Goal: From the feasible nodes, select the one best node by ranking them with scoring algorithms.

Each algorithm returns a score of 0–100. The final score is a weighted sum. The node with the highest total score wins.

Algorithm What it favors
SelectorSpreadPriority Spread Pods of the same Service/ReplicaSet across different nodes (HA)
InterPodAffinityPriority Nodes satisfying preferred inter-pod affinity/anti-affinity rules
LeastRequestedPriority Nodes with the most remaining CPU + memory (load spreading)
MostRequestedPriority Nodes with the least remaining resources (bin-packing, reduce active nodes)
RequestedToCapacityRatioPriority Nodes where current utilization ratio is closest to a target ratio
BalancedResourceAllocation Nodes where CPU and memory utilization are balanced (avoid skewed usage)
NodePreferAvoidPodsPriority Avoid nodes annotated to repel certain Pod types
NodeAffinityPriority Nodes matching preferredDuringSchedulingIgnoredDuringExecution affinity
TaintTolerationPriority Nodes with fewer un-tolerated taints (soft preference)
ImageLocalityPriority Nodes that already have the required container images cached locally
ServiceSpreadingPriority Further spread Pods belonging to the same Service
CalculateAntiAffinityPriorityMap Penalize nodes that would violate anti-affinity preferences
EqualPriorityMap Give all nodes equal score (used as a baseline / tie-breaker)
Scoring feasible nodes A and D:

                        Node A    Node D
LeastRequested:           72        85
ImageLocality:            50       100   (image cached on D)
SelectorSpread:           80        60
BalancedResource:         75        80
─────────────────────────────────────
Weighted total:           68        82

Winner: Node D  ✅
Enter fullscreen mode Exit fullscreen mode

5. Phase 3: Priority & Preemption

Goal: Handle the case where a high-priority Pod cannot be scheduled because no feasible node exists.

Normal scheduling failure behavior

Under normal circumstances, when a Pod fails to schedule:

Pod scheduling fails
    → Pod status: Pending
    → Pod sits in queue
    → Retried only when: Pod spec is updated OR cluster state changes
Enter fullscreen mode Exit fullscreen mode

This is fine for equal-priority workloads. But what if a critical system Pod can't be scheduled because lower-priority Pods are consuming all resources?

Priority & Preemption to the rescue

High-priority Pod scheduling fails (no feasible node)
    │
    ▼
Preemption kicks in
    │
    ▼
Scheduler finds a node where evicting low-priority Pods
would free enough resources for the high-priority Pod
    │
    ▼
Low-priority Pods are evicted (graceful termination)
    │
    ▼
High-priority Pod is scheduled on that node  ✅
Enter fullscreen mode Exit fullscreen mode
Algorithm Role
podDisruptionBudget Defines the minimum number of Pods that must remain available during disruptions — limits how aggressively preemption can evict Pods

Defining Pod Priority

# Step 1: Create a PriorityClass
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: high-priority
value: 1000000          # higher value = higher priority
globalDefault: false

---
# Step 2: Assign to a Pod
spec:
  priorityClassName: high-priority
  containers:
  - name: critical-app
    image: my-app:latest
Enter fullscreen mode Exit fullscreen mode

Protecting critical Pods with PodDisruptionBudget

# Ensure at least 2 replicas of my-service are always running
# even during preemption or node maintenance
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: my-service-pdb
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: my-service
Enter fullscreen mode Exit fullscreen mode

6. Complete Scheduling Decision Flow

kube-scheduler detects unscheduled Pod in queue
      │
      ▼
┌─────────────────────────────────────────────────┐
│  FILTERING: run all predicate algorithms        │
│                                                 │
│  0 nodes pass?                                  │
│  ├── YES → go to Phase 3 (Preemption)           │
│  └── NO  → continue to Scoring                 │
└─────────────────────────────────────────────────┘
      │
      ▼
┌─────────────────────────────────────────────────┐
│  SCORING: run all priority algorithms           │
│  Compute weighted sum per node                  │
│  Select highest score (random on tie)           │
└─────────────────────────────────────────────────┘
      │
      ▼
┌─────────────────────────────────────────────────┐
│  PREEMPTION (if needed):                        │
│  Find node where evicting low-priority Pods     │
│  would make room — respecting PodDisruptionBudget│
│  Evict → reschedule high-priority Pod           │
└─────────────────────────────────────────────────┘
      │
      ▼
Write Pod/Node binding → etcd via apiserver
      │
      ▼
kubelet picks up binding → creates containers  ✅
Enter fullscreen mode Exit fullscreen mode

7. Summary

Phase Strategy Algorithms Purpose
Filtering Hard constraints podFitsResources, podToleratesNodeTaints, CheckVolumeBinding, ... (13 total) Eliminate nodes that cannot run the Pod
Scoring Soft preferences LeastRequestedPriority, ImageLocalityPriority, SelectorSpreadPriority, ... (13 total) Rank feasible nodes to find the best one
Priority & Preemption Eviction policy podDisruptionBudget Allow high-priority Pods to evict lower-priority ones when no space exists

The three-phase pipeline gives Kubernetes scheduling a clean separation of concerns:

  • Can it run here? → Filtering
  • Where should it run? → Scoring
  • What if nowhere works? → Preemption

Next in this series: Kubernetes Resource Orchestration: Deployments, ReplicaSets & Rolling Updates (Part 5)


Follow the series for more deep dives into Kubernetes internals.

Top comments (0)