Guptaji Teegela

Posted on Nov 12

Beyond Scheduling: How Kubernetes Uses QoS, Priority, and Scoring to Keep Your Cluster Balanced

#devops #sre #microservices #platformengineering

When every Pod screams for CPU and memory, who decides who lives, who waits, and who gets evicted?

Kubernetes isn't just a scheduler — it's a negotiator of fairness and efficiency.
Every second, it balances hundreds of workloads, deciding what runs, what waits, and what gets terminated — while maintaining reliability and cost efficiency.

This article unpacks how Quality of Service (QoS), Priority Classes, Preemption, and Bin-Packing Scoring come together to keep your cluster stable and fair.

⚙️ The Challenge: Competing Workloads in Shared Clusters

When multiple workloads share cluster resources, conflicts are inevitable:

High-traffic apps starve lower workloads.
Batch jobs hog memory.
Pods without limits cause unpredictable evictions.

Kubernetes addresses this by applying a layered decision-making model — QoS, Priority, Preemption, and Scoring.

🧭 QoS (Quality of Service): Who Gets Evicted First

Each Pod belongs to a QoS class based on CPU and memory configuration:

QoS Class	Description	Eviction Priority
Guaranteed	Requests = Limits for all containers	Evicted last
Burstable	Requests < Limits	Evicted after BestEffort
BestEffort	No requests/limits set	Evicted first

💡 Lesson: Always define requests and limits — QoS decides who survives under node pressure.

🧱 Priority Classes: Who Runs First

QoS defines who stays, while Priority Classes define who starts.
Assigning PriorityClass values (integer-based) helps rank workloads during scheduling.

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: critical-services
value: 100000
description: Critical platform workloads

💡 Lesson: Reserve high priorities for mission-critical services.
Overusing "high" priority leads to chaos — not resilience.

⚔️ Preemption: Controlled Sacrifice, Not Chaos

When a high-priority Pod can't be scheduled:

The scheduler identifies lower-priority Pods occupying resources.
Marks them for termination.
Reschedules the high-priority Pod.

This is guided by PodDisruptionBudgets (PDBs) to avoid excessive collateral damage.

💡 Lesson: Preemption is controlled resilience — ensuring important workloads run while maintaining order.

⚖️ Scoring & Bin-Packing: Finding the Right Home

Once eligible nodes are filtered, Kubernetes enters the scoring phase to find the best fit.

Plugins involved:

LeastRequestedPriority → favors underutilized nodes.
BalancedResourceAllocation → balances CPU & memory use.
ImageLocalityPriority → prefers nodes with cached images.
NodeAffinityPriority → honors affinity preferences.
TopologySpreadConstraint → ensures zone diversity.

Each node receives a score (0–100) from multiple plugins.
Weighted scores are combined:

final_score = (w1*s1) + (w2*s2) + ...

How weights work:

Scheduler plugins have default weights that you can customize via the scheduler configuration. For example:

LeastRequestedPriority: weight 1 (default) — spreads pods across nodes
BalancedResourceAllocation: weight 1 (default) — prevents CPU/memory imbalance
ImageLocalityPriority: weight 1 (default) — prefers nodes with cached images
NodeAffinityPriority: weight 2 (default) — stronger preference for affinity matches

You can adjust these weights in the kube-scheduler config to prioritize different strategies. Higher weights mean that plugin's score has more influence on the final decision.

QoS defines survivability.
Priority defines importance.
Scoring defines placement.

Together, they shape a stable and efficient cluster.

📖 Real-World Example: Critical Service Under Pressure

Imagine your payment service needs to scale during a traffic spike:

Priority Class (value: 100000) ensures the payment pod is considered before batch jobs.
QoS (Guaranteed) with matching requests/limits protects it from eviction when nodes fill up.
Scoring evaluates nodes: Node A has the payment image cached (ImageLocalityPriority: 85), Node B is underutilized (LeastRequestedPriority: 90). Node B wins.
Preemption kicks in if no nodes have capacity: a low-priority batch job pod (BestEffort QoS) gets evicted to make room.

Without these mechanisms:

Payment pods might wait behind batch jobs
Random evictions could kill critical services
Poor node selection causes slow startup times

With proper configuration:

Critical services schedule first
Predictable eviction order protects important workloads
Optimal node placement reduces latency

🧩 Visual Flow: Kubernetes Scheduling & Bin-Packing

🔧 Troubleshooting Common Issues

"Why is my high-priority pod still pending?"

Check node resources: kubectl describe nodes to see available CPU/memory
Verify PriorityClass is applied: kubectl get pod <pod-name> -o jsonpath='{.spec.priorityClassName}'
Check for taints/tolerations: high priority doesn't bypass node taints
Review preemption logs: kubectl logs -n kube-system <scheduler-pod> for preemption attempts

"My Guaranteed QoS pod got evicted — why?"

Node pressure evictions respect QoS, but disk pressure can evict any pod
Check node conditions: kubectl get nodes -o wide for DiskPressure or MemoryPressure
Verify requests/limits match exactly: kubectl describe pod <pod-name> to confirm Guaranteed class

"Pods are scheduling to the wrong nodes"

Review scoring plugins: check kube-scheduler config for disabled plugins
Verify node labels/affinity: kubectl get nodes --show-labels
Check resource requests: pods with large requests may have limited node options
Inspect scheduler events: kubectl get events --field-selector involvedObject.kind=Pod

"Preemption isn't working"

Ensure PriorityClass exists: kubectl get priorityclass
Check PDB constraints: PodDisruptionBudgets can prevent preemption
Verify pod priority values: lower-priority pods must exist for preemption to occur
Review scheduler configuration: preemption may be disabled in custom scheduler configs

🧠 Key Lessons for SREs & Platform Teams

✅ Always define CPU/memory requests & limits.
✅ Use PriorityClasses sparingly.
✅ Test evictions under simulated stress.
✅ Combine QoS + PDB + Priority for controlled resilience.
✅ Observe scheduling metrics (kube_pod_status_phase, scheduler_score) regularly.

🚀 Takeaway

Kubernetes doesn't just schedule Pods — it negotiates priorities.
Reliability doesn't come from overprovisioning, but from predictable, fair, and disciplined scheduling.