DEV Community

Cover image for Beyond Scheduling: How Kubernetes Uses QoS, Priority, and Scoring to Keep Your Cluster Balanced
Guptaji Teegela
Guptaji Teegela

Posted on

Beyond Scheduling: How Kubernetes Uses QoS, Priority, and Scoring to Keep Your Cluster Balanced

When every Pod screams for CPU and memory, who decides who lives, who waits, and who gets evicted?

Kubernetes isn't just a scheduler — it's a negotiator of fairness and efficiency.
Every second, it balances hundreds of workloads, deciding what runs, what waits, and what gets terminated — while maintaining reliability and cost efficiency.

This article unpacks how Quality of Service (QoS), Priority Classes, Preemption, and Bin-Packing Scoring come together to keep your cluster stable and fair.


⚙️ The Challenge: Competing Workloads in Shared Clusters

When multiple workloads share cluster resources, conflicts are inevitable:

  1. High-traffic apps starve lower workloads.
  2. Batch jobs hog memory.
  3. Pods without limits cause unpredictable evictions.

Kubernetes addresses this by applying a layered decision-making model — QoS, Priority, Preemption, and Scoring.


🧭 QoS (Quality of Service): Who Gets Evicted First

Each Pod belongs to a QoS class based on CPU and memory configuration:

QoS Class Description Eviction Priority
Guaranteed Requests = Limits for all containers Evicted last
Burstable Requests < Limits Evicted after BestEffort
BestEffort No requests/limits set Evicted first

💡 Lesson: Always define requests and limits — QoS decides who survives under node pressure.


🧱 Priority Classes: Who Runs First

QoS defines who stays, while Priority Classes define who starts.
Assigning PriorityClass values (integer-based) helps rank workloads during scheduling.

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: critical-services
value: 100000
description: Critical platform workloads
Enter fullscreen mode Exit fullscreen mode

💡 Lesson: Reserve high priorities for mission-critical services.
Overusing "high" priority leads to chaos — not resilience.


⚔️ Preemption: Controlled Sacrifice, Not Chaos

When a high-priority Pod can't be scheduled:

  1. The scheduler identifies lower-priority Pods occupying resources.
  2. Marks them for termination.
  3. Reschedules the high-priority Pod.

This is guided by PodDisruptionBudgets (PDBs) to avoid excessive collateral damage.

💡 Lesson: Preemption is controlled resilience — ensuring important workloads run while maintaining order.


⚖️ Scoring & Bin-Packing: Finding the Right Home

Once eligible nodes are filtered, Kubernetes enters the scoring phase to find the best fit.

Plugins involved:

  • LeastRequestedPriority → favors underutilized nodes.
  • BalancedResourceAllocation → balances CPU & memory use.
  • ImageLocalityPriority → prefers nodes with cached images.
  • NodeAffinityPriority → honors affinity preferences.
  • TopologySpreadConstraint → ensures zone diversity.

Each node receives a score (0–100) from multiple plugins.
Weighted scores are combined:

final_score = (w1*s1) + (w2*s2) + ...
Enter fullscreen mode Exit fullscreen mode

How weights work:

Scheduler plugins have default weights that you can customize via the scheduler configuration. For example:

  • LeastRequestedPriority: weight 1 (default) — spreads pods across nodes
  • BalancedResourceAllocation: weight 1 (default) — prevents CPU/memory imbalance
  • ImageLocalityPriority: weight 1 (default) — prefers nodes with cached images
  • NodeAffinityPriority: weight 2 (default) — stronger preference for affinity matches

You can adjust these weights in the kube-scheduler config to prioritize different strategies. Higher weights mean that plugin's score has more influence on the final decision.

QoS defines survivability.
Priority defines importance.
Scoring defines placement.

Together, they shape a stable and efficient cluster.


📖 Real-World Example: Critical Service Under Pressure

Imagine your payment service needs to scale during a traffic spike:

  1. Priority Class (value: 100000) ensures the payment pod is considered before batch jobs.
  2. QoS (Guaranteed) with matching requests/limits protects it from eviction when nodes fill up.
  3. Scoring evaluates nodes: Node A has the payment image cached (ImageLocalityPriority: 85), Node B is underutilized (LeastRequestedPriority: 90). Node B wins.
  4. Preemption kicks in if no nodes have capacity: a low-priority batch job pod (BestEffort QoS) gets evicted to make room.

Without these mechanisms:

  • Payment pods might wait behind batch jobs
  • Random evictions could kill critical services
  • Poor node selection causes slow startup times

With proper configuration:

  • Critical services schedule first
  • Predictable eviction order protects important workloads
  • Optimal node placement reduces latency

🧩 Visual Flow: Kubernetes Scheduling & Bin-Packing

Kubernetes Scheduling Flow


🔧 Troubleshooting Common Issues

"Why is my high-priority pod still pending?"

  • Check node resources: kubectl describe nodes to see available CPU/memory
  • Verify PriorityClass is applied: kubectl get pod <pod-name> -o jsonpath='{.spec.priorityClassName}'
  • Check for taints/tolerations: high priority doesn't bypass node taints
  • Review preemption logs: kubectl logs -n kube-system <scheduler-pod> for preemption attempts

"My Guaranteed QoS pod got evicted — why?"

  • Node pressure evictions respect QoS, but disk pressure can evict any pod
  • Check node conditions: kubectl get nodes -o wide for DiskPressure or MemoryPressure
  • Verify requests/limits match exactly: kubectl describe pod <pod-name> to confirm Guaranteed class

"Pods are scheduling to the wrong nodes"

  • Review scoring plugins: check kube-scheduler config for disabled plugins
  • Verify node labels/affinity: kubectl get nodes --show-labels
  • Check resource requests: pods with large requests may have limited node options
  • Inspect scheduler events: kubectl get events --field-selector involvedObject.kind=Pod

"Preemption isn't working"

  • Ensure PriorityClass exists: kubectl get priorityclass
  • Check PDB constraints: PodDisruptionBudgets can prevent preemption
  • Verify pod priority values: lower-priority pods must exist for preemption to occur
  • Review scheduler configuration: preemption may be disabled in custom scheduler configs

🧠 Key Lessons for SREs & Platform Teams

✅ Always define CPU/memory requests & limits.
✅ Use PriorityClasses sparingly.
✅ Test evictions under simulated stress.
✅ Combine QoS + PDB + Priority for controlled resilience.
✅ Observe scheduling metrics (kube_pod_status_phase, scheduler_score) regularly.


🚀 Takeaway

Kubernetes doesn't just schedule Pods — it negotiates priorities.
Reliability doesn't come from overprovisioning, but from predictable, fair, and disciplined scheduling.

Resilience = Consistency in scheduling decisions.

💬 Connect with Me

✍️ If you found this helpful, follow me for more insights on Platform Engineering, SRE, and CloudOps strategies that scale reliability and speed.

🔗 Follow me on LinkedIn if you’d like to discuss reliability architecture, automation, or platform strategy.

Images are generated using Gemini-AI

Top comments (0)