When every Pod screams for CPU and memory, who decides who lives, who waits, and who gets evicted?
Kubernetes isn't just a scheduler — it's a negotiator of fairness and efficiency.
Every second, it balances hundreds of workloads, deciding what runs, what waits, and what gets terminated — while maintaining reliability and cost efficiency.
This article unpacks how Quality of Service (QoS), Priority Classes, Preemption, and Bin-Packing Scoring come together to keep your cluster stable and fair.
⚙️ The Challenge: Competing Workloads in Shared Clusters
When multiple workloads share cluster resources, conflicts are inevitable:
- High-traffic apps starve lower workloads.
- Batch jobs hog memory.
- Pods without limits cause unpredictable evictions.
Kubernetes addresses this by applying a layered decision-making model — QoS, Priority, Preemption, and Scoring.
🧭 QoS (Quality of Service): Who Gets Evicted First
Each Pod belongs to a QoS class based on CPU and memory configuration:
| QoS Class | Description | Eviction Priority |
|---|---|---|
| Guaranteed | Requests = Limits for all containers | Evicted last |
| Burstable | Requests < Limits | Evicted after BestEffort |
| BestEffort | No requests/limits set | Evicted first |
💡 Lesson: Always define requests and limits — QoS decides who survives under node pressure.
🧱 Priority Classes: Who Runs First
QoS defines who stays, while Priority Classes define who starts.
Assigning PriorityClass values (integer-based) helps rank workloads during scheduling.
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: critical-services
value: 100000
description: Critical platform workloads
💡 Lesson: Reserve high priorities for mission-critical services.
Overusing "high" priority leads to chaos — not resilience.
⚔️ Preemption: Controlled Sacrifice, Not Chaos
When a high-priority Pod can't be scheduled:
- The scheduler identifies lower-priority Pods occupying resources.
- Marks them for termination.
- Reschedules the high-priority Pod.
This is guided by PodDisruptionBudgets (PDBs) to avoid excessive collateral damage.
💡 Lesson: Preemption is controlled resilience — ensuring important workloads run while maintaining order.
⚖️ Scoring & Bin-Packing: Finding the Right Home
Once eligible nodes are filtered, Kubernetes enters the scoring phase to find the best fit.
Plugins involved:
- LeastRequestedPriority → favors underutilized nodes.
- BalancedResourceAllocation → balances CPU & memory use.
- ImageLocalityPriority → prefers nodes with cached images.
- NodeAffinityPriority → honors affinity preferences.
- TopologySpreadConstraint → ensures zone diversity.
Each node receives a score (0–100) from multiple plugins.
Weighted scores are combined:
final_score = (w1*s1) + (w2*s2) + ...
How weights work:
Scheduler plugins have default weights that you can customize via the scheduler configuration. For example:
-
LeastRequestedPriority: weight 1 (default) — spreads pods across nodes -
BalancedResourceAllocation: weight 1 (default) — prevents CPU/memory imbalance -
ImageLocalityPriority: weight 1 (default) — prefers nodes with cached images -
NodeAffinityPriority: weight 2 (default) — stronger preference for affinity matches
You can adjust these weights in the kube-scheduler config to prioritize different strategies. Higher weights mean that plugin's score has more influence on the final decision.
QoS defines survivability.
Priority defines importance.
Scoring defines placement.
Together, they shape a stable and efficient cluster.
📖 Real-World Example: Critical Service Under Pressure
Imagine your payment service needs to scale during a traffic spike:
-
Priority Class (
value: 100000) ensures the payment pod is considered before batch jobs. - QoS (Guaranteed) with matching requests/limits protects it from eviction when nodes fill up.
- Scoring evaluates nodes: Node A has the payment image cached (ImageLocalityPriority: 85), Node B is underutilized (LeastRequestedPriority: 90). Node B wins.
- Preemption kicks in if no nodes have capacity: a low-priority batch job pod (BestEffort QoS) gets evicted to make room.
Without these mechanisms:
- Payment pods might wait behind batch jobs
- Random evictions could kill critical services
- Poor node selection causes slow startup times
With proper configuration:
- Critical services schedule first
- Predictable eviction order protects important workloads
- Optimal node placement reduces latency
🧩 Visual Flow: Kubernetes Scheduling & Bin-Packing
🔧 Troubleshooting Common Issues
"Why is my high-priority pod still pending?"
- Check node resources:
kubectl describe nodesto see available CPU/memory - Verify PriorityClass is applied:
kubectl get pod <pod-name> -o jsonpath='{.spec.priorityClassName}' - Check for taints/tolerations: high priority doesn't bypass node taints
- Review preemption logs:
kubectl logs -n kube-system <scheduler-pod>for preemption attempts
"My Guaranteed QoS pod got evicted — why?"
- Node pressure evictions respect QoS, but disk pressure can evict any pod
- Check node conditions:
kubectl get nodes -o wideforDiskPressureorMemoryPressure - Verify requests/limits match exactly:
kubectl describe pod <pod-name>to confirm Guaranteed class
"Pods are scheduling to the wrong nodes"
- Review scoring plugins: check kube-scheduler config for disabled plugins
- Verify node labels/affinity:
kubectl get nodes --show-labels - Check resource requests: pods with large requests may have limited node options
- Inspect scheduler events:
kubectl get events --field-selector involvedObject.kind=Pod
"Preemption isn't working"
- Ensure PriorityClass exists:
kubectl get priorityclass - Check PDB constraints: PodDisruptionBudgets can prevent preemption
- Verify pod priority values: lower-priority pods must exist for preemption to occur
- Review scheduler configuration: preemption may be disabled in custom scheduler configs
🧠 Key Lessons for SREs & Platform Teams
✅ Always define CPU/memory requests & limits.
✅ Use PriorityClasses sparingly.
✅ Test evictions under simulated stress.
✅ Combine QoS + PDB + Priority for controlled resilience.
✅ Observe scheduling metrics (kube_pod_status_phase, scheduler_score) regularly.
🚀 Takeaway
Kubernetes doesn't just schedule Pods — it negotiates priorities.
Reliability doesn't come from overprovisioning, but from predictable, fair, and disciplined scheduling.
Resilience = Consistency in scheduling decisions.
💬 Connect with Me
✍️ If you found this helpful, follow me for more insights on Platform Engineering, SRE, and CloudOps strategies that scale reliability and speed.
🔗 Follow me on LinkedIn if you’d like to discuss reliability architecture, automation, or platform strategy.
Images are generated using Gemini-AI

Top comments (0)