We Used 'Kubernetes Advanced Scheduling' to Fix Real Production Incidents as below:
Kubernetes scheduling looks simple — until production traffic hits.
In lower environments, the default scheduler works beautifully.
In production, under load, across multiple AZs, with service mesh, sidecars, GPUs, and mixed workloads?
That’s when we discovered:
The scheduler is not “basic.”
It’s powerful — if we know how to use it.
Based on real issues we resolved in a live Kubernetes environment running critical workloads, these were 2 AM incidents.
Incident 1: All Pods in One AZ – The Near-Outage
The Situation
We had a 3-zone cluster.
A critical API deployment had 9 replicas.
One evening, during a traffic spike, one Availability Zone had network degradation.
Suddenly, 6 of our 9 pods were unreachable.
Why?
Because the default scheduler had “helpfully” packed pods into fewer zones for efficiency.
ReplicaSet guarantees count.
It does NOT guarantee distribution.
That night we learned:
High Availability is not automatic.
The Fix: Pod Topology Spread Constraints
We implemented:
• topologyKey: topology.kubernetes.io/zone
• maxSkew: 1
• whenUnsatisfiable: DoNotSchedule
This forced the scheduler to maintain balanced pod distribution across zones.
What Changed Immediately
• Pods spread 3-3-3 across AZs
• Rolling updates maintained balance
• Zone failure no longer meant partial collapse
**Why This Matters
Unlike anti-affinity (which is binary), Topology Spread Constraints are quantitative.
They enforce distribution math.
That’s the difference.
Real Production Lesson
During a rolling update before the fix:
New pods filled one AZ first.
Traffic imbalance followed.
Latency spikes happened.
After implementing spread constraints:
Updates remained balanced from the first pod.
No traffic concentration.
Incident 2: Node Memory Pressure – “But Requests Look Fine”
The Situation
We were using:
• Service mesh (Istio sidecars)
• Kata Containers for isolated workloads
Pods were requesting 512MB.
Nodes were showing allocatable capacity sufficient for 20 pods.
But after deploying 15 pods:
• Memory pressure started
• Evictions happened
• Latency increased
Metrics didn’t make sense.
Until we realized:
The scheduler was blind to runtime overhead.
The Fix: Pod Overhead via RuntimeClass
We defined a RuntimeClass with:
overhead:
podFixed:
memory: "50Mi"
cpu: "50m"
Now every pod using that RuntimeClass automatically added overhead to scheduling math.
What Changed?
Before:
Scheduler thought 512MB per pod.
After:
Scheduler calculated ~562MB per pod.
Node fitting became accurate.
Evictions stopped.
Critical Insight
Pod Overhead creates an “invisible container” in scheduling math.
Without it:
You overcommit silently.
With it:
Scheduling becomes financially and operationally accurate.
Incident 3: GPU Pods Landing on Spot Nodes
The Situation
We had:
• GPU nodes (on-demand)
• General compute nodes (Spot instances)
A training workload got scheduled on a Spot node without GPU.
Result:
• CrashLoopBackOff
• Wasted compute
• Delayed ML training pipeline
Node labels existed.
But scheduling preference wasn’t strict enough.
The Fix: Scheduler Profiles
Instead of rewriting the scheduler, we:
• Created a separate scheduler profile
• Configured specific scoring behavior for ML workloads
• Used taints + tolerations + stricter filtering
Now:
• General workloads used default scheduling
• ML workloads used a GPU-aware scheduling profile
Production Impact
• Zero misplacements
• Predictable GPU utilization
• Better cost control
Incident 4: Bin Packing Wasn’t Actually Packing
The Situation
We were optimizing for cost.
Using bin packing strategy (MostRequested scoring).
But nodes were underutilized by 5–7%.
Finance asked:
“Why are we scaling nodes when CPU is still free?”
The missing variable?
Pod Overhead wasn’t being considered in our mental math.
The Real Scheduler Math
When NodeResourceFit runs:
Total Pod Compute =
(Sum of container requests) + (Pod Overhead)
During scoring:
Scheduler evaluates:
(Current Node Usage + Total Pod Compute)
With overhead included,
Bin packing becomes more accurate.
Without understanding this,
You miscalculate cluster density.
After adjusting runtime overhead correctly:
• Node utilization improved
• Fewer nodes required
• Monthly cloud cost dropped noticeably
What Most Teams Don’t Realize
The scheduler is not a black box.
It has phases:
- Filtering
- Scoring
- Binding
You can influence each phase.
Advanced scheduling is not “nice to have.”
It is production engineering.
Lessons learnt
- Replica count ≠ High Availability : Use Topology Spread Constraints.
- Container request ≠ Total consumption : Use Pod Overhead when sidecars or microVMs are involved.
- One scheduler profile ≠ All workloads : Use Scheduler Profiles for workload classes.
- Bin packing requires correct math Understand NodeResourceFit + Overhead interaction.
Running Kubernetes vs Engineering Kubernetes
Anyone can deploy pods.
Engineering Kubernetes means:
• Designing failure domains intentionally
• Accounting for invisible compute
• Aligning scheduling with business rules
• Optimizing cost through math, not guesswork
The scheduler is not just a placement tool.
It is:
A control plane decision engine.
And this separates:
Clusters that survive production
from
Clusters that collapse under it.
(You no need to be a Kubestronaut to learn and apply not so known features of Kubernetes. All you need is a 'crisis' at 2 AM)
Top comments (0)