suresh devops

Posted on Feb 18

How we fixed Real Kubernetes Production Incidents

#architecture #devops #kubernetes #sre

We Used 'Kubernetes Advanced Scheduling' to Fix Real Production Incidents as below:

Kubernetes scheduling looks simple — until production traffic hits.
In lower environments, the default scheduler works beautifully.

In production, under load, across multiple AZs, with service mesh, sidecars, GPUs, and mixed workloads?

That’s when we discovered:
The scheduler is not “basic.”

It’s powerful — if we know how to use it.

Based on real issues we resolved in a live Kubernetes environment running critical workloads, these were 2 AM incidents.

Incident 1: All Pods in One AZ – The Near-Outage

The Situation

We had a 3-zone cluster.
A critical API deployment had 9 replicas.
One evening, during a traffic spike, one Availability Zone had network degradation.
Suddenly, 6 of our 9 pods were unreachable.
Why?
Because the default scheduler had “helpfully” packed pods into fewer zones for efficiency.
ReplicaSet guarantees count.
It does NOT guarantee distribution.

That night we learned:
High Availability is not automatic.

The Fix: Pod Topology Spread Constraints

We implemented:
• topologyKey: topology.kubernetes.io/zone
• maxSkew: 1
• whenUnsatisfiable: DoNotSchedule

This forced the scheduler to maintain balanced pod distribution across zones.

What Changed Immediately

• Pods spread 3-3-3 across AZs
• Rolling updates maintained balance
• Zone failure no longer meant partial collapse

**Why This Matters

Unlike anti-affinity (which is binary), Topology Spread Constraints are quantitative.

They enforce distribution math.
That’s the difference.

Real Production Lesson

During a rolling update before the fix:

New pods filled one AZ first.
Traffic imbalance followed.
Latency spikes happened.

After implementing spread constraints:

Updates remained balanced from the first pod.
No traffic concentration.

Incident 2: Node Memory Pressure – “But Requests Look Fine”
The Situation

We were using:
• Service mesh (Istio sidecars)
• Kata Containers for isolated workloads

Pods were requesting 512MB.
Nodes were showing allocatable capacity sufficient for 20 pods.

But after deploying 15 pods:

• Memory pressure started
• Evictions happened
• Latency increased

Metrics didn’t make sense.
Until we realized:
The scheduler was blind to runtime overhead.

The Fix: Pod Overhead via RuntimeClass

We defined a RuntimeClass with:

overhead:
podFixed:
memory: "50Mi"
cpu: "50m"
Now every pod using that RuntimeClass automatically added overhead to scheduling math.

What Changed?

Before:
Scheduler thought 512MB per pod.

After:
Scheduler calculated ~562MB per pod.
Node fitting became accurate.
Evictions stopped.

Critical Insight

Pod Overhead creates an “invisible container” in scheduling math.

Without it:
You overcommit silently.

With it:
Scheduling becomes financially and operationally accurate.

Incident 3: GPU Pods Landing on Spot Nodes

The Situation

We had:

• GPU nodes (on-demand)
• General compute nodes (Spot instances)
A training workload got scheduled on a Spot node without GPU.

Result:

• CrashLoopBackOff
• Wasted compute
• Delayed ML training pipeline

Node labels existed.
But scheduling preference wasn’t strict enough.

The Fix: Scheduler Profiles

Instead of rewriting the scheduler, we:

• Created a separate scheduler profile
• Configured specific scoring behavior for ML workloads
• Used taints + tolerations + stricter filtering

Now:

• General workloads used default scheduling
• ML workloads used a GPU-aware scheduling profile

Production Impact

• Zero misplacements
• Predictable GPU utilization
• Better cost control

Incident 4: Bin Packing Wasn’t Actually Packing

The Situation

We were optimizing for cost.
Using bin packing strategy (MostRequested scoring).
But nodes were underutilized by 5–7%.

Finance asked:

“Why are we scaling nodes when CPU is still free?”
The missing variable?
Pod Overhead wasn’t being considered in our mental math.

The Real Scheduler Math

When NodeResourceFit runs:

Total Pod Compute =
(Sum of container requests) + (Pod Overhead)

During scoring:
Scheduler evaluates:
(Current Node Usage + Total Pod Compute)

With overhead included,
Bin packing becomes more accurate.
Without understanding this,
You miscalculate cluster density.

After adjusting runtime overhead correctly:
• Node utilization improved
• Fewer nodes required
• Monthly cloud cost dropped noticeably

What Most Teams Don’t Realize

The scheduler is not a black box.

It has phases:

Filtering
Scoring
Binding

You can influence each phase.
Advanced scheduling is not “nice to have.”

It is production engineering.

Lessons learnt

Replica count ≠ High Availability : Use Topology Spread Constraints.
Container request ≠ Total consumption : Use Pod Overhead when sidecars or microVMs are involved.
One scheduler profile ≠ All workloads : Use Scheduler Profiles for workload classes.
Bin packing requires correct math Understand NodeResourceFit + Overhead interaction.

Running Kubernetes vs Engineering Kubernetes

Anyone can deploy pods.
Engineering Kubernetes means:
• Designing failure domains intentionally
• Accounting for invisible compute
• Aligning scheduling with business rules
• Optimizing cost through math, not guesswork

The scheduler is not just a placement tool.

It is:
A control plane decision engine.
And this separates:
Clusters that survive production
from
Clusters that collapse under it.

(You no need to be a Kubestronaut to learn and apply not so known features of Kubernetes. All you need is a 'crisis' at 2 AM)