NTCTech

Posted on Feb 22 • Originally published at rack2cloud.com

Your Kubernetes Cluster Isn’t Out of CPU — The Scheduler Is Stuck

#architecture #aws #devops #kubernetes

This is Part 2 of the Rack2Cloud Diagnostic Series, where we debug the silent killers of Kubernetes reliability.

The Series:

Part 1: ImagePullBackOff: It’s Not the Registry (It’s IAM)
Part 2: The Scheduler is Stuck (You are here)
Part 3: The Network Layer (Why Ingress Fails)
Part 4: Storage Has Gravity (Debugging PVCs)

The Fragmentation Trap: Why 50% “Free” CPU Doesn’t Mean You Can Schedule a Pod

Your Grafana dashboard says the cluster is only 45% used.
Finance keeps pinging you about cloud waste.
Meanwhile, kubectl get pods shows a bunch of Pending pods.

You’re not out of capacity. You’re just losing at Kubernetes Tetris.

Here’s the thing: in Kubernetes, “Total Capacity” is mostly for show. What really matters is Allocatable Continuity. Let’s say you’ve got 10 nodes with 1 CPU free on each. Technically, that’s 10 CPUs. But if you try to schedule a pod that needs 2 CPUs, you’re out of luck.

The Scheduler isn’t broken—it’s just doing exactly what you told it to do. It’s juggling all your rules: PDBs, topology constraints, affinity settings, you name it.

This is the Rack2Cloud Guide to fixing a “Stuck Scheduler.”

The Dashboard Lie: “Green” Doesn’t Mean Go

When a pod gets stuck, most engineers check cluster utilization and see plenty of spare CPU. The dashboard looks green, so they think there’s room.

But the scheduler doesn’t care about the sum. It hunts for a single node with enough continuous resources that also fits all your constraints.

If you see these kinds of errors, you don’t have a capacity problem—you have a placement problem:

0/10 nodes are available: 10 Insufficient cpu.

0/6 nodes are available: 3 node(s) had taint {node.kubernetes.io/unreachable}, 3 node(s) didn’t match Pod’s node affinity.

pod didn't trigger scale-up (it wouldn't fit if a new node is added)

That last one is a real head-scratcher. Even if the autoscaler spins up a shiny new node, your affinity rules still block the pod from landing.

The Fragmentation Trap (Or, The Tetris Problem)

Picture a Tetris board full of empty squares scattered all over. You have space, but you just can’t drop a long block anywhere. That’s exactly how Kubernetes nodes work.

Bin-Packing Failure: The Scheduler looks for a single contiguous slot, not the sum of free space.

The “Requests vs. Limits” Blind Spot

A classic mistake: thinking the scheduler cares about usage. It doesn’t. It only looks at Requests.

resources:
  requests:
    cpu: "2" # <--- Scheduler only cares about this
  limits:
    cpu: "4"

If a node has 4 CPUs, and pods have requested 3.5 CPUs, the scheduler sees that node as full—even if those pods are barely using any CPU at all. Once requested, that seat is reserved.

The DaemonSet Tax

DaemonSets are sneaky. If you’re running Datadog, Fluentd, and a security agent everywhere, each node pays a “fragmentation tax.”

It adds up: 150m CPU (Datadog) + 200m CPU (Fluentd) + 100m CPU (Istio) = 0.45 CPU lost per node. Across 20 nodes, that’s 9 CPUs, scattered in crumbs too small to use.

The Silent Killer: Pod Disruption Budgets (PDBs)

PDBs are meant to keep your app safe during upgrades. But set them too strict, and they turn into a deadlock.

The Scenario: You try to drain a node for a Kubernetes upgrade. The drain just hangs.
The Cause: You set minAvailable: 100% (or maxUnavailable: 0).

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: app-pdb
spec:
  minAvailable: 100%  # <--- The Trap
  selector:
    matchLabels:
      app: web

If you have three replicas, the PDB says all three must always be up. Now Kubernetes can’t move a pod. Cluster Autoscaler can’t scale down. Node Upgrader can’t drain.

The Fix: Always leave room to maneuver. Use maxUnavailable: 1 or minAvailable: 90%.

Topology Spread: The “Strict” Trap

We all want high availability. So we spread pods across availability zones (AZs). But if you set whenUnsatisfiable: DoNotSchedule, you turn a preference into an outage.

topologySpreadConstraints:
  - maxSkew: 1
    topologyKey: topology.kubernetes.io/zone
    whenUnsatisfiable: DoNotSchedule  # <--- The Hard Stop
    labelSelector:
      matchLabels:
        app: web

Here’s the deadlock:

AZ-A has 2 pods.
AZ-B has 2 pods.
AZ-C is full.

You try to schedule a 5th pod. The scheduler sees that putting it in AZ-A or AZ-B would make the skew “2” (breaking your rules), so it looks at AZ-C—but AZ-C is packed. The pod just sits in Pending, even though there’s open space elsewhere.

The Fix: Set whenUnsatisfiable to ScheduleAnyway, unless you’re legally required to keep perfect symmetry.

Advanced: Scheduler Cache Lag

This one’s for the deep-divers.

The scheduler doesn’t check the live state of every node every time. That would be way too slow. Instead, it uses a snapshot cache.

If your environment is chaotic—nodes scaling up and down, pods dying and spawning—the cache can get out of sync for a few seconds. So the scheduler “thinks” Node A is free and assigns a pod. But Node A is actually full, so the Kubelet rejects it. The pod goes back to the queue.

This shows up as weird, fleeting scheduling delays during heavy churn. Usually, it sorts itself out. But now you know why you sometimes see FailedScheduling events that vanish a few seconds later.

Summary: Capacity Problems are Policy Problems

If your dashboard says you have CPU, but your pods are pending, stop buying nodes. You don’t need more hardware. You need to relax your rules.

[x] Audit Requests: Are they realistic?
[x] Relax PDBs: Set maxUnavailable: 1.
[x] Soften Constraints: Use ScheduleAnyway.

DEV Community