Muskan

Posted on May 15 • Originally published at zop.dev

Pod Scheduling for the Frugal: How We Cut EKS Node Cost 31% Without Touching a Workload

#pod #scheduling #frugal #cut

A right-sized EKS cluster should not run at 40 percent node utilization. The pods declare requests that sum to 78 percent of node capacity. The cluster autoscaler provisions nodes to fit those requests. The bill goes to finance based on the nodes provisioned. And then the actual utilization metrics show 40 percent. The gap between 78 percent of node capacity claimed and 40 percent actually used is bin-packing inefficiency, and it survives any amount of right-sizing.

The pattern that fixes the gap is three scheduling-side changes that don't touch any workload. Switch the scheduler scoring from default to MostAllocated. Enable Karpenter's consolidation feature. Add a three-tier priority class so batch workloads can be evicted when high-priority pods need capacity. The combined effect on a typical EKS cluster is 25 to 35 percent reduction in node cost without any pod changing its resource requests.

The piece composes with right-sizing vs auto-scaling but starts from the opposite end. Right-sizing argues with every team about their CPU and memory requests. Scheduling improvements just change where pods land. The political cost is much lower, the work fits in a one-sprint window, and right-sizing becomes more effective afterward because the bin-packing baseline is healthier.

The 40% utilization gap on EKS clusters that already right-sized

Look at any EKS cluster that's already done a right-sizing pass. The numbers will be roughly:

Signal	Typical value
Pod CPU requests / node CPU capacity	75-82%
Pod memory requests / node memory capacity	70-78%
Actual node CPU utilization (averaged)	35-45%
Actual node memory utilization (averaged)	38-50%
Cluster Autoscaler / Karpenter target utilization	80%+

The first two rows say "the cluster is well-packed in theory." The middle two say "the cluster is half-empty in practice." The last row says "the autoscaler thinks it's running tight."

The gap is real, not a measurement artifact. Pod resource requests are declarations of what the pod might use; actual utilization is what the pod uses on average. The scheduler reserves the requested capacity even when the pod uses less. A node with five pods declaring 2 CPU each (10 CPU total reserved) but using 1 CPU each on average (5 CPU actual) is at 100 percent reserved and 50 percent utilized. The autoscaler sees the reservation, not the use.

This is by design — the alternative (oversubscribing based on actual use) breaks under burst, and the scheduling literature is unanimous that reservation-based scheduling is the right primitive. The fix isn't to change how the scheduler treats requests. The fix is to make the scheduler pack requests more efficiently and to let Karpenter consolidate the resulting headroom into fewer nodes.

Lever 1: switch to MostAllocated scoring

The default Kubernetes scheduler optimizes for predictable, evenly-spread placement. The default scoring strategy is LeastAllocated, which prefers nodes with more free capacity. The reasoning is fault tolerance: spread pods across nodes so a single node failure has bounded blast radius. This is the right default if you're not paying for the nodes.

MostAllocated is the opposite strategy: prefer nodes with less free capacity, packing pods tightly. The scoring is opt-in and rarely enabled. It's documented but has no auto-enablement signal: nothing in the cluster tells you "you'd save money if you flipped this."

Same workload, two scoring strategies, two outcomes. LeastAllocated produces 6 nodes at 50 percent each (33 percent waste). MostAllocated produces 4 nodes at 75 percent each (25 percent waste, 33 percent fewer nodes).

The configuration is one block in the scheduler config:

Field	Default value	New value
`KubeSchedulerProfiles[0].plugins.score.disabled`	(none)	`NodeResourcesFit` (the default scorer)
`KubeSchedulerProfiles[0].plugins.score.enabled`	(default)	`NodeResourcesFit` with `scoringStrategy.type: MostAllocated`

On a managed EKS cluster, this lands as a config map for the kube-scheduler. The change rolls out per-control-plane and applies to new pod placements; existing pods stay where they are until they restart for other reasons. The transition over a week is gentle: as pods naturally restart (deployments, image updates, node maintenance), the cluster gradually packs tighter.

The catch is that MostAllocated without PodTopologySpread constraints is dangerous. Left to its own devices, the scheduler will happily put all five replicas of a deployment on one node — maximum density, zero fault tolerance. Topology spread is the corrective. We get to that section in a moment.

The expected outcome on a fleet that previously ran 40 percent utilization: utilization rises to 55-65 percent over the first two weeks. The autoscaler notices fewer nodes are needed and provisions less. The bill drops 12-18 percent depending on workload composition.

Lever 2: enable Karpenter consolidation

Karpenter provisions nodes to fit incoming pods. By default, once a node exists, it stays. If pods leave (deployment scale-down, batch job completion), the node lingers under-utilized until it's empty enough that Karpenter's "expiration" rules kick in.

Consolidation is the active counterpart. Karpenter periodically evaluates the existing fleet, asks "could I run all these pods on fewer or smaller nodes," and re-provisions if yes. The evaluation runs hourly by default. Pods get gracefully evicted from the old nodes, the new (smaller or fewer) nodes get spun up, the old nodes terminate.

Six m5.xlarge nodes at 30-40 percent become two m5.2xlarge nodes at 68-72 percent. Same pods, same requests. The autoscaler bill drops because four fewer nodes are running.

The Karpenter NodePool config to enable it:

Field	Default	New
`disruption.consolidationPolicy`	`WhenUnderutilized`	`WhenUnderutilized` (already enabled in recent versions)
`disruption.consolidateAfter`	unset	`30s` to `1m` (acts on transient under-utilization too)
`disruption.expireAfter`	`720h` (30 days)	`168h` (7 days) — forces fleet refresh

The consolidation policy WhenUnderutilized is what does the work. The consolidateAfter knob controls how aggressive the re-evaluation is; shorter values catch transient under-utilization (a deployment that just scaled down) faster. The expireAfter change is secondary but useful: shorter expiration forces the fleet to refresh more often, which catches drift between Karpenter's view and reality.

The catch with consolidation is that the node types it picks need to be a bounded set. If the NodePool allows 30 instance families, consolidation produces fragmentation: some pods on m5, some on c5, some on r5, none of the families used densely enough to consolidate further. The prerequisite work is pruning to 3-5 high-utility instance families that cover the typical pod resource shapes. Most clusters land on m6i for general purpose, c6i for CPU-bound, r6i for memory-bound, with one or two GPU types for ML workloads.

The expected outcome on a typical EKS fleet: 15-25 percent fewer nodes after the first week of consolidation passes. The first day shows the biggest drop (consolidation catches all the historical under-utilization at once); subsequent days are incremental as new under-utilization gets caught.

Lever 3: priority + preemption for batch workloads

The third lever is the one most teams skip. Kubernetes supports pod priority classes and preemption: high-priority pods can evict low-priority pods when capacity is contended, instead of triggering a node-up.

Most clusters end up with three priority classes:

Class	priorityValue	Workloads	Preemption behavior
`critical`	1_000_000	Customer-facing services, control-plane components	Cannot be preempted
`standard`	500_000	Internal services, default for everything else	Preempts batch only
`batch`	100_000	Periodic jobs, ML training, data pipelines	Preempted by everything else

The priorityClassName field on the pod spec assigns the class. New deployments get the appropriate class via templates; existing deployments get tagged in a one-time PR. Critical workloads are usually a small set (under 20 percent of pods). Batch workloads are usually larger than people expect (often 30-40 percent of pods, mostly invisible: cronjobs, data pipelines, build runners).

The preemption behavior is what creates the savings. When the scheduler can't fit a standard or critical pod, it looks for batch pods to evict instead of triggering Karpenter to provision a new node. The batch pod gets evicted (re-queued for later), the standard pod takes its slot, and the cluster doesn't grow. The batch work runs slower but completes; the cluster runs leaner.

The political work is agreeing on which workloads are evictable. Engineers tend to mark their work as critical by default. The agreement requires a clear definition: "critical means customer-facing degradation if evicted." Most internal infrastructure (monitoring, logging, build runners, batch ETL) is not critical by that definition. The clarification is the political work; the technical implementation is one yaml field per pod.

Preemption only adds savings when the cluster has enough batch workloads to absorb the eviction pressure. Clusters that are pure web-tier with no batch see less benefit (typically 2-3 percent). Clusters with ML training or large data pipelines see more (typically 8-12 percent).

PodTopologySpread is non-negotiable

MostAllocated without topology constraints will pack all replicas of a deployment onto one node. A node failure takes down the entire deployment. This is a real production incident, not a theoretical concern.

The fix is PodTopologySpread constraints on every deployment that has fault tolerance requirements. The yaml block:

Field	Value	Why
`topologyKey`	`topology.kubernetes.io/zone`	Spread across AZs first
`maxSkew`	`1`	At most 1 pod imbalance between AZs
`whenUnsatisfiable`	`ScheduleAnyway`	Soft constraint; better packed than crashed
Second constraint `topologyKey`	`kubernetes.io/hostname`	Then spread across nodes within AZ
Second constraint `maxSkew`	`2`	At most 2 pod imbalance between nodes

The two-constraint pattern says "AZ spread is mandatory (for resilience), node spread is preferred (for further fault isolation), but never refuse to schedule because of either." ScheduleAnyway is what makes it compatible with MostAllocated: when packing is the right choice for cost, the scheduler can ignore the soft constraint and pack tighter.

The cost of getting topology spread right is one yaml block per critical deployment. Tooling exists (open-policy-agent, Kyverno) to enforce that deployments above a certain replica count have topology spread defined; we use a simple admission policy that warns on missing spread and blocks on critical-priority deployments without it.

The trade-off this creates: a cluster running MostAllocated + topology spread will, in steady state, run at 60-70 percent utilization rather than the theoretical maximum of 85 percent. The 15-25 percent gap is the cost of fault tolerance. Closing it further means giving up AZ resilience, which is not a finance decision.

The 31% breakdown: 15 + 10 + 6

The combined impact on a typical EKS cluster decomposes roughly as:

Lever	Typical savings	Range	What drives variation
MostAllocated scoring	15%	12-18%	Higher savings on clusters with many small pods (better bin-packing wins)
Karpenter consolidation	10%	8-12%	Higher savings on clusters with bursty deployments (more transient under-utilization to catch)
Priority preemption	6%	2-12%	Higher savings on clusters with significant batch workload (more eviction-eligible pods)
Combined	31%	22-42%	Composition matters; effects are not exactly additive

The combined number is slightly less than the simple sum because the levers overlap. MostAllocated reduces under-utilization, which means consolidation has less work to do. Preemption reduces node-up events, which means consolidation sees a steadier fleet. The interactions are mild but real; planning around 25-35 percent total savings is more accurate than planning around 31 percent.

The exact mix depends on workload composition. A cluster that's 80 percent web traffic and 20 percent batch will see more value from MostAllocated and less from preemption. A cluster that's 50 percent ML training will see more from preemption (the training jobs are the ideal eviction targets) and less from MostAllocated (large pods don't bin-pack as well as small ones). The 15+10+6 split is the central tendency, not a guarantee.

Why scheduling-first is more politically tractable than right-sizing-first

Right-sizing argues with every team about their resource requests. The conversation is "your pod requests 4 CPU and uses 1.5; let's drop the request to 2." Each team pushes back because they remember the time the pod actually used 4 (during the incident two months ago, the deployment burst, the load test). Negotiating each one takes 30 to 60 minutes per service; a 200-service cluster is a quarter of FinOps time.

Scheduling changes don't argue with anyone. The pod requests stay the same. The pod runs the same code. The only thing that changes is which node the pod lands on (MostAllocated), which other nodes exist alongside it (consolidation), and what happens when capacity is tight (preemption). No engineer has to defend their resource requests because no resource requests are changing.

This makes the sequencing matter. Doing scheduling first:

Step	Time	Political cost	Savings unlocked
Enable MostAllocated + topology spread	1-2 sprints	Low (one config change, validated by SRE)	12-18%
Enable Karpenter consolidation + prune node families	1 sprint	Low (Karpenter team's domain)	8-12%
Define priority classes + tag batch workloads	2-3 sprints	Medium (workload classification debate)	2-12%
Right-size pod requests	1-2 quarters	High (per-service negotiation)	another 15-25% on top

By the time you get to right-sizing, the cluster's bin-packing is already healthy, so the right-sizing conversations land on a smaller per-service savings number. That's actually politically helpful: the engineers see "we already saved 31 percent without touching your pods, and now we're asking for the next 15 percent." The framing flips from "we're cutting your resources" to "we're tuning the last bit of headroom."

The 31 percent number is real and replicable. The work fits in one sprint per lever, takes no engineering team's time except SRE's, and doesn't risk any pod's runtime behavior. It's the cheapest savings on the EKS bill and it shows up before the harder right-sizing fight even starts.