How to Right-Size Kubernetes Node Groups Without Breaking Production

#right #size #kubernetes #node

Over-provisioned node groups are the most common source of Kubernetes compute waste. The average cluster runs at 30-40% CPU utilization of provisioned capacity. Most teams sized their nodes once, applied a "30% headroom" rule, and haven't revisited it since.

The risk of changing node sizes feels higher than the savings. That feeling is wrong. Done with the right process, right-sizing cuts compute costs by 25-35% without a single production incident. Done without the right process, it causes OOMKilled pods, scheduling failures, and drained clusters that won't come back up.

This post covers the measurement you need first, the rotation pattern that makes changes safe, and what to watch for afterward.

The Over-Provisioning Default

Teams provision node groups during initial cluster setup, apply a headroom percentage, and move on to shipping features. The headroom that made sense at launch, when traffic patterns were unknown and autoscaling wasn't tuned, becomes permanent waste six months later.

A 10-node EKS cluster running m5.2xlarge instances costs $2,764 per month in us-east-1 on-demand. If those nodes run at 35% average CPU utilization, you're paying for roughly 6.5 idle nodes. That's $1,797 per month in unused compute.

Right-sized to m5.xlarge with appropriate bin-packing, the same workload runs on 10 smaller nodes for $1,650 per month. The savings are $1,114 per month, $13,368 per year, from one cluster.

Cluster state	Nodes	Instance type	Monthly cost	Avg CPU utilization
Over-provisioned (current)	10	m5.2xlarge	$2,764	35%
Right-sized (target)	10	m5.xlarge	$1,650	60%
Monthly savings			$1,114

The data you need is already in your cluster. Most teams haven't looked at it with right-sizing intent.

What You Need to Measure First

kubectl top nodes shows current utilization at the moment you run it. It does not show peak utilization, scheduled capacity, or the utilization distribution over time. All four matter before you touch anything.

Actual utilization is what your pods consumed over the last 24 hours. kubectl top nodes gives you this as a snapshot.

Scheduled capacity is the sum of all pod CPU and memory requests assigned to each node. A node at 90% scheduled capacity but 35% actual utilization has over-requested pods, not an under-provisioned node. These are different problems with different fixes.

Peak utilization is the 95th percentile CPU and 99th percentile memory over at least 14 days. Right-sizing to average utilization causes failures during peak load periods. Weekly batch jobs, month-end reporting runs, and Monday morning traffic spikes all appear at the tail, not the average.

Pod Disruption Budgets determine the minimum number of pods that must stay available during a node drain. A PDB requiring 100% availability blocks the drain operation indefinitely. Audit all PDBs before you start.

What to measure	Tool	Minimum time horizon
Current node utilization	kubectl top nodes	Real-time snapshot
Per-pod actual usage	kubectl top pods	Real-time snapshot
95th percentile over time	Prometheus + kube-state-metrics	14 days
Pod request vs actual gap	VPA recommender mode	8 days
PDB configurations	kubectl get pdb -A	One-time audit

The Vertical Pod Autoscaler in recommendation mode (no auto-apply) is the most useful tool here. After 8 days of observation, it reports how much each pod actually uses versus what it requested. If VPA recommends cutting a pod's CPU request by 40%, the pod is systematically over-requesting, which inflates your scheduled capacity and forces larger nodes than necessary.

Do not start a node group resize until you have 14 days of utilization data. A 24-hour window misses every weekly and monthly usage spike.

The Blue-Green Node Rotation Pattern

Never resize a node group in-place. In-place resize terminates existing nodes before the replacements are fully ready. This creates scheduling pressure, may violate Pod Disruption Budgets, and gives you no safe rollback path once nodes are gone.

The safe pattern is blue-green rotation: create the new node group alongside the existing one, migrate pods gradually while the old nodes are still available, then delete the old group after a soak period.

Step 1: Create the new node group. Launch a new managed node group with the target instance type. Use the same availability zones, labels, and taints as the existing group. Let both groups coexist for 30 minutes and observe that new pods schedule correctly on the new nodes.

Step 2: Cordon old nodes. Mark every old node as unschedulable with kubectl cordon. No new pods will land on old nodes. Existing pods keep running. This step is fully reversible with kubectl uncordon and carries zero risk.

Step 3: Drain one node at a time. Drain evicts pods from one node, waits for replacement pods to start on another node, then moves to the next. Drain respects PDBs: if evicting a pod would violate a disruption budget, drain pauses and waits. Watch the drain output actively. A drain that has been waiting more than 10 minutes is blocked by a PDB — check kubectl get pdb -A to see what is holding it.

Step 4: Verify for 72 hours. After all pods are running on new nodes, do not immediately delete the old (cordoned) group. Keep it available as a rollback for 72 hours. During this window, run your normal observability checks and watch for the signals described below.

Step 5: Delete the old node group. Only after the 72-hour soak period passes cleanly. No earlier.

The active work for a 10-node cluster takes 3-4 hours. The soak period runs over three days. This is not a slow process — it is a safe one.

What Breaks (and When to Stop)

Right-sizing failures follow four predictable patterns. Each has a specific trigger and a specific prevention.

Failure mode	Trigger condition	Prevention
OOMKilled pods	New node memory insufficient for burst workloads sized using averages	Size to 99th percentile memory, not average
Pending pods	Largest pod's requests exceed new node's allocatable memory	Verify max pod request fits new node before draining
Drain hangs indefinitely	PDB requires N replicas but only N are running, drain would violate it	Scale deployment to N+1 replicas before draining
DaemonSet resource failures	DaemonSet requests exceed allocatable capacity on smaller nodes	Check DaemonSet requests fit new node's allocatable

OOMKilled is the most common failure after a node resize. It happens because average memory usage looks safe, but a nightly report that builds a large in-memory dataset hits the new node's memory ceiling. The pod restarts automatically, which masks the issue in short tests but becomes chronic in production. Prevention: always use the 99th percentile memory over 14+ days, not the mean.

Scheduling failures appear as Pending pods. They happen when a pod's CPU or memory requests exceed what any node in the cluster can offer after system reservations. Kubernetes cannot split a pod across nodes. Prevention: before draining, confirm the largest pod request in the cluster fits within the new node's allocatable capacity. Allocatable is always less than total node memory due to kubelet, kube-proxy, and OS reservations.

Drain hangs happen when a Pod Disruption Budget requires at least 2 replicas available, but only 2 replicas are running. Draining one pod would bring available count to 1, which violates the PDB. Drain waits indefinitely rather than violating it. The fix: scale the deployment to 3 replicas before draining. Once migration is complete, scale back down. Never modify a PDB to unblock a drain — PDBs exist to protect production availability.

DaemonSets run on every node, including new ones, automatically. If a DaemonSet's resource requests exceed the allocatable capacity on the new node type, the DaemonSet pod goes into Pending on every new node. Check DaemonSet resource requests explicitly before sizing down.

The signal to abort and roll back: any OOMKilled event or Pending pod that persists more than 5 minutes after drain completes. Uncordon the old nodes immediately. Pods will reschedule back onto the original nodes while you diagnose the cause.

Sizing the New Nodes: The Math

The target utilization after right-sizing should be 60-70% of node capacity at the 95th percentile. Below 60% and you're over-provisioning again. Above 70% and you have insufficient headroom for autoscaling to respond to traffic spikes.

The calculation for choosing the right node type:

Step	Formula	Example
1. Total scheduled CPU	Sum all pod CPU requests cluster-wide	100 pods x 250m avg = 25 vCPU
2. Required provisioned CPU	Total requests / target utilization (0.65)	25 / 0.65 = 38.5 vCPU
3. Node count	Required vCPU / vCPU per node	38.5 / 4 (m5.xlarge) = 10 nodes
4. Verify largest pod fits	Max pod request < node allocatable	3.5 vCPU request < 3.9 vCPU allocatable
5. Add one node buffer	N+1 for drain headroom	10 + 1 = 11 nodes during migration

In this example: 11 m5.xlarge nodes at $0.192/hr = $1,535/month during migration, then 10 nodes at $1,395/month at steady state. Compare to the m5.2xlarge baseline at $2,764/month. Annual savings: $16,428. The migration takes a week of calendar time and a day of engineering time.

One important check before committing: run kubectl describe node on any existing node and find the allocatable CPU and memory fields. These are the real scheduling limits. A 4 vCPU m5.xlarge has approximately 3.92 vCPU allocatable after system reservations. If your largest pod requests 4 vCPU exactly, it will not schedule.

The 72-Hour Watch

The soak period is not optional. The failure modes that don't appear in the first hour appear during nightly jobs, weekly batch runs, and traffic spikes that only happen on specific days or at the end of billing periods.

During the 72 hours, watch three signals:

Signal	Where to find it	What it means if triggered
Pod restart count	kubectl get pods -A --sort-by='.status.containerStatuses[0].restartCount'	OOMKilled or crash loop — node may be undersized
HPA scaling events	kubectl get events -A --field-selector reason=SuccessfulReschedule	Cluster approaching capacity, autoscaler compensating
CPU throttling rate	container_cpu_cfs_throttled_seconds_total in Prometheus	CPU limits or node size too small for actual load

Any pod with more than 2 restarts during the soak period needs investigation before you delete the old node group. A throttling rate above 25% on any critical service indicates the node type is CPU-constrained.

If all three signals are clean after 72 hours, delete the old node group and schedule the next one. For clusters with multiple node groups, right-size one group at a time with a minimum two-week cadence between changes. Concurrent changes to multiple groups make incident diagnosis nearly impossible — you cannot isolate which change caused a problem.

Right-sizing is not a one-time event. Workloads change, teams add services, and traffic grows. Review node group utilization every quarter. The data is free to collect. The cost of not collecting it shows up on every cloud invoice.