Muskan

Posted on Jun 23

Karpenter consolidation: 6 settings worth tuning in 2026

#kubernetes #aws #devops #finops

Quick take

Karpenter's defaults are tuned for compute cost, not for your workload's tolerance for disruption. Six settings decide whether consolidation saves you 40% or causes a Sev-2 every Tuesday. Tune them once and the autoscaler stops being a footgun.

If you only have 90 seconds, this is the shape:

consolidationPolicy: WhenEmptyOrUnderutilized is the right default for most prod fleets in 2026.
consolidateAfter at the wrong value causes either expensive idle nodes or thrashy churn.
disruption.budgets is the safety net that decides how many nodes can churn at once.

Why Karpenter defaults are not your defaults in 2026

Karpenter 1.0 hit GA in late 2024, and the defaults shifted again with 1.2 in mid-2025. The defaults work well for stateless web traffic and absolutely break stateful workloads.

I have walked into three incidents in the last six months where the same root cause showed up: Karpenter aggressively consolidating during a deploy, killing pods mid-request, and the team blaming the application. The application was fine. The autoscaler was set to disrupt nodes faster than the rolling restart could finish.

Two things changed in 2026 that make this worse:

Spot interruption rates are higher in popular instance families like m7i and c7i, so Karpenter rotates more often.
Multi-arch fleets (Graviton4 plus x86) mean Karpenter has more right-sizing candidates, and "right-size everything" is the default instinct.

The fix is to tune six specific settings against your workload, not against Karpenter's idea of a workload.

1. consolidationPolicy

The most consequential setting. Two valid values in 2026.

WhenEmpty consolidates only when a node has zero workload pods. Safe but leaves a lot of money on the table because underutilized nodes (10% CPU, 30% memory) keep running.

WhenEmptyOrUnderutilized is the 2026 default for cost-sensitive teams. Karpenter will replace a half-empty node with a smaller one when the math says it saves money. The risk is more frequent disruption.

For stateful workloads, batch jobs, or anything with long-running connections, start with WhenEmpty. Move to WhenEmptyOrUnderutilized only after you have PDBs and graceful termination in place.

2. consolidateAfter

The "wait this long before consolidating" timer. Default in 1.2 is 1m. Almost always wrong.

The problem with 1 minute: a CI job that takes 90 seconds to start, runs for 4 minutes, and exits looks like a "stable enough to consolidate" workload. Karpenter then disrupts the node that was about to receive the next job, and the queue backs up.

What I use in prod:

Stateful or latency-sensitive: consolidateAfter: 10m
Mixed workloads: consolidateAfter: 5m
Pure stateless web: consolidateAfter: 2m
Batch-heavy clusters: consolidateAfter: 15m to let job queues drain

Above 15 minutes you stop saving money because nodes stay overprovisioned. Below 2 minutes you cause thrashing.

3. disruption.budgets

This is the blast-radius limiter and the single most important safety setting. A budgets block controls how many nodes can be consolidated at once.

The default since 1.0 is 10%. That sounds safe and is fine for clusters with 50+ nodes. On a 4-node cluster, 10% rounds to "anytime" because there is no fractional node. Set explicit numbers, not percentages, on small clusters.

A reasonable budget for a 20-to-100-node production cluster:

10% during business hours
25% during low-traffic windows (overnight, weekends)
0% during deploy windows or known traffic spikes

The schedule is set via the schedule field with cron syntax. Use it. Most teams that get burned by Karpenter never set a deploy-window budget of zero.

4. disruption.expireAfter

The "kill any node older than this" setting. Default is Never, which sounds good but is a security problem and a hidden cost driver in 2026.

Why expire nodes:

Long-lived nodes accumulate kernel and AMI drift. A node that booted three months ago is running on a security-patched image that is also three months stale.
Spot instances that survive a long time tend to have higher interruption probability. Forcing replacement spreads the risk.
For Reserved Instance and Savings Plan optimization, predictable node turnover helps Karpenter pick the cheapest current option.

My defaults:

Prod stateless: expireAfter: 720h (30 days)
Stateful or DBs: expireAfter: 2160h (90 days)
Spot-heavy: expireAfter: 168h (7 days)

5. terminationGracePeriod

The pod-level setting Karpenter respects when draining. The default is whatever your pod spec says, which is typically 30 seconds. That is wrong for most services with active connections.

For services with long-lived connections (gRPC, WebSocket, long polling, large file transfers), set terminationGracePeriodSeconds to at least 60 to 120 seconds at the pod level. Karpenter will honor it during consolidation drain.

For database pods or stateful workloads, use 300+ seconds and pair with a proper PreStop hook that signals the load balancer to remove the endpoint before the SIGTERM lands.

6. NodePool requirements and weights

The shape of your fleet matters as much as the consolidation timing.

Pin the instance families that match your workload. A NodePool that says "any instance" lets Karpenter pick a t3a.xlarge (burstable) for a sustained CPU workload, which throttles and then forces consolidation as Karpenter notices the CPU starvation. Pin to m7i, c7i, m8g (Graviton4), or similar based on actual workload shape.

Use NodePool weights to express preference. A weight: 100 for Graviton and weight: 10 for x86 tells Karpenter to try Graviton first and fall back to x86 only when needed. This is the cleanest way to drive a mixed-arch migration without forcing it.

Cap the maximum node size with node.kubernetes.io/instance-type constraints if your workload's largest pod is small. Without this, Karpenter will sometimes pick a c7i.4xlarge when a c7i.large would do, because the consolidation math looked at total spend, not per-pod fit.

Common tuning pitfalls

The three failures I see most often.

Setting consolidateAfter too low. Causes a feedback loop where workloads churn between nodes faster than they stabilize. Symptoms: pods restart every 5 to 10 minutes. Fix by raising to at least 5 minutes.

Skipping disruption.budgets for known events. A planned deploy plus default 10% disruption budget can cause a wave of consolidation right when the new deployment is rolling out. Set a deploy-window budget of 0% via cron.

Disabling consolidation entirely. I see this on stateful clusters as the easy answer. It doubles the cost over a quarter. Tune WhenEmpty plus a long consolidateAfter instead.

Where Karpenter tuning still falls short

The honest part.

Spot interruption is opaque. Karpenter can replace spot capacity proactively when AWS signals an interruption, but it does not know that a particular m7i.large pool is hot until it is. For workloads sensitive to spot churn, layer in mixed-instance pools with on-demand fallback.

The math is not multi-cluster aware. Each cluster optimizes locally. If you run dev and prod in the same account, you cannot use savings on one to fund headroom on the other without a manual policy.

No good defaults for AI workloads. GPU instances run for 8 hours then sit idle for 6. Karpenter's consolidation policy assumes utilization is a usable signal, and on GPUs it is mostly noise. Specialized GPU schedulers like KAI handle this better.

Frequently asked questions

Should I use Karpenter or the cluster autoscaler in 2026?
Karpenter for almost everything. The cluster autoscaler still has a niche in air-gapped or single-AZ deployments, but for most teams Karpenter is the default.

Is WhenEmptyOrUnderutilized safe for production?
Yes, with PDBs configured. The risk is not the policy, it is undeclared disruption tolerance at the workload level. Set maxUnavailable and minAvailable PDBs first.

How do I see what Karpenter is about to consolidate?
kubectl get nodeclaims -o wide shows the live picture. The Karpenter logs at debug level surface the consolidation math. Some commercial tools wrap this into a UI with a blast-radius preview before action.

Do these settings work the same on GKE Karpenter and Azure Karpenter?
Roughly yes, since they share the same Karpenter core. Cloud-specific instance type names and capacity-type values differ. The semantic settings are identical.

What about disruption.terminationGracePeriod at the NodePool level?
New in Karpenter 1.3. Sets a node-level grace floor. Useful for clusters where pod-level grace periods are inconsistent. I set it to 120s on most NodePools as a safety net.

What is your current Karpenter setup doing to your bill?

If you have not changed the consolidation defaults since adopting Karpenter, the question worth asking is whether your fleet rotates more often than your deploys. Drop your consolidateAfter value in the comments. I will tell you whether it is leaving money on the table or chasing every dollar at the cost of stability.

DEV Community