NTCTech

Posted on Mar 16 • Originally published at rack2cloud.com

Kubernetes 1.35 Removes the Restart Tax — Why Stateful Workloads Just Got Easier to Operate

#kubernetes #cloudnative #devops #platformengineering

Kubernetes 1.35 in-place pod resize graduates to stable — and with it, six years of a hidden operational tax on stateful workloads comes to an end.

If a container needed more CPU or memory, the only safe answer was a restart. That design made sense for stateless services. It was painful for everything else.

Increase memory on a JVM service and the JIT cache disappears. Resize a PostgreSQL pod and WAL replay starts again from the last checkpoint. Restart Redis and your cache warm-up becomes a production event that ripples across dependent services. Restart a Kafka broker and partition rebalancing begins — consuming cluster bandwidth while your application waits.

Because of that reality, most platform teams quietly shelved one of Kubernetes' most promising automation tools: Vertical Pod Autoscaler. VPA could recommend resource changes with reasonable accuracy. Actually applying them was a different calculation entirely. In production, the restart cost was often higher than the resource inefficiency it was fixing.

Kubernetes 1.35 removes that constraint.

Why This Took Six Years

The feature was not slow due to lack of demand. It was delayed because safely mutating a running container's resource envelope required solving several non-trivial problems simultaneously:

Pod immutability was foundational — the original design treated the pod spec as sealed at creation. Mutating resources in-flight required rethinking kubelet reconciliation from the ground up.
cgroup mutation needed safe exposure — changing CPU and memory limits for a running process requires direct cgroup manipulation. That path needed to be safe, auditable, and consistent across container runtimes.
Status reconciliation was complex — the kubelet needed to track desired resources, allocated resources, and actual resources independently.
Runtime support had to be universal — containerd and CRI-O both needed to implement live resource mutation before the feature could be considered stable. cgroup v2 became the minimum requirement.

The Restart Tax: What It Actually Cost

The restart was never free. For stateless services the cost was low enough to ignore. For the workload classes that make up most of enterprise infrastructure, the math was different.

Workload	What a Restart Actually Caused	Downstream Impact
JVM Applications	JIT profile loss + cold GC behavior	Performance degrades 10–30 min while JIT recompiles hot paths
PostgreSQL	WAL replay + connection pool churn	Recovery time scales with WAL backlog; pgBouncer pools drain and rebuild
Redis	Full cache warm-up required	Cache miss storm hits origin databases until warm-up completes
Kafka Brokers	Partition rebalance cascade	Consumer lag spikes across all partitions during leader re-election
ML Inference Services	Model reload from storage	Cold start latency while model weights reload — requests queue or fail

The operational consequence was predictable: over-provisioning became the safe default. Requests and limits were set conservatively at deployment time and rarely revisited — not because the right values weren't known, but because the cost of correcting them was too high to justify outside a maintenance window.

How Kubernetes 1.35 In-Place Pod Resize Works

The mechanism is direct: when you patch a running pod's resource spec, the kubelet detects the change and writes the new values to the container's cgroup without triggering a container restart. The container keeps running. The process ID does not change. The network namespace is preserved.

The resizePolicy Field

The behavior on resize is controlled per-container via a resizePolicy field:

NotRequired — the resource change is applied to the cgroup without restarting the container. Right for CPU changes on most workloads, and for memory increases where the application can consume additional heap without restart.
RestartContainer — the container is restarted when the specified resource is changed. Appropriate for memory decreases where the allocator won't release memory without a restart.

containers:
- name: postgres
  resources:
    requests:
      cpu: "1"
      memory: "2Gi"
    limits:
      cpu: "2"
      memory: "4Gi"
  resizePolicy:
  - resourceName: cpu
    restartPolicy: NotRequired
  - resourceName: memory
    restartPolicy: RestartContainer

The Two-Field Model

1.35 introduces a clean separation between desired and actual state:

spec.containers[*].resources — the desired state. Mutable for CPU and memory.
status.containerStatuses[*].resources — the actual resources currently configured at the cgroup layer.

When a resize is in progress, these fields will differ. When complete, they converge.

The cgroup v2 Requirement

In-place resize requires cgroup v2. This is not optional. If your node fleet is still running cgroup v1, in-place resize will not function — and the 1.35 upgrade path has additional consequences covered below.

The Real Story Is VPA

For years, VPA had an awkward reputation. It could recommend resource changes with reasonable accuracy. Applying them automatically meant restarting pods — operationally risky enough that most teams disabled Auto and Recreate update modes entirely.

With Kubernetes 1.35 in-place pod resize now stable, the InPlaceOrRecreate VPA update mode graduates to beta:

VPA attempts to resize the pod in-place by patching the resource spec
The kubelet applies the change via cgroup mutation — no eviction, no restart (subject to resizePolicy)
If the node lacks sufficient capacity, VPA falls back to the traditional evict-and-recreate path

The database workload shift:

Before 1.35: Scaling memory → pod restart → connection loss → pgBouncer pool drain → application error spike

After 1.35: Scaling memory → cgroup mutation → workload continues → connections remain established

Operator warning: In-place resize changes container limits immediately at the cgroup layer. The application may not adapt instantly. JVM heap sizing (-Xmx) is still a startup parameter — raising the memory limit without adjusting heap configuration produces a container with more headroom but the same effective ceiling. Test resize behavior before enabling VPA automation on stateful production workloads.

Upgrade Landmines in Kubernetes 1.35

1 — cgroup v1 Is Gone

Not a deprecation. Removed.

# Check cgroup version on a node
stat -fc %T /sys/fs/cgroup/
# cgroup2fs = v2 (compatible)
# tmpfs = v1 (incompatible — must remediate before upgrade)

# Check containerd's cgroup driver
containerd config dump | grep SystemdCgroup

# Verify kubelet matches
systemctl show kubelet | grep cgroup-driver

All three values must be consistent. Nodes on cgroup v1 must be remediated at the OS level before 1.35 can be safely deployed.

2 — containerd 1.x Is End of Life

Minimum supported version is containerd 2.x. containerd 2.x changed default behaviors around cgroup driver configuration, image garbage collection thresholds, and shim management. Validate runtime behavior in non-production before rolling to production node pools.

3 — IPVS Mode Is Formally Deprecated

kube-proxy's IPVS mode is formally deprecated in 1.35. The project is moving toward nftables. Check your current mode:

kubectl get configmap kube-proxy -n kube-system -o yaml | grep mode

What Platform Teams Should Do Before Upgrading

Step 1 — Verify node runtime compatibility
Run the cgroup version check across all nodes. Verify containerd >= 2.x across the fleet.

Step 2 — Audit StatefulSet specs
Add resizePolicy explicitly. For JVM and database workloads, set restartPolicy: RestartContainer for memory.

Step 3 — Re-evaluate VPA
Many teams disabled VPA's automatic modes years ago. With InPlaceOrRecreate in beta, revisit those decisions — starting with non-critical stateful workloads in non-production.

Step 4 — Test resize behavior per workload class
Not all applications respond predictably to live resource changes. Run controlled resize tests before enabling automation on production StatefulSets.

Architect's Verdict

Kubernetes is slowly removing the assumptions that made it feel stateless-only.

In-place pod resize closes a six-year operational gap that forced platform teams to choose between resource efficiency and workload stability. For the workload classes that make up most of enterprise infrastructure — databases, caches, brokers, long-running compute — that trade-off is now gone.

But the operational complexity doesn't disappear. It shifts.

Runtime compatibility, cgroup version consistency, and Day-2 resource drift still decide whether clusters stay stable after upgrade. The teams that will get the most out of 1.35 are the ones who audit their node fleet before upgrading, test resize behavior before enabling automation, and treat resizePolicy as an explicit architectural decision rather than a default.

Originally published at rack2cloud.com

DEV Community