Torque for MechCloud Academy

Posted on Dec 18, 2025 • Edited on Dec 20, 2025

Kubernetes 1.35: The End of Kill and Recreate

#kubernetes #devops #cloudnative #engineering

As the year draws to a close, the Kubernetes community has gifted us one last major update: Kubernetes 1.35, affectionately codified as "Timbernetes".

The release theme, officially tied to the Norse mythology of Yggdrasil (the World Tree), is surprisingly apt. Just as a massive tree grows ring by ring, strengthening its core while expanding its reach, Kubernetes has moved past its chaotic growth phase. It is no longer just sprouting wild new APIs in every direction. Instead, it is thickening its trunk, solidifying its roots, and most importantly learning how to adapt without being uprooted.

If you’ve been managing Kubernetes clusters for the last decade, you know the mantra: Immutable Infrastructure. If you want to change something, you kill it and spawn a new one. Cattle, not pets.

Kubernetes 1.35 challenges this dogma. With the General Availability (GA) of In-Place Pod Vertical Scaling, we are witnessing a philosophical pivot. We are entering an era where our workloads are allowed to breathe, grow, and shrink without facing the executioner.

In this deep dive, we will explore the massive shifts in K8s 1.35, specifically tailored for application developers and platform engineers. We will look at why "Timbernetes" might be the most "resource-conscious" release yet, the heavy influence of AI on the scheduler, and the painful-but-necessary goodbyes to some legacy giants.

1. The Headliner: In-Place Pod Vertical Scaling (GA)

Let’s start with the feature everyone is talking about. After years of being teased in alpha and beta, In-Place Pod Vertical Scaling (KEP #1287) has finally graduated to Stable.

The Problem: The "Restart Tax"

For years, defining resource requests and limits (cpu and memory) was a gamble.

Set them too high? You waste money and strand cluster capacity.
Set them too low? Your app gets OOMKilled (Out of Memory) or throttled.
Need to change them? You have to edit the Deployment, which triggers a rollout. The old pods are terminated, and new ones are scheduled.

This "Kill and Recreate" model works fine for stateless Nginx web servers. It is a nightmare for:

Java applications: The JVM takes time to warm up. Restarting means cold caches and sluggish performance.
Databases & StatefulSets: Restarting a primary DB node triggers leader elections, failovers, and potential downtime or latency spikes.
AI/ML Training Jobs: If a training pod realizes it needs 10% more memory halfway through a 14-hour epoch, killing it means losing hours of computation (unless your checkpointing is perfect).

The Solution: Mutable Resources

With Kubernetes 1.35, you can now patch the resources of a running Pod, and the Kubelet will attempt to apply those changes without restarting the containers.

This is a game-changer for Vertical Pod Autoscalers (VPA). Previously, VPA was a "disruptive" autoscaler. It had to evict your pods to resize them. Now, VPA can act like a silent guardian, tuning your CPU and Memory requests in real-time as traffic fluctuates, with zero downtime.

How It Works (For the Devs)

Technically, this exposes a new /resize subresource. But for us using kubectl, it looks like a standard patch.

Imagine you have a pod running a memory-hungry Python worker:

apiVersion: v1
kind: Pod
metadata:
  name: data-worker
spec:
  containers:
  - name: worker
    image: my-python-worker:latest
    resources:
      requests:
        cpu: "500m"
        memory: "1Gi"
      limits:
        cpu: "1"
        memory: "2Gi"
    resizePolicy:
    - resourceName: cpu
      restartPolicy: NotRequired
    - resourceName: memory
      restartPolicy: NotRequired

If you notice the worker is struggling, you can patch it live:

kubectl patch pod data-worker --patch '{"spec":{"containers":[{"name":"worker", "resources":{"requests":{"cpu":"800m", "memory":"1.5Gi"}}}]}}'

What happens under the hood?

API Server: Accepts the change and updates spec.containers[].resources.
Kubelet: Detects the divergence between spec and status.
Cgroup v2: The Kubelet talks to the underlying Linux cgroup manager to raise the memory limit and CPU shares instantly.
Application: The app suddenly has more breathing room. No SIGTERM, no restart.

The "Thought-Provoking" Angle: Are we returning to Mutable Infrastructure?

This feature reopens an old debate. We moved to containers to ensure consistency. If we allow resources to change on the fly, does kubectl get pod still tell the whole truth?

The answer is yes, because the state is still declarative. You are not SSHing in and running sysctl. You are updating the manifest. However, it shifts the mindset from "disposable" to "adaptable." In a world where sustainability and cloud costs are board-level concerns, killing a healthy process just to give it 100MB more RAM is arguably wasteful. Kubernetes 1.35 makes the platform more eco-friendly by reducing the "churn" of re-scheduling.

2. Kubernetes as the AI OS: Gang Scheduling & DRA

If In-Place Scaling is for efficiency, the next set of features is pure power, driven by the insatiable hunger of AI/ML workloads.

Gang Scheduling (Alpha)

Kubernetes has traditionally been a "one-at-a-time" scheduler. If you submit a job needing 100 GPUs, K8s would happily schedule 50, run out of space, and leave those 50 sitting there idle while waiting for the other 50. This "partial allocation" deadlocks clusters and wastes expensive GPU hours.

In 1.35, we finally get native Gang Scheduling (KEP #4671).

This introduces the concept of "All-or-Nothing." You can define a PodGroup. The scheduler will look at the cluster and ask, "Can I fit ALL these pods right now?"

Yes? Schedule them all simultaneously.
No? Queue them. Don't start any.

For developers training LLMs or running massive MPI (Message Passing Interface) jobs, this removes the need for third-party plugins like Volcano or Kueue (though those tools still offer advanced features). It’s Kubernetes acknowledging that batch processing is just as important as microservices.

Dynamic Resource Allocation (DRA) Maturity

DRA has been bubbling under the surface for a few releases, but in 1.35 it becomes the default way to handle specialized hardware.

The old "Device Plugin" model was rigid. You counted GPUs as integers (1, 2, 4).
DRA allows for structured parameters. You can request:

"A GPU, but it must have at least 24GB VRAM."
"A slice of a GPU (MIG) that shares a PCIe switch with the network card."

With 1.35, we see better support for extended resources in DRA. This is critical for the "Edge AI" use case, where hardware is heterogeneous and constrained.

3. The Great Cleanup: Deprecations & Removals

You cannot have a "World Tree" with dead branches. 1.35 brings the axe to some long-standing components.

The End of cgroup v1

This is the "check your kernel" moment. Support for cgroup v1 is officially removed (or deeply deprecated depending on your provider's window). Kubernetes now relies entirely on cgroup v2.

Why? Cgroup v2 offers better resource isolation, reliable memory QoS, and is the prerequisite for features like the In-Place Pod Resizing we just discussed. If you are still running ancient nodes (looking at you, CentOS 7 diehards), your upgrade path just hit a wall.

The Fall of the Ingress Controller (Kind of)

Perhaps the most controversial "vibe shift" in 1.35 is the aggressive move away from the classic Ingress NGINX controller in favor of the Gateway API.

While Ingress resources aren't deleted, the community signal is clear: Stop building new things on Ingress. The Gateway API is no longer the "future" - it is the "now." It offers:

Standardized traffic splitting (Canary releases).
Header-based routing without insane annotation hacks.
Cross-namespace route sharing (perfect for multi-team clusters).

If you are still writing huge nginx.ingress.kubernetes.io/configuration-snippet annotations, take 1.35 as your sign to learn the HTTPRoute resource.

4. Developer Quality of Life: What’s in it for me?

Okay, maybe you aren't managing the cluster. You're just deploying to it. What does 1.35 give you?

Image Pull Credential Verification (Beta -> Default)

Have you ever had a security scare where Tenant A pulled a private image, and Tenant B (who shouldn't have access) managed to run it because it was cached on the node?
K8s 1.35 closes this loop. AlwaysPullImages logic is smarter. The Kubelet now re-verifies your credentials against the registry before using a cached image. It ensures that just because the bits are on the disk, you don't get to use them unless you have the key.

Job Mutability (Alpha)

Have you ever launched a batch Job, watched it crash due to OOM, and then sighed as you had to delete the job, edit the YAML, and re-apply?
1.35 introduces Mutable Container Resources for Suspended Jobs. You can suspend a failing job, patch the resource limits (give it more RAM), and resume it. It keeps the job history and identity intact. It’s a small change that saves huge amounts of frustration during debugging loops.

Pod-Level Resources (Alpha)

This is a sneak peek at the future. While container resizing is GA, 1.35 introduces Pod-Level Resource specifications. This allows you to set a "budget" for the whole pod, and let the containers inside (app, sidecar, logging agent) fight for it or share it dynamically. This is crucial for the "Sidecar" pattern, preventing a hungry logging agent from starving your main app, or vice versa.

5. The Thought-Provoking Conclusion: The "Biological" Cluster

Why is "Timbernetes" such a pivotal release?

For the first 10 years, Kubernetes treated infrastructure as mechanical. Machines were cogs. Pods were widgets. If a widget was defective, you threw it in the trash and manufactured a new one.

Kubernetes 1.35 marks the transition to a biological view of infrastructure.

Adaptability: Pods can grow and shrink like living cells (Vertical Scaling).
Symbiosis: Gang scheduling acknowledges that some organisms must exist together or not at all.
Evolution: The shedding of cgroup v1 and Ingress shows the system evolving its internal organs.

As developers, this empowers us. We can build applications that are more resilient and cost-effective. We can write "Eco-Mode" operators that shrink our pods at night without killing connections. We can train AI models without fighting the scheduler.

But it also demands more from us. We can no longer treat the cluster as a black box that just "runs containers." We need to understand the nuances of resource topology, the difference between cgroup versions, and the lifecycle of a mutable pod.

The World Tree is growing. It's time for us to climb it.

Quick Upgrade Checklist for Devs

Check your manifests: Are you still using apiVersion: networking.k8s.io/v1beta1 for anything? (You shouldn't be, but check).
Audit your resources: Look at your VPA configuration. If you were using "Off" mode because you feared restarts, try switching to "Auto" in a dev environment with 1.35.
Learn Gateway API: Try converting one complex Ingress into a Gateway/HTTPRoute pair.
Verify your Nodes: ensure your underlying node pools are running on OS images that support cgroup v2.

Happy deploying! 🚀

DEV Community