Bhagirath

Posted on Nov 26

How Kubernetes detects and restarts crashing pods automatcially

#kubernetes #kubelet #pleg #pods

A Deep Dive into the Kubelet, PLEG, and Controller Manager

One of the defining promises of Kubernetes is “Self-Healing.” When a service crashes, the platform automatically detects the failure and restores the workload without human intervention. But How does Kubernetes restart crashing pods?

Kubernetes does not technically restart pods; it replaces the containers within them. This process is managed by the Kubelet on each node, which uses the Pod Lifecycle Event Generator to monitor container states. When a container fails — indicated by a non-zero exit code, an OOMKilled signal, or a failed Liveness Probe — the Kubelet applies a restart action. This restart is throttled by an exponential backoff algorithm (doubling delay up to 300 seconds) to prevent CPU exhaustion, ensuring the system heals itself automatically.

1. The Architecture of Failure: Kubelet and PLEG

To understand detection, we must look at the Node Level. The Kubernetes Control Plane (API Server) is often too far removed to handle immediate process failures. The heavy lifting is performed locally by the Kubelet.

SyncLoop

The Kubelet runs a continuous control loop called the SyncLoop. Its job is simple: Reconcile Expected State with Actual State.

Expected State: “Run Nginx version 1.2.” (From API Server)
Actual State: “Nginx is running.” (From Runtime)

Problem with Polling

In early versions of Kubernetes, the Kubelet would constantly ask the Docker daemon, “Are my containers running?” repeatedly. With 100+ pods per node, this polling choked the CPU.

Solution: PLEG (Pod Lifecycle Event Generator)

This is the internal mechanism that makes detection fast and efficient.

Relisting: PLEG periodically relists all containers from the runtime.
Comparison: It compares the old list with the new list.
Event Generation: If it sees a change (e.g., Container ID abc changed state from Running to Exited), it generates a ContainerDied event.
Immediate Action: This event wakes up the Kubelet immediately, bypassing the standard polling cycle.

2. The Three Signals of Death

How does the runtime know a container has failed? It relies on three specific signals from the Linux Kernel and the Kubelet’s own probing logic.

A. Process Exit (Crash)

When the main process inside your container (PID 1) stops, it sends an exit code to the operating system.

Exit Code 0: The process finished successfully. (Kubernetes considers this “Completed”).
Exit Code 1–255: The process crashed or threw an error.

The Kubelet sees this non-zero code via the CRI and marks the container as Error.

B. The OOMKilled Signal (Exit Code 137)

This is the most common and misunderstood crash.

Scenario: Your application tries to allocate 512MB of RAM, but your Pod YAML limits it to 256MB.
Kernel’s Reaction: The Linux Kernel cgroups mechanism denies the memory request. The kernel invokes the OOM Killer (Out of Memory Killer), which immediately sends SIGKILL to the process.
Result: The container dies instantly with Exit Code 137 (128 + 9 for SIGKILL).

What it looks like in kubectl describe pod:

State:          Terminated
  Reason:       OOMKilled
  Exit Code:    137
  Started:      Mon, 01 Jan 2024 12:00:00 GMT
  Finished:     Mon, 01 Jan 2024 12:05:00 GMT

If you see Exit Code 137, restarting the pod won’t fix it. You must either fix the memory leak in your code or increase the resources.limits.memory in your YAML.

C. The Liveness Probe (The Deadlock)

Sometimes, PID 1 is still running, but the application is frozen (deadlocked) or stuck in an infinite loop. The process exists, so the kernel thinks everything is fine.

This is where Liveness Probes come in. You configure the Kubelet to actively “ping” your app.

livenessProbe:
 httpGet:
 path: /healthz
 port: 8080
 initialDelaySeconds: 15
 periodSeconds: 20
 failureThreshold: 3

If the endpoint returns a 500 error or times out 3 times in a row, the Kubelet decides the application is broken. It forcefully kills the container to trigger a restart.

3. The Recovery Logic: Restart Policies

Once a failure is confirmed, the Kubelet consults the restartPolicy defined in the Pod spec.

“Always” Policy (Default) This is used for standard web servers and long-running services. Kubelet restarts the container regardless of why it stopped.

spec:
  restartPolicy: Always

“OnFailure” Policy Used for batch jobs or data processing. The container is only restarted if it crashes (non-zero exit code). If it finishes cleanly (Exit Code 0), it stays stopped.

spec:
  restartPolicy: OnFailure

“Never” Policy Used for debugging or one-off static pods. Kubernetes will never restart the container.

spec:
  restartPolicy: Never

4. The Algorithm: CrashLoopBackOff

Imagine your database is down, and your API crashes immediately upon connecting. If Kubernetes restarted your API instantly every time, it would restart 1,000 times a second, consuming all the CPU on the node.

To prevent this, Kubernetes uses an Exponential Backoff Algorithm.

Math Behind the Wait

When a container crashes repeatedly, the Kubelet inserts a delay before attempting the next restart. The delay doubles with every crash:

Crash 1: Immediate Restart.
Crash 2: Wait 10s.
Crash 3: Wait 20s.
Crash 4: Wait 40s.
…
Max Delay: 300s

When you run kubectl get pods and see status CrashLoopBackOff, it means Kubernetes is currently waiting for this timer to expire before trying again.

Resetting the Timer:
The timer doesn’t last forever. If the container starts and runs successfully for 10 minutes (configurable via minReadySeconds), Kubernetes resets the backoff counter to zero.

5. Cluster-Level Recovery: When the Node Dies

The Kubelet handles local software failures. But what happens if the physical server (Node) fails? This scenario moves the responsibility from the Kubelet to the Kubernetes Controller Manager.

Node Controller Loop:

Heartbeat Loss: Every node sends a status update to the API Server every 10 seconds.
Timeout: If the API Server receives no update for 5 minutes (default --pod-eviction-timeout), the Node Controller marks the node condition as Unknown or NotReady.
Eviction: The controller applies a NoExecute taint to the node.
Rescheduling: The ReplicaSet Controller observes that the number of running replicas has dropped below the desired count. It immediately schedules new pods on remaining healthy nodes.

6. Advanced Engineering: Startup Probes & Sidecars

As Kubernetes evolves, new features allow for more granular control over restarts.

“Slow Start” Problem
Legacy Java apps or AI models loading large weights into GPU memory can take minutes to start. A standard Liveness Probe would kill these containers before they finish booting.

The Solution: Use a startupProbe.

startupProbe:
  httpGet:
    path: /healthz
    port: 8080
  failureThreshold: 30
  periodSeconds: 10

Logic: The probe checks every 10 seconds for 30 times (300 seconds total).
Behavior: Liveness probes are disabled until the Startup probe succeeds once.

7. How to Debug Crash Loops

When facing a crash loop, use these three commands to diagnose the root cause:

Step 1: Check the Previous Logs If the pod is currently crashing, standard logs might be empty. You need to see the logs of the previous instance that died.

kubectl describe pod <pod-name>

Look for Last State: Terminated and the Exit Code.

Step 2: Inspect the Events The “Events” section tells you why Kubelet killed it (e.g., Liveness Probe Failed, OOMKilled).

kubectl logs <pod-name> --previous

Look for the “Last State” section.

Step 3: Debug with an Ephemeral Container If the container crashes too fast to inspect, attach a debug shell to the running pod without restarting it.

kubectl debug -it <pod-name> --image=busybox --target=<container-name>

Conclusion

Kubernetes “Self-Healing” is not a single feature; it is a symphony of independent systems.

PLEG ensures crashes are detected in milliseconds.
CRI captures the specific exit codes to determine the cause.
Backoff Algorithms prevent your infrastructure from being overwhelmed by failing applications.
Controllers* handle the catastrophic loss of physical hardware.

By understanding these internals, engineers can move beyond basic troubleshooting and architect systems that are resilient to both software bugs and infrastructure failures.

DEV Community