Mehmet TURAÇ

Posted on Jun 2

Great Stack to Doesn't Work #4 — Kubernetes: "Pod Is Running, App Is Dead"

#kubernetes #devops #backend #discuss

A survival guide for when everything goes wrong in production.

The pod is Running. STATUS says Running. kubectl says Running. The deployment shows 3/3 replicas available. Every signal says this thing is alive.

But your users are getting timeouts. The health check endpoint returns 200, but the application thread pool is exhausted. The container is up. The process is running. The application is dead.

Kubernetes trusts your probes. If your probes lie, Kubernetes believes the lie.

The Three Probes: liveness, readiness, startup

These three probes look similar but serve completely different purposes. Mixing them up is responsible for more outages than any other Kubernetes misconfiguration.

Liveness probe: "Is this container broken beyond recovery?" If it fails, Kubernetes kills the container and restarts it. This is a last resort. If your liveness probe checks a database connection and the database is down, Kubernetes restarts your pod. The pod comes back. The database is still down. The liveness probe fails again. CrashLoopBackOff. You now have zero capacity instead of degraded capacity.

Liveness probes should check if the process itself is stuck — deadlocked threads, corrupted internal state, unresponsive event loop. They should NOT check downstream dependencies.

Readiness probe: "Can this container handle traffic right now?" If it fails, Kubernetes removes the pod from the Service endpoints. Traffic stops flowing to it, but the container stays alive. When readiness passes again, traffic resumes.

Readiness probes SHOULD check downstream dependencies. If your app can't reach the database, it shouldn't receive requests. Remove it from the load balancer, let other healthy pods handle traffic, and wait for the dependency to recover.

Startup probe: "Is this container still booting?" Runs only during startup. While the startup probe is running, liveness and readiness probes are disabled. This exists for applications with long initialization times — JVM warmup, large model loading, database migration runs.

Without a startup probe, an application that takes 60 seconds to start will fail the liveness probe (default 10-second timeout) and get killed before it ever finishes booting. CrashLoopBackOff on a perfectly healthy app that just needs more time.

The correct pattern:

startupProbe:
  httpGet:
    path: /health/startup
    port: 8080
  failureThreshold: 30
  periodSeconds: 5
  # 30 * 5 = 150 seconds to start up

livenessProbe:
  httpGet:
    path: /health/live
    port: 8080
  periodSeconds: 10
  failureThreshold: 3
  # Only runs after startup succeeds

readinessProbe:
  httpGet:
    path: /health/ready
    port: 8080
  periodSeconds: 5
  failureThreshold: 2
  # Can toggle on/off during lifetime

Three separate endpoints. Three different checks. Don't make them the same URL.

Resources: The Art of Requests and Limits

Requests tell the Kubernetes scheduler how much resource to guarantee. If your pod requests 500m CPU and 256Mi memory, the scheduler only places it on a node with that much available.

Limits tell the kernel how much the container is allowed to use. Exceeding the memory limit triggers an OOMKill. Exceeding the CPU limit triggers throttling.

The dangerous configurations:

No requests, no limits: The pod is a BestEffort class. It gets whatever's available. Under node pressure, it's the first to be evicted. Never do this in production.

Requests equal to limits (Guaranteed QoS): The pod gets exactly what it asks for. No bursting above, no getting evicted under pressure (unless the node itself is failing). Predictable but expensive — you're reserving resources even when idle.

Requests lower than limits (Burstable QoS): The pod is guaranteed its request amount and can burst up to its limit when resources are available. This is the most common production configuration. The risk: if many pods burst simultaneously, the node runs out, and Kubernetes starts killing Burstable pods that exceed their requests.

The CPU throttling trap: CPU limits are enforced using CFS (Completely Fair Scheduler) bandwidth control. If your pod's limit is 1000m (1 core) and it needs a 200ms burst of 2 cores, it gets throttled for 100ms. Your application doesn't crash — it just gets mysteriously slow. container_cpu_cfs_throttled_seconds_total in Prometheus will show you if this is happening. Many teams set CPU limits too low and spend weeks debugging intermittent latency before checking throttling metrics.

My recommendation: set CPU requests but consider leaving CPU limits unset. Let pods burst on CPU. Set memory limits strictly — memory overcommit leads to OOMKills, which are worse than CPU throttling.

OOMKilled: Death by Memory

When a container exceeds its memory limit, the kernel kills it instantly. No graceful shutdown. No signal. No chance to flush buffers or close connections. The process is gone.

kubectl describe pod shows the exit code: 137 (128 + 9, where 9 is SIGKILL).

Common causes:

Memory leak: Gradual growth over hours or days. The pod works fine after restart, then slowly dies again.
Spike under load: The application allocates memory proportional to concurrent requests. During traffic spikes, memory exceeds the limit.
JVM heap misconfiguration: The JVM's -Xmx is set higher than the container's memory limit. The JVM thinks it has 4GB but the container only allows 2GB. The moment the heap grows past 2GB, OOMKill.

For JVM apps, always set -Xmx to roughly 75% of the container memory limit. The remaining 25% covers metaspace, thread stacks, native memory, and OS overhead.

For Node.js apps, set --max-old-space-size explicitly. The V8 default may exceed your container limit.

Evictions: When the Node Pushes Back

Eviction happens at the node level, not the pod level. When a node runs low on resources (memory, disk, or PIDs), kubelet starts evicting pods to protect the node.

Eviction order:

BestEffort pods (no requests/limits) — evicted first
Burstable pods exceeding their requests
Guaranteed pods — evicted last, only under extreme pressure

Priority classes let you influence this order. Create PriorityClasses for your workloads:

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: critical-service
value: 1000000
globalDefault: false
description: "Critical production services"

Pods with higher priority evict lower-priority pods when the node is under pressure. Your core payment service survives; your internal analytics job gets evicted.

Node Affinity, Taints, and Tolerations

"Why won't my pod schedule?" is the second most common Kubernetes question (after "why is it crashing").

Node affinity: tells the scheduler which nodes the pod prefers or requires.

affinity:
  nodeAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
      nodeSelectorTerms:
        - matchExpressions:
            - key: node-type
              operator: In
              values: ["gpu"]

This pod will only run on nodes labeled node-type=gpu. If no such node exists, the pod stays Pending forever.

Taints: nodes repel pods. A tainted node won't accept pods unless they have a matching toleration.

kubectl taint nodes gpu-node-1 gpu=true:NoSchedule

Now only pods that tolerate gpu=true can run there. This prevents CPU-only workloads from accidentally landing on expensive GPU nodes.

Common scheduling failures:

Pod is Pending with "insufficient cpu/memory" — the requested resources exceed what's available on any node. Either reduce requests or add nodes.
Pod is Pending with "no nodes match pod topology spread constraints" — you have topology rules that can't be satisfied.
Pod is Pending with "0/5 nodes are available: 5 node(s) had taints that the pod didn't tolerate" — every node is tainted and your pod doesn't have the right tolerations.

Pod Disruption Budgets: Don't Take Everything Down at Once

During a rolling update, Kubernetes terminates old pods and creates new ones. Without a PDB, Kubernetes can terminate all pods simultaneously if it's feeling aggressive.

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: api-pdb
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: api

This guarantees at least 2 api pods are always running, even during node drains, cluster upgrades, or voluntary disruptions. Kubernetes will wait to terminate a pod until it can do so without violating the budget.

The 7 Causes of CrashLoopBackOff

CrashLoopBackOff means the container starts, crashes, restarts, crashes again, and Kubernetes increases the delay between restarts exponentially (10s, 20s, 40s, up to 5 minutes).

Application error on startup. Missing config, bad environment variable, connection refused to a required service. Check kubectl logs.
OOMKilled. Memory limit too low. Check kubectl describe pod for OOMKilled in Last State.
Liveness probe too aggressive. The app takes 30 seconds to start, the liveness probe starts at 10 seconds. The probe kills the app before it's ready.
Image pull error masquerading as crash. ImagePullBackOff can look like CrashLoopBackOff in the events. Check events, not just status.
Entrypoint/command misconfiguration. The CMD in the Dockerfile expects arguments that aren't passed, or the entrypoint script has a bash error.
Permissions. The container runs as non-root but needs to write to a directory owned by root. Or a mounted secret has wrong permissions.
Resource quota exhaustion. The namespace has a ResourceQuota and the pod's requests exceed what's available in the quota. The pod keeps trying and failing.

The debugging flow:

kubectl describe pod <name>          # Events section
kubectl logs <name>                  # Current logs
kubectl logs <name> --previous       # Previous crash logs
kubectl get events --sort-by='.lastTimestamp'  # Cluster events

--previous is the one people forget. The current container has no logs because it just started. The previous container's logs show why it crashed.

War Story: The 1-Second Readiness Probe

A payment service. 8 replicas behind a Kubernetes Service. Readiness probe checked /health with a 1-second timeout. The health endpoint pinged the database.

Under normal conditions: 200ms response time. Readiness passes. Traffic flows.

Black Friday: database load increases. Health endpoint response time creeps up. 800ms. 900ms. 1.1 seconds. Readiness probe fails. Kubernetes removes the pod from endpoints.

Now 7 pods handle the traffic that 8 were handling. Each remaining pod gets more load. Their health endpoints slow down. More probes fail. 6 pods. 5 pods. Cascading failure.

Within 90 seconds, all 8 pods were removed from the Service. Zero pods receiving traffic. The application was running perfectly — every pod was healthy. But every readiness probe was timing out because the database was slow.

Fixes:

Increased readiness probe timeout to 5 seconds.
Separated the health check from the database check. Readiness verifies the application can accept HTTP connections. A separate monitoring check verifies database connectivity.
Added a circuit breaker — if the database is slow, the app returns degraded responses from cache instead of timing out.

The lesson: your readiness probe is a load balancer decision. If it's too sensitive, it amplifies problems instead of containing them.

Key Takeaways

Kubernetes doesn't know your application is healthy. It knows your probes pass. Design probes that reflect real application health, not infrastructure connectivity.

Set memory limits. Don't set CPU limits unless you have a specific reason. Check throttling metrics before assuming your app is slow.

CrashLoopBackOff is a symptom, not a diagnosis. kubectl logs --previous is your first tool. kubectl describe pod is your second. The answer is almost always in the Events section.

And if your readiness probe checks a downstream dependency, make sure the timeout is generous enough that temporary slowness doesn't cascade into a full outage.

Over to You

What's the sneakiest CrashLoopBackOff cause you've debugged? Have you ever had a readiness probe cascade like the one in this article?

If you enjoyed this, I write about production engineering, AI systems, and the messy reality of building software at scale.

Follow me:

This is part of the **Great Stack to Doesn't Work* series — a survival guide for when everything goes wrong in production. Follow the series to catch every episode.*

Top comments (1)

E Lion Reigns • Jun 3

Pod running / app dead is such a familiar horror. I have been in the same place with PHP workers that return 200 with empty bodies. What was your first signal that the process was alive but the app was not?