DEV Community

alok shankar
alok shankar

Posted on

Investigating & Resolving High-CPU Alerts in Kubernetes Pods

Introduction
Recently, I faced a production issue where Observability tools flagged sustained CPU utilization >95% on the particular pod in Kubernetes. Investigation revealed the Java process was hitting the pod’s 3-core CPU limit, even though the node had spare capacity—pointing to application-level saturation.

Using kubectl and in-container diagnostics, I confirmed the JVM as the source.

In this post, I’ll walk through the step-by-step process: how I diagnosed it, the safe remediation (increasing pod CPU limits and optionally scaling replicas), and the follow-up JVM and query checks to prevent recurrence.

Goals

  1. - Confirm the alert (pod and node metrics).
  2. - Determine whether node or pod (application) caused high CPU.
  3. - Identify what inside the pod is CPU-hot (process / JVM threads / GC / queries).
  4. - Apply safe remediation and verify.

Step 1 — Confirm pod metrics (kubectl top)

Command:

kubectl top pod webapp-deployment-rfc4f -n stgapp

Enter fullscreen mode Exit fullscreen mode

Output:

NAME                                          CPU(cores)   MEMORY(bytes)
webapp-deployment-rfc4f                         2863m        2662Mi

Enter fullscreen mode Exit fullscreen mode

Values:

CPU = 2863m (≈ 2.86 cores) → round ≈ 2.9 cores
Memory = 2662Mi (≈ 2.6 GB)

Step 2 — Check the pod's resource limits (deployment spec)

kubectl get deployment webapp-deployment-rfc4f -n stgapp-o yaml | grep -A5 resources

Enter fullscreen mode Exit fullscreen mode

Output:

resources:
  limits:
    cpu: "3"
    memory: 3Gi
  requests:
    cpu: "3"
Enter fullscreen mode Exit fullscreen mode

Interpretation: The pod is consuming ~2.86 cores — very close to its configured CPU limit.

CPU request = 3 cores
CPU limit = 3 cores
The pod is configured to have 3 cores; pod usage ~2.86 cores explains the alert (>95% of 3 cores).

Step 3 — Confirm node health (is it node or pod that’s saturated?)

Command:

kubectl top node

Enter fullscreen mode Exit fullscreen mode

Output:

NAME                                       CPU(cores)  CPU%   MEMORY(bytes)   MEMORY%
ip-xxxxxxxxxx.us-west-2.compute.internal   122m         1%     5575Mi          37%
ip-xxxxxxxxxx.us-west-2.compute.internal   181m         2%     9653Mi          65%
ip-xxxxxxxxxx.us-west-2.compute.internal   86m          1%     7030Mi          47%
ip-xxxxxxxxxx.us-west-2.compute.internal   3045m        39%    7057Mi          47%
Enter fullscreen mode Exit fullscreen mode
ip-xxxxxxxxxx.us-west-2.compute.internal    3045m        39%    7057Mi                    47%
... other nodes show low CPU %

Enter fullscreen mode Exit fullscreen mode

Step 4 — Check process level inside the pod

We exec'd into the pod and listed processes.

Command:

kubectl exec -it webapp-deployment-rfc4f -n stgapp-- ps aux --sort=-%cpu | head -20

Enter fullscreen mode Exit fullscreen mode

Output:

USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
app_run+      46  225 17.1 8475108 2745168 pts/0 Sl+  04:42 129:33 /usr/lib/jvm/
app_run+       1  0.0  0.0   2664   960 pts/0    Ss   04:42   0:00 /usr/bin/tini
... (other minor processes)

Enter fullscreen mode Exit fullscreen mode

Interpretation:

The Java process (PID 46) is the heavy consumer — observed at ~225% CPU.

This continuous high CPU usage from Java explains the pod-level metric.

We also obtained thread count:

Command :

kubectl exec -it webapp-deployment-rfc4f -n stgapp -- bash -c "ps -eLf | grep java | wc -l"

Enter fullscreen mode Exit fullscreen mode

Output:

180

Enter fullscreen mode Exit fullscreen mode

Interpretation: ~180 Java threads — a large thread count for java services.

Root cause:

  1. The pod was configured with a CPU limit of 3 cores.
  2. The Java search process consistently consumed most of that (observed 225%–293% at different times).
  3. Node had spare capacity → this was application-level saturation (the application needed more CPU than allotted).
  4. No evidence of node-level resource pressure or cgroup throttling preventing the pod from running; the pod simply used its quota.

Solution implemented:

Below are two safe approaches:

1. Option A — Increase pod CPU limit (vertical fix)

If the application legitimately needs more CPU (sustained), increase limits.cpu. Example: raise limit from 3 → 4 or 6 cores.

Command to update the deployment (non-disruptive for Deployment-managed pods; pods will be rolled):

Command:

kubectl set resources deployment/webapp-deployment-rfc4f -n stgapp \
  --limits=cpu=4 --requests=cpu=3

Enter fullscreen mode Exit fullscreen mode

Or edit YAML:

kubectl edit deployment/webapp-deployment-rfc4f -n stgapp 
# update resources: limits.cpu to "4"

Enter fullscreen mode Exit fullscreen mode

Verify:

kubectl get deployment webapp-deployment-rfc4f -n stgapp -o yaml | grep -A5 resources
# and
kubectl top pod webapp-deployment-rfc4f -n stgapp 

Enter fullscreen mode Exit fullscreen mode

Expected result:

Pod has a larger CPU quota. If load is same, CPU% (relative to limit) will drop and alert will recover.

Option B — Scale replicas (horizontal fix)

If requests can be load-balanced across replicas, scale out to reduce per-pod load:

command-

kubectl scale deployment webapp-deployment-rfc4f -n stgapp --replicas=2

Enter fullscreen mode Exit fullscreen mode

Verify:

command

kubectl get pods -n stgapp  -l app=webapp-deployment-rfc4f -o wide
kubectl top pods -n stgapp 

Enter fullscreen mode Exit fullscreen mode

Expected result:

  1. Per-pod CPU load drops (if incoming workload is split across replicas).
  2. We increased CPU limit to 4 and scaled to 2 replicas.

Increase CPU limit:

Command-

kubectl set resources deployment/webapp-deployment-rfc4f -n stgapp \
  --limits=cpu=4 --requests=cpu=3

Enter fullscreen mode Exit fullscreen mode

Scale replicas:

kubectl scale deployment webapp-deployment-rfc4f -n stgapp --replicas=2

Enter fullscreen mode Exit fullscreen mode

Verify changes:

kubectl get deployment webapp-deployment-rfc4f -n stgapp -o yaml | sed -n '/resources:/,+6p'
kubectl get pods -n stgapp -l app=webapp-deployment-rfc4f -o wide
kubectl top pod -n stgapp 

Enter fullscreen mode Exit fullscreen mode

Post-fix verification:

Deployment resources:

kubectl get deployment webapp-deployment-rfc4f -n stgapp -o yaml | grep -A5 resources

Enter fullscreen mode Exit fullscreen mode

Output -

resources:
  limits:
    cpu: "4"
    memory: 3Gi
  requests:
    cpu: "3"

Enter fullscreen mode Exit fullscreen mode

Summary

  1. Alert triggered due to pod CPU usage ≈ 2.86 cores vs limit 3 cores → Alert flagged >95% sustained usage.
  2. Investigation steps: kubectl top pod, kubectl top node, ps inside pod, check deployment resources.
  3. Root cause: Java search process saturating the pod CPU (application-level).
  4. Remediation: Increase CPU limit (vertical), or scale replicas (horizontal), and investigate hot threads/slow queries for permanent fixes.

Top comments (0)