<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: IT Defined</title>
    <description>The latest articles on DEV Community by IT Defined (@it_defined_9fa44164c67442).</description>
    <link>https://dev.to/it_defined_9fa44164c67442</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3905931%2F084c6ea8-3136-4128-bc2e-66f4cf4503f2.png</url>
      <title>DEV Community: IT Defined</title>
      <link>https://dev.to/it_defined_9fa44164c67442</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/it_defined_9fa44164c67442"/>
    <language>en</language>
    <item>
      <title>Kubernetes Troubleshooting</title>
      <dc:creator>IT Defined</dc:creator>
      <pubDate>Thu, 30 Apr 2026 10:43:31 +0000</pubDate>
      <link>https://dev.to/it_defined_9fa44164c67442/kubernetes-troubleshooting-2l9j</link>
      <guid>https://dev.to/it_defined_9fa44164c67442/kubernetes-troubleshooting-2l9j</guid>
      <description>&lt;h2&gt;
  
  
  Why this exists
&lt;/h2&gt;

&lt;p&gt;I've been running K8s troubleshooting workshops for two years. We have a 200-student program at IT Defined where we throw broken clusters at people. Patterns emerged.&lt;/p&gt;

&lt;p&gt;Most failures aren't novel. The same 25-30 failure modes account for 90% of real-world K8s incidents. If you can confidently debug these, you'll handle most production incidents.&lt;/p&gt;

&lt;p&gt;Here are the 10 most critical scenarios. Full 26 in the linked post.&lt;/p&gt;

&lt;h2&gt;
  
  
  1. CrashLoopBackOff
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Symptom:&lt;/strong&gt; Pod restart count climbing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Diagnosis:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl describe pod POD_NAME
kubectl logs POD_NAME &lt;span class="nt"&gt;--previous&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Likely causes:&lt;/strong&gt; App crashes on startup (config error, missing env var, can't connect to DB), liveness probe too aggressive, command/args misconfigured.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fix:&lt;/strong&gt; Read the previous container's logs. Reason is usually right there. If logs are empty, the container died before logging — check the entrypoint, command, and args.&lt;/p&gt;

&lt;h2&gt;
  
  
  2. ImagePullBackOff or ErrImagePull
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Diagnosis:&lt;/strong&gt; &lt;code&gt;kubectl describe pod&lt;/code&gt;, look at events at the bottom.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Likely causes:&lt;/strong&gt; Image name typo, image doesn't exist, registry credentials missing, wrong region (ECR is regional), node IAM role can't pull from ECR.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fix:&lt;/strong&gt; Run &lt;code&gt;docker pull&lt;/code&gt; manually from a workstation. If it works, it's a node permission issue.&lt;/p&gt;

&lt;h2&gt;
  
  
  3. Pod stuck Pending
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Diagnosis:&lt;/strong&gt; &lt;code&gt;kubectl describe pod&lt;/code&gt;. Look for "0/3 nodes available: insufficient cpu" or "didn't match node selector."&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Likely causes:&lt;/strong&gt; Insufficient capacity, resource requests too high, taints/tolerations mismatch, PVC not bound.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fix:&lt;/strong&gt; Check &lt;code&gt;kubectl describe nodes&lt;/code&gt; for available resources. If maxed, autoscale.&lt;/p&gt;

&lt;h2&gt;
  
  
  4. OOMKilled
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Diagnosis:&lt;/strong&gt; &lt;code&gt;kubectl describe pod&lt;/code&gt; shows "Last State: Terminated, Reason: OOMKilled."&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Likely causes:&lt;/strong&gt; Container exceeded memory limit, JVM not configured for container limits, memory leak.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fix:&lt;/strong&gt; Increase limits if workload genuinely needs more. For Java apps, use &lt;code&gt;-XX:MaxRAMPercentage&lt;/code&gt; properly.&lt;/p&gt;

&lt;h2&gt;
  
  
  5. Service unreachable
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Diagnosis:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl get endpoints SVC_NAME
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Likely causes:&lt;/strong&gt; No endpoints (selector doesn't match pod labels), pod not listening on expected port, NetworkPolicy blocking traffic.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fix:&lt;/strong&gt; 99% of the time it's a label selector mismatch.&lt;/p&gt;

&lt;h2&gt;
  
  
  6. DNS resolution failing
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Diagnosis:&lt;/strong&gt; &lt;code&gt;kubectl exec&lt;/code&gt; into pod, run nslookup. Check CoreDNS pods.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Likely causes:&lt;/strong&gt; CoreDNS pods crashed, NetworkPolicy blocking DNS, /etc/resolv.conf misconfigured.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fix:&lt;/strong&gt; Restart CoreDNS if misbehaving. On EKS, defaults are sometimes too low for busy clusters.&lt;/p&gt;

&lt;h2&gt;
  
  
  7. Ingress 502 Bad Gateway
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Likely causes:&lt;/strong&gt; Backend pod down, target group health check failing, port mismatch, slow startup so ALB marks unhealthy.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fix:&lt;/strong&gt; Check target group health in AWS console. Fix readiness probe if pods unhealthy.&lt;/p&gt;

&lt;h2&gt;
  
  
  8. PVC stuck Pending
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Likely causes:&lt;/strong&gt; No StorageClass set, EBS CSI driver not installed, IAM permissions for the driver.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fix on EKS:&lt;/strong&gt; Install EBS CSI driver as an EKS add-on. Service account needs the right IAM role via IRSA.&lt;/p&gt;

&lt;h2&gt;
  
  
  9. Node Not Ready
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Likely causes:&lt;/strong&gt; Kubelet crashed, container runtime issue, disk pressure, network plugin failure.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fix:&lt;/strong&gt; SSH to node (or SSM Session Manager). Check &lt;code&gt;journalctl -u kubelet&lt;/code&gt;. Often it's disk full from log accumulation.&lt;/p&gt;

&lt;h2&gt;
  
  
  10. HPA not scaling
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Likely causes:&lt;/strong&gt; Metrics-server not installed, HPA targeting CPU but pod has no CPU requests, max replicas reached.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fix:&lt;/strong&gt; &lt;code&gt;kubectl get hpa&lt;/code&gt;. If &lt;code&gt;&amp;lt;unknown&amp;gt;&lt;/code&gt; appears under metrics, metrics-server is broken.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to use this playbook
&lt;/h2&gt;

&lt;p&gt;When you hit a real incident, search for keywords from the symptom. Most day-to-day stuff is covered.&lt;/p&gt;

&lt;p&gt;If you want to actually practice these in a safe environment, our K8s troubleshooting labs at IT Defined are exactly this — broken clusters with planted issues, fix them under time pressure.&lt;/p&gt;

&lt;p&gt;Full 26 scenarios — including ConfigMap updates, Secret rotation, NetworkPolicy issues, PDB blocks, autoscaler problems, kube-proxy/CNI issues, Job failures, IRSA problems, webhook admission controllers, liveness probes, PV cleanup, and cluster upgrades — on &lt;a href="https://itdefined.org/blogs" rel="noopener noreferrer"&gt;itdefined.org&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>devops</category>
      <category>kubernetes</category>
      <category>tutorial</category>
    </item>
  </channel>
</rss>
