DEV Community

Vijaya Bollu
Vijaya Bollu

Posted on

# I Built an AI Kubernetes Pod Debugger — Diagnoses CrashLoopBackOff, OOMKilled, and More in Seconds

Why I Built This

K8s error messages are designed for cluster operators, not for the developer who just shipped a feature and now has a pod stuck in CrashLoopBackOff at 11pm. kubectl get pods tells you something is broken. It doesn't tell you why, and it definitely doesn't tell you what to do about it. I kept doing the same 4-command loop — get pods, describe pod, logs, Google the error — and thought there had to be a better way.

So I automated the loop and added an AI in the middle.


The 6 Failure Types It Handles

  1. ImagePullBackOff — wrong image name, missing credentials, private registry issues
  2. CrashLoopBackOff — app crash on startup, missing env vars, bad config
  3. OOMKilled — container exceeding memory limits
  4. Pending — insufficient cluster resources, node selector issues
  5. Failed — job completion errors, config map missing
  6. Init Container failures — init container exits non-zero

How It Works

The tool runs three kubectl commands in sequence, then hands everything to Ollama.

First, kubectl get pods -o json scans the namespace and flags any pod that isn't Running, isn't ready, or has restarts > 0. For each unhealthy pod, it grabs the last 50 lines of logs and the Events section from kubectl describe. Both get truncated before hitting the AI — logs capped at 2000 chars, events at 1000 — to stay within a reasonable context window.

Then everything goes into a structured prompt:

def analyze_pod_failure(self, pod: Dict, logs: str, events: str) -> str:
    logs_sample = logs[:2000] if logs else "No logs available"
    events_sample = events[:1000] if events else "No events available"

    prompt = f"""You are a Kubernetes expert helping debug pod failures.

Pod Information:
- Name: {pod['name']}
- Status: {pod['status']}
- Ready: {pod['ready']}
- Restart Count: {pod['restarts']}

Recent Logs (last 50 lines):
{logs_sample}

Recent Events:
{events_sample}

Provide a diagnosis in this format:

**ROOT CAUSE:**
[Explain in 1-2 sentences what's causing the pod to fail, using simple terms]

**WHY THIS HAPPENS:**
[Explain why this error occurs, use an analogy if helpful]

**HOW TO FIX:**
[Provide specific kubectl commands or YAML changes to fix the issue]
"""
    response = self.ask_ollama(prompt)
    return self._remove_repeated_suffix(response)
Enter fullscreen mode Exit fullscreen mode

The _remove_repeated_suffix at the end strips duplicated content that Llama 3.2 sometimes generates — more on that below.


Demo: CrashLoopBackOff Diagnosis

Input — what the cluster shows:

$ kubectl get pods
NAME          READY   STATUS             RESTARTS   AGE
broken-pod    0/1     CrashLoopBackOff   4          3m
Enter fullscreen mode Exit fullscreen mode

Logs show: panic: runtime error: invalid memory address or nil pointer dereference

Output from the AI:

🔍 KUBERNETES POD DEBUG REPORT
Pod: broken-pod | Status: CrashLoopBackOff | Restarts: 4

🤖 AI DIAGNOSIS:

**ROOT CAUSE:**
The application is crashing immediately on startup due to a nil pointer
dereference — it's trying to use a variable that was never initialized,
likely a missing environment variable or config value.

**WHY THIS HAPPENS:**
Think of it like a recipe that calls for eggs, but your fridge is empty.
The app tries to read a config value that doesn't exist, gets back nothing,
and then crashes when it tries to use it. Kubernetes keeps restarting it,
hoping it'll work — it won't until the missing config is there.

**HOW TO FIX:**
1. Check what env vars the container expects:
   kubectl describe pod broken-pod | grep -A 10 'Environment'
2. Add the missing values to your deployment YAML under 'env:'
3. Or create a ConfigMap and reference it:
   kubectl create configmap app-config --from-literal=KEY=value
4. Reapply: kubectl apply -f your-deployment.yaml

💡 SUGGESTED NEXT STEP:
   kubectl logs broken-pod --previous
Enter fullscreen mode Exit fullscreen mode

That diagnosis took 12 seconds. The manual version of this took me 25 minutes the first time I hit a nil pointer crash in K8s.


What Surprised Me

kubectl logs silently fails on crashed containers. If a container has already exited, the default kubectl logs command returns nothing — no error, just empty output. The fix is the --previous flag, which fetches logs from the last terminated container. I found this out the hard way when the AI kept saying "No logs available" for CrashLoopBackOff pods. The code now tries normal logs first, then falls back automatically:

except subprocess.CalledProcessError:
    # Try to get previous logs if container already crashed
    result = subprocess.run(
        ["kubectl", "logs", pod_name, "-n", namespace,
         "--previous", f"--tail={tail}"],
        ...
    )
Enter fullscreen mode Exit fullscreen mode

Llama 3.2 sometimes repeats its own output. Occasionally the model generates a full response and then starts over from the beginning, appending a duplicate. For a terminal tool this looks terrible. I had to write a suffix deduplication function that checks if the second half of the response is a repeat of any trailing segment of the first half, and strips it. Not something I expected to need when I started this project.

Minikube hangs silently when the base image is cached. On Windows with Docker Desktop, minikube start can freeze indefinitely trying to pull an image that's already on disk. The fix — minikube start --base-image=gcr.io/k8s-minikube/kicbase:v0.0.49 — forces it to use the local cache. This isn't documented prominently and cost me an hour of debugging a debugging tool.


Try It

GitHub: https://github.com/ThinkWithOps
Demo video: https://youtu.be/LFF-987-uhA

# Prerequisites: Minikube running, Ollama + llama3.2 pulled
git clone https://github.com/ThinkWithOps/ai-devops-projects
cd ai-devops-projects/02-ai-k8s-debugger
pip install -r requirements.txt

# Deploy a broken pod to test with
kubectl apply -f demo/broken-pod.yaml

# Run the debugger
python src/k8s_debugger.py

# Or target a specific pod
python src/k8s_debugger.py --pod broken-pod
Enter fullscreen mode Exit fullscreen mode

Project 2 in my AI+DevOps series — all tools run locally with Ollama, zero cloud costs. Project 1 was an AI Docker vulnerability scanner, Project 3 is an AI AWS Cost Detective. Links in my profile.


What K8s error do you dread seeing the most? For me it's still Pending with node affinity issues — the AI actually handles that one better than I expected. Drop yours in the comments.

Top comments (0)