Part 12: Sherlock Holmes in the Cluster: A Practical Guide to Debugging

#kubernetes #devops #tutorial #beginners

So far in our journey, we have built a robust, stateful, and well-behaved application. But even in the most well-managed cluster, things go wrong. An image tag has a typo. A configuration change breaks the application's startup logic. A pod runs out of memory.

Being a successful Kubernetes practitioner isn't about avoiding errors; it's about being able to efficiently diagnose and fix them. It's time to put on our detective hats and learn how to investigate when things go awry.

Your Detective Toolkit

When a Pod is misbehaving, you have three primary kubectl commands at your disposal. Knowing which one to use is the key to a speedy investigation.

kubectl describe pod <pod-name>
- The Case File: This is the most important command to start with. It gives you the full "biography" of a Pod, including its configuration, status, and IP address. Crucially, at the very bottom, it has an Events section. These events are the log of what Kubernetes itself has tried to do with your Pod. It's the first place to look for infrastructure-level issues.
kubectl logs <pod-name>
- The Witness Testimony: This command streams the standard output (stdout) from the container running inside your Pod. It tells you what the application is saying. If the Pod is running but the app is throwing errors, this is where you'll find them.
kubectl exec
- Going Undercover: This command lets you open a shell directly inside a running container. It's the ultimate tool for hands-on investigation. You can check for configuration files, test network connectivity from within the Pod, or run diagnostic tools.

The Investigation: A Case of a Broken App

Let's investigate a crime scene. We'll deploy an application that is deliberately broken in two different ways.

Create a file named broken-app.yaml:

# broken-app.yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: broken-app
spec:
  replicas: 1
  selector:
    matchLabels:
      app: broken-app
  template:
    metadata:
      labels:
        app: broken-app
    spec:
      containers:
      - name: main-app
        image: nginxx:1.21-alpine # Clue #1: A typo
        command: ["sh", "-c", "echo 'Starting...' && sleep 5 && exit 1"] # Clue #2: A faulty command

Now, apply this broken configuration:

kubectl apply -f broken-app.yaml

Let the investigation begin!

Step 1: Survey the Scene
Check the status of your Pods.

kubectl get pods

You'll immediately see something is wrong.

NAME                          READY   STATUS              RESTARTS   AGE
broken-app-5b5f76f6b4-xyz12   0/1     ImagePullBackOff    0          20s

The status is ImagePullBackOff. This tells us Kubernetes is trying to pull the container image but is failing repeatedly.

Step 2: Examine the Case File (describe)
Let's use describe to find out why. (Remember to use your specific Pod name).

kubectl describe pod broken-app-5b5f76f6b4-xyz12

Scroll down to the Events section at the bottom. You will find the smoking gun.

Events:
  Type     Reason     Age                From               Message
  ----     ------     ----               ----               -------
  Normal   Scheduled  1m                 default-scheduler  Successfully assigned default/broken-app...
  Normal   Pulling    25s (x2 over 1m)   kubelet            Pulling image "nginxx:1.21-alpine"
  Warning  Failed     23s (x2 over 1m)   kubelet            Failed to pull image "nginxx:1.21-alpine": rpc error...
  Warning  Failed     23s (x2 over 1m)   kubelet            Error: ErrImagePull

The event log is crystal clear: Failed to pull image "nginxx:1.21-alpine". We have a typo!

Step 3: Correct Clue #1 and Re-apply
Fix the image name in broken-app.yaml from nginxx to nginx and apply the change.

# In broken-app.yaml
# ...
        image: nginx:1.21-alpine # Corrected
# ...

kubectl apply -f broken-app.yaml

Step 4: A New Problem Arises
The old Pod will be terminated, and a new one will be created. Let's check the status again.

kubectl get pods

NAME                          READY   STATUS             RESTARTS   AGE
broken-app-7dcfc75c8d-abc45   0/1     CrashLoopBackOff   2          30s

A new error! CrashLoopBackOff means the container is starting, but the application inside is exiting with an error code almost immediately. Kubernetes tries to restart it, it crashes again, and the loop continues.

Step 5: Question the Witness (logs)
The image is fine, so the problem must be inside the container. Let's check the application logs.

kubectl logs broken-app-7dcfc75c8d-abc45

The output is simply: Starting...

This tells us the command we specified is running, but it doesn't tell us why it's crashing. This is because the container crashes so fast. Let's ask for the logs of the previous attempt.

kubectl logs broken-app-7dcfc75c8d-abc45 --previous

The result is the same. The exit 1 command in our manifest is causing the container to stop with an error code, which Kubernetes interprets as a crash.

Step 6: Correct Clue #2 and Close the Case
Remove the entire command section from broken-app.yaml to let the nginx image use its default startup command.

# In broken-app.yaml - REMOVE THE FOLLOWING LINES
#
#        command: ["sh", "-c", "echo 'Starting...' && sleep 5 && exit 1"]

Apply the final fix:

kubectl apply -f broken-app.yaml

Check the status one last time:

kubectl get pods

NAME                          READY   STATUS    RESTARTS   AGE
broken-app-6447d96c4d-qrst6   1/1     Running   0          15s

Success! Our Pod is Running. By systematically using describe for cluster-level issues and logs for application-level issues, we solved the case.

What's Next

We now have the fundamental skills to diagnose and fix the most common problems in a Kubernetes cluster.

As our applications have grown more complex, so have our manifests. We now have YAML files for Deployments, Services, ConfigMaps, PVCs, and Ingress rules. Managing all these related files for a single application is becoming cumbersome. What if we want to share our application so someone else can deploy it with one command?

In the next part, we will solve this problem of YAML sprawl by introducing Helm, the package manager for Kubernetes.