So far in our journey, we have built a robust, stateful, and well-behaved application. But even in the most well-managed cluster, things go wrong. An image tag has a typo. A configuration change breaks the application's startup logic. A pod runs out of memory.
Being a successful Kubernetes practitioner isn't about avoiding errors; it's about being able to efficiently diagnose and fix them. It's time to put on our detective hats and learn how to investigate when things go awry.
Your Detective Toolkit
When a Pod is misbehaving, you have three primary kubectl
commands at your disposal. Knowing which one to use is the key to a speedy investigation.
-
kubectl describe pod <pod-name>
- The Case File: This is the most important command to start with. It gives you the full "biography" of a Pod, including its configuration, status, and IP address. Crucially, at the very bottom, it has an
Events
section. These events are the log of what Kubernetes itself has tried to do with your Pod. It's the first place to look for infrastructure-level issues.
- The Case File: This is the most important command to start with. It gives you the full "biography" of a Pod, including its configuration, status, and IP address. Crucially, at the very bottom, it has an
-
kubectl logs <pod-name>
- The Witness Testimony: This command streams the standard output (
stdout
) from the container running inside your Pod. It tells you what the application is saying. If the Pod is running but the app is throwing errors, this is where you'll find them.
- The Witness Testimony: This command streams the standard output (
-
kubectl exec
- Going Undercover: This command lets you open a shell directly inside a running container. It's the ultimate tool for hands-on investigation. You can check for configuration files, test network connectivity from within the Pod, or run diagnostic tools.
The Investigation: A Case of a Broken App
Let's investigate a crime scene. We'll deploy an application that is deliberately broken in two different ways.
Create a file named broken-app.yaml
:
# broken-app.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: broken-app
spec:
replicas: 1
selector:
matchLabels:
app: broken-app
template:
metadata:
labels:
app: broken-app
spec:
containers:
- name: main-app
image: nginxx:1.21-alpine # Clue #1: A typo
command: ["sh", "-c", "echo 'Starting...' && sleep 5 && exit 1"] # Clue #2: A faulty command
Now, apply this broken configuration:
kubectl apply -f broken-app.yaml
Let the investigation begin!
Step 1: Survey the Scene
Check the status of your Pods.
kubectl get pods
You'll immediately see something is wrong.
NAME READY STATUS RESTARTS AGE
broken-app-5b5f76f6b4-xyz12 0/1 ImagePullBackOff 0 20s
The status is ImagePullBackOff
. This tells us Kubernetes is trying to pull the container image but is failing repeatedly.
Step 2: Examine the Case File (describe
)
Let's use describe
to find out why. (Remember to use your specific Pod name).
kubectl describe pod broken-app-5b5f76f6b4-xyz12
Scroll down to the Events
section at the bottom. You will find the smoking gun.
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 1m default-scheduler Successfully assigned default/broken-app...
Normal Pulling 25s (x2 over 1m) kubelet Pulling image "nginxx:1.21-alpine"
Warning Failed 23s (x2 over 1m) kubelet Failed to pull image "nginxx:1.21-alpine": rpc error...
Warning Failed 23s (x2 over 1m) kubelet Error: ErrImagePull
The event log is crystal clear: Failed to pull image "nginxx:1.21-alpine"
. We have a typo!
Step 3: Correct Clue #1 and Re-apply
Fix the image name in broken-app.yaml
from nginxx
to nginx
and apply the change.
# In broken-app.yaml
# ...
image: nginx:1.21-alpine # Corrected
# ...
kubectl apply -f broken-app.yaml
Step 4: A New Problem Arises
The old Pod will be terminated, and a new one will be created. Let's check the status again.
kubectl get pods
NAME READY STATUS RESTARTS AGE
broken-app-7dcfc75c8d-abc45 0/1 CrashLoopBackOff 2 30s
A new error! CrashLoopBackOff
means the container is starting, but the application inside is exiting with an error code almost immediately. Kubernetes tries to restart it, it crashes again, and the loop continues.
Step 5: Question the Witness (logs
)
The image is fine, so the problem must be inside the container. Let's check the application logs.
kubectl logs broken-app-7dcfc75c8d-abc45
The output is simply: Starting...
This tells us the command
we specified is running, but it doesn't tell us why it's crashing. This is because the container crashes so fast. Let's ask for the logs of the previous attempt.
kubectl logs broken-app-7dcfc75c8d-abc45 --previous
The result is the same. The exit 1
command in our manifest is causing the container to stop with an error code, which Kubernetes interprets as a crash.
Step 6: Correct Clue #2 and Close the Case
Remove the entire command
section from broken-app.yaml
to let the nginx
image use its default startup command.
# In broken-app.yaml - REMOVE THE FOLLOWING LINES
#
# command: ["sh", "-c", "echo 'Starting...' && sleep 5 && exit 1"]
Apply the final fix:
kubectl apply -f broken-app.yaml
Check the status one last time:
kubectl get pods
NAME READY STATUS RESTARTS AGE
broken-app-6447d96c4d-qrst6 1/1 Running 0 15s
Success! Our Pod is Running
. By systematically using describe
for cluster-level issues and logs
for application-level issues, we solved the case.
What's Next
We now have the fundamental skills to diagnose and fix the most common problems in a Kubernetes cluster.
As our applications have grown more complex, so have our manifests. We now have YAML files for Deployments, Services, ConfigMaps, PVCs, and Ingress rules. Managing all these related files for a single application is becoming cumbersome. What if we want to share our application so someone else can deploy it with one command?
In the next part, we will solve this problem of YAML sprawl by introducing Helm, the package manager for Kubernetes.
Top comments (0)