Roman Belshevitz for Otomato

Posted on Apr 10, 2022 • Edited on Aug 17, 2022

Kubernetes: Can We Fix It With Insulating Tape? 👷

#devops #kubernetes #troubleshooting

Readers asked to write "something about Pods", closer to the surface of the sea, and simpler. Well, ok, enjoy, I tried.

There are a few common incidents that can occur in a Kubernetes deployment or service. Let's discuss how to respond to them. We proceed from the fact that our knowledge base and "toolbox" are not huge.

🚧 1. Pod crashed

Uncover the cause of the crash and take corrective action. You can use the kubectl get pods command to get information about the crashed pod.

CrashLoopBackOff is a common error indicating a pod constantly crashing in an endless loop. This error can be caused by a variety of issues, including:

Insufficient resources: lack of resources prevents the container from loading
Locked file: a file was already locked by another container
Locked database: the database is being used and locked by other pods
Failed reference: reference to scripts or binaries that are not present on the container
Setup error: an issue with the init container setup in Kubernetes
Config loading error: a server cannot load the configuration file (check your YAMLs twice!)
Misconfigurations: a general file system misconfiguration
Connection issues: DNS or kube-dns is not able to connect to a third-party service
Deploying failed services: an attempt to deploy services/applications that have already failed (e.g. due to a lack of access to other services)

There are a few unobvious ways to manually troubleshoot the CrashLoopBackOff error.

🔬 Look at the logs of the failed Pod deployment

To look at the relevant logs, use this command:

$ kubectl logs [podname] -p

The -p tells the software to retrieve the logs of the previous failed instance, which will let you see what's happening at the application level. For instance, an important file may already be locked by a different container because it's in use.

🔬 Examine logs from preceding containers

If the deployment logs can't pinpoint the problem, try looking at logs from preceding instances. You can run this command to look at previous Pod logs:

$ kubectl logs  -n  --previous

You can run this command to retrieve the last 20 lines of the preceding Pod.

$ kubectl logs --previous --tail20

Look through the log to see why the Pod is constantly starting and crashing.

🔬 List the events

If the logs don't tell you anything, you should try looking for errors in the space, where Kubernetes saves all the events that happened before your Pod crashed. You can run this command:

$ kubectl get events --sort-by=.metadata.creationTimestamp

Add a --namespace [mynamespace] as needed. You will then be able to see what caused the crash.

🔬 Look for "Back-off restarting failed container"

You may be able to find errors that you can't find otherwise by running this command:

kubectl describe pod [name]

If you get "Back-off restarting failed container", this means your container suddenly terminated after Kubernetes started it.

Often, this is the result of resource overload caused by increased activity. Kubernetes provides liveness probes to detect and remedy such situations. As such, you need to manage resources for containers and specify the right limits for containers. You should also consider changing initialDelaySeconds so the software has more time to respond.

🔬 Increase memory resources

Finally, you may be experiencing CrashLoopBackOff errors due to insufficient memory resources. You can increase the memory limit by changing the resources:limits in the Container's resource manifest:

apiVersion: v1
kind: Pod
metadata:
  name: memory-demo
  namespace: mem-example
spec:
  containers:
  - name: memory-demo-ctr
    image: polinux/stress
    resources:
      requests:
        memory: "100Mi"
      limits:
        memory: "200Mi"
    command: ["stress"]
    args: ["--vm", "1", "--vm-bytes", "150M", "--vm-hang", "1"]

We're limiting the containerized stress-ng tool by Przemyslaw Ozgo here. What an irony!🙃

🚧 2. Cluster is unhealthy or overloaded

Take action to relieve the pressure. You can use the console tools, metrics or Lens GUI to get information about the CPU and memory usage of the cluster. See my article about resource management.

🚧 3. Services are unavailable

Investigate the cause of the outage and take corrective action. You can use the kubectl get svc command to get information about the unavailable service.

A common problem with a malfunctioning service is that of missing or mismatched endpoints. For example, it’s important to ensure that a service connects to all the appropriate Pods by matching the Pod’s containerPort label with the service’s targetPort selector. Some other troubleshooting practices for services include:

Verifying that the service works by DNS name
Verifying that it works by IP Address
Ensuring that kube-proxy is functioning as intended

🚧 4. Pod is stuck in the Pending state

You may want to restart your pods. Some possible reasons are:

Resource use isn’t stated or when the software behaves in an unforeseen way. Check your resource limits or auto-scaling rules.
A pod is stuck in a terminating state
Mistaken deployments
Requesting persistent volumes that are not available

Determine the cause of the problem and take corrective action. There are at least four methods how to restart pods.

As you may see, kubectl command will help you a lot. Consider that you have it instead of insulating tape!

Thanks Ben Hirschberg from 🐦ArmoSec and Patrick Londa from 🐦BlinkOps for inspiration. Healthy clusters to you!

DEV Community