DEV Community

Cover image for Kubernetes Troubleshooting
Barbara
Barbara

Posted on • Updated on

Kubernetes Troubleshooting

With Kubernetes large and diverse workloads can be handled.
To keep track of all these processes, monitoring is essential.

Monitoring

To monitor the application you need to collect metrics, like CPU, memory, disk usage and bandwidth on your nodes.

Because Kubernetes is a distributed system, it needs to be monitored and trace cluster-wide.

You can use exterior tools like Prometheus and visualize it with Grafana. But to get started I recommend you to use the Kubernetes dashboard, as it is very easy to set up and you have a default user interface with the most important metrics.

Logging

If you have aggregated logs, you can visualize issues and search the logs for issues.

In Kubernetes the kubelet writes container logs to local files. With the command kubectl logs you can see this logs.

If you want to perform cluster wide logging, you can use Fluentd to aggregate logs.
Fluentd agents run on each node via a DeamonSet and feed them to an ElasticSearch instance prior to visualization.

Troubleshooting

Errors in the container

If you are not sure where to start, run

kubectl describe your-pod

This will report

  • the overall status of the pod: running, pending or an error state
  • the container configuration
  • the container events

If the pod is already running you can first look at the standard outs of the container. One common issue is that there are not enough resources allocated.

kubectl logs your-pod your-container

You can look for error messages in the logs.

If there are errors inside a container you execute into the shell of the container to see what is going on.

kubectl exec -it yourdeployment -- /bin/sh

Networking issues

This could be the next place, where the issues arise.
So you can go ahead and check the DNS, firewalls and general connectivity.

Security issues

You might want to check your RBAC.
SELinux and AppArmor are also common issues, especially with network-centric applications.

If you don't know where to start, you can disable security for testing, to delimit the issue source. But be sure to reenable security afterwards.

Another reason - not only for security issues - could be an update. You can roll back to find out when the issue was introduced.

Further reading:
Kubernetes dashboard
Prometheus
Fluentd
Troubleshoot a cluster
Troubleshoot applications
Debug Pods

Top comments (0)