Docker Container Monitoring and Management Challenges

#docker #monitoring #containermonitoring #containers

Organizations that adopt container orchestration tools for application deployment face a new maintenance challenge. Orchestration tools like Kubernetes or Docker Swarm are designed to decide to which host a container should be deployed and potentially do that on an ongoing basis. Although this functionality is great for helping us make better use of the underlying infrastructure, it creates new challenges in production. While troubleshooting a common first question is “Which specific container is having issues?”. The second question is often “On which host is it running?“. Being able to map container deployments to the underlying container hosts is essential for troubleshooting. But there is more to learn about container monitoring, let us see why infrastructure monitoring is different for containers.

Monitoring a container infrastructure is different from traditional server monitoring. First of all, containers present a new infrastructure layer we simply didn’t have before. Secondly, we have to cope with dynamic placement in one or more clusters possibly running on different cloud services. Finally, containers provide new ways for resource management.

New Infrastructure Layers

Containers add a new layer to the infrastructure and the mapping of containers to servers lets us see where exactly in our infrastructure each container is running. Modern container monitoring tools must thus discover all running containers automatically in order to capture dynamic changes in the deployment and update the container to host mapping in real-time. Thus, the traditional server performance monitoring that is not designed with such requirements in mind is inadequate for monitoring containers.

New Dynamic Deployment and Orchestration

Container orchestration tools like Kubernetes or Docker Swarm are often used to dynamically allocate containers to the most suitable hosts, typically those that have sufficient resources to run them. Containers might thus move from one host to another, especially while scaling the number of containers up or down or during container redeployments. There is no static relation between hosts and services they are running anymore! For troubleshooting, this means one must first figure out which host is running which containers. And, vice versa, when a host exhibits poor performance, it’s highly valuable to be able to isolate which container is the source of the problem and if any containers suffer from the performance issue.

New Resource Management and Metrics

Containers use of resources can be restricted. Using inadequate container resource limits can lead to situations where a container performs poorly simply because it can’t allocate enough resources. At the same time, the cluster host itself might not be fully utilized. How could this problem be discovered and fixed? A good example is monitoring memory fail counters – one of the key container metrics or throttled CPU time. In such situations, monitoring just the overall server performance would not indicate any slowness of containers hitting resource limits. Only monitoring of the actual container metrics for each container helps in this situation. Setting the right resource limits requires detailed knowledge of the resources a container might need under load. A good practice is to monitor container metrics and adjust the resource limits to match the actual needs or scale the number of containers behind a load balancer.

New Log Management Needs

Docker not only changed the deployment of applications, but also the workflow for log management. Instead of writing logs to files, Docker logs are console output streams from containers. Docker Logging Drivers collect and forward logs to their destinations. To match this new logging paradigm, legacy applications need to be updated to write logs to the console instead of writing them to local log files. Some containers start multiple processes, therefore log streams might contain a mix of plain text messages from start scripts and unstructured or structured logs from different containerized applications. The problem is obvious – you can’t just take both log streams (stderr/stdout) from multiple processes and containers, all mixed up, and treat them like a blob, or assume they all use the same log structure and format. You need to be able to tell which log event belongs to what container, what app, parse it correctly, etc.

To do log parsing right, the origin of the container log output needs to be identified. That knowledge can then be used to apply the right parser and add metadata like container name, container image, container ID to each log event. Docker Log Drivers simplified logging a lot, but there are still many Docker logging gotchas. Luckily, modern log shippers integrate with Docker API or Kubernetes API and are able to apply log parsers for different applications.

New Microservice Architecture and Distributed Transaction Tracing

Microservices have been around for over a decade under one name or another. Now often deployed in separate containers it became obvious we needed a way to trace transactions through various microservice layers, from the client all the way down to queues, storage, calls to external services, etc. This created a new interest in Distributed Transaction Tracing that, although not new, has now re-emerged as the third pillar of observability.