DEV Community

Ankit Anand ✨
Ankit Anand ✨

Posted on • Originally published at signoz.io

Monitoring Docker Containers Using OpenTelemetry [Full Tutorial]

Monitoring Docker container metrics is essential for understanding the performance and health of your containers. OpenTelemetry collector can collect Docker container metrics and send it to a backend of your choice. In this tutorial, you will install an OpenTelemetry Collector to collect Docker container metrics and send it to SigNoz, an OpenTelemetry-native APM for monitoring and visualization.

Cover Image


In this tutorial, we cover:

If you want to jump straight into implementation, start with this Prerequisites section.

Dockerization has become quite popular in making application workloads portable. They help developers get rid of server-level dependencies and simplify testing and deployment of the applications themselves. With the adoption of Cloud native technologies, the adoption of Docker has naturally grown. This brought in the need for monitoring the Docker-based containers running on various types of computing.

Why monitor Docker container metrics?

Monitoring Docker container metrics can be crucial in various scenarios to avoid performance issues and assist developers in troubleshooting. container may start consuming an excessive amount of resources (CPU or memory), impacting other containers or the host system.

By monitoring CPU and memory usage, you can detect resource saturation early. This allows you to adjust resource allocation, optimize the application, or scale out your environment before users experience significant slowdowns or outages.

Some of the key factors why monitoring Docker containers is important are as follows:

  • Resource Optimization: It helps in allocating resources efficiently and scaling the containers as per the demand.
  • Performance Management: By understanding the resource utilization and demand, you can tune the performance of applications running inside the containers.
  • Troubleshooting: It enables quick identification and resolution of issues, reducing downtime and improving reliability.
  • Cost Management: In cloud environments, efficient resource usage can lead to significant cost savings.

We can use OpenTelemetry and a backend that supports OpenTelemetry-based data to monitor Docker containers efficiently. OpenTelemetry is quietly becoming the open-source standard for generating and collecting telemetry data.

A Brief Overview of OpenTelemetry

OpenTelemetry is a set of APIs, SDKs, libraries, and integrations aiming to standardize the generation, collection, and management of telemetry data(logs, metrics, and traces). It is backed by the Cloud Native Computing Foundation and is the leading open-source project in the observability domain.

The data you collect with OpenTelemetry is vendor-agnostic and can be exported in many formats. OpenTelemetry provides a tool called OpenTelemetry collector that we will use to collect Docker container metrics.

What is OpenTelemetry Collector?

OpenTelemetry Collector is a stand-alone service provided by OpenTelemetry. It can be used as a telemetry-processing system with a lot of flexible configurations to collect and manage telemetry data.

It can understand different data formats and send it to different backends, making it a versatile tool for building observability solutions.

Read our complete guide on OpenTelemetry Collector

How does OpenTelemetry Collector collect data?

A receiver is how data gets into the OpenTelemetry Collector. Receivers are configured via YAML under the top-level receivers tag. There must be at least one enabled receiver for a configuration to be considered valid.

Here’s an example of an otlp receiver:

receivers:
  otlp:
    protocols:
      grpc:
      http:
Enter fullscreen mode Exit fullscreen mode

An OTLP receiver can receive data via gRPC or HTTP using the OTLP format. There are advanced configurations that you can enable via the YAML file.

Here’s a sample configuration for an otlp receiver.

receivers:
  otlp:
    protocols:
      http:
        endpoint: "localhost:4318"
        cors:
          allowed_origins:
            - http://test.com
            # Origins can have wildcards with *, use * by itself to match any origin.
            - https://*.example.com
          allowed_headers:
            - Example-Header
          max_age: 7200
Enter fullscreen mode Exit fullscreen mode

You can find more details on advanced configurations here.

After configuring a receiver, you must enable it. Receivers are enabled via pipelines within the service section. A pipeline consists of a set of receivers, processors, and exporters.

The following is an example pipeline configuration:

service:
  pipelines:
    metrics:
      receivers: [otlp, prometheus]
      exporters: [otlp, prometheus]
    traces:
      receivers: [otlp, jaeger]
      processors: [batch]
      exporters: [otlp, zipkin]
Enter fullscreen mode Exit fullscreen mode

Pre-requisites

In order to gather metrics from Docker containers, we would first require a Docker client to be installed. Once done, we can run a few simple containers and try to gather metrics related to them. This section guides you through quick database setup using Docker and Docker-Compose. You can skip the setup if you already have Docker running on your system and have a few containers started.

The below links can help you with the Docker installation:

Once the Docker container is installed, start a few containers using the below commands:

docker run nginx:latest -p 8080:80 -d
docker run httpd:latest -p 8081:80 -d
docker run mysql:latest -e MYSQL_ROOT_PASSWORD=mysecretpassword -p 3306:3306 -d
Enter fullscreen mode Exit fullscreen mode

The above commands will start 3 containers on your system to allow us to gather some metrics when we start the OpenTelemetry collector. Next, let us start with the setup of OpenTelemetry Collector. It is assumed that you are setting up the OpenTelemetry collector on the same machine where you are running the Docker containers.

Setting up OpenTelemetry Collector

The OpenTelemetry Collector offers various deployment options to suit different environments and preferences. It can be deployed using Docker, Kubernetes, Nomad, or directly on Linux systems. You can find all the installation options here.

We are going to discuss the manual installation here and resolve any hiccups that come in the way.

Step 1 - Downloading OpenTelemetry Collector

Download the appropriate binary package for your Linux or macOS distribution from the OpenTelemetry Collector releases page. We are using the latest version available at the time of writing this tutorial.

For MACOS (arm64):

curl --proto '=https' --tlsv1.2 -fOL https://github.com/open-telemetry/opentelemetry-collector-releases/releases/download/v0.116.0/otelcol-contrib_0.116.0_darwin_arm64.tar.gz
Enter fullscreen mode Exit fullscreen mode

Step 2 - Extracting the package

Create a new directory named otelcol-contrib and then extract the contents of the otelcol-contrib_0.116.0_darwin_arm64.tar.gz archive into this newly created directory with the following command:

mkdir otelcol-contrib && tar xvzf otelcol-contrib_0.116.0_darwin_arm64.tar.gz -C otelcol-contrib
Enter fullscreen mode Exit fullscreen mode

Step 3 - Setting up the configuration file

Create a config.yaml file in the otelcol-contrib folder. This configuration file will enable the collector to connect with the Docker socket and have other settings like at what frequency you want to monitor the containers. The dockerstats receiver communicates directly with the docker socket that provides the metrics and other relevant details for monitoring.

📝 Note

The configuration file should be created in the same directory where you unpack the otel-collector-contrib binary. In case you have globally installed the binary, it is ok to create on any path.

receivers:
  otlp:
    protocols:
      grpc:
      http:
  docker_stats:
    endpoint: unix:///var/run/docker.sock
    collection_interval: 30s
    timeout: 10s
    api_version: 1.24
    metrics:
      container.uptime:
        enabled: true
      container.restarts:
        enabled: true
      container.network.io.usage.rx_errors:
        enabled: true
      container.network.io.usage.tx_errors:
        enabled: true
      container.network.io.usage.rx_packets:
        enabled: true
      container.network.io.usage.tx_packets:
        enabled: true
processors:
  batch:
    send_batch_size: 1000
    timeout: 10s
  resourcedetection/env:
    detectors: [env]
    timeout: 2s
    override: false
  resourcedetection/system:
    detectors: ["system"]
    system:
      hostname_sources: ["dns", "os"]
  resourcedetection/docker:
    detectors: [env, docker]
    timeout: 2s
    override: false
exporters:
  otlp:
    endpoint: "ingest.{region}.signoz.cloud:443"
    tls:
      insecure: false
    headers:
      signoz-ingestion-key: "{signoz-token}"
  debug:
    verbosity: normal

service:
  pipelines:
    metrics:
      receivers: [otlp, docker_stats]
      processors: [resourcedetection/docker, batch]
      exporters: [otlp]
Enter fullscreen mode Exit fullscreen mode

You would need to replace region and signoz-token in the above file with the region of your choice (for Signoz Cloud) and token obtained from Signoz Cloud → Settings → Ingestion Settings.

Find ingestion settings in SigNoz dashboard


Find ingestion settings in SigNoz dashboard

The above configuration additionally contains a resource detection process that helps identify the host attributes better. The docker socket path remains the same for UNIX-based systems; however, for any other systems, you can refer to this documentation to know more.

Step 4 - Running the collector service

Every Collector release includes an otelcol executable that you can run. Since we’re done with configurations, we can now run the collector service with the following command.

From the otelcol-contrib, run the following command:

./otelcol-contrib --config ./config.yaml
Enter fullscreen mode Exit fullscreen mode

If you want to run it in the background -

./otelcol-contrib --config ./config.yaml &> otelcol-output.log & echo "$\!" > otel-pid
Enter fullscreen mode Exit fullscreen mode

Step 5 - Debugging the output

If you want to see the output of the logs, we’ve just set it up for the background process. You may look it up with:

tail -f -n 50 otelcol-output.log
Enter fullscreen mode Exit fullscreen mode

tail 50 will give the last 50 lines from the file otelcol-output.log

You can stop the collector service with the following command:

kill "$(< otel-pid)"
Enter fullscreen mode Exit fullscreen mode

You should start seeing the metrics on your Signoz Cloud UI in about 30 seconds. You can import this (link to be added) dashboard JSON into your Signoz environment quite easily to monitor your MongoDB database.

Monitoring with Signoz Dashboard

Once the above setup is done, you will be able to access the metrics in the SigNoz dashboard. You can go to the Dashboards tab and try adding a new panel. You can learn how to create dashboards in SigNoz here.

Docker container metrics collected by OpenTelemetry collector


Docker container metrics collected by OpenTelemetry collector

You can easily create charts with query builder in SigNoz. Here are the steps to add a new panel to the dashboard.

Creating a dashboard panel for average memory usage per container


Creating a dashboard panel for average memory usage per container

You can build a complete dashboard around various metrics emitted. Here’s a look at a sample dashboard we built out using the metrics collected. You can get started quickly with this dashboard by using the JSON here.

Dashboard for monitoring Docker Container Metrics in SigNoz


Dashboard for monitoring Docker Container Metrics in SigNoz

You can also create alerts on any metric. Learn how to create alerts here.

Create alerts on any Docker container metrics


Create alerts on any metrics and get notified in a notification channel of your choice

Reference: Docker container metrics and labels collected by OpenTelemetry Collector

Name Description Availability (cgroup v1/v2) Type
container.blockio.io_service_bytes_recursive Number of bytes transferred to/from the disk by the group and descendant groups. Both Sum
container.cpu.usage.kernelmode Time spent by tasks of the cgroup in kernel mode (Linux). Time spent by all container processes in kernel mode (Windows). Both Sum
container.cpu.usage.total Total consumed CPU time Both Sum
container.cpu.usage.usermode Time spent by tasks of the cgroup in user mode (Linux). Time spent by all container processes in user mode (Windows). Both Sum
container.cpu.utilization Percentage usage of CPU Both Gauge
container.memory.file Total memory used cgroup v2 Sum
container.memory.percent Percentage of memory used. cgroup v1 Gauge
container.memory.total_cache Total cache memory used by the processes of the cgroup Both Sum
container.memory.usage.limit Memory limits for the container (if specified) Both Sum
container.memory.usage.total Memory usage of the containers excluding cache Both Sum
container.network.io.usage.rx_bytes Bytes received by the container Both Sum
container.network.io.usage.rx_dropped Incoming packets dropped by the container Both Sum
container.network.io.usage.tx_bytes Bytes transmitted by the container Both Sum
container.network.io.usage.tx_dropped Outgoing packets that got dropped Both Sum

Optional Metrics

The following metrics are not emitted by default. Each of them can be enabled by applying the following configuration:

metrics:
  <metric_name>:
    enabled: true
Enter fullscreen mode Exit fullscreen mode
Name Description Availability (cgroup v1/v2) Type
container.blockio.io_merged_recursive Number of bios/requests merged into requests belonging to this cgroup and its descendant cgroups cgroup v1 Sum
container.blockio.io_queued_recursive Number of requests queued up for this cgroup and its descendant cgroups cgroup v1 Sum
container.blockio.io_service_time_recursive Total amount of time in nanoseconds between request dispatch and request completion for the IOs done by this cgroup and descendant cgroups cgroup v1 Sum
container.blockio.io_serviced_recursive Number of IOs (bio) issued to the disk by the group and descendant groups cgroup v1 Sum
container.blockio.io_time_recursive Disk time allocated to cgroup (and descendant cgroups) per device in milliseconds cgroup v1 Sum
container.blockio.io_wait_time_recursive Total amount of time the IOs for this cgroup (and descendant cgroups) spent waiting in the scheduler queues for service cgroup v1 Sum
container.blockio.sectors_recursive Number of sectors transferred to/from disk by the group and descendant groups cgroup v1 Sum
container.cpu.limit CPU limit set for the container. Both Sum
container.cpu.shares CPU shares set for the container. Both Sum
container.cpu.throttling_data.periods Number of periods with throttling active Both Sum
container.cpu.throttling_data.throttled_periods Number of periods when the container hits its throttling limit. Both Sum
container.cpu.throttling_data.throttled_time Aggregate time the container was throttled Both Sum
container.cpu.usage.percpu Per-core CPU usage by the container cgroup v1 Sum
container.cpu.usage.system System CPU usage, as reported by docker Both Sum
container.memory.active_anon The amount of anonymous memory that has been identified as active by the kernel. Both Sum
container.memory.active_file Cache memory that has been identified as active by the kernel. Both Sum
container.memory.anon Amount of memory used in anonymous mappings such as brk(), sbrk(), and mmap(MAP_ANONYMOUS) (Only available with cgroups v2). cgroup v2 Sum
container.memory.cache The amount of memory used by the processes of this control group that can be associated precisely with a block on a block device cgroup v1 Sum
container.memory.dirty Bytes that are waiting to get written back to the disk, from this cgroup cgroup v1 Sum
container.memory.hierarchical_memory_limit The maximum amount of physical memory that can be used by the processes of this control group cgroup v1 Sum
container.memory.hierarchical_memsw_limit The maximum amount of RAM + swap that can be used by the processes of this control group cgroup v1 Sum
container.memory.inactive_anon The amount of anonymous memory that has been identified as inactive by the kernel. Both Sum
container.memory.inactive_file Cache memory that has been identified as inactive by the kernel. Both Sum
container.memory.mapped_file Indicates the amount of memory mapped by the processes in the control group cgroup v1 Sum
container.memory.pgfault Indicate the number of times that a process of the cgroup triggered a page fault. Both Sum
container.memory.pgmajfault Indicate the number of times that a process of the cgroup triggered a major fault. Both Sum
container.memory.pgpgin Number of pages read from disk by the cgroup cgroup v1 Sum
container.memory.pgpgout Number of pages written to disk by the cgroup cgroup v1 Sum
container.memory.rss The amount of memory that doesn’t correspond to anything on disk: stacks, heaps, and anonymous memory maps cgroup v1 Sum
container.memory.rss_huge Number of bytes of anonymous transparent hugepages in this cgroup cgroup v1 Sum
container.memory.total_active_anon The amount of anonymous memory that has been identified as active by the kernel. Includes descendant cgroups cgroup v1 Sum
container.memory.total_active_file Cache memory that has been identified as active by the kernel. Includes descendant cgroups cgroup v1 Sum
container.memory.total_dirty Bytes that are waiting to get written back to the disk, from this cgroup and descendants cgroup v1 Sum
container.memory.total_inactive_anon The amount of anonymous memory that has been identified as inactive by the kernel. Includes descendant cgroups cgroup v1 Sum
container.memory.total_inactive_file Cache memory that has been identified as inactive by the kernel. Includes descendant cgroups cgroup v1 Sum
container.memory.total_mapped_file Indicates the amount of memory mapped by the processes in the control group and descendant groups cgroup v1 Sum
container.memory.total_pgfault Indicate the number of times that a process of the cgroup (or descendant cgroups) triggered a page fault cgroup v1 Sum
container.memory.total_pgmajfault Indicate the number of times that a process of the cgroup (or descendant cgroups) triggered a major fault cgroup v1 Sum
container.memory.total_pgpgin Number of pages read from disk by the cgroup and descendant groups cgroup v1 Sum
container.memory.total_pgpgout Number of pages written to disk by the cgroup and descendant groups cgroup v1 Sum
container.memory.total_rss The amount of memory that doesn’t correspond to anything on disk: stacks, heaps, and anonymous memory maps. Includes descendant cgroups cgroup v1 Sum
container.memory.total_rss_huge Number of bytes of anonymous transparent hugepages in this cgroup and descendant cgroups cgroup v1 Sum
container.memory.total_unevictable The amount of memory that cannot be reclaimed. Includes descendant cgroups cgroup v1 Sum
container.memory.total_writeback Number of bytes of file/anon cache that are queued for syncing to disk in this cgroup and descendants cgroup v1 Sum
container.memory.unevictable The amount of memory that cannot be reclaimed. Both Sum
container.memory.usage.max Maximum memory usage. Both Sum
container.memory.writeback Number of bytes of file/anon cache that are queued for syncing to disk in this cgroup cgroup v1 Sum
container.network.io.usage.rx_errors Received network errors Both Sum
container.network.io.usage.rx_packets Errors in received packets Both Sum
container.network.io.usage.tx_errors Transmission errors Both Sum
container.network.io.usage.tx_packets Packets with transmission errors Both Sum
container.pids.count Total container PIDs Both Sum
container.pids.limit PIDs limits Both Sum
container.restarts Number of restarts for the container. Both Sum
container.uptime Time elapsed since container start time. Both Gauge

Attributes

The attributes collected for all the metrics are as follows:

Name Description Values Enabled
container.command_line The full command executed by the container. Any Str false
container.hostname The hostname of the container. Any Str true
container.id The ID of the container. Any Str true
container.image.id The ID of the container image. Any Str false
container.image.name The name of the docker image in use by the container. Any Str true
container.name The name of the container. Any Str true
container.runtime The runtime of the container. For this receiver, it will always be 'docker'. Any Str true

Conclusion

In this tutorial, you installed an OpenTelemetry collector to collect Docker container metrics and sent the collected data to SigNoz for monitoring and visualization.

Visit our complete guide on OpenTelemetry Collector to learn more about it. OpenTelemetry is quietly becoming the world standard for open-source observability, and by using it, you can have advantages like a single standard for all telemetry signals, no vendor lock-in, etc.

SigNoz is an open-source OpenTelemetry-native APM that can be used as a single backend for all your observability needs.


Further Reading

Complete Guide on OpenTelemetry Collector

An OpenTelemetry-native APM

Top comments (0)