DEV Community

Cover image for Visualizing GPU Metrics with DCGM Exporter
murajo
murajo

Posted on

Visualizing GPU Metrics with DCGM Exporter

1. Overview

In this article, we introduce the steps for visualizing the operating status of NVIDIA GPUs using NVIDIA’s DCGM Exporter together with Prometheus and Grafana.
DCGM (Data Center GPU Manager) is a toolkit for monitoring and managing GPUs, and by using DCGM Exporter you can obtain metrics in Prometheus format.

Of course, you can also monitor GPU status with the nvidia-smi command.
However, relying solely on nvidia-smi has the following limitations:

  • Manual polling — you need to loop it with a shell script, for example
  • Difficult to centrally monitor multiple hosts — logging in to each host via SSH is cumbersome
  • No long‑term time‑series data — CSV logging is possible but not easy to visualize

The method presented here eliminates these drawbacks and enables centralized monitoring in Grafana.
When finished, you will be able to check GPU usage on a Grafana dashboard like the one below.
Image description


2. Intended Readers

  • Those interested in visualizing resource usage on GPU‑equipped machines
  • Those who want to try collecting metrics with DCGM Exporter / Prometheus / Grafana
  • Those who want to build a GPU monitoring stack using Docker

3. Configuration and Prerequisites

We will build a two‑node setup consisting of a GPU server and a monitoring server.

Server Purpose OS GPU Prerequisite Software
GPU server Target GPU server to be monitored; runs DCGM Exporter Ubuntu 24.04 Present Docker, NVIDIA Container Toolkit
Monitoring server Runs Prometheus and Grafana Ubuntu 24.04 None Docker, Compose Plugin

※ For small test environments, you can place both containers on the GPU server.

The flow of communication during operation is illustrated below.

Image description


4. GPU Server: Setting Up DCGM Exporter

First, set up DCGM Exporter on the GPU server.
For further background, see the official manual.

4.1 Verify NVIDIA Container Toolkit Installation

Ensure that NVIDIA Container Toolkit is installed on the GPU server:

dpkg -l | grep nvidia-container-toolkit
Enter fullscreen mode Exit fullscreen mode

Expected output:

ii  nvidia-container-toolkit      1.17.5-1  amd64  NVIDIA Container toolkit
ii  nvidia-container-toolkit-base 1.17.5-1  amd64  NVIDIA Container Toolkit Base
Enter fullscreen mode Exit fullscreen mode

4.2 Pull and Run DCGM Exporter

DCGM Exporter is provided as a container image in the NVIDIA NGC Catalog. No API key is required for this public image.

docker pull nvcr.io/nvidia/k8s/dcgm-exporter:4.2.3-4.1.3-ubuntu22.04
Enter fullscreen mode Exit fullscreen mode

Start the container (be sure to set the two options below):

  • --gpus all — only GPUs passed to the container are monitored
  • --cap-add SYS_ADMIN — without this, some metrics cannot be collected
docker run -d --rm \
  --gpus all \
  --cap-add SYS_ADMIN \
  -p 9400:9400 \
  --name dcgm-exporter \
  nvcr.io/nvidia/k8s/dcgm-exporter:4.2.3-4.1.3-ubuntu22.04
Enter fullscreen mode Exit fullscreen mode

Verify that the container is running:

docker ps -f name=dcgm-exporter
Enter fullscreen mode Exit fullscreen mode

Expected output (example):

CONTAINER ID   IMAGE                                                     COMMAND                CREATED         STATUS         PORTS                    NAMES
462dde910a54   nvcr.io/nvidia/k8s/dcgm-exporter:4.2.3-4.1.3-ubuntu22.04  "/usr/local/dcgm/dcg…"  10 seconds ago  Up 9 seconds   0.0.0.0:9400->9400/tcp   dcgm-exporter
Enter fullscreen mode Exit fullscreen mode

Confirm Metrics Endpoint

curl http://localhost:9400/metrics | head -n 5
Enter fullscreen mode Exit fullscreen mode

Expected output (example):

# HELP DCGM_FI_DEV_SM_CLOCK SM clock frequency (in MHz).
# TYPE DCGM_FI_DEV_SM_CLOCK gauge
DCGM_FI_DEV_SM_CLOCK{gpu="0",UUID="GPU-f8291959-100f-80a2-a0e5-3db0f7f94746",pci_bus_id="00000000:1B:00.0",device="nvidia0",modelName="NVIDIA H200",Hostname="462dde910a54",DCGM_FI_DRIVER_VERSION="570.124.06"} 345
Enter fullscreen mode Exit fullscreen mode

5. Monitoring Server: Prometheus & Grafana Setup

5.1 Create Persistent Data Directories

Prometheus and Grafana use fixed internal UIDs/GIDs. Create the directories and set ownership:

sudo mkdir -p /opt/prometheus/data
sudo mkdir -p /opt/grafana/data

sudo chown -R 65534:65534 /opt/prometheus/data   # Prometheus UID/GID
sudo chown -R 472:472   /opt/grafana/data        # Grafana UID/GID
Enter fullscreen mode Exit fullscreen mode

5.2 Create Prometheus Configuration File

Create prometheus.yml in the same directory as your future docker-compose.yml:

global:
  scrape_interval: 15s

scrape_configs:
  - job_name: 'dcgm-exporter'
    static_configs:
      - targets: ['<GPU-Server-IP>:9400']
Enter fullscreen mode Exit fullscreen mode
  • scrape_interval — monitoring frequency
  • targets — IP address and port (9400) of the GPU server running DCGM Exporter

5.3 Launch Prometheus & Grafana via Docker Compose

Pull the images:

docker pull prom/prometheus:v3.4.1
docker pull grafana/grafana:12.0.1
Enter fullscreen mode Exit fullscreen mode

Next, prepare the contents for starting Prometheus and Grafana in docker-compose.yml.
Basically, you will write down the contents you have prepared up to this point.

services:
  prometheus:
    image: prom/prometheus:v3.4.1
    container_name: prometheus
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - /opt/prometheus/data:/prometheus
    ports:
      - "9090:9090"

  grafana:
    image: grafana/grafana:12.0.1
    container_name: grafana
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=<AdminPassword>
    volumes:
      - /opt/grafana/data:/var/lib/grafana
Enter fullscreen mode Exit fullscreen mode

Notes:

  • Prometheus expects the file to be named prometheus.yml (extension yml).
  • The value of GF_SECURITY_ADMIN_PASSWORD sets the initial password for user admin. If omitted, the default is admin.

Start the stack:

docker compose up -d
Enter fullscreen mode Exit fullscreen mode

Verify that the container is running:

docker compose ps
Enter fullscreen mode Exit fullscreen mode

Expected output (example):

NAME         IMAGE                    COMMAND                  SERVICE      CREATED         STATUS         PORTS
prometheus   prom/prometheus:v3.4.1   "/bin/prometheus --c…"   prometheus   2 minutes ago   Up 2 minutes   0.0.0.0:9090->9090/tcp
grafana      grafana/grafana:12.0.1   "/run.sh"                grafana      2 minutes ago   Up 2 minutes   0.0.0.0:3000->3000/tcp
Enter fullscreen mode Exit fullscreen mode

5.4 Confirm Prometheus Targets

Open a browser at http://<Monitoring-Server-IP>:9090/targets.
You should see dcgm-exporter listed with state UP.

Image description

5.5 Log In to Grafana

Access http://<Monitoring-Server-IP>:3000 and the login screen will appear. Log in with admin / <AdminPassword> (or admin / admin if the variable was omitted).

Image description

When login succeeds, Grafana is running correctly.

Image description


6. Create Grafana Dashboard

We will use NVIDIA’s published template below.

https://grafana.com/grafana/dashboards/12239-nvidia-dcgm-exporter-dashboard

6.1 Add Prometheus as a Data Source

From the Grafana console, open Add new connection and select Prometheus.

Image description

Enter http://<Monitoring-Server-IP>:9090 in the Connection field, then Save & Test.

Image description

6.2 Import the Dashboard Template

Open Dashboards → New → import.

Image description

Because we are using a public template, enter 12239 as the template ID and click Load.

Image description

If the dashboard imports successfully, GPU temperature, utilization, memory bandwidth, and more will be visualized in real time.

Image description


7. Summary

We visualized GPU usage using DCGM Exporter.

Although only one GPU server was used here, you can register multiple servers in Prometheus and manage them together in Grafana.

For very small setups, running Prometheus and Grafana directly on the GPU server is also an option.


8. Extras

Although not covered in detail, the following are useful from an operations perspective.

Topic Point
Multiple GPU servers Simply add more IPs under targets in prometheus.yml.
Alert settings Forward conditions like DCGM_FI_DEV_GPU_TEMP > 80 to Alertmanager.
Data retention period Adjust with Prometheus flag --storage.tsdb.retention.time=90d.
Version upgrades Ensure DCGM version, NVIDIA driver, and exporter tag are compatible.

9. References

Top comments (0)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.