<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: murajo</title>
    <description>The latest articles on DEV Community by murajo (@murajo).</description>
    <link>https://dev.to/murajo</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3249664%2Ffe00df32-4de7-4925-ab56-537f295f0ebd.jpg</url>
      <title>DEV Community: murajo</title>
      <link>https://dev.to/murajo</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/murajo"/>
    <language>en</language>
    <item>
      <title>Visualizing GPU Metrics with DCGM Exporter</title>
      <dc:creator>murajo</dc:creator>
      <pubDate>Sun, 08 Jun 2025 20:48:50 +0000</pubDate>
      <link>https://dev.to/murajo/visualizing-gpu-metrics-with-dcgm-exporter-1268</link>
      <guid>https://dev.to/murajo/visualizing-gpu-metrics-with-dcgm-exporter-1268</guid>
      <description>&lt;h2&gt;
  
  
  1. Overview
&lt;/h2&gt;

&lt;p&gt;In this article, we introduce the steps for visualizing the operating status of NVIDIA GPUs using NVIDIA’s &lt;strong&gt;DCGM Exporter&lt;/strong&gt; together with &lt;strong&gt;Prometheus&lt;/strong&gt; and &lt;strong&gt;Grafana&lt;/strong&gt;.&lt;br&gt;
DCGM (Data Center GPU Manager) is a toolkit for monitoring and managing GPUs, and by using DCGM Exporter you can obtain metrics in Prometheus format.&lt;/p&gt;

&lt;p&gt;Of course, you can also monitor GPU status with the &lt;code&gt;nvidia-smi&lt;/code&gt; command.&lt;br&gt;
However, relying solely on &lt;code&gt;nvidia-smi&lt;/code&gt; has the following limitations:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Manual polling&lt;/strong&gt; — you need to loop it with a shell script, for example&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Difficult to centrally monitor multiple hosts&lt;/strong&gt; — logging in to each host via SSH is cumbersome&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No long‑term time‑series data&lt;/strong&gt; — CSV logging is possible but not easy to visualize&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The method presented here eliminates these drawbacks and enables centralized monitoring in Grafana.&lt;br&gt;
When finished, you will be able to check GPU usage on a Grafana dashboard like the one below.&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwhc0scr8cm0l5k1t8vk1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwhc0scr8cm0l5k1t8vk1.png" alt="Image description" width="800" height="398"&gt;&lt;/a&gt;&lt;/p&gt;


&lt;h2&gt;
  
  
  2. Intended Readers
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Those interested in visualizing resource usage on GPU‑equipped machines&lt;/li&gt;
&lt;li&gt;Those who want to try collecting metrics with DCGM Exporter / Prometheus / Grafana&lt;/li&gt;
&lt;li&gt;Those who want to build a GPU monitoring stack using Docker&lt;/li&gt;
&lt;/ul&gt;


&lt;h2&gt;
  
  
  3. Configuration and Prerequisites
&lt;/h2&gt;

&lt;p&gt;We will build a two‑node setup consisting of a GPU server and a monitoring server.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Server&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;th&gt;OS&lt;/th&gt;
&lt;th&gt;GPU&lt;/th&gt;
&lt;th&gt;Prerequisite Software&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;GPU server&lt;/td&gt;
&lt;td&gt;Target GPU server to be monitored; runs DCGM Exporter&lt;/td&gt;
&lt;td&gt;Ubuntu 24.04&lt;/td&gt;
&lt;td&gt;Present&lt;/td&gt;
&lt;td&gt;Docker, NVIDIA Container Toolkit&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Monitoring server&lt;/td&gt;
&lt;td&gt;Runs Prometheus and Grafana&lt;/td&gt;
&lt;td&gt;Ubuntu 24.04&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;td&gt;Docker, Compose Plugin&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;※ For small test environments, you can place both containers on the GPU server.&lt;/p&gt;

&lt;p&gt;The flow of communication during operation is illustrated below.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F69z5bbz6yxyn32oiuk54.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F69z5bbz6yxyn32oiuk54.png" alt="Image description" width="800" height="504"&gt;&lt;/a&gt;&lt;/p&gt;


&lt;h2&gt;
  
  
  4. GPU Server: Setting Up DCGM Exporter
&lt;/h2&gt;

&lt;p&gt;First, set up DCGM Exporter on the GPU server. &lt;br&gt;
For further background, see the &lt;a href="https://docs.nvidia.com/datacenter/cloud-native/gpu-telemetry/latest/dcgm-exporter.html" rel="noopener noreferrer"&gt;official manual&lt;/a&gt;.&lt;/p&gt;
&lt;h3&gt;
  
  
  4.1 Verify NVIDIA Container Toolkit Installation
&lt;/h3&gt;

&lt;p&gt;Ensure that &lt;strong&gt;NVIDIA Container Toolkit&lt;/strong&gt; is installed on the GPU server:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;dpkg &lt;span class="nt"&gt;-l&lt;/span&gt; | &lt;span class="nb"&gt;grep &lt;/span&gt;nvidia-container-toolkit
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Expected output:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;ii  nvidia-container-toolkit      1.17.5-1  amd64  NVIDIA Container toolkit
ii  nvidia-container-toolkit-base 1.17.5-1  amd64  NVIDIA Container Toolkit Base
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  4.2 Pull and Run DCGM Exporter
&lt;/h3&gt;

&lt;p&gt;DCGM Exporter is provided as a container image in the NVIDIA &lt;a href="https://catalog.ngc.nvidia.com/orgs/nvidia/teams/k8s/containers/dcgm-exporter" rel="noopener noreferrer"&gt;NGC Catalog&lt;/a&gt;. No API key is required for this public image.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker pull nvcr.io/nvidia/k8s/dcgm-exporter:4.2.3-4.1.3-ubuntu22.04
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Start the container (be sure to set the two options below):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;--gpus all&lt;/code&gt; — only GPUs passed to the container are monitored&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;--cap-add SYS_ADMIN&lt;/code&gt; — without this, some metrics cannot be collected
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker run &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="nt"&gt;--rm&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--gpus&lt;/span&gt; all &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--cap-add&lt;/span&gt; SYS_ADMIN &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-p&lt;/span&gt; 9400:9400 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--name&lt;/span&gt; dcgm-exporter &lt;span class="se"&gt;\&lt;/span&gt;
  nvcr.io/nvidia/k8s/dcgm-exporter:4.2.3-4.1.3-ubuntu22.04
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Verify that the container is running:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker ps &lt;span class="nt"&gt;-f&lt;/span&gt; &lt;span class="nv"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;dcgm-exporter
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Expected output (example):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;CONTAINER ID   IMAGE                                                     COMMAND                CREATED         STATUS         PORTS                    NAMES
462dde910a54   nvcr.io/nvidia/k8s/dcgm-exporter:4.2.3-4.1.3-ubuntu22.04  &lt;span class="s2"&gt;"/usr/local/dcgm/dcg…"&lt;/span&gt;  10 seconds ago  Up 9 seconds   0.0.0.0:9400-&amp;gt;9400/tcp   dcgm-exporter
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Confirm Metrics Endpoint
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl http://localhost:9400/metrics | &lt;span class="nb"&gt;head&lt;/span&gt; &lt;span class="nt"&gt;-n&lt;/span&gt; 5
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Expected output (example):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# HELP DCGM_FI_DEV_SM_CLOCK SM clock frequency (in MHz).
# TYPE DCGM_FI_DEV_SM_CLOCK gauge
DCGM_FI_DEV_SM_CLOCK{gpu="0",UUID="GPU-f8291959-100f-80a2-a0e5-3db0f7f94746",pci_bus_id="00000000:1B:00.0",device="nvidia0",modelName="NVIDIA H200",Hostname="462dde910a54",DCGM_FI_DRIVER_VERSION="570.124.06"} 345
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  5. Monitoring Server: Prometheus &amp;amp; Grafana Setup
&lt;/h2&gt;

&lt;h3&gt;
  
  
  5.1 Create Persistent Data Directories
&lt;/h3&gt;

&lt;p&gt;Prometheus and Grafana use fixed internal UIDs/GIDs. Create the directories and set ownership:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;sudo mkdir&lt;/span&gt; &lt;span class="nt"&gt;-p&lt;/span&gt; /opt/prometheus/data
&lt;span class="nb"&gt;sudo mkdir&lt;/span&gt; &lt;span class="nt"&gt;-p&lt;/span&gt; /opt/grafana/data

&lt;span class="nb"&gt;sudo chown&lt;/span&gt; &lt;span class="nt"&gt;-R&lt;/span&gt; 65534:65534 /opt/prometheus/data   &lt;span class="c"&gt;# Prometheus UID/GID&lt;/span&gt;
&lt;span class="nb"&gt;sudo chown&lt;/span&gt; &lt;span class="nt"&gt;-R&lt;/span&gt; 472:472   /opt/grafana/data        &lt;span class="c"&gt;# Grafana UID/GID&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  5.2 Create Prometheus Configuration File
&lt;/h3&gt;

&lt;p&gt;Create &lt;code&gt;prometheus.yml&lt;/code&gt; in the same directory as your future &lt;code&gt;docker-compose.yml&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;global&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;scrape_interval&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;15s&lt;/span&gt;

&lt;span class="na"&gt;scrape_configs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;job_name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;dcgm-exporter'&lt;/span&gt;
    &lt;span class="na"&gt;static_configs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;targets&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;&amp;lt;GPU-Server-IP&amp;gt;:9400'&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;scrape_interval&lt;/code&gt; — monitoring frequency&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;targets&lt;/code&gt; — IP address and port (9400) of the GPU server running DCGM Exporter&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  5.3 Launch Prometheus &amp;amp; Grafana via Docker Compose
&lt;/h3&gt;

&lt;p&gt;Pull the images:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker pull prom/prometheus:v3.4.1
docker pull grafana/grafana:12.0.1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Next, prepare the contents for starting Prometheus and Grafana in &lt;code&gt;docker-compose.yml&lt;/code&gt;.&lt;br&gt;
Basically, you will write down the contents you have prepared up to this point.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;services&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;prometheus&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;prom/prometheus:v3.4.1&lt;/span&gt;
    &lt;span class="na"&gt;container_name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;prometheus&lt;/span&gt;
    &lt;span class="na"&gt;volumes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;./prometheus.yml:/etc/prometheus/prometheus.yml&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;/opt/prometheus/data:/prometheus&lt;/span&gt;
    &lt;span class="na"&gt;ports&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;9090:9090"&lt;/span&gt;

  &lt;span class="na"&gt;grafana&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;grafana/grafana:12.0.1&lt;/span&gt;
    &lt;span class="na"&gt;container_name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;grafana&lt;/span&gt;
    &lt;span class="na"&gt;ports&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;3000:3000"&lt;/span&gt;
    &lt;span class="na"&gt;environment&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;GF_SECURITY_ADMIN_PASSWORD=&amp;lt;AdminPassword&amp;gt;&lt;/span&gt;
    &lt;span class="na"&gt;volumes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;/opt/grafana/data:/var/lib/grafana&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Notes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Prometheus expects the file to be named &lt;code&gt;prometheus.yml&lt;/code&gt; (extension &lt;code&gt;yml&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;The value of &lt;code&gt;GF_SECURITY_ADMIN_PASSWORD&lt;/code&gt; sets the initial password for user &lt;strong&gt;admin&lt;/strong&gt;. If omitted, the default is &lt;code&gt;admin&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Start the stack:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker compose up &lt;span class="nt"&gt;-d&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Verify that the container is running:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker compose ps
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Expected output (example):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;NAME         IMAGE                    COMMAND                  SERVICE      CREATED         STATUS         PORTS
prometheus   prom/prometheus:v3.4.1   &lt;span class="s2"&gt;"/bin/prometheus --c…"&lt;/span&gt;   prometheus   2 minutes ago   Up 2 minutes   0.0.0.0:9090-&amp;gt;9090/tcp
grafana      grafana/grafana:12.0.1   &lt;span class="s2"&gt;"/run.sh"&lt;/span&gt;                grafana      2 minutes ago   Up 2 minutes   0.0.0.0:3000-&amp;gt;3000/tcp
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  5.4 Confirm Prometheus Targets
&lt;/h3&gt;

&lt;p&gt;Open a browser at &lt;code&gt;http://&amp;lt;Monitoring-Server-IP&amp;gt;:9090/targets&lt;/code&gt;.&lt;br&gt;
You should see &lt;strong&gt;dcgm-exporter&lt;/strong&gt; listed with state &lt;strong&gt;UP&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwzemlg8sdykld7xzc8rp.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwzemlg8sdykld7xzc8rp.png" alt="Image description" width="800" height="226"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  5.5 Log In to Grafana
&lt;/h3&gt;

&lt;p&gt;Access &lt;code&gt;http://&amp;lt;Monitoring-Server-IP&amp;gt;:3000&lt;/code&gt; and the login screen will appear. Log in with &lt;strong&gt;admin&lt;/strong&gt; / &lt;code&gt;&amp;lt;AdminPassword&amp;gt;&lt;/code&gt; (or &lt;strong&gt;admin&lt;/strong&gt; / &lt;strong&gt;admin&lt;/strong&gt; if the variable was omitted).&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fx36f2upj6qmzz7rlgjn1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fx36f2upj6qmzz7rlgjn1.png" alt="Image description" width="800" height="645"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;When login succeeds, Grafana is running correctly.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fppau67rpk3q18rwvij5x.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fppau67rpk3q18rwvij5x.png" alt="Image description" width="800" height="346"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  6. Create Grafana Dashboard
&lt;/h2&gt;

&lt;p&gt;We will use NVIDIA’s published template below.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://grafana.com/grafana/dashboards/12239-nvidia-dcgm-exporter-dashboard" rel="noopener noreferrer"&gt;https://grafana.com/grafana/dashboards/12239-nvidia-dcgm-exporter-dashboard&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  6.1 Add Prometheus as a Data Source
&lt;/h3&gt;

&lt;p&gt;From the Grafana console, open &lt;strong&gt;Add new connection&lt;/strong&gt; and select &lt;strong&gt;Prometheus&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2ziy5dtv3gbr6nxe66lb.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2ziy5dtv3gbr6nxe66lb.png" alt="Image description" width="800" height="271"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Enter &lt;code&gt;http://&amp;lt;Monitoring-Server-IP&amp;gt;:9090&lt;/code&gt; in the Connection field, then &lt;strong&gt;Save &amp;amp; Test&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fn0ohod7ef7g2k24trsq9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fn0ohod7ef7g2k24trsq9.png" alt="Image description" width="800" height="261"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  6.2 Import the Dashboard Template
&lt;/h3&gt;

&lt;p&gt;Open &lt;strong&gt;Dashboards → New → import&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7o85x521a5zt3jqd75ct.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7o85x521a5zt3jqd75ct.png" alt="Image description" width="800" height="250"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Because we are using a public template, enter &lt;strong&gt;12239&lt;/strong&gt; as the template ID and click &lt;strong&gt;Load&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgc4i3hiuwxpfgsl1jt2f.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgc4i3hiuwxpfgsl1jt2f.png" alt="Image description" width="800" height="617"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If the dashboard imports successfully, GPU temperature, utilization, memory bandwidth, and more will be visualized in real time.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6vsxaqajcmbn3ql2lve3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6vsxaqajcmbn3ql2lve3.png" alt="Image description" width="800" height="398"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  7. Summary
&lt;/h2&gt;

&lt;p&gt;We visualized GPU usage using DCGM Exporter.&lt;br&gt;&lt;br&gt;
Although only one GPU server was used here, you can register multiple servers in Prometheus and manage them together in Grafana.&lt;br&gt;&lt;br&gt;
For very small setups, running Prometheus and Grafana directly on the GPU server is also an option.&lt;/p&gt;




&lt;h2&gt;
  
  
  8. Extras
&lt;/h2&gt;

&lt;p&gt;Although not covered in detail, the following are useful from an operations perspective.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Topic&lt;/th&gt;
&lt;th&gt;Point&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Multiple GPU servers&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Simply add more IPs under &lt;code&gt;targets&lt;/code&gt; in &lt;code&gt;prometheus.yml&lt;/code&gt;.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Alert settings&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Forward conditions like &lt;code&gt;DCGM_FI_DEV_GPU_TEMP &amp;gt; 80&lt;/code&gt; to Alertmanager.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Data retention period&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Adjust with Prometheus flag &lt;code&gt;--storage.tsdb.retention.time=90d&lt;/code&gt;.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Version upgrades&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Ensure DCGM version, NVIDIA driver, and exporter tag are compatible.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  9. References
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;DCGM Exporter GitHub — &lt;a href="https://github.com/NVIDIA/dcgm-exporter" rel="noopener noreferrer"&gt;https://github.com/NVIDIA/dcgm-exporter&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;NVIDIA Docs — &lt;a href="https://docs.nvidia.com/datacenter/cloud-native/gpu-telemetry/latest/dcgm-exporter.html" rel="noopener noreferrer"&gt;https://docs.nvidia.com/datacenter/cloud-native/gpu-telemetry/latest/dcgm-exporter.html&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Prometheus Install — &lt;a href="https://prometheus.io/docs/prometheus/latest/installation/" rel="noopener noreferrer"&gt;https://prometheus.io/docs/prometheus/latest/installation/&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Grafana Install — &lt;a href="https://grafana.com/docs/grafana/latest/setup-grafana/installation/docker/" rel="noopener noreferrer"&gt;https://grafana.com/docs/grafana/latest/setup-grafana/installation/docker/&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Dashboard ID 12239 — &lt;a href="https://grafana.com/grafana/dashboards/12239-nvidia-dcgm-exporter-dashboard/" rel="noopener noreferrer"&gt;https://grafana.com/grafana/dashboards/12239-nvidia-dcgm-exporter-dashboard/&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>nvidia</category>
      <category>grafana</category>
      <category>prometheus</category>
    </item>
  </channel>
</rss>
