DEV Community

Cover image for NVIDIA Data Center GPU Manager (DCGM)
Minwook Je
Minwook Je

Posted on

NVIDIA Data Center GPU Manager (DCGM)

Why DCGM?

How can a custom Kubernetes operator enable auto-scaling based on GPU metrics?

Traditional HPA mechanisms are limited to CPU and memory, which falls short for GPU-heavy workloads. By exporting GPU usage metrics to Prometheus, an operator can make scale decisions in response to real-time GPU demand.

DCGM Exporter emerged as a practical solution to this challenge.


NVML vs DCGM vs DCGM Exporter

Item NVML DCGM DCGM Exporter
Level Low-level (direct GPU access) High-level (cluster-wide management) Monitoring layer
Foundation GPU driver Based on NVML DCGM + NVML
Functionality Real-time GPU status Batch collection, policies, health checks Prometheus metrics endpoint
Purpose Library/API calls Management platform Monitoring integration

Core code (dcgm-exporter)

https://github.com/NVIDIA/dcgm-exporter/blob/main/pkg/cmd/app.go#L379

func startDCGMExporter(c *cli.Context) error {
    if err := configureLogger(c); err != nil {
        return err
    }

    for {
        // Create a new context for this run of the exporter
        // Runs are ended by various events (signals from OS or DCGM)
        ctx, cancel := context.WithCancel(context.Background())
        defer cancel()

        var version string
        if c != nil && c.App != nil {
            version = c.App.Version
        }

        slog.Info("Starting dcgm-exporter", slog.String("Version", version))

        config, err := contextToConfig(c)
        if err != nil {
            return err
        }

        err = prerequisites.Validate()
        if err != nil {
            return err
        }

        // Initialize DCGM Provider Instance
        dcgmprovider.Initialize(config)
        dcgmCleanup := dcgmprovider.Client().Cleanup

        // Initialize NVML Provider Instance
        nvmlprovider.Initialize()
        nvmlCleanup := nvmlprovider.Client().Cleanup

        slog.Info("DCGM successfully initialized!")
        slog.Info("NVML provider successfully initialized!")

        fillConfigMetricGroups(config)

        cs := getCounters(config)

        deviceWatchListManager := startDeviceWatchListManager(cs, config)

        hostname, err := hostname.GetHostname(config)
        if err != nil {
            nvmlCleanup()
            dcgmCleanup()
            return err
        }

        cf := collector.InitCollectorFactory(cs, deviceWatchListManager, hostname, config)

        cRegistry := registry.NewRegistry()
        for _, entityCollector := range cf.NewCollectors() {
            cRegistry.Register(entityCollector)
        }

        ch := make(chan string, 10)

        var wg sync.WaitGroup
        stop := make(chan interface{})

        wg.Add(1)

        server, cleanup, err := server.NewMetricsServer(config, ch, deviceWatchListManager, cRegistry)
        if err != nil {
            cRegistry.Cleanup()
            nvmlCleanup()
            dcgmCleanup()
            return err
        }

        go server.Run(ctx, stop, &wg)

        sigs := newOSWatcher(syscall.SIGINT, syscall.SIGTERM, syscall.SIGQUIT, syscall.SIGHUP)

        go watchCollectorsFile(config.CollectorsFile, reloadMetricsServer(sigs))

        sig := <-sigs
        slog.Info("Received signal", slog.String("signal", sig.String()))
        close(stop)
        cancel() // Cancel the context for this iteration
        err = utils.WaitWithTimeout(&wg, time.Second*2)
        if err != nil {
            slog.Error(err.Error())
            cRegistry.Cleanup()
            nvmlCleanup()
            dcgmCleanup()
            cleanup()
            fatal()
        }

        // Call cleanup functions before continuing the loop
        cRegistry.Cleanup()
        nvmlCleanup()
        dcgmCleanup()
        cleanup()

        if sig != syscall.SIGHUP {
            return nil
        }

        // For SIGHUP, we'll continue the loop after cleanup
        slog.Info("Restarting dcgm-exporter after signal")
    }

    return nil
}
Enter fullscreen mode Exit fullscreen mode

Understanding DCGM Exporter Structure

1. Initialization

// Initialize DCGM
dcgmprovider.Initialize(config)
dcgmCleanup := dcgmprovider.Client().Cleanup

// Initialize NVML
nvmlprovider.Initialize()
nvmlCleanup := nvmlprovider.Client().Cleanup
Enter fullscreen mode Exit fullscreen mode

DCGM acts as the primary metric engine; NVML supplements with fine-grained details.

2. Collector and Metric Registration

fillConfigMetricGroups(config)
cs := getCounters(config)
deviceWatchListManager := startDeviceWatchListManager(cs, config)

cf := collector.InitCollectorFactory(cs, deviceWatchListManager, hostname, config)
cRegistry := registry.NewRegistry()
for _, entityCollector := range cf.NewCollectors() {
    cRegistry.Register(entityCollector)
}
Enter fullscreen mode Exit fullscreen mode

Sets up GPU metric groups and registers collectors with Prometheus registry.

3. Restart Loop + Signal Handling

sig := <-sigs

cRegistry.Cleanup()
nvmlCleanup()
dcgmCleanup()
cleanup()

if sig == syscall.SIGHUP {
    // Enter restart loop
}
Enter fullscreen mode Exit fullscreen mode

SIGINT/SIGTERM: terminate, SIGHUP: restart


How to use?

e.g. Integrating DCGM Exporter into a Kubernetes Operator

Core Concept

  • Deploy the exporter as a DaemonSet on GPU nodes
  • Configure it through a custom resource spec
  • Use Prometheus metrics to drive auto-scaling decisions

Auto Scaling Logic Example

func (r *DCGMExporterReconciler) handleAutoScaling(ctx context.Context, dcgmExporter *monitoringv1.DCGMExporter) error {
    if !dcgmExporter.Spec.AutoScale.Enabled {
        return nil
    }

    current := r.getMetricCount(ctx, dcgmExporter)
    target := dcgmExporter.Spec.AutoScale.TargetMetricCount

    if current > target {
        return r.scaleUp(ctx, dcgmExporter)
    } else if current < target/2 {
        return r.scaleDown(ctx, dcgmExporter)
    }
    return nil
}
Enter fullscreen mode Exit fullscreen mode

Adjusts scale based on the number of metrics collected via Prometheus


Summary

  • startDCGMExporter() is a self-contained loop that handles GPU metric collection, Prometheus server setup, and signal-based restarts.
  • A Kubernetes operator can manage this exporter and implement scaling logic based on observed metrics.
  • Collector configurations, GPU node discovery, and collection intervals can all be managed through the operator.

Ultimately, DCGM Exporter is a key enabler for dynamic scaling of GPU-intensive workloads.

Top comments (0)