Minwook Je

Posted on Jul 13

NVIDIA Data Center GPU Manager (DCGM)

#nvidia #kubernetes #ai #programming

Why DCGM?

How can a custom Kubernetes operator enable auto-scaling based on GPU metrics?

Traditional HPA mechanisms are limited to CPU and memory, which falls short for GPU-heavy workloads. By exporting GPU usage metrics to Prometheus, an operator can make scale decisions in response to real-time GPU demand.

DCGM Exporter emerged as a practical solution to this challenge.

NVML vs DCGM vs DCGM Exporter

Item	NVML	DCGM	DCGM Exporter
Level	Low-level (direct GPU access)	High-level (cluster-wide management)	Monitoring layer
Foundation	GPU driver	Based on NVML	DCGM + NVML
Functionality	Real-time GPU status	Batch collection, policies, health checks	Prometheus metrics endpoint
Purpose	Library/API calls	Management platform	Monitoring integration

Core code (`dcgm-exporter`)

https://github.com/NVIDIA/dcgm-exporter/blob/main/pkg/cmd/app.go#L379

func startDCGMExporter(c *cli.Context) error {
    if err := configureLogger(c); err != nil {
        return err
    }

    for {
        // Create a new context for this run of the exporter
        // Runs are ended by various events (signals from OS or DCGM)
        ctx, cancel := context.WithCancel(context.Background())
        defer cancel()

        var version string
        if c != nil && c.App != nil {
            version = c.App.Version
        }

        slog.Info("Starting dcgm-exporter", slog.String("Version", version))

        config, err := contextToConfig(c)
        if err != nil {
            return err
        }

        err = prerequisites.Validate()
        if err != nil {
            return err
        }

        // Initialize DCGM Provider Instance
        dcgmprovider.Initialize(config)
        dcgmCleanup := dcgmprovider.Client().Cleanup

        // Initialize NVML Provider Instance
        nvmlprovider.Initialize()
        nvmlCleanup := nvmlprovider.Client().Cleanup

        slog.Info("DCGM successfully initialized!")
        slog.Info("NVML provider successfully initialized!")

        fillConfigMetricGroups(config)

        cs := getCounters(config)

        deviceWatchListManager := startDeviceWatchListManager(cs, config)

        hostname, err := hostname.GetHostname(config)
        if err != nil {
            nvmlCleanup()
            dcgmCleanup()
            return err
        }

        cf := collector.InitCollectorFactory(cs, deviceWatchListManager, hostname, config)

        cRegistry := registry.NewRegistry()
        for _, entityCollector := range cf.NewCollectors() {
            cRegistry.Register(entityCollector)
        }

        ch := make(chan string, 10)

        var wg sync.WaitGroup
        stop := make(chan interface{})

        wg.Add(1)

        server, cleanup, err := server.NewMetricsServer(config, ch, deviceWatchListManager, cRegistry)
        if err != nil {
            cRegistry.Cleanup()
            nvmlCleanup()
            dcgmCleanup()
            return err
        }

        go server.Run(ctx, stop, &wg)

        sigs := newOSWatcher(syscall.SIGINT, syscall.SIGTERM, syscall.SIGQUIT, syscall.SIGHUP)

        go watchCollectorsFile(config.CollectorsFile, reloadMetricsServer(sigs))

        sig := <-sigs
        slog.Info("Received signal", slog.String("signal", sig.String()))
        close(stop)
        cancel() // Cancel the context for this iteration
        err = utils.WaitWithTimeout(&wg, time.Second*2)
        if err != nil {
            slog.Error(err.Error())
            cRegistry.Cleanup()
            nvmlCleanup()
            dcgmCleanup()
            cleanup()
            fatal()
        }

        // Call cleanup functions before continuing the loop
        cRegistry.Cleanup()
        nvmlCleanup()
        dcgmCleanup()
        cleanup()

        if sig != syscall.SIGHUP {
            return nil
        }

        // For SIGHUP, we'll continue the loop after cleanup
        slog.Info("Restarting dcgm-exporter after signal")
    }

    return nil
}

Understanding DCGM Exporter Structure

1. Initialization

// Initialize DCGM
dcgmprovider.Initialize(config)
dcgmCleanup := dcgmprovider.Client().Cleanup

// Initialize NVML
nvmlprovider.Initialize()
nvmlCleanup := nvmlprovider.Client().Cleanup

DCGM acts as the primary metric engine; NVML supplements with fine-grained details.

2. Collector and Metric Registration

fillConfigMetricGroups(config)
cs := getCounters(config)
deviceWatchListManager := startDeviceWatchListManager(cs, config)

cf := collector.InitCollectorFactory(cs, deviceWatchListManager, hostname, config)
cRegistry := registry.NewRegistry()
for _, entityCollector := range cf.NewCollectors() {
    cRegistry.Register(entityCollector)
}

Sets up GPU metric groups and registers collectors with Prometheus registry.

3. Restart Loop + Signal Handling

sig := <-sigs

cRegistry.Cleanup()
nvmlCleanup()
dcgmCleanup()
cleanup()

if sig == syscall.SIGHUP {
    // Enter restart loop
}

SIGINT/SIGTERM: terminate, SIGHUP: restart

How to use?

e.g. Integrating DCGM Exporter into a Kubernetes Operator

Core Concept

Deploy the exporter as a DaemonSet on GPU nodes
Configure it through a custom resource spec
Use Prometheus metrics to drive auto-scaling decisions

Auto Scaling Logic Example

func (r *DCGMExporterReconciler) handleAutoScaling(ctx context.Context, dcgmExporter *monitoringv1.DCGMExporter) error {
    if !dcgmExporter.Spec.AutoScale.Enabled {
        return nil
    }

    current := r.getMetricCount(ctx, dcgmExporter)
    target := dcgmExporter.Spec.AutoScale.TargetMetricCount

    if current > target {
        return r.scaleUp(ctx, dcgmExporter)
    } else if current < target/2 {
        return r.scaleDown(ctx, dcgmExporter)
    }
    return nil
}

Adjusts scale based on the number of metrics collected via Prometheus

Summary

startDCGMExporter() is a self-contained loop that handles GPU metric collection, Prometheus server setup, and signal-based restarts.
A Kubernetes operator can manage this exporter and implement scaling logic based on observed metrics.
Collector configurations, GPU node discovery, and collection intervals can all be managed through the operator.

Ultimately, DCGM Exporter is a key enabler for dynamic scaling of GPU-intensive workloads.

DEV Community

NVIDIA Data Center GPU Manager (DCGM)

Why DCGM?

NVML vs DCGM vs DCGM Exporter

Core code (`dcgm-exporter`)

Understanding DCGM Exporter Structure

1. Initialization

2. Collector and Metric Registration

3. Restart Loop + Signal Handling

How to use?

Core Concept

Auto Scaling Logic Example

Summary

Top comments (0)

Why DCGM?

NVML vs DCGM vs DCGM Exporter

Core code (dcgm-exporter)

Understanding DCGM Exporter Structure

1. Initialization

2. Collector and Metric Registration

3. Restart Loop + Signal Handling

How to use?

Core Concept

Auto Scaling Logic Example

Summary

Core code (`dcgm-exporter`)