Why DCGM?
How can a custom Kubernetes operator enable auto-scaling based on GPU metrics?
Traditional HPA mechanisms are limited to CPU and memory, which falls short for GPU-heavy workloads. By exporting GPU usage metrics to Prometheus, an operator can make scale decisions in response to real-time GPU demand.
DCGM Exporter emerged as a practical solution to this challenge.
NVML vs DCGM vs DCGM Exporter
Item | NVML | DCGM | DCGM Exporter |
---|---|---|---|
Level | Low-level (direct GPU access) | High-level (cluster-wide management) | Monitoring layer |
Foundation | GPU driver | Based on NVML | DCGM + NVML |
Functionality | Real-time GPU status | Batch collection, policies, health checks | Prometheus metrics endpoint |
Purpose | Library/API calls | Management platform | Monitoring integration |
Core code (dcgm-exporter
)
https://github.com/NVIDIA/dcgm-exporter/blob/main/pkg/cmd/app.go#L379
func startDCGMExporter(c *cli.Context) error {
if err := configureLogger(c); err != nil {
return err
}
for {
// Create a new context for this run of the exporter
// Runs are ended by various events (signals from OS or DCGM)
ctx, cancel := context.WithCancel(context.Background())
defer cancel()
var version string
if c != nil && c.App != nil {
version = c.App.Version
}
slog.Info("Starting dcgm-exporter", slog.String("Version", version))
config, err := contextToConfig(c)
if err != nil {
return err
}
err = prerequisites.Validate()
if err != nil {
return err
}
// Initialize DCGM Provider Instance
dcgmprovider.Initialize(config)
dcgmCleanup := dcgmprovider.Client().Cleanup
// Initialize NVML Provider Instance
nvmlprovider.Initialize()
nvmlCleanup := nvmlprovider.Client().Cleanup
slog.Info("DCGM successfully initialized!")
slog.Info("NVML provider successfully initialized!")
fillConfigMetricGroups(config)
cs := getCounters(config)
deviceWatchListManager := startDeviceWatchListManager(cs, config)
hostname, err := hostname.GetHostname(config)
if err != nil {
nvmlCleanup()
dcgmCleanup()
return err
}
cf := collector.InitCollectorFactory(cs, deviceWatchListManager, hostname, config)
cRegistry := registry.NewRegistry()
for _, entityCollector := range cf.NewCollectors() {
cRegistry.Register(entityCollector)
}
ch := make(chan string, 10)
var wg sync.WaitGroup
stop := make(chan interface{})
wg.Add(1)
server, cleanup, err := server.NewMetricsServer(config, ch, deviceWatchListManager, cRegistry)
if err != nil {
cRegistry.Cleanup()
nvmlCleanup()
dcgmCleanup()
return err
}
go server.Run(ctx, stop, &wg)
sigs := newOSWatcher(syscall.SIGINT, syscall.SIGTERM, syscall.SIGQUIT, syscall.SIGHUP)
go watchCollectorsFile(config.CollectorsFile, reloadMetricsServer(sigs))
sig := <-sigs
slog.Info("Received signal", slog.String("signal", sig.String()))
close(stop)
cancel() // Cancel the context for this iteration
err = utils.WaitWithTimeout(&wg, time.Second*2)
if err != nil {
slog.Error(err.Error())
cRegistry.Cleanup()
nvmlCleanup()
dcgmCleanup()
cleanup()
fatal()
}
// Call cleanup functions before continuing the loop
cRegistry.Cleanup()
nvmlCleanup()
dcgmCleanup()
cleanup()
if sig != syscall.SIGHUP {
return nil
}
// For SIGHUP, we'll continue the loop after cleanup
slog.Info("Restarting dcgm-exporter after signal")
}
return nil
}
Understanding DCGM Exporter Structure
1. Initialization
// Initialize DCGM
dcgmprovider.Initialize(config)
dcgmCleanup := dcgmprovider.Client().Cleanup
// Initialize NVML
nvmlprovider.Initialize()
nvmlCleanup := nvmlprovider.Client().Cleanup
DCGM acts as the primary metric engine; NVML supplements with fine-grained details.
2. Collector and Metric Registration
fillConfigMetricGroups(config)
cs := getCounters(config)
deviceWatchListManager := startDeviceWatchListManager(cs, config)
cf := collector.InitCollectorFactory(cs, deviceWatchListManager, hostname, config)
cRegistry := registry.NewRegistry()
for _, entityCollector := range cf.NewCollectors() {
cRegistry.Register(entityCollector)
}
Sets up GPU metric groups and registers collectors with Prometheus registry.
3. Restart Loop + Signal Handling
sig := <-sigs
cRegistry.Cleanup()
nvmlCleanup()
dcgmCleanup()
cleanup()
if sig == syscall.SIGHUP {
// Enter restart loop
}
SIGINT/SIGTERM: terminate, SIGHUP: restart
How to use?
e.g. Integrating DCGM Exporter into a Kubernetes Operator
Core Concept
- Deploy the exporter as a DaemonSet on GPU nodes
- Configure it through a custom resource spec
- Use Prometheus metrics to drive auto-scaling decisions
Auto Scaling Logic Example
func (r *DCGMExporterReconciler) handleAutoScaling(ctx context.Context, dcgmExporter *monitoringv1.DCGMExporter) error {
if !dcgmExporter.Spec.AutoScale.Enabled {
return nil
}
current := r.getMetricCount(ctx, dcgmExporter)
target := dcgmExporter.Spec.AutoScale.TargetMetricCount
if current > target {
return r.scaleUp(ctx, dcgmExporter)
} else if current < target/2 {
return r.scaleDown(ctx, dcgmExporter)
}
return nil
}
Adjusts scale based on the number of metrics collected via Prometheus
Summary
-
startDCGMExporter()
is a self-contained loop that handles GPU metric collection, Prometheus server setup, and signal-based restarts. - A Kubernetes operator can manage this exporter and implement scaling logic based on observed metrics.
- Collector configurations, GPU node discovery, and collection intervals can all be managed through the operator.
Ultimately, DCGM Exporter is a key enabler for dynamic scaling of GPU-intensive workloads.
Top comments (0)