Pavan Madduri

Posted on Jul 2

Monitoring GPU Inference Containers on OKE with OpenTelemetry - What Prometheus Misses

#oci #kubernetes #gpu #observability

I had Prometheus + DCGM Exporter running on my OKE cluster. It gave me GPU utilization, memory usage, temperature. Basic stuff. What it didn't give me was the correlation between GPU metrics and inference performance request latency, tokens per second, queue depth. Two different dashboards, two different time ranges, no easy way to connect "GPU hit 95% utilization" with "p99 latency spiked to 8 seconds."

That's what led me to build otel-gpu-receiver and adopt OpenTelemetry for GPU monitoring instead of the Prometheus-only approach.

What's Wrong With DCGM Exporter Alone

DCGM Exporter is solid for hardware metrics. It gives you:

DCGM_FI_DEV_GPU_UTIL: utilization percentage
DCGM_FI_DEV_FB_USED : framebuffer memory used
DCGM_FI_DEV_GPU_TEMP: temperature
DCGM_FI_DEV_POWER_USAGE: power draw

These tell you the GPU is busy. They don't tell you why, or whether "busy" means "serving requests efficiently" or "stuck loading a model." I need application-level metrics alongside GPU metrics:

Tokens/second - actual inference throughput
Request queue depth - are requests piling up?
Time to first token - user-perceived latency
Batch size - how well is continuous batching working?

With separate Prometheus endpoints for GPU and application metrics, correlating these requires manual PromQL joins. With OpenTelemetry, everything goes through one pipeline.

The OpenTelemetry Stack on OKE

Here's what I run:

┌─────────────────────────┐
│  GPU Node               │
│  ┌───────────────────┐  │
│  │ vLLM Pod          │  │
│  │  └─ OTel SDK      │──┤──► OTel Collector ──► Backend
│  │     (traces +     │  │        (on each      (Grafana Cloud,
│  │      metrics)     │  │         node)         OCI APM, etc.)
│  └───────────────────┘  │
│  ┌───────────────────┐  │
│  │ otel-gpu-receiver │──┤──►
│  │  (NVML metrics)   │  │
│  └───────────────────┘  │
└─────────────────────────┘

1. Deploy the OTel Collector as a DaemonSet

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: otel-collector
  namespace: monitoring
spec:
  selector:
    matchLabels:
      app: otel-collector
  template:
    metadata:
      labels:
        app: otel-collector
    spec:
      containers:
        - name: collector
          image: otel/opentelemetry-collector-contrib:0.96.0
          volumeMounts:
            - name: config
              mountPath: /etc/otel
          args: ["--config=/etc/otel/config.yaml"]
      volumes:
        - name: config
          configMap:
            name: otel-collector-config

2. Collector Config - GPU + Application Metrics Together

# otel-collector-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: otel-collector-config
  namespace: monitoring
data:
  config.yaml: |
    receivers:
      # GPU metrics from otel-gpu-receiver
      otlp:
        protocols:
          grpc:
            endpoint: 0.0.0.0:4317

      # Scrape vLLM's Prometheus metrics and convert to OTel
      prometheus:
        config:
          scrape_configs:
            - job_name: 'vllm'
              kubernetes_sd_configs:
                - role: pod
              relabel_configs:
                - source_labels: [__meta_kubernetes_pod_label_app]
                  regex: vllm
                  action: keep

    processors:
      batch:
        timeout: 10s

      # Add OKE cluster metadata to all metrics
      k8sattributes:
        auth_type: serviceAccount
        extract:
          metadata:
            - k8s.node.name
            - k8s.pod.name
            - k8s.namespace.name

    exporters:
      # Send to Grafana Cloud (or any OTel backend)
      otlphttp:
        endpoint: https://otlp-gateway-prod-us-central-0.grafana.net/otlp
        headers:
          Authorization: "Basic ${GRAFANA_TOKEN}"

      # Also export to OCI APM
      otlphttp/oci:
        endpoint: https://apm-collector.us-ashburn-1.oci.oraclecloud.com/20200101/opentelemetry
        headers:
          Authorization: "dataKey ${OCI_APM_KEY}"

    service:
      pipelines:
        metrics:
          receivers: [otlp, prometheus]
          processors: [batch, k8sattributes]
          exporters: [otlphttp, otlphttp/oci]

3. Deploy otel-gpu-receiver

This is the component I built. It reads NVIDIA GPU metrics via NVML (the same library nvidia-smi uses) and exports them as OpenTelemetry metrics:

helm install otel-gpu-receiver pmady/otel-gpu-receiver \
  --namespace monitoring \
  --set collector.endpoint="otel-collector.monitoring:4317" \
  --set scrapeInterval=15s

It runs as a DaemonSet on GPU nodes and pushes metrics to the OTel Collector on each node.

Metrics I Actually Look At

After a few weeks of running this, these are the metrics I check daily:

GPU utilization vs. tokens/second .If utilization is high but throughput is flat, something is wrong (usually a batch size issue or memory pressure).

# In Grafana overlay these two on the same panel
gpu_utilization{node="gpu-1"}
rate(vllm_generation_tokens_total{pod="vllm-0"}[5m])

GPU memory vs. request queue : When GPU memory hits the limit, vLLM starts queuing. This is the first sign you need to either reduce --gpu-memory-utilization or add another replica.

Time to first token : The metric users actually feel. If this goes above 2 seconds, something needs attention.

Power draw : Not for alerting, but useful for cost estimation. I can correlate power draw with request volume to estimate per-request energy cost.

What I See That I Couldn't See Before

Last week GPU utilization dropped to 20% while request latency spiked to 5 seconds. With DCGM Exporter alone, I would've been confused the GPU looks fine, why is it slow?

With the combined OTel pipeline, I could see that the batch scheduler in vLLM was waiting for the next batch window. The model had just been loaded (I'd updated the deployment), and the KV cache was cold. Throughput recovered in about 30 seconds as the cache warmed up. Without the application metrics alongside the GPU metrics, I would have been debugging this for an hour.

OCI APM Integration

OCI has its own APM service that accepts OpenTelemetry data. The nice thing about sending metrics there is that you get OCI-native alerting and integration with OCI notifications:

# Create an alarm in OCI Monitoring
oci monitoring alarm create \
  --compartment-id $COMPARTMENT_ID \
  --display-name "GPU inference latency high" \
  --metric-compartment-id $COMPARTMENT_ID \
  --namespace "custom_metrics" \
  --query 'vllm_e2e_request_latency_seconds[5m]{p99}.max() > 5' \
  --severity CRITICAL \
  --destinations '["'$NOTIFICATION_TOPIC_ID'"]'

This sends a PagerDuty/Slack/email alert when p99 inference latency exceeds 5 seconds. The GPU metrics and application metrics are in the same namespace, so you can write alarms that reference both.

My Dashboard Panels

If you're setting this up, here's what I'd put on the Grafana dashboard:

GPU Utilization + Tokens/sec - dual-axis, should correlate
GPU Memory Used / Total - with a threshold line at 90%
Request Queue Depth - should be near zero during normal ops
Time to First Token (p50, p95, p99) - the metric that matters to users
Pod Restart Count - OOM kills show up here
GPU Temperature - more for curiosity, but useful for thermal throttling detection

The shift from "GPU dashboard + separate app dashboard" to "one unified dashboard" made debugging 10x faster. OpenTelemetry is the glue that makes it work.

Pavan Madduri - Oracle ACE Associate, CNCF Golden Kubestronaut. Author of otel-gpu-receiver. GitHub | LinkedIn | Website | Google Scholar | ResearchGate

DEV Community