I had Prometheus + DCGM Exporter running on my OKE cluster. It gave me GPU utilization, memory usage, temperature. Basic stuff. What it didn't give me was the correlation between GPU metrics and inference performance request latency, tokens per second, queue depth. Two different dashboards, two different time ranges, no easy way to connect "GPU hit 95% utilization" with "p99 latency spiked to 8 seconds."
That's what led me to build otel-gpu-receiver and adopt OpenTelemetry for GPU monitoring instead of the Prometheus-only approach.
What's Wrong With DCGM Exporter Alone
DCGM Exporter is solid for hardware metrics. It gives you:
-
DCGM_FI_DEV_GPU_UTIL: utilization percentage -
DCGM_FI_DEV_FB_USED: framebuffer memory used -
DCGM_FI_DEV_GPU_TEMP: temperature -
DCGM_FI_DEV_POWER_USAGE: power draw
These tell you the GPU is busy. They don't tell you why, or whether "busy" means "serving requests efficiently" or "stuck loading a model." I need application-level metrics alongside GPU metrics:
- Tokens/second - actual inference throughput
- Request queue depth - are requests piling up?
- Time to first token - user-perceived latency
- Batch size - how well is continuous batching working?
With separate Prometheus endpoints for GPU and application metrics, correlating these requires manual PromQL joins. With OpenTelemetry, everything goes through one pipeline.
The OpenTelemetry Stack on OKE
Here's what I run:
┌─────────────────────────┐
│ GPU Node │
│ ┌───────────────────┐ │
│ │ vLLM Pod │ │
│ │ └─ OTel SDK │──┤──► OTel Collector ──► Backend
│ │ (traces + │ │ (on each (Grafana Cloud,
│ │ metrics) │ │ node) OCI APM, etc.)
│ └───────────────────┘ │
│ ┌───────────────────┐ │
│ │ otel-gpu-receiver │──┤──►
│ │ (NVML metrics) │ │
│ └───────────────────┘ │
└─────────────────────────┘
1. Deploy the OTel Collector as a DaemonSet
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: otel-collector
namespace: monitoring
spec:
selector:
matchLabels:
app: otel-collector
template:
metadata:
labels:
app: otel-collector
spec:
containers:
- name: collector
image: otel/opentelemetry-collector-contrib:0.96.0
volumeMounts:
- name: config
mountPath: /etc/otel
args: ["--config=/etc/otel/config.yaml"]
volumes:
- name: config
configMap:
name: otel-collector-config
2. Collector Config - GPU + Application Metrics Together
# otel-collector-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: otel-collector-config
namespace: monitoring
data:
config.yaml: |
receivers:
# GPU metrics from otel-gpu-receiver
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
# Scrape vLLM's Prometheus metrics and convert to OTel
prometheus:
config:
scrape_configs:
- job_name: 'vllm'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_label_app]
regex: vllm
action: keep
processors:
batch:
timeout: 10s
# Add OKE cluster metadata to all metrics
k8sattributes:
auth_type: serviceAccount
extract:
metadata:
- k8s.node.name
- k8s.pod.name
- k8s.namespace.name
exporters:
# Send to Grafana Cloud (or any OTel backend)
otlphttp:
endpoint: https://otlp-gateway-prod-us-central-0.grafana.net/otlp
headers:
Authorization: "Basic ${GRAFANA_TOKEN}"
# Also export to OCI APM
otlphttp/oci:
endpoint: https://apm-collector.us-ashburn-1.oci.oraclecloud.com/20200101/opentelemetry
headers:
Authorization: "dataKey ${OCI_APM_KEY}"
service:
pipelines:
metrics:
receivers: [otlp, prometheus]
processors: [batch, k8sattributes]
exporters: [otlphttp, otlphttp/oci]
3. Deploy otel-gpu-receiver
This is the component I built. It reads NVIDIA GPU metrics via NVML (the same library nvidia-smi uses) and exports them as OpenTelemetry metrics:
helm install otel-gpu-receiver pmady/otel-gpu-receiver \
--namespace monitoring \
--set collector.endpoint="otel-collector.monitoring:4317" \
--set scrapeInterval=15s
It runs as a DaemonSet on GPU nodes and pushes metrics to the OTel Collector on each node.
Metrics I Actually Look At
After a few weeks of running this, these are the metrics I check daily:
GPU utilization vs. tokens/second .If utilization is high but throughput is flat, something is wrong (usually a batch size issue or memory pressure).
# In Grafana overlay these two on the same panel
gpu_utilization{node="gpu-1"}
rate(vllm_generation_tokens_total{pod="vllm-0"}[5m])
GPU memory vs. request queue : When GPU memory hits the limit, vLLM starts queuing. This is the first sign you need to either reduce --gpu-memory-utilization or add another replica.
Time to first token : The metric users actually feel. If this goes above 2 seconds, something needs attention.
Power draw : Not for alerting, but useful for cost estimation. I can correlate power draw with request volume to estimate per-request energy cost.
What I See That I Couldn't See Before
Last week GPU utilization dropped to 20% while request latency spiked to 5 seconds. With DCGM Exporter alone, I would've been confused the GPU looks fine, why is it slow?
With the combined OTel pipeline, I could see that the batch scheduler in vLLM was waiting for the next batch window. The model had just been loaded (I'd updated the deployment), and the KV cache was cold. Throughput recovered in about 30 seconds as the cache warmed up. Without the application metrics alongside the GPU metrics, I would have been debugging this for an hour.
OCI APM Integration
OCI has its own APM service that accepts OpenTelemetry data. The nice thing about sending metrics there is that you get OCI-native alerting and integration with OCI notifications:
# Create an alarm in OCI Monitoring
oci monitoring alarm create \
--compartment-id $COMPARTMENT_ID \
--display-name "GPU inference latency high" \
--metric-compartment-id $COMPARTMENT_ID \
--namespace "custom_metrics" \
--query 'vllm_e2e_request_latency_seconds[5m]{p99}.max() > 5' \
--severity CRITICAL \
--destinations '["'$NOTIFICATION_TOPIC_ID'"]'
This sends a PagerDuty/Slack/email alert when p99 inference latency exceeds 5 seconds. The GPU metrics and application metrics are in the same namespace, so you can write alarms that reference both.
My Dashboard Panels
If you're setting this up, here's what I'd put on the Grafana dashboard:
- GPU Utilization + Tokens/sec - dual-axis, should correlate
- GPU Memory Used / Total - with a threshold line at 90%
- Request Queue Depth - should be near zero during normal ops
- Time to First Token (p50, p95, p99) - the metric that matters to users
- Pod Restart Count - OOM kills show up here
- GPU Temperature - more for curiosity, but useful for thermal throttling detection
The shift from "GPU dashboard + separate app dashboard" to "one unified dashboard" made debugging 10x faster. OpenTelemetry is the glue that makes it work.
Pavan Madduri - Oracle ACE Associate, CNCF Golden Kubestronaut. Author of otel-gpu-receiver. GitHub | LinkedIn | Website | Google Scholar | ResearchGate
Top comments (0)