DEV Community

David Mail
David Mail

Posted on • Originally published at ingero.io

GPU Incident Response in 60 Seconds: An SRE's Guide to eBPF-Based GPU Observability

TL;DR

You get paged at 3am: GPU training pipeline missed its SLA. Datadog shows 95% GPU utilization. nvidia-smi agrees. Everything looks green — but the job is 3x slower than expected. You have zero tools to diagnose this. Ingero gives you causal chains in 60 seconds: the host CPU was fighting with DataLoader workers, starving the GPU. You fix it with taskset and go back to sleep — without waking the ML engineer.


The 3am Page Every GPU SRE Dreads

Your PagerDuty fires:

[CRITICAL] GPU Training Pipeline SLA Breached
Cluster: prod-gpu-01 (8x H100)
Job: nightly-retraining-v3
Expected completion: 02:00 UTC
Current status: 47% complete at 03:12 UTC
Enter fullscreen mode Exit fullscreen mode

You open your monitoring stack:

Datadog GPU Dashboard:

GPU Utilization:  95%  ✅
GPU Memory:       78%  ✅
GPU Temperature:  72°C ✅
Power Draw:       680W ✅
Enter fullscreen mode Exit fullscreen mode

Grafana (DCGM Exporter):

dcgm_gpu_utilization:     0.95  ✅
dcgm_fb_used:             62GB  ✅
dcgm_sm_clock:            1980MHz ✅
Enter fullscreen mode Exit fullscreen mode

nvidia-smi:

+-------------------------------------------+
| GPU  Name     | GPU-Util | Memory-Usage    |
|===============+==========+=================|
|   0  H100 SXM |    97%  | 62000MiB / 80GB |
+-------------------------------------------+
Enter fullscreen mode Exit fullscreen mode

Every single dashboard says the GPU is fine. You have a breached SLA and zero signal to work with.

This is where most GPU incidents stall. The SRE has no tools that see below the GPU utilization counter. The options are:

  1. Wake the ML engineer (who'll spend 2 hours adding print statements)
  2. Restart the job and hope it goes faster (it won't)
  3. Stare at dashboards that all say green

Why GPU Dashboards Lie to SREs

Every GPU monitoring tool in your stack — Datadog, Grafana, DCGM, nvidia-smi — reports the same underlying metric: "did the GPU have at least one kernel scheduled?"

That metric is useless for diagnosis. It's like monitoring a restaurant by checking "is someone sitting at each table?" without knowing if anyone is eating. The kitchen (GPU compute cores) could be idle 80% of the time between courses, and your dashboard would still say "97% utilized."

The real problems that cause GPU SLA breaches are host-side:

  • CPU scheduling contention starving the data pipeline
  • DataLoader workers preempted by monitoring agents (ironic)
  • Memory pressure causing page faults in the data loading path
  • Disk I/O bottlenecks blocking the next training batch
  • Network retransmits stalling distributed training

These are all Linux kernel events. DCGM and nvidia-smi have zero visibility into them. Your GPU dashboards are structurally blind to the most common causes of GPU performance degradation.

Enter Ingero: 60-Second GPU Incident Response

Ingero is an eBPF-based observability agent that traces both sides — CUDA APIs (what the GPU is doing) and host kernel events (what the CPU, scheduler, memory, and I/O subsystems are doing). It builds causal chains connecting host events to GPU latency.

It deploys as a K8s DaemonSet and runs continuously with <2% overhead. No code changes, no NVIDIA SDK, no CUPTI.

Here's what incident response looks like:

Step 1: Get the causal chain (10 seconds)

$ ingero explain --since 1h

System Context:
  CPU: 94.2% | Memory: 78.1% | Load: 12.3 (8 cores) | Swap: 0 MB

Causal Chains (last 1 hour):
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
[HIGH] CPU scheduling contention → CUDA throughput drop
  Root: 14,504 context switches on training process (PID 3821)
        Process off-CPU 62 of 120 seconds (51.7% of wall clock)
  Effect: cudaStreamSync p99 inflated 1,028x (7µs → 7.2ms)
          CUDA op throughput dropped 47% from peak
  Contributing: 4 DataLoader workers + prometheus-node-exporter
                + fluent-bit competing for 8 cores
  Fix: pin training to dedicated cores: taskset -c 0-5 python3 train.py
       set DataLoader persistent_workers=True
       nice -n 19 monitoring agents
Enter fullscreen mode Exit fullscreen mode

There it is. The training process was off-CPU 51.7% of the time. The GPU was waiting for data, not computing. Your monitoring agents (Prometheus node exporter, Fluent Bit) were stealing CPU from the training pipeline.

nvidia-smi said 97% because kernels were queued — but the pipeline was running at half speed.

Step 2: Identify the culprits (20 seconds)

$ ingero explain --per-process --since 1h
Enter fullscreen mode Exit fullscreen mode
Process Breakdown (last 1 hour):
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
python3 (train.py) PID 3821:
  cudaStreamSync | 12,403 calls | p50=1.2ms | p99=7.2ms
  cudaMalloc     |    206 calls | p50=65µs  | p99=2.1ms
  cuLaunchKernel | 17,509 calls | p50=12µs  | p99=890µs
  ⚠ Off-CPU: 62.0s / 120s (51.7%)
  ⚠ Context switches: 14,504

pt_data_worker:0 PID 3822:
  ⚠ Off-CPU: 31.4s / 120s (26.2%)
  ⚠ Worst stall: 609ms

prometheus-node-exporter PID 1205:
  ⚠ Context switches: 3,201
  ⚠ CPU stolen: 8.7s
Enter fullscreen mode Exit fullscreen mode

The training process and all 4 DataLoader workers are fighting for CPU with your monitoring stack. The worst single scheduling stall is 609ms — over half a second where a data worker was frozen while the GPU sat idle.

Step 3: Fix it (30 seconds)

# Pin training to dedicated cores (leave 2 for monitoring + OS)
$ kubectl exec -it gpu-training-pod -- taskset -c 0-5 python3 train.py

# Or: deprioritize monitoring agents
$ kubectl exec -it monitoring-pod -- nice -n 19 prometheus-node-exporter
Enter fullscreen mode Exit fullscreen mode

Or better yet, add to your DaemonSet config:

# training pod
resources:
  limits:
    cpu: "6"
  requests:
    cpu: "6"
Enter fullscreen mode Exit fullscreen mode

After the fix:

  • Context switches on training: 14,504 → 890
  • cudaStreamSync p99: 7.2ms → 45µs
  • Pipeline throughput: restored to expected rate
  • SLA: back on track. ML engineer: still sleeping.

AI-Assisted Investigation with MCP

Ingero includes an MCP (Model Context Protocol) server that lets AI assistants investigate GPU incidents. If your team uses Claude, Cursor, or any MCP-compatible tool, the AI can query Ingero directly.

This turns a 2-hour debugging session into a 30-second conversation.

Deployment: Familiar SRE Patterns

Ingero deploys like any other observability agent in your K8s stack:

# Helm install (DaemonSet + RBAC)
helm install ingero ./deploy/helm/ingero \
  --set prometheus.enabled=true \
  --set otlp.enabled=true

# Or standalone
sudo ./bin/ingero trace --stack --prometheus :9090
Enter fullscreen mode Exit fullscreen mode

What you get:

  • DaemonSet: Runs on every GPU node automatically
  • Prometheus /metrics: GPU latency percentiles, causal chain counts — plug into your existing Grafana
  • OTLP export: Send traces to your existing backend (Jaeger, Tempo, Honeycomb)
  • MCP server: AI-assisted investigation via Claude, Cursor, etc.
  • SQLite local storage: 10GB rolling, auto-prunes old events. No external database needed
  • Pod metadata: Enriches events with K8s pod name, namespace, container ID

It slots into your existing monitoring stack. No rip-and-replace.

What Ingero Sees That DCGM Can't

Signal DCGM / nvidia-smi Ingero
GPU utilization % Yes (misleading) Yes (with causal context)
Per-CUDA-call latency No Yes (p50/p95/p99 for every API call)
CPU scheduling delays No Yes (sched_switch tracepoints)
DataLoader worker stalls No Yes (per-process off-CPU time)
Memory pressure → GPU impact No Yes (mm_page_alloc + CUDA correlation)
Disk I/O → GPU stalls No Yes (block_rq + CUDA correlation)
Network → distributed training No Yes (tcp_retransmit + CUDA correlation)
Root cause chain No Yes (automated causal chains with fix recommendations)
Python source line attribution No Yes (CPython frame extraction with --stack)

The SRE Value Proposition

For SREs managing GPU infrastructure, Ingero answers three questions:

  1. Incident response: "Why is the GPU slow right now?" → Causal chain in 60 seconds
  2. Capacity planning: "Are we actually using these GPUs efficiently?" → Real compute efficiency, not nvidia-smi lies
  3. Cost attribution: "Which team's workload is causing contention?" → Per-process, per-namespace breakdown

You don't need to understand CUDA or ML model architectures. Ingero translates kernel-level GPU events into actionable SRE language: root cause, impact, fix.

Try It Yourself

No GPU required to see the pattern:

git clone https://github.com/ingero-io/ingero.git
cd ingero && make build
./bin/ingero demo incident          # See a causal chain form in real-time
./bin/ingero demo cpu-contention    # CPU scheduling causing GPU stalls
Enter fullscreen mode Exit fullscreen mode

For real GPU tracing:

sudo ./bin/ingero check              # Verify system compatibility
sudo ./bin/ingero trace --stack      # Start tracing (runs continuously)
./bin/ingero explain --since 5min    # See causal chains
Enter fullscreen mode Exit fullscreen mode

Ingero is open-source (Apache 2.0), deploys as a K8s DaemonSet, and traces CUDA APIs via standard Linux kernel uprobes. No NVIDIA SDK, no code changes, <2% overhead. Production-safe by design.

GitHub: github.com/ingero-io/ingero

Top comments (0)