David Mail

Posted on Mar 20 • Originally published at ingero.io

GPU Incident Response in 60 Seconds: An SRE's Guide to eBPF-Based GPU Observability

#sre #ebpf #gpu #kubernetes

TL;DR

You get paged at 3am: GPU training pipeline missed its SLA. Datadog shows 95% GPU utilization. nvidia-smi agrees. Everything looks green — but the job is 3x slower than expected. You have zero tools to diagnose this. Ingero gives you causal chains in 60 seconds: the host CPU was fighting with DataLoader workers, starving the GPU. You fix it with taskset and go back to sleep — without waking the ML engineer.

The 3am Page Every GPU SRE Dreads

Your PagerDuty fires:

[CRITICAL] GPU Training Pipeline SLA Breached
Cluster: prod-gpu-01 (8x H100)
Job: nightly-retraining-v3
Expected completion: 02:00 UTC
Current status: 47% complete at 03:12 UTC

You open your monitoring stack:

Datadog GPU Dashboard:

GPU Utilization:  95%  ✅
GPU Memory:       78%  ✅
GPU Temperature:  72°C ✅
Power Draw:       680W ✅

Grafana (DCGM Exporter):

dcgm_gpu_utilization:     0.95  ✅
dcgm_fb_used:             62GB  ✅
dcgm_sm_clock:            1980MHz ✅

nvidia-smi:

+-------------------------------------------+
| GPU  Name     | GPU-Util | Memory-Usage    |
|===============+==========+=================|
|   0  H100 SXM |    97%  | 62000MiB / 80GB |
+-------------------------------------------+

Every single dashboard says the GPU is fine. You have a breached SLA and zero signal to work with.

This is where most GPU incidents stall. The SRE has no tools that see below the GPU utilization counter. The options are:

Wake the ML engineer (who'll spend 2 hours adding print statements)
Restart the job and hope it goes faster (it won't)
Stare at dashboards that all say green

Why GPU Dashboards Lie to SREs

Every GPU monitoring tool in your stack — Datadog, Grafana, DCGM, nvidia-smi — reports the same underlying metric: "did the GPU have at least one kernel scheduled?"

That metric is useless for diagnosis. It's like monitoring a restaurant by checking "is someone sitting at each table?" without knowing if anyone is eating. The kitchen (GPU compute cores) could be idle 80% of the time between courses, and your dashboard would still say "97% utilized."

The real problems that cause GPU SLA breaches are host-side:

CPU scheduling contention starving the data pipeline
DataLoader workers preempted by monitoring agents (ironic)
Memory pressure causing page faults in the data loading path
Disk I/O bottlenecks blocking the next training batch
Network retransmits stalling distributed training

These are all Linux kernel events. DCGM and nvidia-smi have zero visibility into them. Your GPU dashboards are structurally blind to the most common causes of GPU performance degradation.

Enter Ingero: 60-Second GPU Incident Response

Ingero is an eBPF-based observability agent that traces both sides — CUDA APIs (what the GPU is doing) and host kernel events (what the CPU, scheduler, memory, and I/O subsystems are doing). It builds causal chains connecting host events to GPU latency.

It deploys as a K8s DaemonSet and runs continuously with <2% overhead. No code changes, no NVIDIA SDK, no CUPTI.

Here's what incident response looks like:

Step 1: Get the causal chain (10 seconds)

$ ingero explain --since 1h

System Context:
  CPU: 94.2% | Memory: 78.1% | Load: 12.3 (8 cores) | Swap: 0 MB

Causal Chains (last 1 hour):
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
[HIGH] CPU scheduling contention → CUDA throughput drop
  Root: 14,504 context switches on training process (PID 3821)
        Process off-CPU 62 of 120 seconds (51.7% of wall clock)
  Effect: cudaStreamSync p99 inflated 1,028x (7µs → 7.2ms)
          CUDA op throughput dropped 47% from peak
  Contributing: 4 DataLoader workers + prometheus-node-exporter
                + fluent-bit competing for 8 cores
  Fix: pin training to dedicated cores: taskset -c 0-5 python3 train.py
       set DataLoader persistent_workers=True
       nice -n 19 monitoring agents

There it is. The training process was off-CPU 51.7% of the time. The GPU was waiting for data, not computing. Your monitoring agents (Prometheus node exporter, Fluent Bit) were stealing CPU from the training pipeline.

nvidia-smi said 97% because kernels were queued — but the pipeline was running at half speed.

Step 2: Identify the culprits (20 seconds)

$ ingero explain --per-process --since 1h

Process Breakdown (last 1 hour):
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
python3 (train.py) PID 3821:
  cudaStreamSync | 12,403 calls | p50=1.2ms | p99=7.2ms
  cudaMalloc     |    206 calls | p50=65µs  | p99=2.1ms
  cuLaunchKernel | 17,509 calls | p50=12µs  | p99=890µs
  ⚠ Off-CPU: 62.0s / 120s (51.7%)
  ⚠ Context switches: 14,504

pt_data_worker:0 PID 3822:
  ⚠ Off-CPU: 31.4s / 120s (26.2%)
  ⚠ Worst stall: 609ms

prometheus-node-exporter PID 1205:
  ⚠ Context switches: 3,201
  ⚠ CPU stolen: 8.7s

The training process and all 4 DataLoader workers are fighting for CPU with your monitoring stack. The worst single scheduling stall is 609ms — over half a second where a data worker was frozen while the GPU sat idle.

Step 3: Fix it (30 seconds)

# Pin training to dedicated cores (leave 2 for monitoring + OS)
$ kubectl exec -it gpu-training-pod -- taskset -c 0-5 python3 train.py

# Or: deprioritize monitoring agents
$ kubectl exec -it monitoring-pod -- nice -n 19 prometheus-node-exporter

Or better yet, add to your DaemonSet config:

# training pod
resources:
  limits:
    cpu: "6"
  requests:
    cpu: "6"

After the fix:

Context switches on training: 14,504 → 890
cudaStreamSync p99: 7.2ms → 45µs
Pipeline throughput: restored to expected rate
SLA: back on track. ML engineer: still sleeping.

AI-Assisted Investigation with MCP

Ingero includes an MCP (Model Context Protocol) server that lets AI assistants investigate GPU incidents. If your team uses Claude, Cursor, or any MCP-compatible tool, the AI can query Ingero directly.

This turns a 2-hour debugging session into a 30-second conversation.

Deployment: Familiar SRE Patterns

Ingero deploys like any other observability agent in your K8s stack:

# Helm install (DaemonSet + RBAC)
helm install ingero ./deploy/helm/ingero \
  --set prometheus.enabled=true \
  --set otlp.enabled=true

# Or standalone
sudo ./bin/ingero trace --stack --prometheus :9090

What you get:

DaemonSet: Runs on every GPU node automatically
Prometheus /metrics: GPU latency percentiles, causal chain counts — plug into your existing Grafana
OTLP export: Send traces to your existing backend (Jaeger, Tempo, Honeycomb)
MCP server: AI-assisted investigation via Claude, Cursor, etc.
SQLite local storage: 10GB rolling, auto-prunes old events. No external database needed
Pod metadata: Enriches events with K8s pod name, namespace, container ID

It slots into your existing monitoring stack. No rip-and-replace.

What Ingero Sees That DCGM Can't

Signal	DCGM / nvidia-smi	Ingero
GPU utilization %	Yes (misleading)	Yes (with causal context)
Per-CUDA-call latency	No	Yes (p50/p95/p99 for every API call)
CPU scheduling delays	No	Yes (sched_switch tracepoints)
DataLoader worker stalls	No	Yes (per-process off-CPU time)
Memory pressure → GPU impact	No	Yes (mm_page_alloc + CUDA correlation)
Disk I/O → GPU stalls	No	Yes (block_rq + CUDA correlation)
Network → distributed training	No	Yes (tcp_retransmit + CUDA correlation)
Root cause chain	No	Yes (automated causal chains with fix recommendations)
Python source line attribution	No	Yes (CPython frame extraction with --stack)

The SRE Value Proposition

For SREs managing GPU infrastructure, Ingero answers three questions:

Incident response: "Why is the GPU slow right now?" → Causal chain in 60 seconds
Capacity planning: "Are we actually using these GPUs efficiently?" → Real compute efficiency, not nvidia-smi lies
Cost attribution: "Which team's workload is causing contention?" → Per-process, per-namespace breakdown

You don't need to understand CUDA or ML model architectures. Ingero translates kernel-level GPU events into actionable SRE language: root cause, impact, fix.

Try It Yourself

No GPU required to see the pattern:

git clone https://github.com/ingero-io/ingero.git
cd ingero && make build
./bin/ingero demo incident          # See a causal chain form in real-time
./bin/ingero demo cpu-contention    # CPU scheduling causing GPU stalls

For real GPU tracing:

sudo ./bin/ingero check              # Verify system compatibility
sudo ./bin/ingero trace --stack      # Start tracing (runs continuously)
./bin/ingero explain --since 5min    # See causal chains

Ingero is open-source (Apache 2.0), deploys as a K8s DaemonSet, and traces CUDA APIs via standard Linux kernel uprobes. No NVIDIA SDK, no code changes, <2% overhead. Production-safe by design.

GitHub: github.com/ingero-io/ingero