Optimizing GPU Workload Placement in Kubernetes with NVLink-Aware Scheduling

#kubernetes #nvidia #gpu #scheduling

The hidden performance tax

You bought GPUs with NVLink interconnects. You're probably not using them effectively.

NVLink provides high-bandwidth, low-latency communication between GPUs—up to 900 GB/s on modern hardware compared to ~64 GB/s over PCIe. For distributed training workloads, this difference is massive. Gradient synchronization, tensor parallelism, and model sharding all depend on fast GPU-to-GPU communication.

Here's the problem: Kubernetes doesn't know NVLink exists.

The default scheduler sees GPUs as interchangeable resources. Request 4 GPUs, get any 4 GPUs. But on a node with 8 GPUs arranged in two NVLink domains of 4 GPUs each, placement matters enormously. Four GPUs within the same NVLink domain can communicate at full NVLink speed. Four GPUs split across domains fall back to slower PCIe interconnects.

We measured up to 40% degradation in multi-GPU communication performance from suboptimal placement. That's a 40% tax on every distributed training job—paid invisibly, every time.

NVLink topology primer

Modern GPU nodes organize GPUs into NVLink domains—groups of GPUs with direct high-speed interconnects. A typical configuration:

Developer nodes (G1): 2 GPUs, 1 NVLink domain
Production nodes (G2): 4 GPUs, 2 NVLink domains (2 GPUs each)
High-performance nodes (G2-Expansion): 8 GPUs, 2 NVLink domains (4 GPUs each)

Within a domain, GPUs communicate via NVLink. Across domains, they fall back to PCIe or NVSwitch fabric—significantly slower.

The scheduling goal is simple: keep workloads within NVLink domain boundaries whenever possible.

The solution: NVLink-aware scoring

We built a Kubernetes scheduler plugin that scores nodes based on NVLink topology awareness. It operates in the Score phase of the scheduling cycle, evaluating candidate nodes after filtering and before final selection.

The plugin integrates with DCGM (Data Center GPU Manager) via Prometheus to track real-time GPU allocations, then applies a scoring algorithm that considers:

Domain integrity — Can the workload fit within a single NVLink domain?
Node-workload matching — Is this the right-sized node for this workload?
Domain completion — Does this placement fill a partially-used domain?
Resource efficiency — What's the node utilization after placement?

Scoring algorithm

Each node starts with a base score of 50 points, then receives adjustments.

Domain Integrity (primary signal)

This is the core optimization. Keeping GPUs within the same NVLink domain is the whole point.

Workload fits single domain: +50 points
Exact domain fit (e.g., 2 GPUs on 2-GPU domain): +30 bonus
Cross-domain placement: -20 per additional domain

Node-Workload Matching

Right-size workloads to nodes. Don't waste an 8-GPU node on a 1-GPU job.

Small workloads (1–2 GPUs) on small NVLink nodes: +60 to +80 points
Small workloads on oversized nodes: -40 points
Large workloads (5+ GPUs) on large nodes: +80 to +100 points
Large workloads on undersized nodes: -80 points

Domain Completion

Bin-pack within domains before spilling to new ones.

Placement completes a partially-filled domain: +40 points

Resource Efficiency

Favor high utilization to reduce fragmentation.

High utilization (>70%) after placement: +20 to +30 points
Low utilization (<30%): -20 points

Real-time allocation tracking

Static scoring isn't enough. The plugin needs to know which GPUs are currently allocated to make intelligent placement decisions.

We integrate with DCGM metrics exposed via Prometheus. This tells us which GPUs have active workloads and which pods own them. The plugin reconstructs domain state in real-time:

Domain 0: total=2, allocated=1, available=1
Domain 1: total=2, allocated=0, available=2

When a new 1-GPU workload arrives, the plugin recognizes that placing it in Domain 0 completes that domain (+40 bonus), while Domain 1 would leave fragmentation.

Example: Domain completion in action

Consider a 2-GPU node (single NVLink domain) with one GPU already allocated.

Incoming workload: 1 GPU

Scoring breakdown:

Base:              50
Domain integrity:  +50 (single domain)
Node matching:     +80 (perfect fit)
Domain completion: +40 (completes the domain)
Efficiency:        +30 (100% utilization)
───────────────────────
Final:             250 → normalized: 92/100

Compare to placing the same workload on an empty 8-GPU node:

Base:              50
Domain integrity:  +50 (single domain)
Node matching:     -40 (oversized node)
Domain completion:  0  (not completing anything)
Efficiency:        -20 (12.5% utilization)
───────────────────────
Final:             40 → normalized: 31/100

The 2-GPU node wins decisively, preserving the 8-GPU node for workloads that actually need it.

What we learned

Topology awareness compounds. The 40% improvement isn't just about individual job performance—it's about cluster-wide efficiency. Better placement means less fragmentation, which means more workloads scheduled successfully, which means higher overall utilization.

DCGM integration is essential. Without real-time allocation data, the plugin would make decisions based on stale information. The Prometheus integration adds minimal overhead but provides critical visibility.

Scoring weights need tuning. Different clusters have different workload mixes. A cluster dominated by small jobs might want stronger penalties for oversized placement. We exposed the key parameters for operator customization.

Related links: