The hidden performance tax
You bought GPUs with NVLink interconnects. You're probably not using them effectively.
NVLink provides high-bandwidth, low-latency communication between GPUs—up to 900 GB/s on modern hardware compared to ~64 GB/s over PCIe. For distributed training workloads, this difference is massive. Gradient synchronization, tensor parallelism, and model sharding all depend on fast GPU-to-GPU communication.
Here's the problem: Kubernetes doesn't know NVLink exists.
The default scheduler sees GPUs as interchangeable resources. Request 4 GPUs, get any 4 GPUs. But on a node with 8 GPUs arranged in two NVLink domains of 4 GPUs each, placement matters enormously. Four GPUs within the same NVLink domain can communicate at full NVLink speed. Four GPUs split across domains fall back to slower PCIe interconnects.
We measured up to 40% degradation in multi-GPU communication performance from suboptimal placement. That's a 40% tax on every distributed training job—paid invisibly, every time.
NVLink topology primer
Modern GPU nodes organize GPUs into NVLink domains—groups of GPUs with direct high-speed interconnects. A typical configuration:
- Developer nodes (G1): 2 GPUs, 1 NVLink domain
- Production nodes (G2): 4 GPUs, 2 NVLink domains (2 GPUs each)
- High-performance nodes (G2-Expansion): 8 GPUs, 2 NVLink domains (4 GPUs each)
Within a domain, GPUs communicate via NVLink. Across domains, they fall back to PCIe or NVSwitch fabric—significantly slower.
The scheduling goal is simple: keep workloads within NVLink domain boundaries whenever possible.
The solution: NVLink-aware scoring
We built a Kubernetes scheduler plugin that scores nodes based on NVLink topology awareness. It operates in the Score phase of the scheduling cycle, evaluating candidate nodes after filtering and before final selection.
The plugin integrates with DCGM (Data Center GPU Manager) via Prometheus to track real-time GPU allocations, then applies a scoring algorithm that considers:
- Domain integrity — Can the workload fit within a single NVLink domain?
- Node-workload matching — Is this the right-sized node for this workload?
- Domain completion — Does this placement fill a partially-used domain?
- Resource efficiency — What's the node utilization after placement?
Scoring algorithm
Each node starts with a base score of 50 points, then receives adjustments.
Domain Integrity (primary signal)
This is the core optimization. Keeping GPUs within the same NVLink domain is the whole point.
- Workload fits single domain: +50 points
- Exact domain fit (e.g., 2 GPUs on 2-GPU domain): +30 bonus
- Cross-domain placement: -20 per additional domain
Node-Workload Matching
Right-size workloads to nodes. Don't waste an 8-GPU node on a 1-GPU job.
- Small workloads (1–2 GPUs) on small NVLink nodes: +60 to +80 points
- Small workloads on oversized nodes: -40 points
- Large workloads (5+ GPUs) on large nodes: +80 to +100 points
- Large workloads on undersized nodes: -80 points
Domain Completion
Bin-pack within domains before spilling to new ones.
- Placement completes a partially-filled domain: +40 points
Resource Efficiency
Favor high utilization to reduce fragmentation.
- High utilization (>70%) after placement: +20 to +30 points
- Low utilization (<30%): -20 points
Real-time allocation tracking
Static scoring isn't enough. The plugin needs to know which GPUs are currently allocated to make intelligent placement decisions.
We integrate with DCGM metrics exposed via Prometheus. This tells us which GPUs have active workloads and which pods own them. The plugin reconstructs domain state in real-time:
Domain 0: total=2, allocated=1, available=1
Domain 1: total=2, allocated=0, available=2
When a new 1-GPU workload arrives, the plugin recognizes that placing it in Domain 0 completes that domain (+40 bonus), while Domain 1 would leave fragmentation.
Example: Domain completion in action
Consider a 2-GPU node (single NVLink domain) with one GPU already allocated.
Incoming workload: 1 GPU
Scoring breakdown:
Base: 50
Domain integrity: +50 (single domain)
Node matching: +80 (perfect fit)
Domain completion: +40 (completes the domain)
Efficiency: +30 (100% utilization)
───────────────────────
Final: 250 → normalized: 92/100
Compare to placing the same workload on an empty 8-GPU node:
Base: 50
Domain integrity: +50 (single domain)
Node matching: -40 (oversized node)
Domain completion: 0 (not completing anything)
Efficiency: -20 (12.5% utilization)
───────────────────────
Final: 40 → normalized: 31/100
The 2-GPU node wins decisively, preserving the 8-GPU node for workloads that actually need it.
What we learned
Topology awareness compounds. The 40% improvement isn't just about individual job performance—it's about cluster-wide efficiency. Better placement means less fragmentation, which means more workloads scheduled successfully, which means higher overall utilization.
DCGM integration is essential. Without real-time allocation data, the plugin would make decisions based on stale information. The Prometheus integration adds minimal overhead but provides critical visibility.
Scoring weights need tuning. Different clusters have different workload mixes. A cluster dominated by small jobs might want stronger penalties for oversized placement. We exposed the key parameters for operator customization.
Related links:
Top comments (0)