nvidia-peermem "Invalid argument" on Ubuntu — Fix GPUDirect RDMA with DMA-BUF
TL;DR: If modprobe nvidia-peermem fails with Invalid argument (-EINVAL) on a system using the inbox Ubuntu InfiniBand stack (rdma-core), the module is not broken and you do not need it. nvidia-peermem requires an API that only exists in MLNX_OFED. On Hopper/Blackwell GPUs with the NVIDIA open driver, use DMA-BUF instead — it does GPUDirect RDMA natively. The one gotcha: you must enable nvidia-drm modeset=1.
Applies to: Ubuntu 22.04 / 24.04, inbox rdma-core stack, NVIDIA open kernel driver, H100 / H200 / B200, ConnectX-6/7 (or any HCA with ODP support).
The symptom
$ sudo modprobe nvidia-peermem
modprobe: ERROR: could not insert 'nvidia_peermem': Invalid argument
dmesg shows nvidia-peermem loaded but registered nothing, or the load returns -EINVAL. GPUDirect RDMA appears to be unavailable.
Why this happens (and why it is not a bug)
nvidia-peermem is the legacy path for GPUDirect RDMA. It registers GPU memory with the InfiniBand subsystem through a Mellanox-proprietary kernel API:
ib_register_peer_memory_client()
That symbol only exists in MLNX_OFED's build of ib_core. It is not in the mainline kernel, and it is not in rdma-core, which is the inbox InfiniBand stack on Ubuntu.
If you are on the inbox stack, nvidia-peermem was compiled without that API present, so it can never bind and always returns Invalid argument. No module parameter or config change will fix it, because the thing it needs was never there.
Do not install MLNX_OFED just to make nvidia-peermem load. That works, but it is the wrong fix — you would be adding a heavy proprietary stack to revive an obsolete module. There is a native path already in your kernel.
The fix: use DMA-BUF
On Hopper and newer with the open driver, GPUDirect RDMA works through DMA-BUF, a mainline Linux framework. No external module, no MLNX_OFED.
Requirements (check these first)
- NVIDIA open kernel driver (not the proprietary build)
-
nvidia-drm modeset=1enabled ← most common missing piece - Kernel built with:
CONFIG_DMA_SHARED_BUFFER=yCONFIG_HMM_MIRROR=yCONFIG_INFINIBAND_ON_DEMAND_PAGING=y
-
ib_umem_dmabufsymbols present inib_uverbs - HCA with ODP support (ConnectX-6/7 have it)
- Hopper or newer GPU (H100 / H200 / B200)
Step 1 — Enable nvidia-drm modeset
Check current state:
cat /sys/module/nvidia_drm/parameters/modeset
If it returns N, DMA-BUF export is inactive. Enable it:
# Runtime
sudo modprobe -r nvidia_drm && sudo modprobe nvidia_drm modeset=1
# Persistent across reboots
echo 'options nvidia-drm modeset=1' | sudo tee /etc/modprobe.d/nvidia-drm-modeset.conf
sudo update-initramfs -u
Re-check that the parameter now reads Y.
Step 2 — Verify GPUDirect RDMA actually works
Do not trust "it should work now." Confirm the full path: allocate GPU memory, export it as a DMA-BUF file descriptor, register it with the HCA.
The three calls that must succeed:
-
cudaMalloc()— allocate GPU memory -
cuMemGetHandleForAddressRange()withCU_MEM_RANGE_HANDLE_TYPE_DMA_BUF_FD— export as a DMA-BUF fd -
ibv_reg_dmabuf_mr()— register that fd with the InfiniBand HCA
If all three return success, GPU memory is directly addressable by the HCA over DMA-BUF and GPUDirect RDMA is working. nvidia-peermem is not needed.
Summary
| Legacy (nvidia-peermem) | Modern (DMA-BUF) | |
|---|---|---|
| Requires MLNX_OFED | Yes | No |
| External module | Yes | No |
Works on inbox rdma-core
|
No | Yes |
| Supported GPUs | All | Hopper+ |
| NVIDIA recommendation | Deprecated | Preferred |
If nvidia-peermem fails with Invalid argument on an inbox stack, that is expected. Enable nvidia-drm modeset=1, use DMA-BUF, verify with the three-call test above.
Related symptoms worth checking on the same box
-
All IB ports stuck in
INIT, LID 0 → no Subnet Manager on the fabric. Start one:sudo apt install opensm && sudo systemctl start opensm. Ports go Active within seconds. -
One port
Down/Pollingat SDR while others are Active → check the switch side by directed route. If both ends are polling, it is physical (cable / transceiver / seat), not software. Reseat or swap.
Top comments (0)