GPU PCI Address Instability: When Your Card Moves Between Reboots

#proxmox #gpupassthrough #pcie #homelab

I spent an entire afternoon debugging a VM that refused to boot, only to find out my GPU had decided to change its PCI address. One reboot and the device that lived at 01:00.0 suddenly migrated to 02:00.0. Because my Proxmox VM configuration was pinned to the old address, the VM crashed with a QEMU assertion error, and the GPU simply vanished from the guest.

This usually happens because of how the BIOS handles PCIe enumeration during POST. If you have multiple PCIe devices or a complex motherboard topology, the bus numbering isn't always deterministic. This is compounded by AMD Ryzen C-states or weird UMA frame buffer settings that can delay device initialization, causing the kernel to assign addresses in a different order than the previous boot. If you've already dealt with AMD iGPU RAM theft, you know how sensitive these BIOS settings are.

If you're on Proxmox 8.4+, the "happy path" is to use the q35 machine type. The older i440fx is more prone to these PCI mapping failures and IRQ conflicts. I also found that preventing the card from entering deep power states helps avoid the "zombie GPU" scenario where the card is physically there but logically dead.

To stabilize this, I switched the VM to q35 and explicitly enabled PCIe mode for the passthrough device. I also added a kernel parameter to stop the CPU from entering deep sleep states, which I've found reduces the randomness of the PCIe bus scan.

# 1. Change VM to q35 machine type for better PCIe support
qm set <VMID> --machine q35

# 2. Pass through the GPU with pcie=1 to ensure it's treated as a PCIe device
# Replace <PCI_ADDRESS> with your current address (e.g., 0000:01:00.0)
qm set <VMID> -hostpci0 <PCI_ADDRESS>,pcie=1

# 3. To stop the GPU from entering D3cold (which can cause boot-time instability)
# Run this on the Proxmox host
echo 0 > /sys/bus/pci/devices/0000:<PCI_BUS>:<PCI_SLOT>.0/d3cold_allowed

If the addresses keep shifting despite these changes, you're fighting your motherboard's firmware. At that point, I stopped fighting the VM abstraction and moved the NVIDIA drivers directly onto the Proxmox host. I then used the NVIDIA Container Toolkit to expose the GPU to my Kubernetes worker. It removes the PCI address fragility entirely because the host driver handles the hardware mapping, and the containers just see the device.

The lesson here is that PCI addresses are not constants; they are suggestions. If your workload requires 100% uptime and you can't guarantee a static PCI map, stop using VM passthrough and move the driver to the host.

DEV Community

GPU PCI Address Instability: When Your Card Moves Between Reboots

Top comments (0)