I spent two hours trying to debug why my AI agent container couldn’t find the GPU. The error was cryptic, just a "device not found" in the logs, and I had no idea why. I had installed the NVIDIA Container Toolkit, configured containerd, and even tried running the container with --gpus all — nothing worked.
What I expected was for the container to launch with full GPU access, as I had done this before on a different setup. The container should have detected the GPU, initialized the libraries, and started training my model without a hitch.
What actually happened was that the container was using the default OCI runtime, which wasn’t the NVIDIA runtime. As a result, the NVIDIA libraries weren’t loaded, and the GPU wasn’t accessible. The container didn’t fail outright — it just silently missed the GPU, and the AI agent couldn’t proceed.
The fix came after I remembered to run nvidia-ctk runtime configure --runtime=containerd --set-as-default and restart containerd. That command sets the NVIDIA runtime as the default for all containers, not just those that explicitly request it. Without this step, even if you use --gpus all, the container might still run on the default runtime, which is usually runc or crun, not nvidia.
To make sure the configuration sticks, I added the following to my containerd config under [plugins."io.containerd.grpc.v1.cri".containerd.runtimes]:
nvidia:
runtime_type: "nvidia"
Then I restarted containerd and verified with:
nvidia-ctk runtime configure --runtime=containerd --set-as-default
systemctl restart containerd
I also made sure that in Kubernetes, the NVIDIA device plugin was running with the correct runtime class. I updated the DaemonSet for the NVIDIA device plugin to explicitly set runtimeClassName: nvidia, which is crucial in newer Kubernetes versions where the default runtime isn’t automatically set.
Why does this matter? If you're running AI agents, LLMs, or any GPU-dependent workloads, and you forget to set the NVIDIA runtime as the default, you’ll run into silent failures. Containers might start, but they won’t see the GPU. Worse, you might not even get an error — just a model that doesn’t train or a container that exits with no clear reason.
This is especially critical in Kubernetes environments where the NVIDIA device plugin relies on the runtime being set correctly. If it’s not, the device plugin won’t register the GPU, and your cluster will report zero GPU capacity.
If you're using Proxmox and running containers with GPU access, or if you're deploying AI agents in a Kubernetes cluster, always make sure the NVIDIA runtime is set as the default. You can do this with the nvidia-ctk tool, and it’s a simple but crucial step. Otherwise, you’ll be chasing cryptic errors and wasted compute time.
For more on how to avoid GPU passthrough gotchas in VMs, check out this post on GPU passthrough on Proxmox. If you're deploying AI agents in Kubernetes, this guide on building multi-agent systems might also help.
Top comments (0)