I deployed Ollama on Kubernetes, and the GPU worker node locked up mid-rollout. No logs, no error, just a dead pod that wouldn’t terminate and a new one that wouldn’t schedule. It wasn’t a crash. It wasn’t a timeout. It was a deadlock I’d never seen before.
I expected a smooth rollout. Ollama is a single-container, single-GPU workload. I set up a Deployment with a single replica, used a PersistentVolumeClaim for model storage, and assumed Kubernetes would manage the rest. That’s what the documentation says.
What actually happened was a scheduling deadlock. The old pod was still running, using the GPU, but the new pod couldn’t schedule because the GPU was in use. Kubernetes’ default RollingUpdate strategy tried to keep one pod running while replacing the other, but the GPU couldn’t be shared. The new pod waited for the old one to terminate, and the old pod waited for the new one to start. Deadlock.
The fix was switching the Deployment strategy from RollingUpdate to Recreate. That way, the old pod terminates before the new one starts. No GPU contention. No deadlock. It’s a simple change , just set type: Recreate in the Deployment spec.
Here’s what the Deployment looks like with the fix:
spec:
replicas: 1
strategy:
type: Recreate
I also had to configure the NVIDIA runtime correctly. Ollama needs the NVIDIA_VISIBLE_DEVICES=all environment variable set, and the NVIDIA container runtime must be properly mounted. Otherwise, the container fails to initialize, and the pod stays in a CrashLoopBackOff state.
spec:
containers:
- name: ollama
env:
- name: NVIDIA_VISIBLE_DEVICES
value: all
volumeMounts:
- name: nvidia-driver
mountPath: /usr/local/nvidia
volumes:
- name: nvidia-driver
hostPath:
path: /usr/local/nvidia
Why does this matter? If you’re running any GPU workload on Kubernetes , especially ones that require exclusive access to the GPU , you need to understand the limitations of RollingUpdate. It’s not a one-size-fits-all strategy. For GPU workloads, Recreate is the only safe option. Otherwise, you’ll hit deadlocks that leave your pods in limbo.
Another gotcha I ran into was PVC sizing. Ollama models can be large , some of the larger ones require over 100Gi of storage. I initially set the PVC to 50Gi, and the pod wouldn’t schedule. The PVC couldn’t be bound because the node didn’t have enough storage capacity. I had to bump the PVC size and make sure the underlying storage class (Longhorn in my case) had enough available space.
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: ollama-pvc
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 100Gi
If you’re running Ollama on Kubernetes, you’ll need to be mindful of these details. A single misconfigured PVC or deployment strategy can bring your whole system to a halt. I’ve seen it happen more than once. It’s not just about getting the container to start , it’s about making sure it stays up and doesn’t lock the system in the process.
If you’re building AI agent workloads or running large models in production, this is a common pitfall. The documentation doesn’t always highlight the GPU-specific constraints of Kubernetes. But when you’re working with real hardware, those constraints are real. And when they bite, you’ll wish you’d read about them before it’s too late.
For more on GPU workloads and Kubernetes, check out NVIDIA Container Toolkit: Why the Default Runtime Matters. If you’re using Longhorn for storage, Kubernetes Storage on Bare Metal: Longhorn in Practice is a good next step.
Top comments (0)