Kubernetes Rolling Update Failed — Here's Exactly What to Do
One of the most common DevOps interview scenario questions:
"Your deployment rollout failed in Kubernetes. What will you do?"
Most beginners panic at this question. Senior engineers don't — because they have a clear mental framework for it.
Here is that exact framework.
Practical Answer
The priority order is everything here:
First, ensure service stability. Then analyze why the rollout failed.
Never do it the other way around. Production availability comes before investigation — always.
Step-by-Step Debug Process
Step 1: Check Rollout Status
First thing — understand exactly where the rollout stopped:
kubectl rollout status deployment/<name>
This tells you whether the rollout is still progressing, stuck, or has failed completely.
Step 2: Check Events
kubectl describe deployment <name>
Scroll to the Events section at the bottom. This is where Kubernetes tells you exactly what went wrong.
Look for:
- 🔴 Probe failures — liveness or readiness probe not passing
- 🔴 Image errors — wrong image tag, image pull failure, registry issue
- 🔴 Crash issues — container starting and immediately crashing
Step 3: Check New Pods
kubectl get pods
kubectl logs <new-pod-name>
The new pods created during the rolling update are where the failure lives. Check their status and read their logs — the error will almost always be visible here.
Step 4: Immediate Fix — VERY IMPORTANT
If production is affected — rollback first. Investigate later.
kubectl rollout undo deployment/<name>
This instantly restores the previous stable version and brings your service back up. Users stop seeing errors. Now you have time to debug safely without pressure.
# Verify rollback completed successfully
kubectl rollout status deployment/<name>
Step 5: Find the Root Cause
Now that service is restored, investigate calmly. The most common root causes are:
Liveness/Readiness Probe Wrong — The probe is hitting the wrong path or port, causing Kubernetes to think the pod is unhealthy and kill it during rollout.
New Image Bug — The new Docker image has a startup bug or crash that only appears at runtime, not during build.
Config Issue — Wrong environment variable, missing secret, or incorrect ConfigMap value that the new version depends on but the old version didn't.
Step 6: Fix and Redeploy
Once root cause is identified — fix it, test it in staging, then redeploy:
# After fixing the issue
kubectl set image deployment/<name> <container>=<new-fixed-image>
# Watch the new rollout
kubectl rollout status deployment/<name>
Strong Interview Line
Say this in your interview and the interviewer will remember you:
"I always rollback first to maintain availability, then debug the failed rollout."
This one sentence shows you understand that service availability is non-negotiable — and that investigation can wait until users are no longer affected.
That is the mindset of a senior engineer.
Quick Reference Checklist
| Step | Command | Purpose |
|---|---|---|
| 1. Check rollout | kubectl rollout status deployment/<name> |
See where it failed |
| 2. Check events | kubectl describe deployment <name> |
Read failure reason |
| 3. Check pods | kubectl get pods |
Find failing new pods |
| 4. Read logs | kubectl logs <new-pod> |
See exact error |
| 5. Rollback | kubectl rollout undo deployment/<name> |
Restore service NOW |
| 6. Fix & redeploy | kubectl set image ... |
After root cause found |
Most Common Root Causes
| Root Cause | What to Check |
|---|---|
| Probe failure | Liveness/readiness path and port |
| Image error | Image tag exists in registry |
| Config issue | Env vars, Secrets, ConfigMaps |
Final Thought
A failed rolling update is not a disaster — it is a process.
Rollback first. Service is restored. Now you have all the time you need to debug properly, find the root cause, and redeploy with confidence.
The engineers who stay calm during rollout failures are the ones who have this process memorized.
*Have you ever had a rolling update fail in production? What was the root cause? Drop it in the comments *
Top comments (0)