DEV Community

Mumtaz Jahan
Mumtaz Jahan

Posted on

Kubernetes Rolling Update Failed — Here's Exactly What to Do

Kubernetes Rolling Update Failed — Here's Exactly What to Do

One of the most common DevOps interview scenario questions:

"Your deployment rollout failed in Kubernetes. What will you do?"

Most beginners panic at this question. Senior engineers don't — because they have a clear mental framework for it.

Here is that exact framework.


Practical Answer

The priority order is everything here:

First, ensure service stability. Then analyze why the rollout failed.

Never do it the other way around. Production availability comes before investigation — always.


Step-by-Step Debug Process

Step 1: Check Rollout Status

First thing — understand exactly where the rollout stopped:

kubectl rollout status deployment/<name>
Enter fullscreen mode Exit fullscreen mode

This tells you whether the rollout is still progressing, stuck, or has failed completely.


Step 2: Check Events

kubectl describe deployment <name>
Enter fullscreen mode Exit fullscreen mode

Scroll to the Events section at the bottom. This is where Kubernetes tells you exactly what went wrong.

Look for:

  • 🔴 Probe failures — liveness or readiness probe not passing
  • 🔴 Image errors — wrong image tag, image pull failure, registry issue
  • 🔴 Crash issues — container starting and immediately crashing

Step 3: Check New Pods

kubectl get pods

kubectl logs <new-pod-name>
Enter fullscreen mode Exit fullscreen mode

The new pods created during the rolling update are where the failure lives. Check their status and read their logs — the error will almost always be visible here.


Step 4: Immediate Fix — VERY IMPORTANT

If production is affected — rollback first. Investigate later.

kubectl rollout undo deployment/<name>
Enter fullscreen mode Exit fullscreen mode

This instantly restores the previous stable version and brings your service back up. Users stop seeing errors. Now you have time to debug safely without pressure.

# Verify rollback completed successfully
kubectl rollout status deployment/<name>
Enter fullscreen mode Exit fullscreen mode

Step 5: Find the Root Cause

Now that service is restored, investigate calmly. The most common root causes are:

Liveness/Readiness Probe Wrong — The probe is hitting the wrong path or port, causing Kubernetes to think the pod is unhealthy and kill it during rollout.

New Image Bug — The new Docker image has a startup bug or crash that only appears at runtime, not during build.

Config Issue — Wrong environment variable, missing secret, or incorrect ConfigMap value that the new version depends on but the old version didn't.


Step 6: Fix and Redeploy

Once root cause is identified — fix it, test it in staging, then redeploy:

# After fixing the issue
kubectl set image deployment/<name> <container>=<new-fixed-image>

# Watch the new rollout
kubectl rollout status deployment/<name>
Enter fullscreen mode Exit fullscreen mode

Strong Interview Line

Say this in your interview and the interviewer will remember you:

"I always rollback first to maintain availability, then debug the failed rollout."

This one sentence shows you understand that service availability is non-negotiable — and that investigation can wait until users are no longer affected.

That is the mindset of a senior engineer.


Quick Reference Checklist

Step Command Purpose
1. Check rollout kubectl rollout status deployment/<name> See where it failed
2. Check events kubectl describe deployment <name> Read failure reason
3. Check pods kubectl get pods Find failing new pods
4. Read logs kubectl logs <new-pod> See exact error
5. Rollback kubectl rollout undo deployment/<name> Restore service NOW
6. Fix & redeploy kubectl set image ... After root cause found

Most Common Root Causes

Root Cause What to Check
Probe failure Liveness/readiness path and port
Image error Image tag exists in registry
Config issue Env vars, Secrets, ConfigMaps

Final Thought

A failed rolling update is not a disaster — it is a process.

Rollback first. Service is restored. Now you have all the time you need to debug properly, find the root cause, and redeploy with confidence.

The engineers who stay calm during rollout failures are the ones who have this process memorized.


*Have you ever had a rolling update fail in production? What was the root cause? Drop it in the comments *


Top comments (0)