DEV Community: Mumtaz Jahan

Kubernetes Rolling Update Failed — Here's Exactly What to Do

Mumtaz Jahan — Sun, 03 May 2026 00:54:53 +0000

Kubernetes Rolling Update Failed — Here's Exactly What to Do

One of the most common DevOps interview scenario questions:

"Your deployment rollout failed in Kubernetes. What will you do?"

Most beginners panic at this question. Senior engineers don't — because they have a clear mental framework for it.

Here is that exact framework.

Practical Answer

The priority order is everything here:

First, ensure service stability. Then analyze why the rollout failed.

Never do it the other way around. Production availability comes before investigation — always.

Step-by-Step Debug Process

Step 1: Check Rollout Status

First thing — understand exactly where the rollout stopped:

kubectl rollout status deployment/<name>

This tells you whether the rollout is still progressing, stuck, or has failed completely.

Step 2: Check Events

kubectl describe deployment <name>

Scroll to the Events section at the bottom. This is where Kubernetes tells you exactly what went wrong.

Look for:

🔴 Probe failures — liveness or readiness probe not passing
🔴 Image errors — wrong image tag, image pull failure, registry issue
🔴 Crash issues — container starting and immediately crashing

Step 3: Check New Pods

kubectl get pods

kubectl logs <new-pod-name>

The new pods created during the rolling update are where the failure lives. Check their status and read their logs — the error will almost always be visible here.

Step 4: Immediate Fix — VERY IMPORTANT

If production is affected — rollback first. Investigate later.

kubectl rollout undo deployment/<name>

This instantly restores the previous stable version and brings your service back up. Users stop seeing errors. Now you have time to debug safely without pressure.

# Verify rollback completed successfully
kubectl rollout status deployment/<name>

Step 5: Find the Root Cause

Now that service is restored, investigate calmly. The most common root causes are:

Liveness/Readiness Probe Wrong — The probe is hitting the wrong path or port, causing Kubernetes to think the pod is unhealthy and kill it during rollout.

New Image Bug — The new Docker image has a startup bug or crash that only appears at runtime, not during build.

Config Issue — Wrong environment variable, missing secret, or incorrect ConfigMap value that the new version depends on but the old version didn't.

Step 6: Fix and Redeploy

Once root cause is identified — fix it, test it in staging, then redeploy:

# After fixing the issue
kubectl set image deployment/<name> <container>=<new-fixed-image>

# Watch the new rollout
kubectl rollout status deployment/<name>

Strong Interview Line

Say this in your interview and the interviewer will remember you:

"I always rollback first to maintain availability, then debug the failed rollout."

This one sentence shows you understand that service availability is non-negotiable — and that investigation can wait until users are no longer affected.

That is the mindset of a senior engineer.

Quick Reference Checklist

Step	Command	Purpose
1. Check rollout	`kubectl rollout status deployment/<name>`	See where it failed
2. Check events	`kubectl describe deployment <name>`	Read failure reason
3. Check pods	`kubectl get pods`	Find failing new pods
4. Read logs	`kubectl logs <new-pod>`	See exact error
5. Rollback	`kubectl rollout undo deployment/<name>`	Restore service NOW
6. Fix & redeploy	`kubectl set image ...`	After root cause found

Most Common Root Causes

Root Cause	What to Check
Probe failure	Liveness/readiness path and port
Image error	Image tag exists in registry
Config issue	Env vars, Secrets, ConfigMaps

Final Thought

A failed rolling update is not a disaster — it is a process.

Rollback first. Service is restored. Now you have all the time you need to debug properly, find the root cause, and redeploy with confidence.

The engineers who stay calm during rollout failures are the ones who have this process memorized.

*Have you ever had a rolling update fail in production? What was the root cause? Drop it in the comments *

What Are Quality Gates in CI/CD? (And Why "Nobody Reads" Is Not a Gate)

Mumtaz Jahan — Wed, 29 Apr 2026 19:52:48 +0000

What Are Quality Gates in CI/CD?

A quality gate is a rule that must pass for the pipeline to move to the next stage.
Simple definition. Powerful concept.
If the gate fails — the pipeline fails. No exceptions. No "we'll fix it later." That discipline is exactly what keeps bugs out of production.

Common Quality Gates

Here are the most widely used gates in real DevOps pipelines:
✅ Unit test pass rate — 100%
✅ Code coverage — at least 70%
✅ Static analysis — 0 critical issues
✅ Security scan — no high severity CVEs
✅ Smoke test — all must pass
✅ Performance — response time must be under target (p99 threshold)
Each of these is a hard stop. The pipeline does not move forward until every gate passes.

The Rule to Remember in Interviews

A warning nobody reads is not a gate.

This is the most important thing to say when asked about quality gates in an interview. If your pipeline warns but still deploys — that is not a gate. That is noise.
A real gate blocks the pipeline. It forces the team to fix the issue before moving forward.

Real Project Example You Can Use in Interviews

Here is a real scenario worth sharing:
Our pipeline had a 70% code coverage gate. The dev team pushed to drop it to 60% to move faster.
Before agreeing, I pulled quarterly bug data. The finding was clear — low coverage modules had 3x more bugs.
The data made the decision. The gate stayed at 70.
This is a perfect interview answer because it shows you don't just follow rules blindly — you back decisions with data.

Close Your Interview Answer With This Line

Interviewers remember candidates who say this:

"Gates should enforce standards that the team agreed on — not personal preferences."

That one sentence shows maturity, team thinking, and real engineering judgment.

Real World Gate Stack

In my last project we used:

SonarQube — static analysis + code coverage gate
OWASP Dependency Check — security vulnerability gate

Any one of them failing blocked the merge entirely.
That discipline before production is exactly why we caught bugs early instead of firefighting at 2AM.

Quick Summary

Gate TypeExample ThresholdUnit Tests100% pass rateCode Coverage≥ 70%Static Analysis0 critical issuesSecurity ScanNo high CVEsSmoke TestsAll passingPerformanceUnder p99 target

Final Thought

Quality gates are not bureaucracy. They are the team's agreed standards made automatic.
Without gates, standards are just suggestions. With gates, they are enforced every single time — whether it's 10AM on a Monday or 2AM before a release.
Set the gates. Trust the gates. Let the data defend the gates.

What quality gates does your team use? Drop them in the comments 👇

Pipeline Success but Application Broken — Here's Why

Mumtaz Jahan — Tue, 28 Apr 2026 03:10:16 +0000

Pipeline Success but Application Broken — Here's Why

One of the most confusing moments in DevOps:

CI/CD Pipeline — Passed
Application — Not Working

How can the pipeline be green but the app still be broken? This is actually one of the most common real interview scenario questions asked in DevOps — and most beginners get it wrong.

Practical Answer

Here is the key concept that most people miss:

CI/CD success only means the build and deploy succeeded — not that the application is healthy.

Think of it this way — your pipeline's job is to:

Build the Docker image
Push it to the registry
Deploy it to Kubernetes

But none of these steps check whether your application actually started correctly, connected to the database, loaded the right config, or is responding to requests.

The pipeline says "I delivered the package" — not "the package works."

Practical Steps to Debug This

When your pipeline is green but app is broken, follow these steps in order:

Step 1: Check Pods

kubectl get pods

Look at the STATUS column. You want Running — anything else like Error, Pending, or OOMKilled tells you something went wrong after deployment.

Step 2: Check Logs

kubectl logs <pod-name>

This is where the real story is. The app may have deployed successfully but crashed immediately on startup. The logs will tell you exactly why.

Step 3: Verify Environment Variables, Secrets & ConfigMaps

This is the most common culprit. The pipeline deployed the right image — but the app couldn't connect to the database because:

Wrong DB host in environment variables
Secret not mounted correctly
ConfigMap pointing to staging instead of production

# Check environment variables on the pod
kubectl exec <pod-name> -- env

# Check if secret exists
kubectl get secret <secret-name>

# Check ConfigMap values
kubectl get configmap <configmap-name> -o yaml

Step 4: Test the Endpoint Manually

Don't assume the app is working — verify it:

# Port forward and test locally
kubectl port-forward <pod-name> 8080:8080

# Then in another terminal
curl http://localhost:8080/health

If the health check fails, you have confirmed the app is broken despite the green pipeline.

Step 5: If Needed — Rollback

If you can't fix it quickly and production is affected — rollback immediately:

kubectl rollout undo deployment/<deployment-name>

# Verify rollback
kubectl rollout status deployment/<deployment-name>

Restore service first. Investigate later.

Key Insight

Most real production issues after a green pipeline = config mismatch

Not a code bug. Not a broken image. Just a wrong environment variable, a missing secret, or a ConfigMap pointing to the wrong place.

This is why experienced DevOps engineers always check config first when the pipeline passes but the app doesn't behave.

Quick Debug Checklist

Step	Command	What to Look For
Check pods	`kubectl get pods`	STATUS = Running
Check logs	`kubectl logs <pod>`	Startup errors
Check env vars	`kubectl exec <pod> -- env`	Correct DB/API values
Check secrets	`kubectl get secret`	Secret exists and mounted
Test endpoint	`curl localhost/health`	200 OK response
Rollback	`kubectl rollout undo`	If nothing else works

Final Thought

A green pipeline gives you confidence in your delivery process — not a guarantee that your application is healthy. These are two very different things.

Add smoke tests and health checks at the end of your pipeline to bridge that gap. Make your pipeline not just check if the deploy succeeded — but if the app actually came up healthy.

Have you ever been caught off guard by a green pipeline and a broken app? Drop your story in the comments 👇

What broke your CI/CD pipeline and how did you fix it?

Mumtaz Jahan — Sat, 25 Apr 2026 15:41:35 +0000

We all have that one story. 😅
That moment where:

Pipeline was green ✅
You pushed one small change
Everything exploded 💥

For me it was forgetting to add environment variables in Jenkins — pipeline ran perfectly locally, failed completely in production. Classic.
I want to hear your stories:
🔴 What was the stupidest thing that broke your pipeline?
🟡 How long did it take you to find the bug?
🟢 How did you finally fix it?
Drop your war stories below 👇 — the more painful the better! 😄

CI/CD Pipeline Best Practices That Nobody Teaches You When You're Starting Out

Mumtaz Jahan — Sat, 25 Apr 2026 15:25:13 +0000

When I first started building CI/CD pipelines, I thought it was just about automating deployments. I was wrong.

After working with Jenkins, GitHub Actions, and GitLab CI — here are the real best practices I wish someone told me earlier.

1. Never Store Secrets in Your Pipeline Code

The biggest mistake beginners make:
#  WRONG — never do this
docker login -u admin -p mypassword123

#  RIGHT — use environment variables
docker login -u $DOCKER_USER -p $DOCKER_PASSWORD
Use your CI tool's secret manager — Jenkins Credentials, GitHub Secrets, GitLab Variables. Always.

2. Fail Fast — Put Quick Checks First

Order your pipeline stages like this:
Lint → Unit Tests → Build → Integration Tests → Deploy
Why? If your linting fails, there's no point building. Catch errors early, save time.

3. Always Build Immutable Artifacts

Never deploy code directly. Always build a Docker image or artifact first:
# Tag with commit SHA — not just "latest"
docker build -t myapp:$GIT_COMMIT_SHA .
docker push myapp:$GIT_COMMIT_SHA
Using latest tag is a trap — you lose traceability.

4. One Pipeline Per Branch Strategy
feature/* → lint + unit tests only
develop   → lint + tests + build + deploy to staging
main      → full pipeline + deploy to production
Don't run full heavy pipelines on every feature branch — wastes time and resources.

5. Notifications Matter More Than You Think

Add Slack or email alerts for:

Pipeline failure

Successful production deploy

Test coverage dropping below threshold

Silent pipelines = hidden problems.

6. Keep Your Pipeline as Code

Always use:

Jenkinsfile for Jenkins

.github/workflows/*.yml for GitHub Actions

.gitlab-ci.yml for GitLab

Never configure pipelines through UI only — it can't be versioned or reviewed.

7. 📊 Track These Pipeline Metrics

Metric Why It Matters

Build duration Spot slowdowns early

Test pass rate Catch flaky tests

Deploy frequency Measure team velocity

Mean time to recovery How fast you fix failures

8. Never Skip Tests to Speed Up Pipeline

Skipping tests to go faster is like removing smoke detectors to save battery. You'll regret it in production.

What's Your Biggest CI/CD Struggle?

Drop it in the comments — I read every one!

💬 P.S. I run a free Telegram community called **DevOps Materials & Learning Hub* where we share CI/CD scripts, Jenkinsfiles, pipeline templates and more. Join us here → https://t.me/+YHQcSaCPd9EzMmQ1*

Metric	Why It Matters
Build duration	Spot slowdowns early
Test pass rate	Catch flaky tests
Deploy frequency	Measure team velocity
Mean time to recovery	How fast you fix failures

Why Your AWS Bill Doubled Overnight (And How to Plug the Leaks)

Mumtaz Jahan — Fri, 24 Apr 2026 01:43:54 +0000

Why Your AWS Bill Doubled Overnight (And How to Plug the Leaks)

We've all been there.

You open the AWS Billing Dashboard, expecting the usual $50–$100, only to see a vertical spike that looks like a mountain range. The immediate reaction is:

"We must have massive traffic!"

But let's be real — traffic rarely doubles overnight. Your misconfigurations, however, certainly can.

If you're staring down a bill that's spiraling out of control, here is your emergency checklist to find the invisible drains on your budget.

1. The NAT Gateway "Processing" Trap

NAT Gateways are the silent killers of AWS budgets. You aren't just paying for the uptime — you're paying for every gigabyte that passes through.

The Leak:
Sending high-bandwidth internal traffic (like S3 uploads) through a NAT Gateway instead of using a VPC Endpoint.

The Fix:
Use VPC Endpoints for S3 and DynamoDB to keep that traffic off the expensive NAT "highway."

# Check your NAT Gateway data transfer costs
aws ec2 describe-nat-gateways --query 'NatGateways[*].{ID:NatGatewayId,State:State}'

A single misconfigured service pushing gigabytes through NAT can silently add hundreds of dollars to your bill.

2. Cross-AZ Data Transfer — The Invisible Tax

High availability is great, but cross-Availability Zone (AZ) traffic comes with a literal invisible tax.

The Leak:
Your app server in us-east-1a is constantly chatting with a database in us-east-1b.

The Fix:
Keep your "chatty" services within the same AZ where possible, or use Service Discovery to prioritize local traffic.

# Check which AZ your instances are running in
aws ec2 describe-instances --query 'Reservations[*].Instances[*].{ID:InstanceId,AZ:Placement.AvailabilityZone}'

3. Ghost EBS Volumes

When you terminate an EC2 instance, the Elastic Block Store (EBS) volume doesn't always go away with it.

The Leak:
"Unattached" volumes sitting in your console, doing absolutely nothing except costing you monthly rent.

The Fix:
Go to EC2 Console → Volumes → Filter by State = Available

# Find all unattached EBS volumes via CLI
aws ec2 describe-volumes --filters Name=status,Values=available \
--query 'Volumes[*].{ID:VolumeId,Size:Size,State:State}'

If it's not In-use — delete it or snapshot it and move on.

4. Broken Auto Scaling

Auto Scaling is designed to save you money, but it only works if it knows how to breathe.

The Leak:
Your "Scale Up" policy works perfectly during peak hours, but your "Scale Down" policy is either missing or blocked by a single stuck process.

The Fix:
Audit your CloudWatch alarms. Ensure your cooldown periods aren't too long and that your termination policies are actually firing.

# List your Auto Scaling groups and their activities
aws autoscaling describe-scaling-activities --auto-scaling-group-name your-group-name

5. The CloudWatch Ingestion Spike

Logs are vital — until they cost more than the app they're monitoring.

The Leak:
You left a service in Debug mode, and now you're paying for terabytes of CloudWatch log ingestion.

The Fix:
Set a retention policy. Don't keep logs for "Forever" by default.

# Set a 30-day retention policy on a log group
aws logs put-retention-policy \
  --log-group-name /your/log/group \
  --retention-in-days 30

14 to 30 days is usually plenty for dev environments.

6. S3 Without a Lifecycle Policy

Storage is cheap — but it's not free.

The Leak:
Storing every version of every file in Standard Storage for years with no cleanup plan.

The Fix:
Implement S3 Lifecycle Policies to move old data automatically.

{
  "Rules": [{
    "Status": "Enabled",
    "Transitions": [{
      "Days": 30,
      "StorageClass": "STANDARD_IA"
    }, {
      "Days": 90,
      "StorageClass": "GLACIER"
    }]
  }]
}

Move to Infrequent Access after 30 days. Move to Glacier after 90. Your future self will thank you.

7. Idle Load Balancers

An ALB (Application Load Balancer) costs roughly $16–$20/month just to exist — even if nothing is using it.

The Leak:
Leftover load balancers from a project or staging environment you forgot to tear down.

The Fix:

# Find load balancers with no targets
aws elbv2 describe-load-balancers \
--query 'LoadBalancers[*].{Name:LoadBalancerName,DNS:DNSName}'

If it has zero targets and zero requests — delete it immediately.

8. Snapshot Hoarding

Backups are important — but do you really need a snapshot of a test server from 2022?

The Leak:
Automated backups that never expire.

The Fix:
Use AWS Backup to centralize management and set hard expiration dates on snapshots.

# List all your snapshots and their ages
aws ec2 describe-snapshots --owner-ids self \
--query 'Snapshots[*].{ID:SnapshotId,Size:VolumeSize,Date:StartTime}'

Quick Emergency Checklist

#	Leak	Quick Fix
1	NAT Gateway traffic	Use VPC Endpoints
2	Cross-AZ traffic	Keep services in same AZ
3	Ghost EBS volumes	Filter Available → Delete
4	Broken Auto Scaling	Audit CloudWatch alarms
5	CloudWatch debug logs	Set 14-30 day retention
6	S3 no lifecycle	Add Lifecycle Policy
7	Idle Load Balancers	Zero targets → Delete
8	Old snapshots	Set expiration in AWS Backup

The Bottom Line

AWS is a "pay-for-what-you-use" model — but if you aren't careful, you're also "paying-for-what-you-forgot-to-turn-off."

Run through this checklist every month. Set a calendar reminder. Your AWS bill will thank you.

*What's the biggest hidden cost you've ever found in your AWS bill? Drop it in the comments *

How I Fixed Jenkins Built-In Node Offline Issue on EC2

Mumtaz Jahan — Mon, 20 Apr 2026 11:08:28 +0000

Problem: Jenkins Built-In Node Showing Offline on EC2

If you have ever set up Jenkins on an AWS EC2 instance and seen your Built-In Node showing offline with builds not running — this post is for you!

Here is exactly what happened, why it happened, and how I fixed it step by step.

Understanding the Warning

When I clicked on the Built-In Node I saw this error:

Disk space is below threshold of 1.00 GiB. Only 951.90 MiB out of 956.65 MiB left on /tmp.

Jenkins monitors your server's system resources constantly. It requires:

Resource	Minimum Required
Free Disk Space	≥ 1 GB
Free Temp Space `/tmp`	≥ 1 GB
Free Swap Space	> 0 B

In my case the checks showed:

Free Disk Space: 22.26 GiB — perfectly fine
Free Temp Space /tmp: 951.90 MiB — below Jenkins threshold
Free Swap Space: 0 B — no swap at all

The /tmp partition was only 956 MiB total — just under Jenkins' 1 GB requirement. So Jenkins automatically took the Built-In Node offline and refused to run any builds.

Fix Option 1 — Increase `/tmp` Size (Best for EC2)

This is the permanent fix. We increase the /tmp mount size by editing the filesystem configuration.

Step 1: Open the fstab file

sudo nano /etc/fstab

Step 2: Add this line at the bottom

tmpfs /tmp tmpfs defaults,size=2G 0 0

Save and exit (Ctrl+X → Y → Enter)

Step 3: Remount /tmp without rebooting

sudo mount -o remount /tmp

Step 4: Verify the new size

df -h /tmp

You should now see:

Filesystem   Size   Used   Avail   Use%   Mounted on
tmpfs        2.0G   4.8M   2.0G    1%     /tmp

/tmp is now 2 GB — well above Jenkins' 1 GB threshold! ✅

Fix Option 2 — Quick Jenkins Restart (Temporary Fix)

If you just need Jenkins back online quickly:

sudo rm -rf /tmp/*
sudo systemctl restart jenkins

This clears temp files and forces Jenkins to recheck the threshold. Sometimes this is enough to bring the node back online temporarily.

Bonus Fix — Add Swap Space (Important for t2.micro)

On t2.micro EC2 instances, swap is 0 B by default. Jenkins warns about this too. Here is how to create a 1 GB swap file:

# Create a 1GB swap file
sudo fallocate -l 1G /swapfile

# Set correct permissions
sudo chmod 600 /swapfile

# Set up swap space
sudo mkswap /swapfile

# Enable the swap
sudo swapon /swapfile

# Verify
free -h

Result

After applying Fix Option 1 and adding swap space the node came back online immediately!

Built-In Node came back Online
Jenkins disk space warning disappeared
Builds started running immediately

Key Takeaways

t2.micro has very limited resources — always check /tmp size when setting up Jenkins on a free tier EC2.

Never ignore Jenkins resource warnings — they directly affect whether your node stays online.

Fix Option 1 is permanent, Fix Option 2 is temporary. Always go with Option 1 for a stable setup.

Always add swap on t2.micro — it prevents a lot of memory-related Jenkins issues down the line.

*Setting up Jenkins on EC2? Drop your questions in the comments *

Git Commands Every DevOps Engineer Must Know

Mumtaz Jahan — Mon, 20 Apr 2026 03:23:56 +0000

Git Commands Every DevOps Engineer Must Know

Git is not just a version control tool — it's your daily survival kit as a DevOps engineer. Whether you're managing pipelines, fixing production issues, or collaborating with teams, these commands will save you every single day.

Let's break it down section by section.

1. Initial Setup — Configure Git & Start Your Project

Before anything else, set up Git properly:

# Set your global username
git config --global user.name "Your Name"

# Set your global email
git config --global user.email "your@email.com"

# Initialize a new Git repository
git init

# Clone an existing repository
git clone <repo-url>

Tip: Start every project the right way — proper config avoids identity issues in commits later.

2. Daily Git Flow — Your Everyday DevOps Cycle

This is the cycle you'll repeat every single day:

# Check the status of your changes
git status

# Stage all changes to be committed
git add .

# Commit your changes with a message
git commit -m "your message here"

# Push changes to remote repository
git push

Tip: Practice it. Automate it. Master it.

3. Branching Strategy — Work Smarter in CI/CD & Teamwork

Branching is essential for clean CI/CD pipelines:

# List all branches in your repository
git branch

# Create a new branch and switch to it
git checkout -b feature-branch

# Switch to an existing branch
git switch branch-name

# Merge changes from another branch
git merge branch-name

Tip: Always keep your main branch clean and stable. Never push untested code directly to main.

4. Sync With Remote — Keep Your Local Repo Up to Date

Always sync before you push:

# Fetch and merge changes from remote to local
git pull

# Fetch changes from remote (without merging)
git fetch

# Show remote repositories and their URLs
git remote -v

Tip: Always git pull before you git push — it helps you avoid conflicts and ensures successful deployments.

5. Debug Like a Pro — Inspect History, Changes & Code

When something breaks in production, these are your best friends:

# Show commit history
git log

# View changes between working tree, staging or commits
git diff

# Show details of a specific commit
git show <commit-id>

# See who changed each line and why
git blame <file>

Tip: git blame helps you find who broke production 👀 — use it wisely!

6. Undo & Fix — Life Saver Commands

Mistakes happen. Fix them like a pro:

# Undo last commit, keep changes staged
git reset --soft HEAD~1

# Undo last commit and discard all changes (Use with caution!)
git reset --hard HEAD~1

# Safely undo changes by creating a new commit (great for shared repos)
git revert <commit-id>

# Temporarily save changes and clean your working directory
git stash

Tip: Use git revert over git reset --hard in shared/production repos — it's safer and keeps history clean.

7. Bonus — Extra Power Commands

Level up your Git game:

# Create a tag for a specific version
git tag -a v1.0 -m "version 1.0"

# Push a specific tag to remote
git push origin v1.0

# Apply a specific commit from another branch
git cherry-pick <commit-id>

# Check repository integrity and find issues
git fsck --full

# Remove untracked files and directories
git clean -fd

Tip: Master these commands and you're ready to tackle any DevOps workflow with confidence!

Quick Reference Summary

Command	What it does
`git status`	Check current changes
`git pull`	Sync from remote
`git stash`	Save work temporarily
`git revert`	Safely undo in shared repos
`git blame`	Find who changed what
`git cherry-pick`	Apply specific commits
`git tag`	Version your releases

Final Thought

Git is the backbone of every DevOps workflow. The engineers who know these commands deeply don't just push code — they ship with confidence.

Save this post for your next deployment!

*Which Git command has saved you the most in production? Drop it in the comments *

DevOps Scenario Interview Question: Deployment Failed in Production

Mumtaz Jahan — Sat, 18 Apr 2026 14:11:02 +0000

Tags: devops kubernetes cicd career

Scenario: Your Deployment Failed in Production. What Steps Will You Take?

This is one of the most common real-world scenario questions asked in DevOps interviews. Interviewers don't want textbook answers — they want to know how you think under pressure.

Here's the complete answer framework.

Answer: Step-by-Step Approach

1. Check CI/CD Pipeline Logs

First thing — don't guess, read the logs.

# For Jenkins
cat /var/log/jenkins/jenkins.log

# For GitHub Actions — check the Actions tab in your repo

# For GitLab CI
gitlab-ci logs

The pipeline log tells you exactly where it broke.

2. Identify the Failed Stage (Build / Test / Deploy)

Every pipeline has stages. Narrow it down:

Build failed? → Dependency issue, Dockerfile error, compilation error
Test failed? → A test caught a regression before it hit production
Deploy failed? → Kubernetes issue, wrong image tag, resource limits, misconfigured secrets

Knowing the stage cuts your debugging time in half.

3. Verify Configuration Changes

Check what changed before the failure:

# Check recent git commits
git log --oneline -10

# Check Kubernetes config changes
kubectl describe deployment my-app

# Check if secrets/configmaps were updated
kubectl get configmap my-app-config -o yaml

Most production failures trace back to a config change someone forgot to mention.

4. Rollback to Previous Stable Version

Don't try to fix forward when production is down. Rollback first, fix later.

# Kubernetes rollback
kubectl rollout undo deployment/my-app

# Verify rollback status
kubectl rollout status deployment/my-app

# Check rollout history
kubectl rollout history deployment/my-app

This restores service immediately while you investigate the root cause safely.

5. Fix the Issue and Redeploy

Once production is stable:

Reproduce the issue in staging
Apply the fix
Test thoroughly
Redeploy with the corrected version

kubectl set image deployment/my-app my-app=my-image:v2.1-fixed
kubectl rollout status deployment/my-app

Pro Tip

Always maintain versioned Docker images — never use latest in production.

# Bad
image: my-app:latest

# Good
image: my-app:v2.0.1

Without versioned images, you can't rollback. Tag every release.

Bonus: What Interviewers Are Really Looking For

They want to see that you: don't panic, prioritize restoring service over finding blame, think in structured steps, and know the actual commands — not just theory.

*Preparing for a DevOps interview? Drop your toughest scenario question in the comments *

How to Fixed a Kubernetes CrashLoopBackOff in Production

Mumtaz Jahan — Fri, 17 Apr 2026 23:55:42 +0000

Tags: kubernetes devops debugging cloud

Problem: Application Was DOWN in Kubernetes

One of the most stressful moments in DevOps — you check your monitoring dashboard and your application is completely DOWN in Kubernetes. No graceful degradation. Just... down.

Here's exactly how I diagnosed and fixed it in under an hour.

Issue Found: Pod Was in CrashLoopBackOff

Running kubectl get pods revealed the culprit immediately:

NAME                        READY   STATUS             RESTARTS   AGE
my-app-7d9f8b6c4-xk2pq     0/1     CrashLoopBackOff   8          20m

CrashLoopBackOff means Kubernetes is repeatedly trying to start your container, it crashes, and Kubernetes backs off with increasing wait times before retrying. Something inside the container was causing it to exit immediately on startup.

Debug Steps

Step 1: Checked Logs (`kubectl logs`)

kubectl logs my-app-7d9f8b6c4-xk2pq --previous

The --previous flag is crucial here — it lets you see logs from the crashed container, not the current (possibly empty) one. The logs showed repeated connection errors on startup.

Step 2: Checked Config

kubectl describe pod my-app-7d9f8b6c4-xk2pq

I inspected the environment variables and ConfigMaps attached to the pod. The describe command is a goldmine — it shows events, resource limits, volume mounts, and more.

Step 3: Found DB Connection Issue

The logs made it clear: the app was trying to connect to the database using an incorrect connection string. The host value in the environment variable was pointing to a stale endpoint. The app would crash immediately on boot since it couldn't reach the DB.

Fix Applied

Corrected Environment Variables

Updated the Kubernetes secret/configmap with the correct database host:

kubectl edit secret my-app-db-secret
# or
kubectl set env deployment/my-app DB_HOST=correct-db-host.internal

Restarted the Deployment

kubectl rollout restart deployment/my-app

Then watched the rollout:

kubectl rollout status deployment/my-app

🎉 Result

 Application UP
 Issue resolved

The pods came up healthy, readiness probes passed, and traffic started flowing again.

Key Takeaways

Always check logs with --previous — the live container may have no logs if it crashes before writing any.

kubectl describe pod is your best friend for seeing the full picture: events, env vars, resource pressure.

CrashLoopBackOff is almost always one of: bad env vars/secrets, missing config, OOM kill, or a bug triggered at startup.

kubectl rollout restart is safer than deleting pods manually — it does a rolling restart with zero downtime.

*Hit a similar issue? Drop your debugging story in the comments *

DEV Community: Mumtaz Jahan

Kubernetes Rolling Update Failed — Here's Exactly What to Do

Kubernetes Rolling Update Failed — Here's Exactly What to Do

Practical Answer

Step-by-Step Debug Process

Step 1: Check Rollout Status

Step 2: Check Events

Step 3: Check New Pods

Step 4: Immediate Fix — VERY IMPORTANT

Step 5: Find the Root Cause

Step 6: Fix and Redeploy

Strong Interview Line

Quick Reference Checklist

Most Common Root Causes

Final Thought

What Are Quality Gates in CI/CD? (And Why "Nobody Reads" Is Not a Gate)

What Are Quality Gates in CI/CD?

Common Quality Gates

The Rule to Remember in Interviews

Real Project Example You Can Use in Interviews

Close Your Interview Answer With This Line

Real World Gate Stack

Quick Summary

Final Thought

Pipeline Success but Application Broken — Here's Why

Pipeline Success but Application Broken — Here's Why

Practical Answer

Practical Steps to Debug This

Step 1: Check Pods

Step 2: Check Logs

Step 3: Verify Environment Variables, Secrets & ConfigMaps

Step 4: Test the Endpoint Manually

Step 5: If Needed — Rollback

Key Insight

Quick Debug Checklist

Final Thought

What broke your CI/CD pipeline and how did you fix it?

CI/CD Pipeline Best Practices That Nobody Teaches You When You're Starting Out

1. Never Store Secrets in Your Pipeline Code

2. Fail Fast — Put Quick Checks First

3. Always Build Immutable Artifacts

4. One Pipeline Per Branch Strategy

5. Notifications Matter More Than You Think

6. Keep Your Pipeline as Code

7. 📊 Track These Pipeline Metrics

8. Never Skip Tests to Speed Up Pipeline

What's Your Biggest CI/CD Struggle?

Why Your AWS Bill Doubled Overnight (And How to Plug the Leaks)

Why Your AWS Bill Doubled Overnight (And How to Plug the Leaks)

1. The NAT Gateway "Processing" Trap

2. Cross-AZ Data Transfer — The Invisible Tax

3. Ghost EBS Volumes

4. Broken Auto Scaling

5. The CloudWatch Ingestion Spike

6. S3 Without a Lifecycle Policy

7. Idle Load Balancers

8. Snapshot Hoarding

Quick Emergency Checklist

The Bottom Line

How I Fixed Jenkins Built-In Node Offline Issue on EC2

Problem: Jenkins Built-In Node Showing Offline on EC2

Understanding the Warning

Fix Option 1 — Increase /tmp Size (Best for EC2)

Fix Option 2 — Quick Jenkins Restart (Temporary Fix)

Bonus Fix — Add Swap Space (Important for t2.micro)

Result

Key Takeaways

Git Commands Every DevOps Engineer Must Know

Git Commands Every DevOps Engineer Must Know

1. Initial Setup — Configure Git & Start Your Project

2. Daily Git Flow — Your Everyday DevOps Cycle

3. Branching Strategy — Work Smarter in CI/CD & Teamwork

4. Sync With Remote — Keep Your Local Repo Up to Date

5. Debug Like a Pro — Inspect History, Changes & Code

6. Undo & Fix — Life Saver Commands

7. Bonus — Extra Power Commands

Quick Reference Summary

Final Thought

DevOps Scenario Interview Question: Deployment Failed in Production

Scenario: Your Deployment Failed in Production. What Steps Will You Take?

Fix Option 1 — Increase `/tmp` Size (Best for EC2)

Step 1: Checked Logs (`kubectl logs`)