Mustkhim Inamdar

Posted on Jul 1

Jenkins in Production: Real Issues, RCA, and Fixes That Actually Work

#devops #jenkins #cicd #troubleshooting

This isn’t theory. This is a real production issue I faced with Jenkins documented with actual RCA, the troubleshooting I followed, and how I fixed and hardened the pipeline after recovery.

1. Jenkins Master Became Unresponsive During Peak Hours

What happened:

Jenkins UI crashed
Builds queued indefinitely
Engineers across multiple teams blocked

Duration of impact:

~2 hours

Root Cause:

JVM heap space exhausted → OutOfMemoryError
Disk usage at 100% → no cleanup of old builds/artifacts
SCM hooks and Git polling overwhelmed the executor queue
No workspace cleanup on matrix builds

Troubleshooting

tail -f /var/log/jenkins/jenkins.log
jcmd <pid> GC.heap_info
df -h
htop

Killed large zombie processes
Cleared build directories
Restarted Jenkins after freeing up memory

✅ Fixes Applied

JVM Memory Configuration

export JENKINS_JAVA_OPTIONS="-Xms2g -Xmx4g"

Pipeline Cleanup & Discarder

options {
  buildDiscarder(logRotator(numToKeepStr: '10'))
}

post {
  always {
    cleanWs()
  }
}

Log Rotation & Backup

tar -czf jenkins_backup_$(date +%F).tar.gz /var/lib/jenkins

SCM/Webhook Control

GitHub webhook throttled
Added quiet periods and throttle plugin

2. Jenkins Agent Disconnecting Mid-Build

What happened:

Builds failed halfway through execution
Logs were incomplete
Rebuilds triggered, wasting compute

Root Cause:

SSH/JNLP connection dropped due to firewall timeout
Cloud auto-scaling agents terminated mid-job
No agent lifecycle hooks defined

Troubleshooting

Checked jenkins.log and agent logs
Verified cloud termination settings
Monitored for memory/cpu bottlenecks on agents

✅Fixes Applied

Enable TCP Keep-Alive

ServerAliveInterval 60
ServerAliveCountMax 5

Cloud Agent Protection

Graceful shutdown scripts
Increased idle timeout before scale-in
Only terminate idle agents with no active job

3. Jenkins Master UI Was Freezing Frequently

What happened:

Jenkins dashboard became sluggish
Admin actions timed out
Pipelines remained queued

Root Cause:

Scripted pipelines executing heavy shell operations on master
Plugins with memory leaks
Too many concurrent builds on master thread

Troubleshooting

Analyzed /metrics endpoint
Monitored heap and thread dumps
Disabled high-impact plugins

✅Fixes Applied

Move All Jobs Off Master

Set “Restrict where this project runs”
Enforce master isolation

Switch to Declarative Pipelines

Reduced memory footprint
Improved readability and safety

Enable Monitoring

Installed Metrics Plugin
Integrated with Prometheus + Grafana

Secrets Leaked into Logs and Artifacts

What happened:

Tokens and .pem files showed up in Jenkins logs
.env files archived as pipeline artifacts
Security audit failed

Root Cause:

Use of echo $TOKEN inside sh blocks
No use of withCredentials wrapper
Logs not masked automatically

Troubleshooting

Scanned logs using keywords like AKIA, BEGIN RSA, etc.
Reviewed artifact contents for secrets
Reviewed pipeline code across teams

✅ Fixes Applied

Mask Secrets

withCredentials([string(credentialsId: 'secret-token', variable: 'TOKEN')]) {
  sh 'curl -H "Authorization: Bearer $TOKEN" https://secure-api/'
}

Block Sensitive Artifact Upload

steps {
  sh '''
    if grep -r 'AKIA' ./build; then exit 1; fi
  '''
}

Enable Secret Scanning Tools

Integrated GitLeaks
Pre-check builds for secret patterns

Jenkins Plugin Incompatibility After Upgrade

What happened:

Jenkins failed to start after a routine upgrade
Multiple jobs crashed due to missing plugin dependencies
UI elements broke, pipelines wouldn't compile

Root Cause:

Plugin versions upgraded without checking compatibility
Jenkins core version jumped ahead
Deprecated scripted pipelines using outdated plugin APIs

Troubleshooting

Accessed Jenkins in safe mode
Checked /var/lib/jenkins/plugins
Rolled back version via backup restore

✅ Fixes Applied

Version Lock with plugins.txt

git:4.11.5
workflow-aggregator:2.6
credentials:2.6.1

Test Updates on Staging First

Jenkins Docker image with pinned plugins
Automated plugin diff validation via jenkins-plugin-cli

Upgrade Policy Aligned with LTS Cycle

📋 Summary Table: RCA & Fixes

Issue	Root Cause	Fix Applied
Master Unresponsive	Heap/Disk full, SCM flooding	Memory tuning, cleanup, webhook control
Agent Disconnects	Network timeout, auto-scale kills	Keep-alive, lifecycle hooks
UI Freezing	Master overload, heavy plugins	Pipeline refactor, monitoring
Secrets in Logs	Unsafe usage of shell/env	withCredentials, scanning
Plugin Failures	Incompatible versions	Pin versions, test on staging

✅ Jenkins Production Readiness Checklist

[x] JVM heap and thread monitoring
[x] Log/artifact cleanup via pipeline config
[x] Declarative pipelines with clean stages
[x] Secrets masked and scanned
[x] Plugin versions pinned in code
[x] Staging Jenkins for dry runs
[x] Backup + disaster recovery tested monthly

Final Take

Jenkins is a battle-tested CI/CD engine but left unchecked, it can become fragile and costly in production. These 5 real-world issues cost teams hours, if not days. But they also taught us how to:

Think of Jenkins like core infrastructure
Use IaC principles to control configuration
Automate hygiene and disaster recovery

About the Author

Mustkhim Inamdar
Cloud-Native DevOps Architect | Platform Engineer | CI/CD Specialist
Passionate about automation, scalability, and next-gen tooling. With years of experience across Big Data, Cloud Operations (AWS), CI/CD, and DevOps for automotive systems, I’ve delivered robust solutions using tools like Terraform, Jenkins, Kubernetes, LDRA, Polyspace, MATLAB/Simulink, and more.

I love exploring emerging tech like GitOps, MLOps, and Generative AI, and sharing practical insights from real-world projects.

📬 Let’s connect:
🔗 LinkedIn
📘 GitHub
🧠 Blog series on DevOps + AI coming soon!

💬 Got your own Jenkins horror story?
Drop it in the comments or DM me on LinkedIn. Let’s learn from each other’s scars and build resilient CI/CD systems.

DEV Community

Jenkins in Production: Real Issues, RCA, and Fixes That Actually Work

1. Jenkins Master Became Unresponsive During Peak Hours

What happened:

Duration of impact:

Root Cause:

Troubleshooting

✅ Fixes Applied

2. Jenkins Agent Disconnecting Mid-Build

What happened:

Root Cause:

Troubleshooting

✅Fixes Applied

3. Jenkins Master UI Was Freezing Frequently

What happened:

Root Cause:

Troubleshooting

✅Fixes Applied

Secrets Leaked into Logs and Artifacts

What happened:

Root Cause:

Troubleshooting

✅ Fixes Applied

Jenkins Plugin Incompatibility After Upgrade

What happened:

Root Cause:

Troubleshooting

✅ Fixes Applied

📋 Summary Table: RCA & Fixes

✅ Jenkins Production Readiness Checklist

Final Take

About the Author

Top comments (0)