DEV Community

Cover image for Jenkins in Production: Real Issues, RCA, and Fixes That Actually Work
Mustkhim Inamdar
Mustkhim Inamdar

Posted on

Jenkins in Production: Real Issues, RCA, and Fixes That Actually Work

This isn’t theory. This is a real production issue I faced with Jenkins documented with actual RCA, the troubleshooting I followed, and how I fixed and hardened the pipeline after recovery.


1. Jenkins Master Became Unresponsive During Peak Hours

What happened:

  • Jenkins UI crashed
  • Builds queued indefinitely
  • Engineers across multiple teams blocked

Duration of impact:

~2 hours


Root Cause:

  • JVM heap space exhausted → OutOfMemoryError
  • Disk usage at 100% → no cleanup of old builds/artifacts
  • SCM hooks and Git polling overwhelmed the executor queue
  • No workspace cleanup on matrix builds

Troubleshooting

tail -f /var/log/jenkins/jenkins.log
jcmd <pid> GC.heap_info
df -h
htop
Enter fullscreen mode Exit fullscreen mode
  • Killed large zombie processes
  • Cleared build directories
  • Restarted Jenkins after freeing up memory

✅ Fixes Applied

JVM Memory Configuration

export JENKINS_JAVA_OPTIONS="-Xms2g -Xmx4g"
Enter fullscreen mode Exit fullscreen mode

Pipeline Cleanup & Discarder

options {
  buildDiscarder(logRotator(numToKeepStr: '10'))
}

post {
  always {
    cleanWs()
  }
}
Enter fullscreen mode Exit fullscreen mode

Log Rotation & Backup

tar -czf jenkins_backup_$(date +%F).tar.gz /var/lib/jenkins
Enter fullscreen mode Exit fullscreen mode

SCM/Webhook Control

  • GitHub webhook throttled
  • Added quiet periods and throttle plugin

2. Jenkins Agent Disconnecting Mid-Build

What happened:

  • Builds failed halfway through execution
  • Logs were incomplete
  • Rebuilds triggered, wasting compute

Root Cause:

  • SSH/JNLP connection dropped due to firewall timeout
  • Cloud auto-scaling agents terminated mid-job
  • No agent lifecycle hooks defined

Troubleshooting

  • Checked jenkins.log and agent logs
  • Verified cloud termination settings
  • Monitored for memory/cpu bottlenecks on agents

✅Fixes Applied

Enable TCP Keep-Alive

ServerAliveInterval 60
ServerAliveCountMax 5
Enter fullscreen mode Exit fullscreen mode

Cloud Agent Protection

  • Graceful shutdown scripts
  • Increased idle timeout before scale-in
  • Only terminate idle agents with no active job

3. Jenkins Master UI Was Freezing Frequently

What happened:

  • Jenkins dashboard became sluggish
  • Admin actions timed out
  • Pipelines remained queued

Root Cause:

  • Scripted pipelines executing heavy shell operations on master
  • Plugins with memory leaks
  • Too many concurrent builds on master thread

Troubleshooting

  • Analyzed /metrics endpoint
  • Monitored heap and thread dumps
  • Disabled high-impact plugins

✅Fixes Applied

Move All Jobs Off Master

  • Set “Restrict where this project runs”
  • Enforce master isolation

Switch to Declarative Pipelines

  • Reduced memory footprint
  • Improved readability and safety

Enable Monitoring


Secrets Leaked into Logs and Artifacts

What happened:

  • Tokens and .pem files showed up in Jenkins logs
  • .env files archived as pipeline artifacts
  • Security audit failed

Root Cause:

  • Use of echo $TOKEN inside sh blocks
  • No use of withCredentials wrapper
  • Logs not masked automatically

Troubleshooting

  • Scanned logs using keywords like AKIA, BEGIN RSA, etc.
  • Reviewed artifact contents for secrets
  • Reviewed pipeline code across teams

✅ Fixes Applied

Mask Secrets

withCredentials([string(credentialsId: 'secret-token', variable: 'TOKEN')]) {
  sh 'curl -H "Authorization: Bearer $TOKEN" https://secure-api/'
}
Enter fullscreen mode Exit fullscreen mode

Block Sensitive Artifact Upload

steps {
  sh '''
    if grep -r 'AKIA' ./build; then exit 1; fi
  '''
}
Enter fullscreen mode Exit fullscreen mode

Enable Secret Scanning Tools

  • Integrated GitLeaks
  • Pre-check builds for secret patterns

Jenkins Plugin Incompatibility After Upgrade

What happened:

  • Jenkins failed to start after a routine upgrade
  • Multiple jobs crashed due to missing plugin dependencies
  • UI elements broke, pipelines wouldn't compile

Root Cause:

  • Plugin versions upgraded without checking compatibility
  • Jenkins core version jumped ahead
  • Deprecated scripted pipelines using outdated plugin APIs

Troubleshooting

  • Accessed Jenkins in safe mode
  • Checked /var/lib/jenkins/plugins
  • Rolled back version via backup restore

✅ Fixes Applied

Version Lock with plugins.txt

git:4.11.5
workflow-aggregator:2.6
credentials:2.6.1
Enter fullscreen mode Exit fullscreen mode

Test Updates on Staging First

  • Jenkins Docker image with pinned plugins
  • Automated plugin diff validation via jenkins-plugin-cli

Upgrade Policy Aligned with LTS Cycle


📋 Summary Table: RCA & Fixes

Issue Root Cause Fix Applied
Master Unresponsive Heap/Disk full, SCM flooding Memory tuning, cleanup, webhook control
Agent Disconnects Network timeout, auto-scale kills Keep-alive, lifecycle hooks
UI Freezing Master overload, heavy plugins Pipeline refactor, monitoring
Secrets in Logs Unsafe usage of shell/env withCredentials, scanning
Plugin Failures Incompatible versions Pin versions, test on staging

✅ Jenkins Production Readiness Checklist

  • [x] JVM heap and thread monitoring
  • [x] Log/artifact cleanup via pipeline config
  • [x] Declarative pipelines with clean stages
  • [x] Secrets masked and scanned
  • [x] Plugin versions pinned in code
  • [x] Staging Jenkins for dry runs
  • [x] Backup + disaster recovery tested monthly

Final Take

Jenkins is a battle-tested CI/CD engine but left unchecked, it can become fragile and costly in production. These 5 real-world issues cost teams hours, if not days. But they also taught us how to:

  • Think of Jenkins like core infrastructure
  • Use IaC principles to control configuration
  • Automate hygiene and disaster recovery

About the Author

Mustkhim Inamdar
Cloud-Native DevOps Architect | Platform Engineer | CI/CD Specialist
Passionate about automation, scalability, and next-gen tooling. With years of experience across Big Data, Cloud Operations (AWS), CI/CD, and DevOps for automotive systems, I’ve delivered robust solutions using tools like Terraform, Jenkins, Kubernetes, LDRA, Polyspace, MATLAB/Simulink, and more.

I love exploring emerging tech like GitOps, MLOps, and Generative AI, and sharing practical insights from real-world projects.

📬 Let’s connect:
🔗 LinkedIn
📘 GitHub
🧠 Blog series on DevOps + AI coming soon!


💬 Got your own Jenkins horror story?
Drop it in the comments or DM me on LinkedIn. Let’s learn from each other’s scars and build resilient CI/CD systems.

Top comments (0)