TL;DR
Our Jenkins started failing on 1/26, but the root cause began on 1/13. We discovered three compounding issues:
- Shared Library cache was disabled (existing issue)
- Switched to disposable agents (1/13 change)
- Increased build frequency (New Year effect)
Result: 50x increase in metadata IOPS → EFS burst credits drained over 2 weeks.
Why You Should Care
If you're running Jenkins on EFS, this could happen to you. The symptoms appear suddenly, but the root cause often starts weeks earlier. Time-series analysis of metrics is crucial.
The Mystery: Symptoms vs. Root Cause
Previously, I wrote about how Jenkins became slow and Git clones started failing. We found ~15GB of Git temporary files (tmp_pack_*) accumulated on EFS, causing metadata IOPS exhaustion.
We fixed it with Elastic Throughput and cleanup jobs. Case closed, right?
Not quite.
When I checked the EFS Burst Credit Balance graph, I noticed something important:
The credit started declining around 1/13, but symptoms appeared on 1/26.
Timeline:
- 1/13: Credit decline starts
- 1/19: Rapid decline
- 1/26: Credit bottoms out
- 1/26-27: Symptoms appear
The tmp_pack_* accumulation was a symptom, not the root cause. Something changed on 1/13.
What Changed on 1/13?
Honestly, this stumped me. I had a few ideas, but nothing definitive:
1. Agent Architecture Change
Around 1/13, we changed our Jenkins agent strategy:
Before: Shared Agents
- EC2 instances: c5.large, etc.
- Multiple jobs share agents
- Workspace reuse
-
git pullfor incremental updates
After: Disposable Agents
- EC2 instances: t3.small, etc.
- One agent per job
- Destroy after use
-
git clonefor full clones every time
The goal was cost reduction. We didn't consider metadata IOPS impact.
2. Post-New Year Development Rush
Teams ramped up development after the New Year holiday, increasing overall Jenkins load.
The Math: 50x Metadata IOPS Increase
Let me calculate the impact:
Builds per day: 50 (estimated)
Files created per clone: 5,000
Shared agent approach:
Clone once = 5,000 metadata operations
Disposable agent approach:
50 builds × 5,000 files = 250,000 metadata operations/day
50x increase in metadata IOPS.
Add the New Year rush, and the numbers get even worse.
Understanding Git Cache in Jenkins
During investigation, I noticed /mnt/efs/jenkins/caches/ directory:
/mnt/efs/jenkins/caches/git-3e9b32912840757a720f39230c221f0e
This is Jenkins Git plugin's bare repository cache.
How Git Caching Works
Jenkins Git plugin optimizes clones by:
- Caching remote repos in
/mnt/efs/jenkins/caches/git-{hash}/as bare repositories - Cloning to job workspaces using
git clone --referencefrom this cache - Hash generated from repo URL + branch combination
The problem: Disposable agents might not benefit from this cache since they're new every time.
The Smoking Gun: tmp_pack_* Location
I revisited where tmp_pack_* files were located:
jobs/sample-job/jobs/sample-pipeline/builds/104/libs/
└── 335abf.../root/.git/objects/pack/
└── tmp_pack_WqmOyE ← 100-300MB
These are in per-build directories:
jobs/sample-job/jobs/sample-pipeline/
└── builds/
├── 104/
│ └── libs/.../tmp_pack_WqmOyE
├── 105/
│ └── libs/.../tmp_pack_XYZ123
└── 106/
└── libs/.../tmp_pack_ABC456
Every build was re-checking out the Pipeline Shared Library, generating tmp_pack_* each time.
Question: Why is Shared Library being fetched on every build?
Root Cause: Cache Setting Was OFF
I checked Jenkins configuration and found the smoking gun.
The Shared Library setting Cache fetched versions on controller for quick retrieval was unchecked.
This meant:
- Shared Library cache completely disabled
- Full fetch from remote repository on every build
- Temporary files generated in
.git/objects/pack/ - Massive metadata IOPS consumption
The Fix: Enable Caching
I immediately changed the settings:
-
Enabled
Cache fetched versions on controller for quick retrieval - Set
Refresh time in minutesto 180 minutes
Choosing the Refresh Time
This is actually important. Options:
- 60-120 min: Fast updates, moderate IOPS reduction
- 180 min (3 hours): Balanced - 8 updates/day
- 360 min (6 hours): Stable operation - 4 updates/day
- 1440 min (24 hours): Maximum IOPS reduction
Why I chose 180 minutes:
- Updates check ~8 times/day (9am, 12pm, 3pm, 6pm...)
- Shared Library changes reflected within half a day is fine
- Significant IOPS reduction (every build → once per 3 hours)
- Can manually clear cache for urgent changes
Jenkins has a "force refresh" feature, so urgent changes aren't a problem. I documented this in our runbook so we don't forget.
Measuring the Impact
Post-change monitoring plan:
Short-term (24-48 hours)
- No new
tmp_pack_*files generated - EFS metadata IOPS decrease
Mid-term (1 week)
- Burst Credit Balance recovery trend
- Stable build performance
Long-term (1 month)
- Credits remain stable
- No recurrence
Lessons Learned
1. Symptoms ≠ Root Cause Timeline
- Symptom appearance: 1/26-1/27
- Root cause: Around 1/13
- Credit depletion: Gradual over 2 weeks
Time-series analysis is crucial. Fixing only the visible symptoms leads to superficial solutions.
2. Architecture Changes Have Hidden Costs
The disposable agent change was for cost optimization. We did reduce EC2 costs, but created problems elsewhere.
When changing architecture:
- Evaluate performance impact beforehand
- Set up monitoring before the change
- Continue tracking metrics after
3. EFS Metadata IOPS Characteristics
- Mass creation/deletion of small files is deadly
- File count matters more than storage size
- Burst mode requires credit management
- Credit depletion happens gradually
Especially with .git/objects/ containing thousands of small files, behavior differs drastically from normal file I/O.
4. Compound Root Causes
This issue wasn't a single cause but three factors combining:
- Shared Library cache disabled (pre-existing)
- Disposable agent switch (1/13)
- Increased builds (New Year)
Each alone might not have caused major issues, but together they exceeded the critical threshold.
Open Questions
While we enabled Shared Library caching, we're still using disposable agents.
Can agent-side Git cache be utilized effectively with disposable agents?
Possibilities:
- Share EFS Git cache across all agents
- Extend agent lifecycle slightly for reuse across jobs
- Cache in S3 and sync on startup
Finding the right balance between cost and performance remains a challenge.
I write more about technical decision-making and engineering practices on my blog.
Check it out: https://tielec.blog/
Top comments (0)