Why You Should Care
If you're running Jenkins on AWS EFS, you might hit this exact problem. Git clone operations start timing out, Jenkins UI becomes painfully slow, and you get cryptic "Bad file descriptor" errors.
The culprit? Git temporary pack files accumulating over time, starving EFS of metadata IOPS.
The Problem
Monday morning. Jenkins dashboard takes forever to load. Hit the Replay button, build starts, then:
fatal: write error: Bad file descriptor
fatal: fetch-pack: invalid index-pack output
ERROR: Error cloning remote repo 'origin'
Transfer speed drops from 77KB/s to 51KB/s before timing out completely. 504 Gateway Timeout errors everywhere.
Initial Investigation
First thought: "Network issue?"
But looking closer at the error logs, I noticed pipeline-groovy-lib was failing during Shared Library checkout. That happens on the Jenkins Controller, not agents. So this is a Controller resource problem.
Checked CloudWatch metrics:
- ✅ CPU: Normal (0-5%, occasional 30% spikes)
- ✅ Network: Nothing unusual
- ✅ EBS disk latency: Stable at ~0.7s
Wait... this Jenkins uses EFS, not just EBS.
EFS Metrics Told the Real Story
Checked EFS CloudWatch metrics and found:
- Throughput utilization: Hitting 100% during 00:00-03:00
- IOPS: Metadata operations dominating
- Storage: Growing from 14GB → 17GB
📊 See detailed metrics and graphs in the full write-up
EFS was starving on metadata IOPS, not storage capacity.
What's Metadata IOPS?
In EFS, metadata operations include:
- File stat (checking size/timestamps)
- Directory listings
- File create/delete
- Permission changes
In other words: lots of small file operations consume metadata IOPS.
Jenkins workloads are full of these:
- Build logs (thousands of small files)
- Git repositories (.git/objects with tons of files)
- Shared Library clones
- Build fingerprints
It's not about storage size. It's about file count.
Finding the Culprit
Checked directory sizes:
du -sh /mnt/efs/jenkins/*
342M plugins
106M war
332M logs
373M caches
174M fingerprints
timeout jobs # Suspicious timeout
Only found ~1.3GB. Missing ~15.7GB.
Searched for large files directly:
find /mnt/efs/jenkins -type f -size +100M 2>/dev/null -ls
Boom:
125829120 Jan 27 01:50 .../builds/873/libs/.../root/.git/objects/pack/tmp_pack_S3GPJw
122683392 Jan 27 01:50 .../builds/872/libs/.../root/.git/objects/pack/tmp_pack_c4EAwd
(dozens more of these...)
tmp_pack_* files everywhere. 100-300MB each.
Root Cause
Here's what was happening:
- Jenkins clones Pipeline Shared Library
- Git creates temporary pack files (
tmp_pack_*) - EFS IOPS throttling causes timeout
- Temp files never get cleaned up
- This happens every build (nightly at 23:12)
- ~200-300MB garbage per build × dozens of builds = ~15GB
Vicious cycle:
EFS slow → Git fails → Files accumulate → EFS slower
The Fix
Immediate: Adjust EFS Throughput
Changed from Bursting mode to Provisioned throughput (300 MiB/s).
Why provisioned?
- Predictable performance for metadata IOPS spikes
- No waiting for burst credits to recover
- Works during investigation (
find,ducommands)
⚠️ Note: EFS throughput mode changes have restrictions. Plan accordingly.
Cleanup Job
Created a daily cleanup pipeline:
pipeline {
agent any
triggers {
cron('0 4 * * *')
}
stages {
stage('Clean tmp_pack files') {
steps {
sh '''
find $JENKINS_HOME -name "tmp_pack_*" -type f -mtime +1 -delete
echo "Cleaned up tmp_pack_* files older than 1 day"
'''
}
}
}
}
Manual Cleanup (One-time)
systemctl stop jenkins
find /mnt/efs/jenkins -name "tmp_pack_*" -type f -delete
systemctl start jenkins
Freed up ~15GB instantly.
Key Takeaways
1. Storage Size ≠ Performance
Small files matter more than total GB on EFS. Metadata operations can bottleneck before you hit storage limits.
2. Bursting Mode Can Be Unpredictable
When problems accumulate gradually ("silently"), burst credits can run out unexpectedly.
3. Always Have a Safety Net
Changing to provisioned throughput bought us time to investigate properly without user impact.
Monitoring Setup
Added CloudWatch alarms:
- EFS throughput utilization > 75%
- Directory size monitoring (weekly reports)
Early detection prevents these surprises.
Conclusion
Surface symptom: Git clone errors
→ Deeper cause: EFS metadata IOPS exhaustion
→ Root cause: Git temp file accumulation
Problem-solving is about peeling back layers. Each hypothesis, each metric check, gets you closer to the truth.
If you found this useful, I write more about infrastructure debugging and SRE experiences here:
https://tielec.blog/
Full investigation details with metrics graphs:
https://tielec.blog/en/tech/sre/jenkins-efs-metadata-iops-issue
Top comments (0)