Yuto Takashi

Posted on Jan 27

How Git Temp Files Killed Our Jenkins Performance (EFS Metadata IOPS Hell)

#jenkins #aws #devops #troubleshooting

Why You Should Care

If you're running Jenkins on AWS EFS, you might hit this exact problem. Git clone operations start timing out, Jenkins UI becomes painfully slow, and you get cryptic "Bad file descriptor" errors.

The culprit? Git temporary pack files accumulating over time, starving EFS of metadata IOPS.

The Problem

Monday morning. Jenkins dashboard takes forever to load. Hit the Replay button, build starts, then:

fatal: write error: Bad file descriptor
fatal: fetch-pack: invalid index-pack output
ERROR: Error cloning remote repo 'origin'

Transfer speed drops from 77KB/s to 51KB/s before timing out completely. 504 Gateway Timeout errors everywhere.

Initial Investigation

First thought: "Network issue?"

But looking closer at the error logs, I noticed pipeline-groovy-lib was failing during Shared Library checkout. That happens on the Jenkins Controller, not agents. So this is a Controller resource problem.

Checked CloudWatch metrics:

✅ CPU: Normal (0-5%, occasional 30% spikes)
✅ Network: Nothing unusual
✅ EBS disk latency: Stable at ~0.7s

Wait... this Jenkins uses EFS, not just EBS.

EFS Metrics Told the Real Story

Checked EFS CloudWatch metrics and found:

Throughput utilization: Hitting 100% during 00:00-03:00
IOPS: Metadata operations dominating
Storage: Growing from 14GB → 17GB

📊 See detailed metrics and graphs in the full write-up

EFS was starving on metadata IOPS, not storage capacity.

What's Metadata IOPS?

In EFS, metadata operations include:

File stat (checking size/timestamps)
Directory listings
File create/delete
Permission changes

In other words: lots of small file operations consume metadata IOPS.

Jenkins workloads are full of these:

Build logs (thousands of small files)
Git repositories (.git/objects with tons of files)
Shared Library clones
Build fingerprints

It's not about storage size. It's about file count.

Finding the Culprit

Checked directory sizes:

du -sh /mnt/efs/jenkins/*

342M    plugins
106M    war
332M    logs
373M    caches
174M    fingerprints
timeout jobs  # Suspicious timeout

Only found ~1.3GB. Missing ~15.7GB.

Searched for large files directly:

find /mnt/efs/jenkins -type f -size +100M 2>/dev/null -ls

Boom:

125829120 Jan 27 01:50 .../builds/873/libs/.../root/.git/objects/pack/tmp_pack_S3GPJw
122683392 Jan 27 01:50 .../builds/872/libs/.../root/.git/objects/pack/tmp_pack_c4EAwd
(dozens more of these...)

tmp_pack_* files everywhere. 100-300MB each.

Root Cause

Here's what was happening:

Jenkins clones Pipeline Shared Library
Git creates temporary pack files (tmp_pack_*)
EFS IOPS throttling causes timeout
Temp files never get cleaned up
This happens every build (nightly at 23:12)
~200-300MB garbage per build × dozens of builds = ~15GB

Vicious cycle:

EFS slow → Git fails → Files accumulate → EFS slower

The Fix

Immediate: Adjust EFS Throughput

Changed from Bursting mode to Provisioned throughput (300 MiB/s).

Why provisioned?

Predictable performance for metadata IOPS spikes
No waiting for burst credits to recover
Works during investigation (find, du commands)

⚠️ Note: EFS throughput mode changes have restrictions. Plan accordingly.

Cleanup Job

Created a daily cleanup pipeline:

pipeline {
    agent any
    triggers {
        cron('0 4 * * *')
    }
    stages {
        stage('Clean tmp_pack files') {
            steps {
                sh '''
                    find $JENKINS_HOME -name "tmp_pack_*" -type f -mtime +1 -delete
                    echo "Cleaned up tmp_pack_* files older than 1 day"
                '''
            }
        }
    }
}

Manual Cleanup (One-time)

systemctl stop jenkins
find /mnt/efs/jenkins -name "tmp_pack_*" -type f -delete
systemctl start jenkins

Freed up ~15GB instantly.

Key Takeaways

1. Storage Size ≠ Performance

Small files matter more than total GB on EFS. Metadata operations can bottleneck before you hit storage limits.

2. Bursting Mode Can Be Unpredictable

When problems accumulate gradually ("silently"), burst credits can run out unexpectedly.

3. Always Have a Safety Net

Changing to provisioned throughput bought us time to investigate properly without user impact.

Monitoring Setup

Added CloudWatch alarms:

EFS throughput utilization > 75%
Directory size monitoring (weekly reports)

Early detection prevents these surprises.

Conclusion

Surface symptom: Git clone errors
→ Deeper cause: EFS metadata IOPS exhaustion
→ Root cause: Git temp file accumulation

Problem-solving is about peeling back layers. Each hypothesis, each metric check, gets you closer to the truth.

If you found this useful, I write more about infrastructure debugging and SRE experiences here:
https://tielec.blog/

Full investigation details with metrics graphs:
https://tielec.blog/en/tech/sre/jenkins-efs-metadata-iops-issue

DEV Community