Yuto Takashi

Posted on Jan 31

How Jenkins Slowly Drained Our EFS Burst Credits Over 2 Weeks

#aws #cicd #devops #performance

TL;DR

Our Jenkins started failing on 1/26, but the root cause began on 1/13. We discovered three compounding issues:

Shared Library cache was disabled (existing issue)
Switched to disposable agents (1/13 change)
Increased build frequency (New Year effect)

Result: 50x increase in metadata IOPS → EFS burst credits drained over 2 weeks.

Why You Should Care

If you're running Jenkins on EFS, this could happen to you. The symptoms appear suddenly, but the root cause often starts weeks earlier. Time-series analysis of metrics is crucial.

The Mystery: Symptoms vs. Root Cause

Previously, I wrote about how Jenkins became slow and Git clones started failing. We found ~15GB of Git temporary files (tmp_pack_*) accumulated on EFS, causing metadata IOPS exhaustion.

We fixed it with Elastic Throughput and cleanup jobs. Case closed, right?

Not quite.

When I checked the EFS Burst Credit Balance graph, I noticed something important:

The credit started declining around 1/13, but symptoms appeared on 1/26.

Timeline:

1/13: Credit decline starts
1/19: Rapid decline
1/26: Credit bottoms out
1/26-27: Symptoms appear

The tmp_pack_* accumulation was a symptom, not the root cause. Something changed on 1/13.

What Changed on 1/13?

Honestly, this stumped me. I had a few ideas, but nothing definitive:

1. Agent Architecture Change

Around 1/13, we changed our Jenkins agent strategy:

Before: Shared Agents

EC2 instances: c5.large, etc.
Multiple jobs share agents
Workspace reuse
git pull for incremental updates

After: Disposable Agents

EC2 instances: t3.small, etc.
One agent per job
Destroy after use
git clone for full clones every time

The goal was cost reduction. We didn't consider metadata IOPS impact.

2. Post-New Year Development Rush

Teams ramped up development after the New Year holiday, increasing overall Jenkins load.

The Math: 50x Metadata IOPS Increase

Let me calculate the impact:

Builds per day: 50 (estimated)
Files created per clone: 5,000

Shared agent approach:
  Clone once = 5,000 metadata operations

Disposable agent approach:
  50 builds × 5,000 files = 250,000 metadata operations/day

50x increase in metadata IOPS.

Add the New Year rush, and the numbers get even worse.

Understanding Git Cache in Jenkins

During investigation, I noticed /mnt/efs/jenkins/caches/ directory:

/mnt/efs/jenkins/caches/git-3e9b32912840757a720f39230c221f0e

This is Jenkins Git plugin's bare repository cache.

How Git Caching Works

Jenkins Git plugin optimizes clones by:

Caching remote repos in /mnt/efs/jenkins/caches/git-{hash}/ as bare repositories
Cloning to job workspaces using git clone --reference from this cache
Hash generated from repo URL + branch combination

The problem: Disposable agents might not benefit from this cache since they're new every time.

The Smoking Gun: tmp_pack_* Location

I revisited where tmp_pack_* files were located:

jobs/sample-job/jobs/sample-pipeline/builds/104/libs/
  └── 335abf.../root/.git/objects/pack/
      └── tmp_pack_WqmOyE  ← 100-300MB

These are in per-build directories:

jobs/sample-job/jobs/sample-pipeline/
└── builds/
    ├── 104/
    │   └── libs/.../tmp_pack_WqmOyE
    ├── 105/
    │   └── libs/.../tmp_pack_XYZ123
    └── 106/
        └── libs/.../tmp_pack_ABC456

Every build was re-checking out the Pipeline Shared Library, generating tmp_pack_* each time.

Question: Why is Shared Library being fetched on every build?

Root Cause: Cache Setting Was OFF

I checked Jenkins configuration and found the smoking gun.

The Shared Library setting Cache fetched versions on controller for quick retrieval was unchecked.

This meant:

Shared Library cache completely disabled
Full fetch from remote repository on every build
Temporary files generated in .git/objects/pack/
Massive metadata IOPS consumption

The Fix: Enable Caching

I immediately changed the settings:

Enabled Cache fetched versions on controller for quick retrieval
Set Refresh time in minutes to 180 minutes

Choosing the Refresh Time

This is actually important. Options:

60-120 min: Fast updates, moderate IOPS reduction
180 min (3 hours): Balanced - 8 updates/day
360 min (6 hours): Stable operation - 4 updates/day
1440 min (24 hours): Maximum IOPS reduction

Why I chose 180 minutes:

Updates check ~8 times/day (9am, 12pm, 3pm, 6pm...)
Shared Library changes reflected within half a day is fine
Significant IOPS reduction (every build → once per 3 hours)
Can manually clear cache for urgent changes

Jenkins has a "force refresh" feature, so urgent changes aren't a problem. I documented this in our runbook so we don't forget.

Measuring the Impact

Post-change monitoring plan:

Short-term (24-48 hours)

No new tmp_pack_* files generated
EFS metadata IOPS decrease

Mid-term (1 week)

Burst Credit Balance recovery trend
Stable build performance

Long-term (1 month)

Credits remain stable
No recurrence

Lessons Learned

1. Symptoms ≠ Root Cause Timeline

Symptom appearance: 1/26-1/27
Root cause: Around 1/13
Credit depletion: Gradual over 2 weeks

Time-series analysis is crucial. Fixing only the visible symptoms leads to superficial solutions.

2. Architecture Changes Have Hidden Costs

The disposable agent change was for cost optimization. We did reduce EC2 costs, but created problems elsewhere.

When changing architecture:

Evaluate performance impact beforehand
Set up monitoring before the change
Continue tracking metrics after

3. EFS Metadata IOPS Characteristics

Mass creation/deletion of small files is deadly
File count matters more than storage size
Burst mode requires credit management
Credit depletion happens gradually

Especially with .git/objects/ containing thousands of small files, behavior differs drastically from normal file I/O.

4. Compound Root Causes

This issue wasn't a single cause but three factors combining:

Shared Library cache disabled (pre-existing)
Disposable agent switch (1/13)
Increased builds (New Year)

Each alone might not have caused major issues, but together they exceeded the critical threshold.

Open Questions

While we enabled Shared Library caching, we're still using disposable agents.

Can agent-side Git cache be utilized effectively with disposable agents?

Possibilities:

Share EFS Git cache across all agents
Extend agent lifecycle slightly for reuse across jobs
Cache in S3 and sync on startup

Finding the right balance between cost and performance remains a challenge.

I write more about technical decision-making and engineering practices on my blog.
Check it out: https://tielec.blog/

DEV Community