Yuto Takashi

Posted on Feb 14

Final Chapter: Jenkins EFS Problem Solved - From 100% to 0% Throughput Usage

#aws #cicd #devops #performance

TL;DR

After three articles tracking down a Jenkins EFS performance issue, enabling Shared Library cache reduced throughput usage from 100% spikes to near 0%. This article covers the final results and the complete SRE process from emergency response to permanent fix.

Previous Episodes (Quick Recap)

This is the final article in a 4-part series:

Episode 1: How Git Temp Files Killed Our Jenkins Performance
- Problem: Jenkins slowed down, Git clone failures, 504 errors
- Discovery: EFS metadata IOPS exhaustion
- Culprit: ~15GB of tmp_pack_* files accumulating
- Emergency fix: Provisioned throughput 300 MiB/s + cleanup job
Episode 2: How I Spent $69 in 26 Hours (and How to Avoid It)
- Cost: $69 in 26 hours
- Lesson: Didn't know about Elastic Throughput (1/20th the cost)
- Learning: Decision process was sound, but not the optimal solution
Episode 3: How Jenkins Slowly Drained Our EFS Burst Credits Over 2 Weeks
- Key finding: Symptom appeared on 1/26, but root cause started on 1/13
- Multiple factors:
- Shared Library cache was disabled (existed before)
- Changed to disposable agent approach (1/13) → metadata IOPS increased
- Development accelerated in new year → more builds
- Root fix: Enabled Shared Library cache, set Refresh time to 180 minutes

The Result: Dramatic Improvement

After enabling the Shared Library cache setting (Cache fetched versions on controller for quick retrieval) and setting Refresh time to 180 minutes, the EFS throughput usage changed dramatically.

Before the fix (left side, 06:00-12:00):

Throughput usage spiking to 100% frequently
Almost constantly under high load
Far exceeding the 75% warning zone

After the fix (after 12:00):

Throughput usage dropped dramatically and stabilized
Baseline near 0%
Regular small spikes every 3 hours (max ~60%)

The 3-hour spikes are from the Shared Library cache refresh checks (Refresh time: 180 minutes). In other words, it's working exactly as expected.

Honestly, I didn't expect the effect to be this clean.

The Complete Timeline: 5 Throughput Modes

Over the course of this incident, we went through 5 different EFS throughput modes:

Mode Progression:

Bursting (original)
    ↓ 1/27 emergency response
Provisioned 300 MiB/s (26 hours)
    ↓ 1/28 cost reduction
Elastic Throughput (1 day)
    ↓ 1/29 verification
Provisioned 10 MiB/s (current)
    ↓ planned
Bursting (return to original)

Cost Comparison:

Mode	Duration	Cost	Reason
Bursting	Until 1/27	Storage only	Normal operation
Provisioned 300 MiB/s	1/27 (26 hrs)	~$69	Emergency: ensure investigation could proceed
Elastic Throughput	1/28-1/29 (~1 day)	~$8	Cost reduction: pay per use
Provisioned 10 MiB/s	1/30-current	~$2.3/day	Verification: stable operation at low cost
Bursting (planned)	Soon	Storage only	Permanent fix: return to original

Why We Changed From Elastic Throughput

Elastic Throughput turned out to be "surprisingly costly":

Daily cost: ~$8
Monthly estimate: ~$240 (~¥35,000)

In contrast, Provisioned 10 MiB/s costs ~$72/month (~¥10,000). Given our current usage pattern (average throughput a few %, max ~60%), 10 MiB/s is sufficient.

However, this is just for the verification period. We plan to eventually return to Bursting mode.

Hypothesis Verification

Let's verify the hypothesis from Episode 3.

Initial Hypothesis

Did the change to disposable agent approach (1/13) cause the metadata IOPS spike?

Answer: Partially correct, but not the main culprit.

The Real Culprit

Shared Library cache was disabled (existed before the incident)

Just enabling the cache brought throughput usage down to near 0%. This means Shared Library's full fetch on every build was consuming the overwhelming majority of metadata IOPS.

Impact of Disposable Agent Approach

So was the disposable agent approach irrelevant?

Not quite. I believe the change to disposable agents was one factor that accelerated Burst Credit depletion.

The combination of three factors:

Shared Library cache disabled (pre-existing) → Controller-side metadata IOPS on every build
Disposable agent approach (from 1/13) → Agent-side metadata IOPS on every build
Development acceleration in new year → increased build frequency

These three factors together caused rapid Burst Credit depletion from 1/13, with symptoms appearing 2 weeks later on 1/26.

The SRE Process: From Detection to Resolution

Looking at the timeline of our response, you can see a clear SRE process:

The Complete SRE Workflow:

Problem Detection (1/27 morning)
- Symptoms: Jenkins slow, Git clone failures, 504 errors
- Metrics check: EFS throughput usage at 100%
- Time: ~30 minutes
Emergency Response (1/27 morning)
- Decision: Change to Provisioned throughput 300 MiB/s (executed next day)
- Goal: Ensure investigation could continue
- Tradeoff: High cost vs. continued development
Impact Mitigation (1/27 afternoon)
- Created cleanup job
- Planned tmp_pack_* deletion
- Implemented recurrence prevention
Root Cause Investigation (1/27-1/30)
- Stage 1: Found tmp_pack_* accumulation
- Stage 2: Burst Credit Balance graph analysis revealed 1/13 as starting point
- Stage 3: Discovered Shared Library cache was disabled

To be honest, I got stuck here. When I found tmp_pack_*, I thought "this is it," but it was actually just part of the symptom. Reviewing the graphs chronologically led me to the true root cause.

Permanent Fix (1/30)
- Enabled Shared Library cache
- Set Refresh time: 180 minutes
- Optimized throughput mode
Effect Measurement (1/30 onward)
- Confirmed dramatic improvement in throughput usage
- 3-hour spikes are as expected
- Continuing to monitor
Reflection & Knowledge Sharing (this article)
- Reflected on cost decisions (didn't know about Elastic Throughput)
- Understood the multiple contributing factors
- Sharing knowledge with the organization

This last part is surprisingly crucial. It's not just about solving the problem, but articulating "why it happened" and "how we decided" to apply to future situations.

Outstanding Issues and Next Steps

Short-term Tasks

1. Return to Bursting Mode

We're currently running on Provisioned 10 MiB/s, but plan to return to Bursting mode eventually.

What to check before switching back:

Has Burst Credit Balance recovered sufficiently?
Are new tmp_pack_* files being generated?
Is the cleanup job working correctly?

2. Strengthen Monitoring

This incident could have been caught earlier with proper monitoring.

Alerts to set up:

EFS throughput usage > 75%
Burst Credit Balance < threshold (TBD)
Abnormal storage capacity increase

Long-term Considerations

Reconsidering the Disposable Agent Approach

We're continuing with the disposable agent approach, but its impact on metadata IOPS can't be ignored.

Options to consider:

Extend agent lifecycle slightly to reuse across multiple jobs
Share Git cache on EFS across all agents
Place cache in S3 and sync on startup

Finding the right balance between cost and performance is the next challenge.

Key Takeaways

Looking back at the timeline, here's what I learned.

Technical Lessons

EFS Metadata IOPS Characteristics
- Massive small file operations are deadly
- File count matters more than storage capacity
- Burst Credit management is key
Jenkins Caching Mechanisms
- Importance of Shared Library cache
- Setting the right Refresh time balance
- Hidden costs of disabled caching
Throughput Mode Selection
- Elastic Throughput isn't a silver bullet
- Optimization based on usage patterns
- Importance of cost estimation

Process Lessons

But more importantly, it's about "how we decided."

Emergency Decision Making:

Make decisions without perfect information
Prioritize avoiding worst-case scenarios
Clarify tradeoffs

Investigation Approach:

Look at graphs chronologically, not just symptoms
Form hypotheses, test them, move to next hypothesis if wrong
Acknowledge when you're stuck

Accountability:

Costs can be explained after the fact
Articulate the decision process
Share both successes and failures

I regret not knowing about Elastic Throughput, but I don't regret the decision to "ensure investigation could continue."

The $69 tuition might have been expensive, but I think I got more than that in learning.

This is the final article in the series:

Episode 1: How Git Temp Files Killed Our Jenkins Performance
Episode 2: How I Spent $69 in 26 Hours
Episode 3: How Jenkins Slowly Drained Our EFS Burst Credits

I write more about SRE decision-making processes and the thinking behind technical choices on my blog.
Check it out: https://tielec.blog/en/tech/sre/jenkins-efs-final-report

DEV Community