TL;DR
During a Jenkins EFS incident, I switched to Provisioned Throughput (300 MiB/s) for emergency response. It cost $69 for just 26 hours. If I had known about Elastic Throughput, it would have been around $3.50. Here's what I learned about EFS throughput modes and cost optimization.
The Incident
Last week, our Jenkins CI/CD pipeline came to a halt due to EFS metadata IOPS exhaustion. As an emergency measure, I changed the EFS throughput mode to Provisioned Throughput at 300 MiB/s to keep Jenkins running while investigating the root cause.
The next day, I checked AWS Cost Explorer and saw:
$69.00
For 26 hours of usage. Ouch.
Why You Should Care
If you're running EFS for production workloads, understanding throughput modes is critical. A simple configuration choice can mean the difference between $3 and $69 for the same workload.
EFS Throughput Modes: A Quick Comparison
AWS EFS offers three throughput modes:
1. Bursting Throughput (Default)
Cost: Storage cost only
Performance scales with storage size. You get baseline throughput based on your storage capacity, plus burst credits for temporary spikes.
- ✅ No extra cost
- ❌ Performance degrades when credits run out (our problem)
2. Provisioned Throughput
Cost: Storage + Throughput cost
Tokyo region: ~$7.2 per MiB/s per month
For 300 MiB/s:
- Monthly: 300 × $7.2 = $2,160
26 hours: $2,160 × (26/720) ≈ $78 (actual: $69)
✅ Guaranteed performance
❌ Very expensive, billed even when idle
3. Elastic Throughput
Cost: Storage + Actual usage
Tokyo region:
- Read: $0.04/GB
- Write: $0.07/GB
For 26 hours with ~50GB usage:
50GB × $0.07 ≈ $3.50
✅ Pay-per-use, auto-scales
❌ Harder to predict costs
Cost Comparison
| Mode | 26-hour Cost | When to Use |
|---|---|---|
| Bursting | $5.6/month | Normal operations |
| Provisioned | $69 | Constant high throughput |
| Elastic | $3.50 | Spike handling (best for most cases) |
Difference: ~$65 (~$9,500 yen)
What I Should Have Done
Instead of jumping to Provisioned Throughput, here's the better approach:
Step 1: Switch to Elastic Throughput
aws efs put-file-system-policy \
--file-system-id fs-xxxxxx \
--throughput-mode elastic
This would have:
- Auto-scaled during investigation
- Cost only ~$3.50 for the same period
- No manual capacity planning needed
Step 2: Investigate Root Cause
While Elastic Throughput handles the spike automatically, investigate and fix the underlying issue (in our case, Git temporary files accumulating).
Step 3: Set Up Monitoring
CloudWatch alarms for:
-
PercentIOLimit> 75% - Early warning before IOPS exhaustion
Why I Didn't Choose Elastic Throughput
Honestly? I didn't know it existed.
Elastic Throughput was announced in 2022, but I hadn't updated my knowledge. During the emergency, my mental model was:
- Bursting = free but unreliable
- Provisioned = expensive but guaranteed
I missed the third, better option.
Was the Decision Wrong?
Not entirely. Let's look at ROI:
Cost: $69 (10,000 yen)
Avoided Loss:
- 10 engineers × 3 hours waiting = 30 person-hours
- At ~$50/hour = $1,500 in productivity loss
- Plus deployment delays (hard to quantify)
ROI: ~20x
The decision to prioritize business continuity was correct. But knowing about Elastic Throughput would have achieved the same result for 1/20th the cost.
Lessons Learned
1. Always Research Current Options
Don't rely on old knowledge during emergencies. Take 5 minutes to check AWS documentation for the latest features.
2. Cost Estimation is Part of the Response
"Make it work first" is important, but:
- List all options
- Quick cost comparison
- Choose based on data, not urgency
3. Document and Share
This $69 lesson becomes valuable when shared. Your team (and the community) can learn without paying the same price.
Action Items
If you're using EFS:
- [ ] Check your current throughput mode
- [ ] Consider Elastic Throughput for variable workloads
- [ ] Set up CloudWatch alarms for
PercentIOLimit - [ ] Document your throughput mode decision process
Bottom Line
Use Elastic Throughput for most production workloads.
It's the best of both worlds:
- Handles spikes automatically
- Pay only for what you use
- No capacity planning required
Provisioned Throughput should be reserved for constant, predictable high-throughput scenarios.
Next time I face a similar situation, I'll reach for Elastic Throughput first.
I write more about technical decision-making and engineering practices on my blog.
Check it out: https://tielec.blog/
Top comments (0)