Yuto Takashi

Posted on Jan 30

AWS EFS Emergency Response: How I Spent $69 in 26 Hours (And How to Avoid It)

#aws #cicd #devops #performance

TL;DR

During a Jenkins EFS incident, I switched to Provisioned Throughput (300 MiB/s) for emergency response. It cost $69 for just 26 hours. If I had known about Elastic Throughput, it would have been around $3.50. Here's what I learned about EFS throughput modes and cost optimization.

The Incident

Last week, our Jenkins CI/CD pipeline came to a halt due to EFS metadata IOPS exhaustion. As an emergency measure, I changed the EFS throughput mode to Provisioned Throughput at 300 MiB/s to keep Jenkins running while investigating the root cause.

The next day, I checked AWS Cost Explorer and saw:

$69.00

For 26 hours of usage. Ouch.

Why You Should Care

If you're running EFS for production workloads, understanding throughput modes is critical. A simple configuration choice can mean the difference between $3 and $69 for the same workload.

EFS Throughput Modes: A Quick Comparison

AWS EFS offers three throughput modes:

1. Bursting Throughput (Default)

Cost: Storage cost only

Performance scales with storage size. You get baseline throughput based on your storage capacity, plus burst credits for temporary spikes.

✅ No extra cost
❌ Performance degrades when credits run out (our problem)

2. Provisioned Throughput

Cost: Storage + Throughput cost

Tokyo region: ~$7.2 per MiB/s per month

For 300 MiB/s:

Monthly: 300 × $7.2 = $2,160
26 hours: $2,160 × (26/720) ≈ $78 (actual: $69)
✅ Guaranteed performance
❌ Very expensive, billed even when idle

3. Elastic Throughput

Cost: Storage + Actual usage

Tokyo region:

Read: $0.04/GB
Write: $0.07/GB

For 26 hours with ~50GB usage:

50GB × $0.07 ≈ $3.50
✅ Pay-per-use, auto-scales
❌ Harder to predict costs

Cost Comparison

Mode	26-hour Cost	When to Use
Bursting	$5.6/month	Normal operations
Provisioned	$69	Constant high throughput
Elastic	$3.50	Spike handling (best for most cases)

Difference: ~$65 (~$9,500 yen)

What I Should Have Done

Instead of jumping to Provisioned Throughput, here's the better approach:

Step 1: Switch to Elastic Throughput

aws efs put-file-system-policy \
  --file-system-id fs-xxxxxx \
  --throughput-mode elastic

This would have:

Auto-scaled during investigation
Cost only ~$3.50 for the same period
No manual capacity planning needed

Step 2: Investigate Root Cause

While Elastic Throughput handles the spike automatically, investigate and fix the underlying issue (in our case, Git temporary files accumulating).

Step 3: Set Up Monitoring

CloudWatch alarms for:

PercentIOLimit > 75%
Early warning before IOPS exhaustion

Why I Didn't Choose Elastic Throughput

Honestly? I didn't know it existed.

Elastic Throughput was announced in 2022, but I hadn't updated my knowledge. During the emergency, my mental model was:

Bursting = free but unreliable
Provisioned = expensive but guaranteed

I missed the third, better option.

Was the Decision Wrong?

Not entirely. Let's look at ROI:

Cost: $69 (10,000 yen)

Avoided Loss:

10 engineers × 3 hours waiting = 30 person-hours
At ~$50/hour = $1,500 in productivity loss
Plus deployment delays (hard to quantify)

ROI: ~20x

The decision to prioritize business continuity was correct. But knowing about Elastic Throughput would have achieved the same result for 1/20th the cost.

Lessons Learned

1. Always Research Current Options

Don't rely on old knowledge during emergencies. Take 5 minutes to check AWS documentation for the latest features.

2. Cost Estimation is Part of the Response

"Make it work first" is important, but:

List all options
Quick cost comparison
Choose based on data, not urgency

3. Document and Share

This $69 lesson becomes valuable when shared. Your team (and the community) can learn without paying the same price.

Action Items

If you're using EFS:

[ ] Check your current throughput mode
[ ] Consider Elastic Throughput for variable workloads
[ ] Set up CloudWatch alarms for PercentIOLimit
[ ] Document your throughput mode decision process

Bottom Line

Use Elastic Throughput for most production workloads.

It's the best of both worlds:

Handles spikes automatically
Pay only for what you use
No capacity planning required

Provisioned Throughput should be reserved for constant, predictable high-throughput scenarios.

Next time I face a similar situation, I'll reach for Elastic Throughput first.

I write more about technical decision-making and engineering practices on my blog.
Check it out: https://tielec.blog/

DEV Community