EBS gp2 burst credits ran dry and our builds slowed to a crawl

#infrastructure #devops #sre #aws

TL;DR: A chunk of our EC2 build agents got slow at the same time every afternoon. No CPU pressure, no memory pressure, no network weirdness. It was EBS gp2 burst credits draining to zero, and the fix was a one-line volume type change to gp3 plus a CloudWatch alarm we should've had years ago.

Right, so this one annoyed me for about a week before the penny dropped.

I work on the core compute platform at Buildkite. We run a fleet of EC2 build agents that pick up jobs off a queue and run them. Sydney afternoon, roughly 2pm local, a handful of agents would start dragging. Builds that normally finished in 4 minutes were taking 11. Not all agents. Maybe 15% of the fleet at any given time.

The symptom that lied to us

First instinct was the obvious stuff. CPU? Flat at 30%. Memory? Plenty free. The agent process itself looked bored.

One of our test suites shells out to an LLM step for flaky-test classification, routed through a gateway, so my first paranoid thought was an upstream provider slowdown. We run that traffic through Bifrost so failover and latency are visible per-provider, and the dashboards there were clean. Not the model. Not the gateway. The slow part was local.

So I SSH'd onto a sick agent mid-build and ran iostat -x 1. There it was. %util pinned at 100, await sitting around 80ms, and the volume was doing maybe 100 IOPS when the workload clearly wanted more.

A gp2 volume. 100GB. Baseline 300 IOPS.

Burst credits, the thing nobody remembers

Here's the bit that bites people. A gp2 EBS volume gives you 3 IOPS per GB as a baseline. Our 100GB volumes get 300 baseline IOPS. Below 1TB, volumes earn burst credits and can spike to 3000 IOPS, but only while they've got credits in the bucket.

Those credits refill at the baseline rate. Burn faster than you refill and the bucket empties. When it hits zero you're hard-capped at 300 IOPS until it recovers.

Our build agents do a lot of small random writes. Cloning repos, unpacking caches, npm doing what npm does with its 40,000 tiny files. Early in an agent's life it's got a full credit bucket and flies. After a few hours of back-to-back builds, the bucket's empty. The 2pm pattern wasn't a time-of-day thing at all. It was just agents that had been alive long enough to drain their credits.

You can watch it happen. The metric is BurstBalance, a percentage, and we had zero alarms on it.

aws cloudwatch get-metric-statistics \
  --namespace AWS/EBS \
  --metric-name BurstBalance \
  --dimensions Name=VolumeId,Value=vol-0abc123def456 \
  --start-time 2026-06-20T00:00:00Z \
  --end-time 2026-06-20T06:00:00Z \
  --period 300 \
  --statistics Minimum

Run that against a sick volume and you'll see Minimum walking down toward 0 over the build session. Clean as.

Why gp3 fixes it

gp3 doesn't do the burst-credit dance. You get a flat 3000 IOPS baseline regardless of size, and you can provision up to 16,000 if you pay for it. No bucket, no draining, no surprise cliff at hour three.

It's also cheaper for our shape of workload. gp3 storage is about 20% less per GB than gp2, and the first 3000 IOPS are included.

Volume type	Baseline IOPS (100GB)	Burst behaviour	Predictable under sustained load
gp2	300	Credits to 3000, then cliff	No
gp3	3000 (flat)	None, provision up to 16k	Yes
io2	Provisioned, up to 64k	None	Yes, but pricey

For a build fleet, gp3 is the boring correct answer. io2 is overkill unless you genuinely need tens of thousands of sustained IOPS, and we don't.

The migration is a volume modification, no snapshot dance, no detach:

aws ec2 modify-volume \
  --volume-id vol-0abc123def456 \
  --volume-type gp3 \
  --iops 3000 \
  --throughput 125

We baked it into the launch template so every new agent comes up gp3. Existing volumes got modified in a rolling batch over a maintenance window. p95 build duration on the affected cohort dropped from 9.2 minutes back to 4.1.

Trade-offs and limitations

gp3 isn't free of footguns. The default throughput is 125 MB/s, and if your workload is throughput-heavy rather than IOPS-heavy you'll need to bump that separately, because gp3 decouples the two. We left ours at 125 and it's fine, but I've seen teams forget and wonder why their big sequential reads didn't speed up.

modify-volume also has a cooldown. You can't modify the same volume again for 6 hours, so if you fat-finger the IOPS number you're waiting it out. Plan the values before you run it.

And this didn't make our agents infinitely fast. It removed an artificial cliff. If a build is genuinely I/O bound at 3000 IOPS, gp3 buys you headroom, not magic. The real long-term fix for us is shrinking what we write to disk per build, which is a slower piece of work.

Last thing. Burst credits exist on other AWS resources too, with different names and different cliffs. T-series instance CPU credits. NAT gateway. If you've got one burst-credit surprise in your stack, you've probably got more hiding.