claire nguyen

Posted on May 27

The Cost Math Behind Our CI Cache Hit Rate Going From 40% to 91%

#devops #infrastructure #sre #mlops

TL;DR: We were burning roughly AUD $14k/month on redundant CI compute because our cache hit rate sat at 40%. Three changes (content-addressed keys, a warmer tier, and killing one bad pre-commit hook) pushed it to 91% and shaved the bill to about $3.2k. Most of the savings came from a single weekend audit, not new tooling.

I run infra at Buildkite. We eat our own dog food, which means our internal monorepo runs on the same agents we sell to customers. About six weeks ago our finance team flagged that our CI compute line on AWS had crept up 38% quarter-on-quarter while team headcount only grew 11%. Something was off.

Turns out the culprit wasn't traffic. It was caches.

The starting point

Our setup, roughly:

~280 engineers across Sydney, Melbourne, San Francisco
Around 4,200 builds/day on the monorepo
Mix of Go, TypeScript, and a chunky Python ML eval service
Agents running on m6i.4xlarge spot instances in ap-southeast-2
Remote cache backed by S3 with a CloudFront distribution

When I first pulled the numbers, our cache hit rate (measured per build step, weighted by step duration) was sitting at 40.3%. For a healthy CI setup of this size I'd reckon you want 80%+. Anything under 60% means you're paying twice for the same compute.

Here's what the spend breakdown looked like before we touched anything:

Component	Monthly cost (AUD)	% of total
Spot EC2 (build agents)	$11,200	67%
S3 cache storage	$890	5%
CloudFront egress	$1,140	7%
LLM eval API calls (OpenAI + Anthropic)	$3,420	21%
Total	$16,650	100%

The LLM line is the one nobody expected. We run automated PR review on a subset of changes, plus regression evals on our search ranking service.

The three things that actually mattered

1. Content-addressed cache keys

We had cache keys like node_modules_v3_${branch_name}_${os}. That's already wrong but the worse bit was the v3 suffix that someone bumped six months ago and forgot why.

Switched to hashing the actual inputs: package-lock.json content hash + Node version + OS. Standard stuff but we'd just never done it properly.

steps:
  - label: ":node: install"
    plugins:
      - cache#v2.4.0:
          manifest: package-lock.json
          path: node_modules
          restore: file
          save: file
          key: "v1-{{ runner.os }}-node-{{ checksum 'package-lock.json' }}"

The restore: file bit matters. It means we only invalidate when package-lock.json actually changes, not when the branch name changes. Cache hit rate on the install step went from 31% to 96% overnight.

2. A warmer tier between memory and S3

S3 is cheap but the round-trip from ap-southeast-2 agents to S3 is about 18ms for small objects, and we were pulling thousands of them per build. We added an r6gd.large instance with NVMe local storage as an in-region warm cache. Agents check there first, fall through to S3.

Cost: about $180/month for the warm cache instance. Saves us roughly $1,400/month in CloudFront egress because most cache reads never leave the VPC now.

3. The bad pre-commit hook

This one is embarrassing. Someone added a pre-commit hook two years ago that ran find . -name "*.pyc" -delete before every test invocation. On a clean checkout this does nothing useful. On a cached checkout it deletes all the compiled Python bytecode, forcing Python to recompile on every test run. Average test step went from 4m20s to 2m45s after deleting eight lines of bash.

I genuinely could not believe it. We'd been paying for that for two years.

The LLM bit

The $3,420 LLM line was harder to chip away at because the calls themselves are useful. What we did:

Routed the PR review traffic through an AI gateway (we use Bifrost, which gives us semantic caching and a single endpoint) so identical or near-identical review prompts hit cache instead of provider
Moved the search ranking evals to a nightly batch rather than per-PR
Switched the bulk of the review traffic to a cheaper model and reserved the expensive one for changes touching /security/*

Semantic cache hit rate on PR review prompts settled around 34%, which doesn't sound massive but the prompts that hit cache tend to be the bigger ones (boilerplate "review this dependency bump" type stuff), so the dollar impact was bigger than the hit rate suggests.

Final LLM line came down to $1,180/month.

Where we landed

Component	Before	After	Change
Spot EC2	$11,200	$1,820	-84%
S3 + warm cache	$890	$1,070	+20%
CloudFront egress	$1,140	$140	-88%
LLM API	$3,420	$1,180	-65%
Total	$16,650	$4,210	-75%

Cache hit rate: 91.2% weighted.

Trade-offs and Limitations

The warm cache tier is a single point of failure. If that r6gd.large dies, we fall through to S3 cleanly but builds slow down by ~40 seconds each until we replace it. For us that's fine because spot interruption is more common than instance failure anyway. For a smaller team I'd skip it.

Content-addressed keys made cache busting harder for the rare case where you legitimately want to invalidate everything. We added a manual BUILDKITE_CACHE_EPOCH env var so a human can force-invalidate when needed. Used it twice in three months.

The pre-commit hook thing wasn't a tooling problem. It was institutional knowledge rot. There's no caching strategy that protects you from someone deleting your bytecode every commit. You need humans to actually read what runs.

DEV Community