Semantic caching our flaky-test summariser: 58% fewer LLM calls

#sre #devops #llm #mlops

TL;DR: Our internal flaky-test summariser at Buildkite was firing ~40k LLM calls a day, and most were near-duplicates of failures we'd already explained. Switching on semantic caching in Bifrost cut live provider calls by 58% and dropped p50 latency on cache hits from ~900ms to about 40ms. It also kept the feature alive when our primary provider browned out for 11 minutes.

The feature that wouldn't shut up

On our platform team (eight of us) we shipped a small thing last quarter: when a test goes flaky in a Buildkite pipeline, we pass the failure output to an LLM and stick a plain-English summary on the build page. Devs liked it. The provider bill less so.

By March it was making roughly 40,000 calls a day against anthropic/claude-haiku, with openai/gpt-4o-mini as the fallback. p50 latency sat around 900ms. The monthly bill crept past $310. Not catastrophic. But the calls were doing the same work over and over.

Why the calls were so repetitive

Here's the bit that bugged me. Flaky tests are flaky for the same reasons across builds. A timeout in payments_spec.rb looks almost identical on Tuesday as it did on Monday, minus a timestamp and a container ID.

So we were paying full freight to summarise text we'd already summarised. Different bytes, same meaning. A normal key-based cache misses all of these because the strings never match exactly. That's the whole problem semantic caching solves: it matches on meaning, not on an md5 of the prompt.

We already ran everything through Bifrost as our gateway, mostly for the automatic failover. Turns out the semantic caching was sitting right there.

Turning it on

Bifrost runs as a single Go binary in front of our summariser. We added the cache plugin to the gateway config and pointed it at a small embedding model so we weren't paying much per lookup.

{
  "plugins": [
    {
      "name": "semantic_cache",
      "config": {
        "embedding_model": "openai/text-embedding-3-small",
        "threshold": 0.92,
        "ttl_seconds": 86400
      }
    }
  ]
}

The threshold is the knob that matters. At 0.92 cosine similarity two failures have to be genuinely close before we serve a cached summary. We started at 0.97, which was too strict (hit rate sat around 20%), and walked it down while spot-checking summaries against the real failures.

Settled on 0.92. Cache hit rate landed at 58% over the first three weeks. On a hit, the summariser returns in ~40ms instead of waiting on a provider round trip. No code change in our app, since Bifrost speaks the same OpenAI-compatible API we already called.

What the brownout taught us

Two weeks in, our primary provider had a rough afternoon. Elevated errors and timeouts for 11 minutes. Normally Bifrost's fallback kicks the traffic to the secondary, which it did.

But the cache did something I hadn't planned for. More than half the requests during that window never reached either provider, because they matched recent failures already in the cache. The blast radius shrank on its own. The fallback handled the genuinely new failures, the cache absorbed the repeats, and nobody filed a ticket. She'll be right, basically.

That's the reliability angle people miss with caching. It's not only a cost lever. It's load shedding you get for free when an upstream goes wobbly.

Bifrost vs LiteLLM vs Portkey

We looked at the obvious alternatives before committing. All three can do semantic caching. They're not the same.

Capability	Bifrost	LiteLLM	Portkey
Semantic cache	Built in, config-driven	Yes, via Redis + embeddings	Yes, mature
Failover + cache together	Single binary	Proxy + Redis to wire up	SaaS, polished
Self-host	Go binary, Docker	Python proxy	Self-host or cloud
Dashboard	Built-in web UI	Community UI	Strongest of the three
Provider breadth	23+	Very broad	Broad

Honest read: LiteLLM has the bigger community and the widest provider list, and if you already run Redis their cache is well-trodden. Portkey's dashboard and analytics are the slickest of the lot, and for a team that wants a managed SaaS it's hard to argue against.

We picked Bifrost because we self-host on ECS and wanted the failover and the cache in one Go process, not a Python proxy plus a Redis we'd have to babysit. Fewer moving parts to break on a game day.

Trade-offs and Limitations

Semantic caching isn't free of sharp edges, and pretending otherwise would be daft.

The threshold is a real risk. Set it too loose and you'll serve a summary from a different failure that happens to read similarly. We caught two of these at 0.88 during tuning. A bad summary on a build page erodes trust fast, so we erred conservative at 0.92 and accept a lower hit rate for it.

Embeddings add a little latency and cost on every lookup, including misses. With text-embedding-3-small it's small, but it's not zero. For workloads where every input is genuinely unique, you'll pay the embedding tax and get almost nothing back.

Cache invalidation is on you. When we changed the summariser's prompt, every cached entry was suddenly stale against the new format. We dropped the TTL to 24 hours so the cache rolls over daily rather than holding stale shapes for a week.

And it doesn't replace failover. The cache helped during the brownout, but only because we had recent traffic. Cold cache plus dead provider equals a bad time. Keep your fallback chain regardless.