Game day on our build cluster: killing an AZ to test LLM flake detection

#devops #sre #infrastructure #llm

TL;DR: We ran a game day on our Buildkite agent fleet where I yanked an entire AWS AZ while our LLM-based flake classifier was triaging failures. The classifier fell over because we'd wired it to a single OpenAI endpoint. Putting Bifrost in front fixed the failover hole and exposed two other bugs we hadn't seen.

Right, so a few weeks back I was running a game day on our internal build cluster. About 800 agents spread across ap-southeast-2a, 2b, and 2c. The exercise was meant to test our LLM-powered flake detector under partial infrastructure failure. The detector reads a failed job log, classifies it as flake | real | infra, and decides whether to auto-retry.

I killed 2a. That was the plan. What wasn't the plan was the flake detector going completely dark within 90 seconds.

What broke

We'd built the detector as a tiny Go service running on each agent host. It called OpenAI's gpt-4o-mini directly. One endpoint, one API key, no retries beyond the SDK default. When 2a went down, our networking config rerouted egress through a NAT gateway that was hot-throttled by the surge of retries from other services. Result: every flake classification request hung for 30 seconds, then timed out.

CI pipelines didn't fail — they just stopped auto-retrying. So engineers started seeing real bugs and flakes hit them at the same rate, and Slack lit up.

The post-mortem was a bit embarrassing. We'd tested failover for the build database, the artifact store, the agent registration service. Hadn't tested failover for the thing classifying our test failures. Classic case of treating the LLM call as "just an API" instead of as a dependency that can fail in five different ways.

What we changed

I'd been keeping an eye on Bifrost (https://github.com/maximhq/bifrost) for a few months. It's an AI gateway written in Go that sits between your app and the providers. Single OpenAI-compatible endpoint, fallback rules, load balancing across keys, and a Prometheus metrics endpoint baked in. That last bit was what sold me, because our observability stack is already Prom + Grafana and I didn't fancy bolting on yet another exporter.

Deployed it as a sidecar on the agent hosts, two replicas per AZ. Config looked roughly like this:

providers:
  openai:
    keys:
      - value: env.OPENAI_KEY_PRIMARY
        weight: 0.7
      - value: env.OPENAI_KEY_BACKUP
        weight: 0.3
  anthropic:
    keys:
      - value: env.ANTHROPIC_KEY

fallbacks:
  - model: "openai/gpt-4o-mini"
    targets:
      - "openai/gpt-4o-mini"
      - "anthropic/claude-haiku-4-5"

The flake detector's only change was pointing its OpenAI base URL at http://localhost:8080/v1. One line. No SDK swap.

Second game day

Ran the same exercise two weeks later. Killed 2a again. The Bifrost sidecar on 2a stopped responding, the detector's HTTP client failed over to the 2b sidecar via our service mesh, and classification continued. The fallback rule kicked in for about 4% of requests when one OpenAI key got rate-limited by the surge — those routed to Anthropic and came back in roughly the same latency window.

We didn't see zero impact. Tail latency on classifications jumped from p99 ~1.2s to p99 ~3.8s during the failover window. But nothing went dark.

Two bugs the gateway exposed

The Prometheus metrics from Bifrost showed us things our app-level logging had been hiding.

Bug one: 12% of our "real bug" classifications were coming from one specific agent pool that runs Ruby tests. The model was getting truncated logs because we'd set max_tokens too low on the input side at some point and nobody remembered. The per-provider token histograms in the metrics made it obvious.

Bug two: Our retry logic was double-counting. The agent was retrying on 429, and Bifrost was also retrying on 429. So a single rate-limited request was costing us 4x the tokens. Fixed by turning off retries in our client and letting Bifrost handle them.

Honest comparison

We looked at LiteLLM and Portkey before landing on Bifrost. Quick table:

Concern	LiteLLM	Portkey	Bifrost
Deploy as single binary	Python, heavier	Hosted-first	Go binary, npx or Docker
Prom metrics out of box	Plugin	Hosted dashboard	Native endpoint
Fallback config	YAML	UI + config	YAML + Web UI
OSS self-host	Yes	Limited	Yes
Maturity (Apr 2026)	Highest, broad ecosystem	Strong hosted product	Younger, smaller community

LiteLLM has way more community plugins and provider quirks already handled. If you're doing weird stuff with niche providers, it's still probably the safer pick. Portkey's hosted dashboards are nicer than what we built ourselves. Bifrost won for us because it's a single Go binary, native Prom, and the latency overhead in our tests was under 2ms p50.

Trade-offs and limitations

Adds a network hop. ~1-2ms p50, ~5ms p99 in our setup. Acceptable for flake classification, maybe not for tight inner loops.
Another thing to monitor. We've now got Bifrost-down alerts in PagerDuty.
Semantic caching (https://docs.getbifrost.ai/features/semantic-caching) sounded great but we haven't enabled it — flake classification context is too specific for cache hits to be meaningful in our case.
The Web UI is handy for fiddling locally, but we manage config via git like everything else, so we mostly ignore it.

Game days for the LLM dependency in your CI aren't optional anymore if you're doing anything non-trivial. The LLM call is now a critical path component, treat it like one.