Chaos testing your CI runner fleet when half the jobs call an LLM

#devops #sre #infrastructure #llm

TL;DR: We started injecting LLM provider failures into our Buildkite agent fleet during scheduled game days. Found out our "retry on 5xx" logic was happily burning $80/hr re-sending the same 200k-token context to Anthropic during a brownout. Putting Bifrost in front of the agents fixed the obvious stuff. The chaos testing exposed the non-obvious stuff.

Right, story time. We run a fair-sized fleet of Buildkite agents on EC2, and over the last 18 months maybe 30% of jobs started touching an LLM somewhere. Code review bots. Doc generation. A weird internal thing that summarises flaky test runs. The build itself is deterministic. The LLM calls inside the build are not.

When OpenAI had its multi-hour wobble in March, our p99 build time went from 4 minutes to 47. Half the queue stalled. We hadn't tested for it because nothing in our chaos playbook accounted for "third-party inference API returns 200 but takes 90 seconds."

So we built one.

What we were already doing wrong

The original setup was the obvious thing. Each agent had an OPENAI_API_KEY baked into the AMI. Build scripts called the API directly. Retries were whatever the SDK gave us by default.

Three problems showed up the first time we ran a proper failure injection:

SDK default retry was 2 attempts with exponential backoff. On a 200k-token prompt at $3/M input tokens, that's 60 cents per retry. Multiply by 800 concurrent agents during a brownout and you do the maths.
We had no circuit breaker. Agents kept dialling a dead provider for the full 10-minute job timeout.
No visibility into which build steps were calling which model. The bill arrived monthly. The blame arrived never.

The game day setup

We run game days on a staging fleet that mirrors prod cluster sizing. The injection is done with a tiny toxiproxy sidecar that sits between the agent and the outbound LLM endpoint. Three failure modes we rotate through:

Brownout: 30% of requests return 429 with a Retry-After of 60s
Slowdown: every request gets 15s of latency added
Hard down: 100% return 503 for 8 minutes, then recovery

The first time we ran the brownout scenario against our naive setup, we got a Slack page from finance before the game day was over. They'd seen the cost spike in their hourly dashboard. Embarrassing. Also, exactly the point of the exercise.

Putting a gateway in front

We moved to running Bifrost as a sidecar on each agent host. The agents talk to localhost:8080 with the OpenAI SDK and Bifrost handles the actual provider calls. Drop-in replacement, no code changes in the build scripts.

The config is boring, which is what you want:

providers:
  openai:
    keys:
      - value: env.OPENAI_KEY_PRIMARY
        weight: 0.7
      - value: env.OPENAI_KEY_SECONDARY
        weight: 0.3
  anthropic:
    keys:
      - value: env.ANTHROPIC_KEY

fallbacks:
  - primary: openai/gpt-4o-mini
    backup:
      - anthropic/claude-haiku-4-5

Two things this actually solved during our next game day:

Fallback worked without code changes. When toxiproxy killed OpenAI, builds kept moving by routing to Anthropic. Build time bumped maybe 20%. Nobody paged.

The Prometheus metrics gave us per-pipeline cost visibility. We could finally see that one team's "summarise the test logs" step was responsible for 40% of our LLM spend. Conversation with that team was much easier with numbers attached.

What gateway != fixes

Here's the honest bit. The gateway didn't solve our retry-cost problem on its own. Bifrost's fallback config is good, but if your build script is calling the API in a loop and not respecting the 429s coming back, you'll still burn money. We had to write our own thin wrapper in the build pipeline to bail out of the LLM step after 2 failures and fall back to a heuristic. Gateway gave us the signals. The build logic still has to do the right thing with them.

Honest comparison

We looked at LiteLLM and Portkey before settling. Quick read:

Tool	What we liked	Where it didn't fit
LiteLLM	Massive provider list, well-known	Python proxy meant another runtime on each agent host
Portkey	Slick analytics dashboard, mature observability	SaaS-first, our security team wasn't keen on egress for build logs
Bifrost	Single Go binary, drop-in OpenAI compat, semantic caching that actually saved us 22% on the doc-gen pipeline	Smaller ecosystem, fewer integrations than LiteLLM, MCP gateway is enterprise-tier

If you're already running LiteLLM happily, no reason to swap. We just preferred deploying one binary alongside the agent instead of a Python service.

Trade-offs and limitations

A few things to be straight about:

Adding a gateway adds a hop. We measured about 3-5ms overhead per call. Fine for our use case, might matter if you're doing latency-sensitive inference.
Semantic caching is brilliant for repetitive build prompts (think "summarise this stack trace") but useless for anything with high-entropy input. Don't expect a free 50% cost cut.
Self-hosted means you own the uptime of the gateway too. We run it as a sidecar so the blast radius is one agent, but if you centralise it, you've created a new SPOF.
Game days take real time. Half a day to set up, half a day to run, two days of follow-up tickets. Worth it. Not free.

The biggest win wasn't any one feature. It was that we'd actually pulled the cables out before a real provider had a bad afternoon. "Never had an outage" usually means you've never tested your failure handling.

Top comments (1)

Xidao • May 26

This is a good example of why LLM calls inside CI need failure budgets, not just retries. Once prompts get into the 100k+ token range, the unit of failure is no longer a request, it is a cost event, so the circuit breaker really needs to be token-aware rather than HTTP-aware.

We hit a similar issue where provider latency spikes were less damaging than partial success: the provider accepted the request, streamed late, and the job crossed its own timeout so the next retry duplicated work anyway. What ended up helping was emitting an idempotency key per build step and treating "same prompt fingerprint + same artifact target" as a dedupe boundary before any retry was allowed.

Did you end up chaos-testing degraded responses too, not just transport failures? Things like truncated JSON, schema-valid but low-confidence outputs, or a fallback model that passes syntax checks but changes review quality can be harder to catch than a clean 503.