Fault-injecting our LLM provider to trust Bifrost fallbacks

#infrastructure #devops #sre #llm

TL;DR: We run an LLM-backed build-failure summariser at Buildkite. To stop a provider wobble from breaking it mid-deploy, I ran a game day that fault-injected OpenAI with 429s and 500s and watched whether Bifrost's fallback config actually rerouted. It did, but only after I fixed two things I'd set up wrong.

We've got a small service that reads failed CI jobs and writes a one-paragraph summary into the build annotation, so engineers don't have to scroll 4,000 lines of test log to find the one assertion that broke. It calls an LLM. Handy when it works. Embarrassing when it doesn't, because a broken annotation makes people distrust every annotation.

The problem is the thing it depends on isn't ours. OpenAI rate-limits, has the occasional 5xx spell, and we don't get a heads-up. "Never had an outage" usually means you never tested the failure path. So I tested it.

Why a gateway at all

I didn't want fallback logic smeared across our service code. Retry-with-jitter, secondary provider, key rotation, all of that wants to live in one place with metrics attached. We put Bifrost in front, an OpenAI-compatible gateway, so our service keeps talking the same /v1/chat/completions it always did and the routing decisions move to config.

The pitch is plain. One endpoint, 23+ providers behind it, automatic fallbacks between them. Our code points at localhost:8080 instead of api.openai.com and stops caring which model actually answers.

Here's the fallback config I started the game day with:

{
  "providers": {
    "openai": { "keys": ["env.OPENAI_KEY_A", "env.OPENAI_KEY_B"] },
    "anthropic": { "keys": ["env.ANTHROPIC_KEY"] }
  },
  "fallbacks": [
    "openai/gpt-4o-mini",
    "anthropic/claude-haiku-4-5"
  ]
}

Two OpenAI keys for load balancing, then Anthropic as the lifeboat if OpenAI as a whole goes sideways. That was the theory.

The game day

A game day is just a planned outage you cause on purpose, with people watching. I scheduled 45 minutes, told the team, and put a toxiproxy in front of OpenAI so I could inject faults without waiting for the real thing to break.

Three scenarios:

429 storm. Every OpenAI response becomes a rate-limit for 5 minutes.
Hard 500s. OpenAI returns 503 on half of requests.
Latency tar pit. 30-second delays, no errors, the nastiest one.

Scenario one went fine. Bifrost saw the 429s, rotated between key A and key B, then gave up on OpenAI and the requests landed on Haiku. Annotations kept writing. Reckoned I was done.

Scenario two found my first mistake. I'd not set a sane retry ceiling, so on a 503 the gateway retried hard against the same struggling provider before failing over, and our p95 on annotation writes jumped to about 18 seconds. Fixed it by capping retries and letting the fallback fire sooner. The README's retries and fallbacks page covers the knobs; I'd skimmed it the first time.

Scenario three is the one everyone gets wrong. Slow isn't down. A 30-second response isn't an error, so naive fallback never triggers, the request just sits there. We added a request timeout so a tar-pitted provider counts as a failure and trips the lifeboat. That single change is the actual reason this exercise was worth running.

What the metrics showed

Bifrost ships native Prometheus metrics, so I didn't have to bolt on my own. I watched fallback rate and per-provider latency the whole time on a Grafana board.

Scenario	Without fallback	With Bifrost (tuned)
429 storm	annotations stall	reroute to Haiku, ~2.1s p95
Hard 503s	50% writes fail	0 user-visible failures
30s latency	every write hangs	timeout trips fallback in 4s

The numbers that mattered: zero broken annotations across all three once tuned, and the fallback decisions were visible in metrics instead of buried in logs nobody reads.

How it stacks up against LiteLLM and Portkey

I'd used LiteLLM before. Worth being honest here.

	Bifrost	LiteLLM	Portkey
OpenAI-compatible endpoint	yes	yes	yes
Automatic fallbacks	yes	yes	yes
Native Prometheus metrics	yes	yes	yes
Self-host story	single Go binary	Python proxy	gateway is OSS, control plane hosted
Maturity / ecosystem	newer	large, lots of integrations	polished dashboards

LiteLLM has been around longer and has a bigger pile of community integrations, which counts for something when you hit an edge case at 2am. Portkey's hosted dashboards are nicer than anything I'd build myself, and if you don't want to run infra that's a fair trade. We picked Bifrost mostly because a single Go binary is easy for an infra team to operate and the Prometheus output dropped straight into our existing board with no glue. Not a knock on the others. Different priorities.

Trade-offs and limitations

A gateway is one more hop you have to keep alive. If Bifrost falls over, every LLM call falls with it, so we run two replicas behind a load balancer and the game day included killing one of them too.

Fallback to a different model means a different model. Haiku doesn't write the exact same summary as gpt-4o-mini, and for a build annotation that's fine, but if you depend on a strict output schema you need to test the lifeboat actually produces it. We caught one prompt that assumed OpenAI-specific formatting.

And fault injection in front of a proxy isn't the real provider misbehaving. Toxiproxy gives you 429s and delays, not the weird partial-stream failures you see in the wild. It's a model of the failure, not the failure. Better than nothing, not the whole story.

Semantic caching is on the roadmap for us, not load-bearing yet, so I'm not going to claim numbers I haven't measured.