DEV Community

claire nguyen
claire nguyen

Posted on

Fault-injecting our LLM provider to trust Bifrost fallbacks

TL;DR: We run an LLM-backed build-failure summariser at Buildkite. To stop a provider wobble from breaking it mid-deploy, I ran a game day that fault-injected OpenAI with 429s and 500s and watched whether Bifrost's fallback config actually rerouted. It did, but only after I fixed two things I'd set up wrong.

We've got a small service that reads failed CI jobs and writes a one-paragraph summary into the build annotation, so engineers don't have to scroll 4,000 lines of test log to find the one assertion that broke. It calls an LLM. Handy when it works. Embarrassing when it doesn't, because a broken annotation makes people distrust every annotation.

The problem is the thing it depends on isn't ours. OpenAI rate-limits, has the occasional 5xx spell, and we don't get a heads-up. "Never had an outage" usually means you never tested the failure path. So I tested it.

Why a gateway at all

I didn't want fallback logic smeared across our service code. Retry-with-jitter, secondary provider, key rotation, all of that wants to live in one place with metrics attached. We put Bifrost in front, an OpenAI-compatible gateway, so our service keeps talking the same /v1/chat/completions it always did and the routing decisions move to config.

The pitch is plain. One endpoint, 23+ providers behind it, automatic fallbacks between them. Our code points at localhost:8080 instead of api.openai.com and stops caring which model actually answers.

Here's the fallback config I started the game day with:

{
  "providers": {
    "openai": { "keys": ["env.OPENAI_KEY_A", "env.OPENAI_KEY_B"] },
    "anthropic": { "keys": ["env.ANTHROPIC_KEY"] }
  },
  "fallbacks": [
    "openai/gpt-4o-mini",
    "anthropic/claude-haiku-4-5"
  ]
}
Enter fullscreen mode Exit fullscreen mode

Two OpenAI keys for load balancing, then Anthropic as the lifeboat if OpenAI as a whole goes sideways. That was the theory.

The game day

A game day is just a planned outage you cause on purpose, with people watching. I scheduled 45 minutes, told the team, and put a toxiproxy in front of OpenAI so I could inject faults without waiting for the real thing to break.

Three scenarios:

  1. 429 storm. Every OpenAI response becomes a rate-limit for 5 minutes.
  2. Hard 500s. OpenAI returns 503 on half of requests.
  3. Latency tar pit. 30-second delays, no errors, the nastiest one.

Scenario one went fine. Bifrost saw the 429s, rotated between key A and key B, then gave up on OpenAI and the requests landed on Haiku. Annotations kept writing. Reckoned I was done.

Scenario two found my first mistake. I'd not set a sane retry ceiling, so on a 503 the gateway retried hard against the same struggling provider before failing over, and our p95 on annotation writes jumped to about 18 seconds. Fixed it by capping retries and letting the fallback fire sooner. The README's retries and fallbacks page covers the knobs; I'd skimmed it the first time.

Scenario three is the one everyone gets wrong. Slow isn't down. A 30-second response isn't an error, so naive fallback never triggers, the request just sits there. We added a request timeout so a tar-pitted provider counts as a failure and trips the lifeboat. That single change is the actual reason this exercise was worth running.

What the metrics showed

Bifrost ships native Prometheus metrics, so I didn't have to bolt on my own. I watched fallback rate and per-provider latency the whole time on a Grafana board.

Scenario Without fallback With Bifrost (tuned)
429 storm annotations stall reroute to Haiku, ~2.1s p95
Hard 503s 50% writes fail 0 user-visible failures
30s latency every write hangs timeout trips fallback in 4s

The numbers that mattered: zero broken annotations across all three once tuned, and the fallback decisions were visible in metrics instead of buried in logs nobody reads.

How it stacks up against LiteLLM and Portkey

I'd used LiteLLM before. Worth being honest here.

Bifrost LiteLLM Portkey
OpenAI-compatible endpoint yes yes yes
Automatic fallbacks yes yes yes
Native Prometheus metrics yes yes yes
Self-host story single Go binary Python proxy gateway is OSS, control plane hosted
Maturity / ecosystem newer large, lots of integrations polished dashboards

LiteLLM has been around longer and has a bigger pile of community integrations, which counts for something when you hit an edge case at 2am. Portkey's hosted dashboards are nicer than anything I'd build myself, and if you don't want to run infra that's a fair trade. We picked Bifrost mostly because a single Go binary is easy for an infra team to operate and the Prometheus output dropped straight into our existing board with no glue. Not a knock on the others. Different priorities.

Trade-offs and limitations

A gateway is one more hop you have to keep alive. If Bifrost falls over, every LLM call falls with it, so we run two replicas behind a load balancer and the game day included killing one of them too.

Fallback to a different model means a different model. Haiku doesn't write the exact same summary as gpt-4o-mini, and for a build annotation that's fine, but if you depend on a strict output schema you need to test the lifeboat actually produces it. We caught one prompt that assumed OpenAI-specific formatting.

And fault injection in front of a proxy isn't the real provider misbehaving. Toxiproxy gives you 429s and delays, not the weird partial-stream failures you see in the wild. It's a model of the failure, not the failure. Better than nothing, not the whole story.

Semantic caching is on the roadmap for us, not load-bearing yet, so I'm not going to claim numbers I haven't measured.

Further Reading

Top comments (1)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.