Error budgets for an LLM dependency you don't control

#infrastructure #devops #sre #llm

TL;DR: We shipped a natural-language build-query feature at Buildkite, then tried to put a 99.9% SLO on it. Turns out you can't promise uptime for a model provider you don't run. We put Bifrost in front, failed over across three providers, and now the error budget tracks our gateway's behaviour instead of OpenAI's status page.

Here's the moment it clicked for me. We were drafting an SLO doc for a feature that lets people ask "why did this build fail" in plain English. Someone wrote "99.9% availability". Cool. That's 43 minutes of allowed downtime a month. Then OpenAI had a wobble for about 50 minutes one Tuesday and we blew the whole budget before lunch.

The problem wasn't our code. Our service was up the entire time. The dependency wasn't.

You can't SLO something you don't operate

A normal SLO assumes you control the thing you're measuring. Postgres, your own API, an internal queue. You can add replicas, you can tune it, you can page someone who can fix it.

A hosted LLM is none of that. When Anthropic returns a 529 or OpenAI starts handing out 429s under load, there is no lever on your side. You wait. Our p99 for the feature was around 2.1 seconds on a good day, and during provider degradation it'd climb past 9 seconds or just fail outright.

So the question stopped being "how do I make the provider more reliable" and became "how do I make my dependency on any single provider less load-bearing." That's a routing problem, not a model problem.

Putting a gateway in the path

We run Bifrost as the single egress point for every LLM call now. It's an OpenAI-compatible gateway, so our service code didn't change much. The interesting part is the fallback config: if the primary provider errors or times out, the request gets retried against the next one without our app knowing.

{
  "providers": {
    "openai": { "keys": [{ "value": "env.OPENAI_KEY" }] },
    "anthropic": { "keys": [{ "value": "env.ANTHROPIC_KEY" }] },
    "bedrock": { "keys": [{ "value": "env.BEDROCK_KEY" }] }
  },
  "fallbacks": [
    "openai/gpt-4o-mini",
    "anthropic/claude-3-5-haiku",
    "bedrock/anthropic.claude-3-haiku"
  ]
}

Three providers, ranked. When OpenAI throttles, the call lands on Anthropic. When both are sad, Bedrock catches it. The feature degrades in quality maybe, but it stays up. That's the whole point of an error budget. Stay inside the line.

It also does load balancing across multiple keys, which mattered more than I expected. Half our "outages" early on were just one API key hitting its rate limit while another sat idle.

The metrics that actually feed the SLO

The bit that sold me was native Prometheus output. Bifrost exposes metrics straight out of the box, so I'm not scraping a vendor status page or parsing logs to know if we're burning budget.

Our availability SLI is now "requests Bifrost successfully resolved, including via fallback" over total requests. A request that failed on OpenAI but succeeded on Anthropic counts as a win, because the user got an answer. That's the number that should drive the SLO, not per-provider success.

# fast burn-rate over 1h: are we eating budget faster than allowed?
sum(rate(bifrost_requests_total{status="error"}[1h]))
/
sum(rate(bifrost_requests_total[1h]))
> (14.4 * 0.001)

We went from one provider doing about 99.4% effective availability to the fallback chain sitting around 99.93% over the last 60 days. Same models, same budget, just not betting the feature on one company's afternoon.

How it stacks up

We looked at LiteLLM and Portkey before landing here. None of these is strictly best. Depends what you're optimising for.

Thing I cared about	Bifrost	LiteLLM	Portkey
Self-host, no vendor in path	Yes, single Go binary	Yes	Possible, but hosted is the main path
Native Prometheus metrics	Built in	Via callbacks/config	Dashboard-first, export is extra
Provider failover config	Declarative fallback list	Yes, router config	Yes, configs/strategies
Hosted analytics UI	Basic built-in UI	Minimal	Strongest of the three
Python ecosystem depth	Smaller	Largest, huge community	Good

Honestly, if you live in Python and want the biggest provider list and community, LiteLLM is hard to beat. If you want a polished hosted dashboard and guardrails without running anything, Portkey is the comfortable pick. We're an infra team that wants metrics in our own Prometheus and a binary we can run on our own boxes, so Bifrost fit our shape. No worries either way.

Trade-offs and Limitations

Fallback hides failure, and that cuts both ways. If your alerting only watches the final success rate, you can be quietly running 80% of traffic on your third-choice provider for days and not notice the bill. We added a separate alert on per-provider fallback rate so degradation is visible, not just survivable.

Quality drift is real too. gpt-4o-mini and claude-3-5-haiku don't answer identically, so a build-failure summary can read differently mid-incident. For us that's acceptable. For anything doing structured extraction, you'd want to validate output shape per provider.

And a gateway is one more thing to run. It's a low-risk component, but it's in the hot path, so we run it with the same care as any other tier-1 service. If Bifrost is down, everything's down. We game-day it like the rest of our stack.

Self-hosting also means semantic caching, governance, and the rest are your config problem, not a managed feature. Fine for us. Worth knowing.