claire nguyen

Posted on Jun 23

We made our LLM gateway a single point of failure. Then we tested it.

#infrastructure #devops #sre #llm

TL;DR: We put an LLM gateway in front of about 40 internal services to get failover and one billing view. Then a game day showed the gateway itself was now the thing that took everything down. Here's how we ran two Bifrost replicas, what broke, and where LiteLLM and Portkey were honestly better for us.

Right, so the irony. We added a gateway to stop one flaky provider from taking down our internal tooling at Buildkite. Anthropic 529s, OpenAI timeouts, the usual. The gateway gave us automatic fallbacks and a single place to see spend. Lovely.

What it also gave us was a brand new single point of failure that nothing on our side had been tested against.

How we got here

We run a fair bit of LLM-backed tooling internally. PR summarisers, a flaky-test classifier, log triage. Around 40 services, all of them eventually calling OpenAI or Bedrock.

Every team had its own keys, its own retry logic, its own idea of a timeout. No worries when one service breaks. Real problem when a provider has a bad hour and 40 services all melt down differently.

So we put Bifrost in the middle. One OpenAI-compatible endpoint, automatic fallbacks between providers, and Prometheus metrics we didn't have to build. Spend went to one dashboard. Good outcome.

The catch is obvious in hindsight. Forty services now had a hard dependency on one box.

The game day

We do game days. "Never had an outage" usually means you never tested your failure handling, and I'd rather find the gap on a Tuesday than at 3am.

First run, single replica. We killed the gateway pod. 38 of 40 services started failing within 4 seconds. The two that survived had their own local fallback to a cached response. Everyone else just ate connection-refused.

Lesson one: a gateway you run as one replica is worse than no gateway. You've concentrated the blast radius and added a hop.

So we moved to two replicas behind a Kubernetes Service, with proper probes. Bifrost is stateless for routing, which makes horizontal scaling boring in the good way. Config lives in config.json and env-referenced secrets, so both pods read the same provider setup.

# bifrost-deployment.yaml (trimmed)
spec:
  replicas: 2
  template:
    spec:
      containers:
        - name: bifrost
          image: maximhq/bifrost:latest
          ports:
            - containerPort: 8080
          readinessProbe:
            httpGet:
              path: /metrics
              port: 8080
            periodSeconds: 5
            failureThreshold: 2
          livenessProbe:
            httpGet:
              path: /metrics
              port: 8080
            periodSeconds: 10

Second run, killed one of two pods. Readiness flipped that pod out in about 8 seconds. p99 on the surviving pod jumped from 180ms to roughly 340ms while it carried full load, then settled once Kubernetes scheduled a replacement. No service-level errors. That's the result we wanted.

What still bit us

Killing a pod is the easy test. The nasty one is a slow gateway, not a dead one.

We injected 700ms of latency on one replica using a sidecar. Bifrost stayed "ready" because /metrics answered fine, but real requests crawled. Health checks that only prove the process is alive don't prove it's useful. We ended up adding our own synthetic probe that does a tiny /v1/chat/completions call against a cheap model every 15 seconds and alerts on latency, not just liveness.

The other gotcha was client timeouts. Several services had no timeout at all, so a degraded gateway meant threads piling up. We standardised on a 30s client timeout and let Bifrost's retries and fallbacks handle the provider-side mess.

How it compares

We tested three before committing. All three do the core job. They differ on what happens when you push them.

Concern	Bifrost	LiteLLM	Portkey
Self-hosted HA	Stateless, easy 2+ replicas	Stateless, works fine	Self-host is heavier, more moving parts
Failover config	Per-request fallbacks, native	Solid router fallbacks	Strong, config-driven
Native Prometheus	Yes, built in	Yes	Via their stack
Latency overhead	Low, Go-based	Higher under load in our test	Low, but managed-first
Managed dashboard	Behind enterprise	Lighter	This is their strength

Honest read: if you want a polished managed dashboard with the least ops work, Portkey is genuinely ahead. If you're already deep in Python and want the widest community model coverage, LiteLLM's router is hard to beat and it's been battle-tested by a lot of people. We picked Bifrost because the Go binary held p99 better under our load test and the self-hosted clustering story was the least fiddly for our EKS setup.

Trade-offs and limitations

The gateway is still a dependency. Two replicas reduce risk, they don't delete it. If your config is wrong, both pods are wrong together.

Semantic caching saved us real money on repeated PR-summary calls, but it can serve a stale answer when prompts are near-identical but context differs. Worth tuning the similarity threshold rather than trusting defaults.

Adding the hop costs you something. Our measured median overhead was single-digit milliseconds, which is nothing next to provider latency, but it's not zero and you should measure it on your own traffic, not trust mine.

And clustering with adaptive load balancing sits in enterprise. The open-source path scales horizontally fine for our size, but know where the line is before you plan.

DEV Community