Putting an LLM Gateway in Front of Our Build Agents: Why We Picked Bifrost

#sre #devops #infrastructure #llm

TL;DR: We bolted an LLM gateway in front of the AI features in our build pipeline tooling and ended up running Bifrost instead of LiteLLM or Kong. The deciding factor wasn't features, it was the 11 microsecond overhead and the fact it didn't fall over when one provider had a wobbly afternoon.

Right, so a few weeks back I got pulled into a project to wire LLM calls into some internal tooling we use for triaging flaky builds. Nothing fancy, mostly summarising failure logs and suggesting which test owner to ping. The catch was that this thing sits on the hot path of our build feedback loop, and our SRE on-call rotation was very clear: if your shiny AI feature adds latency to my builds, I will personally come and uninstall it.

Fair enough.

The problem with calling providers directly

First pass was the obvious one. SDK calls straight to Anthropic, with OpenAI as a fallback wrapped in a try/except. Worked fine in dev. Then we hit a real Tuesday afternoon where Anthropic had a regional hiccup, our fallback logic kicked in, and we discovered our "fallback" was actually just retrying the same broken endpoint because someone (me) had copy-pasted the client config.

Classic.

So we needed a proper gateway. The shortlist was Bifrost, LiteLLM, and Kong with an AI plugin. I'd used Kong before for regular API stuff so I was leaning that way out of habit, but I forced myself to actually test the three of them.

What we measured

I set up a quick bench on an m6i.large with a mock upstream so we weren't measuring provider latency. Ran 50k requests at modest concurrency. Here's roughly what we got.

Gateway	Overhead per request	Memory steady state	Setup time
Direct SDK	~0 µs	80 MB	10 min
Bifrost	~11 µs	95 MB	25 min
LiteLLM	~2.1 ms	180 MB	20 min
Kong + AI plugin	~1.4 ms	220 MB	90 min

The 11 microsecond number for Bifrost is what they claim on their repo and honestly I assumed it was marketing fluff until I saw it on our own bench. It's Go, runs as a single binary, and the gateway overhead genuinely disappears into the noise of the actual LLM call.

LiteLLM is Python and you can feel it. It's fine for a lot of use cases and the feature set is honestly massive, but on our hot path that extra couple of milliseconds per call added up across thousands of build steps.

Kong is Kong. Powerful, but it's a full API gateway with an AI plugin bolted on, not an LLM gateway. We didn't need the rest of Kong.

The config that actually mattered

The bit that sold me wasn't the latency. It was weighted routing with proper failover. Here's a stripped down version of what we landed on:

providers:
  anthropic_primary:
    type: anthropic
    api_key: \${ANTHROPIC_KEY}
    weight: 70
  openai_secondary:
    type: openai
    api_key: \${OPENAI_KEY}
    weight: 30

routing:
  build_triage:
    providers: [anthropic_primary, openai_secondary]
    failover: true
    timeout_ms: 8000

cache:
  semantic:
    enabled: true
    similarity_threshold: 0.92
    ttl_seconds: 3600

That semantic cache block is doing a lot of work. Build failures rhyme. A flaky test that times out today probably timed out last week with a slightly different log signature, and the cache catches that fuzzy match instead of paying for another LLM call. We saw cache hit rates around 38% in the first fortnight, which translates directly into provider bill reduction.

Virtual keys were the other thing that mattered for us. We could hand different teams their own virtual key with its own rate limit and budget, all pointing at the same upstream credentials. No more chasing engineers to rotate keys when someone's notebook leaked one to a gist.

Failover that actually works

The thing I tested most paranoidly was the failover. I literally just killed the Anthropic endpoint at the network level mid-request, expecting some ugly behaviour. Bifrost retried against OpenAI inside the same request boundary, the caller got a response, and the metrics endpoint showed the failover counter tick. No drama.

Reckon this is the thing most people get wrong when they roll their own. Failover is easy to write and hard to test. Having it as a config flag means I can write a game day scenario where we knock providers offline and watch the gateway do its job, instead of hoping our wrapper code holds up.

Trade-offs and Limitations

Not all sunshine.

The dashboard is functional but it's no Grafana. We export Prometheus metrics out of it and build our own panels, which is what we wanted anyway, but if you're hoping for a polished UI out of the box you'll be doing some work.

The plugin ecosystem is smaller than LiteLLM. If you need some niche provider or a very specific transformation, LiteLLM probably has it already and Bifrost might need you to write a small bit of Go. For our needs (Anthropic, OpenAI, one self-hosted model) this was a non-issue.

Go binary means your ops team needs to be cool with running a Go service. If you're an all-Python shop and your team is allergic to anything else, that's a real friction point even though the binary itself is genuinely fire-and-forget.

And semantic caching can bite you. If your prompts are doing something where a "similar" prompt actually needs a different answer (think anything with user-specific context smuggled in), you'll want to disable it for those routes. We learned this the second day.

Where it sits now

It runs as a sidecar to our build orchestration service. Two replicas behind an internal load balancer, Prometheus scraping the metrics endpoint, and pagerduty wired to the failover counter so we know when a provider is having a bad day before our users do. Total memory footprint across the cluster is rounding error compared to the workloads it serves.

The on-call SRE has not, so far, come to uninstall it. I'll take the win.