claire nguyen

Posted on Jun 18

A provider latency spike stalled our whole build queue

#devops #sre #infrastructure #llm

TL;DR: A provider slowdown turned a 2-second LLM call into a 70-second hang. Because our build agents block on that call, the queue backed up to roughly 400 jobs in twelve minutes. We put Bifrost in front with hard timeouts and a fallback model, and the queue stopped caring whether any single provider was healthy.

The bit nobody designs for

I work on compute and build orchestration at Buildkite. One of our internal services calls an LLM to triage flaky test output, group it, and suggest a likely owner. Small thing. Saves engineers a fair bit of digging.

The catch is that a build agent waits on that call before it releases its slot. So the latency of one HTTP request to a model provider quietly became part of our queue's throughput math. Nobody wrote it down that way, but that's what it was.

On a Tuesday in May the provider's p99 went sideways. Not an outage. Just slow. Our default client timeout was 60 seconds, our retry was three attempts with backoff, and suddenly a call that normally took 2 seconds was eating 70 before giving up. Agents held their slots the whole time. Within twelve minutes the run queue sat at about 400 jobs that had nothing to do with the LLM at all.

Classic head-of-line blocking. One slow dependency, a whole fleet stuck behind it.

What we changed

We'd been calling the provider SDK directly from the service. The reliability logic lived in our own code, which meant our timeout values, our retry counts, and our fallback policy were all bespoke and slightly wrong.

We moved the call behind Bifrost, a self-hosted Go gateway that speaks an OpenAI-compatible API. The point wasn't to add a hop. It was to move the failure handling out of our app and into config we could reason about during an incident.

Three things mattered for us.

First, fallbacks. If the primary model is slow or erroring, route to a different model or provider instead of retrying the sad one into the ground. Second, semantic caching, because flaky test output repeats far more than you'd reckon, and a cache hit is a call that can't hang. Third, native Prometheus metrics, so the LLM path showed up on the same SLO dashboards as everything else we run.

Here's the gist of the config:

{
  "providers": {
    "anthropic": {
      "keys": [{ "value": "env.ANTHROPIC_KEY", "weight": 1.0 }],
      "network_config": { "default_request_timeout_in_seconds": 8 }
    },
    "openai": {
      "keys": [{ "value": "env.OPENAI_KEY", "weight": 1.0 }],
      "network_config": { "default_request_timeout_in_seconds": 8 }
    }
  },
  "fallbacks": [
    { "model": "anthropic/claude-haiku-4-5", "weight": 1.0 },
    { "model": "openai/gpt-4o-mini", "weight": 1.0 }
  ]
}

The 8-second timeout is the real fix. Our triage call has no business taking longer than that, and if it does, we'd rather get a degraded answer from the fallback than hold a build slot hostage. The gateway fails over instead of stacking retries against a provider that's already struggling.

Did it work

Next time the same provider got slow (it happened again in June, naturally), the gateway tripped to the fallback model after 8 seconds. Triage quality dropped a touch for those minutes. The queue never noticed. Peak backlog during that window was 11 jobs, not 400.

The caching surprised me more than the failover. On a normal day we're seeing roughly 30% cache hits on triage prompts, because the same flaky test produces near-identical output across re-runs. Thirty percent fewer calls is thirty percent fewer chances to hang.

How it stacks up

We looked at LiteLLM and Portkey before landing here. All three do the core gateway job. The differences are real, so here's the honest version.

Thing I cared about	Bifrost	LiteLLM	Portkey
Self-hosted, no SaaS dependency	Yes, single Go binary	Yes, Python proxy	Yes, but SaaS is the main path
Fallback + load balancing config	Built in	Built in	Built in
Semantic caching	Built in	Built in	Built in
Prometheus metrics native	Yes	Via add-ons	Via their dashboard
Provider breadth	23+	Widest of the three	Broad

LiteLLM has the widest provider list and the biggest community, so if you're calling something obscure it's a safe bet. If your stack is already Python, its proxy drops in with less friction. Portkey's managed offering and guardrail tooling are more polished out of the box, and for a team that doesn't want to run another service that matters.

We picked Bifrost because it's one static binary written in Go, the config above is the whole story, and I didn't want a Python runtime sitting in a latency-sensitive path on our build hosts. That's a preference, not a verdict.

Trade-offs and limitations

You're adding a network hop. For us it's a sidecar on the same host, so it's sub-millisecond, but it's not zero and you should measure it rather than trust me.

The fallback model gives worse triage answers. We decided a rough answer that arrives beats a good one that never does, but that's a call you make per use case, not a universal truth.

Semantic caching can serve a stale-ish answer when two failures look similar but aren't. We tuned the similarity threshold conservatively and accept the occasional miss.

And a gateway is now a dependency too. We run two replicas behind a local load balancer and treat it like any other tier-1 service. If you deploy a single instance, you've moved your single point of failure, not removed it.

The broader lesson has nothing to do with LLMs. Any blocking call to a dependency you don't control belongs behind a timeout you do control. We just hadn't noticed an LLM had become one.

DEV Community