Surviving an AZ Failover for Our Build Runner Fleet at 3am

#devops #sre #infrastructure #llm

TL;DR: We lost an AWS AZ for 47 minutes back in March. Our build runner fleet on EKS mostly survived, but the AI-assisted code review bots wedged because their LLM calls all routed to one region. Sticking Bifrost in front of those calls fixed the second problem. Here's what we changed.

It was 3:12am Sydney time when PagerDuty went off. ap-southeast-2a was having a wobble. Not a full outage — just enough packet loss that EKS nodes started flapping in and out of the cluster.

Our build runner fleet handled it fine. We've drilled this. Pod disruption budgets, multi-AZ node groups, the usual stuff. Builds rescheduled to 2b and 2c within about 90 seconds. No worries.

The bit that didn't handle it fine was the AI review bot we'd shipped six weeks earlier. That thing called Anthropic's API directly from inside the build container. When the AZ flapped, the egress NAT in 2a started dropping outbound TLS. The bot retried, hit our 30-second build timeout, and 4,200 builds went red over half an hour.

I want to talk about what we did the morning after, because the fix wasn't "make the bot more resilient." It was "stop pretending the LLM call is special."

The actual failure mode

Here's the rough shape of what was happening. Our review bot was a Go service running as a sidecar in the build pod. Pseudo-config looked like this:

review_bot:
  provider: anthropic
  api_key: ${ANTHROPIC_KEY}
  model: claude-sonnet-4-6
  timeout_ms: 25000
  max_retries: 2

Two retries, 25 second timeout each. Sounds reasonable. Except when the underlying network is dropping packets, you don't fail fast — you sit there waiting for TCP to give up. Two retries became 75 seconds of nothing. Build timeout kicked in. Build failed.

Worse, every single review bot in every single build was hitting the same NAT gateway in the same degraded AZ. We'd accidentally built a single point of failure into something we'd designed as a sidecar.

What we changed

I'd been kicking the tyres on Bifrost (https://github.com/maximhq/bifrost) for a few weeks already because I wanted central observability on LLM spend across our internal tools. The AZ incident pushed it to the top of the queue.

The plan was simple: stop letting build pods talk to providers directly. Run Bifrost as a deployment in our shared platform namespace, spread across all three AZs, and point the review bot at it. The bot's config went from anthropic.com to an internal service URL.

Bifrost's drop-in replacement (https://docs.getbifrost.ai/features/drop-in-replacement) meant we didn't touch the bot's code. Just the env var.

Then we configured fallbacks (https://docs.getbifrost.ai/features/retries-and-fallbacks) so a failed Anthropic call rolls over to AWS Bedrock's Claude. Same model family, different network path, different auth, different everything.

{
  "model": "anthropic/claude-sonnet-4-6",
  "fallbacks": [
    "bedrock/anthropic.claude-sonnet-4-6",
    "openai/gpt-4o-mini"
  ]
}

The GPT-4o-mini at the bottom is a deliberate downgrade. If both Anthropic paths are stuffed, we'd rather give the dev a worse review than no review and a red build.

What it looks like vs the alternatives

I evaluated three things properly. Here's the honest comparison from my notes:

Concern	LiteLLM	Portkey	Bifrost
Self-hosted Go binary	No (Python)	Partial	Yes
Provider failover config	Yes	Yes	Yes
Built-in web UI for config	Limited	Yes (cloud)	Yes (local)
Semantic caching	Plugin	Yes	Yes
Memory footprint on our nodes	~400MB	N/A (SaaS-first)	~180MB
MCP gateway	No	No	Yes (enterprise)

LiteLLM is genuinely good and we run it for one of our data science notebooks because the Python ergonomics are nice. Portkey has the slickest dashboard if you're happy with their cloud. Bifrost won here because we wanted a Go binary we could run on our own infra, and the resource overhead per pod mattered when we're scheduling hundreds of build pods.

The boring infra bit

We deployed Bifrost as three replicas, one per AZ, behind a ClusterIP service. Topology spread constraints to keep them honest. Each pod has its own provider key set via Kubernetes secrets, referenced through Bifrost's env var support (https://docs.getbifrost.ai/deployment-guides/config-json#environment-variable-references).

Prometheus scrape config picks up the native metrics endpoint. We graph p99 latency per provider and alert on fallback rate above 5% for more than 10 minutes. That alert would have fired during the March incident and given us a much better signal than "builds are timing out."

Trade-offs and limitations

This isn't a free win. A few things to flag.

The gateway is now a new hop in the request path. We measured about 8-12ms added per call. For our use case that's noise. For real-time inference it might not be.

Bifrost's clustering features are an enterprise thing. We're running it as independent replicas behind a service, which works because our config is mostly static. If you need shared state across replicas (live config sync, shared rate limit counters), you'll either pay for enterprise or accept some eventual consistency.

Semantic caching sounds great but we haven't turned it on for the review bot because code reviews are too context-specific. Cache hit rate would be near zero. Worth knowing before you assume it'll save you money.

And the obvious one: a gateway pod failing is now a thing that can break LLM calls. Spread your replicas, set sensible PDBs, don't be silly.