Our PR-review bot kept hitting 429s. Bifrost key pooling fixed it.

#devops #sre #infrastructure #llm

TL;DR: Our internal PR-review bot was getting 429'd by Anthropic between 9am and 11am Sydney time. We dropped Bifrost in front, pooled four keys, and the 429 rate fell from 8.2% to 0.07% in a fortnight. The migration was one env var swap. The interesting bits were the bits we got wrong.

The problem

We've got a PR-review bot that pings Claude on every pull request opened against our internal monorepo. It pulls the diff and ships a structured prompt to Claude. Gets back a summary plus a couple of "have you considered..." nudges. Saves our reviewers maybe 10 minutes per PR, on a team of 80 engineers, all sharing one Anthropic workspace that someone provisioned back in early 2024 and nobody bothered to split.

You can guess what happened.

Mornings in Sydney are brutal. Everyone arrives, opens their PRs from the night before, and our bot fires off 30-40 concurrent requests. Anthropic's per-org rate limit got chewed through by 9:15 most days. Bot started failing. Slack filled up with "did the review bot die again?" messages. Not a great look for the platform team.

What we tried first

The naive fix was a job queue with backoff. Wrote it in an arvo. Buildkite job, Redis-backed, exponential retry with jitter. It worked, sort of. Reviews now took 4-7 minutes to come back instead of 8 seconds, and engineers started ignoring the bot entirely because by the time the review landed they'd already merged, which kind of defeats the whole point of having a review bot.

Queueing wasn't the answer. We needed more headroom, which meant more keys, which meant somebody had to manage them.

Why Bifrost

I'd been kicking the tyres on a few gateways for an unrelated project. Bifrost (https://github.com/maximhq/bifrost) won on two specific points: load balancing across multiple API keys for the same provider is a documented first-class feature, and the OpenAI-compatible endpoint meant we didn't have to touch the bot's SDK code. It already spoke openai.ChatCompletion against an internal proxy URL.

Setup took about 40 minutes including the time to argue with our SSO admin about a new GitHub OAuth app.

{
  "providers": {
    "anthropic": {
      "keys": [
        { "value": "env.ANTHROPIC_KEY_1", "weight": 1.0 },
        { "value": "env.ANTHROPIC_KEY_2", "weight": 1.0 },
        { "value": "env.ANTHROPIC_KEY_3", "weight": 1.0 },
        { "value": "env.ANTHROPIC_KEY_4", "weight": 1.0 }
      ],
      "network_config": {
        "default_request_timeout_in_seconds": 30
      }
    }
  }
}

Bot config was a one-liner. Pointed OPENAI_API_BASE at our Bifrost ECS service on port 8080 and the bot didn't know it'd been moved.

Results after two weeks

| Metric | Before (queue + 1 key) | After (Bifrost + 4 keys) |
|---|---|
| Median review latency | 4m 30s | 11s |
| p95 review latency | 7m 12s | 28s |
| 429 rate | 8.2% | 0.07% |
| Reviews abandoned (timed out) | 14% | 0.4% |
| "Is the bot dead" Slack pings | ~6/day | 0 |

Costs went up about 22% because more reviews actually completed. Worth it.

Bifrost vs LiteLLM vs Portkey

I evaluated all three properly. None is strictly better; they hit different sweet spots.

| Concern | Bifrost | LiteLLM | Portkey |
|---|---|
| Multi-key load balancing | Native | Via Router | Native |
| OpenAI-compatible endpoint | Yes | Yes | Yes |
| Self-host complexity | Single Go binary | Python + deps | SaaS-first |
| Built-in web UI for config | Yes | Limited | Cloud-side |
| Semantic caching | Yes | Yes | Yes |
| MCP gateway | Yes | No | No |
| Community size | Growing | Larger | Larger |

LiteLLM's community is bigger and the integrations list is wider. If you want Python ergonomics, it's the easier ride. Portkey's hosted UX is slicker out of the box, but we needed self-host for compliance reasons. Bifrost being a single Go binary suited our ECS deploy model and our preference for fewer Python services in the critical path.

Trade-offs and limitations

It's not all roses.

Failover is per-request, not per-key cooldown. If one of our four keys gets stuck in a rate-limit hole, Bifrost retries the call elsewhere but doesn't proactively quarantine the bad key for a window. We're managing that with manual weight tweaks for now.
The web UI is handy but state lives in config files. Make changes via the UI in dev and forget to commit the config, and you've got drift. We learned that one the hard way.
Single point of failure. Anything you put in front of every LLM call becomes load-bearing. We run two Bifrost replicas behind an ALB. Tiny team running one node and a restart policy might be fine, but think about it before you ship.
Observability glue. Prometheus metrics are emitted natively, which is great. You'll still need to wire them into your existing dashboards. Took us an afternoon.