DEV Community: claire nguyen

Error budgets for an LLM dependency you don't control

claire nguyen — Mon, 01 Jun 2026 13:22:28 +0000

TL;DR: We shipped a natural-language build-query feature at Buildkite, then tried to put a 99.9% SLO on it. Turns out you can't promise uptime for a model provider you don't run. We put Bifrost in front, failed over across three providers, and now the error budget tracks our gateway's behaviour instead of OpenAI's status page.

Here's the moment it clicked for me. We were drafting an SLO doc for a feature that lets people ask "why did this build fail" in plain English. Someone wrote "99.9% availability". Cool. That's 43 minutes of allowed downtime a month. Then OpenAI had a wobble for about 50 minutes one Tuesday and we blew the whole budget before lunch.

The problem wasn't our code. Our service was up the entire time. The dependency wasn't.

You can't SLO something you don't operate

A normal SLO assumes you control the thing you're measuring. Postgres, your own API, an internal queue. You can add replicas, you can tune it, you can page someone who can fix it.

A hosted LLM is none of that. When Anthropic returns a 529 or OpenAI starts handing out 429s under load, there is no lever on your side. You wait. Our p99 for the feature was around 2.1 seconds on a good day, and during provider degradation it'd climb past 9 seconds or just fail outright.

So the question stopped being "how do I make the provider more reliable" and became "how do I make my dependency on any single provider less load-bearing." That's a routing problem, not a model problem.

Putting a gateway in the path

We run Bifrost as the single egress point for every LLM call now. It's an OpenAI-compatible gateway, so our service code didn't change much. The interesting part is the fallback config: if the primary provider errors or times out, the request gets retried against the next one without our app knowing.

{
  "providers": {
    "openai": { "keys": [{ "value": "env.OPENAI_KEY" }] },
    "anthropic": { "keys": [{ "value": "env.ANTHROPIC_KEY" }] },
    "bedrock": { "keys": [{ "value": "env.BEDROCK_KEY" }] }
  },
  "fallbacks": [
    "openai/gpt-4o-mini",
    "anthropic/claude-3-5-haiku",
    "bedrock/anthropic.claude-3-haiku"
  ]
}

Three providers, ranked. When OpenAI throttles, the call lands on Anthropic. When both are sad, Bedrock catches it. The feature degrades in quality maybe, but it stays up. That's the whole point of an error budget. Stay inside the line.

It also does load balancing across multiple keys, which mattered more than I expected. Half our "outages" early on were just one API key hitting its rate limit while another sat idle.

The metrics that actually feed the SLO

The bit that sold me was native Prometheus output. Bifrost exposes metrics straight out of the box, so I'm not scraping a vendor status page or parsing logs to know if we're burning budget.

Our availability SLI is now "requests Bifrost successfully resolved, including via fallback" over total requests. A request that failed on OpenAI but succeeded on Anthropic counts as a win, because the user got an answer. That's the number that should drive the SLO, not per-provider success.

# fast burn-rate over 1h: are we eating budget faster than allowed?
sum(rate(bifrost_requests_total{status="error"}[1h]))
/
sum(rate(bifrost_requests_total[1h]))
> (14.4 * 0.001)

We went from one provider doing about 99.4% effective availability to the fallback chain sitting around 99.93% over the last 60 days. Same models, same budget, just not betting the feature on one company's afternoon.

How it stacks up

We looked at LiteLLM and Portkey before landing here. None of these is strictly best. Depends what you're optimising for.

Thing I cared about	Bifrost	LiteLLM	Portkey
Self-host, no vendor in path	Yes, single Go binary	Yes	Possible, but hosted is the main path
Native Prometheus metrics	Built in	Via callbacks/config	Dashboard-first, export is extra
Provider failover config	Declarative fallback list	Yes, router config	Yes, configs/strategies
Hosted analytics UI	Basic built-in UI	Minimal	Strongest of the three
Python ecosystem depth	Smaller	Largest, huge community	Good

Honestly, if you live in Python and want the biggest provider list and community, LiteLLM is hard to beat. If you want a polished hosted dashboard and guardrails without running anything, Portkey is the comfortable pick. We're an infra team that wants metrics in our own Prometheus and a binary we can run on our own boxes, so Bifrost fit our shape. No worries either way.

Trade-offs and Limitations

Fallback hides failure, and that cuts both ways. If your alerting only watches the final success rate, you can be quietly running 80% of traffic on your third-choice provider for days and not notice the bill. We added a separate alert on per-provider fallback rate so degradation is visible, not just survivable.

Quality drift is real too. gpt-4o-mini and claude-3-5-haiku don't answer identically, so a build-failure summary can read differently mid-incident. For us that's acceptable. For anything doing structured extraction, you'd want to validate output shape per provider.

And a gateway is one more thing to run. It's a low-risk component, but it's in the hot path, so we run it with the same care as any other tier-1 service. If Bifrost is down, everything's down. We game-day it like the rest of our stack.

Self-hosting also means semantic caching, governance, and the rest are your config problem, not a managed feature. Fine for us. Worth knowing.

The Prometheus label that blew our monitoring bill out 6x

claire nguyen — Fri, 29 May 2026 04:21:15 +0000

TL;DR: Our metrics bill went 6x in a single month. Traffic was flat. One Prometheus label carrying per-build IDs spawned millions of time series, and the backend charges by active series. Here's how we caught it and the label rules we run now so it doesn't happen again.

The bill, not the traffic

I'm on the infra team at Buildkite. We run a fairly chunky Prometheus setup feeding a managed backend, and one Monday the monthly estimate had quietly gone from about $1,800 to a touch over $11k. Nobody shipped more traffic. Build volume was the same 40k-ish builds a day it'd been for weeks.

So it wasn't load. It was series count. Active series had climbed from roughly 1.2 million to nearly 9 million, and the backend prices on active series, not on request volume. That's the trap most people miss the first time.

What cardinality actually is

Think of every unique combination of metric name plus label values as its own drawer in a filing cabinet. http_requests_total{status="200"} is one drawer. Add region="ap-southeast-2" and now you've got a drawer per region. Add a label whose values are unbounded and you've got a cabinet the size of a warehouse.

Cardinality is the count of those drawers. Each one is a separate time series that has to be stored and indexed. Low-cardinality labels (status, region) are fine. High-cardinality ones are where the money leaks.

The one label that did it

A teammate had added build_id to a counter so they could debug a flaky deploy. Fair enough in the moment. Problem is every build has a unique ID, we do ~40k a day, and those IDs hang around for the full retention window.

40k unique values a day, multiplied across a handful of other labels, multiplied across retention. That's your several-million-series jump right there. One label.

Catching it

The fastest way to find the offender is to ask Prometheus which metric has the most series:

topk(10, count by (__name__)({__name__=~".+"}))

Then drill into the worst metric and see which label is doing the damage:

count(count by (build_id)(deploy_attempts_total))

When that second query came back with a number in the tens of thousands, we had our culprit.

The fix

You drop the label before it ever hits storage. metric_relabel_configs runs at scrape time, so you can strip a label without touching the app code:

scrape_configs:
  - job_name: "build-agents"
    metric_relabel_configs:
      - regex: "build_id"
        action: labeldrop

Per-build detail didn't vanish, we moved it to where unbounded identifiers belong: traces and logs. If you genuinely need a metric sliced per build, use exemplars so the high-cardinality bit lives in the trace store, not the series index.

Here's how we now reason about labels before adding one:

Label	Unique values	Safe to add?
status	~5	Yes
region	~6	Yes
instance_type	~15	Yes
agent_queue	~200	Usually fine
build_id	~40k/day	No, use a trace
user_email	unbounded	No, never

Rule of thumb we reckon on: if you can't name the upper bound of a label's values on a whiteboard, it doesn't go on a metric.

Same trap, different service

This isn't only a Prometheus-the-app thing. Any service that emits Prometheus metrics can sink you the same way. We run a small internal feature that summarises failed build logs through an LLM, and those calls go through Bifrost, an open-source AI gateway that ships native Prometheus metrics out of the box. Handy. But the instinct to tag those metrics with a per-request ID or per-virtual-key label is exactly the same footgun.

We keep its labels down to provider and model. That gives us cost-per-provider and latency-per-model without minting a new series for every call. The discipline travels with the metric, not the tool.

Trade-offs and Limitations

Dropping build_id means you can't slice a single build inside Prometheus anymore. For ad-hoc "what did build 84213 do" questions, you're now in the trace or log tooling, which is a context switch some folks grumbled about for a week.

Recording rules, the other common fix, aren't free either. They add evaluation load on the Prometheus side, and if you write a sloppy one you can quietly recreate the cardinality you were trying to kill. Test the output series count before you ship the rule.

Exemplars need backend support and a tracing system wired up. If you haven't got distributed tracing yet, that path's a bigger project than a one-line labeldrop. Be honest about where you are.

And labeldrop is a blunt instrument. Once it's gone at scrape, it's gone. If you later decide you wanted that dimension bounded rather than dropped, you're re-instrumenting.

Our PR-review bot kept hitting 429s. Bifrost key pooling fixed it.

claire nguyen — Thu, 28 May 2026 13:22:11 +0000

TL;DR: Our internal PR-review bot was getting 429'd by Anthropic between 9am and 11am Sydney time. We dropped Bifrost in front, pooled four keys, and the 429 rate fell from 8.2% to 0.07% in a fortnight. The migration was one env var swap. The interesting bits were the bits we got wrong.

The problem

We've got a PR-review bot that pings Claude on every pull request opened against our internal monorepo. It pulls the diff and ships a structured prompt to Claude. Gets back a summary plus a couple of "have you considered..." nudges. Saves our reviewers maybe 10 minutes per PR, on a team of 80 engineers, all sharing one Anthropic workspace that someone provisioned back in early 2024 and nobody bothered to split.

You can guess what happened.

Mornings in Sydney are brutal. Everyone arrives, opens their PRs from the night before, and our bot fires off 30-40 concurrent requests. Anthropic's per-org rate limit got chewed through by 9:15 most days. Bot started failing. Slack filled up with "did the review bot die again?" messages. Not a great look for the platform team.

What we tried first

The naive fix was a job queue with backoff. Wrote it in an arvo. Buildkite job, Redis-backed, exponential retry with jitter. It worked, sort of. Reviews now took 4-7 minutes to come back instead of 8 seconds, and engineers started ignoring the bot entirely because by the time the review landed they'd already merged, which kind of defeats the whole point of having a review bot.

Queueing wasn't the answer. We needed more headroom, which meant more keys, which meant somebody had to manage them.

Why Bifrost

I'd been kicking the tyres on a few gateways for an unrelated project. Bifrost (https://github.com/maximhq/bifrost) won on two specific points: load balancing across multiple API keys for the same provider is a documented first-class feature, and the OpenAI-compatible endpoint meant we didn't have to touch the bot's SDK code. It already spoke openai.ChatCompletion against an internal proxy URL.

Setup took about 40 minutes including the time to argue with our SSO admin about a new GitHub OAuth app.

{
  "providers": {
    "anthropic": {
      "keys": [
        { "value": "env.ANTHROPIC_KEY_1", "weight": 1.0 },
        { "value": "env.ANTHROPIC_KEY_2", "weight": 1.0 },
        { "value": "env.ANTHROPIC_KEY_3", "weight": 1.0 },
        { "value": "env.ANTHROPIC_KEY_4", "weight": 1.0 }
      ],
      "network_config": {
        "default_request_timeout_in_seconds": 30
      }
    }
  }
}

Bot config was a one-liner. Pointed OPENAI_API_BASE at our Bifrost ECS service on port 8080 and the bot didn't know it'd been moved.

Results after two weeks

| Metric | Before (queue + 1 key) | After (Bifrost + 4 keys) |
|---|---|
| Median review latency | 4m 30s | 11s |
| p95 review latency | 7m 12s | 28s |
| 429 rate | 8.2% | 0.07% |
| Reviews abandoned (timed out) | 14% | 0.4% |
| "Is the bot dead" Slack pings | ~6/day | 0 |

Costs went up about 22% because more reviews actually completed. Worth it.

Bifrost vs LiteLLM vs Portkey

I evaluated all three properly. None is strictly better; they hit different sweet spots.

| Concern | Bifrost | LiteLLM | Portkey |
|---|---|
| Multi-key load balancing | Native | Via Router | Native |
| OpenAI-compatible endpoint | Yes | Yes | Yes |
| Self-host complexity | Single Go binary | Python + deps | SaaS-first |
| Built-in web UI for config | Yes | Limited | Cloud-side |
| Semantic caching | Yes | Yes | Yes |
| MCP gateway | Yes | No | No |
| Community size | Growing | Larger | Larger |

LiteLLM's community is bigger and the integrations list is wider. If you want Python ergonomics, it's the easier ride. Portkey's hosted UX is slicker out of the box, but we needed self-host for compliance reasons. Bifrost being a single Go binary suited our ECS deploy model and our preference for fewer Python services in the critical path.

Trade-offs and limitations

It's not all roses.

Failover is per-request, not per-key cooldown. If one of our four keys gets stuck in a rate-limit hole, Bifrost retries the call elsewhere but doesn't proactively quarantine the bad key for a window. We're managing that with manual weight tweaks for now.
The web UI is handy but state lives in config files. Make changes via the UI in dev and forget to commit the config, and you've got drift. We learned that one the hard way.
Single point of failure. Anything you put in front of every LLM call becomes load-bearing. We run two Bifrost replicas behind an ALB. Tiny team running one node and a restart policy might be fine, but think about it before you ship.
Observability glue. Prometheus metrics are emitted natively, which is great. You'll still need to wire them into your existing dashboards. Took us an afternoon.

Surviving an AZ Failover for Our Build Runner Fleet at 3am

claire nguyen — Wed, 27 May 2026 13:24:15 +0000

TL;DR: We lost an AWS AZ for 47 minutes back in March. Our build runner fleet on EKS mostly survived, but the AI-assisted code review bots wedged because their LLM calls all routed to one region. Sticking Bifrost in front of those calls fixed the second problem. Here's what we changed.

It was 3:12am Sydney time when PagerDuty went off. ap-southeast-2a was having a wobble. Not a full outage — just enough packet loss that EKS nodes started flapping in and out of the cluster.

Our build runner fleet handled it fine. We've drilled this. Pod disruption budgets, multi-AZ node groups, the usual stuff. Builds rescheduled to 2b and 2c within about 90 seconds. No worries.

The bit that didn't handle it fine was the AI review bot we'd shipped six weeks earlier. That thing called Anthropic's API directly from inside the build container. When the AZ flapped, the egress NAT in 2a started dropping outbound TLS. The bot retried, hit our 30-second build timeout, and 4,200 builds went red over half an hour.

I want to talk about what we did the morning after, because the fix wasn't "make the bot more resilient." It was "stop pretending the LLM call is special."

The actual failure mode

Here's the rough shape of what was happening. Our review bot was a Go service running as a sidecar in the build pod. Pseudo-config looked like this:

review_bot:
  provider: anthropic
  api_key: ${ANTHROPIC_KEY}
  model: claude-sonnet-4-6
  timeout_ms: 25000
  max_retries: 2

Two retries, 25 second timeout each. Sounds reasonable. Except when the underlying network is dropping packets, you don't fail fast — you sit there waiting for TCP to give up. Two retries became 75 seconds of nothing. Build timeout kicked in. Build failed.

Worse, every single review bot in every single build was hitting the same NAT gateway in the same degraded AZ. We'd accidentally built a single point of failure into something we'd designed as a sidecar.

What we changed

I'd been kicking the tyres on Bifrost (https://github.com/maximhq/bifrost) for a few weeks already because I wanted central observability on LLM spend across our internal tools. The AZ incident pushed it to the top of the queue.

The plan was simple: stop letting build pods talk to providers directly. Run Bifrost as a deployment in our shared platform namespace, spread across all three AZs, and point the review bot at it. The bot's config went from anthropic.com to an internal service URL.

Bifrost's drop-in replacement (https://docs.getbifrost.ai/features/drop-in-replacement) meant we didn't touch the bot's code. Just the env var.

Then we configured fallbacks (https://docs.getbifrost.ai/features/retries-and-fallbacks) so a failed Anthropic call rolls over to AWS Bedrock's Claude. Same model family, different network path, different auth, different everything.

{
  "model": "anthropic/claude-sonnet-4-6",
  "fallbacks": [
    "bedrock/anthropic.claude-sonnet-4-6",
    "openai/gpt-4o-mini"
  ]
}

The GPT-4o-mini at the bottom is a deliberate downgrade. If both Anthropic paths are stuffed, we'd rather give the dev a worse review than no review and a red build.

What it looks like vs the alternatives

I evaluated three things properly. Here's the honest comparison from my notes:

Concern	LiteLLM	Portkey	Bifrost
Self-hosted Go binary	No (Python)	Partial	Yes
Provider failover config	Yes	Yes	Yes
Built-in web UI for config	Limited	Yes (cloud)	Yes (local)
Semantic caching	Plugin	Yes	Yes
Memory footprint on our nodes	~400MB	N/A (SaaS-first)	~180MB
MCP gateway	No	No	Yes (enterprise)

LiteLLM is genuinely good and we run it for one of our data science notebooks because the Python ergonomics are nice. Portkey has the slickest dashboard if you're happy with their cloud. Bifrost won here because we wanted a Go binary we could run on our own infra, and the resource overhead per pod mattered when we're scheduling hundreds of build pods.

The boring infra bit

We deployed Bifrost as three replicas, one per AZ, behind a ClusterIP service. Topology spread constraints to keep them honest. Each pod has its own provider key set via Kubernetes secrets, referenced through Bifrost's env var support (https://docs.getbifrost.ai/deployment-guides/config-json#environment-variable-references).

Prometheus scrape config picks up the native metrics endpoint. We graph p99 latency per provider and alert on fallback rate above 5% for more than 10 minutes. That alert would have fired during the March incident and given us a much better signal than "builds are timing out."

Trade-offs and limitations

This isn't a free win. A few things to flag.

The gateway is now a new hop in the request path. We measured about 8-12ms added per call. For our use case that's noise. For real-time inference it might not be.

Bifrost's clustering features are an enterprise thing. We're running it as independent replicas behind a service, which works because our config is mostly static. If you need shared state across replicas (live config sync, shared rate limit counters), you'll either pay for enterprise or accept some eventual consistency.

Semantic caching sounds great but we haven't turned it on for the review bot because code reviews are too context-specific. Cache hit rate would be near zero. Worth knowing before you assume it'll save you money.

And the obvious one: a gateway pod failing is now a thing that can break LLM calls. Spread your replicas, set sensible PDBs, don't be silly.

The Cost Math Behind Our CI Cache Hit Rate Going From 40% to 91%

claire nguyen — Wed, 27 May 2026 04:23:02 +0000

TL;DR: We were burning roughly AUD $14k/month on redundant CI compute because our cache hit rate sat at 40%. Three changes (content-addressed keys, a warmer tier, and killing one bad pre-commit hook) pushed it to 91% and shaved the bill to about $3.2k. Most of the savings came from a single weekend audit, not new tooling.

I run infra at Buildkite. We eat our own dog food, which means our internal monorepo runs on the same agents we sell to customers. About six weeks ago our finance team flagged that our CI compute line on AWS had crept up 38% quarter-on-quarter while team headcount only grew 11%. Something was off.

Turns out the culprit wasn't traffic. It was caches.

The starting point

Our setup, roughly:

~280 engineers across Sydney, Melbourne, San Francisco
Around 4,200 builds/day on the monorepo
Mix of Go, TypeScript, and a chunky Python ML eval service
Agents running on m6i.4xlarge spot instances in ap-southeast-2
Remote cache backed by S3 with a CloudFront distribution

When I first pulled the numbers, our cache hit rate (measured per build step, weighted by step duration) was sitting at 40.3%. For a healthy CI setup of this size I'd reckon you want 80%+. Anything under 60% means you're paying twice for the same compute.

Here's what the spend breakdown looked like before we touched anything:

Component	Monthly cost (AUD)	% of total
Spot EC2 (build agents)	$11,200	67%
S3 cache storage	$890	5%
CloudFront egress	$1,140	7%
LLM eval API calls (OpenAI + Anthropic)	$3,420	21%
Total	$16,650	100%

The LLM line is the one nobody expected. We run automated PR review on a subset of changes, plus regression evals on our search ranking service.

The three things that actually mattered

1. Content-addressed cache keys

We had cache keys like node_modules_v3_${branch_name}_${os}. That's already wrong but the worse bit was the v3 suffix that someone bumped six months ago and forgot why.

Switched to hashing the actual inputs: package-lock.json content hash + Node version + OS. Standard stuff but we'd just never done it properly.

steps:
  - label: ":node: install"
    plugins:
      - cache#v2.4.0:
          manifest: package-lock.json
          path: node_modules
          restore: file
          save: file
          key: "v1-{{ runner.os }}-node-{{ checksum 'package-lock.json' }}"

The restore: file bit matters. It means we only invalidate when package-lock.json actually changes, not when the branch name changes. Cache hit rate on the install step went from 31% to 96% overnight.

2. A warmer tier between memory and S3

S3 is cheap but the round-trip from ap-southeast-2 agents to S3 is about 18ms for small objects, and we were pulling thousands of them per build. We added an r6gd.large instance with NVMe local storage as an in-region warm cache. Agents check there first, fall through to S3.

Cost: about $180/month for the warm cache instance. Saves us roughly $1,400/month in CloudFront egress because most cache reads never leave the VPC now.

3. The bad pre-commit hook

This one is embarrassing. Someone added a pre-commit hook two years ago that ran find . -name "*.pyc" -delete before every test invocation. On a clean checkout this does nothing useful. On a cached checkout it deletes all the compiled Python bytecode, forcing Python to recompile on every test run. Average test step went from 4m20s to 2m45s after deleting eight lines of bash.

I genuinely could not believe it. We'd been paying for that for two years.

The LLM bit

The $3,420 LLM line was harder to chip away at because the calls themselves are useful. What we did:

Routed the PR review traffic through an AI gateway (we use Bifrost, which gives us semantic caching and a single endpoint) so identical or near-identical review prompts hit cache instead of provider
Moved the search ranking evals to a nightly batch rather than per-PR
Switched the bulk of the review traffic to a cheaper model and reserved the expensive one for changes touching /security/*

Semantic cache hit rate on PR review prompts settled around 34%, which doesn't sound massive but the prompts that hit cache tend to be the bigger ones (boilerplate "review this dependency bump" type stuff), so the dollar impact was bigger than the hit rate suggests.

Final LLM line came down to $1,180/month.

Where we landed

Component	Before	After	Change
Spot EC2	$11,200	$1,820	-84%
S3 + warm cache	$890	$1,070	+20%
CloudFront egress	$1,140	$140	-88%
LLM API	$3,420	$1,180	-65%
Total	$16,650	$4,210	-75%

Cache hit rate: 91.2% weighted.

Trade-offs and Limitations

The warm cache tier is a single point of failure. If that r6gd.large dies, we fall through to S3 cleanly but builds slow down by ~40 seconds each until we replace it. For us that's fine because spot interruption is more common than instance failure anyway. For a smaller team I'd skip it.

Content-addressed keys made cache busting harder for the rare case where you legitimately want to invalidate everything. We added a manual BUILDKITE_CACHE_EPOCH env var so a human can force-invalidate when needed. Used it twice in three months.

The pre-commit hook thing wasn't a tooling problem. It was institutional knowledge rot. There's no caching strategy that protects you from someone deleting your bytecode every commit. You need humans to actually read what runs.

Chaos testing your CI runner fleet when half the jobs call an LLM

claire nguyen — Tue, 26 May 2026 13:25:36 +0000

TL;DR: We started injecting LLM provider failures into our Buildkite agent fleet during scheduled game days. Found out our "retry on 5xx" logic was happily burning $80/hr re-sending the same 200k-token context to Anthropic during a brownout. Putting Bifrost in front of the agents fixed the obvious stuff. The chaos testing exposed the non-obvious stuff.

Right, story time. We run a fair-sized fleet of Buildkite agents on EC2, and over the last 18 months maybe 30% of jobs started touching an LLM somewhere. Code review bots. Doc generation. A weird internal thing that summarises flaky test runs. The build itself is deterministic. The LLM calls inside the build are not.

When OpenAI had its multi-hour wobble in March, our p99 build time went from 4 minutes to 47. Half the queue stalled. We hadn't tested for it because nothing in our chaos playbook accounted for "third-party inference API returns 200 but takes 90 seconds."

So we built one.

What we were already doing wrong

The original setup was the obvious thing. Each agent had an OPENAI_API_KEY baked into the AMI. Build scripts called the API directly. Retries were whatever the SDK gave us by default.

Three problems showed up the first time we ran a proper failure injection:

SDK default retry was 2 attempts with exponential backoff. On a 200k-token prompt at $3/M input tokens, that's 60 cents per retry. Multiply by 800 concurrent agents during a brownout and you do the maths.
We had no circuit breaker. Agents kept dialling a dead provider for the full 10-minute job timeout.
No visibility into which build steps were calling which model. The bill arrived monthly. The blame arrived never.

The game day setup

We run game days on a staging fleet that mirrors prod cluster sizing. The injection is done with a tiny toxiproxy sidecar that sits between the agent and the outbound LLM endpoint. Three failure modes we rotate through:

Brownout: 30% of requests return 429 with a Retry-After of 60s
Slowdown: every request gets 15s of latency added
Hard down: 100% return 503 for 8 minutes, then recovery

The first time we ran the brownout scenario against our naive setup, we got a Slack page from finance before the game day was over. They'd seen the cost spike in their hourly dashboard. Embarrassing. Also, exactly the point of the exercise.

Putting a gateway in front

We moved to running Bifrost as a sidecar on each agent host. The agents talk to localhost:8080 with the OpenAI SDK and Bifrost handles the actual provider calls. Drop-in replacement, no code changes in the build scripts.

The config is boring, which is what you want:

providers:
  openai:
    keys:
      - value: env.OPENAI_KEY_PRIMARY
        weight: 0.7
      - value: env.OPENAI_KEY_SECONDARY
        weight: 0.3
  anthropic:
    keys:
      - value: env.ANTHROPIC_KEY

fallbacks:
  - primary: openai/gpt-4o-mini
    backup:
      - anthropic/claude-haiku-4-5

Two things this actually solved during our next game day:

Fallback worked without code changes. When toxiproxy killed OpenAI, builds kept moving by routing to Anthropic. Build time bumped maybe 20%. Nobody paged.

The Prometheus metrics gave us per-pipeline cost visibility. We could finally see that one team's "summarise the test logs" step was responsible for 40% of our LLM spend. Conversation with that team was much easier with numbers attached.

What gateway != fixes

Here's the honest bit. The gateway didn't solve our retry-cost problem on its own. Bifrost's fallback config is good, but if your build script is calling the API in a loop and not respecting the 429s coming back, you'll still burn money. We had to write our own thin wrapper in the build pipeline to bail out of the LLM step after 2 failures and fall back to a heuristic. Gateway gave us the signals. The build logic still has to do the right thing with them.

Honest comparison

We looked at LiteLLM and Portkey before settling. Quick read:

Tool	What we liked	Where it didn't fit
LiteLLM	Massive provider list, well-known	Python proxy meant another runtime on each agent host
Portkey	Slick analytics dashboard, mature observability	SaaS-first, our security team wasn't keen on egress for build logs
Bifrost	Single Go binary, drop-in OpenAI compat, semantic caching that actually saved us 22% on the doc-gen pipeline	Smaller ecosystem, fewer integrations than LiteLLM, MCP gateway is enterprise-tier

If you're already running LiteLLM happily, no reason to swap. We just preferred deploying one binary alongside the agent instead of a Python service.

Trade-offs and limitations

A few things to be straight about:

Adding a gateway adds a hop. We measured about 3-5ms overhead per call. Fine for our use case, might matter if you're doing latency-sensitive inference.
Semantic caching is brilliant for repetitive build prompts (think "summarise this stack trace") but useless for anything with high-entropy input. Don't expect a free 50% cost cut.
Self-hosted means you own the uptime of the gateway too. We run it as a sidecar so the blast radius is one agent, but if you centralise it, you've created a new SPOF.
Game days take real time. Half a day to set up, half a day to run, two days of follow-up tickets. Worth it. Not free.

The biggest win wasn't any one feature. It was that we'd actually pulled the cables out before a real provider had a bad afternoon. "Never had an outage" usually means you've never tested your failure handling.

Game day on our build cluster: killing an AZ to test LLM flake detection

claire nguyen — Mon, 25 May 2026 13:22:18 +0000

TL;DR: We ran a game day on our Buildkite agent fleet where I yanked an entire AWS AZ while our LLM-based flake classifier was triaging failures. The classifier fell over because we'd wired it to a single OpenAI endpoint. Putting Bifrost in front fixed the failover hole and exposed two other bugs we hadn't seen.

Right, so a few weeks back I was running a game day on our internal build cluster. About 800 agents spread across ap-southeast-2a, 2b, and 2c. The exercise was meant to test our LLM-powered flake detector under partial infrastructure failure. The detector reads a failed job log, classifies it as flake | real | infra, and decides whether to auto-retry.

I killed 2a. That was the plan. What wasn't the plan was the flake detector going completely dark within 90 seconds.

What broke

We'd built the detector as a tiny Go service running on each agent host. It called OpenAI's gpt-4o-mini directly. One endpoint, one API key, no retries beyond the SDK default. When 2a went down, our networking config rerouted egress through a NAT gateway that was hot-throttled by the surge of retries from other services. Result: every flake classification request hung for 30 seconds, then timed out.

CI pipelines didn't fail — they just stopped auto-retrying. So engineers started seeing real bugs and flakes hit them at the same rate, and Slack lit up.

The post-mortem was a bit embarrassing. We'd tested failover for the build database, the artifact store, the agent registration service. Hadn't tested failover for the thing classifying our test failures. Classic case of treating the LLM call as "just an API" instead of as a dependency that can fail in five different ways.

What we changed

I'd been keeping an eye on Bifrost (https://github.com/maximhq/bifrost) for a few months. It's an AI gateway written in Go that sits between your app and the providers. Single OpenAI-compatible endpoint, fallback rules, load balancing across keys, and a Prometheus metrics endpoint baked in. That last bit was what sold me, because our observability stack is already Prom + Grafana and I didn't fancy bolting on yet another exporter.

Deployed it as a sidecar on the agent hosts, two replicas per AZ. Config looked roughly like this:

providers:
  openai:
    keys:
      - value: env.OPENAI_KEY_PRIMARY
        weight: 0.7
      - value: env.OPENAI_KEY_BACKUP
        weight: 0.3
  anthropic:
    keys:
      - value: env.ANTHROPIC_KEY

fallbacks:
  - model: "openai/gpt-4o-mini"
    targets:
      - "openai/gpt-4o-mini"
      - "anthropic/claude-haiku-4-5"

The flake detector's only change was pointing its OpenAI base URL at http://localhost:8080/v1. One line. No SDK swap.

Second game day

Ran the same exercise two weeks later. Killed 2a again. The Bifrost sidecar on 2a stopped responding, the detector's HTTP client failed over to the 2b sidecar via our service mesh, and classification continued. The fallback rule kicked in for about 4% of requests when one OpenAI key got rate-limited by the surge — those routed to Anthropic and came back in roughly the same latency window.

We didn't see zero impact. Tail latency on classifications jumped from p99 ~1.2s to p99 ~3.8s during the failover window. But nothing went dark.

Two bugs the gateway exposed

The Prometheus metrics from Bifrost showed us things our app-level logging had been hiding.

Bug one: 12% of our "real bug" classifications were coming from one specific agent pool that runs Ruby tests. The model was getting truncated logs because we'd set max_tokens too low on the input side at some point and nobody remembered. The per-provider token histograms in the metrics made it obvious.

Bug two: Our retry logic was double-counting. The agent was retrying on 429, and Bifrost was also retrying on 429. So a single rate-limited request was costing us 4x the tokens. Fixed by turning off retries in our client and letting Bifrost handle them.

Honest comparison

We looked at LiteLLM and Portkey before landing on Bifrost. Quick table:

Concern	LiteLLM	Portkey	Bifrost
Deploy as single binary	Python, heavier	Hosted-first	Go binary, npx or Docker
Prom metrics out of box	Plugin	Hosted dashboard	Native endpoint
Fallback config	YAML	UI + config	YAML + Web UI
OSS self-host	Yes	Limited	Yes
Maturity (Apr 2026)	Highest, broad ecosystem	Strong hosted product	Younger, smaller community

LiteLLM has way more community plugins and provider quirks already handled. If you're doing weird stuff with niche providers, it's still probably the safer pick. Portkey's hosted dashboards are nicer than what we built ourselves. Bifrost won for us because it's a single Go binary, native Prom, and the latency overhead in our tests was under 2ms p50.

Trade-offs and limitations

Adds a network hop. ~1-2ms p50, ~5ms p99 in our setup. Acceptable for flake classification, maybe not for tight inner loops.
Another thing to monitor. We've now got Bifrost-down alerts in PagerDuty.
Semantic caching (https://docs.getbifrost.ai/features/semantic-caching) sounded great but we haven't enabled it — flake classification context is too specific for cache hits to be meaningful in our case.
The Web UI is handy for fiddling locally, but we manage config via git like everything else, so we mostly ignore it.

Game days for the LLM dependency in your CI aren't optional anymore if you're doing anything non-trivial. The LLM call is now a critical path component, treat it like one.

Stop paying for idle GPUs in your CI: batching LLM eval jobs

claire nguyen — Fri, 22 May 2026 04:22:08 +0000

TL;DR: Running LLM evaluations on every PR will burn your GPU budget faster than you can blink. We cut our eval spend by about 60% by batching jobs into windowed runs on shared GPU pools, plus a smarter queue that knows the difference between a "smoke test" eval and a full regression run. Here's how, and where the trade-offs hurt.

Right, so a few months back I got pulled into a conversation that's becoming pretty familiar around here. A team had wired up an LLM-based evaluation suite into their CI. Every PR triggered a run against a set of prompts, scored the outputs, and posted results back to the PR. Lovely in theory.

The cloud bill was not lovely.

They were spinning up a g5.xlarge per PR, sometimes three or four in parallel during peak hours, and the GPU sat idle for about 70% of the run because most of the time was spent on cold starts, model loading, and prompt formatting. Classic case of treating GPUs like CPUs.

I reckon a lot of teams are hitting this wall right now. So let's talk about what actually works.

The problem with "GPU per job"

CI runners are designed for stateless, throwaway compute. That model breaks the second you involve a 7B+ parameter model that takes 30-90 seconds to load into VRAM.

Here's the rough breakdown of a typical eval job we measured:

Phase	Time (avg)	GPU utilization
Cold start (instance boot)	45s	0%
Model download from S3	60s	0%
Model load into VRAM	25s	~10%
Actual inference (50 prompts)	40s	~85%
Result upload + teardown	15s	0%

So out of about 3 minutes of billable GPU time, you're getting 40 seconds of useful work. That's brutal economics.

Batching: the boring fix that works

The trick isn't fancy. You stop spinning up a GPU per job and start treating the GPU like a long-lived service that consumes jobs from a queue.

We run a small pool of g5.xlarge instances (usually 2-4 depending on load) that stay warm. Each runner has the model preloaded in VRAM. CI jobs push eval requests to an SQS queue, runners pull from the queue, batch up to N prompts per inference pass, and post results back.

Rough sketch of the runner config:

runner:
  instance_type: g5.xlarge
  pool_size_min: 2
  pool_size_max: 6
  scale_metric: queue_depth
  scale_threshold: 25  # jobs in queue

  model:
    name: llama-3.1-8b-instruct
    preload: true
    keep_warm_seconds: 1800

  batching:
    max_batch_size: 16
    max_wait_ms: 2000

  job_types:
    smoke_eval:
      priority: high
      max_prompts: 10
    full_regression:
      priority: low
      max_prompts: 500
      window: nightly_only

That max_wait_ms is doing the heavy lifting. The runner waits up to 2 seconds to gather a batch before firing inference. For CI, 2 seconds of latency is nothing. For inference throughput, it's everything.

Routing matters too

Once you've got a warm pool, you might as well route different model calls through one place. We have eval suites that hit a mix of self-hosted Llama, Claude via API, and OpenAI. Instead of each CI job authenticating separately and managing keys, we put a gateway in front.

There's a bunch of options here. LiteLLM is popular, Bifrost (https://github.com/maximhq/bifrost) is another one that does the same kind of multi-provider routing with rate limit handling, and you can roll your own with a thin FastAPI wrapper if you're feeling keen. The point is you stop scattering API keys across twenty CI configs.

Job classification: don't run a full eval on every commit

This was the biggest single win, honestly. We split eval jobs into tiers:

Smoke evals: 5-10 prompts, run on every PR, catches obvious regressions
Standard evals: 50-100 prompts, run on merge to main
Full regression: 500+ prompts, run nightly on main

Before this, every PR triggered the full 500-prompt suite because nobody had bothered to think about what they actually needed to know per PR. The answer is "did this change break something obvious?", not "is this model production-ready?"

Cut our GPU-hours by about 40% just from that change alone, before any of the batching work.

What the numbers looked like

After about three weeks of running the new setup:

Metric	Before	After
GPU-hours per day	38	14
Avg PR feedback time	4m 20s	1m 50s
Monthly GPU spend (eval only)	~$8,200	~$3,100
Queue p99 wait time	n/a	8s

Faster and cheaper, which is the dream combination you almost never get.

Trade-offs and limitations

Nothing's free, so here's what actually hurt:

Cold start on scale-up is still painful. When the queue spikes past what the warm pool can handle, the new runners take 90+ seconds to come online with the model loaded. We mitigated by being more aggressive on the scale_threshold than felt comfortable, which means we're occasionally paying for idle capacity. You can't have both.

Batching adds latency variance. A job that arrives just after a batch fires waits the full max_wait_ms. For CI this is fine. For production inference it might not be, so don't blindly copy this config to your prod inference pipeline.

Pool exhaustion is a real failure mode. If your queue grows faster than you can scale, jobs back up. We had a Friday afternoon where a misconfigured test suite generated 4,000 eval jobs in 10 minutes and the queue depth alert woke me up at 11pm. Add circuit breakers and per-team quotas early, not after the first incident.

Model updates are now an event. When you preload models, swapping versions means a rolling restart of the pool. We do this during low-traffic windows but it's added operational overhead that didn't exist with the per-job model.

Putting an LLM Gateway in Front of Our Build Agents

claire nguyen — Thu, 21 May 2026 13:22:18 +0000

TL;DR: We bolted an LLM gateway in front of the AI features in our build pipeline tooling and ended up running Bifrost instead of LiteLLM or Kong. The deciding factor wasn't features, it was the 11 microsecond overhead and the fact it didn't fall over when one provider had a wobbly afternoon.

Right, so a few weeks back I got pulled into a project to wire LLM calls into some internal tooling we use for triaging flaky builds. Nothing fancy, mostly summarising failure logs and suggesting which test owner to ping. The catch was that this thing sits on the hot path of our build feedback loop, and our SRE on-call rotation was very clear: if your shiny AI feature adds latency to my builds, I will personally come and uninstall it.

Fair enough.

The problem with calling providers directly

First pass was the obvious one. SDK calls straight to Anthropic, with OpenAI as a fallback wrapped in a try/except. Worked fine in dev. Then we hit a real Tuesday afternoon where Anthropic had a regional hiccup, our fallback logic kicked in, and we discovered our "fallback" was actually just retrying the same broken endpoint because someone (me) had copy-pasted the client config.

Classic.

So we needed a proper gateway. The shortlist was Bifrost, LiteLLM, and Kong with an AI plugin. I'd used Kong before for regular API stuff so I was leaning that way out of habit, but I forced myself to actually test the three of them.

What we measured

I set up a quick bench on an m6i.large with a mock upstream so we weren't measuring provider latency. Ran 50k requests at modest concurrency. Here's roughly what we got.

Gateway	Overhead per request	Memory steady state	Setup time
Direct SDK	~0 µs	80 MB	10 min
Bifrost	~11 µs	95 MB	25 min
LiteLLM	~2.1 ms	180 MB	20 min
Kong + AI plugin	~1.4 ms	220 MB	90 min

The 11 microsecond number for Bifrost is what they claim on their repo and honestly I assumed it was marketing fluff until I saw it on our own bench. It's Go, runs as a single binary, and the gateway overhead genuinely disappears into the noise of the actual LLM call.

LiteLLM is Python and you can feel it. It's fine for a lot of use cases and the feature set is honestly massive, but on our hot path that extra couple of milliseconds per call added up across thousands of build steps.

Kong is Kong. Powerful, but it's a full API gateway with an AI plugin bolted on, not an LLM gateway. We didn't need the rest of Kong.

The config that actually mattered

The bit that sold me wasn't the latency. It was weighted routing with proper failover. Here's a stripped down version of what we landed on:

providers:
  anthropic_primary:
    type: anthropic
    api_key: ${ANTHROPIC_KEY}
    weight: 70
  openai_secondary:
    type: openai
    api_key: ${OPENAI_KEY}
    weight: 30

routing:
  build_triage:
    providers: [anthropic_primary, openai_secondary]
    failover: true
    timeout_ms: 8000

cache:
  semantic:
    enabled: true
    similarity_threshold: 0.92
    ttl_seconds: 3600

That semantic cache block is doing a lot of work. Build failures rhyme. A flaky test that times out today probably timed out last week with a slightly different log signature, and the cache catches that fuzzy match instead of paying for another LLM call. We saw cache hit rates around 38% in the first fortnight, which translates directly into provider bill reduction.

Virtual keys were the other thing that mattered for us. We could hand different teams their own virtual key with its own rate limit and budget, all pointing at the same upstream credentials. No more chasing engineers to rotate keys when someone's notebook leaked one to a gist.

Failover that actually works

The thing I tested most paranoidly was the failover. I literally just killed the Anthropic endpoint at the network level mid-request, expecting some ugly behaviour. Bifrost retried against OpenAI inside the same request boundary, the caller got a response, and the metrics endpoint showed the failover counter tick. No drama.

Reckon this is the thing most people get wrong when they roll their own. Failover is easy to write and hard to test. Having it as a config flag means I can write a game day scenario where we knock providers offline and watch the gateway do its job, instead of hoping our wrapper code holds up.

Trade-offs and Limitations

Not all sunshine.

The dashboard is functional but it's no Grafana. We export Prometheus metrics out of it and build our own panels, which is what we wanted anyway, but if you're hoping for a polished UI out of the box you'll be doing some work.

The plugin ecosystem is smaller than LiteLLM. If you need some niche provider or a very specific transformation, LiteLLM probably has it already and Bifrost might need you to write a small bit of Go. For our needs (Anthropic, OpenAI, one self-hosted model) this was a non-issue.

Go binary means your ops team needs to be cool with running a Go service. If you're an all-Python shop and your team is allergic to anything else, that's a real friction point even though the binary itself is genuinely fire-and-forget.

And semantic caching can bite you. If your prompts are doing something where a "similar" prompt actually needs a different answer (think anything with user-specific context smuggled in), you'll want to disable it for those routes. We learned this the second day.

Where it sits now

It runs as a sidecar to our build orchestration service. Two replicas behind an internal load balancer, Prometheus scraping the metrics endpoint, and pagerduty wired to the failover counter so we know when a provider is having a bad day before our users do. Total memory footprint across the cluster is rounding error compared to the workloads it serves.

The on-call SRE has not, so far, come to uninstall it. I'll take the win.

Putting an LLM Gateway in Front of Our Build Agents: Why We Picked Bifrost

claire nguyen — Tue, 19 May 2026 04:22:40 +0000

Fair enough.

The problem with calling providers directly

Classic.

What we measured

I set up a quick bench on an m6i.large with a mock upstream so we weren't measuring provider latency. Ran 50k requests at modest concurrency. Here's roughly what we got.

Gateway	Overhead per request	Memory steady state	Setup time
Direct SDK	~0 µs	80 MB	10 min
Bifrost	~11 µs	95 MB	25 min
LiteLLM	~2.1 ms	180 MB	20 min
Kong + AI plugin	~1.4 ms	220 MB	90 min

Kong is Kong. Powerful, but it's a full API gateway with an AI plugin bolted on, not an LLM gateway. We didn't need the rest of Kong.

The config that actually mattered

The bit that sold me wasn't the latency. It was weighted routing with proper failover. Here's a stripped down version of what we landed on:

providers:
  anthropic_primary:
    type: anthropic
    api_key: \${ANTHROPIC_KEY}
    weight: 70
  openai_secondary:
    type: openai
    api_key: \${OPENAI_KEY}
    weight: 30

routing:
  build_triage:
    providers: [anthropic_primary, openai_secondary]
    failover: true
    timeout_ms: 8000

cache:
  semantic:
    enabled: true
    similarity_threshold: 0.92
    ttl_seconds: 3600

Failover that actually works

Trade-offs and Limitations

Not all sunshine.

Where it sits now

The on-call SRE has not, so far, come to uninstall it. I'll take the win.

What I Actually Pay For When My LLM Bill Doubles Overnight

claire nguyen — Fri, 15 May 2026 10:34:37 +0000

TL;DR: Your LLM bill isn't one number, it's about six. Retry storms, runaway agents, and bad routing are the usual culprits. A bit of observability work up front saves you from staring at a $40k invoice wondering what happened.

Last quarter I helped a mate's team trace a sudden 3x jump in their OpenAI spend. They thought it was usage growth. It wasn't. It was a retry loop in their orchestration code that fired off three full-context requests every time a single tool call timed out. Took us about two hours to find, ten minutes to fix.

I reckon most teams running LLMs in prod have something similar lurking. You just don't see it until the invoice lands.

The bill has more parts than you think

When you read "LLM cost" on a finance dashboard, that single number is hiding a bunch of independent failure modes. Worth pulling them apart.

Cost driver	What it actually is	Where it hides
Input tokens	Prompt + context + system message	Long system prompts, fat RAG chunks
Output tokens	Model's response	Verbose prompts, no max_tokens cap
Retries	Failed requests you paid for anyway	Library defaults, agent loops
Cached vs uncached	Prompt caching hits or misses	Cache invalidation from tiny prompt edits
Provider markup	Your gateway/aggregator's cut	Hidden in unit pricing
Wasted spend	Calls you didn't need to make	Background agents, debug code in prod

The first three are the ones I see blow up budgets. Provider markup matters at scale but it's predictable. The other three sneak up on you.

Retries are the silent killer

Here's the pattern I see constantly. A team uses some agent framework, the framework has a default retry of 3 with exponential backoff, and the prompts include the full conversation history. A timeout on token 4000 of a 4096-token response means you just paid for 4000 tokens, then immediately paid for another 4000+ tokens, then maybe a third time.

# What people write
client = OpenAI()
response = client.chat.completions.create(
    model="gpt-4",
    messages=long_history,
    max_retries=3,  # default
)

# What they should write
response = client.chat.completions.create(
    model="gpt-4",
    messages=long_history,
    max_tokens=500,
    timeout=30,
    max_retries=1,
)

Two changes. Cap your output. Stop retrying expensive operations more than once. If a 30-second call fails once, the second attempt usually fails too.

Caching is worth the effort, but it's fragile

Prompt caching on most providers gives you something like 50-90% off cached input tokens. Brilliant when it works. The trap is that cache keys are exact prefix matches. Change your system prompt by one character, your timestamp injection bumps every request, your dynamic user context shifts the prefix... cache hit rate goes to zero and your bill quietly goes back up.

A useful exercise: log your actual cache hit rate per route, not just the average. I had a service where overall hit rate was 70%, looked fine, but one specific endpoint was running at 4% because someone added a timestamp to the system prompt for "debugging."

Routing across providers

Once you've got more than one model in play, where requests go starts mattering a lot. Cheap models for classification, expensive ones for synthesis. Local models for bulk preprocessing if you can host them.

A few options here, depending on how much you want to manage yourself:

Build it in your own gateway service
Use a router library like LiteLLM in-process
Run a proxy like Bifrost (https://github.com/maximhq/bifrost), Kong AI, or Portkey in front of your services
Stick with one provider and use their own routing

The proxy approach is what I've ended up preferring on bigger setups because it gives you one place to log, retry, and rate-limit. The downside is one more thing to keep alive. For smaller services I just call the SDK directly and accept the duplication.

Set hard limits per workload

The cheapest debugging tool I've found is a per-API-key spend cap. Most providers offer them. Most teams don't set them because the dashboards default to "monitor only."

Set them. Set them low. If your batch job is supposed to cost $200 a day and you cap it at $400, you'll get paged the day someone accidentally points it at production traffic instead of staging. That page is much nicer to receive than a Slack message from finance two weeks later.

# Example budget config we use
budgets:
  - name: chat-prod
    daily_limit_usd: 1200
    hourly_limit_usd: 100
    alert_at_pct: [50, 80, 95]
    hard_stop_at_pct: 100
  - name: experiments
    daily_limit_usd: 50
    hard_stop_at_pct: 100

Trade-offs and Limitations

A few honest caveats.

Caching aggressively means your prompts get rigid. You can't iterate on system prompts as freely because every edit nukes your cache.

Hard spend caps will absolutely cause prod outages. That's the point, but it means you need a runbook for "we hit the cap, what now." Either auto-raise with approval, or fail open to a cheaper fallback model, or fail closed and alert. Pick one before it happens.

Per-key budgets only work if you have enough keys. If everything runs through one shared key, you've got coarse-grained control at best.

And honestly, observability work has diminishing returns. If your bill is $500 a month, don't build a routing platform. If it's $50k a month, the engineering time pays for itself in a week.

MCP in Production Reality vs the Spec

claire nguyen — Wed, 29 Apr 2026 05:58:18 +0000

Been building against MCP for the last four months and the gap between what vendors claim and what the spec actually supports is getting hard to ignore.

If you have not read the official roadmap yet, it is worth your time. The document published by AAIF in March lays things out clearly and honestly. The list of what is still missing is longer than many people in the ecosystem seem willing to admit.

Stateless Streaming Is Not Here Yet

Stateless Streamable HTTP is still marked as in progress. That has real consequences.

Today, if you want to scale horizontally, you are dealing with sticky sessions or putting a stateful proxy in front of your servers. This is not a small implementation detail. It directly affects reliability, cost, and operational complexity.

Every MCP native at scale pitch I have seen quietly works around this with a custom session layer. That may be practical for now, but it is not what people assume when they hear "stateless protocol."

Async Work Is Still DIY

The Tasks primitive for async and long running operations is also in progress.

In practice, this means any agent doing multi minute work is faking async. Most teams end up with polling endpoints, custom retry logic, and their own definitions of job state.

The problem is not just inconvenience. It is fragmentation. Each implementation behaves slightly differently, which makes interoperability harder before it even begins.

Discovery Is Still Manual

Server discovery is another gap that shows up quickly.

The idea of Server Cards exposed via .well known URLs is promising, but not available yet. Right now, you cannot know what an MCP server can do without connecting to it first.

The Registry preview from late 2025 helps, but it is not a replacement for protocol level discovery. You still end up writing glue code just to answer basic capability questions.

Enterprise Auth Is Not Ready

Authentication is where things feel especially incomplete for real world use.

Most implementations today rely on static client secrets. That works for prototypes, but does not align with how larger organizations manage identity and access.

The roadmap calls out SSO integrated cross app access as a priority. That is exactly what is needed. Until it lands, teams are building their own auth layers on top.

The Hidden Cost: Rewrites Later

Put all of this together and a pattern emerges.

If you are building serious MCP infrastructure today, you are not just implementing the spec. You are filling in gaps around session management, async orchestration, discovery, and authentication.

Those gaps come with a cost. Once these features land in the official spec, a lot of today's custom infrastructure will need to be reworked or replaced. Some abstractions will survive. Many will not.

If you are designing systems now, it is worth being explicit about where you are deviating from the spec and how hard it will be to unwind later.

About Those "Production Ready" Claims

This also makes it hard to take production ready MCP gateway claims at face value in April 2026.

There are usually two possibilities. Either the deployment is small enough that these issues have not surfaced yet, or the vendor has built proprietary extensions on top of MCP.

Neither is inherently wrong, but both are very different from what the marketing suggests.

The Good News

None of this is a knock on MCP itself.

The shape of the protocol feels right. The direction is solid. The roadmap is transparent about what is missing, which is more than can be said for many standards at this stage.

But the reality is simple. Production grade tooling is still catching up.

DEV Community: claire nguyen

Error budgets for an LLM dependency you don't control

You can't SLO something you don't operate

Putting a gateway in the path

The metrics that actually feed the SLO

How it stacks up

Trade-offs and Limitations

Further Reading

The Prometheus label that blew our monitoring bill out 6x

The bill, not the traffic

What cardinality actually is

The one label that did it

Catching it

The fix

Same trap, different service

Trade-offs and Limitations

Further Reading

Our PR-review bot kept hitting 429s. Bifrost key pooling fixed it.

The problem

What we tried first

Why Bifrost

Results after two weeks

Bifrost vs LiteLLM vs Portkey

Trade-offs and limitations

Further reading

Surviving an AZ Failover for Our Build Runner Fleet at 3am

The actual failure mode

What we changed

What it looks like vs the alternatives

The boring infra bit

Trade-offs and limitations

Further reading

The Cost Math Behind Our CI Cache Hit Rate Going From 40% to 91%

The starting point

The three things that actually mattered

1. Content-addressed cache keys

2. A warmer tier between memory and S3

3. The bad pre-commit hook

The LLM bit

Where we landed

Trade-offs and Limitations

Further Reading

Chaos testing your CI runner fleet when half the jobs call an LLM

What we were already doing wrong

The game day setup

Putting a gateway in front

What gateway != fixes

Honest comparison

Trade-offs and limitations

Further Reading

Game day on our build cluster: killing an AZ to test LLM flake detection

What broke

What we changed

Second game day

Two bugs the gateway exposed

Honest comparison

Trade-offs and limitations

Further Reading

Stop paying for idle GPUs in your CI: batching LLM eval jobs

The problem with "GPU per job"

Batching: the boring fix that works

Routing matters too

Job classification: don't run a full eval on every commit

What the numbers looked like

Trade-offs and limitations

Further Reading

Putting an LLM Gateway in Front of Our Build Agents

The problem with calling providers directly

What we measured

The config that actually mattered

Failover that actually works

Trade-offs and Limitations

Where it sits now

Further Reading

Putting an LLM Gateway in Front of Our Build Agents: Why We Picked Bifrost

The problem with calling providers directly

What we measured

The config that actually mattered

Failover that actually works

Trade-offs and Limitations