<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: claire nguyen</title>
    <description>The latest articles on DEV Community by claire nguyen (@claire_nguyen).</description>
    <link>https://dev.to/claire_nguyen</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3864932%2F406e92e1-2c8d-4d65-a1f4-a8d4e8c2fd1d.jpg</url>
      <title>DEV Community: claire nguyen</title>
      <link>https://dev.to/claire_nguyen</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/claire_nguyen"/>
    <language>en</language>
    <item>
      <title>Error budgets for an LLM dependency you don't control</title>
      <dc:creator>claire nguyen</dc:creator>
      <pubDate>Mon, 01 Jun 2026 13:22:28 +0000</pubDate>
      <link>https://dev.to/claire_nguyen/error-budgets-for-an-llm-dependency-you-dont-control-6ia</link>
      <guid>https://dev.to/claire_nguyen/error-budgets-for-an-llm-dependency-you-dont-control-6ia</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR: We shipped a natural-language build-query feature at Buildkite, then tried to put a 99.9% SLO on it. Turns out you can't promise uptime for a model provider you don't run. We put Bifrost in front, failed over across three providers, and now the error budget tracks our gateway's behaviour instead of OpenAI's status page.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Here's the moment it clicked for me. We were drafting an SLO doc for a feature that lets people ask "why did this build fail" in plain English. Someone wrote "99.9% availability". Cool. That's 43 minutes of allowed downtime a month. Then OpenAI had a wobble for about 50 minutes one Tuesday and we blew the whole budget before lunch.&lt;/p&gt;

&lt;p&gt;The problem wasn't our code. Our service was up the entire time. The dependency wasn't.&lt;/p&gt;

&lt;h2&gt;
  
  
  You can't SLO something you don't operate
&lt;/h2&gt;

&lt;p&gt;A normal SLO assumes you control the thing you're measuring. Postgres, your own API, an internal queue. You can add replicas, you can tune it, you can page someone who can fix it.&lt;/p&gt;

&lt;p&gt;A hosted LLM is none of that. When Anthropic returns a 529 or OpenAI starts handing out 429s under load, there is no lever on your side. You wait. Our p99 for the feature was around 2.1 seconds on a good day, and during provider degradation it'd climb past 9 seconds or just fail outright.&lt;/p&gt;

&lt;p&gt;So the question stopped being "how do I make the provider more reliable" and became "how do I make my dependency on any single provider less load-bearing." That's a routing problem, not a model problem.&lt;/p&gt;

&lt;h2&gt;
  
  
  Putting a gateway in the path
&lt;/h2&gt;

&lt;p&gt;We run Bifrost as the single egress point for every LLM call now. It's an OpenAI-compatible gateway, so our service code didn't change much. The interesting part is the fallback config: if the primary provider errors or times out, the request gets retried against the next one without our app knowing.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"providers"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"openai"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"keys"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"value"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"env.OPENAI_KEY"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}]&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"anthropic"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"keys"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"value"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"env.ANTHROPIC_KEY"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}]&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"bedrock"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"keys"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"value"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"env.BEDROCK_KEY"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}]&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"fallbacks"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="s2"&gt;"openai/gpt-4o-mini"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="s2"&gt;"anthropic/claude-3-5-haiku"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="s2"&gt;"bedrock/anthropic.claude-3-haiku"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Three providers, ranked. When OpenAI throttles, the call lands on Anthropic. When both are sad, Bedrock catches it. The feature degrades in quality maybe, but it stays up. That's the whole point of an error budget. Stay inside the line.&lt;/p&gt;

&lt;p&gt;It also does load balancing across multiple keys, which mattered more than I expected. Half our "outages" early on were just one API key hitting its rate limit while another sat idle.&lt;/p&gt;

&lt;h2&gt;
  
  
  The metrics that actually feed the SLO
&lt;/h2&gt;

&lt;p&gt;The bit that sold me was native Prometheus output. Bifrost exposes metrics straight out of the box, so I'm not scraping a vendor status page or parsing logs to know if we're burning budget.&lt;/p&gt;

&lt;p&gt;Our availability SLI is now "requests Bifrost successfully resolved, including via fallback" over total requests. A request that failed on OpenAI but succeeded on Anthropic counts as a win, because the user got an answer. That's the number that should drive the SLO, not per-provider success.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# fast burn-rate over 1h: are we eating budget faster than allowed?
sum(rate(bifrost_requests_total{status="error"}[1h]))
/
sum(rate(bifrost_requests_total[1h]))
&amp;gt; (14.4 * 0.001)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We went from one provider doing about 99.4% effective availability to the fallback chain sitting around 99.93% over the last 60 days. Same models, same budget, just not betting the feature on one company's afternoon.&lt;/p&gt;

&lt;h2&gt;
  
  
  How it stacks up
&lt;/h2&gt;

&lt;p&gt;We looked at LiteLLM and Portkey before landing here. None of these is strictly best. Depends what you're optimising for.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Thing I cared about&lt;/th&gt;
&lt;th&gt;Bifrost&lt;/th&gt;
&lt;th&gt;LiteLLM&lt;/th&gt;
&lt;th&gt;Portkey&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Self-host, no vendor in path&lt;/td&gt;
&lt;td&gt;Yes, single Go binary&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Possible, but hosted is the main path&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Native Prometheus metrics&lt;/td&gt;
&lt;td&gt;Built in&lt;/td&gt;
&lt;td&gt;Via callbacks/config&lt;/td&gt;
&lt;td&gt;Dashboard-first, export is extra&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Provider failover config&lt;/td&gt;
&lt;td&gt;Declarative fallback list&lt;/td&gt;
&lt;td&gt;Yes, router config&lt;/td&gt;
&lt;td&gt;Yes, configs/strategies&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Hosted analytics UI&lt;/td&gt;
&lt;td&gt;Basic built-in UI&lt;/td&gt;
&lt;td&gt;Minimal&lt;/td&gt;
&lt;td&gt;Strongest of the three&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Python ecosystem depth&lt;/td&gt;
&lt;td&gt;Smaller&lt;/td&gt;
&lt;td&gt;Largest, huge community&lt;/td&gt;
&lt;td&gt;Good&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Honestly, if you live in Python and want the biggest provider list and community, LiteLLM is hard to beat. If you want a polished hosted dashboard and guardrails without running anything, Portkey is the comfortable pick. We're an infra team that wants metrics in our own Prometheus and a binary we can run on our own boxes, so Bifrost fit our shape. No worries either way.&lt;/p&gt;

&lt;h2&gt;
  
  
  Trade-offs and Limitations
&lt;/h2&gt;

&lt;p&gt;Fallback hides failure, and that cuts both ways. If your alerting only watches the final success rate, you can be quietly running 80% of traffic on your third-choice provider for days and not notice the bill. We added a separate alert on per-provider fallback rate so degradation is visible, not just survivable.&lt;/p&gt;

&lt;p&gt;Quality drift is real too. gpt-4o-mini and claude-3-5-haiku don't answer identically, so a build-failure summary can read differently mid-incident. For us that's acceptable. For anything doing structured extraction, you'd want to validate output shape per provider.&lt;/p&gt;

&lt;p&gt;And a gateway is one more thing to run. It's a low-risk component, but it's in the hot path, so we run it with the same care as any other tier-1 service. If Bifrost is down, everything's down. We game-day it like the rest of our stack.&lt;/p&gt;

&lt;p&gt;Self-hosting also means semantic caching, governance, and the rest are your config problem, not a managed feature. Fine for us. Worth knowing.&lt;/p&gt;

&lt;h2&gt;
  
  
  Further Reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Bifrost retries and fallbacks&lt;/li&gt;
&lt;li&gt;Bifrost observability and Prometheus metrics&lt;/li&gt;
&lt;li&gt;Bifrost gateway setup&lt;/li&gt;
&lt;li&gt;Google SRE Workbook: alerting on SLOs and burn rates&lt;/li&gt;
&lt;li&gt;Bifrost on GitHub&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>sre</category>
      <category>llm</category>
      <category>devops</category>
      <category>infrastructure</category>
    </item>
    <item>
      <title>The Prometheus label that blew our monitoring bill out 6x</title>
      <dc:creator>claire nguyen</dc:creator>
      <pubDate>Fri, 29 May 2026 04:21:15 +0000</pubDate>
      <link>https://dev.to/claire_nguyen/the-prometheus-label-that-blew-our-monitoring-bill-out-6x-57hj</link>
      <guid>https://dev.to/claire_nguyen/the-prometheus-label-that-blew-our-monitoring-bill-out-6x-57hj</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR: Our metrics bill went 6x in a single month. Traffic was flat. One Prometheus label carrying per-build IDs spawned millions of time series, and the backend charges by active series. Here's how we caught it and the label rules we run now so it doesn't happen again.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The bill, not the traffic
&lt;/h2&gt;

&lt;p&gt;I'm on the infra team at Buildkite. We run a fairly chunky Prometheus setup feeding a managed backend, and one Monday the monthly estimate had quietly gone from about $1,800 to a touch over $11k. Nobody shipped more traffic. Build volume was the same 40k-ish builds a day it'd been for weeks.&lt;/p&gt;

&lt;p&gt;So it wasn't load. It was series count. Active series had climbed from roughly 1.2 million to nearly 9 million, and the backend prices on active series, not on request volume. That's the trap most people miss the first time.&lt;/p&gt;

&lt;h2&gt;
  
  
  What cardinality actually is
&lt;/h2&gt;

&lt;p&gt;Think of every unique combination of metric name plus label values as its own drawer in a filing cabinet. &lt;code&gt;http_requests_total{status="200"}&lt;/code&gt; is one drawer. Add &lt;code&gt;region="ap-southeast-2"&lt;/code&gt; and now you've got a drawer per region. Add a label whose values are unbounded and you've got a cabinet the size of a warehouse.&lt;/p&gt;

&lt;p&gt;Cardinality is the count of those drawers. Each one is a separate time series that has to be stored and indexed. Low-cardinality labels (status, region) are fine. High-cardinality ones are where the money leaks.&lt;/p&gt;

&lt;h2&gt;
  
  
  The one label that did it
&lt;/h2&gt;

&lt;p&gt;A teammate had added &lt;code&gt;build_id&lt;/code&gt; to a counter so they could debug a flaky deploy. Fair enough in the moment. Problem is every build has a unique ID, we do ~40k a day, and those IDs hang around for the full retention window.&lt;/p&gt;

&lt;p&gt;40k unique values a day, multiplied across a handful of other labels, multiplied across retention. That's your several-million-series jump right there. One label.&lt;/p&gt;

&lt;h2&gt;
  
  
  Catching it
&lt;/h2&gt;

&lt;p&gt;The fastest way to find the offender is to ask Prometheus which metric has the most series:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;topk(10, count by (__name__)({__name__=~".+"}))
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then drill into the worst metric and see which label is doing the damage:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;count(count by (build_id)(deploy_attempts_total))
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When that second query came back with a number in the tens of thousands, we had our culprit.&lt;/p&gt;

&lt;h2&gt;
  
  
  The fix
&lt;/h2&gt;

&lt;p&gt;You drop the label before it ever hits storage. &lt;code&gt;metric_relabel_configs&lt;/code&gt; runs at scrape time, so you can strip a label without touching the app code:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;scrape_configs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;job_name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;build-agents"&lt;/span&gt;
    &lt;span class="na"&gt;metric_relabel_configs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;regex&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;build_id"&lt;/span&gt;
        &lt;span class="na"&gt;action&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;labeldrop&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Per-build detail didn't vanish, we moved it to where unbounded identifiers belong: traces and logs. If you genuinely need a metric sliced per build, use exemplars so the high-cardinality bit lives in the trace store, not the series index.&lt;/p&gt;

&lt;p&gt;Here's how we now reason about labels before adding one:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Label&lt;/th&gt;
&lt;th&gt;Unique values&lt;/th&gt;
&lt;th&gt;Safe to add?&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;status&lt;/td&gt;
&lt;td&gt;~5&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;region&lt;/td&gt;
&lt;td&gt;~6&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;instance_type&lt;/td&gt;
&lt;td&gt;~15&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;agent_queue&lt;/td&gt;
&lt;td&gt;~200&lt;/td&gt;
&lt;td&gt;Usually fine&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;build_id&lt;/td&gt;
&lt;td&gt;~40k/day&lt;/td&gt;
&lt;td&gt;No, use a trace&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;user_email&lt;/td&gt;
&lt;td&gt;unbounded&lt;/td&gt;
&lt;td&gt;No, never&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Rule of thumb we reckon on: if you can't name the upper bound of a label's values on a whiteboard, it doesn't go on a metric.&lt;/p&gt;

&lt;h2&gt;
  
  
  Same trap, different service
&lt;/h2&gt;

&lt;p&gt;This isn't only a Prometheus-the-app thing. Any service that emits Prometheus metrics can sink you the same way. We run a small internal feature that summarises failed build logs through an LLM, and those calls go through Bifrost, an open-source AI gateway that ships native Prometheus metrics out of the box. Handy. But the instinct to tag those metrics with a per-request ID or per-virtual-key label is exactly the same footgun.&lt;/p&gt;

&lt;p&gt;We keep its labels down to provider and model. That gives us cost-per-provider and latency-per-model without minting a new series for every call. The discipline travels with the metric, not the tool.&lt;/p&gt;

&lt;h2&gt;
  
  
  Trade-offs and Limitations
&lt;/h2&gt;

&lt;p&gt;Dropping &lt;code&gt;build_id&lt;/code&gt; means you can't slice a single build inside Prometheus anymore. For ad-hoc "what did build 84213 do" questions, you're now in the trace or log tooling, which is a context switch some folks grumbled about for a week.&lt;/p&gt;

&lt;p&gt;Recording rules, the other common fix, aren't free either. They add evaluation load on the Prometheus side, and if you write a sloppy one you can quietly recreate the cardinality you were trying to kill. Test the output series count before you ship the rule.&lt;/p&gt;

&lt;p&gt;Exemplars need backend support and a tracing system wired up. If you haven't got distributed tracing yet, that path's a bigger project than a one-line &lt;code&gt;labeldrop&lt;/code&gt;. Be honest about where you are.&lt;/p&gt;

&lt;p&gt;And &lt;code&gt;labeldrop&lt;/code&gt; is a blunt instrument. Once it's gone at scrape, it's gone. If you later decide you wanted that dimension bounded rather than dropped, you're re-instrumenting.&lt;/p&gt;

&lt;h2&gt;
  
  
  Further Reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://prometheus.io/docs/practices/naming/" rel="noopener noreferrer"&gt;Prometheus: metric and label naming&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://prometheus.io/docs/practices/instrumentation/" rel="noopener noreferrer"&gt;Prometheus: instrumentation best practices&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://grafana.com/docs/grafana-cloud/cost-management-and-billing/reduce-costs/metrics-costs/control-metrics-usage-via-cardinality-management/" rel="noopener noreferrer"&gt;Grafana: control metrics usage via cardinality management&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.getbifrost.ai/features/observability/default" rel="noopener noreferrer"&gt;Bifrost observability docs&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>devops</category>
      <category>infrastructure</category>
      <category>sre</category>
    </item>
    <item>
      <title>Our PR-review bot kept hitting 429s. Bifrost key pooling fixed it.</title>
      <dc:creator>claire nguyen</dc:creator>
      <pubDate>Thu, 28 May 2026 13:22:11 +0000</pubDate>
      <link>https://dev.to/claire_nguyen/our-pr-review-bot-kept-hitting-429s-bifrost-key-pooling-fixed-it-4m9f</link>
      <guid>https://dev.to/claire_nguyen/our-pr-review-bot-kept-hitting-429s-bifrost-key-pooling-fixed-it-4m9f</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR: Our internal PR-review bot was getting 429'd by Anthropic between 9am and 11am Sydney time. We dropped Bifrost in front, pooled four keys, and the 429 rate fell from 8.2% to 0.07% in a fortnight. The migration was one env var swap. The interesting bits were the bits we got wrong.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The problem
&lt;/h2&gt;

&lt;p&gt;We've got a PR-review bot that pings Claude on every pull request opened against our internal monorepo. It pulls the diff and ships a structured prompt to Claude. Gets back a summary plus a couple of "have you considered..." nudges. Saves our reviewers maybe 10 minutes per PR, on a team of 80 engineers, all sharing one Anthropic workspace that someone provisioned back in early 2024 and nobody bothered to split.&lt;/p&gt;

&lt;p&gt;You can guess what happened.&lt;/p&gt;

&lt;p&gt;Mornings in Sydney are brutal. Everyone arrives, opens their PRs from the night before, and our bot fires off 30-40 concurrent requests. Anthropic's per-org rate limit got chewed through by 9:15 most days. Bot started failing. Slack filled up with "did the review bot die again?" messages. Not a great look for the platform team.&lt;/p&gt;

&lt;h2&gt;
  
  
  What we tried first
&lt;/h2&gt;

&lt;p&gt;The naive fix was a job queue with backoff. Wrote it in an arvo. Buildkite job, Redis-backed, exponential retry with jitter. It worked, sort of. Reviews now took 4-7 minutes to come back instead of 8 seconds, and engineers started ignoring the bot entirely because by the time the review landed they'd already merged, which kind of defeats the whole point of having a review bot.&lt;/p&gt;

&lt;p&gt;Queueing wasn't the answer. We needed more headroom, which meant more keys, which meant somebody had to manage them.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Bifrost
&lt;/h2&gt;

&lt;p&gt;I'd been kicking the tyres on a few gateways for an unrelated project. Bifrost (&lt;a href="https://github.com/maximhq/bifrost" rel="noopener noreferrer"&gt;https://github.com/maximhq/bifrost&lt;/a&gt;) won on two specific points: load balancing across multiple API keys for the same provider is a documented first-class feature, and the OpenAI-compatible endpoint meant we didn't have to touch the bot's SDK code. It already spoke &lt;code&gt;openai.ChatCompletion&lt;/code&gt; against an internal proxy URL.&lt;/p&gt;

&lt;p&gt;Setup took about 40 minutes including the time to argue with our SSO admin about a new GitHub OAuth app.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"providers"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"anthropic"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"keys"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"value"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"env.ANTHROPIC_KEY_1"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"weight"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;1.0&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"value"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"env.ANTHROPIC_KEY_2"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"weight"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;1.0&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"value"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"env.ANTHROPIC_KEY_3"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"weight"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;1.0&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"value"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"env.ANTHROPIC_KEY_4"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"weight"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;1.0&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"network_config"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"default_request_timeout_in_seconds"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;30&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Bot config was a one-liner. Pointed &lt;code&gt;OPENAI_API_BASE&lt;/code&gt; at our Bifrost ECS service on port 8080 and the bot didn't know it'd been moved.&lt;/p&gt;

&lt;h2&gt;
  
  
  Results after two weeks
&lt;/h2&gt;

&lt;p&gt;| Metric | Before (queue + 1 key) | After (Bifrost + 4 keys) |&lt;br&gt;
|---|---|&lt;br&gt;
| Median review latency | 4m 30s | 11s |&lt;br&gt;
| p95 review latency | 7m 12s | 28s |&lt;br&gt;
| 429 rate | 8.2% | 0.07% |&lt;br&gt;
| Reviews abandoned (timed out) | 14% | 0.4% |&lt;br&gt;
| "Is the bot dead" Slack pings | ~6/day | 0 |&lt;/p&gt;

&lt;p&gt;Costs went up about 22% because more reviews actually completed. Worth it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Bifrost vs LiteLLM vs Portkey
&lt;/h2&gt;

&lt;p&gt;I evaluated all three properly. None is strictly better; they hit different sweet spots.&lt;/p&gt;

&lt;p&gt;| Concern | Bifrost | LiteLLM | Portkey |&lt;br&gt;
|---|---|&lt;br&gt;
| Multi-key load balancing | Native | Via Router | Native |&lt;br&gt;
| OpenAI-compatible endpoint | Yes | Yes | Yes |&lt;br&gt;
| Self-host complexity | Single Go binary | Python + deps | SaaS-first |&lt;br&gt;
| Built-in web UI for config | Yes | Limited | Cloud-side |&lt;br&gt;
| Semantic caching | Yes | Yes | Yes |&lt;br&gt;
| MCP gateway | Yes | No | No |&lt;br&gt;
| Community size | Growing | Larger | Larger |&lt;/p&gt;

&lt;p&gt;LiteLLM's community is bigger and the integrations list is wider. If you want Python ergonomics, it's the easier ride. Portkey's hosted UX is slicker out of the box, but we needed self-host for compliance reasons. Bifrost being a single Go binary suited our ECS deploy model and our preference for fewer Python services in the critical path.&lt;/p&gt;

&lt;h2&gt;
  
  
  Trade-offs and limitations
&lt;/h2&gt;

&lt;p&gt;It's not all roses.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Failover is per-request, not per-key cooldown.&lt;/strong&gt; If one of our four keys gets stuck in a rate-limit hole, Bifrost retries the call elsewhere but doesn't proactively quarantine the bad key for a window. We're managing that with manual weight tweaks for now.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The web UI is handy but state lives in config files.&lt;/strong&gt; Make changes via the UI in dev and forget to commit the config, and you've got drift. We learned that one the hard way.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Single point of failure.&lt;/strong&gt; Anything you put in front of every LLM call becomes load-bearing. We run two Bifrost replicas behind an ALB. Tiny team running one node and a restart policy might be fine, but think about it before you ship.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Observability glue.&lt;/strong&gt; Prometheus metrics are emitted natively, which is great. You'll still need to wire them into your existing dashboards. Took us an afternoon.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Further reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Bifrost retries and fallbacks: &lt;a href="https://docs.getbifrost.ai/features/retries-and-fallbacks" rel="noopener noreferrer"&gt;https://docs.getbifrost.ai/features/retries-and-fallbacks&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Bifrost governance and virtual keys: &lt;a href="https://docs.getbifrost.ai/features/governance/virtual-keys" rel="noopener noreferrer"&gt;https://docs.getbifrost.ai/features/governance/virtual-keys&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Bifrost observability: &lt;a href="https://docs.getbifrost.ai/features/observability/default" rel="noopener noreferrer"&gt;https://docs.getbifrost.ai/features/observability/default&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;LiteLLM router config: &lt;a href="https://docs.litellm.ai/docs/routing" rel="noopener noreferrer"&gt;https://docs.litellm.ai/docs/routing&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Anthropic rate limit headers: &lt;a href="https://docs.anthropic.com/en/api/rate-limits" rel="noopener noreferrer"&gt;https://docs.anthropic.com/en/api/rate-limits&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;No worries if you've already got a gateway you're happy with. Don't write your own queue and hope.&lt;/p&gt;

</description>
      <category>devops</category>
      <category>infrastructure</category>
      <category>llm</category>
      <category>sre</category>
    </item>
    <item>
      <title>Surviving an AZ Failover for Our Build Runner Fleet at 3am</title>
      <dc:creator>claire nguyen</dc:creator>
      <pubDate>Wed, 27 May 2026 13:24:15 +0000</pubDate>
      <link>https://dev.to/claire_nguyen/surviving-an-az-failover-for-our-build-runner-fleet-at-3am-pg7</link>
      <guid>https://dev.to/claire_nguyen/surviving-an-az-failover-for-our-build-runner-fleet-at-3am-pg7</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR: We lost an AWS AZ for 47 minutes back in March. Our build runner fleet on EKS mostly survived, but the AI-assisted code review bots wedged because their LLM calls all routed to one region. Sticking Bifrost in front of those calls fixed the second problem. Here's what we changed.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;It was 3:12am Sydney time when PagerDuty went off. ap-southeast-2a was having a wobble. Not a full outage — just enough packet loss that EKS nodes started flapping in and out of the cluster.&lt;/p&gt;

&lt;p&gt;Our build runner fleet handled it fine. We've drilled this. Pod disruption budgets, multi-AZ node groups, the usual stuff. Builds rescheduled to 2b and 2c within about 90 seconds. No worries.&lt;/p&gt;

&lt;p&gt;The bit that didn't handle it fine was the AI review bot we'd shipped six weeks earlier. That thing called Anthropic's API directly from inside the build container. When the AZ flapped, the egress NAT in 2a started dropping outbound TLS. The bot retried, hit our 30-second build timeout, and 4,200 builds went red over half an hour.&lt;/p&gt;

&lt;p&gt;I want to talk about what we did the morning after, because the fix wasn't "make the bot more resilient." It was "stop pretending the LLM call is special."&lt;/p&gt;

&lt;h2&gt;
  
  
  The actual failure mode
&lt;/h2&gt;

&lt;p&gt;Here's the rough shape of what was happening. Our review bot was a Go service running as a sidecar in the build pod. Pseudo-config looked like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;review_bot&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;provider&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;anthropic&lt;/span&gt;
  &lt;span class="na"&gt;api_key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${ANTHROPIC_KEY}&lt;/span&gt;
  &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;claude-sonnet-4-6&lt;/span&gt;
  &lt;span class="na"&gt;timeout_ms&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;25000&lt;/span&gt;
  &lt;span class="na"&gt;max_retries&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;2&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Two retries, 25 second timeout each. Sounds reasonable. Except when the underlying network is dropping packets, you don't fail fast — you sit there waiting for TCP to give up. Two retries became 75 seconds of nothing. Build timeout kicked in. Build failed.&lt;/p&gt;

&lt;p&gt;Worse, every single review bot in every single build was hitting the same NAT gateway in the same degraded AZ. We'd accidentally built a single point of failure into something we'd designed as a sidecar.&lt;/p&gt;

&lt;h2&gt;
  
  
  What we changed
&lt;/h2&gt;

&lt;p&gt;I'd been kicking the tyres on Bifrost (&lt;a href="https://github.com/maximhq/bifrost" rel="noopener noreferrer"&gt;https://github.com/maximhq/bifrost&lt;/a&gt;) for a few weeks already because I wanted central observability on LLM spend across our internal tools. The AZ incident pushed it to the top of the queue.&lt;/p&gt;

&lt;p&gt;The plan was simple: stop letting build pods talk to providers directly. Run Bifrost as a deployment in our shared platform namespace, spread across all three AZs, and point the review bot at it. The bot's config went from anthropic.com to an internal service URL.&lt;/p&gt;

&lt;p&gt;Bifrost's drop-in replacement (&lt;a href="https://docs.getbifrost.ai/features/drop-in-replacement" rel="noopener noreferrer"&gt;https://docs.getbifrost.ai/features/drop-in-replacement&lt;/a&gt;) meant we didn't touch the bot's code. Just the env var.&lt;/p&gt;

&lt;p&gt;Then we configured fallbacks (&lt;a href="https://docs.getbifrost.ai/features/retries-and-fallbacks" rel="noopener noreferrer"&gt;https://docs.getbifrost.ai/features/retries-and-fallbacks&lt;/a&gt;) so a failed Anthropic call rolls over to AWS Bedrock's Claude. Same model family, different network path, different auth, different everything.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"model"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"anthropic/claude-sonnet-4-6"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"fallbacks"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="s2"&gt;"bedrock/anthropic.claude-sonnet-4-6"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="s2"&gt;"openai/gpt-4o-mini"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The GPT-4o-mini at the bottom is a deliberate downgrade. If both Anthropic paths are stuffed, we'd rather give the dev a worse review than no review and a red build.&lt;/p&gt;

&lt;h2&gt;
  
  
  What it looks like vs the alternatives
&lt;/h2&gt;

&lt;p&gt;I evaluated three things properly. Here's the honest comparison from my notes:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Concern&lt;/th&gt;
&lt;th&gt;LiteLLM&lt;/th&gt;
&lt;th&gt;Portkey&lt;/th&gt;
&lt;th&gt;Bifrost&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Self-hosted Go binary&lt;/td&gt;
&lt;td&gt;No (Python)&lt;/td&gt;
&lt;td&gt;Partial&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Provider failover config&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Built-in web UI for config&lt;/td&gt;
&lt;td&gt;Limited&lt;/td&gt;
&lt;td&gt;Yes (cloud)&lt;/td&gt;
&lt;td&gt;Yes (local)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Semantic caching&lt;/td&gt;
&lt;td&gt;Plugin&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Memory footprint on our nodes&lt;/td&gt;
&lt;td&gt;~400MB&lt;/td&gt;
&lt;td&gt;N/A (SaaS-first)&lt;/td&gt;
&lt;td&gt;~180MB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MCP gateway&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Yes (enterprise)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;LiteLLM is genuinely good and we run it for one of our data science notebooks because the Python ergonomics are nice. Portkey has the slickest dashboard if you're happy with their cloud. Bifrost won here because we wanted a Go binary we could run on our own infra, and the resource overhead per pod mattered when we're scheduling hundreds of build pods.&lt;/p&gt;

&lt;h2&gt;
  
  
  The boring infra bit
&lt;/h2&gt;

&lt;p&gt;We deployed Bifrost as three replicas, one per AZ, behind a ClusterIP service. Topology spread constraints to keep them honest. Each pod has its own provider key set via Kubernetes secrets, referenced through Bifrost's env var support (&lt;a href="https://docs.getbifrost.ai/deployment-guides/config-json#environment-variable-references" rel="noopener noreferrer"&gt;https://docs.getbifrost.ai/deployment-guides/config-json#environment-variable-references&lt;/a&gt;).&lt;/p&gt;

&lt;p&gt;Prometheus scrape config picks up the native metrics endpoint. We graph p99 latency per provider and alert on fallback rate above 5% for more than 10 minutes. That alert would have fired during the March incident and given us a much better signal than "builds are timing out."&lt;/p&gt;

&lt;h2&gt;
  
  
  Trade-offs and limitations
&lt;/h2&gt;

&lt;p&gt;This isn't a free win. A few things to flag.&lt;/p&gt;

&lt;p&gt;The gateway is now a new hop in the request path. We measured about 8-12ms added per call. For our use case that's noise. For real-time inference it might not be.&lt;/p&gt;

&lt;p&gt;Bifrost's clustering features are an enterprise thing. We're running it as independent replicas behind a service, which works because our config is mostly static. If you need shared state across replicas (live config sync, shared rate limit counters), you'll either pay for enterprise or accept some eventual consistency.&lt;/p&gt;

&lt;p&gt;Semantic caching sounds great but we haven't turned it on for the review bot because code reviews are too context-specific. Cache hit rate would be near zero. Worth knowing before you assume it'll save you money.&lt;/p&gt;

&lt;p&gt;And the obvious one: a gateway pod failing is now a thing that can break LLM calls. Spread your replicas, set sensible PDBs, don't be silly.&lt;/p&gt;

&lt;h2&gt;
  
  
  Further reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Bifrost fallbacks docs: &lt;a href="https://docs.getbifrost.ai/features/retries-and-fallbacks" rel="noopener noreferrer"&gt;https://docs.getbifrost.ai/features/retries-and-fallbacks&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Bifrost observability: &lt;a href="https://docs.getbifrost.ai/features/observability/default" rel="noopener noreferrer"&gt;https://docs.getbifrost.ai/features/observability/default&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;LiteLLM router config: &lt;a href="https://docs.litellm.ai/docs/routing" rel="noopener noreferrer"&gt;https://docs.litellm.ai/docs/routing&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;AWS multi-AZ resilience patterns: &lt;a href="https://aws.amazon.com/architecture/well-architected/" rel="noopener noreferrer"&gt;https://aws.amazon.com/architecture/well-architected/&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Kubernetes topology spread constraints: &lt;a href="https://kubernetes.io/docs/concepts/scheduling-eviction/topology-spread-constraints/" rel="noopener noreferrer"&gt;https://kubernetes.io/docs/concepts/scheduling-eviction/topology-spread-constraints/&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>devops</category>
      <category>sre</category>
      <category>infrastructure</category>
      <category>llm</category>
    </item>
    <item>
      <title>The Cost Math Behind Our CI Cache Hit Rate Going From 40% to 91%</title>
      <dc:creator>claire nguyen</dc:creator>
      <pubDate>Wed, 27 May 2026 04:23:02 +0000</pubDate>
      <link>https://dev.to/claire_nguyen/the-cost-math-behind-our-ci-cache-hit-rate-going-from-40-to-91-284d</link>
      <guid>https://dev.to/claire_nguyen/the-cost-math-behind-our-ci-cache-hit-rate-going-from-40-to-91-284d</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR: We were burning roughly AUD $14k/month on redundant CI compute because our cache hit rate sat at 40%. Three changes (content-addressed keys, a warmer tier, and killing one bad pre-commit hook) pushed it to 91% and shaved the bill to about $3.2k. Most of the savings came from a single weekend audit, not new tooling.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;I run infra at Buildkite. We eat our own dog food, which means our internal monorepo runs on the same agents we sell to customers. About six weeks ago our finance team flagged that our CI compute line on AWS had crept up 38% quarter-on-quarter while team headcount only grew 11%. Something was off.&lt;/p&gt;

&lt;p&gt;Turns out the culprit wasn't traffic. It was caches.&lt;/p&gt;

&lt;h2&gt;
  
  
  The starting point
&lt;/h2&gt;

&lt;p&gt;Our setup, roughly:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;~280 engineers across Sydney, Melbourne, San Francisco&lt;/li&gt;
&lt;li&gt;Around 4,200 builds/day on the monorepo&lt;/li&gt;
&lt;li&gt;Mix of Go, TypeScript, and a chunky Python ML eval service&lt;/li&gt;
&lt;li&gt;Agents running on &lt;code&gt;m6i.4xlarge&lt;/code&gt; spot instances in &lt;code&gt;ap-southeast-2&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Remote cache backed by S3 with a CloudFront distribution&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;When I first pulled the numbers, our cache hit rate (measured per build step, weighted by step duration) was sitting at 40.3%. For a healthy CI setup of this size I'd reckon you want 80%+. Anything under 60% means you're paying twice for the same compute.&lt;/p&gt;

&lt;p&gt;Here's what the spend breakdown looked like before we touched anything:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Component&lt;/th&gt;
&lt;th&gt;Monthly cost (AUD)&lt;/th&gt;
&lt;th&gt;% of total&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Spot EC2 (build agents)&lt;/td&gt;
&lt;td&gt;$11,200&lt;/td&gt;
&lt;td&gt;67%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;S3 cache storage&lt;/td&gt;
&lt;td&gt;$890&lt;/td&gt;
&lt;td&gt;5%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CloudFront egress&lt;/td&gt;
&lt;td&gt;$1,140&lt;/td&gt;
&lt;td&gt;7%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LLM eval API calls (OpenAI + Anthropic)&lt;/td&gt;
&lt;td&gt;$3,420&lt;/td&gt;
&lt;td&gt;21%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Total&lt;/td&gt;
&lt;td&gt;$16,650&lt;/td&gt;
&lt;td&gt;100%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The LLM line is the one nobody expected. We run automated PR review on a subset of changes, plus regression evals on our search ranking service.&lt;/p&gt;

&lt;h2&gt;
  
  
  The three things that actually mattered
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Content-addressed cache keys
&lt;/h3&gt;

&lt;p&gt;We had cache keys like &lt;code&gt;node_modules_v3_${branch_name}_${os}&lt;/code&gt;. That's already wrong but the worse bit was the &lt;code&gt;v3&lt;/code&gt; suffix that someone bumped six months ago and forgot why.&lt;/p&gt;

&lt;p&gt;Switched to hashing the actual inputs: &lt;code&gt;package-lock.json&lt;/code&gt; content hash + Node version + OS. Standard stuff but we'd just never done it properly.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;label&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;:node:&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;install"&lt;/span&gt;
    &lt;span class="na"&gt;plugins&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;cache#v2.4.0&lt;/span&gt;&lt;span class="err"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;manifest&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;package-lock.json&lt;/span&gt;
          &lt;span class="na"&gt;path&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;node_modules&lt;/span&gt;
          &lt;span class="na"&gt;restore&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;file&lt;/span&gt;
          &lt;span class="na"&gt;save&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;file&lt;/span&gt;
          &lt;span class="na"&gt;key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;v1-{{&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;runner.os&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;}}-node-{{&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;checksum&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;'package-lock.json'&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;}}"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;restore: file&lt;/code&gt; bit matters. It means we only invalidate when &lt;code&gt;package-lock.json&lt;/code&gt; actually changes, not when the branch name changes. Cache hit rate on the install step went from 31% to 96% overnight.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. A warmer tier between memory and S3
&lt;/h3&gt;

&lt;p&gt;S3 is cheap but the round-trip from &lt;code&gt;ap-southeast-2&lt;/code&gt; agents to S3 is about 18ms for small objects, and we were pulling thousands of them per build. We added an &lt;code&gt;r6gd.large&lt;/code&gt; instance with NVMe local storage as an in-region warm cache. Agents check there first, fall through to S3.&lt;/p&gt;

&lt;p&gt;Cost: about $180/month for the warm cache instance. Saves us roughly $1,400/month in CloudFront egress because most cache reads never leave the VPC now.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. The bad pre-commit hook
&lt;/h3&gt;

&lt;p&gt;This one is embarrassing. Someone added a pre-commit hook two years ago that ran &lt;code&gt;find . -name "*.pyc" -delete&lt;/code&gt; before every test invocation. On a clean checkout this does nothing useful. On a cached checkout it deletes all the compiled Python bytecode, forcing Python to recompile on every test run. Average test step went from 4m20s to 2m45s after deleting eight lines of bash.&lt;/p&gt;

&lt;p&gt;I genuinely could not believe it. We'd been paying for that for two years.&lt;/p&gt;

&lt;h2&gt;
  
  
  The LLM bit
&lt;/h2&gt;

&lt;p&gt;The $3,420 LLM line was harder to chip away at because the calls themselves are useful. What we did:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Routed the PR review traffic through an AI gateway (we use &lt;a href="https://github.com/maximhq/bifrost" rel="noopener noreferrer"&gt;Bifrost&lt;/a&gt;, which gives us semantic caching and a single endpoint) so identical or near-identical review prompts hit cache instead of provider&lt;/li&gt;
&lt;li&gt;Moved the search ranking evals to a nightly batch rather than per-PR&lt;/li&gt;
&lt;li&gt;Switched the bulk of the review traffic to a cheaper model and reserved the expensive one for changes touching &lt;code&gt;/security/*&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Semantic cache hit rate on PR review prompts settled around 34%, which doesn't sound massive but the prompts that hit cache tend to be the bigger ones (boilerplate "review this dependency bump" type stuff), so the dollar impact was bigger than the hit rate suggests.&lt;/p&gt;

&lt;p&gt;Final LLM line came down to $1,180/month.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where we landed
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Component&lt;/th&gt;
&lt;th&gt;Before&lt;/th&gt;
&lt;th&gt;After&lt;/th&gt;
&lt;th&gt;Change&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Spot EC2&lt;/td&gt;
&lt;td&gt;$11,200&lt;/td&gt;
&lt;td&gt;$1,820&lt;/td&gt;
&lt;td&gt;-84%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;S3 + warm cache&lt;/td&gt;
&lt;td&gt;$890&lt;/td&gt;
&lt;td&gt;$1,070&lt;/td&gt;
&lt;td&gt;+20%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CloudFront egress&lt;/td&gt;
&lt;td&gt;$1,140&lt;/td&gt;
&lt;td&gt;$140&lt;/td&gt;
&lt;td&gt;-88%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LLM API&lt;/td&gt;
&lt;td&gt;$3,420&lt;/td&gt;
&lt;td&gt;$1,180&lt;/td&gt;
&lt;td&gt;-65%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Total&lt;/td&gt;
&lt;td&gt;$16,650&lt;/td&gt;
&lt;td&gt;$4,210&lt;/td&gt;
&lt;td&gt;-75%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Cache hit rate: 91.2% weighted.&lt;/p&gt;

&lt;h2&gt;
  
  
  Trade-offs and Limitations
&lt;/h2&gt;

&lt;p&gt;The warm cache tier is a single point of failure. If that &lt;code&gt;r6gd.large&lt;/code&gt; dies, we fall through to S3 cleanly but builds slow down by ~40 seconds each until we replace it. For us that's fine because spot interruption is more common than instance failure anyway. For a smaller team I'd skip it.&lt;/p&gt;

&lt;p&gt;Content-addressed keys made cache busting harder for the rare case where you legitimately want to invalidate everything. We added a manual &lt;code&gt;BUILDKITE_CACHE_EPOCH&lt;/code&gt; env var so a human can force-invalidate when needed. Used it twice in three months.&lt;/p&gt;

&lt;p&gt;The pre-commit hook thing wasn't a tooling problem. It was institutional knowledge rot. There's no caching strategy that protects you from someone deleting your bytecode every commit. You need humans to actually read what runs.&lt;/p&gt;

&lt;h2&gt;
  
  
  Further Reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://bazel.build/remote/caching" rel="noopener noreferrer"&gt;Bazel remote caching docs&lt;/a&gt; — even if you don't use Bazel, their model is worth understanding&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/buildkite-plugins/cache-buildkite-plugin" rel="noopener noreferrer"&gt;Buildkite agent caching plugin&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/spot-interruptions.html" rel="noopener noreferrer"&gt;The AWS spot instance interruption guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.blog/engineering/" rel="noopener noreferrer"&gt;GitHub's writeup on their Actions cache architecture&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://docs.getbifrost.ai/features/semantic-caching" rel="noopener noreferrer"&gt;Bifrost semantic caching docs&lt;/a&gt; if you're routing LLM traffic and curious about prompt-level caching&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Go audit your hooks. Seriously. You probably have one of these.&lt;/p&gt;

</description>
      <category>devops</category>
      <category>infrastructure</category>
      <category>sre</category>
      <category>mlops</category>
    </item>
    <item>
      <title>Chaos testing your CI runner fleet when half the jobs call an LLM</title>
      <dc:creator>claire nguyen</dc:creator>
      <pubDate>Tue, 26 May 2026 13:25:36 +0000</pubDate>
      <link>https://dev.to/claire_nguyen/chaos-testing-your-ci-runner-fleet-when-half-the-jobs-call-an-llm-2ab4</link>
      <guid>https://dev.to/claire_nguyen/chaos-testing-your-ci-runner-fleet-when-half-the-jobs-call-an-llm-2ab4</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR: We started injecting LLM provider failures into our Buildkite agent fleet during scheduled game days. Found out our "retry on 5xx" logic was happily burning $80/hr re-sending the same 200k-token context to Anthropic during a brownout. Putting Bifrost in front of the agents fixed the obvious stuff. The chaos testing exposed the non-obvious stuff.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Right, story time. We run a fair-sized fleet of Buildkite agents on EC2, and over the last 18 months maybe 30% of jobs started touching an LLM somewhere. Code review bots. Doc generation. A weird internal thing that summarises flaky test runs. The build itself is deterministic. The LLM calls inside the build are not.&lt;/p&gt;

&lt;p&gt;When OpenAI had its multi-hour wobble in March, our p99 build time went from 4 minutes to 47. Half the queue stalled. We hadn't tested for it because nothing in our chaos playbook accounted for "third-party inference API returns 200 but takes 90 seconds."&lt;/p&gt;

&lt;p&gt;So we built one.&lt;/p&gt;

&lt;h2&gt;
  
  
  What we were already doing wrong
&lt;/h2&gt;

&lt;p&gt;The original setup was the obvious thing. Each agent had an &lt;code&gt;OPENAI_API_KEY&lt;/code&gt; baked into the AMI. Build scripts called the API directly. Retries were whatever the SDK gave us by default.&lt;/p&gt;

&lt;p&gt;Three problems showed up the first time we ran a proper failure injection:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;SDK default retry was 2 attempts with exponential backoff. On a 200k-token prompt at $3/M input tokens, that's 60 cents per retry. Multiply by 800 concurrent agents during a brownout and you do the maths.&lt;/li&gt;
&lt;li&gt;We had no circuit breaker. Agents kept dialling a dead provider for the full 10-minute job timeout.&lt;/li&gt;
&lt;li&gt;No visibility into which build steps were calling which model. The bill arrived monthly. The blame arrived never.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  The game day setup
&lt;/h2&gt;

&lt;p&gt;We run game days on a staging fleet that mirrors prod cluster sizing. The injection is done with a tiny toxiproxy sidecar that sits between the agent and the outbound LLM endpoint. Three failure modes we rotate through:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Brownout&lt;/strong&gt;: 30% of requests return 429 with a Retry-After of 60s&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Slowdown&lt;/strong&gt;: every request gets 15s of latency added&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Hard down&lt;/strong&gt;: 100% return 503 for 8 minutes, then recovery&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The first time we ran the brownout scenario against our naive setup, we got a Slack page from finance before the game day was over. They'd seen the cost spike in their hourly dashboard. Embarrassing. Also, exactly the point of the exercise.&lt;/p&gt;

&lt;h2&gt;
  
  
  Putting a gateway in front
&lt;/h2&gt;

&lt;p&gt;We moved to running &lt;a href="https://github.com/maximhq/bifrost" rel="noopener noreferrer"&gt;Bifrost&lt;/a&gt; as a sidecar on each agent host. The agents talk to &lt;code&gt;localhost:8080&lt;/code&gt; with the OpenAI SDK and Bifrost handles the actual provider calls. Drop-in replacement, no code changes in the build scripts.&lt;/p&gt;

&lt;p&gt;The config is boring, which is what you want:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;providers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;openai&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;keys&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;env.OPENAI_KEY_PRIMARY&lt;/span&gt;
        &lt;span class="na"&gt;weight&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;0.7&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;env.OPENAI_KEY_SECONDARY&lt;/span&gt;
        &lt;span class="na"&gt;weight&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;0.3&lt;/span&gt;
  &lt;span class="na"&gt;anthropic&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;keys&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;env.ANTHROPIC_KEY&lt;/span&gt;

&lt;span class="na"&gt;fallbacks&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;primary&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;openai/gpt-4o-mini&lt;/span&gt;
    &lt;span class="na"&gt;backup&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;anthropic/claude-haiku-4-5&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Two things this actually solved during our next game day:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fallback worked without code changes.&lt;/strong&gt; When toxiproxy killed OpenAI, builds kept moving by routing to Anthropic. Build time bumped maybe 20%. Nobody paged.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Prometheus metrics gave us per-pipeline cost visibility.&lt;/strong&gt; We could finally see that one team's "summarise the test logs" step was responsible for 40% of our LLM spend. Conversation with that team was much easier with numbers attached.&lt;/p&gt;

&lt;h2&gt;
  
  
  What gateway != fixes
&lt;/h2&gt;

&lt;p&gt;Here's the honest bit. The gateway didn't solve our retry-cost problem on its own. Bifrost's fallback config is good, but if your build script is calling the API in a loop and not respecting the 429s coming back, you'll still burn money. We had to write our own thin wrapper in the build pipeline to bail out of the LLM step after 2 failures and fall back to a heuristic. Gateway gave us the signals. The build logic still has to do the right thing with them.&lt;/p&gt;

&lt;h2&gt;
  
  
  Honest comparison
&lt;/h2&gt;

&lt;p&gt;We looked at LiteLLM and Portkey before settling. Quick read:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Tool&lt;/th&gt;
&lt;th&gt;What we liked&lt;/th&gt;
&lt;th&gt;Where it didn't fit&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;LiteLLM&lt;/td&gt;
&lt;td&gt;Massive provider list, well-known&lt;/td&gt;
&lt;td&gt;Python proxy meant another runtime on each agent host&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Portkey&lt;/td&gt;
&lt;td&gt;Slick analytics dashboard, mature observability&lt;/td&gt;
&lt;td&gt;SaaS-first, our security team wasn't keen on egress for build logs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Bifrost&lt;/td&gt;
&lt;td&gt;Single Go binary, drop-in OpenAI compat, semantic caching that actually saved us 22% on the doc-gen pipeline&lt;/td&gt;
&lt;td&gt;Smaller ecosystem, fewer integrations than LiteLLM, MCP gateway is enterprise-tier&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;If you're already running LiteLLM happily, no reason to swap. We just preferred deploying one binary alongside the agent instead of a Python service.&lt;/p&gt;

&lt;h2&gt;
  
  
  Trade-offs and limitations
&lt;/h2&gt;

&lt;p&gt;A few things to be straight about:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Adding a gateway adds a hop. We measured about 3-5ms overhead per call. Fine for our use case, might matter if you're doing latency-sensitive inference.&lt;/li&gt;
&lt;li&gt;Semantic caching is brilliant for repetitive build prompts (think "summarise this stack trace") but useless for anything with high-entropy input. Don't expect a free 50% cost cut.&lt;/li&gt;
&lt;li&gt;Self-hosted means you own the uptime of the gateway too. We run it as a sidecar so the blast radius is one agent, but if you centralise it, you've created a new SPOF.&lt;/li&gt;
&lt;li&gt;Game days take real time. Half a day to set up, half a day to run, two days of follow-up tickets. Worth it. Not free.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The biggest win wasn't any one feature. It was that we'd actually pulled the cables out before a real provider had a bad afternoon. "Never had an outage" usually means you've never tested your failure handling.&lt;/p&gt;

&lt;h2&gt;
  
  
  Further Reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://docs.getbifrost.ai/features/retries-and-fallbacks" rel="noopener noreferrer"&gt;Bifrost retries and fallbacks docs&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.getbifrost.ai/features/semantic-caching" rel="noopener noreferrer"&gt;Bifrost semantic caching docs&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.getbifrost.ai/features/observability/default" rel="noopener noreferrer"&gt;Bifrost observability and Prometheus metrics&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/Shopify/toxiproxy" rel="noopener noreferrer"&gt;Toxiproxy on GitHub&lt;/a&gt; for network failure injection&lt;/li&gt;
&lt;li&gt;&lt;a href="https://buildkite.com/blog" rel="noopener noreferrer"&gt;Buildkite's own writeup on agent fleet reliability&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>devops</category>
      <category>sre</category>
      <category>infrastructure</category>
      <category>llm</category>
    </item>
    <item>
      <title>Game day on our build cluster: killing an AZ to test LLM flake detection</title>
      <dc:creator>claire nguyen</dc:creator>
      <pubDate>Mon, 25 May 2026 13:22:18 +0000</pubDate>
      <link>https://dev.to/claire_nguyen/game-day-on-our-build-cluster-killing-an-az-to-test-llm-flake-detection-dam</link>
      <guid>https://dev.to/claire_nguyen/game-day-on-our-build-cluster-killing-an-az-to-test-llm-flake-detection-dam</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR: We ran a game day on our Buildkite agent fleet where I yanked an entire AWS AZ while our LLM-based flake classifier was triaging failures. The classifier fell over because we'd wired it to a single OpenAI endpoint. Putting Bifrost in front fixed the failover hole and exposed two other bugs we hadn't seen.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Right, so a few weeks back I was running a game day on our internal build cluster. About 800 agents spread across ap-southeast-2a, 2b, and 2c. The exercise was meant to test our LLM-powered flake detector under partial infrastructure failure. The detector reads a failed job log, classifies it as &lt;code&gt;flake | real | infra&lt;/code&gt;, and decides whether to auto-retry.&lt;/p&gt;

&lt;p&gt;I killed 2a. That was the plan. What wasn't the plan was the flake detector going completely dark within 90 seconds.&lt;/p&gt;

&lt;h2&gt;
  
  
  What broke
&lt;/h2&gt;

&lt;p&gt;We'd built the detector as a tiny Go service running on each agent host. It called OpenAI's &lt;code&gt;gpt-4o-mini&lt;/code&gt; directly. One endpoint, one API key, no retries beyond the SDK default. When 2a went down, our networking config rerouted egress through a NAT gateway that was hot-throttled by the surge of retries from other services. Result: every flake classification request hung for 30 seconds, then timed out.&lt;/p&gt;

&lt;p&gt;CI pipelines didn't fail — they just stopped auto-retrying. So engineers started seeing real bugs &lt;em&gt;and&lt;/em&gt; flakes hit them at the same rate, and Slack lit up.&lt;/p&gt;

&lt;p&gt;The post-mortem was a bit embarrassing. We'd tested failover for the build database, the artifact store, the agent registration service. Hadn't tested failover for the thing classifying our test failures. Classic case of treating the LLM call as "just an API" instead of as a dependency that can fail in five different ways.&lt;/p&gt;

&lt;h2&gt;
  
  
  What we changed
&lt;/h2&gt;

&lt;p&gt;I'd been keeping an eye on Bifrost (&lt;a href="https://github.com/maximhq/bifrost" rel="noopener noreferrer"&gt;https://github.com/maximhq/bifrost&lt;/a&gt;) for a few months. It's an AI gateway written in Go that sits between your app and the providers. Single OpenAI-compatible endpoint, fallback rules, load balancing across keys, and a Prometheus metrics endpoint baked in. That last bit was what sold me, because our observability stack is already Prom + Grafana and I didn't fancy bolting on yet another exporter.&lt;/p&gt;

&lt;p&gt;Deployed it as a sidecar on the agent hosts, two replicas per AZ. Config looked roughly like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;providers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;openai&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;keys&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;env.OPENAI_KEY_PRIMARY&lt;/span&gt;
        &lt;span class="na"&gt;weight&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;0.7&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;env.OPENAI_KEY_BACKUP&lt;/span&gt;
        &lt;span class="na"&gt;weight&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;0.3&lt;/span&gt;
  &lt;span class="na"&gt;anthropic&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;keys&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;env.ANTHROPIC_KEY&lt;/span&gt;

&lt;span class="na"&gt;fallbacks&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;openai/gpt-4o-mini"&lt;/span&gt;
    &lt;span class="na"&gt;targets&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;openai/gpt-4o-mini"&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;anthropic/claude-haiku-4-5"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The flake detector's only change was pointing its OpenAI base URL at &lt;code&gt;http://localhost:8080/v1&lt;/code&gt;. One line. No SDK swap.&lt;/p&gt;

&lt;h2&gt;
  
  
  Second game day
&lt;/h2&gt;

&lt;p&gt;Ran the same exercise two weeks later. Killed 2a again. The Bifrost sidecar on 2a stopped responding, the detector's HTTP client failed over to the 2b sidecar via our service mesh, and classification continued. The fallback rule kicked in for about 4% of requests when one OpenAI key got rate-limited by the surge — those routed to Anthropic and came back in roughly the same latency window.&lt;/p&gt;

&lt;p&gt;We didn't see zero impact. Tail latency on classifications jumped from p99 ~1.2s to p99 ~3.8s during the failover window. But nothing went dark.&lt;/p&gt;

&lt;h2&gt;
  
  
  Two bugs the gateway exposed
&lt;/h2&gt;

&lt;p&gt;The Prometheus metrics from Bifrost showed us things our app-level logging had been hiding.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Bug one&lt;/strong&gt;: 12% of our "real bug" classifications were coming from one specific agent pool that runs Ruby tests. The model was getting truncated logs because we'd set &lt;code&gt;max_tokens&lt;/code&gt; too low on the &lt;em&gt;input&lt;/em&gt; side at some point and nobody remembered. The per-provider token histograms in the metrics made it obvious.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Bug two&lt;/strong&gt;: Our retry logic was double-counting. The agent was retrying on 429, and Bifrost was also retrying on 429. So a single rate-limited request was costing us 4x the tokens. Fixed by turning off retries in our client and letting Bifrost handle them.&lt;/p&gt;

&lt;h2&gt;
  
  
  Honest comparison
&lt;/h2&gt;

&lt;p&gt;We looked at LiteLLM and Portkey before landing on Bifrost. Quick table:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Concern&lt;/th&gt;
&lt;th&gt;LiteLLM&lt;/th&gt;
&lt;th&gt;Portkey&lt;/th&gt;
&lt;th&gt;Bifrost&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Deploy as single binary&lt;/td&gt;
&lt;td&gt;Python, heavier&lt;/td&gt;
&lt;td&gt;Hosted-first&lt;/td&gt;
&lt;td&gt;Go binary, npx or Docker&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Prom metrics out of box&lt;/td&gt;
&lt;td&gt;Plugin&lt;/td&gt;
&lt;td&gt;Hosted dashboard&lt;/td&gt;
&lt;td&gt;Native endpoint&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Fallback config&lt;/td&gt;
&lt;td&gt;YAML&lt;/td&gt;
&lt;td&gt;UI + config&lt;/td&gt;
&lt;td&gt;YAML + Web UI&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;OSS self-host&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Limited&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Maturity (Apr 2026)&lt;/td&gt;
&lt;td&gt;Highest, broad ecosystem&lt;/td&gt;
&lt;td&gt;Strong hosted product&lt;/td&gt;
&lt;td&gt;Younger, smaller community&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;LiteLLM has way more community plugins and provider quirks already handled. If you're doing weird stuff with niche providers, it's still probably the safer pick. Portkey's hosted dashboards are nicer than what we built ourselves. Bifrost won for us because it's a single Go binary, native Prom, and the latency overhead in our tests was under 2ms p50.&lt;/p&gt;

&lt;h2&gt;
  
  
  Trade-offs and limitations
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Adds a network hop. ~1-2ms p50, ~5ms p99 in our setup. Acceptable for flake classification, maybe not for tight inner loops.&lt;/li&gt;
&lt;li&gt;Another thing to monitor. We've now got Bifrost-down alerts in PagerDuty.&lt;/li&gt;
&lt;li&gt;Semantic caching (&lt;a href="https://docs.getbifrost.ai/features/semantic-caching" rel="noopener noreferrer"&gt;https://docs.getbifrost.ai/features/semantic-caching&lt;/a&gt;) sounded great but we haven't enabled it — flake classification context is too specific for cache hits to be meaningful in our case.&lt;/li&gt;
&lt;li&gt;The Web UI is handy for fiddling locally, but we manage config via git like everything else, so we mostly ignore it.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Game days for the LLM dependency in your CI aren't optional anymore if you're doing anything non-trivial. The LLM call is now a critical path component, treat it like one.&lt;/p&gt;

&lt;h2&gt;
  
  
  Further Reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://docs.getbifrost.ai/features/retries-and-fallbacks" rel="noopener noreferrer"&gt;Bifrost fallbacks and retries&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.getbifrost.ai/features/observability/default" rel="noopener noreferrer"&gt;Bifrost Prometheus observability&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://buildkite.com/docs/agent/v3" rel="noopener noreferrer"&gt;Buildkite agent architecture&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://principlesofchaos.org/" rel="noopener noreferrer"&gt;Principles of Chaos Engineering&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/maximhq/bifrost" rel="noopener noreferrer"&gt;Bifrost on GitHub&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>sre</category>
      <category>devops</category>
      <category>infrastructure</category>
      <category>llm</category>
    </item>
    <item>
      <title>Stop paying for idle GPUs in your CI: batching LLM eval jobs</title>
      <dc:creator>claire nguyen</dc:creator>
      <pubDate>Fri, 22 May 2026 04:22:08 +0000</pubDate>
      <link>https://dev.to/claire_nguyen/stop-paying-for-idle-gpus-in-your-ci-batching-llm-eval-jobs-26b0</link>
      <guid>https://dev.to/claire_nguyen/stop-paying-for-idle-gpus-in-your-ci-batching-llm-eval-jobs-26b0</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR: Running LLM evaluations on every PR will burn your GPU budget faster than you can blink. We cut our eval spend by about 60% by batching jobs into windowed runs on shared GPU pools, plus a smarter queue that knows the difference between a "smoke test" eval and a full regression run. Here's how, and where the trade-offs hurt.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Right, so a few months back I got pulled into a conversation that's becoming pretty familiar around here. A team had wired up an LLM-based evaluation suite into their CI. Every PR triggered a run against a set of prompts, scored the outputs, and posted results back to the PR. Lovely in theory.&lt;/p&gt;

&lt;p&gt;The cloud bill was not lovely.&lt;/p&gt;

&lt;p&gt;They were spinning up a g5.xlarge per PR, sometimes three or four in parallel during peak hours, and the GPU sat idle for about 70% of the run because most of the time was spent on cold starts, model loading, and prompt formatting. Classic case of treating GPUs like CPUs.&lt;/p&gt;

&lt;p&gt;I reckon a lot of teams are hitting this wall right now. So let's talk about what actually works.&lt;/p&gt;

&lt;h2&gt;
  
  
  The problem with "GPU per job"
&lt;/h2&gt;

&lt;p&gt;CI runners are designed for stateless, throwaway compute. That model breaks the second you involve a 7B+ parameter model that takes 30-90 seconds to load into VRAM.&lt;/p&gt;

&lt;p&gt;Here's the rough breakdown of a typical eval job we measured:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Phase&lt;/th&gt;
&lt;th&gt;Time (avg)&lt;/th&gt;
&lt;th&gt;GPU utilization&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Cold start (instance boot)&lt;/td&gt;
&lt;td&gt;45s&lt;/td&gt;
&lt;td&gt;0%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Model download from S3&lt;/td&gt;
&lt;td&gt;60s&lt;/td&gt;
&lt;td&gt;0%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Model load into VRAM&lt;/td&gt;
&lt;td&gt;25s&lt;/td&gt;
&lt;td&gt;~10%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Actual inference (50 prompts)&lt;/td&gt;
&lt;td&gt;40s&lt;/td&gt;
&lt;td&gt;~85%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Result upload + teardown&lt;/td&gt;
&lt;td&gt;15s&lt;/td&gt;
&lt;td&gt;0%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;So out of about 3 minutes of billable GPU time, you're getting 40 seconds of useful work. That's brutal economics.&lt;/p&gt;

&lt;h2&gt;
  
  
  Batching: the boring fix that works
&lt;/h2&gt;

&lt;p&gt;The trick isn't fancy. You stop spinning up a GPU per job and start treating the GPU like a long-lived service that consumes jobs from a queue.&lt;/p&gt;

&lt;p&gt;We run a small pool of g5.xlarge instances (usually 2-4 depending on load) that stay warm. Each runner has the model preloaded in VRAM. CI jobs push eval requests to an SQS queue, runners pull from the queue, batch up to N prompts per inference pass, and post results back.&lt;/p&gt;

&lt;p&gt;Rough sketch of the runner config:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;runner&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;instance_type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;g5.xlarge&lt;/span&gt;
  &lt;span class="na"&gt;pool_size_min&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;2&lt;/span&gt;
  &lt;span class="na"&gt;pool_size_max&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;6&lt;/span&gt;
  &lt;span class="na"&gt;scale_metric&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;queue_depth&lt;/span&gt;
  &lt;span class="na"&gt;scale_threshold&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;25&lt;/span&gt;  &lt;span class="c1"&gt;# jobs in queue&lt;/span&gt;

  &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;llama-3.1-8b-instruct&lt;/span&gt;
    &lt;span class="na"&gt;preload&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
    &lt;span class="na"&gt;keep_warm_seconds&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1800&lt;/span&gt;

  &lt;span class="na"&gt;batching&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;max_batch_size&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;16&lt;/span&gt;
    &lt;span class="na"&gt;max_wait_ms&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;2000&lt;/span&gt;

  &lt;span class="na"&gt;job_types&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;smoke_eval&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;priority&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;high&lt;/span&gt;
      &lt;span class="na"&gt;max_prompts&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;10&lt;/span&gt;
    &lt;span class="na"&gt;full_regression&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;priority&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;low&lt;/span&gt;
      &lt;span class="na"&gt;max_prompts&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;500&lt;/span&gt;
      &lt;span class="na"&gt;window&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;nightly_only&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That &lt;code&gt;max_wait_ms&lt;/code&gt; is doing the heavy lifting. The runner waits up to 2 seconds to gather a batch before firing inference. For CI, 2 seconds of latency is nothing. For inference throughput, it's everything.&lt;/p&gt;

&lt;h2&gt;
  
  
  Routing matters too
&lt;/h2&gt;

&lt;p&gt;Once you've got a warm pool, you might as well route different model calls through one place. We have eval suites that hit a mix of self-hosted Llama, Claude via API, and OpenAI. Instead of each CI job authenticating separately and managing keys, we put a gateway in front.&lt;/p&gt;

&lt;p&gt;There's a bunch of options here. LiteLLM is popular, Bifrost (&lt;a href="https://github.com/maximhq/bifrost" rel="noopener noreferrer"&gt;https://github.com/maximhq/bifrost&lt;/a&gt;) is another one that does the same kind of multi-provider routing with rate limit handling, and you can roll your own with a thin FastAPI wrapper if you're feeling keen. The point is you stop scattering API keys across twenty CI configs.&lt;/p&gt;

&lt;h2&gt;
  
  
  Job classification: don't run a full eval on every commit
&lt;/h2&gt;

&lt;p&gt;This was the biggest single win, honestly. We split eval jobs into tiers:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Smoke evals&lt;/strong&gt;: 5-10 prompts, run on every PR, catches obvious regressions&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Standard evals&lt;/strong&gt;: 50-100 prompts, run on merge to main&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Full regression&lt;/strong&gt;: 500+ prompts, run nightly on main&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Before this, every PR triggered the full 500-prompt suite because nobody had bothered to think about what they actually needed to know per PR. The answer is "did this change break something obvious?", not "is this model production-ready?"&lt;/p&gt;

&lt;p&gt;Cut our GPU-hours by about 40% just from that change alone, before any of the batching work.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the numbers looked like
&lt;/h2&gt;

&lt;p&gt;After about three weeks of running the new setup:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Before&lt;/th&gt;
&lt;th&gt;After&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;GPU-hours per day&lt;/td&gt;
&lt;td&gt;38&lt;/td&gt;
&lt;td&gt;14&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Avg PR feedback time&lt;/td&gt;
&lt;td&gt;4m 20s&lt;/td&gt;
&lt;td&gt;1m 50s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Monthly GPU spend (eval only)&lt;/td&gt;
&lt;td&gt;~$8,200&lt;/td&gt;
&lt;td&gt;~$3,100&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Queue p99 wait time&lt;/td&gt;
&lt;td&gt;n/a&lt;/td&gt;
&lt;td&gt;8s&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Faster &lt;em&gt;and&lt;/em&gt; cheaper, which is the dream combination you almost never get.&lt;/p&gt;

&lt;h2&gt;
  
  
  Trade-offs and limitations
&lt;/h2&gt;

&lt;p&gt;Nothing's free, so here's what actually hurt:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cold start on scale-up is still painful.&lt;/strong&gt; When the queue spikes past what the warm pool can handle, the new runners take 90+ seconds to come online with the model loaded. We mitigated by being more aggressive on the &lt;code&gt;scale_threshold&lt;/code&gt; than felt comfortable, which means we're occasionally paying for idle capacity. You can't have both.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Batching adds latency variance.&lt;/strong&gt; A job that arrives just after a batch fires waits the full &lt;code&gt;max_wait_ms&lt;/code&gt;. For CI this is fine. For production inference it might not be, so don't blindly copy this config to your prod inference pipeline.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pool exhaustion is a real failure mode.&lt;/strong&gt; If your queue grows faster than you can scale, jobs back up. We had a Friday afternoon where a misconfigured test suite generated 4,000 eval jobs in 10 minutes and the queue depth alert woke me up at 11pm. Add circuit breakers and per-team quotas early, not after the first incident.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Model updates are now an event.&lt;/strong&gt; When you preload models, swapping versions means a rolling restart of the pool. We do this during low-traffic windows but it's added operational overhead that didn't exist with the per-job model.&lt;/p&gt;

&lt;h2&gt;
  
  
  Further Reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://docs.aws.amazon.com/sqs/" rel="noopener noreferrer"&gt;SQS as a job queue for ML workloads&lt;/a&gt; - the AWS docs are surprisingly readable on this&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://docs.vllm.ai/" rel="noopener noreferrer"&gt;vLLM batching internals&lt;/a&gt; - if you want to understand continuous batching at the inference layer&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://karpenter.sh/" rel="noopener noreferrer"&gt;Karpenter for GPU autoscaling&lt;/a&gt; - if you're on EKS and want smarter node scaling&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.anyscale.com/blog" rel="noopener noreferrer"&gt;The Cost of LLM Inference&lt;/a&gt; - good benchmarks for sizing decisions&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://docs.litellm.ai/" rel="noopener noreferrer"&gt;LiteLLM gateway docs&lt;/a&gt; - one of the multi-provider routing options&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;No worries if your setup looks different. The general shape holds: warm pools, batched jobs, classified workloads. Apply where it fits.&lt;/p&gt;

</description>
      <category>devops</category>
      <category>mlops</category>
      <category>llm</category>
      <category>sre</category>
    </item>
    <item>
      <title>Putting an LLM Gateway in Front of Our Build Agents</title>
      <dc:creator>claire nguyen</dc:creator>
      <pubDate>Thu, 21 May 2026 13:22:18 +0000</pubDate>
      <link>https://dev.to/claire_nguyen/putting-an-llm-gateway-in-front-of-our-build-agents-3jbb</link>
      <guid>https://dev.to/claire_nguyen/putting-an-llm-gateway-in-front-of-our-build-agents-3jbb</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR: We bolted an LLM gateway in front of the AI features in our build pipeline tooling and ended up running Bifrost instead of LiteLLM or Kong. The deciding factor wasn't features, it was the 11 microsecond overhead and the fact it didn't fall over when one provider had a wobbly afternoon.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Right, so a few weeks back I got pulled into a project to wire LLM calls into some internal tooling we use for triaging flaky builds. Nothing fancy, mostly summarising failure logs and suggesting which test owner to ping. The catch was that this thing sits on the hot path of our build feedback loop, and our SRE on-call rotation was very clear: if your shiny AI feature adds latency to my builds, I will personally come and uninstall it.&lt;/p&gt;

&lt;p&gt;Fair enough.&lt;/p&gt;

&lt;h2&gt;
  
  
  The problem with calling providers directly
&lt;/h2&gt;

&lt;p&gt;First pass was the obvious one. SDK calls straight to Anthropic, with OpenAI as a fallback wrapped in a try/except. Worked fine in dev. Then we hit a real Tuesday afternoon where Anthropic had a regional hiccup, our fallback logic kicked in, and we discovered our "fallback" was actually just retrying the same broken endpoint because someone (me) had copy-pasted the client config.&lt;/p&gt;

&lt;p&gt;Classic.&lt;/p&gt;

&lt;p&gt;So we needed a proper gateway. The shortlist was Bifrost, LiteLLM, and Kong with an AI plugin. I'd used Kong before for regular API stuff so I was leaning that way out of habit, but I forced myself to actually test the three of them.&lt;/p&gt;

&lt;h2&gt;
  
  
  What we measured
&lt;/h2&gt;

&lt;p&gt;I set up a quick bench on an m6i.large with a mock upstream so we weren't measuring provider latency. Ran 50k requests at modest concurrency. Here's roughly what we got.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Gateway&lt;/th&gt;
&lt;th&gt;Overhead per request&lt;/th&gt;
&lt;th&gt;Memory steady state&lt;/th&gt;
&lt;th&gt;Setup time&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Direct SDK&lt;/td&gt;
&lt;td&gt;~0 µs&lt;/td&gt;
&lt;td&gt;80 MB&lt;/td&gt;
&lt;td&gt;10 min&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Bifrost&lt;/td&gt;
&lt;td&gt;~11 µs&lt;/td&gt;
&lt;td&gt;95 MB&lt;/td&gt;
&lt;td&gt;25 min&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LiteLLM&lt;/td&gt;
&lt;td&gt;~2.1 ms&lt;/td&gt;
&lt;td&gt;180 MB&lt;/td&gt;
&lt;td&gt;20 min&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Kong + AI plugin&lt;/td&gt;
&lt;td&gt;~1.4 ms&lt;/td&gt;
&lt;td&gt;220 MB&lt;/td&gt;
&lt;td&gt;90 min&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The 11 microsecond number for Bifrost is what they claim on their repo and honestly I assumed it was marketing fluff until I saw it on our own bench. It's Go, runs as a single binary, and the gateway overhead genuinely disappears into the noise of the actual LLM call.&lt;/p&gt;

&lt;p&gt;LiteLLM is Python and you can feel it. It's fine for a lot of use cases and the feature set is honestly massive, but on our hot path that extra couple of milliseconds per call added up across thousands of build steps.&lt;/p&gt;

&lt;p&gt;Kong is Kong. Powerful, but it's a full API gateway with an AI plugin bolted on, not an LLM gateway. We didn't need the rest of Kong.&lt;/p&gt;

&lt;h2&gt;
  
  
  The config that actually mattered
&lt;/h2&gt;

&lt;p&gt;The bit that sold me wasn't the latency. It was weighted routing with proper failover. Here's a stripped down version of what we landed on:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;providers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;anthropic_primary&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;anthropic&lt;/span&gt;
    &lt;span class="na"&gt;api_key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${ANTHROPIC_KEY}&lt;/span&gt;
    &lt;span class="na"&gt;weight&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;70&lt;/span&gt;
  &lt;span class="na"&gt;openai_secondary&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;openai&lt;/span&gt;
    &lt;span class="na"&gt;api_key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${OPENAI_KEY}&lt;/span&gt;
    &lt;span class="na"&gt;weight&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;30&lt;/span&gt;

&lt;span class="na"&gt;routing&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;build_triage&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;providers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;anthropic_primary&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;openai_secondary&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
    &lt;span class="na"&gt;failover&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
    &lt;span class="na"&gt;timeout_ms&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;8000&lt;/span&gt;

&lt;span class="na"&gt;cache&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;semantic&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;enabled&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
    &lt;span class="na"&gt;similarity_threshold&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;0.92&lt;/span&gt;
    &lt;span class="na"&gt;ttl_seconds&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;3600&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That semantic cache block is doing a lot of work. Build failures rhyme. A flaky test that times out today probably timed out last week with a slightly different log signature, and the cache catches that fuzzy match instead of paying for another LLM call. We saw cache hit rates around 38% in the first fortnight, which translates directly into provider bill reduction.&lt;/p&gt;

&lt;p&gt;Virtual keys were the other thing that mattered for us. We could hand different teams their own virtual key with its own rate limit and budget, all pointing at the same upstream credentials. No more chasing engineers to rotate keys when someone's notebook leaked one to a gist.&lt;/p&gt;

&lt;h2&gt;
  
  
  Failover that actually works
&lt;/h2&gt;

&lt;p&gt;The thing I tested most paranoidly was the failover. I literally just killed the Anthropic endpoint at the network level mid-request, expecting some ugly behaviour. Bifrost retried against OpenAI inside the same request boundary, the caller got a response, and the metrics endpoint showed the failover counter tick. No drama.&lt;/p&gt;

&lt;p&gt;Reckon this is the thing most people get wrong when they roll their own. Failover is easy to write and hard to test. Having it as a config flag means I can write a game day scenario where we knock providers offline and watch the gateway do its job, instead of hoping our wrapper code holds up.&lt;/p&gt;

&lt;h2&gt;
  
  
  Trade-offs and Limitations
&lt;/h2&gt;

&lt;p&gt;Not all sunshine.&lt;/p&gt;

&lt;p&gt;The dashboard is functional but it's no Grafana. We export Prometheus metrics out of it and build our own panels, which is what we wanted anyway, but if you're hoping for a polished UI out of the box you'll be doing some work.&lt;/p&gt;

&lt;p&gt;The plugin ecosystem is smaller than LiteLLM. If you need some niche provider or a very specific transformation, LiteLLM probably has it already and Bifrost might need you to write a small bit of Go. For our needs (Anthropic, OpenAI, one self-hosted model) this was a non-issue.&lt;/p&gt;

&lt;p&gt;Go binary means your ops team needs to be cool with running a Go service. If you're an all-Python shop and your team is allergic to anything else, that's a real friction point even though the binary itself is genuinely fire-and-forget.&lt;/p&gt;

&lt;p&gt;And semantic caching can bite you. If your prompts are doing something where a "similar" prompt actually needs a different answer (think anything with user-specific context smuggled in), you'll want to disable it for those routes. We learned this the second day.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where it sits now
&lt;/h2&gt;

&lt;p&gt;It runs as a sidecar to our build orchestration service. Two replicas behind an internal load balancer, Prometheus scraping the metrics endpoint, and pagerduty wired to the failover counter so we know when a provider is having a bad day before our users do. Total memory footprint across the cluster is rounding error compared to the workloads it serves.&lt;/p&gt;

&lt;p&gt;The on-call SRE has not, so far, come to uninstall it. I'll take the win.&lt;/p&gt;

&lt;h2&gt;
  
  
  Further Reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://github.com/maximhq/bifrost" rel="noopener noreferrer"&gt;Bifrost repo and docs&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.litellm.ai/docs/proxy/quick_start" rel="noopener noreferrer"&gt;LiteLLM proxy documentation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.konghq.com/hub/kong-inc/ai-proxy/" rel="noopener noreferrer"&gt;Kong AI Gateway overview&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://sre.google/workbook/non-abstract-design/" rel="noopener noreferrer"&gt;Google SRE Workbook chapter on graceful degradation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://prometheus.io/docs/guides/go-application/" rel="noopener noreferrer"&gt;Prometheus client libraries for Go&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>infrastructure</category>
      <category>devops</category>
      <category>sre</category>
      <category>llm</category>
    </item>
    <item>
      <title>Putting an LLM Gateway in Front of Our Build Agents: Why We Picked Bifrost</title>
      <dc:creator>claire nguyen</dc:creator>
      <pubDate>Tue, 19 May 2026 04:22:40 +0000</pubDate>
      <link>https://dev.to/claire_nguyen/putting-an-llm-gateway-in-front-of-our-build-agents-why-we-picked-bifrost-130g</link>
      <guid>https://dev.to/claire_nguyen/putting-an-llm-gateway-in-front-of-our-build-agents-why-we-picked-bifrost-130g</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR: We bolted an LLM gateway in front of the AI features in our build pipeline tooling and ended up running Bifrost instead of LiteLLM or Kong. The deciding factor wasn't features, it was the 11 microsecond overhead and the fact it didn't fall over when one provider had a wobbly afternoon.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Right, so a few weeks back I got pulled into a project to wire LLM calls into some internal tooling we use for triaging flaky builds. Nothing fancy, mostly summarising failure logs and suggesting which test owner to ping. The catch was that this thing sits on the hot path of our build feedback loop, and our SRE on-call rotation was very clear: if your shiny AI feature adds latency to my builds, I will personally come and uninstall it.&lt;/p&gt;

&lt;p&gt;Fair enough.&lt;/p&gt;

&lt;h2&gt;
  
  
  The problem with calling providers directly
&lt;/h2&gt;

&lt;p&gt;First pass was the obvious one. SDK calls straight to Anthropic, with OpenAI as a fallback wrapped in a try/except. Worked fine in dev. Then we hit a real Tuesday afternoon where Anthropic had a regional hiccup, our fallback logic kicked in, and we discovered our "fallback" was actually just retrying the same broken endpoint because someone (me) had copy-pasted the client config.&lt;/p&gt;

&lt;p&gt;Classic.&lt;/p&gt;

&lt;p&gt;So we needed a proper gateway. The shortlist was Bifrost, LiteLLM, and Kong with an AI plugin. I'd used Kong before for regular API stuff so I was leaning that way out of habit, but I forced myself to actually test the three of them.&lt;/p&gt;

&lt;h2&gt;
  
  
  What we measured
&lt;/h2&gt;

&lt;p&gt;I set up a quick bench on an m6i.large with a mock upstream so we weren't measuring provider latency. Ran 50k requests at modest concurrency. Here's roughly what we got.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Gateway&lt;/th&gt;
&lt;th&gt;Overhead per request&lt;/th&gt;
&lt;th&gt;Memory steady state&lt;/th&gt;
&lt;th&gt;Setup time&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Direct SDK&lt;/td&gt;
&lt;td&gt;~0 µs&lt;/td&gt;
&lt;td&gt;80 MB&lt;/td&gt;
&lt;td&gt;10 min&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Bifrost&lt;/td&gt;
&lt;td&gt;~11 µs&lt;/td&gt;
&lt;td&gt;95 MB&lt;/td&gt;
&lt;td&gt;25 min&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LiteLLM&lt;/td&gt;
&lt;td&gt;~2.1 ms&lt;/td&gt;
&lt;td&gt;180 MB&lt;/td&gt;
&lt;td&gt;20 min&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Kong + AI plugin&lt;/td&gt;
&lt;td&gt;~1.4 ms&lt;/td&gt;
&lt;td&gt;220 MB&lt;/td&gt;
&lt;td&gt;90 min&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The 11 microsecond number for Bifrost is what they claim on their repo and honestly I assumed it was marketing fluff until I saw it on our own bench. It's Go, runs as a single binary, and the gateway overhead genuinely disappears into the noise of the actual LLM call.&lt;/p&gt;

&lt;p&gt;LiteLLM is Python and you can feel it. It's fine for a lot of use cases and the feature set is honestly massive, but on our hot path that extra couple of milliseconds per call added up across thousands of build steps.&lt;/p&gt;

&lt;p&gt;Kong is Kong. Powerful, but it's a full API gateway with an AI plugin bolted on, not an LLM gateway. We didn't need the rest of Kong.&lt;/p&gt;

&lt;h2&gt;
  
  
  The config that actually mattered
&lt;/h2&gt;

&lt;p&gt;The bit that sold me wasn't the latency. It was weighted routing with proper failover. Here's a stripped down version of what we landed on:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;providers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;anthropic_primary&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;anthropic&lt;/span&gt;
    &lt;span class="na"&gt;api_key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;\${ANTHROPIC_KEY}&lt;/span&gt;
    &lt;span class="na"&gt;weight&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;70&lt;/span&gt;
  &lt;span class="na"&gt;openai_secondary&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;openai&lt;/span&gt;
    &lt;span class="na"&gt;api_key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;\${OPENAI_KEY}&lt;/span&gt;
    &lt;span class="na"&gt;weight&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;30&lt;/span&gt;

&lt;span class="na"&gt;routing&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;build_triage&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;providers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;anthropic_primary&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;openai_secondary&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
    &lt;span class="na"&gt;failover&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
    &lt;span class="na"&gt;timeout_ms&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;8000&lt;/span&gt;

&lt;span class="na"&gt;cache&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;semantic&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;enabled&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
    &lt;span class="na"&gt;similarity_threshold&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;0.92&lt;/span&gt;
    &lt;span class="na"&gt;ttl_seconds&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;3600&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That semantic cache block is doing a lot of work. Build failures rhyme. A flaky test that times out today probably timed out last week with a slightly different log signature, and the cache catches that fuzzy match instead of paying for another LLM call. We saw cache hit rates around 38% in the first fortnight, which translates directly into provider bill reduction.&lt;/p&gt;

&lt;p&gt;Virtual keys were the other thing that mattered for us. We could hand different teams their own virtual key with its own rate limit and budget, all pointing at the same upstream credentials. No more chasing engineers to rotate keys when someone's notebook leaked one to a gist.&lt;/p&gt;

&lt;h2&gt;
  
  
  Failover that actually works
&lt;/h2&gt;

&lt;p&gt;The thing I tested most paranoidly was the failover. I literally just killed the Anthropic endpoint at the network level mid-request, expecting some ugly behaviour. Bifrost retried against OpenAI inside the same request boundary, the caller got a response, and the metrics endpoint showed the failover counter tick. No drama.&lt;/p&gt;

&lt;p&gt;Reckon this is the thing most people get wrong when they roll their own. Failover is easy to write and hard to test. Having it as a config flag means I can write a game day scenario where we knock providers offline and watch the gateway do its job, instead of hoping our wrapper code holds up.&lt;/p&gt;

&lt;h2&gt;
  
  
  Trade-offs and Limitations
&lt;/h2&gt;

&lt;p&gt;Not all sunshine.&lt;/p&gt;

&lt;p&gt;The dashboard is functional but it's no Grafana. We export Prometheus metrics out of it and build our own panels, which is what we wanted anyway, but if you're hoping for a polished UI out of the box you'll be doing some work.&lt;/p&gt;

&lt;p&gt;The plugin ecosystem is smaller than LiteLLM. If you need some niche provider or a very specific transformation, LiteLLM probably has it already and Bifrost might need you to write a small bit of Go. For our needs (Anthropic, OpenAI, one self-hosted model) this was a non-issue.&lt;/p&gt;

&lt;p&gt;Go binary means your ops team needs to be cool with running a Go service. If you're an all-Python shop and your team is allergic to anything else, that's a real friction point even though the binary itself is genuinely fire-and-forget.&lt;/p&gt;

&lt;p&gt;And semantic caching can bite you. If your prompts are doing something where a "similar" prompt actually needs a different answer (think anything with user-specific context smuggled in), you'll want to disable it for those routes. We learned this the second day.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where it sits now
&lt;/h2&gt;

&lt;p&gt;It runs as a sidecar to our build orchestration service. Two replicas behind an internal load balancer, Prometheus scraping the metrics endpoint, and pagerduty wired to the failover counter so we know when a provider is having a bad day before our users do. Total memory footprint across the cluster is rounding error compared to the workloads it serves.&lt;/p&gt;

&lt;p&gt;The on-call SRE has not, so far, come to uninstall it. I'll take the win.&lt;/p&gt;

&lt;h2&gt;
  
  
  Further Reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://github.com/maximhq/bifrost" rel="noopener noreferrer"&gt;Bifrost repo and docs&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.litellm.ai/docs/proxy/quick_start" rel="noopener noreferrer"&gt;LiteLLM proxy documentation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.konghq.com/hub/kong-inc/ai-proxy/" rel="noopener noreferrer"&gt;Kong AI Gateway overview&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://sre.google/workbook/non-abstract-design/" rel="noopener noreferrer"&gt;Google SRE Workbook chapter on graceful degradation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://prometheus.io/docs/guides/go-application/" rel="noopener noreferrer"&gt;Prometheus client libraries for Go&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>infrastructure</category>
      <category>devops</category>
      <category>sre</category>
      <category>llm</category>
    </item>
    <item>
      <title>What I Actually Pay For When My LLM Bill Doubles Overnight</title>
      <dc:creator>claire nguyen</dc:creator>
      <pubDate>Fri, 15 May 2026 10:34:37 +0000</pubDate>
      <link>https://dev.to/claire_nguyen/what-i-actually-pay-for-when-my-llm-bill-doubles-overnight-37gf</link>
      <guid>https://dev.to/claire_nguyen/what-i-actually-pay-for-when-my-llm-bill-doubles-overnight-37gf</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR: Your LLM bill isn't one number, it's about six. Retry storms, runaway agents, and bad routing are the usual culprits. A bit of observability work up front saves you from staring at a $40k invoice wondering what happened.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Last quarter I helped a mate's team trace a sudden 3x jump in their OpenAI spend. They thought it was usage growth. It wasn't. It was a retry loop in their orchestration code that fired off three full-context requests every time a single tool call timed out. Took us about two hours to find, ten minutes to fix.&lt;/p&gt;

&lt;p&gt;I reckon most teams running LLMs in prod have something similar lurking. You just don't see it until the invoice lands.&lt;/p&gt;

&lt;h2&gt;
  
  
  The bill has more parts than you think
&lt;/h2&gt;

&lt;p&gt;When you read "LLM cost" on a finance dashboard, that single number is hiding a bunch of independent failure modes. Worth pulling them apart.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Cost driver&lt;/th&gt;
&lt;th&gt;What it actually is&lt;/th&gt;
&lt;th&gt;Where it hides&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Input tokens&lt;/td&gt;
&lt;td&gt;Prompt + context + system message&lt;/td&gt;
&lt;td&gt;Long system prompts, fat RAG chunks&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Output tokens&lt;/td&gt;
&lt;td&gt;Model's response&lt;/td&gt;
&lt;td&gt;Verbose prompts, no max_tokens cap&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Retries&lt;/td&gt;
&lt;td&gt;Failed requests you paid for anyway&lt;/td&gt;
&lt;td&gt;Library defaults, agent loops&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cached vs uncached&lt;/td&gt;
&lt;td&gt;Prompt caching hits or misses&lt;/td&gt;
&lt;td&gt;Cache invalidation from tiny prompt edits&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Provider markup&lt;/td&gt;
&lt;td&gt;Your gateway/aggregator's cut&lt;/td&gt;
&lt;td&gt;Hidden in unit pricing&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Wasted spend&lt;/td&gt;
&lt;td&gt;Calls you didn't need to make&lt;/td&gt;
&lt;td&gt;Background agents, debug code in prod&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The first three are the ones I see blow up budgets. Provider markup matters at scale but it's predictable. The other three sneak up on you.&lt;/p&gt;

&lt;h2&gt;
  
  
  Retries are the silent killer
&lt;/h2&gt;

&lt;p&gt;Here's the pattern I see constantly. A team uses some agent framework, the framework has a default retry of 3 with exponential backoff, and the prompts include the full conversation history. A timeout on token 4000 of a 4096-token response means you just paid for 4000 tokens, then immediately paid for another 4000+ tokens, then maybe a third time.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# What people write
&lt;/span&gt;&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;OpenAI&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;long_history&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;max_retries&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# default
&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# What they should write
&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;long_history&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;max_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;500&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;30&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;max_retries&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Two changes. Cap your output. Stop retrying expensive operations more than once. If a 30-second call fails once, the second attempt usually fails too.&lt;/p&gt;

&lt;h2&gt;
  
  
  Caching is worth the effort, but it's fragile
&lt;/h2&gt;

&lt;p&gt;Prompt caching on most providers gives you something like 50-90% off cached input tokens. Brilliant when it works. The trap is that cache keys are exact prefix matches. Change your system prompt by one character, your timestamp injection bumps every request, your dynamic user context shifts the prefix... cache hit rate goes to zero and your bill quietly goes back up.&lt;/p&gt;

&lt;p&gt;A useful exercise: log your actual cache hit rate per route, not just the average. I had a service where overall hit rate was 70%, looked fine, but one specific endpoint was running at 4% because someone added a timestamp to the system prompt for "debugging."&lt;/p&gt;

&lt;h2&gt;
  
  
  Routing across providers
&lt;/h2&gt;

&lt;p&gt;Once you've got more than one model in play, where requests go starts mattering a lot. Cheap models for classification, expensive ones for synthesis. Local models for bulk preprocessing if you can host them.&lt;/p&gt;

&lt;p&gt;A few options here, depending on how much you want to manage yourself:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Build it in your own gateway service&lt;/li&gt;
&lt;li&gt;Use a router library like LiteLLM in-process&lt;/li&gt;
&lt;li&gt;Run a proxy like Bifrost (&lt;a href="https://github.com/maximhq/bifrost" rel="noopener noreferrer"&gt;https://github.com/maximhq/bifrost&lt;/a&gt;), Kong AI, or Portkey in front of your services&lt;/li&gt;
&lt;li&gt;Stick with one provider and use their own routing&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The proxy approach is what I've ended up preferring on bigger setups because it gives you one place to log, retry, and rate-limit. The downside is one more thing to keep alive. For smaller services I just call the SDK directly and accept the duplication.&lt;/p&gt;

&lt;h2&gt;
  
  
  Set hard limits per workload
&lt;/h2&gt;

&lt;p&gt;The cheapest debugging tool I've found is a per-API-key spend cap. Most providers offer them. Most teams don't set them because the dashboards default to "monitor only."&lt;/p&gt;

&lt;p&gt;Set them. Set them low. If your batch job is supposed to cost $200 a day and you cap it at $400, you'll get paged the day someone accidentally points it at production traffic instead of staging. That page is much nicer to receive than a Slack message from finance two weeks later.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Example budget config we use&lt;/span&gt;
&lt;span class="na"&gt;budgets&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;chat-prod&lt;/span&gt;
    &lt;span class="na"&gt;daily_limit_usd&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1200&lt;/span&gt;
    &lt;span class="na"&gt;hourly_limit_usd&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;100&lt;/span&gt;
    &lt;span class="na"&gt;alert_at_pct&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;50&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;80&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;95&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
    &lt;span class="na"&gt;hard_stop_at_pct&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;100&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;experiments&lt;/span&gt;
    &lt;span class="na"&gt;daily_limit_usd&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;50&lt;/span&gt;
    &lt;span class="na"&gt;hard_stop_at_pct&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;100&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Trade-offs and Limitations
&lt;/h2&gt;

&lt;p&gt;A few honest caveats.&lt;/p&gt;

&lt;p&gt;Caching aggressively means your prompts get rigid. You can't iterate on system prompts as freely because every edit nukes your cache.&lt;/p&gt;

&lt;p&gt;Hard spend caps will absolutely cause prod outages. That's the point, but it means you need a runbook for "we hit the cap, what now." Either auto-raise with approval, or fail open to a cheaper fallback model, or fail closed and alert. Pick one before it happens.&lt;/p&gt;

&lt;p&gt;Per-key budgets only work if you have enough keys. If everything runs through one shared key, you've got coarse-grained control at best.&lt;/p&gt;

&lt;p&gt;And honestly, observability work has diminishing returns. If your bill is $500 a month, don't build a routing platform. If it's $50k a month, the engineering time pays for itself in a week.&lt;/p&gt;

&lt;h2&gt;
  
  
  Further Reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://platform.openai.com/docs/guides/prompt-caching" rel="noopener noreferrer"&gt;OpenAI prompt caching docs&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.anthropic.com/en/docs/build-with-claude/prompt-caching" rel="noopener noreferrer"&gt;Anthropic prompt caching docs&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.litellm.ai/docs/simple_proxy" rel="noopener noreferrer"&gt;LiteLLM proxy&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://aws.amazon.com/bedrock/pricing/" rel="noopener noreferrer"&gt;AWS Bedrock pricing breakdown&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://sre.google/workbook/table-of-contents/" rel="noopener noreferrer"&gt;Google's "SRE Workbook" chapter on cost&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Bills are signals. If yours doubles overnight, something in your system changed. Find that thing before you negotiate a bigger contract.&lt;/p&gt;

</description>
      <category>infrastructure</category>
      <category>llm</category>
      <category>devops</category>
      <category>sre</category>
    </item>
    <item>
      <title>MCP in Production Reality vs the Spec</title>
      <dc:creator>claire nguyen</dc:creator>
      <pubDate>Wed, 29 Apr 2026 05:58:18 +0000</pubDate>
      <link>https://dev.to/claire_nguyen/mcp-in-production-reality-vs-the-spec-3f04</link>
      <guid>https://dev.to/claire_nguyen/mcp-in-production-reality-vs-the-spec-3f04</guid>
      <description>&lt;p&gt;Been building against MCP for the last four months and the gap between what vendors claim and what the spec actually supports is getting hard to ignore.&lt;/p&gt;

&lt;p&gt;If you have not read the official roadmap yet, it is worth your time. The document published by AAIF in March lays things out clearly and honestly. The list of what is still missing is longer than many people in the ecosystem seem willing to admit.&lt;/p&gt;

&lt;h2&gt;
  
  
  Stateless Streaming Is Not Here Yet
&lt;/h2&gt;

&lt;p&gt;Stateless Streamable HTTP is still marked as in progress. That has real consequences.&lt;/p&gt;

&lt;p&gt;Today, if you want to scale horizontally, you are dealing with sticky sessions or putting a stateful proxy in front of your servers. This is not a small implementation detail. It directly affects reliability, cost, and operational complexity.&lt;/p&gt;

&lt;p&gt;Every MCP native at scale pitch I have seen quietly works around this with a custom session layer. That may be practical for now, but it is not what people assume when they hear "stateless protocol."&lt;/p&gt;

&lt;h2&gt;
  
  
  Async Work Is Still DIY
&lt;/h2&gt;

&lt;p&gt;The Tasks primitive for async and long running operations is also in progress.&lt;/p&gt;

&lt;p&gt;In practice, this means any agent doing multi minute work is faking async. Most teams end up with polling endpoints, custom retry logic, and their own definitions of job state.&lt;/p&gt;

&lt;p&gt;The problem is not just inconvenience. It is fragmentation. Each implementation behaves slightly differently, which makes interoperability harder before it even begins.&lt;/p&gt;

&lt;h2&gt;
  
  
  Discovery Is Still Manual
&lt;/h2&gt;

&lt;p&gt;Server discovery is another gap that shows up quickly.&lt;/p&gt;

&lt;p&gt;The idea of Server Cards exposed via .well known URLs is promising, but not available yet. Right now, you cannot know what an MCP server can do without connecting to it first.&lt;/p&gt;

&lt;p&gt;The Registry preview from late 2025 helps, but it is not a replacement for protocol level discovery. You still end up writing glue code just to answer basic capability questions.&lt;/p&gt;

&lt;h2&gt;
  
  
  Enterprise Auth Is Not Ready
&lt;/h2&gt;

&lt;p&gt;Authentication is where things feel especially incomplete for real world use.&lt;/p&gt;

&lt;p&gt;Most implementations today rely on static client secrets. That works for prototypes, but does not align with how larger organizations manage identity and access.&lt;/p&gt;

&lt;p&gt;The roadmap calls out SSO integrated cross app access as a priority. That is exactly what is needed. Until it lands, teams are building their own auth layers on top.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Hidden Cost: Rewrites Later
&lt;/h2&gt;

&lt;p&gt;Put all of this together and a pattern emerges.&lt;/p&gt;

&lt;p&gt;If you are building serious MCP infrastructure today, you are not just implementing the spec. You are filling in gaps around session management, async orchestration, discovery, and authentication.&lt;/p&gt;

&lt;p&gt;Those gaps come with a cost. Once these features land in the official spec, a lot of today's custom infrastructure will need to be reworked or replaced. Some abstractions will survive. Many will not.&lt;/p&gt;

&lt;p&gt;If you are designing systems now, it is worth being explicit about where you are deviating from the spec and how hard it will be to unwind later.&lt;/p&gt;

&lt;h2&gt;
  
  
  About Those "Production Ready" Claims
&lt;/h2&gt;

&lt;p&gt;This also makes it hard to take production ready MCP gateway claims at face value in April 2026.&lt;/p&gt;

&lt;p&gt;There are usually two possibilities. Either the deployment is small enough that these issues have not surfaced yet, or the vendor has built proprietary extensions on top of MCP.&lt;/p&gt;

&lt;p&gt;Neither is inherently wrong, but both are very different from what the marketing suggests.&lt;/p&gt;

&lt;p&gt;The Good News&lt;/p&gt;

&lt;p&gt;None of this is a knock on MCP itself.&lt;/p&gt;

&lt;p&gt;The shape of the protocol feels right. The direction is solid. The roadmap is transparent about what is missing, which is more than can be said for many standards at this stage.&lt;/p&gt;

&lt;p&gt;But the reality is simple. Production grade tooling is still catching up.&lt;/p&gt;

</description>
      <category>devops</category>
      <category>infrastructure</category>
      <category>llm</category>
      <category>sre</category>
    </item>
  </channel>
</rss>
