<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: claire nguyen</title>
    <description>The latest articles on DEV Community by claire nguyen (@claire_nguyen).</description>
    <link>https://dev.to/claire_nguyen</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3864932%2F406e92e1-2c8d-4d65-a1f4-a8d4e8c2fd1d.jpg</url>
      <title>DEV Community: claire nguyen</title>
      <link>https://dev.to/claire_nguyen</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/claire_nguyen"/>
    <language>en</language>
    <item>
      <title>What is shadow AI and how to govern it</title>
      <dc:creator>claire nguyen</dc:creator>
      <pubDate>Mon, 22 Jun 2026 06:14:58 +0000</pubDate>
      <link>https://dev.to/claire_nguyen/what-is-shadow-ai-and-how-to-govern-it-1k4f</link>
      <guid>https://dev.to/claire_nguyen/what-is-shadow-ai-and-how-to-govern-it-1k4f</guid>
      <description>&lt;p&gt;&lt;em&gt;Shadow AI is the use of AI tools inside an organization without IT's knowledge or approval. This guide explains what it is, why it creates real security and compliance risk, and how the Bifrost AI gateway together with Bifrost Edge brings that usage under governance on every machine.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Most of the AI tools employees rely on at work run on their own machines and reach a model provider directly, without passing through any corporate network checkpoint. A developer can install a desktop assistant, paste in proprietary source code, and send it to a third-party model before anyone in security knows the tool exists. Industry analysts call this pattern shadow AI: the use of AI tools or applications by employees without the approval or oversight of the IT department. The scale is no longer marginal, as a 2026 Gartner survey of cybersecurity leaders found that 69 percent have evidence or suspect that employees are using public generative AI at work.&lt;/p&gt;

&lt;h2&gt;
  
  
  What shadow AI is
&lt;/h2&gt;

&lt;p&gt;Shadow AI is the use of AI tools, models, and services by employees without the knowledge, approval, or governance of an organization's IT or security teams. It is a subset of shadow IT, the broader category of hardware and software that IT has not approved, but it carries risks that older shadow IT controls were never designed to handle. The distinction matters for how an organization should respond.&lt;/p&gt;

&lt;p&gt;Where shadow IT generally involves an unapproved application or storage service, shadow AI centers on systems that process, generate, and retain data in ways that are difficult to reverse. Common forms include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Public chatbots and assistants used on work data, such as the ChatGPT app or Claude Desktop signed in with a personal account.&lt;/li&gt;
&lt;li&gt;AI features inside browser tabs and SaaS products that an employee turns on without review.&lt;/li&gt;
&lt;li&gt;Coding agents in the terminal and IDE, including Claude Code, Codex, and Cursor, that read source code and call external services.&lt;/li&gt;
&lt;li&gt;MCP servers wired into those tools, which can read files, call APIs, and act on a user's behalf.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Salesforce's 2026 Workforce AI Survey found that 67 percent of employees use AI at work, while only 18 percent of organizations have a formal AI security policy. Adoption at that pace, with no governing layer underneath it, is what turns ordinary productivity into a security exposure.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why shadow AI is a security risk, not just a policy gap
&lt;/h2&gt;

&lt;p&gt;Shadow AI raises measurable security and compliance risk because sensitive data reaches systems that the organization cannot see, control, or audit. Gartner predicts that by 2030, more than 40 percent of organizations will experience &lt;a href="https://www.infosecurity-magazine.com/news/gartner-40-firms-hit-shadow-ai/" rel="noopener noreferrer"&gt;security or compliance incidents tied to the use of unauthorized AI&lt;/a&gt;, and the reasons follow directly from how these tools are used.&lt;/p&gt;

&lt;p&gt;The exposure goes well beyond a single pasted prompt. Several distinct failure modes make shadow AI harder to contain than earlier forms of unapproved software:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Data leaves the organization in a form that cannot be recalled. Once proprietary text enters a third-party model, it may be retained or used in ways the organization has no ability to reverse, unlike a file that can be deleted from a server.&lt;/li&gt;
&lt;li&gt;Compliance and audit trails break down. Legal and security teams cannot demonstrate where regulated data went, whether retention rules were followed, or whether residency obligations were met when the traffic never passed through a governed path.&lt;/li&gt;
&lt;li&gt;AI agents inherit standing access. An assistant connected through an MCP server can read email, repositories, and internal systems on a continuing basis, so the question shifts from what a user pasted once to what a connected tool can reach at any time.&lt;/li&gt;
&lt;li&gt;Governance trails adoption. Most organizations still have no reliable way to see which AI tools and connections are in use, which leaves the bulk of this activity outside any policy or review.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These concerns are not hypothetical; they describe what happens when fast, employee-driven adoption runs ahead of any mechanism for seeing or controlling it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why existing controls miss shadow AI
&lt;/h2&gt;

&lt;p&gt;Traditional network controls miss most shadow AI because the activity does not behave like the traffic those controls were built to inspect. Network proxies and data loss prevention systems observe what crosses the corporate network, yet a large share of AI usage runs on the endpoint and connects straight to a provider over an encrypted channel that resembles ordinary web traffic.&lt;/p&gt;

&lt;p&gt;Three gaps recur across the older approaches:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Network filtering and data loss prevention operate at the perimeter, so AI requests that originate and resolve on the device fall outside their view.&lt;/li&gt;
&lt;li&gt;Blocklists depend on a known list of destinations, and new AI tools, browser features, and MCP servers appear faster than any list can track.&lt;/li&gt;
&lt;li&gt;Written policies state what employees should do, but a document does not enforce anything at the moment a prompt is sent.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The common thread points toward the fix: the AI runs on the endpoint, where the person and the tool actually meet, so the endpoint is the one place where every request can be seen and governed before data leaves the machine.&lt;/p&gt;

&lt;h2&gt;
  
  
  How the Bifrost AI gateway with Bifrost Edge govern shadow AI
&lt;/h2&gt;

&lt;p&gt;Governing shadow AI well takes two things that fit together: one place to define policy, and a way to apply that policy to the AI running on every machine. Bifrost, the &lt;a href="https://github.com/maximhq/bifrost" rel="noopener noreferrer"&gt;open-source AI gateway&lt;/a&gt; built by Maxim AI, is that one place. The gateway already holds the &lt;a href="https://docs.getbifrost.ai/deployment-guides/config-json/governance" rel="noopener noreferrer"&gt;virtual keys, budgets, and rate limits&lt;/a&gt; that tie AI usage to a person or project, the &lt;a href="https://docs.getbifrost.ai/enterprise/guardrails" rel="noopener noreferrer"&gt;guardrail profiles&lt;/a&gt; that inspect prompts and responses, and the audit logs that record every exchange. The limitation, until now, has been reach: those controls governed only the traffic that someone had configured to point at the gateway.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://docs.getbifrost.ai/edge/overview" rel="noopener noreferrer"&gt;Bifrost Edge&lt;/a&gt; closes that gap by running on each machine and routing all supported AI traffic through Bifrost, so the same virtual keys, budgets, guardrails, and audit logs that already protect gateway traffic now apply to the desktop apps, browser AI, and coding agents people use day to day. The gateway stays the single control plane, and Edge becomes its reach to the endpoint, so there is no second policy model to build or maintain.&lt;/p&gt;

&lt;p&gt;A request from any &lt;a href="https://docs.getbifrost.ai/edge/supported-applications" rel="noopener noreferrer"&gt;supported AI tool&lt;/a&gt; follows the same governed path on every machine:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;A user works in a desktop app, a browser AI surface, or a coding agent exactly as before, with &lt;a href="https://docs.getbifrost.ai/edge/how-it-works" rel="noopener noreferrer"&gt;no base URL change and no SDK swap&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;Bifrost Edge routes that request through the organization's Bifrost rather than letting it go straight to the provider.&lt;/li&gt;
&lt;li&gt;Bifrost ties the request to the user's virtual key and its budget, runs the configured guardrails, and writes the exchange to the audit log.&lt;/li&gt;
&lt;li&gt;The governed response returns to the original app, with sensitive content already caught or redacted.&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Guardrails apply before data leaves the machine
&lt;/h3&gt;

&lt;p&gt;The &lt;a href="https://docs.getbifrost.ai/edge/security" rel="noopener noreferrer"&gt;guardrail profiles configured in Bifrost&lt;/a&gt; apply to endpoint traffic with no extra setup on the device. A guardrail runs before a prompt reaches a model and again before a response returns, so secrets and personal data are caught or redacted before they leave the machine. Built-in coverage includes Gitleaks-backed secrets detection for leaked API keys, tokens, and credentials, a PII detection template built on custom regex, and content safety, alongside integrations with AWS Bedrock Guardrails, Azure Content Safety, Google Model Armor, CrowdStrike AIDR, GraySwan Cygnal, and Patronus AI.&lt;/p&gt;

&lt;h3&gt;
  
  
  Visibility into MCP servers across the fleet
&lt;/h3&gt;

&lt;p&gt;Most organizations cannot say which MCP servers their employees have connected to AI tools. &lt;a href="https://docs.getbifrost.ai/edge/mcp-governance" rel="noopener noreferrer"&gt;Bifrost Edge inventories the MCP servers&lt;/a&gt; configured inside each supported app and builds a live picture across the fleet of which servers are in use, on which apps, and on how many devices. Administrators then allow or deny each server individually, and Edge enforces that decision on the device, even for an app that had the server configured before the policy existed. MCP discovery covers the major AI apps that support it, including Claude Code, Claude Desktop, Gemini CLI, OpenCode, Codex, and Cursor.&lt;/p&gt;

&lt;h3&gt;
  
  
  App policy enforced on every device
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://docs.getbifrost.ai/edge/app-governance" rel="noopener noreferrer"&gt;App governance&lt;/a&gt; lets administrators decide which AI applications are permitted across the organization. Approved apps run normally, with their traffic governed through Bifrost, while disallowed apps are blocked before any data leaves the machine. When Edge encounters an app or MCP server it has not seen, it requests approval from the admin console, and administrators choose whether pending items are allowed or blocked while a decision is pending. Policy changes reach the whole organization at once, without anyone revisiting individual machines.&lt;/p&gt;

&lt;h3&gt;
  
  
  Rollout through your existing device management
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://docs.getbifrost.ai/edge/deployment-mdm" rel="noopener noreferrer"&gt;Bifrost Edge deploys through the device management platforms&lt;/a&gt; an organization already runs, including Jamf, Microsoft Intune, Kandji, Omnissa Workspace ONE, and JumpCloud, across macOS, Windows, and Linux. The managed configuration carries only the connection settings that point each machine at the organization's Bifrost, and identity and keys come from the user's single sign-on, so no secrets sit on the device. After the first sign-in, governance stays in sync with the gateway, and central changes to app policy, MCP allow and deny lists, and routing reach the fleet on their own.&lt;/p&gt;

&lt;h2&gt;
  
  
  Common questions about shadow AI
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Is shadow AI the same as shadow IT?
&lt;/h3&gt;

&lt;p&gt;Shadow AI is a subset of shadow IT, but the risk profile differs. Shadow IT covers any hardware or software that IT has not approved, whereas shadow AI specifically involves tools that process and retain data in a model, which makes the exposure harder to reverse and more likely to spread across teams.&lt;/p&gt;

&lt;h3&gt;
  
  
  Can shadow AI be detected?
&lt;/h3&gt;

&lt;p&gt;Shadow AI can be detected when AI requests are observed at the point where they originate. Because much of the usage runs on the endpoint, a layer that operates on the device, such as Bifrost Edge, can inventory the apps and MCP servers in use and route their traffic through a gateway where it becomes visible and auditable.&lt;/p&gt;

&lt;h3&gt;
  
  
  How do you govern AI coding agents?
&lt;/h3&gt;

&lt;p&gt;Coding agents such as Claude Code, Codex, and Cursor run locally and often connect directly to model providers and MCP servers. Routing their traffic through Bifrost applies the same guardrails, budgets, and audit logging used for the rest of an organization's AI, while app and MCP policies determine which agents and tools are allowed on each machine.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where this leaves enterprise teams
&lt;/h2&gt;

&lt;p&gt;Shadow AI persists because the activity happens on the endpoint and moves faster than perimeter controls can follow, so better intentions and longer policy documents do not resolve it on their own. The organizations that handle it well treat it first as a visibility problem and then as an enforcement problem, governing AI where people actually use it rather than where the network happens to see it.&lt;/p&gt;

&lt;p&gt;Pairing the Bifrost AI gateway with Bifrost Edge gives security and platform teams one control plane for that work, with the gateway defining the virtual keys, budgets, guardrails, and audit logs, and Edge, currently in alpha, extending them to every machine in the organization. Teams sizing up shadow AI can review how the combined approach works on the &lt;a href="https://docs.getbifrost.ai/edge/overview" rel="noopener noreferrer"&gt;Bifrost Edge overview&lt;/a&gt; and register there for alpha access.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>cybersecurity</category>
      <category>management</category>
      <category>security</category>
    </item>
    <item>
      <title>Fault-injecting our LLM provider to trust Bifrost fallbacks</title>
      <dc:creator>claire nguyen</dc:creator>
      <pubDate>Fri, 19 Jun 2026 13:26:05 +0000</pubDate>
      <link>https://dev.to/claire_nguyen/fault-injecting-our-llm-provider-to-trust-bifrost-fallbacks-233e</link>
      <guid>https://dev.to/claire_nguyen/fault-injecting-our-llm-provider-to-trust-bifrost-fallbacks-233e</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR: We run an LLM-backed build-failure summariser at Buildkite. To stop a provider wobble from breaking it mid-deploy, I ran a game day that fault-injected OpenAI with 429s and 500s and watched whether Bifrost's fallback config actually rerouted. It did, but only after I fixed two things I'd set up wrong.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;We've got a small service that reads failed CI jobs and writes a one-paragraph summary into the build annotation, so engineers don't have to scroll 4,000 lines of test log to find the one assertion that broke. It calls an LLM. Handy when it works. Embarrassing when it doesn't, because a broken annotation makes people distrust every annotation.&lt;/p&gt;

&lt;p&gt;The problem is the thing it depends on isn't ours. OpenAI rate-limits, has the occasional 5xx spell, and we don't get a heads-up. "Never had an outage" usually means you never tested the failure path. So I tested it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why a gateway at all
&lt;/h2&gt;

&lt;p&gt;I didn't want fallback logic smeared across our service code. Retry-with-jitter, secondary provider, key rotation, all of that wants to live in one place with metrics attached. We put Bifrost in front, an OpenAI-compatible gateway, so our service keeps talking the same &lt;code&gt;/v1/chat/completions&lt;/code&gt; it always did and the routing decisions move to config.&lt;/p&gt;

&lt;p&gt;The pitch is plain. One endpoint, 23+ providers behind it, automatic fallbacks between them. Our code points at &lt;code&gt;localhost:8080&lt;/code&gt; instead of &lt;code&gt;api.openai.com&lt;/code&gt; and stops caring which model actually answers.&lt;/p&gt;

&lt;p&gt;Here's the fallback config I started the game day with:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"providers"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"openai"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"keys"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"env.OPENAI_KEY_A"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"env.OPENAI_KEY_B"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"anthropic"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"keys"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"env.ANTHROPIC_KEY"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"fallbacks"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="s2"&gt;"openai/gpt-4o-mini"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="s2"&gt;"anthropic/claude-haiku-4-5"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Two OpenAI keys for load balancing, then Anthropic as the lifeboat if OpenAI as a whole goes sideways. That was the theory.&lt;/p&gt;

&lt;h2&gt;
  
  
  The game day
&lt;/h2&gt;

&lt;p&gt;A game day is just a planned outage you cause on purpose, with people watching. I scheduled 45 minutes, told the team, and put a toxiproxy in front of OpenAI so I could inject faults without waiting for the real thing to break.&lt;/p&gt;

&lt;p&gt;Three scenarios:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;429 storm.&lt;/strong&gt; Every OpenAI response becomes a rate-limit for 5 minutes.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Hard 500s.&lt;/strong&gt; OpenAI returns 503 on half of requests.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Latency tar pit.&lt;/strong&gt; 30-second delays, no errors, the nastiest one.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Scenario one went fine. Bifrost saw the 429s, rotated between key A and key B, then gave up on OpenAI and the requests landed on Haiku. Annotations kept writing. Reckoned I was done.&lt;/p&gt;

&lt;p&gt;Scenario two found my first mistake. I'd not set a sane retry ceiling, so on a 503 the gateway retried hard against the same struggling provider before failing over, and our p95 on annotation writes jumped to about 18 seconds. Fixed it by capping retries and letting the fallback fire sooner. The README's &lt;a href="https://docs.getbifrost.ai/features/retries-and-fallbacks" rel="noopener noreferrer"&gt;retries and fallbacks&lt;/a&gt; page covers the knobs; I'd skimmed it the first time.&lt;/p&gt;

&lt;p&gt;Scenario three is the one everyone gets wrong. Slow isn't down. A 30-second response isn't an error, so naive fallback never triggers, the request just sits there. We added a request timeout so a tar-pitted provider counts as a failure and trips the lifeboat. That single change is the actual reason this exercise was worth running.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the metrics showed
&lt;/h2&gt;

&lt;p&gt;Bifrost ships native Prometheus metrics, so I didn't have to bolt on my own. I watched fallback rate and per-provider latency the whole time on a Grafana board.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Scenario&lt;/th&gt;
&lt;th&gt;Without fallback&lt;/th&gt;
&lt;th&gt;With Bifrost (tuned)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;429 storm&lt;/td&gt;
&lt;td&gt;annotations stall&lt;/td&gt;
&lt;td&gt;reroute to Haiku, ~2.1s p95&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Hard 503s&lt;/td&gt;
&lt;td&gt;50% writes fail&lt;/td&gt;
&lt;td&gt;0 user-visible failures&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;30s latency&lt;/td&gt;
&lt;td&gt;every write hangs&lt;/td&gt;
&lt;td&gt;timeout trips fallback in 4s&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The numbers that mattered: zero broken annotations across all three once tuned, and the fallback decisions were visible in metrics instead of buried in logs nobody reads.&lt;/p&gt;

&lt;h2&gt;
  
  
  How it stacks up against LiteLLM and Portkey
&lt;/h2&gt;

&lt;p&gt;I'd used LiteLLM before. Worth being honest here.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Bifrost&lt;/th&gt;
&lt;th&gt;LiteLLM&lt;/th&gt;
&lt;th&gt;Portkey&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;OpenAI-compatible endpoint&lt;/td&gt;
&lt;td&gt;yes&lt;/td&gt;
&lt;td&gt;yes&lt;/td&gt;
&lt;td&gt;yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Automatic fallbacks&lt;/td&gt;
&lt;td&gt;yes&lt;/td&gt;
&lt;td&gt;yes&lt;/td&gt;
&lt;td&gt;yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Native Prometheus metrics&lt;/td&gt;
&lt;td&gt;yes&lt;/td&gt;
&lt;td&gt;yes&lt;/td&gt;
&lt;td&gt;yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Self-host story&lt;/td&gt;
&lt;td&gt;single Go binary&lt;/td&gt;
&lt;td&gt;Python proxy&lt;/td&gt;
&lt;td&gt;gateway is OSS, control plane hosted&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Maturity / ecosystem&lt;/td&gt;
&lt;td&gt;newer&lt;/td&gt;
&lt;td&gt;large, lots of integrations&lt;/td&gt;
&lt;td&gt;polished dashboards&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;LiteLLM has been around longer and has a bigger pile of community integrations, which counts for something when you hit an edge case at 2am. Portkey's hosted dashboards are nicer than anything I'd build myself, and if you don't want to run infra that's a fair trade. We picked Bifrost mostly because a single Go binary is easy for an infra team to operate and the &lt;a href="https://docs.getbifrost.ai/features/observability/default" rel="noopener noreferrer"&gt;Prometheus output&lt;/a&gt; dropped straight into our existing board with no glue. Not a knock on the others. Different priorities.&lt;/p&gt;

&lt;h2&gt;
  
  
  Trade-offs and limitations
&lt;/h2&gt;

&lt;p&gt;A gateway is one more hop you have to keep alive. If Bifrost falls over, every LLM call falls with it, so we run two replicas behind a load balancer and the game day included killing one of them too.&lt;/p&gt;

&lt;p&gt;Fallback to a different model means a different model. Haiku doesn't write the exact same summary as gpt-4o-mini, and for a build annotation that's fine, but if you depend on a strict output schema you need to test the lifeboat actually produces it. We caught one prompt that assumed OpenAI-specific formatting.&lt;/p&gt;

&lt;p&gt;And fault injection in front of a proxy isn't the real provider misbehaving. Toxiproxy gives you 429s and delays, not the weird partial-stream failures you see in the wild. It's a model of the failure, not the failure. Better than nothing, not the whole story.&lt;/p&gt;

&lt;p&gt;Semantic caching is on the roadmap for us, not load-bearing yet, so I'm not going to claim numbers I haven't measured.&lt;/p&gt;

&lt;h2&gt;
  
  
  Further Reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://docs.getbifrost.ai/features/retries-and-fallbacks" rel="noopener noreferrer"&gt;Bifrost retries and fallbacks&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.getbifrost.ai/features/observability/default" rel="noopener noreferrer"&gt;Bifrost observability and Prometheus metrics&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.getbifrost.ai/quickstart/gateway/setting-up" rel="noopener noreferrer"&gt;Bifrost gateway setup&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/maximhq/bifrost" rel="noopener noreferrer"&gt;Bifrost on GitHub&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/Shopify/toxiproxy" rel="noopener noreferrer"&gt;Toxiproxy for fault injection&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>sre</category>
      <category>devops</category>
      <category>llm</category>
      <category>infrastructure</category>
    </item>
    <item>
      <title>An 8-minute outage from a dead NLB and a JVM that cached DNS forever</title>
      <dc:creator>claire nguyen</dc:creator>
      <pubDate>Fri, 19 Jun 2026 06:43:43 +0000</pubDate>
      <link>https://dev.to/claire_nguyen/an-8-minute-outage-from-a-dead-nlb-and-a-jvm-that-cached-dns-forever-4dmj</link>
      <guid>https://dev.to/claire_nguyen/an-8-minute-outage-from-a-dead-nlb-and-a-jvm-that-cached-dns-forever-4dmj</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR: We drained a Network Load Balancer during a planned migration, and one internal service kept hammering the dead IPs for 8 minutes. The cause wasn't the failover. It was a JVM caching the DNS record forever. The fix was a 30-second TTL and a health-check tweak, not a smarter system.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Last month I ran a routine migration on one of our internal control-plane services at Buildkite. Move it behind a new NLB, drain the old one, done before lunch. We've got a small platform team, four of us, and this was meant to be a non-event.&lt;/p&gt;

&lt;p&gt;It was not a non-event.&lt;/p&gt;

&lt;h2&gt;
  
  
  The bit where everything looked fine
&lt;/h2&gt;

&lt;p&gt;I shifted the Route 53 record to the new NLB, watched the new targets go healthy, and started draining the old load balancer. Traffic on the new path climbed. Traffic on the old path... did not drop. One service, a Java scheduler that fans out build metadata, kept slamming the old NLB's IP addresses like nothing had changed.&lt;/p&gt;

&lt;p&gt;The old targets were already deregistering. So we had requests landing on backends mid-shutdown, timing out at 10 seconds each, retrying, timing out again. Error rate on that service went from basically zero to 60%. For 8 minutes.&lt;/p&gt;

&lt;p&gt;The annoying part? Every other service flipped over within about 60 seconds. Our Go services, our Node workers, all fine. Only the JVM one was stuck.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why one runtime behaved differently
&lt;/h2&gt;

&lt;p&gt;Here's the thing nobody on the team had clocked. The default DNS caching behaviour is wildly different depending on what's making the request.&lt;/p&gt;

&lt;p&gt;| Runtime | Default DNS cache TTL | What actually happened |&lt;br&gt;
|---|---|&lt;br&gt;
| JVM (security manager off) | 30s | Usually fine |&lt;br&gt;
| JVM (security manager on) | Forever (&lt;code&gt;networkaddress.cache.ttl=-1&lt;/code&gt;) | Cached the dead NLB IPs until restart |&lt;br&gt;
| Go net resolver | Honours record TTL | Re-resolved in ~60s |&lt;br&gt;
| Node 18 | Honours record TTL | Re-resolved in ~60s |&lt;br&gt;
| Python &lt;code&gt;requests&lt;/code&gt; | No caching at the lib layer | Re-resolved per connection pool refresh |&lt;/p&gt;

&lt;p&gt;NLBs hand you IP addresses behind a DNS name, and those IPs change when targets move. The contract is: you re-resolve and follow the record. The JVM, with a security manager active, sets &lt;code&gt;networkaddress.cache.ttl&lt;/code&gt; to &lt;code&gt;-1&lt;/code&gt;, which means cache the first answer for the life of the process. So our scheduler resolved the name once at boot, three weeks ago, and never looked again.&lt;/p&gt;

&lt;p&gt;The DNS record had a 60-second TTL. Didn't matter. The JVM never asked DNS a second time.&lt;/p&gt;
&lt;h2&gt;
  
  
  The fix
&lt;/h2&gt;

&lt;p&gt;Two lines in the JVM security config, pushed through our base image:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight properties"&gt;&lt;code&gt;&lt;span class="c"&gt;# $JAVA_HOME/conf/security/java.security
&lt;/span&gt;&lt;span class="py"&gt;networkaddress.cache.ttl&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;30&lt;/span&gt;
&lt;span class="py"&gt;networkaddress.cache.negative.ttl&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;5&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Thirty seconds of caching, five seconds for negative lookups so a transient NXDOMAIN doesn't get pinned either. We bake this into the base image now so every JVM service inherits it. No app code change.&lt;/p&gt;

&lt;p&gt;Then the second half of the problem: even with re-resolution, the old NLB was draining too slowly. Default deregistration delay is 300 seconds. During a planned cutover that's an eternity of half-dead targets accepting connections. We dropped it for this service:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;aws elbv2 modify-target-group-attributes &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--target-group-arn&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$TG_ARN&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--attributes&lt;/span&gt; &lt;span class="nv"&gt;Key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;deregistration_delay.timeout_seconds,Value&lt;span class="o"&gt;=&lt;/span&gt;30
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Short TTL plus short drain meant the next test cutover finished in under 90 seconds across every runtime. No error spike.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where the LLM bit sneaks in
&lt;/h2&gt;

&lt;p&gt;One reason this stung is that the same scheduler kicks off build steps that call an LLM for flaky-test classification. Those calls go out through an AI gateway, Bifrost in our case, which already does its own provider failover and re-resolution on the upstream side. So that path stayed healthy the whole time. Which made the outage extra confusing at first, because the part everyone assumed was fragile (the model calls) was solid, and the boring internal HTTP call was the thing on fire. Good reminder that the exotic dependency isn't always the one that bites you.&lt;/p&gt;

&lt;h2&gt;
  
  
  Trade-offs and limitations
&lt;/h2&gt;

&lt;p&gt;A short DNS TTL isn't free, and I won't pretend it is.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;More DNS queries.&lt;/strong&gt; Dropping JVM cache to 30s means roughly 120x more lookups per hour per host than the old "resolve once" behaviour. For us that's noise against a Route 53 resolver, but if you're doing thousands of resolutions a second you'll want to check your resolver isn't the new bottleneck.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;30s is still 30s.&lt;/strong&gt; This shortens the failover window, it doesn't remove it. For a true hot failover you need connection-level health checks and active draining, not just DNS. We treat DNS as the coarse knob and deregistration delay as the fine one.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Negative caching tradeoff.&lt;/strong&gt; A 5-second negative TTL means a brief DNS hiccup gets retried fast, but it also means a genuinely down name gets queried more aggressively. Fine for internal services, worth a second look if you're resolving something rate-limited.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;It's per-runtime.&lt;/strong&gt; There's no single switch. The JVM fix does nothing for a Go binary, and the Go default did nothing wrong here. You have to know each runtime's behaviour, which is exactly the gap that caused this.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I'd tell past me
&lt;/h2&gt;

&lt;p&gt;Test the failover, not the failover system. We had a lovely runbook for swapping NLBs and zero tests for what the clients actually did when we swapped. "Never had a DNS issue" just meant we'd never drained a load balancer with a JVM watching.&lt;/p&gt;

&lt;p&gt;Next game day, we're killing an NLB on purpose and watching every runtime re-resolve. If something caches forever, I'd rather find out at 2pm on a Tuesday than during a real migration.&lt;/p&gt;

&lt;h2&gt;
  
  
  Further Reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://docs.aws.amazon.com/elasticloadbalancing/latest/network/load-balancer-target-groups.html" rel="noopener noreferrer"&gt;AWS: NLB target deregistration and connection draining&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.oracle.com/en/java/javase/17/docs/api/java.base/java/net/doc-files/net-properties.html" rel="noopener noreferrer"&gt;Oracle JDK networking properties (&lt;code&gt;networkaddress.cache.ttl&lt;/code&gt;)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.aws.amazon.com/Route53/latest/DeveloperGuide/resource-record-sets-values.html" rel="noopener noreferrer"&gt;Route 53 record TTL guidance&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/maximhq/bifrost" rel="noopener noreferrer"&gt;Bifrost AI gateway (provider failover for the model-call path)&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>sre</category>
      <category>infrastructure</category>
      <category>devops</category>
      <category>kubernetes</category>
    </item>
    <item>
      <title>A provider latency spike stalled our whole build queue</title>
      <dc:creator>claire nguyen</dc:creator>
      <pubDate>Thu, 18 Jun 2026 13:27:47 +0000</pubDate>
      <link>https://dev.to/claire_nguyen/a-provider-latency-spike-stalled-our-whole-build-queue-47fa</link>
      <guid>https://dev.to/claire_nguyen/a-provider-latency-spike-stalled-our-whole-build-queue-47fa</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR: A provider slowdown turned a 2-second LLM call into a 70-second hang. Because our build agents block on that call, the queue backed up to roughly 400 jobs in twelve minutes. We put Bifrost in front with hard timeouts and a fallback model, and the queue stopped caring whether any single provider was healthy.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The bit nobody designs for
&lt;/h2&gt;

&lt;p&gt;I work on compute and build orchestration at Buildkite. One of our internal services calls an LLM to triage flaky test output, group it, and suggest a likely owner. Small thing. Saves engineers a fair bit of digging.&lt;/p&gt;

&lt;p&gt;The catch is that a build agent waits on that call before it releases its slot. So the latency of one HTTP request to a model provider quietly became part of our queue's throughput math. Nobody wrote it down that way, but that's what it was.&lt;/p&gt;

&lt;p&gt;On a Tuesday in May the provider's p99 went sideways. Not an outage. Just slow. Our default client timeout was 60 seconds, our retry was three attempts with backoff, and suddenly a call that normally took 2 seconds was eating 70 before giving up. Agents held their slots the whole time. Within twelve minutes the run queue sat at about 400 jobs that had nothing to do with the LLM at all.&lt;/p&gt;

&lt;p&gt;Classic head-of-line blocking. One slow dependency, a whole fleet stuck behind it.&lt;/p&gt;

&lt;h2&gt;
  
  
  What we changed
&lt;/h2&gt;

&lt;p&gt;We'd been calling the provider SDK directly from the service. The reliability logic lived in our own code, which meant our timeout values, our retry counts, and our fallback policy were all bespoke and slightly wrong.&lt;/p&gt;

&lt;p&gt;We moved the call behind &lt;a href="https://github.com/maximhq/bifrost" rel="noopener noreferrer"&gt;Bifrost&lt;/a&gt;, a self-hosted Go gateway that speaks an OpenAI-compatible API. The point wasn't to add a hop. It was to move the failure handling out of our app and into config we could reason about during an incident.&lt;/p&gt;

&lt;p&gt;Three things mattered for us.&lt;/p&gt;

&lt;p&gt;First, &lt;a href="https://docs.getbifrost.ai/features/retries-and-fallbacks" rel="noopener noreferrer"&gt;fallbacks&lt;/a&gt;. If the primary model is slow or erroring, route to a different model or provider instead of retrying the sad one into the ground. Second, &lt;a href="https://docs.getbifrost.ai/features/semantic-caching" rel="noopener noreferrer"&gt;semantic caching&lt;/a&gt;, because flaky test output repeats far more than you'd reckon, and a cache hit is a call that can't hang. Third, native &lt;a href="https://docs.getbifrost.ai/features/observability/default" rel="noopener noreferrer"&gt;Prometheus metrics&lt;/a&gt;, so the LLM path showed up on the same SLO dashboards as everything else we run.&lt;/p&gt;

&lt;p&gt;Here's the gist of the config:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"providers"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"anthropic"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"keys"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"value"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"env.ANTHROPIC_KEY"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"weight"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;1.0&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}],&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"network_config"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"default_request_timeout_in_seconds"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"openai"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"keys"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"value"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"env.OPENAI_KEY"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"weight"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;1.0&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}],&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"network_config"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"default_request_timeout_in_seconds"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"fallbacks"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"model"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"anthropic/claude-haiku-4-5"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"weight"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;1.0&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"model"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"openai/gpt-4o-mini"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"weight"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;1.0&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The 8-second timeout is the real fix. Our triage call has no business taking longer than that, and if it does, we'd rather get a degraded answer from the fallback than hold a build slot hostage. The gateway fails over instead of stacking retries against a provider that's already struggling.&lt;/p&gt;

&lt;h2&gt;
  
  
  Did it work
&lt;/h2&gt;

&lt;p&gt;Next time the same provider got slow (it happened again in June, naturally), the gateway tripped to the fallback model after 8 seconds. Triage quality dropped a touch for those minutes. The queue never noticed. Peak backlog during that window was 11 jobs, not 400.&lt;/p&gt;

&lt;p&gt;The caching surprised me more than the failover. On a normal day we're seeing roughly 30% cache hits on triage prompts, because the same flaky test produces near-identical output across re-runs. Thirty percent fewer calls is thirty percent fewer chances to hang.&lt;/p&gt;

&lt;h2&gt;
  
  
  How it stacks up
&lt;/h2&gt;

&lt;p&gt;We looked at LiteLLM and Portkey before landing here. All three do the core gateway job. The differences are real, so here's the honest version.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Thing I cared about&lt;/th&gt;
&lt;th&gt;Bifrost&lt;/th&gt;
&lt;th&gt;LiteLLM&lt;/th&gt;
&lt;th&gt;Portkey&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Self-hosted, no SaaS dependency&lt;/td&gt;
&lt;td&gt;Yes, single Go binary&lt;/td&gt;
&lt;td&gt;Yes, Python proxy&lt;/td&gt;
&lt;td&gt;Yes, but SaaS is the main path&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Fallback + load balancing config&lt;/td&gt;
&lt;td&gt;Built in&lt;/td&gt;
&lt;td&gt;Built in&lt;/td&gt;
&lt;td&gt;Built in&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Semantic caching&lt;/td&gt;
&lt;td&gt;Built in&lt;/td&gt;
&lt;td&gt;Built in&lt;/td&gt;
&lt;td&gt;Built in&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Prometheus metrics native&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Via add-ons&lt;/td&gt;
&lt;td&gt;Via their dashboard&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Provider breadth&lt;/td&gt;
&lt;td&gt;23+&lt;/td&gt;
&lt;td&gt;Widest of the three&lt;/td&gt;
&lt;td&gt;Broad&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;LiteLLM has the widest provider list and the biggest community, so if you're calling something obscure it's a safe bet. If your stack is already Python, its proxy drops in with less friction. Portkey's managed offering and guardrail tooling are more polished out of the box, and for a team that doesn't want to run another service that matters.&lt;/p&gt;

&lt;p&gt;We picked Bifrost because it's one static binary written in Go, the config above is the whole story, and I didn't want a Python runtime sitting in a latency-sensitive path on our build hosts. That's a preference, not a verdict.&lt;/p&gt;

&lt;h2&gt;
  
  
  Trade-offs and limitations
&lt;/h2&gt;

&lt;p&gt;You're adding a network hop. For us it's a sidecar on the same host, so it's sub-millisecond, but it's not zero and you should measure it rather than trust me.&lt;/p&gt;

&lt;p&gt;The fallback model gives worse triage answers. We decided a rough answer that arrives beats a good one that never does, but that's a call you make per use case, not a universal truth.&lt;/p&gt;

&lt;p&gt;Semantic caching can serve a stale-ish answer when two failures look similar but aren't. We tuned the similarity threshold conservatively and accept the occasional miss.&lt;/p&gt;

&lt;p&gt;And a gateway is now a dependency too. We run two replicas behind a local load balancer and treat it like any other tier-1 service. If you deploy a single instance, you've moved your single point of failure, not removed it.&lt;/p&gt;

&lt;p&gt;The broader lesson has nothing to do with LLMs. Any blocking call to a dependency you don't control belongs behind a timeout you do control. We just hadn't noticed an LLM had become one.&lt;/p&gt;

&lt;h2&gt;
  
  
  Further reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://docs.getbifrost.ai/features/retries-and-fallbacks" rel="noopener noreferrer"&gt;Bifrost retries and fallbacks&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.getbifrost.ai/features/semantic-caching" rel="noopener noreferrer"&gt;Bifrost semantic caching&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.getbifrost.ai/features/observability/default" rel="noopener noreferrer"&gt;Bifrost observability and Prometheus metrics&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.litellm.ai/" rel="noopener noreferrer"&gt;LiteLLM proxy docs&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://sre.google/workbook/" rel="noopener noreferrer"&gt;Google SRE Workbook: addressing cascading failures&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>sre</category>
      <category>infrastructure</category>
      <category>llm</category>
      <category>devops</category>
    </item>
    <item>
      <title>Scaling Claude Code Across Enterprise Engineering Teams</title>
      <dc:creator>claire nguyen</dc:creator>
      <pubDate>Wed, 17 Jun 2026 08:50:38 +0000</pubDate>
      <link>https://dev.to/claire_nguyen/scaling-claude-code-across-enterprise-engineering-teams-5dfp</link>
      <guid>https://dev.to/claire_nguyen/scaling-claude-code-across-enterprise-engineering-teams-5dfp</guid>
      <description>&lt;p&gt;&lt;em&gt;Scaling Claude Code deployments across engineering teams requires governance, cost control, and observability that individual API keys cannot provide. &lt;a href="https://www.getmaxim.ai/bifrost" rel="noopener noreferrer"&gt;Bifrost&lt;/a&gt; is an open-source AI gateway that centralizes Claude Code access without changing how developers work.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Claude Code runs against &lt;code&gt;api.anthropic.com&lt;/code&gt; by default, which means each developer who installs it carries an individual provider credential and every request bypasses any central point of control. That arrangement works for a pilot, but it breaks down once hundreds of engineers run the agent daily and the organization needs to answer questions about spend, access, and reliability. Scaling Claude Code deployments to that size calls for an infrastructure layer between the agent and the model providers. &lt;a href="https://www.getmaxim.ai/bifrost" rel="noopener noreferrer"&gt;Bifrost&lt;/a&gt;, the &lt;a href="https://github.com/maximhq/bifrost" rel="noopener noreferrer"&gt;open-source AI gateway&lt;/a&gt; built by Maxim AI, fills that role by routing all Claude Code traffic through a single governed endpoint.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why scaling Claude Code across teams is hard
&lt;/h2&gt;

&lt;p&gt;Scaling Claude Code from a few volunteers to an entire engineering organization introduces four operational problems: untracked cost, ungoverned access, single-provider lock-in, and scattered telemetry. Each one is manageable for one developer and severe across a fleet. The list below maps the failure modes that surface as adoption grows.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cost management&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Per-developer API keys make it difficult to attribute spend to a team, project, or individual.&lt;/li&gt;
&lt;li&gt;There is no shared view of where token budget is actually going.&lt;/li&gt;
&lt;li&gt;Setting per-team quotas or hard spending caps is not possible with raw provider keys.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Access control&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Distributing and rotating provider credentials across many developers is operationally fragile.&lt;/li&gt;
&lt;li&gt;Revoking a single compromised key without disrupting everyone else is awkward.&lt;/li&gt;
&lt;li&gt;Tracking which teams are entitled to which models has no central enforcement point.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Model flexibility&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Claude Code ships locked to Anthropic's endpoint, so teams cannot route specific tasks to alternative providers.&lt;/li&gt;
&lt;li&gt;There is no failover path when a provider returns errors or hits a rate limit.&lt;/li&gt;
&lt;li&gt;Lighter models cannot be used for simple tasks to reduce per-request cost.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Observability&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Logs live on individual developer machines rather than in a shared system.&lt;/li&gt;
&lt;li&gt;Token usage, latency, and error rates are not monitored centrally.&lt;/li&gt;
&lt;li&gt;Debugging a bad session or auditing usage after the fact is impractical at scale.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These issues move from inconvenient to blocking the moment Claude Code crosses from an experiment into production engineering workflows.&lt;/p&gt;

&lt;h2&gt;
  
  
  What an enterprise AI gateway does for Claude Code
&lt;/h2&gt;

&lt;p&gt;An AI gateway is a unified entry point that routes, authenticates, governs, and observes traffic to multiple LLM providers from a single API. Placed in front of Claude Code, &lt;a href="https://www.getmaxim.ai/bifrost" rel="noopener noreferrer"&gt;Bifrost&lt;/a&gt; intercepts every request the agent makes, applies policy, forwards it to the chosen provider, and returns the response in Anthropic's expected format. The client binary stays unmodified, so developers keep working exactly as before.&lt;/p&gt;

&lt;p&gt;Integration is a base URL change. Point Claude Code at the Anthropic-compatible endpoint Bifrost exposes, supply a virtual key, and launch the agent:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Point Claude Code at Bifrost's Anthropic-compatible endpoint&lt;/span&gt;
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;ANTHROPIC_BASE_URL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"http://localhost:8080/anthropic"&lt;/span&gt;

&lt;span class="c"&gt;# Use a Bifrost virtual key instead of a raw provider key&lt;/span&gt;
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;ANTHROPIC_API_KEY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"vk_your_key"&lt;/span&gt;

&lt;span class="c"&gt;# Launch Claude Code as usual&lt;/span&gt;
claude
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Because Bifrost functions as a &lt;a href="https://docs.getbifrost.ai/features/drop-in-replacement" rel="noopener noreferrer"&gt;drop-in replacement&lt;/a&gt; for Anthropic's API surface, Claude Code never knows a gateway is in the path. Behind that endpoint, Bifrost routes to any of &lt;a href="https://docs.getbifrost.ai/providers/supported-providers/overview" rel="noopener noreferrer"&gt;20+ supported providers&lt;/a&gt; and over 1,000 models, giving a single Claude Code session access to Anthropic, AWS Bedrock, Google Vertex, Azure OpenAI, and others without client-side changes. The &lt;a href="https://docs.getbifrost.ai/cli-agents/claude-code" rel="noopener noreferrer"&gt;Claude Code integration guide&lt;/a&gt; documents the full setup, and the &lt;a href="https://www.getmaxim.ai/bifrost/resources/claude-code" rel="noopener noreferrer"&gt;Claude Code resource hub&lt;/a&gt; collects the governance patterns teams use in production.&lt;/p&gt;

&lt;h2&gt;
  
  
  Key capabilities for scaling Claude Code deployments
&lt;/h2&gt;

&lt;p&gt;Routing Claude Code through Bifrost turns four scaling problems into configuration. The capabilities below correspond directly to the cost, access, flexibility, and observability gaps described earlier.&lt;/p&gt;

&lt;h3&gt;
  
  
  Multi-team access control with virtual keys
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://docs.getbifrost.ai/features/governance/virtual-keys" rel="noopener noreferrer"&gt;Virtual keys&lt;/a&gt; are the primary governance entity in Bifrost. Instead of handing real provider credentials to developers, administrators issue virtual keys that carry their own permissions, budgets, and rate limits. This abstraction supports a hierarchy that mirrors how organizations are structured:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Organization level:&lt;/strong&gt; set an overall spending ceiling across all teams.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Team level:&lt;/strong&gt; allocate independent budgets to individual engineering teams or projects.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Developer level:&lt;/strong&gt; track and cap usage for a single engineer.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Keys can be created, rotated, or revoked instantly without touching the underlying provider credentials. The result is centralized credential management, real-time spend visibility per team, automatic enforcement of &lt;a href="https://docs.getbifrost.ai/features/governance/budget-and-limits" rel="noopener noreferrer"&gt;budgets and rate limits&lt;/a&gt;, and clean onboarding and offboarding. Full policy options are covered on the &lt;a href="https://www.getmaxim.ai/bifrost/resources/governance" rel="noopener noreferrer"&gt;governance resource page&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Cost optimization through model routing
&lt;/h3&gt;

&lt;p&gt;Not every Claude Code task needs a frontier model. With &lt;a href="https://docs.getbifrost.ai/providers/routing-rules" rel="noopener noreferrer"&gt;routing rules&lt;/a&gt;, teams can direct simple operations to lighter, cheaper models while reserving the most capable ones for hard problems. Claude Code already organizes its work into tiers, and a gateway lets those tiers map to the most cost-effective model for each job:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Task type&lt;/th&gt;
&lt;th&gt;Suggested model tier&lt;/th&gt;
&lt;th&gt;Rationale&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Lightweight edits, formatting&lt;/td&gt;
&lt;td&gt;Claude Haiku or a comparable small model&lt;/td&gt;
&lt;td&gt;Lowest per-token cost for routine work&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Default coding and refactoring&lt;/td&gt;
&lt;td&gt;Claude Sonnet&lt;/td&gt;
&lt;td&gt;Strong balance of capability and cost&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Complex reasoning, deep review&lt;/td&gt;
&lt;td&gt;Claude Opus or another frontier model&lt;/td&gt;
&lt;td&gt;Reserve highest cost for highest-value tasks&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Because Bifrost normalizes requests across providers, the same routing logic can send a task to AWS Bedrock, Google Vertex, or another backend when that provider offers a better price or availability for a given model. Developers experience one consistent interface while the gateway optimizes spend underneath it.&lt;/p&gt;

&lt;h3&gt;
  
  
  Production-grade reliability
&lt;/h3&gt;

&lt;p&gt;A single-provider setup is only as reliable as that provider's uptime. Bifrost adds &lt;a href="https://docs.getbifrost.ai/features/fallbacks" rel="noopener noreferrer"&gt;automatic failover&lt;/a&gt; so that when a primary provider returns errors or rate limits, requests move to a configured backup with no interruption to the Claude Code session.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Automatic fallbacks:&lt;/strong&gt; define an ordered chain of providers and models so requests complete even during an outage.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Load balancing:&lt;/strong&gt; distribute traffic across multiple API keys and providers using &lt;a href="https://docs.getbifrost.ai/features/keys-management" rel="noopener noreferrer"&gt;weighted key management&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Semantic caching:&lt;/strong&gt; Bifrost's &lt;a href="https://docs.getbifrost.ai/features/semantic-caching" rel="noopener noreferrer"&gt;semantic caching&lt;/a&gt; returns stored responses for semantically similar requests, cutting both latency and cost.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Unified observability
&lt;/h3&gt;

&lt;p&gt;Bifrost consolidates telemetry for all Claude Code activity into one place, replacing logs scattered across developer laptops. The &lt;a href="https://docs.getbifrost.ai/features/observability/default" rel="noopener noreferrer"&gt;built-in observability layer&lt;/a&gt; tracks token usage by team, project, and developer, alongside request volume, latency, and error rates.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Real-time dashboards:&lt;/strong&gt; monitor spend and performance across every Claude Code consumer.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Request inspection:&lt;/strong&gt; review full request and response payloads and trace multi-turn conversations for debugging.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Enterprise integrations:&lt;/strong&gt; export native &lt;a href="https://docs.getbifrost.ai/features/observability/prometheus" rel="noopener noreferrer"&gt;Prometheus metrics&lt;/a&gt; and &lt;a href="https://docs.getbifrost.ai/features/observability/otel" rel="noopener noreferrer"&gt;OpenTelemetry traces&lt;/a&gt; into existing monitoring stacks, with custom alerting on usage thresholds.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This visibility is what lets a platform team find optimization opportunities and catch reliability regressions before they affect a wider rollout.&lt;/p&gt;

&lt;h2&gt;
  
  
  Implementation architecture
&lt;/h2&gt;

&lt;p&gt;Bifrost supports several deployment models, so teams can match the gateway to their security and operations requirements.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Self-hosted:&lt;/strong&gt; run the &lt;a href="https://docs.getbifrost.ai/quickstart/gateway/setting-up" rel="noopener noreferrer"&gt;open-source gateway&lt;/a&gt; inside your own infrastructure for maximum control.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Enterprise managed:&lt;/strong&gt; use &lt;a href="https://docs.getbifrost.ai/enterprise/overview" rel="noopener noreferrer"&gt;Bifrost Enterprise&lt;/a&gt; for managed deployment, clustering, and advanced governance.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Hybrid / in-VPC:&lt;/strong&gt; keep the gateway in your own VPC with &lt;a href="https://docs.getbifrost.ai/enterprise/invpc-deployments" rel="noopener noreferrer"&gt;in-VPC deployment&lt;/a&gt; for workloads that cannot egress to the public internet.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Request flow through the gateway follows a consistent path from the developer workstation to the provider and back into centralized monitoring:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Developer workstation (Claude Code)
        |
   Bifrost gateway (localhost:8080/anthropic)
        |
[Virtual key -&amp;gt; team budget -&amp;gt; model selection -&amp;gt; routing rules]
        |
   Provider API (Anthropic, Bedrock, Vertex, Azure, ...)
        |
   Response + logging + metrics
        |
   Centralized dashboard
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;On the security side, enterprise deployments add &lt;a href="https://docs.getbifrost.ai/enterprise/rbac" rel="noopener noreferrer"&gt;role-based access control&lt;/a&gt;, SSO and OIDC integration with providers like Okta and Microsoft Entra, and &lt;a href="https://docs.getbifrost.ai/enterprise/audit-logs" rel="noopener noreferrer"&gt;immutable audit logs&lt;/a&gt; for SOC 2, GDPR, HIPAA, and ISO 27001 compliance. Sensitive workloads can be isolated at the network layer so that Claude Code traffic never leaves controlled infrastructure.&lt;/p&gt;

&lt;h2&gt;
  
  
  Best practices for production Claude Code deployments
&lt;/h2&gt;

&lt;p&gt;Roll out a Claude Code gateway in stages rather than enforcing every policy at once. The following sequence keeps the migration low-risk:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Start with monitoring.&lt;/strong&gt; Deploy Bifrost in observability mode first. Establish baseline usage, identify high-volume teams, and understand cost drivers before enforcing any limits.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Layer in tiered access.&lt;/strong&gt; Build a &lt;a href="https://docs.getbifrost.ai/features/governance/virtual-keys" rel="noopener noreferrer"&gt;virtual key&lt;/a&gt; hierarchy that matches your org chart, starting with conservative budgets and adjusting against real usage.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Configure failover chains.&lt;/strong&gt; Set an ordered fallback path, for example a primary Anthropic route backed by an equivalent model on AWS Bedrock, so sessions survive a provider incident.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Enable MCP tools deliberately.&lt;/strong&gt; Bifrost as an &lt;a href="https://docs.getbifrost.ai/mcp/overview" rel="noopener noreferrer"&gt;MCP gateway&lt;/a&gt; extends Claude Code with external tools through the &lt;a href="https://modelcontextprotocol.io" rel="noopener noreferrer"&gt;Model Context Protocol&lt;/a&gt;. Begin with low-risk tools such as filesystem and search, then add database or internal-API tools as governance matures.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Track quality, not just cost.&lt;/strong&gt; Pair usage data with quality signals from automated testing so AI-assisted output stays reliable as the deployment grows.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Common questions about scaling Claude Code
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Does routing Claude Code through a gateway change the developer experience?
&lt;/h3&gt;

&lt;p&gt;No. The integration is a base URL change, and Bifrost returns responses in Anthropic's native format. Developers continue to use the &lt;code&gt;claude&lt;/code&gt; command and Claude Code's native features without modification.&lt;/p&gt;

&lt;h3&gt;
  
  
  Can the same Claude Code session use non-Anthropic models?
&lt;/h3&gt;

&lt;p&gt;Yes. Because Bifrost exposes an Anthropic-compatible endpoint while routing to &lt;a href="https://docs.getbifrost.ai/providers/supported-providers/overview" rel="noopener noreferrer"&gt;20+ providers&lt;/a&gt;, a single session can target Anthropic, AWS Bedrock, Google Vertex, Azure OpenAI, and others, selected by routing rules or on the fly.&lt;/p&gt;

&lt;h3&gt;
  
  
  How does an AI gateway control Claude Code costs?
&lt;/h3&gt;

&lt;p&gt;Spend is controlled through virtual keys that carry per-team and per-developer budgets, combined with model routing that sends routine tasks to cheaper models. Both are enforced centrally at the gateway, so cost limits apply automatically to every request.&lt;/p&gt;

&lt;h2&gt;
  
  
  Getting started
&lt;/h2&gt;

&lt;p&gt;Scaling a Claude Code deployment with Bifrost takes a short setup. The steps below assume a local gateway; production deployments follow the same pattern against a hosted instance.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Install Claude Code&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npm &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-g&lt;/span&gt; @anthropic-ai/claude-code
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;2. Run Bifrost and create a virtual key&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Start the &lt;a href="https://docs.getbifrost.ai/quickstart/gateway/setting-up" rel="noopener noreferrer"&gt;gateway&lt;/a&gt;, add your provider credentials, and create a virtual key scoped to a team with its own budget through the Bifrost dashboard or configuration.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Point Claude Code at the gateway&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;ANTHROPIC_BASE_URL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"http://localhost:8080/anthropic"&lt;/span&gt;
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;ANTHROPIC_API_KEY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"vk_your_team_key"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;4. Launch Claude Code&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;claude
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Teams that prefer not to manage environment variables manually can use the &lt;a href="https://docs.getbifrost.ai/quickstart/cli/getting-started" rel="noopener noreferrer"&gt;Bifrost CLI&lt;/a&gt;, which configures the connection and launches Claude Code with the correct settings automatically. The same flow works for &lt;a href="https://www.getmaxim.ai/bifrost/resources/cli-agents" rel="noopener noreferrer"&gt;other CLI agents&lt;/a&gt; such as Codex CLI, Gemini CLI, and Cursor. Claude Code itself is documented in &lt;a href="https://docs.anthropic.com/en/docs/claude-code/overview" rel="noopener noreferrer"&gt;Anthropic's official docs&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Scale Claude Code with confidence
&lt;/h2&gt;

&lt;p&gt;Scaling Claude Code deployments across an enterprise requires more than distributing API keys. It requires centralized governance, multi-provider flexibility, and unified observability, delivered without changing how developers work. Bifrost provides that control plane as an open-source AI gateway, giving platform teams cost visibility, automatic failover, and audit-ready logging while preserving the native Claude Code experience.&lt;/p&gt;

&lt;p&gt;To see how Bifrost can govern and scale your Claude Code deployment, &lt;a href="https://getmaxim.ai/bifrost/book-a-demo" rel="noopener noreferrer"&gt;book a demo&lt;/a&gt; with the Bifrost team or explore &lt;a href="https://www.getmaxim.ai/bifrost/enterprise" rel="noopener noreferrer"&gt;Bifrost Enterprise&lt;/a&gt; for managed deployment and advanced governance.&lt;/p&gt;

</description>
      <category>agents</category>
      <category>ai</category>
      <category>claude</category>
      <category>infrastructure</category>
    </item>
    <item>
      <title>Top 5 AI Agent Evaluation Tools in 2026</title>
      <dc:creator>claire nguyen</dc:creator>
      <pubDate>Tue, 02 Jun 2026 11:22:41 +0000</pubDate>
      <link>https://dev.to/claire_nguyen/top-5-ai-agent-evaluation-tools-in-2026-2c7c</link>
      <guid>https://dev.to/claire_nguyen/top-5-ai-agent-evaluation-tools-in-2026-2c7c</guid>
      <description>&lt;p&gt;&lt;em&gt;Evaluating AI agents requires more than static benchmarks. This guide compares the five leading &lt;a href="https://www.getmaxim.ai/products/agent-simulation-evaluation" rel="noopener noreferrer"&gt;AI agent evaluation&lt;/a&gt; platforms in 2026: &lt;a href="https://www.getmaxim.ai" rel="noopener noreferrer"&gt;Maxim AI&lt;/a&gt;, Langfuse, Arize, LangSmith, and Galileo. Maxim AI is the best choice for teams that need end-to-end simulation, evaluation, and observability in a single platform built for cross-functional collaboration.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Production AI agents now handle customer support escalations, financial data analysis, and multi-step autonomous workflows. As these systems become mission-critical, systematic evaluation is no longer optional. Evaluation spans three concrete dimensions: output quality across diverse scenarios, cost control in multi-step workflows, and audit trail generation for regulatory requirements. Modern evaluation platforms address all three through tracing, automated testing, and production monitoring. This guide covers the five platforms that lead the field in 2026.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Is AI Agent Evaluation?
&lt;/h2&gt;

&lt;p&gt;AI agent evaluation is the process of measuring agent output quality, task completion, and behavior across real-world scenarios before and after deployment. Unlike static ML model scoring, agent evaluation must account for multi-step trajectories where a single failure can cascade downstream. Effective evaluation frameworks cover pre-production simulation, automated scoring at session and trace levels, and continuous monitoring once agents are live.&lt;/p&gt;

&lt;h2&gt;
  
  
  1. Maxim AI: End-to-End Simulation, Evaluation, and Observability
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://www.getmaxim.ai" rel="noopener noreferrer"&gt;Maxim AI&lt;/a&gt; is an end-to-end platform for AI simulation, evaluation, and observability, purpose-built for teams shipping agentic applications. The platform brings pre-release experimentation, scenario-based simulation, and production monitoring into a single interface designed for both engineering and product teams.&lt;/p&gt;

&lt;h3&gt;
  
  
  Simulation and Testing
&lt;/h3&gt;

&lt;p&gt;Maxim's &lt;a href="https://www.getmaxim.ai/products/agent-simulation-evaluation" rel="noopener noreferrer"&gt;simulation engine&lt;/a&gt; tests agents across hundreds of scenarios and user personas before any code reaches production. Evaluation operates at the conversational level: complete agent trajectories are analyzed for task completion, and simulations can be re-run from any step to isolate root causes and reproduce failures.&lt;/p&gt;

&lt;h3&gt;
  
  
  Evaluation Framework
&lt;/h3&gt;

&lt;p&gt;The platform supports a unified framework for machine and human evaluations. Teams access off-the-shelf evaluators from the evaluator store or create custom evaluators tuned to their quality criteria. Evaluations are configurable at the session, trace, or span level, giving engineering teams full granularity across &lt;a href="https://www.getmaxim.ai/blog/ai-agent-evaluation-metrics/" rel="noopener noreferrer"&gt;multi-agent systems&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Observability Suite
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://www.getmaxim.ai/products/agent-observability" rel="noopener noreferrer"&gt;Maxim's observability layer&lt;/a&gt; provides real-time production monitoring with distributed tracing. Custom dashboards expose agent behavior across any dimension, and automated quality checks trigger alerts when production metrics fall outside defined thresholds.&lt;/p&gt;

&lt;h3&gt;
  
  
  Data Management
&lt;/h3&gt;

&lt;p&gt;Teams curate multi-modal datasets directly from production logs. Human-in-the-loop workflows support continuous dataset enrichment, and synthetic data generation covers evaluation scenarios that production traffic has not yet reached.&lt;/p&gt;

&lt;h3&gt;
  
  
  Cross-Functional Collaboration
&lt;/h3&gt;

&lt;p&gt;A no-code UI enables product managers to configure evaluations and build dashboards without engineering dependencies. Playground++ supports rapid prompt engineering and model comparison across quality, cost, and latency dimensions. This is a key differentiator from tools that restrict workflow ownership to engineering teams alone.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; Teams requiring full lifecycle coverage from experimentation through production, organizations where product and engineering collaborate on agent quality, enterprises with human-plus-LLM evaluation workflows, and teams building multi-agent systems that require &lt;a href="https://www.getmaxim.ai/products/agent-observability" rel="noopener noreferrer"&gt;granular observability&lt;/a&gt; with audit trails.&lt;/p&gt;

&lt;p&gt;See &lt;a href="https://www.getmaxim.ai/blog/evaluation-workflows-for-ai-agents/" rel="noopener noreferrer"&gt;evaluation workflows for AI agents&lt;/a&gt; for a deeper look at how teams structure their eval pipelines.&lt;/p&gt;

&lt;h2&gt;
  
  
  2. Langfuse: Open-Source Tracing with Self-Hosting
&lt;/h2&gt;

&lt;p&gt;Langfuse is an open-source LLM observability platform that offers self-hosted deployment for teams with strict data residency requirements. The platform covers tracing, evaluation, and monitoring with full infrastructure control.&lt;/p&gt;

&lt;p&gt;Core capabilities include prompt management with version tracking and usage pattern analysis, LLM-as-a-judge evaluations with custom or pre-built evaluators, session-based analysis for user-facing applications, and dataset creation from production traces for offline evaluation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; Teams with data privacy requirements that prohibit third-party cloud processing, developers building custom evaluation workflows on an open-source foundation, and organizations that need self-hosted infrastructure at low cost. See the &lt;a href="https://www.getmaxim.ai/compare/maxim-vs-langfuse" rel="noopener noreferrer"&gt;Maxim vs Langfuse comparison&lt;/a&gt; for a full capability breakdown.&lt;/p&gt;

&lt;h2&gt;
  
  
  3. Arize: Unified Monitoring for ML and LLM Systems
&lt;/h2&gt;

&lt;p&gt;Arize (Phoenix platform) applies ML observability principles to LLM monitoring, providing a single monitoring layer across classical ML models and agent applications. This makes it relevant for organizations that run both traditional models and generative AI in the same production stack.&lt;/p&gt;

&lt;p&gt;Key capabilities include drift detection and performance degradation monitoring, tool selection and invocation evaluators for agent workflows, OpenTelemetry-compatible tracing via OpenInference instrumentation, and integration with AWS Bedrock Agents and major orchestration frameworks.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; Enterprises running hybrid ML and LLM systems that need a unified monitoring view, data science teams already familiar with traditional MLOps tooling, and regulated industries that require explainability across both model types. Compare platform depth in the &lt;a href="https://www.getmaxim.ai/compare/maxim-vs-arize" rel="noopener noreferrer"&gt;Maxim vs Arize breakdown&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  4. LangSmith: LangChain-Native Debugging and Tracing
&lt;/h2&gt;

&lt;p&gt;LangSmith is the observability and debugging tool from LangChain, built specifically for applications developed on the LangChain framework. It offers detailed tracing and tight integration with LangChain abstractions, which reduces setup time for teams already in that ecosystem.&lt;/p&gt;

&lt;p&gt;Capabilities include multi-turn evaluation for complete agent conversations, an Insights Agent that automatically categorizes usage patterns, offline and online evaluation workflows, and annotation queues for subject-matter expert feedback collection.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; Teams with a significant investment in the LangChain ecosystem, developers who need rapid prototyping and iterative debugging, and organizations looking for a developer-first tracing experience within LangChain-based architectures. For teams evaluating their options, the &lt;a href="https://www.getmaxim.ai/compare/maxim-vs-langsmith" rel="noopener noreferrer"&gt;Maxim vs LangSmith comparison&lt;/a&gt; shows how the platforms differ on collaboration and evaluation depth.&lt;/p&gt;

&lt;h2&gt;
  
  
  5. Galileo: Hallucination Detection and Production Guardrails
&lt;/h2&gt;

&lt;p&gt;Galileo focuses on AI reliability for high-stakes use cases, offering research-backed hallucination detection, an eval-to-guardrail lifecycle, and Luna-2 small language models for cost-effective production monitoring.&lt;/p&gt;

&lt;p&gt;Core capabilities include research-grounded metrics for factual accuracy and hallucination detection, automatic conversion of pre-production evaluations into production guardrails, agent-specific metrics covering tool selection accuracy and session success rates, and a reported 97% cost reduction in monitoring via Luna-2 model inference. Agent evaluation metrics and context coverage are narrower in scope compared to full-lifecycle platforms.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; High-stakes domains (healthcare, finance, legal) where factual accuracy validation is the primary requirement, teams that need real-time guardrails controlling live agent behavior, and organizations with safety compliance obligations.&lt;/p&gt;

&lt;h2&gt;
  
  
  Platform Comparison
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Platform&lt;/th&gt;
&lt;th&gt;Primary Strength&lt;/th&gt;
&lt;th&gt;Deployment&lt;/th&gt;
&lt;th&gt;Pricing&lt;/th&gt;
&lt;th&gt;Open Source&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Maxim AI&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;End-to-end simulation, evaluation, and observability with cross-functional collaboration&lt;/td&gt;
&lt;td&gt;Cloud, On-premise&lt;/td&gt;
&lt;td&gt;Free tier; Pro from $29/seat/month&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Langfuse&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Open-source tracing with self-hosting&lt;/td&gt;
&lt;td&gt;Cloud, Self-hosted&lt;/td&gt;
&lt;td&gt;Free tier (50k observations/month); Pro from $59/month&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Arize&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Unified ML and LLM monitoring&lt;/td&gt;
&lt;td&gt;Cloud, On-premise&lt;/td&gt;
&lt;td&gt;Contact sales&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;LangSmith&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;LangChain-native debugging&lt;/td&gt;
&lt;td&gt;Cloud, Self-hosted (Enterprise)&lt;/td&gt;
&lt;td&gt;Free tier (5k traces/month); Contact sales&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Galileo&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Hallucination detection and guardrails&lt;/td&gt;
&lt;td&gt;Cloud&lt;/td&gt;
&lt;td&gt;Free tier; Contact sales&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  How to Choose an AI Agent Evaluation Platform
&lt;/h2&gt;

&lt;p&gt;The right platform depends on your evaluation scope, team structure, and deployment requirements:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Comprehensive lifecycle coverage across engineering and product teams:&lt;/strong&gt; Maxim AI is built for this. The no-code UI, simulation engine, and observability layer give both engineering and product full workflow ownership.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Open-source control and data residency:&lt;/strong&gt; Langfuse is the primary option, with self-hosted deployment and an active open-source community.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Hybrid ML and LLM monitoring in a unified interface:&lt;/strong&gt; Arize addresses this with its Phoenix platform and OpenTelemetry-based tracing.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;LangChain-native development:&lt;/strong&gt; LangSmith reduces integration overhead for teams already building on LangChain.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Real-time guardrails and hallucination detection in regulated industries:&lt;/strong&gt; Galileo is designed specifically for this use case.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Platforms that only address one phase of the agent lifecycle (tracing only, or guardrails only) create gaps between pre-production testing and production monitoring. For most teams building at scale, a platform that spans the full development cycle reduces the overhead of maintaining separate tools and correlating data across systems.&lt;/p&gt;

&lt;p&gt;Maxim AI's &lt;a href="https://www.getmaxim.ai/blog/ai-agent-quality-evaluation/" rel="noopener noreferrer"&gt;agent quality evaluation approach&lt;/a&gt; covers this in more depth, including how simulation connects to production monitoring in a unified workflow.&lt;/p&gt;

&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;p&gt;AI agent evaluation in 2026 requires platforms that cover the complete development lifecycle, not just one phase. Maxim AI leads for teams that need simulation, evaluation, and observability in one place with cross-functional collaboration built in. Langfuse is the right choice when data control and open-source infrastructure are the priority. Arize fits organizations running hybrid ML and LLM workloads. LangSmith is the natural pick for LangChain-focused teams. Galileo addresses the specific need for hallucination prevention in safety-critical domains.&lt;/p&gt;

&lt;p&gt;Selecting the wrong tool adds cost and complexity, particularly when teams must maintain separate systems for pre-production testing and production monitoring. Matching the platform to your team's structure, deployment requirements, and evaluation scope is the decision that matters most.&lt;/p&gt;

&lt;p&gt;To see how Maxim AI covers the full AI agent evaluation lifecycle, from simulation through production monitoring, &lt;a href="https://getmaxim.ai/demo" rel="noopener noreferrer"&gt;book a demo&lt;/a&gt; or &lt;a href="https://app.getmaxim.ai/sign-up" rel="noopener noreferrer"&gt;sign up for free&lt;/a&gt;.&lt;/p&gt;

</description>
    </item>
    <item>
      <title>The SIGTERM our build workers ignored, and the 90s that fixed it</title>
      <dc:creator>claire nguyen</dc:creator>
      <pubDate>Tue, 02 Jun 2026 04:21:50 +0000</pubDate>
      <link>https://dev.to/claire_nguyen/the-sigterm-our-build-workers-ignored-and-the-90s-that-fixed-it-2kk8</link>
      <guid>https://dev.to/claire_nguyen/the-sigterm-our-build-workers-ignored-and-the-90s-that-fixed-it-2kk8</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR: Our ECS build workers were quietly killing in-flight jobs every time we scaled in or deployed. The fix wasn't a bigger timeout, it was actually handling SIGTERM and bumping &lt;code&gt;stopTimeout&lt;/code&gt; to 120s. Cut our "agent lost" failures from ~2% of runs to under 0.1%.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;So, the thing that bugged me for weeks. We run a chunk of Buildkite's build compute on ECS, and every deploy or scale-in event would spike a small batch of failed builds. Not heaps. Maybe 2% of running jobs at that moment. Enough that someone in the team Slack would go "oi, my build died again" once a day.&lt;/p&gt;

&lt;p&gt;The error was always the same flavour: agent disconnected mid-step, job marked as &lt;code&gt;lost&lt;/code&gt;, customer retries, moves on. Annoying but not loud enough to page anyone. Which is exactly why it survived for a month.&lt;/p&gt;

&lt;h2&gt;
  
  
  What was actually happening
&lt;/h2&gt;

&lt;p&gt;ECS sends &lt;code&gt;SIGTERM&lt;/code&gt; to your container when it wants the task gone. Scale-in, deployment, spot reclaim, all of it. You get a grace window, then &lt;code&gt;SIGKILL&lt;/code&gt;. The default &lt;code&gt;stopTimeout&lt;/code&gt; is 30 seconds.&lt;/p&gt;

&lt;p&gt;Our build agent process caught nothing. The PID 1 in the container was a shell wrapper, and the agent ran as a child. &lt;code&gt;SIGTERM&lt;/code&gt; hit the shell, the shell shrugged, the agent kept running until &lt;code&gt;SIGKILL&lt;/code&gt; ripped the whole thing out. A 6-minute integration test step that was 80% done? Gone.&lt;/p&gt;

&lt;p&gt;The agent already supports graceful stop. It'll finish the current job and refuse new ones if you signal it properly. We weren't passing the signal through. Classic.&lt;/p&gt;

&lt;p&gt;Here's the before. Spot the problem:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight docker"&gt;&lt;code&gt;&lt;span class="c"&gt;# before — shell eats the signal, agent never hears it&lt;/span&gt;
&lt;span class="k"&gt;ENTRYPOINT&lt;/span&gt;&lt;span class="s"&gt; ["/bin/sh", "-c", "buildkite-agent start"]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;sh -c&lt;/code&gt; doesn't forward signals to the child by default. So PID 1 catches &lt;code&gt;SIGTERM&lt;/code&gt; and does nothing useful with it.&lt;/p&gt;

&lt;h2&gt;
  
  
  The fix
&lt;/h2&gt;

&lt;p&gt;Two parts. Run the agent as PID 1 so it gets the signal directly, and give it enough time to drain.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight docker"&gt;&lt;code&gt;&lt;span class="c"&gt;# after — exec replaces the shell, agent becomes PID 1&lt;/span&gt;
&lt;span class="k"&gt;ENTRYPOINT&lt;/span&gt;&lt;span class="s"&gt; ["buildkite-agent", "start"]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then the ECS task definition:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"containerDefinitions"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"build-agent"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"stopTimeout"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;120&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"essential"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We set &lt;code&gt;stopTimeout&lt;/code&gt; to 120 because most steps finish inside two minutes, and the agent's own &lt;code&gt;--cancel-grace-period&lt;/code&gt; lines up so it doesn't get cut off mid-drain. The agent hears &lt;code&gt;SIGTERM&lt;/code&gt;, stops accepting new work, lets the current job run to completion, then exits clean. ECS sees a tidy exit and moves on.&lt;/p&gt;

&lt;p&gt;The agent does the right thing on its own once it actually receives the signal:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;buildkite-agent start &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--cancel-grace-period&lt;/span&gt; 120 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--disconnect-after-idle-timeout&lt;/span&gt; 10
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That &lt;code&gt;--disconnect-after-idle-timeout&lt;/code&gt; matters for the scale-in path. An idle agent disconnects fast so the autoscaler isn't waiting 120s to retire a worker doing nothing.&lt;/p&gt;

&lt;h2&gt;
  
  
  The numbers
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Before&lt;/th&gt;
&lt;th&gt;After&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;agent lost&lt;/code&gt; failures (% of runs)&lt;/td&gt;
&lt;td&gt;~2%&lt;/td&gt;
&lt;td&gt;&amp;lt;0.1%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Avg drain time on scale-in&lt;/td&gt;
&lt;td&gt;n/a (SIGKILL)&lt;/td&gt;
&lt;td&gt;11s idle / 70s busy&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Builds killed per deploy&lt;/td&gt;
&lt;td&gt;8–15&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;stopTimeout&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;30s&lt;/td&gt;
&lt;td&gt;120s&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The drain time on busy workers went up, sure. We trade a slower scale-in for builds that don't die. Easy call. Idle workers still retire in about 11 seconds because they've got nothing to finish.&lt;/p&gt;

&lt;p&gt;One thing I'll flag for the LLM-curious crowd here on r/LocalLLaMA: a few of our build steps call a model for flaky-test classification, and those run through a small gateway sidecar (we use &lt;a href="https://github.com/maximhq/bifrost" rel="noopener noreferrer"&gt;Bifrost&lt;/a&gt;) in the same task. That process needs the same treatment. It has to flush in-flight requests on &lt;code&gt;SIGTERM&lt;/code&gt; or you get half-finished calls counted as failures, same bug in a different shirt. Once the agent drains properly, the sidecar's automatic failover stops papering over the real problem.&lt;/p&gt;

&lt;h2&gt;
  
  
  How we caught it for good
&lt;/h2&gt;

&lt;p&gt;The fix is one thing. Trusting it is another. We added a game day check: kill a task running an active build, on purpose, and assert the build completes instead of going &lt;code&gt;lost&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# during a game day — force-drain a worker mid-build&lt;/span&gt;
aws ecs stop-task &lt;span class="nt"&gt;--cluster&lt;/span&gt; build-prod &lt;span class="nt"&gt;--task&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$TASK_ARN&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
&lt;span class="c"&gt;# then assert the build state, not just that the task stopped&lt;/span&gt;
bk-cli build get &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$BUILD_ID&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="nt"&gt;--field&lt;/span&gt; state &lt;span class="c"&gt;# expect: passed&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you never test the drain, you don't have a drain. You've got hope. "Never had a lost build" usually means nobody's pulled the plug while watching.&lt;/p&gt;

&lt;h2&gt;
  
  
  Trade-offs and limitations
&lt;/h2&gt;

&lt;p&gt;Longer &lt;code&gt;stopTimeout&lt;/code&gt; means slower deploys when workers are busy. If you're running a fleet of 500, that's real wall-clock time during a rolling deploy. We accepted it because dead builds cost more than slow deploys.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;stopTimeout&lt;/code&gt; has a hard ceiling of 120s on ECS. If a single build step legitimately runs longer than two minutes and can't be checkpointed, this won't save it. You'll still lose those. For us that's a rare long-running step, and we've moved most of them to job-level retries instead.&lt;/p&gt;

&lt;p&gt;This also assumes your work is interruptible-after-current-task. If one agent holds a single 40-minute job, graceful stop just means waiting 120s then killing it anyway. Drain logic helps fleets of short jobs far more than it helps long monolithic ones.&lt;/p&gt;

&lt;p&gt;And spot reclaim only gives you a 2-minute warning, which is right at the edge of our 120s window. Tight. We lean on retries as the backstop there rather than pretending the drain always wins.&lt;/p&gt;

&lt;h2&gt;
  
  
  Further Reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://docs.aws.amazon.com/AmazonECS/latest/developerguide/task_definition_parameters.html" rel="noopener noreferrer"&gt;ECS task lifecycle and stopTimeout&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://buildkite.com/docs/agent/v3/signal-handling" rel="noopener noreferrer"&gt;Buildkite agent signal handling&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.docker.com/reference/dockerfile/#stopsignal" rel="noopener noreferrer"&gt;Graceful shutdown in containers (Docker docs on STOPSIGNAL)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/maximhq/bifrost" rel="noopener noreferrer"&gt;Bifrost AI gateway&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/spot-interruptions.html" rel="noopener noreferrer"&gt;AWS spot instance interruption notices&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>sre</category>
      <category>infrastructure</category>
      <category>devops</category>
      <category>kubernetes</category>
    </item>
    <item>
      <title>Error budgets for an LLM dependency you don't control</title>
      <dc:creator>claire nguyen</dc:creator>
      <pubDate>Mon, 01 Jun 2026 13:22:28 +0000</pubDate>
      <link>https://dev.to/claire_nguyen/error-budgets-for-an-llm-dependency-you-dont-control-6ia</link>
      <guid>https://dev.to/claire_nguyen/error-budgets-for-an-llm-dependency-you-dont-control-6ia</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR: We shipped a natural-language build-query feature at Buildkite, then tried to put a 99.9% SLO on it. Turns out you can't promise uptime for a model provider you don't run. We put Bifrost in front, failed over across three providers, and now the error budget tracks our gateway's behaviour instead of OpenAI's status page.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Here's the moment it clicked for me. We were drafting an SLO doc for a feature that lets people ask "why did this build fail" in plain English. Someone wrote "99.9% availability". Cool. That's 43 minutes of allowed downtime a month. Then OpenAI had a wobble for about 50 minutes one Tuesday and we blew the whole budget before lunch.&lt;/p&gt;

&lt;p&gt;The problem wasn't our code. Our service was up the entire time. The dependency wasn't.&lt;/p&gt;

&lt;h2&gt;
  
  
  You can't SLO something you don't operate
&lt;/h2&gt;

&lt;p&gt;A normal SLO assumes you control the thing you're measuring. Postgres, your own API, an internal queue. You can add replicas, you can tune it, you can page someone who can fix it.&lt;/p&gt;

&lt;p&gt;A hosted LLM is none of that. When Anthropic returns a 529 or OpenAI starts handing out 429s under load, there is no lever on your side. You wait. Our p99 for the feature was around 2.1 seconds on a good day, and during provider degradation it'd climb past 9 seconds or just fail outright.&lt;/p&gt;

&lt;p&gt;So the question stopped being "how do I make the provider more reliable" and became "how do I make my dependency on any single provider less load-bearing." That's a routing problem, not a model problem.&lt;/p&gt;

&lt;h2&gt;
  
  
  Putting a gateway in the path
&lt;/h2&gt;

&lt;p&gt;We run Bifrost as the single egress point for every LLM call now. It's an OpenAI-compatible gateway, so our service code didn't change much. The interesting part is the fallback config: if the primary provider errors or times out, the request gets retried against the next one without our app knowing.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"providers"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"openai"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"keys"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"value"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"env.OPENAI_KEY"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}]&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"anthropic"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"keys"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"value"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"env.ANTHROPIC_KEY"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}]&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"bedrock"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"keys"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"value"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"env.BEDROCK_KEY"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}]&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"fallbacks"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="s2"&gt;"openai/gpt-4o-mini"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="s2"&gt;"anthropic/claude-3-5-haiku"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="s2"&gt;"bedrock/anthropic.claude-3-haiku"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Three providers, ranked. When OpenAI throttles, the call lands on Anthropic. When both are sad, Bedrock catches it. The feature degrades in quality maybe, but it stays up. That's the whole point of an error budget. Stay inside the line.&lt;/p&gt;

&lt;p&gt;It also does load balancing across multiple keys, which mattered more than I expected. Half our "outages" early on were just one API key hitting its rate limit while another sat idle.&lt;/p&gt;

&lt;h2&gt;
  
  
  The metrics that actually feed the SLO
&lt;/h2&gt;

&lt;p&gt;The bit that sold me was native Prometheus output. Bifrost exposes metrics straight out of the box, so I'm not scraping a vendor status page or parsing logs to know if we're burning budget.&lt;/p&gt;

&lt;p&gt;Our availability SLI is now "requests Bifrost successfully resolved, including via fallback" over total requests. A request that failed on OpenAI but succeeded on Anthropic counts as a win, because the user got an answer. That's the number that should drive the SLO, not per-provider success.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# fast burn-rate over 1h: are we eating budget faster than allowed?
sum(rate(bifrost_requests_total{status="error"}[1h]))
/
sum(rate(bifrost_requests_total[1h]))
&amp;gt; (14.4 * 0.001)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We went from one provider doing about 99.4% effective availability to the fallback chain sitting around 99.93% over the last 60 days. Same models, same budget, just not betting the feature on one company's afternoon.&lt;/p&gt;

&lt;h2&gt;
  
  
  How it stacks up
&lt;/h2&gt;

&lt;p&gt;We looked at LiteLLM and Portkey before landing here. None of these is strictly best. Depends what you're optimising for.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Thing I cared about&lt;/th&gt;
&lt;th&gt;Bifrost&lt;/th&gt;
&lt;th&gt;LiteLLM&lt;/th&gt;
&lt;th&gt;Portkey&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Self-host, no vendor in path&lt;/td&gt;
&lt;td&gt;Yes, single Go binary&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Possible, but hosted is the main path&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Native Prometheus metrics&lt;/td&gt;
&lt;td&gt;Built in&lt;/td&gt;
&lt;td&gt;Via callbacks/config&lt;/td&gt;
&lt;td&gt;Dashboard-first, export is extra&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Provider failover config&lt;/td&gt;
&lt;td&gt;Declarative fallback list&lt;/td&gt;
&lt;td&gt;Yes, router config&lt;/td&gt;
&lt;td&gt;Yes, configs/strategies&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Hosted analytics UI&lt;/td&gt;
&lt;td&gt;Basic built-in UI&lt;/td&gt;
&lt;td&gt;Minimal&lt;/td&gt;
&lt;td&gt;Strongest of the three&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Python ecosystem depth&lt;/td&gt;
&lt;td&gt;Smaller&lt;/td&gt;
&lt;td&gt;Largest, huge community&lt;/td&gt;
&lt;td&gt;Good&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Honestly, if you live in Python and want the biggest provider list and community, LiteLLM is hard to beat. If you want a polished hosted dashboard and guardrails without running anything, Portkey is the comfortable pick. We're an infra team that wants metrics in our own Prometheus and a binary we can run on our own boxes, so Bifrost fit our shape. No worries either way.&lt;/p&gt;

&lt;h2&gt;
  
  
  Trade-offs and Limitations
&lt;/h2&gt;

&lt;p&gt;Fallback hides failure, and that cuts both ways. If your alerting only watches the final success rate, you can be quietly running 80% of traffic on your third-choice provider for days and not notice the bill. We added a separate alert on per-provider fallback rate so degradation is visible, not just survivable.&lt;/p&gt;

&lt;p&gt;Quality drift is real too. gpt-4o-mini and claude-3-5-haiku don't answer identically, so a build-failure summary can read differently mid-incident. For us that's acceptable. For anything doing structured extraction, you'd want to validate output shape per provider.&lt;/p&gt;

&lt;p&gt;And a gateway is one more thing to run. It's a low-risk component, but it's in the hot path, so we run it with the same care as any other tier-1 service. If Bifrost is down, everything's down. We game-day it like the rest of our stack.&lt;/p&gt;

&lt;p&gt;Self-hosting also means semantic caching, governance, and the rest are your config problem, not a managed feature. Fine for us. Worth knowing.&lt;/p&gt;

&lt;h2&gt;
  
  
  Further Reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Bifrost retries and fallbacks&lt;/li&gt;
&lt;li&gt;Bifrost observability and Prometheus metrics&lt;/li&gt;
&lt;li&gt;Bifrost gateway setup&lt;/li&gt;
&lt;li&gt;Google SRE Workbook: alerting on SLOs and burn rates&lt;/li&gt;
&lt;li&gt;Bifrost on GitHub&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>sre</category>
      <category>llm</category>
      <category>devops</category>
      <category>infrastructure</category>
    </item>
    <item>
      <title>The Prometheus label that blew our monitoring bill out 6x</title>
      <dc:creator>claire nguyen</dc:creator>
      <pubDate>Fri, 29 May 2026 04:21:15 +0000</pubDate>
      <link>https://dev.to/claire_nguyen/the-prometheus-label-that-blew-our-monitoring-bill-out-6x-57hj</link>
      <guid>https://dev.to/claire_nguyen/the-prometheus-label-that-blew-our-monitoring-bill-out-6x-57hj</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR: Our metrics bill went 6x in a single month. Traffic was flat. One Prometheus label carrying per-build IDs spawned millions of time series, and the backend charges by active series. Here's how we caught it and the label rules we run now so it doesn't happen again.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The bill, not the traffic
&lt;/h2&gt;

&lt;p&gt;I'm on the infra team at Buildkite. We run a fairly chunky Prometheus setup feeding a managed backend, and one Monday the monthly estimate had quietly gone from about $1,800 to a touch over $11k. Nobody shipped more traffic. Build volume was the same 40k-ish builds a day it'd been for weeks.&lt;/p&gt;

&lt;p&gt;So it wasn't load. It was series count. Active series had climbed from roughly 1.2 million to nearly 9 million, and the backend prices on active series, not on request volume. That's the trap most people miss the first time.&lt;/p&gt;

&lt;h2&gt;
  
  
  What cardinality actually is
&lt;/h2&gt;

&lt;p&gt;Think of every unique combination of metric name plus label values as its own drawer in a filing cabinet. &lt;code&gt;http_requests_total{status="200"}&lt;/code&gt; is one drawer. Add &lt;code&gt;region="ap-southeast-2"&lt;/code&gt; and now you've got a drawer per region. Add a label whose values are unbounded and you've got a cabinet the size of a warehouse.&lt;/p&gt;

&lt;p&gt;Cardinality is the count of those drawers. Each one is a separate time series that has to be stored and indexed. Low-cardinality labels (status, region) are fine. High-cardinality ones are where the money leaks.&lt;/p&gt;

&lt;h2&gt;
  
  
  The one label that did it
&lt;/h2&gt;

&lt;p&gt;A teammate had added &lt;code&gt;build_id&lt;/code&gt; to a counter so they could debug a flaky deploy. Fair enough in the moment. Problem is every build has a unique ID, we do ~40k a day, and those IDs hang around for the full retention window.&lt;/p&gt;

&lt;p&gt;40k unique values a day, multiplied across a handful of other labels, multiplied across retention. That's your several-million-series jump right there. One label.&lt;/p&gt;

&lt;h2&gt;
  
  
  Catching it
&lt;/h2&gt;

&lt;p&gt;The fastest way to find the offender is to ask Prometheus which metric has the most series:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;topk(10, count by (__name__)({__name__=~".+"}))
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then drill into the worst metric and see which label is doing the damage:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;count(count by (build_id)(deploy_attempts_total))
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When that second query came back with a number in the tens of thousands, we had our culprit.&lt;/p&gt;

&lt;h2&gt;
  
  
  The fix
&lt;/h2&gt;

&lt;p&gt;You drop the label before it ever hits storage. &lt;code&gt;metric_relabel_configs&lt;/code&gt; runs at scrape time, so you can strip a label without touching the app code:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;scrape_configs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;job_name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;build-agents"&lt;/span&gt;
    &lt;span class="na"&gt;metric_relabel_configs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;regex&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;build_id"&lt;/span&gt;
        &lt;span class="na"&gt;action&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;labeldrop&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Per-build detail didn't vanish, we moved it to where unbounded identifiers belong: traces and logs. If you genuinely need a metric sliced per build, use exemplars so the high-cardinality bit lives in the trace store, not the series index.&lt;/p&gt;

&lt;p&gt;Here's how we now reason about labels before adding one:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Label&lt;/th&gt;
&lt;th&gt;Unique values&lt;/th&gt;
&lt;th&gt;Safe to add?&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;status&lt;/td&gt;
&lt;td&gt;~5&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;region&lt;/td&gt;
&lt;td&gt;~6&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;instance_type&lt;/td&gt;
&lt;td&gt;~15&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;agent_queue&lt;/td&gt;
&lt;td&gt;~200&lt;/td&gt;
&lt;td&gt;Usually fine&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;build_id&lt;/td&gt;
&lt;td&gt;~40k/day&lt;/td&gt;
&lt;td&gt;No, use a trace&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;user_email&lt;/td&gt;
&lt;td&gt;unbounded&lt;/td&gt;
&lt;td&gt;No, never&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Rule of thumb we reckon on: if you can't name the upper bound of a label's values on a whiteboard, it doesn't go on a metric.&lt;/p&gt;

&lt;h2&gt;
  
  
  Same trap, different service
&lt;/h2&gt;

&lt;p&gt;This isn't only a Prometheus-the-app thing. Any service that emits Prometheus metrics can sink you the same way. We run a small internal feature that summarises failed build logs through an LLM, and those calls go through Bifrost, an open-source AI gateway that ships native Prometheus metrics out of the box. Handy. But the instinct to tag those metrics with a per-request ID or per-virtual-key label is exactly the same footgun.&lt;/p&gt;

&lt;p&gt;We keep its labels down to provider and model. That gives us cost-per-provider and latency-per-model without minting a new series for every call. The discipline travels with the metric, not the tool.&lt;/p&gt;

&lt;h2&gt;
  
  
  Trade-offs and Limitations
&lt;/h2&gt;

&lt;p&gt;Dropping &lt;code&gt;build_id&lt;/code&gt; means you can't slice a single build inside Prometheus anymore. For ad-hoc "what did build 84213 do" questions, you're now in the trace or log tooling, which is a context switch some folks grumbled about for a week.&lt;/p&gt;

&lt;p&gt;Recording rules, the other common fix, aren't free either. They add evaluation load on the Prometheus side, and if you write a sloppy one you can quietly recreate the cardinality you were trying to kill. Test the output series count before you ship the rule.&lt;/p&gt;

&lt;p&gt;Exemplars need backend support and a tracing system wired up. If you haven't got distributed tracing yet, that path's a bigger project than a one-line &lt;code&gt;labeldrop&lt;/code&gt;. Be honest about where you are.&lt;/p&gt;

&lt;p&gt;And &lt;code&gt;labeldrop&lt;/code&gt; is a blunt instrument. Once it's gone at scrape, it's gone. If you later decide you wanted that dimension bounded rather than dropped, you're re-instrumenting.&lt;/p&gt;

&lt;h2&gt;
  
  
  Further Reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://prometheus.io/docs/practices/naming/" rel="noopener noreferrer"&gt;Prometheus: metric and label naming&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://prometheus.io/docs/practices/instrumentation/" rel="noopener noreferrer"&gt;Prometheus: instrumentation best practices&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://grafana.com/docs/grafana-cloud/cost-management-and-billing/reduce-costs/metrics-costs/control-metrics-usage-via-cardinality-management/" rel="noopener noreferrer"&gt;Grafana: control metrics usage via cardinality management&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.getbifrost.ai/features/observability/default" rel="noopener noreferrer"&gt;Bifrost observability docs&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>devops</category>
      <category>infrastructure</category>
      <category>sre</category>
    </item>
    <item>
      <title>Our PR-review bot kept hitting 429s. Bifrost key pooling fixed it.</title>
      <dc:creator>claire nguyen</dc:creator>
      <pubDate>Thu, 28 May 2026 13:22:11 +0000</pubDate>
      <link>https://dev.to/claire_nguyen/our-pr-review-bot-kept-hitting-429s-bifrost-key-pooling-fixed-it-4m9f</link>
      <guid>https://dev.to/claire_nguyen/our-pr-review-bot-kept-hitting-429s-bifrost-key-pooling-fixed-it-4m9f</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR: Our internal PR-review bot was getting 429'd by Anthropic between 9am and 11am Sydney time. We dropped Bifrost in front, pooled four keys, and the 429 rate fell from 8.2% to 0.07% in a fortnight. The migration was one env var swap. The interesting bits were the bits we got wrong.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The problem
&lt;/h2&gt;

&lt;p&gt;We've got a PR-review bot that pings Claude on every pull request opened against our internal monorepo. It pulls the diff and ships a structured prompt to Claude. Gets back a summary plus a couple of "have you considered..." nudges. Saves our reviewers maybe 10 minutes per PR, on a team of 80 engineers, all sharing one Anthropic workspace that someone provisioned back in early 2024 and nobody bothered to split.&lt;/p&gt;

&lt;p&gt;You can guess what happened.&lt;/p&gt;

&lt;p&gt;Mornings in Sydney are brutal. Everyone arrives, opens their PRs from the night before, and our bot fires off 30-40 concurrent requests. Anthropic's per-org rate limit got chewed through by 9:15 most days. Bot started failing. Slack filled up with "did the review bot die again?" messages. Not a great look for the platform team.&lt;/p&gt;

&lt;h2&gt;
  
  
  What we tried first
&lt;/h2&gt;

&lt;p&gt;The naive fix was a job queue with backoff. Wrote it in an arvo. Buildkite job, Redis-backed, exponential retry with jitter. It worked, sort of. Reviews now took 4-7 minutes to come back instead of 8 seconds, and engineers started ignoring the bot entirely because by the time the review landed they'd already merged, which kind of defeats the whole point of having a review bot.&lt;/p&gt;

&lt;p&gt;Queueing wasn't the answer. We needed more headroom, which meant more keys, which meant somebody had to manage them.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Bifrost
&lt;/h2&gt;

&lt;p&gt;I'd been kicking the tyres on a few gateways for an unrelated project. Bifrost (&lt;a href="https://github.com/maximhq/bifrost" rel="noopener noreferrer"&gt;https://github.com/maximhq/bifrost&lt;/a&gt;) won on two specific points: load balancing across multiple API keys for the same provider is a documented first-class feature, and the OpenAI-compatible endpoint meant we didn't have to touch the bot's SDK code. It already spoke &lt;code&gt;openai.ChatCompletion&lt;/code&gt; against an internal proxy URL.&lt;/p&gt;

&lt;p&gt;Setup took about 40 minutes including the time to argue with our SSO admin about a new GitHub OAuth app.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"providers"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"anthropic"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"keys"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"value"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"env.ANTHROPIC_KEY_1"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"weight"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;1.0&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"value"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"env.ANTHROPIC_KEY_2"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"weight"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;1.0&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"value"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"env.ANTHROPIC_KEY_3"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"weight"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;1.0&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"value"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"env.ANTHROPIC_KEY_4"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"weight"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;1.0&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"network_config"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"default_request_timeout_in_seconds"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;30&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Bot config was a one-liner. Pointed &lt;code&gt;OPENAI_API_BASE&lt;/code&gt; at our Bifrost ECS service on port 8080 and the bot didn't know it'd been moved.&lt;/p&gt;

&lt;h2&gt;
  
  
  Results after two weeks
&lt;/h2&gt;

&lt;p&gt;| Metric | Before (queue + 1 key) | After (Bifrost + 4 keys) |&lt;br&gt;
|---|---|&lt;br&gt;
| Median review latency | 4m 30s | 11s |&lt;br&gt;
| p95 review latency | 7m 12s | 28s |&lt;br&gt;
| 429 rate | 8.2% | 0.07% |&lt;br&gt;
| Reviews abandoned (timed out) | 14% | 0.4% |&lt;br&gt;
| "Is the bot dead" Slack pings | ~6/day | 0 |&lt;/p&gt;

&lt;p&gt;Costs went up about 22% because more reviews actually completed. Worth it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Bifrost vs LiteLLM vs Portkey
&lt;/h2&gt;

&lt;p&gt;I evaluated all three properly. None is strictly better; they hit different sweet spots.&lt;/p&gt;

&lt;p&gt;| Concern | Bifrost | LiteLLM | Portkey |&lt;br&gt;
|---|---|&lt;br&gt;
| Multi-key load balancing | Native | Via Router | Native |&lt;br&gt;
| OpenAI-compatible endpoint | Yes | Yes | Yes |&lt;br&gt;
| Self-host complexity | Single Go binary | Python + deps | SaaS-first |&lt;br&gt;
| Built-in web UI for config | Yes | Limited | Cloud-side |&lt;br&gt;
| Semantic caching | Yes | Yes | Yes |&lt;br&gt;
| MCP gateway | Yes | No | No |&lt;br&gt;
| Community size | Growing | Larger | Larger |&lt;/p&gt;

&lt;p&gt;LiteLLM's community is bigger and the integrations list is wider. If you want Python ergonomics, it's the easier ride. Portkey's hosted UX is slicker out of the box, but we needed self-host for compliance reasons. Bifrost being a single Go binary suited our ECS deploy model and our preference for fewer Python services in the critical path.&lt;/p&gt;

&lt;h2&gt;
  
  
  Trade-offs and limitations
&lt;/h2&gt;

&lt;p&gt;It's not all roses.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Failover is per-request, not per-key cooldown.&lt;/strong&gt; If one of our four keys gets stuck in a rate-limit hole, Bifrost retries the call elsewhere but doesn't proactively quarantine the bad key for a window. We're managing that with manual weight tweaks for now.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The web UI is handy but state lives in config files.&lt;/strong&gt; Make changes via the UI in dev and forget to commit the config, and you've got drift. We learned that one the hard way.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Single point of failure.&lt;/strong&gt; Anything you put in front of every LLM call becomes load-bearing. We run two Bifrost replicas behind an ALB. Tiny team running one node and a restart policy might be fine, but think about it before you ship.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Observability glue.&lt;/strong&gt; Prometheus metrics are emitted natively, which is great. You'll still need to wire them into your existing dashboards. Took us an afternoon.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Further reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Bifrost retries and fallbacks: &lt;a href="https://docs.getbifrost.ai/features/retries-and-fallbacks" rel="noopener noreferrer"&gt;https://docs.getbifrost.ai/features/retries-and-fallbacks&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Bifrost governance and virtual keys: &lt;a href="https://docs.getbifrost.ai/features/governance/virtual-keys" rel="noopener noreferrer"&gt;https://docs.getbifrost.ai/features/governance/virtual-keys&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Bifrost observability: &lt;a href="https://docs.getbifrost.ai/features/observability/default" rel="noopener noreferrer"&gt;https://docs.getbifrost.ai/features/observability/default&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;LiteLLM router config: &lt;a href="https://docs.litellm.ai/docs/routing" rel="noopener noreferrer"&gt;https://docs.litellm.ai/docs/routing&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Anthropic rate limit headers: &lt;a href="https://docs.anthropic.com/en/api/rate-limits" rel="noopener noreferrer"&gt;https://docs.anthropic.com/en/api/rate-limits&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;No worries if you've already got a gateway you're happy with. Don't write your own queue and hope.&lt;/p&gt;

</description>
      <category>devops</category>
      <category>infrastructure</category>
      <category>llm</category>
      <category>sre</category>
    </item>
    <item>
      <title>Surviving an AZ Failover for Our Build Runner Fleet at 3am</title>
      <dc:creator>claire nguyen</dc:creator>
      <pubDate>Wed, 27 May 2026 13:24:15 +0000</pubDate>
      <link>https://dev.to/claire_nguyen/surviving-an-az-failover-for-our-build-runner-fleet-at-3am-pg7</link>
      <guid>https://dev.to/claire_nguyen/surviving-an-az-failover-for-our-build-runner-fleet-at-3am-pg7</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR: We lost an AWS AZ for 47 minutes back in March. Our build runner fleet on EKS mostly survived, but the AI-assisted code review bots wedged because their LLM calls all routed to one region. Sticking Bifrost in front of those calls fixed the second problem. Here's what we changed.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;It was 3:12am Sydney time when PagerDuty went off. ap-southeast-2a was having a wobble. Not a full outage — just enough packet loss that EKS nodes started flapping in and out of the cluster.&lt;/p&gt;

&lt;p&gt;Our build runner fleet handled it fine. We've drilled this. Pod disruption budgets, multi-AZ node groups, the usual stuff. Builds rescheduled to 2b and 2c within about 90 seconds. No worries.&lt;/p&gt;

&lt;p&gt;The bit that didn't handle it fine was the AI review bot we'd shipped six weeks earlier. That thing called Anthropic's API directly from inside the build container. When the AZ flapped, the egress NAT in 2a started dropping outbound TLS. The bot retried, hit our 30-second build timeout, and 4,200 builds went red over half an hour.&lt;/p&gt;

&lt;p&gt;I want to talk about what we did the morning after, because the fix wasn't "make the bot more resilient." It was "stop pretending the LLM call is special."&lt;/p&gt;

&lt;h2&gt;
  
  
  The actual failure mode
&lt;/h2&gt;

&lt;p&gt;Here's the rough shape of what was happening. Our review bot was a Go service running as a sidecar in the build pod. Pseudo-config looked like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;review_bot&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;provider&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;anthropic&lt;/span&gt;
  &lt;span class="na"&gt;api_key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${ANTHROPIC_KEY}&lt;/span&gt;
  &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;claude-sonnet-4-6&lt;/span&gt;
  &lt;span class="na"&gt;timeout_ms&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;25000&lt;/span&gt;
  &lt;span class="na"&gt;max_retries&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;2&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Two retries, 25 second timeout each. Sounds reasonable. Except when the underlying network is dropping packets, you don't fail fast — you sit there waiting for TCP to give up. Two retries became 75 seconds of nothing. Build timeout kicked in. Build failed.&lt;/p&gt;

&lt;p&gt;Worse, every single review bot in every single build was hitting the same NAT gateway in the same degraded AZ. We'd accidentally built a single point of failure into something we'd designed as a sidecar.&lt;/p&gt;

&lt;h2&gt;
  
  
  What we changed
&lt;/h2&gt;

&lt;p&gt;I'd been kicking the tyres on Bifrost (&lt;a href="https://github.com/maximhq/bifrost" rel="noopener noreferrer"&gt;https://github.com/maximhq/bifrost&lt;/a&gt;) for a few weeks already because I wanted central observability on LLM spend across our internal tools. The AZ incident pushed it to the top of the queue.&lt;/p&gt;

&lt;p&gt;The plan was simple: stop letting build pods talk to providers directly. Run Bifrost as a deployment in our shared platform namespace, spread across all three AZs, and point the review bot at it. The bot's config went from anthropic.com to an internal service URL.&lt;/p&gt;

&lt;p&gt;Bifrost's drop-in replacement (&lt;a href="https://docs.getbifrost.ai/features/drop-in-replacement" rel="noopener noreferrer"&gt;https://docs.getbifrost.ai/features/drop-in-replacement&lt;/a&gt;) meant we didn't touch the bot's code. Just the env var.&lt;/p&gt;

&lt;p&gt;Then we configured fallbacks (&lt;a href="https://docs.getbifrost.ai/features/retries-and-fallbacks" rel="noopener noreferrer"&gt;https://docs.getbifrost.ai/features/retries-and-fallbacks&lt;/a&gt;) so a failed Anthropic call rolls over to AWS Bedrock's Claude. Same model family, different network path, different auth, different everything.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"model"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"anthropic/claude-sonnet-4-6"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"fallbacks"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="s2"&gt;"bedrock/anthropic.claude-sonnet-4-6"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="s2"&gt;"openai/gpt-4o-mini"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The GPT-4o-mini at the bottom is a deliberate downgrade. If both Anthropic paths are stuffed, we'd rather give the dev a worse review than no review and a red build.&lt;/p&gt;

&lt;h2&gt;
  
  
  What it looks like vs the alternatives
&lt;/h2&gt;

&lt;p&gt;I evaluated three things properly. Here's the honest comparison from my notes:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Concern&lt;/th&gt;
&lt;th&gt;LiteLLM&lt;/th&gt;
&lt;th&gt;Portkey&lt;/th&gt;
&lt;th&gt;Bifrost&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Self-hosted Go binary&lt;/td&gt;
&lt;td&gt;No (Python)&lt;/td&gt;
&lt;td&gt;Partial&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Provider failover config&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Built-in web UI for config&lt;/td&gt;
&lt;td&gt;Limited&lt;/td&gt;
&lt;td&gt;Yes (cloud)&lt;/td&gt;
&lt;td&gt;Yes (local)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Semantic caching&lt;/td&gt;
&lt;td&gt;Plugin&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Memory footprint on our nodes&lt;/td&gt;
&lt;td&gt;~400MB&lt;/td&gt;
&lt;td&gt;N/A (SaaS-first)&lt;/td&gt;
&lt;td&gt;~180MB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MCP gateway&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Yes (enterprise)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;LiteLLM is genuinely good and we run it for one of our data science notebooks because the Python ergonomics are nice. Portkey has the slickest dashboard if you're happy with their cloud. Bifrost won here because we wanted a Go binary we could run on our own infra, and the resource overhead per pod mattered when we're scheduling hundreds of build pods.&lt;/p&gt;

&lt;h2&gt;
  
  
  The boring infra bit
&lt;/h2&gt;

&lt;p&gt;We deployed Bifrost as three replicas, one per AZ, behind a ClusterIP service. Topology spread constraints to keep them honest. Each pod has its own provider key set via Kubernetes secrets, referenced through Bifrost's env var support (&lt;a href="https://docs.getbifrost.ai/deployment-guides/config-json#environment-variable-references" rel="noopener noreferrer"&gt;https://docs.getbifrost.ai/deployment-guides/config-json#environment-variable-references&lt;/a&gt;).&lt;/p&gt;

&lt;p&gt;Prometheus scrape config picks up the native metrics endpoint. We graph p99 latency per provider and alert on fallback rate above 5% for more than 10 minutes. That alert would have fired during the March incident and given us a much better signal than "builds are timing out."&lt;/p&gt;

&lt;h2&gt;
  
  
  Trade-offs and limitations
&lt;/h2&gt;

&lt;p&gt;This isn't a free win. A few things to flag.&lt;/p&gt;

&lt;p&gt;The gateway is now a new hop in the request path. We measured about 8-12ms added per call. For our use case that's noise. For real-time inference it might not be.&lt;/p&gt;

&lt;p&gt;Bifrost's clustering features are an enterprise thing. We're running it as independent replicas behind a service, which works because our config is mostly static. If you need shared state across replicas (live config sync, shared rate limit counters), you'll either pay for enterprise or accept some eventual consistency.&lt;/p&gt;

&lt;p&gt;Semantic caching sounds great but we haven't turned it on for the review bot because code reviews are too context-specific. Cache hit rate would be near zero. Worth knowing before you assume it'll save you money.&lt;/p&gt;

&lt;p&gt;And the obvious one: a gateway pod failing is now a thing that can break LLM calls. Spread your replicas, set sensible PDBs, don't be silly.&lt;/p&gt;

&lt;h2&gt;
  
  
  Further reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Bifrost fallbacks docs: &lt;a href="https://docs.getbifrost.ai/features/retries-and-fallbacks" rel="noopener noreferrer"&gt;https://docs.getbifrost.ai/features/retries-and-fallbacks&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Bifrost observability: &lt;a href="https://docs.getbifrost.ai/features/observability/default" rel="noopener noreferrer"&gt;https://docs.getbifrost.ai/features/observability/default&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;LiteLLM router config: &lt;a href="https://docs.litellm.ai/docs/routing" rel="noopener noreferrer"&gt;https://docs.litellm.ai/docs/routing&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;AWS multi-AZ resilience patterns: &lt;a href="https://aws.amazon.com/architecture/well-architected/" rel="noopener noreferrer"&gt;https://aws.amazon.com/architecture/well-architected/&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Kubernetes topology spread constraints: &lt;a href="https://kubernetes.io/docs/concepts/scheduling-eviction/topology-spread-constraints/" rel="noopener noreferrer"&gt;https://kubernetes.io/docs/concepts/scheduling-eviction/topology-spread-constraints/&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>devops</category>
      <category>sre</category>
      <category>infrastructure</category>
      <category>llm</category>
    </item>
    <item>
      <title>The Cost Math Behind Our CI Cache Hit Rate Going From 40% to 91%</title>
      <dc:creator>claire nguyen</dc:creator>
      <pubDate>Wed, 27 May 2026 04:23:02 +0000</pubDate>
      <link>https://dev.to/claire_nguyen/the-cost-math-behind-our-ci-cache-hit-rate-going-from-40-to-91-284d</link>
      <guid>https://dev.to/claire_nguyen/the-cost-math-behind-our-ci-cache-hit-rate-going-from-40-to-91-284d</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR: We were burning roughly AUD $14k/month on redundant CI compute because our cache hit rate sat at 40%. Three changes (content-addressed keys, a warmer tier, and killing one bad pre-commit hook) pushed it to 91% and shaved the bill to about $3.2k. Most of the savings came from a single weekend audit, not new tooling.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;I run infra at Buildkite. We eat our own dog food, which means our internal monorepo runs on the same agents we sell to customers. About six weeks ago our finance team flagged that our CI compute line on AWS had crept up 38% quarter-on-quarter while team headcount only grew 11%. Something was off.&lt;/p&gt;

&lt;p&gt;Turns out the culprit wasn't traffic. It was caches.&lt;/p&gt;

&lt;h2&gt;
  
  
  The starting point
&lt;/h2&gt;

&lt;p&gt;Our setup, roughly:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;~280 engineers across Sydney, Melbourne, San Francisco&lt;/li&gt;
&lt;li&gt;Around 4,200 builds/day on the monorepo&lt;/li&gt;
&lt;li&gt;Mix of Go, TypeScript, and a chunky Python ML eval service&lt;/li&gt;
&lt;li&gt;Agents running on &lt;code&gt;m6i.4xlarge&lt;/code&gt; spot instances in &lt;code&gt;ap-southeast-2&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Remote cache backed by S3 with a CloudFront distribution&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;When I first pulled the numbers, our cache hit rate (measured per build step, weighted by step duration) was sitting at 40.3%. For a healthy CI setup of this size I'd reckon you want 80%+. Anything under 60% means you're paying twice for the same compute.&lt;/p&gt;

&lt;p&gt;Here's what the spend breakdown looked like before we touched anything:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Component&lt;/th&gt;
&lt;th&gt;Monthly cost (AUD)&lt;/th&gt;
&lt;th&gt;% of total&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Spot EC2 (build agents)&lt;/td&gt;
&lt;td&gt;$11,200&lt;/td&gt;
&lt;td&gt;67%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;S3 cache storage&lt;/td&gt;
&lt;td&gt;$890&lt;/td&gt;
&lt;td&gt;5%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CloudFront egress&lt;/td&gt;
&lt;td&gt;$1,140&lt;/td&gt;
&lt;td&gt;7%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LLM eval API calls (OpenAI + Anthropic)&lt;/td&gt;
&lt;td&gt;$3,420&lt;/td&gt;
&lt;td&gt;21%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Total&lt;/td&gt;
&lt;td&gt;$16,650&lt;/td&gt;
&lt;td&gt;100%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The LLM line is the one nobody expected. We run automated PR review on a subset of changes, plus regression evals on our search ranking service.&lt;/p&gt;

&lt;h2&gt;
  
  
  The three things that actually mattered
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Content-addressed cache keys
&lt;/h3&gt;

&lt;p&gt;We had cache keys like &lt;code&gt;node_modules_v3_${branch_name}_${os}&lt;/code&gt;. That's already wrong but the worse bit was the &lt;code&gt;v3&lt;/code&gt; suffix that someone bumped six months ago and forgot why.&lt;/p&gt;

&lt;p&gt;Switched to hashing the actual inputs: &lt;code&gt;package-lock.json&lt;/code&gt; content hash + Node version + OS. Standard stuff but we'd just never done it properly.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;label&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;:node:&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;install"&lt;/span&gt;
    &lt;span class="na"&gt;plugins&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;cache#v2.4.0&lt;/span&gt;&lt;span class="err"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;manifest&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;package-lock.json&lt;/span&gt;
          &lt;span class="na"&gt;path&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;node_modules&lt;/span&gt;
          &lt;span class="na"&gt;restore&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;file&lt;/span&gt;
          &lt;span class="na"&gt;save&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;file&lt;/span&gt;
          &lt;span class="na"&gt;key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;v1-{{&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;runner.os&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;}}-node-{{&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;checksum&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;'package-lock.json'&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;}}"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;restore: file&lt;/code&gt; bit matters. It means we only invalidate when &lt;code&gt;package-lock.json&lt;/code&gt; actually changes, not when the branch name changes. Cache hit rate on the install step went from 31% to 96% overnight.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. A warmer tier between memory and S3
&lt;/h3&gt;

&lt;p&gt;S3 is cheap but the round-trip from &lt;code&gt;ap-southeast-2&lt;/code&gt; agents to S3 is about 18ms for small objects, and we were pulling thousands of them per build. We added an &lt;code&gt;r6gd.large&lt;/code&gt; instance with NVMe local storage as an in-region warm cache. Agents check there first, fall through to S3.&lt;/p&gt;

&lt;p&gt;Cost: about $180/month for the warm cache instance. Saves us roughly $1,400/month in CloudFront egress because most cache reads never leave the VPC now.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. The bad pre-commit hook
&lt;/h3&gt;

&lt;p&gt;This one is embarrassing. Someone added a pre-commit hook two years ago that ran &lt;code&gt;find . -name "*.pyc" -delete&lt;/code&gt; before every test invocation. On a clean checkout this does nothing useful. On a cached checkout it deletes all the compiled Python bytecode, forcing Python to recompile on every test run. Average test step went from 4m20s to 2m45s after deleting eight lines of bash.&lt;/p&gt;

&lt;p&gt;I genuinely could not believe it. We'd been paying for that for two years.&lt;/p&gt;

&lt;h2&gt;
  
  
  The LLM bit
&lt;/h2&gt;

&lt;p&gt;The $3,420 LLM line was harder to chip away at because the calls themselves are useful. What we did:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Routed the PR review traffic through an AI gateway (we use &lt;a href="https://github.com/maximhq/bifrost" rel="noopener noreferrer"&gt;Bifrost&lt;/a&gt;, which gives us semantic caching and a single endpoint) so identical or near-identical review prompts hit cache instead of provider&lt;/li&gt;
&lt;li&gt;Moved the search ranking evals to a nightly batch rather than per-PR&lt;/li&gt;
&lt;li&gt;Switched the bulk of the review traffic to a cheaper model and reserved the expensive one for changes touching &lt;code&gt;/security/*&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Semantic cache hit rate on PR review prompts settled around 34%, which doesn't sound massive but the prompts that hit cache tend to be the bigger ones (boilerplate "review this dependency bump" type stuff), so the dollar impact was bigger than the hit rate suggests.&lt;/p&gt;

&lt;p&gt;Final LLM line came down to $1,180/month.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where we landed
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Component&lt;/th&gt;
&lt;th&gt;Before&lt;/th&gt;
&lt;th&gt;After&lt;/th&gt;
&lt;th&gt;Change&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Spot EC2&lt;/td&gt;
&lt;td&gt;$11,200&lt;/td&gt;
&lt;td&gt;$1,820&lt;/td&gt;
&lt;td&gt;-84%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;S3 + warm cache&lt;/td&gt;
&lt;td&gt;$890&lt;/td&gt;
&lt;td&gt;$1,070&lt;/td&gt;
&lt;td&gt;+20%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CloudFront egress&lt;/td&gt;
&lt;td&gt;$1,140&lt;/td&gt;
&lt;td&gt;$140&lt;/td&gt;
&lt;td&gt;-88%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LLM API&lt;/td&gt;
&lt;td&gt;$3,420&lt;/td&gt;
&lt;td&gt;$1,180&lt;/td&gt;
&lt;td&gt;-65%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Total&lt;/td&gt;
&lt;td&gt;$16,650&lt;/td&gt;
&lt;td&gt;$4,210&lt;/td&gt;
&lt;td&gt;-75%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Cache hit rate: 91.2% weighted.&lt;/p&gt;

&lt;h2&gt;
  
  
  Trade-offs and Limitations
&lt;/h2&gt;

&lt;p&gt;The warm cache tier is a single point of failure. If that &lt;code&gt;r6gd.large&lt;/code&gt; dies, we fall through to S3 cleanly but builds slow down by ~40 seconds each until we replace it. For us that's fine because spot interruption is more common than instance failure anyway. For a smaller team I'd skip it.&lt;/p&gt;

&lt;p&gt;Content-addressed keys made cache busting harder for the rare case where you legitimately want to invalidate everything. We added a manual &lt;code&gt;BUILDKITE_CACHE_EPOCH&lt;/code&gt; env var so a human can force-invalidate when needed. Used it twice in three months.&lt;/p&gt;

&lt;p&gt;The pre-commit hook thing wasn't a tooling problem. It was institutional knowledge rot. There's no caching strategy that protects you from someone deleting your bytecode every commit. You need humans to actually read what runs.&lt;/p&gt;

&lt;h2&gt;
  
  
  Further Reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://bazel.build/remote/caching" rel="noopener noreferrer"&gt;Bazel remote caching docs&lt;/a&gt; — even if you don't use Bazel, their model is worth understanding&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/buildkite-plugins/cache-buildkite-plugin" rel="noopener noreferrer"&gt;Buildkite agent caching plugin&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/spot-interruptions.html" rel="noopener noreferrer"&gt;The AWS spot instance interruption guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.blog/engineering/" rel="noopener noreferrer"&gt;GitHub's writeup on their Actions cache architecture&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://docs.getbifrost.ai/features/semantic-caching" rel="noopener noreferrer"&gt;Bifrost semantic caching docs&lt;/a&gt; if you're routing LLM traffic and curious about prompt-level caching&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Go audit your hooks. Seriously. You probably have one of these.&lt;/p&gt;

</description>
      <category>devops</category>
      <category>infrastructure</category>
      <category>sre</category>
      <category>mlops</category>
    </item>
  </channel>
</rss>
