<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Pranay Batta</title>
    <description>The latest articles on DEV Community by Pranay Batta (@pranay_batta).</description>
    <link>https://dev.to/pranay_batta</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3652594%2F9d2926ca-eede-4542-b782-4feb2ced66f1.jpg</url>
      <title>DEV Community: Pranay Batta</title>
      <link>https://dev.to/pranay_batta</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/pranay_batta"/>
    <language>en</language>
    <item>
      <title>MCP at Scale: Access Control, Cost Governance, and 92% Lower Token Costs</title>
      <dc:creator>Pranay Batta</dc:creator>
      <pubDate>Tue, 14 Apr 2026 08:44:38 +0000</pubDate>
      <link>https://dev.to/pranay_batta/mcp-at-scale-access-control-cost-governance-and-92-lower-token-costs-50jf</link>
      <guid>https://dev.to/pranay_batta/mcp-at-scale-access-control-cost-governance-and-92-lower-token-costs-50jf</guid>
      <description>&lt;h2&gt;
  
  
  The Hidden Tax on Every MCP Request
&lt;/h2&gt;

&lt;p&gt;Here is something nobody talks about when they demo MCP integrations: token costs at scale.&lt;/p&gt;

&lt;p&gt;I have been running MCP setups with increasing numbers of connected servers. The pattern is always the same. You connect a few servers, everything works brilliantly. You connect a dozen, costs start climbing. You connect sixteen servers with 500+ tools, and suddenly your token budget is gone before the model even starts thinking about your actual query.&lt;/p&gt;

&lt;p&gt;Why? Every tool definition from every connected server gets injected into the model's context on every single request. 150+ tool definitions can consume the majority of your token budget. And there is zero access control. Any consumer can call any tool. No cost tracking at tool level.&lt;/p&gt;

&lt;p&gt;This is unsustainable for production deployments.&lt;/p&gt;

&lt;h2&gt;
  
  
  I Tested Bifrost's Code Mode Approach
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://git.new/bifrost" rel="noopener noreferrer"&gt;Bifrost&lt;/a&gt; takes a fundamentally different approach to this problem. Instead of dumping all tool definitions into the context window, it exposes a virtual filesystem of Python stub files. The model discovers tools on-demand through four meta-tools:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;listToolFiles&lt;/code&gt; - discover available servers and tools&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;readToolFile&lt;/code&gt; - load specific function signatures&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;getToolDocs&lt;/code&gt; - fetch detailed documentation only when needed&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;executeToolCode&lt;/code&gt; - run scripts in a sandboxed Starlark interpreter&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The key insight: the model only loads what it actually needs for the current query. If you ask it to read a file, it does not need to know about your Slack, GitHub, Jira, and database tools all at once.&lt;/p&gt;

&lt;p&gt;Here is what a typical tool discovery flow looks like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Model calls listToolFiles to see available servers
&lt;/span&gt;&lt;span class="n"&gt;available&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;listToolFiles&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="c1"&gt;# Returns: ["filesystem/", "github/", "slack/", "jira/", ...]
&lt;/span&gt;
&lt;span class="c1"&gt;# Model identifies it needs filesystem tools for this query
&lt;/span&gt;&lt;span class="n"&gt;tools&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;readToolFile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;filesystem/read.py&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# Returns only the function signature for filesystem_read
&lt;/span&gt;
&lt;span class="c1"&gt;# Model fetches docs only if needed
&lt;/span&gt;&lt;span class="n"&gt;docs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;getToolDocs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;filesystem&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;read&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Executes with full sandboxing
&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;executeToolCode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;filesystem/read.py&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;path&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/src/main.go&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is lazy loading for LLM tool contexts. Simple idea. Massive impact.&lt;/p&gt;

&lt;h2&gt;
  
  
  Benchmark Results: 3 Controlled Rounds
&lt;/h2&gt;

&lt;p&gt;I ran three controlled rounds, scaling from 6 servers to 16 servers. Every round maintained a 100% task pass rate. The model completed every task correctly while using dramatically fewer tokens.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Round&lt;/th&gt;
&lt;th&gt;Tools&lt;/th&gt;
&lt;th&gt;Servers&lt;/th&gt;
&lt;th&gt;Token Reduction&lt;/th&gt;
&lt;th&gt;Cost Savings&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;96&lt;/td&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;58.2%&lt;/td&gt;
&lt;td&gt;55.7%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;251&lt;/td&gt;
&lt;td&gt;11&lt;/td&gt;
&lt;td&gt;84.5%&lt;/td&gt;
&lt;td&gt;83.4%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;508&lt;/td&gt;
&lt;td&gt;16&lt;/td&gt;
&lt;td&gt;92.8%&lt;/td&gt;
&lt;td&gt;92.2%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;At roughly 500 tools, Code Mode reduces per-query token usage by about 14x. From 1.15M tokens down to 83K. That is not an incremental improvement. That is a different cost structure entirely.&lt;/p&gt;

&lt;p&gt;The savings compound non-linearly. As you add more tools, the percentage saved increases because Code Mode's overhead stays roughly constant while traditional mode scales linearly with tool count.&lt;/p&gt;

&lt;p&gt;For full benchmark methodology, check the &lt;a href="https://docs.getbifrost.ai/benchmarking/getting-started" rel="noopener noreferrer"&gt;benchmarking docs&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Access Control That Actually Works
&lt;/h2&gt;

&lt;p&gt;Token savings are great, but production MCP deployments need &lt;a href="https://docs.getbifrost.ai/features/governance" rel="noopener noreferrer"&gt;governance&lt;/a&gt;. Bifrost handles this through two mechanisms.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Virtual Keys&lt;/strong&gt; let you create scoped credentials per user, team, or customer. You can scope at the tool level:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;virtual_key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;data-team-key"&lt;/span&gt;
  &lt;span class="na"&gt;allowed_tools&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;database_read&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;database_query&lt;/span&gt;
  &lt;span class="na"&gt;blocked_tools&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;database_delete&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;filesystem_write&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Allow &lt;code&gt;filesystem_read&lt;/code&gt;, block &lt;code&gt;filesystem_write&lt;/code&gt;. Allow &lt;code&gt;database_query&lt;/code&gt;, block &lt;code&gt;database_delete&lt;/code&gt;. Fine-grained, declarative, no code changes needed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;MCP Tool Groups&lt;/strong&gt; are named collections of tools from multiple servers. You create a group, attach it to keys, teams, or users. No database queries at resolve time. This is important when you are running at &lt;a href="https://getmax.im/bifrost-home" rel="noopener noreferrer"&gt;5000 RPS&lt;/a&gt; and cannot afford lookup latency.&lt;/p&gt;

&lt;h2&gt;
  
  
  Per-Tool Observability
&lt;/h2&gt;

&lt;p&gt;Every tool execution gets logged with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Tool name and server source&lt;/li&gt;
&lt;li&gt;Arguments passed and results returned&lt;/li&gt;
&lt;li&gt;Execution latency&lt;/li&gt;
&lt;li&gt;Virtual key that initiated the call&lt;/li&gt;
&lt;li&gt;Parent LLM request context&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You can track &lt;a href="https://docs.getbifrost.ai/features/observability" rel="noopener noreferrer"&gt;cost at the tool level&lt;/a&gt; alongside LLM token costs. This matters when your finance team asks why the AI bill doubled last month. You can point to exactly which tools, which teams, and which queries drove the spend.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://docs.getbifrost.ai/features/governance/budget-and-limits" rel="noopener noreferrer"&gt;Budget and limits&lt;/a&gt; let you set spending caps per virtual key, so no single team can blow through the monthly allocation.&lt;/p&gt;

&lt;h2&gt;
  
  
  Connection Flexibility
&lt;/h2&gt;

&lt;p&gt;Bifrost supports four &lt;a href="https://docs.getbifrost.ai/mcp/overview" rel="noopener noreferrer"&gt;MCP connection types&lt;/a&gt;: STDIO, HTTP, SSE, and in-process via the Go SDK. OAuth 2.0 with PKCE and automatic token refresh is built in. Health monitoring with automatic reconnects keeps things running without manual intervention.&lt;/p&gt;

&lt;p&gt;You can run it in manual approval mode where a human reviews tool calls, or in autonomous agent loop mode where the model chains tool calls independently.&lt;/p&gt;

&lt;p&gt;For Claude Code and Cursor users, the &lt;code&gt;/mcp&lt;/code&gt; endpoint integrates directly. &lt;a href="https://docs.getbifrost.ai/quickstart/gateway/setting-up" rel="noopener noreferrer"&gt;Setup takes minutes&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Honest Trade-offs
&lt;/h2&gt;

&lt;p&gt;No tool is perfect. Here is what I noticed:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Learning curve for Code Mode.&lt;/strong&gt; The virtual filesystem abstraction is elegant, but it is a new mental model. Teams used to traditional MCP tool injection will need to understand why their tools are now "files" the model reads on demand.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Meta-tool overhead on simple queries.&lt;/strong&gt; If you only have 10-20 tools, the overhead of the four meta-tools (listToolFiles, readToolFile, etc.) might not save you much. The real wins kick in above 50-100 tools. Below that threshold, traditional mode works fine.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Starlark sandbox limitations.&lt;/strong&gt; The sandboxed Starlark interpreter is secure by design, but it means tool code runs in a restricted environment. Complex tool implementations may need adjustments.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Dependency on gateway availability.&lt;/strong&gt; Adding a gateway layer means one more component to monitor. Bifrost's 11 microsecond latency and Go-based architecture make this a non-issue in practice, but it is still an additional piece of infrastructure.&lt;/p&gt;

&lt;h2&gt;
  
  
  Who Should Care
&lt;/h2&gt;

&lt;p&gt;If you are running fewer than 50 MCP tools, you probably do not need Code Mode yet. Traditional tool injection works fine at that scale.&lt;/p&gt;

&lt;p&gt;If you are running 100+ tools across multiple servers, or if you need per-team access control, or if your CFO is asking questions about AI infrastructure costs, this is worth evaluating.&lt;/p&gt;

&lt;p&gt;The 92% cost reduction at 500+ tools is the headline number, but the governance features (virtual keys, tool groups, audit logging) are what make it production-ready.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try It
&lt;/h2&gt;

&lt;p&gt;Bifrost is open-source and written in Go.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://git.new/bifrost" rel="noopener noreferrer"&gt;GitHub repo&lt;/a&gt; - star it if this is useful&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://docs.getbifrost.ai/mcp/overview" rel="noopener noreferrer"&gt;MCP documentation&lt;/a&gt; - full setup guide&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://docs.getbifrost.ai/features/governance" rel="noopener noreferrer"&gt;Governance docs&lt;/a&gt; - virtual keys, tool groups, budgets&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://docs.getbifrost.ai/quickstart/gateway/setting-up" rel="noopener noreferrer"&gt;Getting started&lt;/a&gt; - up and running in minutes&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I have been testing a lot of MCP tooling lately. Bifrost's approach to the context window problem is the most practical solution I have seen. The lazy loading pattern for tool definitions should honestly be how all MCP gateways work.&lt;/p&gt;

&lt;p&gt;Check the &lt;a href="https://getmax.im/bifrostdocs" rel="noopener noreferrer"&gt;docs&lt;/a&gt; and give it a spin. Happy to discuss benchmarks or setup in the comments.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>mcp</category>
      <category>llm</category>
      <category>devops</category>
    </item>
    <item>
      <title>How to Track LLM Costs and Rate Limits on AWS Bedrock with an AI Gateway</title>
      <dc:creator>Pranay Batta</dc:creator>
      <pubDate>Mon, 13 Apr 2026 10:21:38 +0000</pubDate>
      <link>https://dev.to/pranay_batta/how-to-track-llm-costs-and-rate-limits-on-aws-bedrock-with-an-ai-gateway-5alh</link>
      <guid>https://dev.to/pranay_batta/how-to-track-llm-costs-and-rate-limits-on-aws-bedrock-with-an-ai-gateway-5alh</guid>
      <description>&lt;p&gt;Running LLM workloads on AWS is easy. Knowing what they cost is not. You spin up Bedrock, call Claude or Mistral a few thousand times, and the bill shows up three days later as a single line item. No breakdown by team. No per-model cost tracking. No rate limits unless you build them yourself.&lt;/p&gt;

&lt;p&gt;I spent the last two weeks evaluating how teams can get proper cost governance over LLM usage on AWS. Native tools, third-party gateways, open-source options. Here is what I found.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem with AWS Native Cost Tracking
&lt;/h2&gt;

&lt;p&gt;AWS gives you CloudWatch and Cost Explorer. Both are built for general AWS resource monitoring. They work fine for EC2, Lambda, S3. For LLM workloads on Bedrock, they fall short.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What you get from CloudWatch + Cost Explorer:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Aggregate Bedrock spend per region&lt;/li&gt;
&lt;li&gt;Invocation counts at the service level&lt;/li&gt;
&lt;li&gt;Basic alarms on total spend thresholds&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;What you do not get:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Per-model token-level cost breakdowns&lt;/li&gt;
&lt;li&gt;Team or project-level budget enforcement&lt;/li&gt;
&lt;li&gt;Rate limiting by user, team, or API key&lt;/li&gt;
&lt;li&gt;Real-time cost tracking per request&lt;/li&gt;
&lt;li&gt;Automatic routing away from providers that exceed limits&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you are running one model for one team, native tools are fine. The moment you have multiple teams, multiple models, or need to enforce granular budgets, you are building custom infrastructure.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Gateway Approach
&lt;/h2&gt;

&lt;p&gt;An LLM gateway sits between your application and Bedrock. Every request passes through it. That gives you a single place to track costs, enforce rate limits, and control routing.&lt;/p&gt;

&lt;p&gt;I tested three approaches:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Feature&lt;/th&gt;
&lt;th&gt;AWS Native (CloudWatch + Cost Explorer)&lt;/th&gt;
&lt;th&gt;LiteLLM&lt;/th&gt;
&lt;th&gt;Bifrost&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;LLM-specific cost tracking&lt;/td&gt;
&lt;td&gt;Aggregate only&lt;/td&gt;
&lt;td&gt;Per-request, per-model&lt;/td&gt;
&lt;td&gt;Per-request, per-model&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Budget hierarchy&lt;/td&gt;
&lt;td&gt;Account-level billing alerts&lt;/td&gt;
&lt;td&gt;Basic budget controls&lt;/td&gt;
&lt;td&gt;4-tier: Customer &amp;gt; Team &amp;gt; Virtual Key &amp;gt; Provider&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Rate limiting&lt;/td&gt;
&lt;td&gt;No native LLM rate limits&lt;/td&gt;
&lt;td&gt;Basic rate limiting&lt;/td&gt;
&lt;td&gt;VK + Provider Config level, token and request limits&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Reset durations&lt;/td&gt;
&lt;td&gt;N/A&lt;/td&gt;
&lt;td&gt;Limited options&lt;/td&gt;
&lt;td&gt;1m, 5m, 1h, 1d, 1w, 1M, 1Y (calendar-aligned UTC)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Bedrock support&lt;/td&gt;
&lt;td&gt;Native&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes (provider type "bedrock")&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Overhead&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;td&gt;~8ms (Python)&lt;/td&gt;
&lt;td&gt;11 microseconds (Go)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Deployment&lt;/td&gt;
&lt;td&gt;N/A&lt;/td&gt;
&lt;td&gt;Self-hosted or cloud&lt;/td&gt;
&lt;td&gt;Self-hosted (runs in your VPC)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Language&lt;/td&gt;
&lt;td&gt;N/A&lt;/td&gt;
&lt;td&gt;Python&lt;/td&gt;
&lt;td&gt;Go&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The numbers tell the story. For teams that need real LLM cost governance on AWS, a dedicated gateway is the right call.&lt;/p&gt;

&lt;h2&gt;
  
  
  Setting Up Bifrost with AWS Bedrock
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://git.new/bifrost" rel="noopener noreferrer"&gt;Bifrost&lt;/a&gt; runs in your VPC alongside Bedrock. No data leaves your infrastructure. That matters for teams with compliance requirements.&lt;/p&gt;

&lt;p&gt;Start the gateway:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npx &lt;span class="nt"&gt;-y&lt;/span&gt; @maximhq/bifrost
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Full &lt;a href="https://docs.getbifrost.ai/quickstart/gateway/setting-up" rel="noopener noreferrer"&gt;setup guide here&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Configure Bedrock as a provider:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;accounts&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ml-team"&lt;/span&gt;
    &lt;span class="na"&gt;providers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bedrock-claude"&lt;/span&gt;
        &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bedrock"&lt;/span&gt;
        &lt;span class="na"&gt;region&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;us-east-1"&lt;/span&gt;
        &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;anthropic.claude-sonnet-4-20250514-v1:0"&lt;/span&gt;
        &lt;span class="na"&gt;weight&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;80&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bedrock-mistral"&lt;/span&gt;
        &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bedrock"&lt;/span&gt;
        &lt;span class="na"&gt;region&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;us-west-2"&lt;/span&gt;
        &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;mistral.mistral-large-2407-v1:0"&lt;/span&gt;
        &lt;span class="na"&gt;weight&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;20&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Weighted routing across models. 80% of requests go to Claude Sonnet on Bedrock, 20% to Mistral. Both running through your AWS account. The &lt;a href="https://docs.getbifrost.ai/quickstart/gateway/provider-configuration" rel="noopener noreferrer"&gt;provider configuration docs&lt;/a&gt; cover all Bedrock model formats and region options.&lt;/p&gt;

&lt;h2&gt;
  
  
  Four-Tier Budget Hierarchy
&lt;/h2&gt;

&lt;p&gt;This is where Bifrost separates itself from everything else I tested. The &lt;a href="https://docs.getbifrost.ai/features/governance/budget-and-limits" rel="noopener noreferrer"&gt;budget system&lt;/a&gt; has four levels: Customer, Team, Virtual Key, and Provider Config. All four must pass for a request to go through.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;budgets&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;customer&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;acme-corp"&lt;/span&gt;
      &lt;span class="na"&gt;limit&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;5000&lt;/span&gt;
      &lt;span class="na"&gt;period&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1M"&lt;/span&gt;

  &lt;span class="na"&gt;team&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ml-engineering"&lt;/span&gt;
      &lt;span class="na"&gt;customer_id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;acme-corp"&lt;/span&gt;
      &lt;span class="na"&gt;limit&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;2000&lt;/span&gt;
      &lt;span class="na"&gt;period&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1M"&lt;/span&gt;

  &lt;span class="na"&gt;virtual_key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;staging-key"&lt;/span&gt;
      &lt;span class="na"&gt;team_id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ml-engineering"&lt;/span&gt;
      &lt;span class="na"&gt;limit&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;500&lt;/span&gt;
      &lt;span class="na"&gt;period&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1w"&lt;/span&gt;

  &lt;span class="na"&gt;provider_config&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bedrock-claude"&lt;/span&gt;
      &lt;span class="na"&gt;limit&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1000&lt;/span&gt;
      &lt;span class="na"&gt;period&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1M"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Customer gets $5,000/month. ML Engineering team gets $2,000 of that. The staging key is capped at $500/week. And the Bedrock Claude provider itself is capped at $1,000/month. If any tier hits its limit, the request is blocked.&lt;/p&gt;

&lt;p&gt;Cost is calculated from provider pricing, token usage, request type, cache status, and batch operations. Not estimated. Calculated from actual usage data.&lt;/p&gt;

&lt;p&gt;The &lt;a href="https://docs.getbifrost.ai/features/governance" rel="noopener noreferrer"&gt;governance docs&lt;/a&gt; have the full breakdown.&lt;/p&gt;

&lt;h2&gt;
  
  
  Rate Limiting That Actually Works for LLMs
&lt;/h2&gt;

&lt;p&gt;AWS does not give you LLM-specific rate limits. Bedrock has service quotas, but those are blunt instruments. You cannot limit a specific team to 100 requests per minute or cap token consumption per API key.&lt;/p&gt;

&lt;p&gt;Bifrost handles rate limiting at two levels: Virtual Key and Provider Config. You can set both request limits (calls per duration) and token limits (tokens per duration).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;rate_limits&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;virtual_key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;staging-key"&lt;/span&gt;
      &lt;span class="na"&gt;requests&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;limit&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;100&lt;/span&gt;
        &lt;span class="na"&gt;duration&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1h"&lt;/span&gt;
      &lt;span class="na"&gt;tokens&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;limit&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;50000&lt;/span&gt;
        &lt;span class="na"&gt;duration&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1h"&lt;/span&gt;

  &lt;span class="na"&gt;provider_config&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bedrock-claude"&lt;/span&gt;
      &lt;span class="na"&gt;requests&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;limit&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;500&lt;/span&gt;
        &lt;span class="na"&gt;duration&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1h"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Reset durations: 1m, 5m, 1h, 1d, 1w, 1M, 1Y. The daily, weekly, monthly, and yearly resets are calendar-aligned in UTC. So "1d" resets at midnight UTC, not 24 hours from first request.&lt;/p&gt;

&lt;p&gt;Here is the clever part: if a provider config exceeds its rate limit, that provider gets excluded from &lt;a href="https://docs.getbifrost.ai/features/governance/routing" rel="noopener noreferrer"&gt;routing&lt;/a&gt;. But other providers in the account remain available. Traffic shifts automatically. No downtime, no manual intervention.&lt;/p&gt;

&lt;h2&gt;
  
  
  Observability at Sub-Millisecond Overhead
&lt;/h2&gt;

&lt;p&gt;Every request through Bifrost is captured: tokens used, latency, cost, response status. The &lt;a href="https://docs.getbifrost.ai/features/observability" rel="noopener noreferrer"&gt;observability layer&lt;/a&gt; adds less than 0.1ms of overhead. Storage backend is SQLite or PostgreSQL.&lt;/p&gt;

&lt;p&gt;What makes this useful for AWS teams:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;14+ API filter options&lt;/strong&gt; for querying logs. Filter by model, provider, team, cost range, status code, time window.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;WebSocket live updates.&lt;/strong&gt; Watch requests flow through in real time. Useful during load testing or incident debugging.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Single pane across providers.&lt;/strong&gt; If you are running Bedrock plus OpenAI or Gemini as &lt;a href="https://docs.getbifrost.ai/features/fallbacks" rel="noopener noreferrer"&gt;failover&lt;/a&gt;, all logs are in one place.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Compare that to checking CloudWatch for Bedrock, then the OpenAI dashboard for your fallback, then manually correlating timestamps. The centralised view saves real time.&lt;/p&gt;

&lt;h2&gt;
  
  
  Honest Trade-offs
&lt;/h2&gt;

&lt;p&gt;No tool solves everything. Here is what to know:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Bifrost is self-hosted only.&lt;/strong&gt; You run it, you maintain it. For teams already on AWS with VPC infrastructure, this is straightforward. For smaller teams without DevOps, it is extra work.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;LiteLLM has broader provider coverage.&lt;/strong&gt; 100+ providers out of the box. If you need niche providers, LiteLLM may have them. Bifrost focuses on major providers but adds the Go performance advantage and deeper governance features.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;AWS native tools have zero overhead.&lt;/strong&gt; If all you need is aggregate cost visibility and basic billing alerts, CloudWatch is already there. No extra infrastructure.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Go vs Python matters at scale.&lt;/strong&gt; Bifrost's 11 microsecond overhead versus LiteLLM's ~8ms becomes significant when you are processing thousands of requests per minute. At low volume, both are fine. At scale, the difference compounds. The &lt;a href="https://docs.getbifrost.ai/benchmarking/getting-started" rel="noopener noreferrer"&gt;benchmarks&lt;/a&gt; back this up: 5,000 RPS on a single instance.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Bifrost is a newer project.&lt;/strong&gt; The community is growing but smaller than LiteLLM's. Documentation is solid. Edge cases may require checking GitHub issues.&lt;/p&gt;

&lt;h2&gt;
  
  
  When to Use What
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Stick with AWS native tools if:&lt;/strong&gt; You have one team, one model, and just need billing alerts.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Consider LiteLLM if:&lt;/strong&gt; You need maximum provider coverage and are comfortable with Python-based overhead.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Use Bifrost if:&lt;/strong&gt; You need granular cost governance, multi-tier budgets, LLM-specific rate limiting, and minimal latency on AWS. Especially if you are already running in a VPC and want &lt;a href="https://docs.getbifrost.ai/features/semantic-caching" rel="noopener noreferrer"&gt;semantic caching&lt;/a&gt; and automatic failover alongside cost controls.&lt;/p&gt;

&lt;h2&gt;
  
  
  Quick Start
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# 1. Start Bifrost in your VPC&lt;/span&gt;
npx &lt;span class="nt"&gt;-y&lt;/span&gt; @maximhq/bifrost

&lt;span class="c"&gt;# 2. Configure Bedrock providers in bifrost.yaml&lt;/span&gt;

&lt;span class="c"&gt;# 3. Set budget and rate limit tiers&lt;/span&gt;

&lt;span class="c"&gt;# 4. Point your application at the gateway&lt;/span&gt;
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;ANTHROPIC_BASE_URL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;http://localhost:8080/anthropic
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;ANTHROPIC_API_KEY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;your-bifrost-virtual-key
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Every Bedrock request now has cost tracking, rate limiting, and observability built in.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://git.new/bifrost" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt; | &lt;a href="https://getmax.im/bifrostdocs" rel="noopener noreferrer"&gt;Docs&lt;/a&gt; | &lt;a href="https://getmax.im/bifrost-home" rel="noopener noreferrer"&gt;Website&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;AWS makes it easy to run LLM workloads. It does not make it easy to govern them. If your team is scaling Bedrock usage and needs real cost controls, a dedicated LLM gateway fills the gap that CloudWatch and Cost Explorer leave open.&lt;/p&gt;

&lt;p&gt;Check the &lt;a href="https://git.new/bifrostrepo" rel="noopener noreferrer"&gt;repo&lt;/a&gt; if you want to dig into the source.&lt;/p&gt;

</description>
      <category>aws</category>
      <category>llm</category>
      <category>opensource</category>
      <category>ai</category>
    </item>
    <item>
      <title>Best Claude Code Gateway for Multi-Model Routing</title>
      <dc:creator>Pranay Batta</dc:creator>
      <pubDate>Fri, 10 Apr 2026 22:29:45 +0000</pubDate>
      <link>https://dev.to/pranay_batta/best-claude-code-gateway-for-multi-model-routing-24mn</link>
      <guid>https://dev.to/pranay_batta/best-claude-code-gateway-for-multi-model-routing-24mn</guid>
      <description>&lt;p&gt;Claude Code is great until you need more than one model. You hit a rate limit on Anthropic, want Gemini for long context, or need GPT-4o for a specific task. The default setup gives you no way to route across providers.&lt;/p&gt;

&lt;p&gt;I spent a week testing gateways that sit between Claude Code and LLM providers. The goal was simple: configure multiple models, set routing weights, get automatic failover, and keep Claude Code working normally.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://git.new/bifrost" rel="noopener noreferrer"&gt;Bifrost&lt;/a&gt; was the clear winner. Open-source, written in Go, 11 microsecond overhead per request. Here is how I set up multi-model routing and what I learned.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Multi-Model Routing Matters
&lt;/h2&gt;

&lt;p&gt;Different models are good at different things. Claude Sonnet handles tool use well. GPT-4o is strong at certain code generation tasks. Gemini 2.5 Pro handles massive context windows. Using one model for everything means you are leaving performance on the table.&lt;/p&gt;

&lt;p&gt;Multi-model routing lets you:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Split traffic across providers by weight&lt;/li&gt;
&lt;li&gt;Fail over automatically when a provider goes down&lt;/li&gt;
&lt;li&gt;Pin specific models for specific tasks&lt;/li&gt;
&lt;li&gt;Control costs by routing cheaper models for simpler operations&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The problem: Claude Code talks to &lt;code&gt;api.anthropic.com&lt;/code&gt; by default. No native multi-model support. You need a gateway.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Setup: Bifrost as a Claude Code Gateway
&lt;/h2&gt;

&lt;p&gt;Bifrost exposes an Anthropic-compatible endpoint. Claude Code does not know a gateway exists. It sends standard requests, and Bifrost translates and routes them to whatever provider you configure.&lt;/p&gt;

&lt;p&gt;Full &lt;a href="https://docs.getbifrost.ai/cli-agents/claude-code" rel="noopener noreferrer"&gt;Claude Code integration docs here&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Install and Connect
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npx &lt;span class="nt"&gt;-y&lt;/span&gt; @maximhq/bifrost
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That starts the gateway locally. &lt;a href="https://docs.getbifrost.ai/quickstart/gateway/setting-up" rel="noopener noreferrer"&gt;Setup guide&lt;/a&gt; has the details.&lt;/p&gt;

&lt;p&gt;Point Claude Code at Bifrost:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;ANTHROPIC_BASE_URL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;http://localhost:8080/anthropic
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;ANTHROPIC_API_KEY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;your-bifrost-virtual-key
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;ANTHROPIC_API_KEY&lt;/code&gt; here is a Bifrost virtual key, not your actual Anthropic key. Provider keys live in the Bifrost config. This is a &lt;a href="https://docs.getbifrost.ai/features/drop-in-replacement" rel="noopener noreferrer"&gt;drop-in replacement&lt;/a&gt; for the Anthropic API.&lt;/p&gt;

&lt;p&gt;Done. Every Claude Code request now flows through Bifrost.&lt;/p&gt;

&lt;h2&gt;
  
  
  Weighted Routing Configuration
&lt;/h2&gt;

&lt;p&gt;This is the core of multi-model routing. You assign weights to providers, and Bifrost distributes traffic accordingly. Weights auto-normalize to sum 1.0, so you can use any numbers.&lt;/p&gt;

&lt;p&gt;Here is a config that splits traffic between GPT-4o and Claude Sonnet:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;accounts&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;dev-team"&lt;/span&gt;
    &lt;span class="na"&gt;providers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;openai-primary"&lt;/span&gt;
        &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;openai"&lt;/span&gt;
        &lt;span class="na"&gt;api_key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;${OPENAI_API_KEY}"&lt;/span&gt;
        &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4o"&lt;/span&gt;
        &lt;span class="na"&gt;weight&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;70&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;anthropic-secondary"&lt;/span&gt;
        &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;anthropic"&lt;/span&gt;
        &lt;span class="na"&gt;api_key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;${ANTHROPIC_API_KEY}"&lt;/span&gt;
        &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-sonnet-4-20250514"&lt;/span&gt;
        &lt;span class="na"&gt;weight&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;30&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;70% of requests go to GPT-4o. 30% to Claude Sonnet. I used this to compare output quality across providers in real coding sessions without manually switching anything.&lt;/p&gt;

&lt;p&gt;The &lt;a href="https://docs.getbifrost.ai/features/governance/routing" rel="noopener noreferrer"&gt;routing docs&lt;/a&gt; cover all the configuration options.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Important detail:&lt;/strong&gt; cross-provider routing does not happen automatically. You must explicitly configure each provider in your config. Bifrost does not guess or infer routing rules.&lt;/p&gt;

&lt;h2&gt;
  
  
  Automatic Failover
&lt;/h2&gt;

&lt;p&gt;Weighted routing is useful. Automatic &lt;a href="https://docs.getbifrost.ai/features/fallbacks" rel="noopener noreferrer"&gt;failover&lt;/a&gt; is essential. Providers go down. Rate limits hit. You do not want your Claude Code session to break mid-task.&lt;/p&gt;

&lt;p&gt;Bifrost sorts providers by weight and retries on failure. If the primary provider fails, the next one picks up the request.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;accounts&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;dev-team"&lt;/span&gt;
    &lt;span class="na"&gt;providers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;openai-primary"&lt;/span&gt;
        &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;openai"&lt;/span&gt;
        &lt;span class="na"&gt;api_key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;${OPENAI_API_KEY}"&lt;/span&gt;
        &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4o"&lt;/span&gt;
        &lt;span class="na"&gt;weight&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;80&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gemini-fallback"&lt;/span&gt;
        &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gemini"&lt;/span&gt;
        &lt;span class="na"&gt;api_key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;${GEMINI_API_KEY}"&lt;/span&gt;
        &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gemini-2.5-pro"&lt;/span&gt;
        &lt;span class="na"&gt;weight&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;15&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;anthropic-fallback"&lt;/span&gt;
        &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;anthropic"&lt;/span&gt;
        &lt;span class="na"&gt;api_key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;${ANTHROPIC_API_KEY}"&lt;/span&gt;
        &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-sonnet-4-20250514"&lt;/span&gt;
        &lt;span class="na"&gt;weight&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;5&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;OpenAI goes down, Bifrost retries with Gemini. Gemini fails, falls back to Anthropic. My coding session never interrupts.&lt;/p&gt;

&lt;h2&gt;
  
  
  Model Pinning for Bedrock and Vertex AI
&lt;/h2&gt;

&lt;p&gt;If your team uses AWS Bedrock or Google Vertex AI, you can pin specific models directly:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Bedrock&lt;/span&gt;
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;ANTHROPIC_DEFAULT_SONNET_MODEL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"bedrock/global.anthropic.claude-sonnet-4-6"&lt;/span&gt;

&lt;span class="c"&gt;# Vertex AI&lt;/span&gt;
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;ANTHROPIC_DEFAULT_SONNET_MODEL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"vertex/claude-sonnet-4-6"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can also override the model mid-session using the &lt;code&gt;--model&lt;/code&gt; flag or the &lt;code&gt;/model&lt;/code&gt; command inside Claude Code. Useful when you want to switch between models for different parts of a task. Start with Sonnet for scaffolding, switch to GPT-4o for a tricky implementation, then back again. The gateway handles the translation layer for each provider.&lt;/p&gt;

&lt;p&gt;This is one area where the &lt;a href="https://docs.getbifrost.ai/integrations/anthropic-sdk" rel="noopener noreferrer"&gt;Anthropic SDK compatibility&lt;/a&gt; matters. Bifrost maintains full compatibility with the Anthropic message format, so model pinning and switching work without any client-side changes.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://docs.getbifrost.ai/quickstart/gateway/provider-configuration" rel="noopener noreferrer"&gt;Provider configuration docs&lt;/a&gt; list all supported providers and model formats.&lt;/p&gt;

&lt;h2&gt;
  
  
  Budget Controls Across Providers
&lt;/h2&gt;

&lt;p&gt;Once all traffic flows through one gateway, cost management becomes straightforward. Bifrost has a four-tier &lt;a href="https://docs.getbifrost.ai/features/governance/budget-and-limits" rel="noopener noreferrer"&gt;budget hierarchy&lt;/a&gt;: Customer, Team, Virtual Key, Provider Config.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;budgets&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;level&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;virtual_key"&lt;/span&gt;
    &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-code-dev"&lt;/span&gt;
    &lt;span class="na"&gt;limit&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;200&lt;/span&gt;
    &lt;span class="na"&gt;period&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;monthly"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Set a limit. When it is reached, requests get blocked. No surprise bills from a runaway Claude Code session.&lt;/p&gt;

&lt;p&gt;The full &lt;a href="https://docs.getbifrost.ai/features/governance" rel="noopener noreferrer"&gt;governance layer&lt;/a&gt; handles rate limiting, access control, and spend management across all configured providers.&lt;/p&gt;

&lt;h2&gt;
  
  
  Observability Across All Providers
&lt;/h2&gt;

&lt;p&gt;Every request through Bifrost gets logged: latency, token count, cost, provider used, response status. The &lt;a href="https://docs.getbifrost.ai/features/observability" rel="noopener noreferrer"&gt;observability layer&lt;/a&gt; gives you a single view across all providers.&lt;/p&gt;

&lt;p&gt;This is particularly useful with multi-model routing. You can see exactly which provider handled each request, compare response times across models, and track per-provider costs. When I was running 70/30 weighted routing between GPT-4o and Claude Sonnet, the observability data showed me exactly how each model performed on real coding tasks. Response times, token consumption, and cost per request, all in one place.&lt;/p&gt;

&lt;p&gt;Without centralized logging, you are checking multiple provider dashboards and guessing which model handled what. That is not sustainable when you are running multiple providers through Claude Code daily.&lt;/p&gt;

&lt;h2&gt;
  
  
  Honest Trade-offs
&lt;/h2&gt;

&lt;p&gt;No tool is perfect. Here is what I found:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;OpenRouter streaming limitation.&lt;/strong&gt; OpenRouter does not stream function call arguments properly. This causes file operation failures in Claude Code. If you use OpenRouter as a provider, expect issues with tool use.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Non-Anthropic model requirements.&lt;/strong&gt; Any non-Anthropic model you route through must support tool use. Claude Code relies heavily on function calling. Models without proper tool support will fail on file operations, search, and other agent tasks.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Self-hosted only.&lt;/strong&gt; The open-source version requires you to run and maintain the gateway. There is no managed cloud offering. That means monitoring, updating, and debugging are on you.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Newer project.&lt;/strong&gt; Bifrost's community is growing but still smaller than older alternatives. Documentation is solid, but edge cases may require digging through issues on GitHub.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Extra hop.&lt;/strong&gt; You are adding a process between Claude Code and your provider. The 11 microsecond overhead is negligible, but it is one more thing in the chain to keep running.&lt;/p&gt;

&lt;h2&gt;
  
  
  Performance
&lt;/h2&gt;

&lt;p&gt;I ran benchmarks matching the &lt;a href="https://docs.getbifrost.ai/benchmarking/getting-started" rel="noopener noreferrer"&gt;benchmarking guide&lt;/a&gt;. The numbers held up: 11 microseconds of routing overhead, 5,000 requests per second on a single instance. The Go implementation makes a real difference. Python-based gateways I tested added significantly more latency.&lt;/p&gt;

&lt;p&gt;For a gateway that sits in the critical path of every LLM call, low overhead matters.&lt;/p&gt;

&lt;h2&gt;
  
  
  Quick Start Summary
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# 1. Start Bifrost&lt;/span&gt;
npx &lt;span class="nt"&gt;-y&lt;/span&gt; @maximhq/bifrost

&lt;span class="c"&gt;# 2. Configure providers in bifrost.yaml (weighted routing + failover)&lt;/span&gt;

&lt;span class="c"&gt;# 3. Point Claude Code at Bifrost&lt;/span&gt;
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;ANTHROPIC_BASE_URL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;http://localhost:8080/anthropic
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;ANTHROPIC_API_KEY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;your-bifrost-virtual-key

&lt;span class="c"&gt;# 4. Use Claude Code normally&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That is it. Your Claude Code session now routes across multiple models with automatic failover and budget controls.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://git.new/bifrost" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt; | &lt;a href="https://getmax.im/bifrostdocs" rel="noopener noreferrer"&gt;Docs&lt;/a&gt; | &lt;a href="https://getmax.im/bifrost-home" rel="noopener noreferrer"&gt;Website&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If you are running Claude Code for real work, multi-model routing is not optional. Single-provider setups break at the worst times. A gateway that handles routing, failover, and cost controls in one place saves hours of debugging and thousands in unexpected spend.&lt;/p&gt;

&lt;p&gt;Open an issue on the &lt;a href="https://git.new/bifrostrepo" rel="noopener noreferrer"&gt;repo&lt;/a&gt; if you run into anything.&lt;/p&gt;

</description>
      <category>llm</category>
      <category>opensource</category>
      <category>ai</category>
      <category>programming</category>
    </item>
    <item>
      <title>Best MCP Gateway for 50% Token Cost Savings</title>
      <dc:creator>Pranay Batta</dc:creator>
      <pubDate>Tue, 07 Apr 2026 08:32:48 +0000</pubDate>
      <link>https://dev.to/pranay_batta/best-mcp-gateway-for-50-token-cost-savings-4anm</link>
      <guid>https://dev.to/pranay_batta/best-mcp-gateway-for-50-token-cost-savings-4anm</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR:&lt;/strong&gt; Classic MCP dumps 100+ tool definitions into every LLM call. Bifrost's Code Mode generates TypeScript declarations instead, cutting token usage by 50%+ and latency by 40-50%. If you are running 3 or more MCP servers, this is the single biggest cost lever you have.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem with Classic MCP
&lt;/h2&gt;

&lt;p&gt;I have been testing MCP setups for a few months now. The standard approach is simple. You connect your MCP servers, and every tool definition gets sent to the LLM as part of the context window. Every single call.&lt;/p&gt;

&lt;p&gt;With 3 MCP servers, you might have 30-40 tools. With 10 servers, easily 100+. Each tool definition includes the name, description, input schema, and parameter types. That is a lot of tokens. And you are paying for every single one of them on every request.&lt;/p&gt;

&lt;p&gt;The math is straightforward. If your average tool definition is 200 tokens, and you have 50 tools, that is 10,000 tokens of overhead per call. At scale, this adds up fast.&lt;/p&gt;

&lt;h2&gt;
  
  
  How Bifrost Code Mode Changes This
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://git.new/bifrost" rel="noopener noreferrer"&gt;Bifrost&lt;/a&gt; takes a different approach with its &lt;a href="https://getmax.im/bifrostdocs" rel="noopener noreferrer"&gt;Code Mode&lt;/a&gt;. Instead of exposing raw tool definitions to the LLM, it generates TypeScript declaration files (.d.ts) for all connected MCP tools.&lt;/p&gt;

&lt;p&gt;The LLM then writes TypeScript code to orchestrate multiple tools in a restricted sandbox environment. Instead of the model making 5 separate tool calls (each requiring a round trip), it writes one code block that handles all 5 operations.&lt;/p&gt;

&lt;p&gt;Here is what this means in practice:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Token reduction:&lt;/strong&gt; 50%+ compared to classic MCP. The TypeScript declarations are more compact than full JSON schemas, and the model makes fewer round trips.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Latency reduction:&lt;/strong&gt; 40-50% compared to classic MCP. Fewer round trips means faster overall execution.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Recommended when:&lt;/strong&gt; You are using 3 or more MCP servers.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What Code Mode Actually Does
&lt;/h2&gt;

&lt;p&gt;The execution model is restricted by design. Here is what is available in the sandbox:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Available:&lt;/strong&gt; ES5.1+ JavaScript, async/await, TypeScript, console.log/error/warn, JSON.parse/stringify, and all MCP tool bindings as globals.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Not available:&lt;/strong&gt; ES Modules, Node.js APIs, browser APIs, DOM, timers (setTimeout/setInterval), network access.&lt;/p&gt;

&lt;p&gt;This is not a general-purpose runtime. It is a controlled environment where the LLM can orchestrate tools safely. No arbitrary code execution, no network calls outside of the tool bindings.&lt;/p&gt;

&lt;p&gt;You can configure tool bindings at the server level or tool level, depending on how granular you need the control to be. The &lt;a href="https://getmax.im/bifrostdocs" rel="noopener noreferrer"&gt;docs cover the binding configuration&lt;/a&gt; in detail.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Latency Numbers
&lt;/h2&gt;

&lt;p&gt;Bifrost itself adds 11 microseconds of latency overhead per request. It is written in Go and handles 5,000 RPS sustained throughput. That is roughly 50x faster than Python-based alternatives.&lt;/p&gt;

&lt;p&gt;For MCP-specific operations:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Sub-3ms MCP latency overall&lt;/li&gt;
&lt;li&gt;InProcess connections: ~0.1ms&lt;/li&gt;
&lt;li&gt;STDIO connections: ~1-10ms&lt;/li&gt;
&lt;li&gt;HTTP connections: ~10-500ms (network dependent)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The MCP tool discovery is cached after the first request, so subsequent calls hit ~100-500 microseconds for discovery and ~50-200 nanoseconds for tool filtering.&lt;/p&gt;

&lt;h2&gt;
  
  
  Agent Mode: The Other Side
&lt;/h2&gt;

&lt;p&gt;Bifrost also has an &lt;a href="https://getmax.im/bifrostdocs" rel="noopener noreferrer"&gt;Agent Mode&lt;/a&gt; that turns the gateway into an autonomous agent runtime. You configure which tools are auto-approved via &lt;code&gt;tools_to_auto_execute&lt;/code&gt;, set a &lt;code&gt;max_depth&lt;/code&gt; to prevent infinite loops, and let the agent handle iterative execution.&lt;/p&gt;

&lt;p&gt;This is a different use case from Code Mode. Agent Mode is for workflows where you want the LLM to act autonomously within boundaries. Code Mode is for when you want to reduce token costs on tool-heavy operations.&lt;/p&gt;

&lt;h2&gt;
  
  
  Deployment
&lt;/h2&gt;

&lt;p&gt;Setup is zero-config. You can start with npx or Docker. The gateway supports 19+ providers out of the box (OpenAI, Anthropic, Azure, Bedrock, Gemini, Mistral, Cohere, Groq, and others), all through an &lt;a href="https://getmax.im/bifrostdocs" rel="noopener noreferrer"&gt;OpenAI-compatible API format&lt;/a&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# npx&lt;/span&gt;
npx &lt;span class="nt"&gt;-y&lt;/span&gt; @maximhq/bifrost

&lt;span class="c"&gt;# Docker&lt;/span&gt;
docker run &lt;span class="nt"&gt;-p&lt;/span&gt; 8080:8080 maximhq/bifrost
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Who Should Use Code Mode
&lt;/h2&gt;

&lt;p&gt;If you are running fewer than 3 MCP servers, classic mode is probably fine. The overhead is manageable.&lt;/p&gt;

&lt;p&gt;If you are running 3+, especially with 50+ tools across those servers, Code Mode is worth testing. The 50%+ token savings are significant at scale, and the 40-50% latency improvement compounds across multi-step agent workflows.&lt;/p&gt;

&lt;p&gt;I tested this on a setup with 5 MCP servers and 80+ tools. The token savings were immediately visible in the cost dashboard. The reduced round trips also made the overall agent response noticeably faster.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;GitHub:&lt;/strong&gt; &lt;a href="https://git.new/bifrost" rel="noopener noreferrer"&gt;git.new/bifrost&lt;/a&gt;&lt;br&gt;
&lt;strong&gt;Docs:&lt;/strong&gt; &lt;a href="https://getmax.im/bifrostdocs" rel="noopener noreferrer"&gt;getmax.im/bifrostdocs&lt;/a&gt;&lt;br&gt;
&lt;strong&gt;Website:&lt;/strong&gt; &lt;a href="https://getmax.im/bifrost-home" rel="noopener noreferrer"&gt;getmax.im/bifrost-home&lt;/a&gt;&lt;/p&gt;

</description>
      <category>mcp</category>
      <category>ai</category>
      <category>llm</category>
      <category>programming</category>
    </item>
    <item>
      <title>How to Connect Any Model with Gemini CLI Using Bifrost AI Gateway</title>
      <dc:creator>Pranay Batta</dc:creator>
      <pubDate>Mon, 06 Apr 2026 10:12:17 +0000</pubDate>
      <link>https://dev.to/pranay_batta/how-to-connect-any-model-with-gemini-cli-using-bifrost-ai-gateway-4n0d</link>
      <guid>https://dev.to/pranay_batta/how-to-connect-any-model-with-gemini-cli-using-bifrost-ai-gateway-4n0d</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR:&lt;/strong&gt; Gemini CLI works with Google's models out of the box. But if you want to route requests through multiple providers, add failover, or track costs, you can point Gemini CLI at Bifrost. One config change. Every model available through a single endpoint.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem with Single-Provider CLI Tools
&lt;/h2&gt;

&lt;p&gt;Gemini CLI connects to Google's Generative AI API. That is fine if you only use Gemini models. But most production setups involve multiple providers. OpenAI for some tasks. Anthropic for others. Maybe a local Ollama instance for development.&lt;/p&gt;

&lt;p&gt;Switching between CLIs and API keys for each provider gets old fast.&lt;/p&gt;

&lt;p&gt;I tested &lt;a href="https://git.new/bifrost" rel="noopener noreferrer"&gt;Bifrost&lt;/a&gt;, an open-source LLM gateway written in Go, as a unified routing layer for Gemini CLI. The setup took about 5 minutes.&lt;/p&gt;

&lt;h2&gt;
  
  
  How Bifrost Works with Gemini CLI
&lt;/h2&gt;

&lt;p&gt;Bifrost exposes a &lt;a href="https://getmax.im/bifrostdocs" rel="noopener noreferrer"&gt;fully Google GenAI compatible endpoint&lt;/a&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight http"&gt;&lt;code&gt;&lt;span class="err"&gt;http://localhost:8080/genai/v1beta/models/{model}/generateContent
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This means Gemini CLI can talk to Bifrost without any code changes. Just point the base URL to your Bifrost instance.&lt;/p&gt;

&lt;p&gt;Bifrost then routes the request to whatever provider and model you specify. OpenAI, Anthropic, Vertex AI, Bedrock, Groq, Ollama. All through the same endpoint.&lt;/p&gt;

&lt;h2&gt;
  
  
  Setup
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Step 1: Install Bifrost
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npx @anthropic-ai/bifrost@latest
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Zero config. Starts on port 8080 by default.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 2: Configure Providers
&lt;/h3&gt;

&lt;p&gt;Add your provider keys to the &lt;a href="https://getmax.im/bifrostdocs" rel="noopener noreferrer"&gt;config&lt;/a&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"providers"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"openai"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"keys"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"openai-1"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"value"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"env.OPENAI_API_KEY"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"models"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"gpt-4o"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"gpt-4o-mini"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"weight"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;1.0&lt;/span&gt;&lt;span class="p"&gt;}]&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"anthropic"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"keys"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"anthropic-1"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"value"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"env.ANTHROPIC_API_KEY"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"models"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"claude-sonnet-4-20250514"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"weight"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;1.0&lt;/span&gt;&lt;span class="p"&gt;}]&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"gemini"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"keys"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"gemini-1"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"value"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"env.GEMINI_API_KEY"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"models"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"gemini-2.5-flash"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"weight"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;1.0&lt;/span&gt;&lt;span class="p"&gt;}]&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 3: Point Gemini CLI to Bifrost
&lt;/h3&gt;

&lt;p&gt;Set the base URL to your Bifrost instance:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;GEMINI_API_BASE_URL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;http://localhost:8080
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now every request from Gemini CLI goes through Bifrost. You can target any provider using the provider-prefixed model format:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;gemini/gemini-2.5-flash      → Google Gemini
openai/gpt-4o                → OpenAI
anthropic/claude-sonnet-4-20250514  → Anthropic
vertex/gemini-pro             → Vertex AI
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  What You Get
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Multi-Provider Routing
&lt;/h3&gt;

&lt;p&gt;One CLI, every model. No more switching between tools or managing separate API keys per provider.&lt;/p&gt;

&lt;h3&gt;
  
  
  Automatic Failover
&lt;/h3&gt;

&lt;p&gt;Set up &lt;a href="https://getmax.im/bifrostdocs" rel="noopener noreferrer"&gt;fallback chains&lt;/a&gt; in your requests. If Gemini is rate-limited, the request goes to OpenAI. If OpenAI is down, it goes to Anthropic. Each fallback is a fresh request. All plugins still run.&lt;/p&gt;

&lt;h3&gt;
  
  
  Budget Controls
&lt;/h3&gt;

&lt;p&gt;Bifrost has a &lt;a href="https://getmax.im/bifrostdocs" rel="noopener noreferrer"&gt;four-tier budget hierarchy&lt;/a&gt;: Customer, Team, Virtual Key, and Provider Config. Set a monthly spending cap on your Virtual Key. When it is hit, the gateway stops routing to paid providers. Your local Ollama instance can serve as the fallback.&lt;/p&gt;

&lt;h3&gt;
  
  
  Cost Tracking
&lt;/h3&gt;

&lt;p&gt;Every request is logged with token counts and cost calculations. The &lt;a href="https://getmax.im/bifrostdocs" rel="noopener noreferrer"&gt;Model Catalog&lt;/a&gt; tracks pricing across all providers automatically.&lt;/p&gt;

&lt;h2&gt;
  
  
  Performance
&lt;/h2&gt;

&lt;p&gt;I ran 1,000 requests through Bifrost targeting Gemini models. The gateway adds 11µs of overhead per request. At 5,000 RPS sustained throughput, the bottleneck is always the provider, never the gateway. That is 50x faster than Python-based alternatives like LiteLLM.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Overhead Question
&lt;/h2&gt;

&lt;p&gt;The concern with adding a proxy layer is always latency. In practice, LLM API calls take 500ms to 5 seconds depending on the model and prompt. An 11µs gateway overhead is invisible.&lt;/p&gt;

&lt;p&gt;The &lt;a href="https://getmax.im/bifrostdocs" rel="noopener noreferrer"&gt;semantic caching&lt;/a&gt; layer (currently Weaviate-backed) can actually reduce latency for repeated queries by serving cached responses instead of hitting the provider again.&lt;/p&gt;

&lt;h2&gt;
  
  
  When This Makes Sense
&lt;/h2&gt;

&lt;p&gt;This setup is useful if you:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Use Gemini CLI but also need OpenAI or Anthropic models&lt;/li&gt;
&lt;li&gt;Want failover so your workflow does not break during provider outages&lt;/li&gt;
&lt;li&gt;Need to track costs across providers in one place&lt;/li&gt;
&lt;li&gt;Want to set budget limits so you do not get surprise bills&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you only use Gemini models and do not care about failover or cost tracking, direct connection is fine. The gateway adds value when you are working across providers.&lt;/p&gt;

&lt;h2&gt;
  
  
  Links
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://git.new/bifrost" rel="noopener noreferrer"&gt;Bifrost GitHub&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://getmax.im/bifrostdocs" rel="noopener noreferrer"&gt;Documentation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://getmax.im/bifrost-home" rel="noopener noreferrer"&gt;Bifrost Home&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>ai</category>
      <category>gemini</category>
      <category>programming</category>
      <category>webdev</category>
    </item>
    <item>
      <title>Top 5 Enterprise AI Gateways for Dynamic Routing in 2026</title>
      <dc:creator>Pranay Batta</dc:creator>
      <pubDate>Fri, 03 Apr 2026 05:42:24 +0000</pubDate>
      <link>https://dev.to/pranay_batta/top-5-enterprise-ai-gateways-for-dynamic-routing-in-2026-514b</link>
      <guid>https://dev.to/pranay_batta/top-5-enterprise-ai-gateways-for-dynamic-routing-in-2026-514b</guid>
      <description>&lt;p&gt;If you are running multiple LLM providers in production, routing logic becomes a critical infrastructure decision. Send everything to one provider and you get single points of failure. Hardcode routing rules and you lose flexibility when latency spikes or rate limits hit.&lt;/p&gt;

&lt;p&gt;I spent the last few weeks evaluating five AI gateways specifically for their dynamic routing capabilities. The criteria: latency overhead, failover behaviour, weighted distribution, and how much config it takes to get routing working in production.&lt;/p&gt;

&lt;p&gt;The short version: &lt;a href="https://git.new/bifrost" rel="noopener noreferrer"&gt;Bifrost&lt;/a&gt; came out on top for raw performance and routing flexibility. 11 microsecond latency overhead, written in Go, with weighted routing and automatic &lt;a href="https://docs.getbifrost.ai/features/fallbacks" rel="noopener noreferrer"&gt;failover&lt;/a&gt; built in. You can run it right now with &lt;code&gt;npx -y @maximhq/bifrost&lt;/code&gt;. &lt;a href="https://getmax.im/bifrostdocs" rel="noopener noreferrer"&gt;Full docs here&lt;/a&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why dynamic routing matters
&lt;/h2&gt;

&lt;p&gt;Static routing is fine for prototypes. Pick a model, call the API, ship it.&lt;/p&gt;

&lt;p&gt;Production is different. You need:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Failover&lt;/strong&gt;: When OpenAI returns 429s or 500s, traffic should automatically shift to Anthropic or another provider. No manual intervention.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Weighted distribution&lt;/strong&gt;: Split traffic 70/30 across providers for cost optimization or A/B testing model quality.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Latency-based routing&lt;/strong&gt;: Send requests to whichever provider responds fastest at that moment.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Budget-aware routing&lt;/strong&gt;: Stop sending traffic to a provider when your spend cap is hit.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The gateway layer is the right place to handle this. Application code should not care which provider serves a request.&lt;/p&gt;




&lt;h2&gt;
  
  
  The five gateways I tested
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Bifrost
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Language&lt;/strong&gt;: Go | &lt;strong&gt;Overhead&lt;/strong&gt;: 11 microseconds | &lt;strong&gt;Throughput&lt;/strong&gt;: 5,000 RPS sustained&lt;/p&gt;

&lt;p&gt;Bifrost is the fastest gateway I have tested. The 11 microsecond overhead is not a typo. That is roughly 50x faster than Python-based alternatives like LiteLLM, which adds around 8ms per request.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://docs.getbifrost.ai/features/governance/routing" rel="noopener noreferrer"&gt;Routing configuration&lt;/a&gt; is declarative and clean. Here is what weighted routing across two providers looks like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# bifrost-config.yaml&lt;/span&gt;
&lt;span class="na"&gt;providers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;openai-primary&lt;/span&gt;
    &lt;span class="na"&gt;provider&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;openai&lt;/span&gt;
    &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;gpt-4o&lt;/span&gt;
    &lt;span class="na"&gt;weight&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;70&lt;/span&gt;
    &lt;span class="na"&gt;api_key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${OPENAI_API_KEY}&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;anthropic-fallback&lt;/span&gt;
    &lt;span class="na"&gt;provider&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;anthropic&lt;/span&gt;
    &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;claude-sonnet-4-20250514&lt;/span&gt;
    &lt;span class="na"&gt;weight&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;30&lt;/span&gt;
    &lt;span class="na"&gt;api_key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${ANTHROPIC_API_KEY}&lt;/span&gt;

&lt;span class="na"&gt;routing&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;strategy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;weighted&lt;/span&gt;
  &lt;span class="na"&gt;fallback&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;enabled&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
    &lt;span class="na"&gt;max_retries&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;2&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That splits 70% of traffic to OpenAI and 30% to Anthropic. If OpenAI fails, requests automatically &lt;a href="https://docs.getbifrost.ai/features/fallbacks" rel="noopener noreferrer"&gt;fall back&lt;/a&gt; to Anthropic.&lt;/p&gt;

&lt;p&gt;What I like: the &lt;a href="https://docs.getbifrost.ai/features/governance" rel="noopener noreferrer"&gt;governance layer&lt;/a&gt; ties routing to budgets. You can set a four-tier &lt;a href="https://docs.getbifrost.ai/features/governance/budget-and-limits" rel="noopener noreferrer"&gt;budget hierarchy&lt;/a&gt; (Customer, Team, Virtual Key, Provider Config) and routing decisions respect those limits. When a provider budget is exhausted, traffic shifts automatically.&lt;/p&gt;

&lt;p&gt;Setup is genuinely fast. One command to start:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npx &lt;span class="nt"&gt;-y&lt;/span&gt; @maximhq/bifrost
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Or Docker:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker run &lt;span class="nt"&gt;-p&lt;/span&gt; 8080:8080 maximhq/bifrost
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;a href="https://docs.getbifrost.ai/quickstart/gateway/setting-up" rel="noopener noreferrer"&gt;setup guide&lt;/a&gt; covers both approaches. &lt;a href="https://docs.getbifrost.ai/quickstart/gateway/provider-configuration" rel="noopener noreferrer"&gt;Provider configuration&lt;/a&gt; takes a few minutes.&lt;/p&gt;

&lt;p&gt;Other features worth noting: &lt;a href="https://docs.getbifrost.ai/features/semantic-caching" rel="noopener noreferrer"&gt;semantic caching&lt;/a&gt; with dual-layer support (exact hash + semantic similarity), &lt;a href="https://docs.getbifrost.ai/features/observability" rel="noopener noreferrer"&gt;observability&lt;/a&gt; built in, &lt;a href="https://docs.getbifrost.ai/features/mcp" rel="noopener noreferrer"&gt;MCP support&lt;/a&gt; with sub-3ms latency and 50%+ token reduction in Code Mode, and a &lt;a href="https://docs.getbifrost.ai/features/drop-in-replacement" rel="noopener noreferrer"&gt;drop-in replacement&lt;/a&gt; endpoint for the Anthropic SDK so you can migrate without changing application code. &lt;a href="https://docs.getbifrost.ai/integrations/anthropic-sdk" rel="noopener noreferrer"&gt;Anthropic SDK integration docs here&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://docs.getbifrost.ai/benchmarking/getting-started" rel="noopener noreferrer"&gt;Check the benchmarks&lt;/a&gt; if you want to verify the numbers yourself.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. LiteLLM
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Language&lt;/strong&gt;: Python | &lt;strong&gt;Overhead&lt;/strong&gt;: ~8ms | &lt;strong&gt;Providers&lt;/strong&gt;: 100+&lt;/p&gt;

&lt;p&gt;LiteLLM has the widest provider coverage I have seen. Over 100 providers through a unified interface. If you need to call a niche model API, LiteLLM probably supports it.&lt;/p&gt;

&lt;p&gt;Routing is available through the proxy server. You can configure fallbacks and load balancing across models. The configuration is YAML-based and straightforward.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;model_list&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;model_name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;gpt-4&lt;/span&gt;
    &lt;span class="na"&gt;litellm_params&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;openai/gpt-4&lt;/span&gt;
      &lt;span class="na"&gt;api_key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;sk-xxx&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;model_name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;gpt-4&lt;/span&gt;
    &lt;span class="na"&gt;litellm_params&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;azure/gpt-4&lt;/span&gt;
      &lt;span class="na"&gt;api_key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;sk-yyy&lt;/span&gt;

&lt;span class="na"&gt;router_settings&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;routing_strategy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;least-busy&lt;/span&gt;
  &lt;span class="na"&gt;num_retries&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;3&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The trade-off is performance. At ~8ms overhead per request, you are adding meaningful latency at high throughput. For applications doing thousands of requests per second, that adds up. The Python runtime is the bottleneck.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Credit where it is due&lt;/strong&gt;: LiteLLM's provider coverage is unmatched and the community is active. For teams that prioritize breadth over speed, it is a solid choice.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Kong AI Gateway
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Language&lt;/strong&gt;: Lua/C (OpenResty) | &lt;strong&gt;Type&lt;/strong&gt;: Enterprise, plugin-based&lt;/p&gt;

&lt;p&gt;Kong is a well-established API gateway that added AI capabilities through plugins. If your organization already runs Kong for general API management, adding AI routing is incremental.&lt;/p&gt;

&lt;p&gt;The AI plugin supports multiple providers and basic routing. Rate limiting, authentication, and logging come from Kong's mature plugin ecosystem.&lt;/p&gt;

&lt;p&gt;The limitation: AI-specific routing features require the enterprise tier. The open-source version gives you basic proxying, but weighted routing, advanced failover, and AI-specific analytics are paid features. Configuration is also more complex because you are working within Kong's plugin architecture rather than a purpose-built AI gateway.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Credit&lt;/strong&gt;: Kong's plugin ecosystem is mature and battle-tested for general API management.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Cloudflare AI Gateway
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Type&lt;/strong&gt;: Managed service | &lt;strong&gt;Setup&lt;/strong&gt;: Minutes&lt;/p&gt;

&lt;p&gt;Cloudflare AI Gateway is the easiest to set up on this list. If you are already on Cloudflare, you can enable it from the dashboard and start routing requests through their edge network.&lt;/p&gt;

&lt;p&gt;It provides caching, rate limiting, and basic analytics out of the box. The managed nature means zero infrastructure to maintain.&lt;/p&gt;

&lt;p&gt;The limitation: routing flexibility is constrained compared to self-hosted options. Custom routing strategies, weighted distribution, and provider-level budget controls are limited. You also depend on Cloudflare's edge network for all LLM traffic, which may not work for teams with data residency requirements.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Credit&lt;/strong&gt;: For teams that want AI gateway functionality without managing infrastructure, Cloudflare delivers the simplest path to production.&lt;/p&gt;

&lt;h3&gt;
  
  
  5. Azure API Management
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Type&lt;/strong&gt;: Enterprise, Azure-native | &lt;strong&gt;Setup&lt;/strong&gt;: Hours to days&lt;/p&gt;

&lt;p&gt;Azure APIM is the default choice for organizations already invested in Azure. It supports routing to Azure OpenAI endpoints with built-in integration, and you can configure policies for retry, circuit breaking, and load balancing.&lt;/p&gt;

&lt;p&gt;The routing configuration uses Azure's policy XML, which is verbose but powerful. You get deep integration with Azure Monitor, Key Vault, and other Azure services.&lt;/p&gt;

&lt;p&gt;The limitation: it is Azure-native. If you are multi-cloud or use non-Azure LLM providers, the integration story gets complicated. Routing to Anthropic or other providers requires custom policy work. Setup is also significantly more complex than purpose-built AI gateways.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Credit&lt;/strong&gt;: For Azure-first organizations, the deep integration with the Azure ecosystem and enterprise compliance features are genuinely valuable.&lt;/p&gt;




&lt;h2&gt;
  
  
  Comparison table
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Feature&lt;/th&gt;
&lt;th&gt;Bifrost&lt;/th&gt;
&lt;th&gt;LiteLLM&lt;/th&gt;
&lt;th&gt;Kong AI&lt;/th&gt;
&lt;th&gt;Cloudflare AI&lt;/th&gt;
&lt;th&gt;Azure APIM&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Latency overhead&lt;/td&gt;
&lt;td&gt;11 microseconds&lt;/td&gt;
&lt;td&gt;~8ms&lt;/td&gt;
&lt;td&gt;Low (Lua/C)&lt;/td&gt;
&lt;td&gt;Varies (edge)&lt;/td&gt;
&lt;td&gt;Varies&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Language&lt;/td&gt;
&lt;td&gt;Go&lt;/td&gt;
&lt;td&gt;Python&lt;/td&gt;
&lt;td&gt;Lua/C&lt;/td&gt;
&lt;td&gt;Managed&lt;/td&gt;
&lt;td&gt;Managed&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Weighted routing&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Enterprise only&lt;/td&gt;
&lt;td&gt;Limited&lt;/td&gt;
&lt;td&gt;Via policy&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Automatic failover&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Basic&lt;/td&gt;
&lt;td&gt;Via policy&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Budget-aware routing&lt;/td&gt;
&lt;td&gt;Yes (4-tier)&lt;/td&gt;
&lt;td&gt;Basic&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Semantic caching&lt;/td&gt;
&lt;td&gt;Yes (dual-layer)&lt;/td&gt;
&lt;td&gt;Basic&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Provider count&lt;/td&gt;
&lt;td&gt;Growing&lt;/td&gt;
&lt;td&gt;100+&lt;/td&gt;
&lt;td&gt;Major providers&lt;/td&gt;
&lt;td&gt;Major providers&lt;/td&gt;
&lt;td&gt;Azure-focused&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Open source&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Partial&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Self-hosted&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Setup time&lt;/td&gt;
&lt;td&gt;Minutes&lt;/td&gt;
&lt;td&gt;Minutes&lt;/td&gt;
&lt;td&gt;Hours&lt;/td&gt;
&lt;td&gt;Minutes&lt;/td&gt;
&lt;td&gt;Hours to days&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  Honest trade-offs
&lt;/h2&gt;

&lt;p&gt;No tool is perfect. Here is what I found lacking in each.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Bifrost&lt;/strong&gt;: Provider count is still growing. If you need a niche provider that is not yet supported, you will need to check the docs or request it. The project is newer than LiteLLM, so community resources are still building up.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;LiteLLM&lt;/strong&gt;: Performance at scale is the main concern. The ~8ms overhead is fine for low-throughput applications, but at 5,000+ RPS, you are looking at significant cumulative latency. Memory usage also climbs with the Python runtime under load.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Kong AI Gateway&lt;/strong&gt;: The AI features feel bolted on rather than native. If you are not already a Kong customer, adopting the full Kong stack just for AI routing is overkill. Enterprise pricing for AI-specific features is a barrier.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cloudflare AI Gateway&lt;/strong&gt;: Limited control. You cannot implement custom routing strategies or complex failover logic. Data flows through Cloudflare's network, which is a non-starter for some compliance requirements.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Azure APIM&lt;/strong&gt;: Vendor lock-in is real. Multi-provider routing outside Azure requires significant custom work. Configuration through XML policies is tedious compared to YAML-based alternatives.&lt;/p&gt;




&lt;h2&gt;
  
  
  Which one should you pick
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Pick Bifrost&lt;/strong&gt; if performance and routing flexibility are your top priorities. The 11 microsecond overhead and built-in &lt;a href="https://docs.getbifrost.ai/features/governance" rel="noopener noreferrer"&gt;governance&lt;/a&gt; features (budget-aware routing, weighted distribution, automatic failover) make it the strongest option for high-throughput production workloads. &lt;a href="https://git.new/bifrost" rel="noopener noreferrer"&gt;Star it on GitHub&lt;/a&gt; or &lt;a href="https://getmax.im/bifrostdocs" rel="noopener noreferrer"&gt;check the docs&lt;/a&gt; to get started.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pick LiteLLM&lt;/strong&gt; if you need the widest provider coverage and performance is not your bottleneck.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pick Kong&lt;/strong&gt; if your organization already runs Kong and wants to add AI routing incrementally.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pick Cloudflare&lt;/strong&gt; if you want zero infrastructure overhead and can live with limited routing customization.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pick Azure APIM&lt;/strong&gt; if you are fully committed to the Azure ecosystem.&lt;/p&gt;

&lt;p&gt;For most teams building production AI infrastructure, routing is a gateway-level concern that should not leak into application code. The right gateway depends on your throughput requirements, provider mix, and how much control you need over routing logic.&lt;/p&gt;

&lt;p&gt;I would start with &lt;a href="https://getmax.im/bifrost-home" rel="noopener noreferrer"&gt;Bifrost&lt;/a&gt;. One command to run, sub-microsecond overhead, and routing that actually works at scale. &lt;a href="https://getmax.im/docspage" rel="noopener noreferrer"&gt;Docs are here&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>devops</category>
      <category>go</category>
    </item>
    <item>
      <title>How to Connect Non-Anthropic Models to Claude Code with Bifrost AI Gateway</title>
      <dc:creator>Pranay Batta</dc:creator>
      <pubDate>Wed, 01 Apr 2026 13:41:26 +0000</pubDate>
      <link>https://dev.to/pranay_batta/how-to-connect-non-anthropic-models-to-claude-code-with-bifrost-ai-gateway-5dnj</link>
      <guid>https://dev.to/pranay_batta/how-to-connect-non-anthropic-models-to-claude-code-with-bifrost-ai-gateway-5dnj</guid>
      <description>&lt;p&gt;I tested five different LLM gateways to route non-Anthropic models through Claude Code. Bifrost was the fastest by a wide margin. 11 microseconds of overhead per request. 50x faster than the Python-based alternatives I benchmarked.&lt;/p&gt;

&lt;p&gt;Here is exactly how I set it up, what worked, and where each feature matters.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;TL;DR:&lt;/strong&gt; Bifrost is an open-source Go gateway that exposes an Anthropic-compatible endpoint, letting you route Claude Code requests to GPT-4o, Gemini, Bedrock, or any supported provider by changing one environment variable. You get multi-provider failover, budget controls, and semantic caching at 11 microseconds of overhead per request.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;This post assumes you are familiar with Claude Code and have used at least one LLM API.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://git.new/bifrost" rel="noopener noreferrer"&gt;Bifrost on GitHub&lt;/a&gt; -- open-source, written in Go, handles 5,000 RPS on a single instance.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem
&lt;/h2&gt;

&lt;p&gt;Claude Code locks you into &lt;code&gt;api.anthropic.com&lt;/code&gt;. No native way to swap providers. You cannot route to GPT-4o, Gemini, or Bedrock models without building your own proxy or switching tools entirely.&lt;/p&gt;

&lt;p&gt;I needed:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;GPT-4o for certain coding tasks&lt;/li&gt;
&lt;li&gt;Gemini 2.5 Pro for long context&lt;/li&gt;
&lt;li&gt;Automatic failover when a provider goes down&lt;/li&gt;
&lt;li&gt;One place to track costs across all models&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Building a custom proxy was not worth the maintenance burden. So I went looking for something production-ready.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Bifrost Does
&lt;/h2&gt;

&lt;p&gt;Bifrost exposes an Anthropic-compatible endpoint at &lt;code&gt;/anthropic&lt;/code&gt;. Claude Code sends standard Anthropic-format requests. Bifrost translates and routes them to whatever provider you configure -- OpenAI, Bedrock, Vertex AI, Gemini, others.&lt;/p&gt;

&lt;p&gt;It is a &lt;a href="https://docs.getbifrost.ai/features/drop-in-replacement" rel="noopener noreferrer"&gt;drop-in replacement&lt;/a&gt;. Change one URL. No SDK modifications. No wrapper code.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Claude Code -&amp;gt; Bifrost (/anthropic) -&amp;gt; Any LLM Provider
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;a href="https://docs.getbifrost.ai/integrations/anthropic-sdk" rel="noopener noreferrer"&gt;Anthropic SDK integration&lt;/a&gt; page has the full compatibility details.&lt;/p&gt;

&lt;h2&gt;
  
  
  Setup: 3 Minutes
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Step 1: Install
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npx &lt;span class="nt"&gt;-y&lt;/span&gt; @maximhq/bifrost
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That starts the gateway locally. Full &lt;a href="https://docs.getbifrost.ai/quickstart/gateway/setting-up" rel="noopener noreferrer"&gt;setup instructions here&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 2: Configure a Provider
&lt;/h3&gt;

&lt;p&gt;Create &lt;code&gt;bifrost.yaml&lt;/code&gt;. This routes everything to GPT-4o:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;accounts&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;my-account"&lt;/span&gt;
    &lt;span class="na"&gt;providers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;openai-primary"&lt;/span&gt;
        &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;openai"&lt;/span&gt;
        &lt;span class="na"&gt;api_key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;${OPENAI_API_KEY}"&lt;/span&gt;
        &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4o"&lt;/span&gt;
        &lt;span class="na"&gt;weight&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;100&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;See the &lt;a href="https://docs.getbifrost.ai/quickstart/gateway/provider-configuration" rel="noopener noreferrer"&gt;provider configuration docs&lt;/a&gt; for all supported providers and options.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 3: Point Claude Code at Bifrost
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;ANTHROPIC_BASE_URL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;http://localhost:8080/anthropic
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Done. Claude Code now sends requests through Bifrost, which translates them to OpenAI format and forwards to GPT-4o. Zero code changes.&lt;/p&gt;

&lt;h2&gt;
  
  
  Multi-Provider Routing
&lt;/h2&gt;

&lt;p&gt;This is where it gets interesting. I configured weighted &lt;a href="https://docs.getbifrost.ai/features/governance/routing" rel="noopener noreferrer"&gt;routing&lt;/a&gt; across two providers:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;accounts&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;my-account"&lt;/span&gt;
    &lt;span class="na"&gt;providers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;openai-primary"&lt;/span&gt;
        &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;openai"&lt;/span&gt;
        &lt;span class="na"&gt;api_key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;${OPENAI_API_KEY}"&lt;/span&gt;
        &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4o"&lt;/span&gt;
        &lt;span class="na"&gt;weight&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;80&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;anthropic-fallback"&lt;/span&gt;
        &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;anthropic"&lt;/span&gt;
        &lt;span class="na"&gt;api_key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;${ANTHROPIC_API_KEY}"&lt;/span&gt;
        &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-sonnet-4-20250514"&lt;/span&gt;
        &lt;span class="na"&gt;weight&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;20&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;80% of traffic goes to GPT-4o. 20% to Claude. Useful when you want to compare output quality across models in real usage.&lt;/p&gt;

&lt;h2&gt;
  
  
  Automatic Failover
&lt;/h2&gt;

&lt;p&gt;This was the feature that sold me. &lt;a href="https://docs.getbifrost.ai/features/fallbacks" rel="noopener noreferrer"&gt;Failover configuration&lt;/a&gt; took five minutes. If GPT-4o goes down, Bifrost tries the next provider automatically:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;accounts&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;my-account"&lt;/span&gt;
    &lt;span class="na"&gt;failover&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;enabled&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
    &lt;span class="na"&gt;providers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;openai-primary"&lt;/span&gt;
        &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;openai"&lt;/span&gt;
        &lt;span class="na"&gt;api_key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;${OPENAI_API_KEY}"&lt;/span&gt;
        &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4o"&lt;/span&gt;
        &lt;span class="na"&gt;priority&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gemini-secondary"&lt;/span&gt;
        &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gemini"&lt;/span&gt;
        &lt;span class="na"&gt;api_key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;${GEMINI_API_KEY}"&lt;/span&gt;
        &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gemini-2.5-pro"&lt;/span&gt;
        &lt;span class="na"&gt;priority&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;2&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;anthropic-tertiary"&lt;/span&gt;
        &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;anthropic"&lt;/span&gt;
        &lt;span class="na"&gt;api_key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;${ANTHROPIC_API_KEY}"&lt;/span&gt;
        &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-sonnet-4-20250514"&lt;/span&gt;
        &lt;span class="na"&gt;priority&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;3&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;OpenAI fails, Bifrost tries Gemini. Gemini fails, falls back to Anthropic. My Claude Code session never breaks. No retry logic on my side.&lt;/p&gt;

&lt;h2&gt;
  
  
  Bedrock and Vertex AI
&lt;/h2&gt;

&lt;p&gt;I also tested with AWS Bedrock and Vertex AI. Same pattern:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;providers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bedrock-claude"&lt;/span&gt;
    &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bedrock"&lt;/span&gt;
    &lt;span class="na"&gt;region&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;us-east-1"&lt;/span&gt;
    &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;anthropic.claude-sonnet-4-20250514-v2:0"&lt;/span&gt;
    &lt;span class="na"&gt;priority&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;vertex-gemini"&lt;/span&gt;
    &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;vertex"&lt;/span&gt;
    &lt;span class="na"&gt;project_id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;my-gcp-project"&lt;/span&gt;
    &lt;span class="na"&gt;region&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;us-central1"&lt;/span&gt;
    &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gemini-2.5-pro"&lt;/span&gt;
    &lt;span class="na"&gt;priority&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;2&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Same Anthropic-compatible endpoint. Claude Code does not know which provider is behind Bifrost. That is the point.&lt;/p&gt;

&lt;h2&gt;
  
  
  Features Worth Mentioning
&lt;/h2&gt;

&lt;p&gt;Routing alone is useful. But once all requests flow through one gateway, you get access to several other capabilities I found genuinely practical.&lt;/p&gt;

&lt;h3&gt;
  
  
  Budget Enforcement
&lt;/h3&gt;

&lt;p&gt;Bifrost has a four-tier &lt;a href="https://docs.getbifrost.ai/features/governance/budget-and-limits" rel="noopener noreferrer"&gt;budget hierarchy&lt;/a&gt;: Customer, Team, Virtual Key, Provider Config. I set team-level limits:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;budgets&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;level&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;team"&lt;/span&gt;
    &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;engineering"&lt;/span&gt;
    &lt;span class="na"&gt;limit&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;500&lt;/span&gt;
    &lt;span class="na"&gt;period&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;monthly"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Budget runs out, requests get blocked. No surprise bills. The &lt;a href="https://docs.getbifrost.ai/features/governance" rel="noopener noreferrer"&gt;governance docs&lt;/a&gt; cover the full hierarchy.&lt;/p&gt;

&lt;h3&gt;
  
  
  Semantic Caching
&lt;/h3&gt;

&lt;p&gt;This cut my costs noticeably. Bifrost supports &lt;a href="https://docs.getbifrost.ai/features/semantic-caching" rel="noopener noreferrer"&gt;dual-layer caching&lt;/a&gt;: exact hash matching plus semantic similarity. If I have already asked a similar question, it returns the cached response instead of hitting the provider.&lt;/p&gt;

&lt;p&gt;Supported &lt;a href="https://docs.getbifrost.ai/architecture/framework/vector-store" rel="noopener noreferrer"&gt;vector stores&lt;/a&gt;: Weaviate, Redis, Qdrant.&lt;/p&gt;

&lt;h3&gt;
  
  
  Observability
&lt;/h3&gt;

&lt;p&gt;Every request gets logged with latency, tokens, cost, and provider information. The &lt;a href="https://docs.getbifrost.ai/features/observability" rel="noopener noreferrer"&gt;observability layer&lt;/a&gt; gives you full visibility into what is happening across all your providers.&lt;/p&gt;

&lt;h3&gt;
  
  
  MCP Support
&lt;/h3&gt;

&lt;p&gt;Bifrost also works as an &lt;a href="https://docs.getbifrost.ai/features/mcp" rel="noopener noreferrer"&gt;MCP server&lt;/a&gt;. I tested Code Mode -- it reduced tokens by over 50% and latency by 40-50%. Agent Mode is available for more complex workflows. Useful if you are connecting to Claude Desktop or other MCP-compatible clients.&lt;/p&gt;

&lt;h2&gt;
  
  
  Benchmarks
&lt;/h2&gt;

&lt;p&gt;I ran my own tests and the numbers matched what is documented. 11 microseconds overhead. 5,000 RPS on a single instance. The Go implementation makes a real difference compared to Python gateways I tested.&lt;/p&gt;

&lt;p&gt;The &lt;a href="https://docs.getbifrost.ai/benchmarking/getting-started" rel="noopener noreferrer"&gt;benchmarking guide&lt;/a&gt; explains how to reproduce these numbers yourself.&lt;/p&gt;

&lt;h2&gt;
  
  
  Trade-offs and Limitations
&lt;/h2&gt;

&lt;p&gt;Worth being upfront about the downsides:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Relatively new project.&lt;/strong&gt; Bifrost does not have the years of battle-testing that older proxies have. The community is growing but smaller than established alternatives.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Self-hosted only.&lt;/strong&gt; The open-source version has no managed cloud offering. You run and maintain the infrastructure yourself.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Extra operational overhead.&lt;/strong&gt; You are running a separate process between Claude Code and your LLM provider. That is one more thing to monitor, update, and debug compared to direct API calls.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Provider coverage is expanding but not exhaustive.&lt;/strong&gt; Some niche providers or model variants may not be supported yet. Check the docs before committing.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Quick Recap
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;Install: &lt;code&gt;npx -y @maximhq/bifrost&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Configure providers in &lt;code&gt;bifrost.yaml&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Set &lt;code&gt;ANTHROPIC_BASE_URL=http://localhost:8080/anthropic&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Use Claude Code normally. Bifrost routes to whatever model you configured.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;I tried building custom proxies before. I tried other gateways. This is the fastest option I found, and the setup takes minutes not hours.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://git.new/bifrost" rel="noopener noreferrer"&gt;GitHub repo&lt;/a&gt; | &lt;a href="https://getmax.im/bifrostdocs" rel="noopener noreferrer"&gt;Docs&lt;/a&gt; | &lt;a href="https://getmax.im/bifrost-home" rel="noopener noreferrer"&gt;Website&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If you run into issues or want a specific provider supported, open an issue on the repo.&lt;/p&gt;

&lt;h2&gt;
  
  
  Further Reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://docs.getbifrost.ai/features/drop-in-replacement" rel="noopener noreferrer"&gt;Drop-in Replacement Guide&lt;/a&gt; -- how Bifrost maintains full Anthropic SDK compatibility&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://docs.getbifrost.ai/quickstart/gateway/provider-configuration" rel="noopener noreferrer"&gt;Provider Configuration&lt;/a&gt; -- all supported providers and config options&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://docs.getbifrost.ai/features/fallbacks" rel="noopener noreferrer"&gt;Failover and Fallbacks&lt;/a&gt; -- setting up automatic provider failover&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://docs.getbifrost.ai/features/governance/budget-and-limits" rel="noopener noreferrer"&gt;Governance: Budget and Limits&lt;/a&gt; -- the four-tier budget hierarchy explained&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://docs.getbifrost.ai/benchmarking/getting-started" rel="noopener noreferrer"&gt;Benchmarking Guide&lt;/a&gt; -- reproduce the latency and throughput numbers yourself&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>claudecode</category>
      <category>llm</category>
      <category>opensource</category>
      <category>ai</category>
    </item>
    <item>
      <title>How Bifrost Reduces GPT Costs and Response Times with Semantic Caching</title>
      <dc:creator>Pranay Batta</dc:creator>
      <pubDate>Wed, 01 Apr 2026 05:51:15 +0000</pubDate>
      <link>https://dev.to/pranay_batta/how-bifrost-reduces-gpt-costs-and-response-times-with-semantic-caching-344g</link>
      <guid>https://dev.to/pranay_batta/how-bifrost-reduces-gpt-costs-and-response-times-with-semantic-caching-344g</guid>
      <description>&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;p&gt;Every GPT API call costs money and takes time. If your app sends the same (or very similar) prompts repeatedly, you are paying full price each time for answers you already have. &lt;a href="https://git.new/bifrost" rel="noopener noreferrer"&gt;Bifrost&lt;/a&gt;, an open-source LLM gateway, ships with a &lt;a href="https://docs.getbifrost.ai/features/semantic-caching" rel="noopener noreferrer"&gt;semantic caching&lt;/a&gt; plugin that uses dual-layer caching: exact hash matching plus &lt;a href="https://docs.getbifrost.ai/architecture/framework/vector-store" rel="noopener noreferrer"&gt;vector similarity search&lt;/a&gt;. Cache hits cost zero. Semantic matches cost only the embedding lookup. This post walks you through how it works and how to set it up.&lt;/p&gt;




&lt;h2&gt;
  
  
  The cost problem with GPT API calls
&lt;/h2&gt;

&lt;p&gt;If you are building anything production-grade with GPT-4, GPT-4o, or any OpenAI model, you already know that API costs add up fast. Token-based pricing means every request burns through your budget, whether it is a fresh question or something your system answered three minutes ago.&lt;/p&gt;

&lt;p&gt;Here is the thing: in most real applications, a significant portion of requests are either identical or semantically similar to previous ones. Think about it. Customer support bots get asked the same questions in slightly different words. Code assistants receive near-identical prompts from different users. RAG pipelines retrieve similar context and ask similar follow-ups.&lt;/p&gt;

&lt;p&gt;Without caching, you pay full model cost for every single one of those requests. You also wait for the full round-trip to the provider each time, adding latency that your users notice.&lt;/p&gt;

&lt;p&gt;The obvious fix is caching. But traditional exact-match caching has a big limitation: it only works when the prompt is character-for-character identical. Change one word, add a comma, rephrase slightly, and you get a cache miss. That is where semantic caching changes the game.&lt;/p&gt;




&lt;h2&gt;
  
  
  What semantic caching is and how it differs from exact-match caching
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Exact-match caching&lt;/strong&gt; hashes the entire request and looks up that hash. If the hash matches a stored response, you get a cache hit. If even one character is different, it is a miss. This works well for automated pipelines where prompts are templated and predictable. It falls apart for user-facing applications where people phrase things differently.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Semantic caching&lt;/strong&gt; converts the request into a vector embedding and searches for similar embeddings in a vector store. If a stored request is semantically similar enough (above a configurable threshold), the cached response is returned. This means "How do I reset my password?" and "What are the steps to change my password?" can both hit the same cache entry.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://getmax.im/bifrost-home" rel="noopener noreferrer"&gt;Bifrost&lt;/a&gt; combines both approaches in a &lt;a href="https://docs.getbifrost.ai/features/semantic-caching" rel="noopener noreferrer"&gt;dual-layer architecture&lt;/a&gt;, giving you the speed of exact matching with the intelligence of semantic similarity as a fallback.&lt;/p&gt;




&lt;h2&gt;
  
  
  How Bifrost implements dual-layer caching
&lt;/h2&gt;

&lt;p&gt;Bifrost's &lt;a href="https://docs.getbifrost.ai/features/semantic-caching" rel="noopener noreferrer"&gt;semantic cache plugin&lt;/a&gt; uses a two-step lookup process for every request that has a cache key:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Layer 1: Exact hash match.&lt;/strong&gt; The plugin hashes the request and checks for a direct match. This is the fastest path. If it hits, you get the cached response with zero additional cost. No embedding generation, no vector search, no provider call.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Layer 2: Semantic similarity search.&lt;/strong&gt; If the exact match misses, Bifrost generates an embedding for the request and searches the &lt;a href="https://docs.getbifrost.ai/architecture/framework/vector-store" rel="noopener noreferrer"&gt;vector store&lt;/a&gt; for semantically similar entries. If a match is found above the similarity threshold (default 0.8), the cached response is returned. The only cost here is the embedding generation.&lt;/p&gt;

&lt;p&gt;If both layers miss, the request goes to the LLM provider as normal. The response is then stored in the &lt;a href="https://docs.getbifrost.ai/architecture/framework/vector-store" rel="noopener noreferrer"&gt;vector store&lt;/a&gt; with its embedding for future lookups.&lt;/p&gt;

&lt;p&gt;You can also control which layer to use per request. If you know your use case only needs exact matching (templated prompts), you can skip the semantic layer entirely. If you want semantic-only, that is an option too. The default is both, with direct matching first and semantic as fallback.&lt;/p&gt;

&lt;p&gt;Here is how the cost breaks down:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Scenario&lt;/th&gt;
&lt;th&gt;LLM API Cost&lt;/th&gt;
&lt;th&gt;Embedding Cost&lt;/th&gt;
&lt;th&gt;Total Cost&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Exact cache hit&lt;/td&gt;
&lt;td&gt;Zero&lt;/td&gt;
&lt;td&gt;Zero&lt;/td&gt;
&lt;td&gt;Zero&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Semantic cache hit&lt;/td&gt;
&lt;td&gt;Zero&lt;/td&gt;
&lt;td&gt;Embedding only&lt;/td&gt;
&lt;td&gt;Minimal&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cache miss&lt;/td&gt;
&lt;td&gt;Full model cost&lt;/td&gt;
&lt;td&gt;Embedding generation&lt;/td&gt;
&lt;td&gt;Full + embedding&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Bifrost also handles cost calculation natively through &lt;code&gt;CalculateCostWithCacheDebug&lt;/code&gt;, which automatically accounts for cache hits, semantic matches, and misses in your &lt;a href="https://docs.getbifrost.ai/features/governance/budget-and-limits" rel="noopener noreferrer"&gt;cost tracking&lt;/a&gt;. All pricing data is cached in memory for O(1) lookup, so the cost calculation itself adds no overhead.&lt;/p&gt;

&lt;p&gt;Check out the full &lt;a href="https://getmax.im/bifrostdocs" rel="noopener noreferrer"&gt;Bifrost documentation&lt;/a&gt; for the complete API reference.&lt;/p&gt;




&lt;h2&gt;
  
  
  Setting it up
&lt;/h2&gt;

&lt;p&gt;Follow the &lt;a href="https://docs.getbifrost.ai/quickstart/gateway/setting-up" rel="noopener noreferrer"&gt;setup guide&lt;/a&gt; to get Bifrost running, then configure two things: a vector store and the semantic cache plugin.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 1: Configure the &lt;a href="https://docs.getbifrost.ai/architecture/framework/vector-store" rel="noopener noreferrer"&gt;vector store&lt;/a&gt;
&lt;/h3&gt;

&lt;p&gt;Bifrost uses Weaviate as its vector store. You can run Weaviate locally with Docker or use Weaviate Cloud.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Local setup with Docker:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker run &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-p&lt;/span&gt; 8080:8080 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-p&lt;/span&gt; 50051:50051 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-e&lt;/span&gt; &lt;span class="nv"&gt;PERSISTENCE_DATA_PATH&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'/var/lib/weaviate'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  semitechnologies/weaviate:latest
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;config.json (local Weaviate):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"vector_store"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"enabled"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"weaviate"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"config"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"host"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"localhost:8080"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"scheme"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"http"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;config.json (Weaviate Cloud):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"vector_store"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"enabled"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"weaviate"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"config"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"host"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"your-cluster.weaviate.network"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"scheme"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"https"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"api_key"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"your-weaviate-api-key"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 2: Configure the &lt;a href="https://docs.getbifrost.ai/features/semantic-caching" rel="noopener noreferrer"&gt;semantic cache plugin&lt;/a&gt;
&lt;/h3&gt;

&lt;p&gt;Add the plugin to your Bifrost config:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"plugins"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"enabled"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"semantic_cache"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"config"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"provider"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"openai"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"embedding_model"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"text-embedding-3-small"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"ttl"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"5m"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"threshold"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.8&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"conversation_history_threshold"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"exclude_system_prompt"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"cache_by_model"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"cache_by_provider"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"cleanup_on_shutdown"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A few things to note about these settings:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;threshold&lt;/code&gt;&lt;/strong&gt;: The similarity score (0 to 1) required for a semantic match. 0.8 is a good starting point. Higher means stricter matching, fewer false positives, but more cache misses.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;conversation_history_threshold&lt;/code&gt;&lt;/strong&gt;: Defaults to 3. If a conversation has more messages than this, caching is skipped. Long conversations have high probability of false positive semantic matches due to topic overlap, and they rarely produce exact hash matches anyway.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;ttl&lt;/code&gt;&lt;/strong&gt;: How long cached responses stay valid. Accepts duration strings like &lt;code&gt;"30s"&lt;/code&gt;, &lt;code&gt;"5m"&lt;/code&gt;, &lt;code&gt;"1h"&lt;/code&gt;, or numeric seconds.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;cache_by_model&lt;/code&gt;&lt;/strong&gt; and &lt;strong&gt;&lt;code&gt;cache_by_provider&lt;/code&gt;&lt;/strong&gt;: When true, cache entries are isolated per &lt;a href="https://docs.getbifrost.ai/architecture/framework/model-catalog" rel="noopener noreferrer"&gt;model&lt;/a&gt; and &lt;a href="https://docs.getbifrost.ai/quickstart/gateway/provider-configuration" rel="noopener noreferrer"&gt;provider&lt;/a&gt; combination. A GPT-4 response will not be returned for a GPT-3.5-turbo request.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Step 3: Trigger caching per request
&lt;/h3&gt;

&lt;p&gt;Caching is opt-in per request. You need to set a cache key, either via the Go SDK or HTTP headers:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;HTTP API:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# This request WILL be cached&lt;/span&gt;
curl &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"x-bf-cache-key: session-123"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
     &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
     &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{"model": "gpt-4", "messages": [{"role": "user", "content": "What is semantic caching?"}]}'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
     http://localhost:8080/v1/chat/completions
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Go SDK:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="n"&gt;ctx&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;WithValue&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;semanticcache&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;CacheKey&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"session-123"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ChatCompletionRequest&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Without the cache key, requests bypass caching entirely. This gives you fine-grained control over what gets cached and what does not.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Per-request overrides (HTTP):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"x-bf-cache-key: session-123"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
     &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"x-bf-cache-ttl: 30s"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
     &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"x-bf-cache-threshold: 0.9"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
     http://localhost:8080/v1/chat/completions
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Cache type control:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Direct hash matching only (fastest, no embedding cost)&lt;/span&gt;
curl &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"x-bf-cache-key: session-123"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
     &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"x-bf-cache-type: direct"&lt;/span&gt; ...

&lt;span class="c"&gt;# Semantic similarity search only&lt;/span&gt;
curl &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"x-bf-cache-key: session-123"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
     &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"x-bf-cache-type: semantic"&lt;/span&gt; ...

&lt;span class="c"&gt;# Default: both (direct first, semantic fallback)&lt;/span&gt;
curl &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"x-bf-cache-key: session-123"&lt;/span&gt; ...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can also use no-store mode to read from cache without storing the response:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"x-bf-cache-key: session-123"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
     &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"x-bf-cache-no-store: true"&lt;/span&gt; ...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  When semantic caching helps vs when it does not
&lt;/h2&gt;

&lt;p&gt;Semantic caching is not a universal solution. Here is where it works well and where it does not.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Good fit:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Customer support bots where users ask the same questions in different words&lt;/li&gt;
&lt;li&gt;FAQ-style applications with predictable query patterns&lt;/li&gt;
&lt;li&gt;RAG pipelines where similar contexts produce similar queries&lt;/li&gt;
&lt;li&gt;Internal tools where multiple team members ask overlapping questions&lt;/li&gt;
&lt;li&gt;Any high-volume application with repetitive prompt patterns&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Not a good fit:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Conversations that are heavily context-dependent and unique every time&lt;/li&gt;
&lt;li&gt;Long multi-turn conversations (the &lt;code&gt;conversation_history_threshold&lt;/code&gt; exists for this reason, as longer conversations create false positive matches)&lt;/li&gt;
&lt;li&gt;Applications where responses must reflect real-time data that changes frequently&lt;/li&gt;
&lt;li&gt;Creative generation tasks where you want varied outputs for similar inputs&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The key insight is that semantic caching works best when your application naturally produces clusters of similar requests. If every request is genuinely unique, caching of any kind will not help much.&lt;/p&gt;




&lt;h2&gt;
  
  
  Other performance details worth knowing
&lt;/h2&gt;

&lt;p&gt;Beyond semantic caching, Bifrost caches aggressively at multiple levels:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Tool discovery&lt;/strong&gt; is cached after the first request, bringing subsequent lookups down to roughly 100-500 microseconds.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Health check results&lt;/strong&gt; are cached at approximately 50 nanoseconds.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;All pricing data&lt;/strong&gt; is cached in memory for O(1) lookups during cost calculations.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Cache entries use namespace isolation. Each Bifrost instance gets its own &lt;a href="https://docs.getbifrost.ai/architecture/framework/vector-store" rel="noopener noreferrer"&gt;vector store&lt;/a&gt; namespace to prevent conflicts. When the Bifrost client shuts down (with &lt;code&gt;cleanup_on_shutdown&lt;/code&gt; set to true), all cache entries and the namespace itself are cleaned up. You can also programmatically &lt;a href="https://docs.getbifrost.ai/api-reference/cache/clear-cache-by-cache-key" rel="noopener noreferrer"&gt;clear cache by key&lt;/a&gt; or &lt;a href="https://docs.getbifrost.ai/api-reference/cache/clear-cache-by-request-id" rel="noopener noreferrer"&gt;clear cache by request ID&lt;/a&gt; via the API.&lt;/p&gt;

&lt;p&gt;Cache metadata is automatically added to responses via &lt;code&gt;response.ExtraFields.CacheDebug&lt;/code&gt;, so you can inspect whether a response came from direct cache, semantic match, or a fresh provider call. You can also use the &lt;a href="https://docs.getbifrost.ai/api-reference/logging/get-log-statistics" rel="noopener noreferrer"&gt;log statistics API&lt;/a&gt; for deeper &lt;a href="https://docs.getbifrost.ai/features/observability" rel="noopener noreferrer"&gt;observability&lt;/a&gt; into your cache performance.&lt;/p&gt;




&lt;h2&gt;
  
  
  Wrapping up
&lt;/h2&gt;

&lt;p&gt;If your GPT-powered application handles any volume of requests, there is a good chance a meaningful portion of those requests are semantically similar. Paying full API cost for every one of them does not make sense.&lt;/p&gt;

&lt;p&gt;Bifrost's &lt;a href="https://docs.getbifrost.ai/features/semantic-caching" rel="noopener noreferrer"&gt;semantic cache plugin&lt;/a&gt; gives you dual-layer caching with exact matching and &lt;a href="https://docs.getbifrost.ai/architecture/framework/vector-store" rel="noopener noreferrer"&gt;vector similarity search&lt;/a&gt;, opt-in per request, configurable thresholds, and built-in &lt;a href="https://docs.getbifrost.ai/features/governance/budget-and-limits" rel="noopener noreferrer"&gt;cost tracking&lt;/a&gt;. It is open source, written in Go, and designed for production workloads.&lt;/p&gt;

&lt;p&gt;Check out the &lt;a href="https://git.new/bifrost" rel="noopener noreferrer"&gt;GitHub repo&lt;/a&gt; to get started, read the &lt;a href="https://getmax.im/bifrostdocs" rel="noopener noreferrer"&gt;docs&lt;/a&gt; for the full configuration reference, or visit the &lt;a href="https://getmax.im/bifrost-home" rel="noopener noreferrer"&gt;Bifrost website&lt;/a&gt; to learn more about the gateway.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>programming</category>
      <category>gpt</category>
      <category>mcp</category>
    </item>
    <item>
      <title>LLM Cost Tracking and Spend Management for Engineering Teams</title>
      <dc:creator>Pranay Batta</dc:creator>
      <pubDate>Wed, 01 Apr 2026 05:43:30 +0000</pubDate>
      <link>https://dev.to/pranay_batta/llm-cost-tracking-and-spend-management-for-engineering-teams-233a</link>
      <guid>https://dev.to/pranay_batta/llm-cost-tracking-and-spend-management-for-engineering-teams-233a</guid>
      <description>&lt;p&gt;Your team ships a feature using GPT-4, it works great in staging, and then production traffic hits. Suddenly you are burning through API credits faster than anyone expected. Multiply that across three providers, five teams, and a few hundred thousand requests per day. Good luck figuring out where the money went.&lt;/p&gt;

&lt;p&gt;We built &lt;a href="https://git.new/bifrost" rel="noopener noreferrer"&gt;Bifrost&lt;/a&gt;, an open-source LLM gateway in Go, and cost tracking was one of the first problems we had to solve properly. This post covers what we learned, how we designed spend management into the gateway layer, and what the alternatives look like. You can get started with the &lt;a href="https://docs.getbifrost.ai/quickstart/gateway/setting-up" rel="noopener noreferrer"&gt;setup guide&lt;/a&gt; in under a minute.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;TL;DR&lt;/strong&gt;: Bifrost gives you per-request &lt;a href="https://docs.getbifrost.ai/features/observability" rel="noopener noreferrer"&gt;cost logging&lt;/a&gt;, four-tier &lt;a href="https://docs.getbifrost.ai/features/governance/budget-and-limits" rel="noopener noreferrer"&gt;budget hierarchies&lt;/a&gt; (Customer, Team, Virtual Key, Provider Config), auto-synced model pricing, and cache-aware cost calculations. All at 11 microsecond latency overhead. You can run it right now with &lt;code&gt;npx -y @maximhq/bifrost&lt;/code&gt;. &lt;a href="https://getmax.im/bifrostdocs" rel="noopener noreferrer"&gt;Full docs here&lt;/a&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  The actual problem with LLM costs
&lt;/h2&gt;

&lt;p&gt;Cloud compute costs are predictable. You pick an instance type, you know the hourly rate, you can forecast monthly spend within a few percent.&lt;/p&gt;

&lt;p&gt;LLM costs are nothing like that.&lt;/p&gt;

&lt;p&gt;A single API call costs somewhere between $0.0001 and $0.50 depending on the model, the input length, the output length, whether you are sending images or audio, and whether the context crosses the 128k token threshold (where pricing tiers change). That is per request.&lt;/p&gt;

&lt;p&gt;Now add multi-provider &lt;a href="https://docs.getbifrost.ai/features/governance/routing" rel="noopener noreferrer"&gt;routing&lt;/a&gt;. Your app might use OpenAI for chat, Anthropic for analysis, and a smaller model for classification. Each provider has different pricing structures, different token counting methods, and different billing cycles.&lt;/p&gt;

&lt;p&gt;The result: engineering teams have no idea what they are spending until the invoice arrives.&lt;/p&gt;

&lt;h2&gt;
  
  
  What cost tracking actually requires
&lt;/h2&gt;

&lt;p&gt;Most teams start with "we will check the provider dashboard." That breaks down fast for three reasons.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Per-request granularity.&lt;/strong&gt; You need to know the cost of every single API call, tied to which customer, which team, and which feature triggered it. Provider dashboards give you aggregate numbers, not per-request attribution.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Real-time budget enforcement.&lt;/strong&gt; Knowing you overspent last month does not help. You need the system to reject requests when a budget limit is hit, before the money is gone.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Multi-modal cost calculation.&lt;/strong&gt; If your app sends images, audio, or very long contexts, the cost calculation is not a simple token multiplication. You need tiered pricing support, per-image costs, per-second audio costs, and character-based pricing for certain models.&lt;/p&gt;

&lt;h2&gt;
  
  
  How we built cost tracking in Bifrost
&lt;/h2&gt;

&lt;p&gt;We wanted cost management to be a &lt;a href="https://docs.getbifrost.ai/features/governance" rel="noopener noreferrer"&gt;gateway-level concern&lt;/a&gt;, not something each application team has to implement. Here is how the pieces fit together.&lt;/p&gt;

&lt;h3&gt;
  
  
  Model Catalog with auto-synced pricing
&lt;/h3&gt;

&lt;p&gt;The Model Catalog is the foundation. It maintains pricing data for every supported model across all providers. You can also &lt;a href="https://docs.getbifrost.ai/api-reference/configuration/force-pricing-sync" rel="noopener noreferrer"&gt;force a pricing sync&lt;/a&gt; at any time via the API.&lt;/p&gt;

&lt;p&gt;On startup, Bifrost downloads the latest pricing sheet and loads it into memory. When a ConfigStore (SQLite or PostgreSQL) is available, it also persists the data and re-syncs every 24 hours automatically. All lookups are O(1) from memory.&lt;/p&gt;

&lt;p&gt;The pricing data covers multiple modalities:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Text&lt;/strong&gt;: token-based and character-based pricing for chat completions, text completions, and embeddings&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Audio&lt;/strong&gt;: token-based and duration-based pricing for speech synthesis and transcription&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Images&lt;/strong&gt;: per-image costs with tiered pricing for high-token contexts&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tiered pricing&lt;/strong&gt;: automatic rate changes above 128k tokens, reflecting actual provider pricing&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This means cost calculation is accurate for every request type, not an approximation based on token count alone.&lt;/p&gt;

&lt;h3&gt;
  
  
  Four-tier &lt;a href="https://docs.getbifrost.ai/features/governance/budget-and-limits" rel="noopener noreferrer"&gt;budget hierarchy&lt;/a&gt;
&lt;/h3&gt;

&lt;p&gt;This is where spend management happens. Bifrost supports budgets at four levels:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Customer&lt;/strong&gt; - set a spending cap for an entire customer account&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Team&lt;/strong&gt; - limit spend per team within a customer&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://docs.getbifrost.ai/features/governance/virtual-keys" rel="noopener noreferrer"&gt;Virtual Key&lt;/a&gt;&lt;/strong&gt; - control costs per API key (useful for per-feature or per-environment budgets)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://docs.getbifrost.ai/quickstart/gateway/provider-configuration" rel="noopener noreferrer"&gt;Provider Config&lt;/a&gt;&lt;/strong&gt; - cap total spend on a specific provider&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Each budget has a &lt;code&gt;max_limit&lt;/code&gt;, a &lt;code&gt;reset_duration&lt;/code&gt; (daily, weekly, monthly), and tracks &lt;code&gt;current_usage&lt;/code&gt; in real time.&lt;/p&gt;

&lt;p&gt;Here is what creating a customer with a budget looks like via the API:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;--request&lt;/span&gt; POST &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--url&lt;/span&gt; http://localhost:8080/api/governance/customers &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--header&lt;/span&gt; &lt;span class="s1"&gt;'Content-Type: application/json'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--data&lt;/span&gt; &lt;span class="s1"&gt;'{
    "name": "acme-corp",
    "budget": {
      "max_limit": 500,
      "reset_duration": "monthly"
    }
  }'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The response includes the budget object with &lt;code&gt;current_usage&lt;/code&gt; tracked automatically:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"customer"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"cust-abc123"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"acme-corp"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"budget"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"bdgt-xyz"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"max_limit"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;500&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"reset_duration"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"monthly"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"current_usage"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When &lt;code&gt;current_usage&lt;/code&gt; hits &lt;code&gt;max_limit&lt;/code&gt;, requests are rejected. No surprises on the invoice.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;a href="https://docs.getbifrost.ai/features/observability" rel="noopener noreferrer"&gt;LogStore&lt;/a&gt;: per-request cost audit trail
&lt;/h3&gt;

&lt;p&gt;Every request that passes through Bifrost gets &lt;a href="https://docs.getbifrost.ai/api-reference/logging/get-logs" rel="noopener noreferrer"&gt;logged with full cost data&lt;/a&gt;. The LogStore captures:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Provider and model used&lt;/li&gt;
&lt;li&gt;Input tokens, output tokens, total tokens&lt;/li&gt;
&lt;li&gt;Calculated cost (broken down into input cost, output cost, request cost, total cost)&lt;/li&gt;
&lt;li&gt;Latency&lt;/li&gt;
&lt;li&gt;Status (success or error)&lt;/li&gt;
&lt;li&gt;Timestamps&lt;/li&gt;
&lt;li&gt;Full input/output content (serialized as JSON)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You can query this data with filters. Want to see all requests to OpenAI that cost more than $0.10 in the last hour? That is a single API call.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;--request&lt;/span&gt; POST &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--url&lt;/span&gt; http://localhost:8080/api/logs/search &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--header&lt;/span&gt; &lt;span class="s1"&gt;'Content-Type: application/json'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--data&lt;/span&gt; &lt;span class="s1"&gt;'{
    "filters": {
      "providers": ["openai"],
      "min_cost": 0.10,
      "start_time": "2026-03-31T00:00:00Z"
    },
    "pagination": {
      "limit": 50,
      "sort_by": "cost",
      "order": "desc"
    }
  }'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The response includes &lt;a href="https://docs.getbifrost.ai/api-reference/logging/get-log-statistics" rel="noopener noreferrer"&gt;aggregated stats&lt;/a&gt; alongside individual logs: total requests, success rate, average latency, total tokens, and total cost for the query. This is the data you need for cost attribution and chargeback.&lt;/p&gt;

&lt;h3&gt;
  
  
  Getting started
&lt;/h3&gt;

&lt;p&gt;You can have this running in under a minute:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npx &lt;span class="nt"&gt;-y&lt;/span&gt; @maximhq/bifrost
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Or with Docker if you prefer containerized deployment. Then point your LLM calls at the Bifrost endpoint instead of directly at the provider — it works as a &lt;a href="https://docs.getbifrost.ai/features/drop-in-replacement" rel="noopener noreferrer"&gt;drop-in replacement&lt;/a&gt; for the &lt;a href="https://docs.getbifrost.ai/integrations/openai-sdk" rel="noopener noreferrer"&gt;OpenAI SDK&lt;/a&gt;, &lt;a href="https://docs.getbifrost.ai/integrations/anthropic-sdk" rel="noopener noreferrer"&gt;Anthropic SDK&lt;/a&gt;, and &lt;a href="https://docs.getbifrost.ai/integrations/bedrock-sdk" rel="noopener noreferrer"&gt;Bedrock SDK&lt;/a&gt;. Cost tracking, budget enforcement, and logging happen automatically.&lt;/p&gt;

&lt;p&gt;Check the &lt;a href="https://docs.getbifrost.ai/quickstart/gateway/setting-up" rel="noopener noreferrer"&gt;setup docs&lt;/a&gt; for configuration details.&lt;/p&gt;

&lt;h2&gt;
  
  
  Cache-aware cost tracking
&lt;/h2&gt;

&lt;p&gt;This is a detail that matters more than you would expect.&lt;/p&gt;

&lt;p&gt;Bifrost includes a dual-layer &lt;a href="https://docs.getbifrost.ai/features/semantic-caching" rel="noopener noreferrer"&gt;semantic cache&lt;/a&gt; (exact hash matching + semantic similarity via Weaviate). When a request hits the cache, the cost calculation changes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Direct cache hit&lt;/strong&gt; (exact match): zero cost. The response comes from cache, no provider API call is made.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Semantic cache hit&lt;/strong&gt; (similar query found): the cost is the embedding generation cost only. No model inference cost.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cache miss with storage&lt;/strong&gt;: the cost is the base model usage plus the embedding generation cost for storing the result.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you are not tracking cache-aware costs, your cost reports will overcount. Every cache hit that gets reported at full model price inflates your numbers and hides the ROI of caching.&lt;/p&gt;

&lt;h2&gt;
  
  
  How other tools handle cost tracking
&lt;/h2&gt;

&lt;p&gt;Credit where it is due. There are several tools in this space, and they each take a different approach.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Helicone&lt;/strong&gt; is a proxy-based observability platform. It logs requests and provides cost analytics through a dashboard. The cost tracking is solid, with per-request granularity. Where it differs from Bifrost: Helicone is primarily an observability tool. Budget enforcement and cache-aware cost calculations are not its focus. It is a good choice if you want analytics without gateway-level controls.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;OpenRouter&lt;/strong&gt; acts as a unified API layer across multiple LLM providers. It handles routing and gives you a single bill, which simplifies accounting. However, OpenRouter is a hosted proxy — your requests pass through their infrastructure. There is no self-hosted option, no budget enforcement at the gateway level, and no per-customer or per-team spend hierarchy. If you need cost attribution beyond "which model was called," you will need to build that yourself on top of their logs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;AWS API Gateway + Bedrock&lt;/strong&gt; is what many AWS-native teams reach for. You get IAM-based access control and CloudWatch metrics. The limitation is that cost tracking is coarse-grained — you get aggregate billing through AWS Cost Explorer, not per-request cost breakdowns tied to your internal teams or customers. Building a four-tier budget hierarchy on top of AWS services means stitching together Lambda, DynamoDB, and custom billing logic. It works but it is a lot of glue code.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Kong AI Gateway&lt;/strong&gt; and &lt;strong&gt;Cloudflare AI Gateway&lt;/strong&gt; both provide rate limiting and basic analytics for AI API traffic. Kong gives you plugin-based extensibility, and Cloudflare gives you edge caching and DDoS protection. Neither provides built-in per-request cost calculation with multi-modal pricing awareness, and neither offers the kind of &lt;a href="https://docs.getbifrost.ai/features/governance/budget-and-limits" rel="noopener noreferrer"&gt;budget hierarchy&lt;/a&gt; where you can set spending caps at the customer, team, and key level with automatic enforcement.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;LiteLLM&lt;/strong&gt; is the most well-known Python-based proxy. It supports cost tracking and has a wide model coverage. The trade-off is performance. LiteLLM adds roughly 8ms of latency overhead per request. Bifrost adds 11 microseconds, which is about 50x faster. At 5,000 RPS, that difference compounds. If your use case is low-throughput internal tooling, LiteLLM works fine. If you are running production workloads at scale, the latency overhead matters.&lt;/p&gt;

&lt;p&gt;The math is straightforward: at 5,000 requests per second, 8ms overhead means 40 seconds of cumulative latency overhead per second of wall time. At 11 microseconds, it is 0.055 seconds.&lt;/p&gt;

&lt;h2&gt;
  
  
  What we learned building this
&lt;/h2&gt;

&lt;p&gt;A few things surprised us during development.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pricing data goes stale fast.&lt;/strong&gt; Providers update pricing regularly. We started with a static pricing file and quickly realized it needed to be auto-synced. The 24-hour sync interval with O(1) memory lookups was the balance we settled on. You can also trigger a &lt;a href="https://docs.getbifrost.ai/api-reference/configuration/force-pricing-sync" rel="noopener noreferrer"&gt;manual pricing sync&lt;/a&gt; via &lt;code&gt;POST /api/pricing/force-sync&lt;/code&gt; if a provider drops prices and you want immediate accuracy.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Budget enforcement needs to be in the hot path.&lt;/strong&gt; We tried implementing budgets as an async check initially. The problem: by the time the async check ran, the request was already sent to the provider and the cost was incurred. Budget checks have to happen before the request goes upstream. That is why Bifrost handles it at the gateway layer with in-memory state.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Multi-modal cost calculation is harder than it looks.&lt;/strong&gt; Text-only cost is straightforward: multiply tokens by price per token. But when a request includes images, the cost depends on the image resolution and the token context length. Audio adds per-second pricing. Some models charge per character instead of per token. The Model Catalog handles all of this, but getting it right required modelling each provider's pricing structure individually.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cost attribution needs hierarchy.&lt;/strong&gt; Flat per-key budgets are not enough for real organizations. An engineering team needs to know: "How much is Customer X spending? How much of that is Team Y? Which &lt;a href="https://docs.getbifrost.ai/features/governance/virtual-keys" rel="noopener noreferrer"&gt;virtual key&lt;/a&gt; is burning through budget?" That is why we built the four-tier hierarchy (Customer, Team, Virtual Key, Provider Config). You can &lt;a href="https://docs.getbifrost.ai/api-reference/governance/create-virtual-key" rel="noopener noreferrer"&gt;create virtual keys via the API&lt;/a&gt; and attach budgets to each level.&lt;/p&gt;

&lt;h2&gt;
  
  
  Wrapping up
&lt;/h2&gt;

&lt;p&gt;LLM cost management is not optional for production systems. If you are routing requests across multiple providers without per-request cost tracking, budget enforcement, and cache-aware calculations, you are flying blind. For enterprise teams, Bifrost also supports &lt;a href="https://docs.getbifrost.ai/enterprise/audit-logs" rel="noopener noreferrer"&gt;audit logs&lt;/a&gt;, &lt;a href="https://docs.getbifrost.ai/enterprise/log-exports" rel="noopener noreferrer"&gt;log exports&lt;/a&gt;, and &lt;a href="https://docs.getbifrost.ai/enterprise/intelligent-load-balancing" rel="noopener noreferrer"&gt;intelligent load balancing&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://git.new/bifrost" rel="noopener noreferrer"&gt;Bifrost&lt;/a&gt; is open-source, written in Go, and runs with a single command. It handles cost tracking at the gateway layer so your application code does not have to.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://git.new/bifrostrepo" rel="noopener noreferrer"&gt;GitHub repo&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://getmax.im/bifrostdocs" rel="noopener noreferrer"&gt;Documentation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://getmax.im/bifrost-home" rel="noopener noreferrer"&gt;Website&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you are dealing with LLM spend management, give it a try and let us know what is missing. We are actively building based on what teams actually need.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>programming</category>
      <category>security</category>
      <category>devops</category>
    </item>
    <item>
      <title>LiteLLM vs Bifrost: Why the Supply Chain Attack Changes Everything for LLM Gateways</title>
      <dc:creator>Pranay Batta</dc:creator>
      <pubDate>Sat, 28 Mar 2026 05:10:03 +0000</pubDate>
      <link>https://dev.to/pranay_batta/litellm-vs-bifrost-why-the-supply-chain-attack-changes-everything-for-llm-gateways-b9l</link>
      <guid>https://dev.to/pranay_batta/litellm-vs-bifrost-why-the-supply-chain-attack-changes-everything-for-llm-gateways-b9l</guid>
      <description>&lt;p&gt;If you're running LiteLLM in production, the March 2026 supply chain attack probably got your attention. Mine too. I spent the past few days digging into what happened, why it happened, and what it means for anyone choosing an LLM gateway in 2026.&lt;/p&gt;

&lt;p&gt;This is not a hit piece. LiteLLM is a solid project with massive adoption. But this incident exposed something structural that every engineering team needs to think about. And it happens to make the case for Bifrost, a Go-based alternative, in ways that go beyond the usual performance benchmarks.&lt;/p&gt;

&lt;p&gt;Let's break it all down.&lt;/p&gt;




&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Two backdoored versions of LiteLLM (1.82.7, 1.82.8) were published to PyPI on March 24, 2026, via stolen credentials.&lt;/li&gt;
&lt;li&gt;The malware stole SSH keys, AWS/GCP/Azure credentials, and Kubernetes secrets. It used Python's &lt;code&gt;.pth&lt;/code&gt; persistence mechanism to survive across interpreter restarts.&lt;/li&gt;
&lt;li&gt;DSPy, MLflow, CrewAI, OpenHands, and Arize Phoenix all pulled the compromised version.&lt;/li&gt;
&lt;li&gt;Bifrost is a Go-based LLM gateway that compiles to a single binary. The attack vector that hit LiteLLM simply does not exist in its architecture.&lt;/li&gt;
&lt;li&gt;Beyond security, Bifrost adds 11 microseconds of overhead per request vs LiteLLM's roughly 8ms, supports 20+ providers, offers semantic caching via Weaviate, and has a four-tier budget hierarchy.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  What Happened: The Full Attack Chain
&lt;/h2&gt;

&lt;p&gt;Here's the sequence of events, based on &lt;a href="https://snyk.io/articles/poisoned-security-scanner-backdooring-litellm/" rel="noopener noreferrer"&gt;Snyk's detailed investigation&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 1: The Trivy GitHub Action was compromised.&lt;/strong&gt; A group called TeamPCP tampered with the widely-used Trivy security scanner GitHub Action.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 2: LiteLLM's CI/CD pipeline pulled the compromised Trivy Action.&lt;/strong&gt; Because LiteLLM's workflow used an unpinned version of the Trivy GitHub Action (not pinned to a specific SHA), the compromised version ran inside LiteLLM's CI environment.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 3: The malicious Trivy Action exfiltrated LiteLLM's &lt;code&gt;PYPI_PUBLISH&lt;/code&gt; token.&lt;/strong&gt; With that token, the attackers could publish any package version to PyPI under LiteLLM's name.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 4: Two backdoored versions (1.82.7, 1.82.8) were published to PyPI.&lt;/strong&gt; These looked like normal LiteLLM updates. Anyone running &lt;code&gt;pip install --upgrade litellm&lt;/code&gt; got them.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 5: The malware deployed a &lt;code&gt;.pth&lt;/code&gt; persistence file.&lt;/strong&gt; This is the part that needs explaining.&lt;/p&gt;

&lt;h3&gt;
  
  
  What are .pth files?
&lt;/h3&gt;

&lt;p&gt;If you're not deep into Python internals, &lt;code&gt;.pth&lt;/code&gt; files might be new to you. They live in Python's &lt;code&gt;site-packages&lt;/code&gt; directory and get executed automatically every time the Python interpreter starts up. Not when you import a specific package. Every single time Python runs. Anything.&lt;/p&gt;

&lt;p&gt;The attackers placed a &lt;code&gt;.pth&lt;/code&gt; file that loaded their malware on every Python interpreter startup. It did not matter whether your code imported &lt;code&gt;litellm&lt;/code&gt; or not. If the package was installed in the environment, the malware was active.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What the malware stole:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;SSH private keys&lt;/li&gt;
&lt;li&gt;AWS, GCP, and Azure credentials&lt;/li&gt;
&lt;li&gt;Kubernetes secrets&lt;/li&gt;
&lt;li&gt;Crypto wallet keys&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Step 6: The attackers used 73 compromised GitHub accounts&lt;/strong&gt; to spam the disclosure issue with noise and eventually closed it using stolen maintainer credentials, trying to suppress the report.&lt;/p&gt;

&lt;p&gt;The backdoored versions were live on PyPI for approximately 3 hours. LiteLLM has 3.4 million+ daily downloads. You can do the math on the blast radius.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why Architecture Matters More Than You Think
&lt;/h2&gt;

&lt;p&gt;Let's talk about why this specific attack cannot happen to &lt;a href="https://git.new/bifrost" rel="noopener noreferrer"&gt;Bifrost&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;It is not just "Bifrost is written in Go, so it's safe." That would be a lazy argument. The actual reasons are architectural, and they matter.&lt;/p&gt;

&lt;h3&gt;
  
  
  No site-packages directory
&lt;/h3&gt;

&lt;p&gt;Python packages install into &lt;code&gt;site-packages&lt;/code&gt;. That directory is a shared space where any installed package can drop files, including &lt;code&gt;.pth&lt;/code&gt; files that execute on interpreter startup. This is the mechanism the LiteLLM attackers exploited.&lt;/p&gt;

&lt;p&gt;Go compiles to a single static binary. There is no &lt;code&gt;site-packages&lt;/code&gt; equivalent. There is no shared directory where a compromised dependency could drop a persistence mechanism. The binary is the binary.&lt;/p&gt;

&lt;h3&gt;
  
  
  No .pth hook mechanism
&lt;/h3&gt;

&lt;p&gt;Python's &lt;code&gt;.pth&lt;/code&gt; file execution is a feature, not a bug. It exists for legitimate reasons (configuring import paths, running initialization code). But it also means any package you install can run arbitrary code on every Python startup without your knowledge or consent.&lt;/p&gt;

&lt;p&gt;Go has no equivalent mechanism. When you compile a Go binary, what goes in is what comes out. There is no startup hook that third-party code can inject into after compilation.&lt;/p&gt;

&lt;h3&gt;
  
  
  No transitive pip dependency chain
&lt;/h3&gt;

&lt;p&gt;LiteLLM has a substantial dependency tree. Each of those dependencies has its own dependencies. Each one is a potential attack surface. When you &lt;code&gt;pip install litellm&lt;/code&gt;, you're trusting not just the LiteLLM maintainers but every maintainer of every transitive dependency.&lt;/p&gt;

&lt;p&gt;Bifrost ships as a compiled binary via &lt;code&gt;npx -y @maximhq/bifrost&lt;/code&gt; or Docker (&lt;code&gt;docker pull maximhq/bifrost&lt;/code&gt;). Dependencies are resolved and compiled at build time by the Bifrost team. You're running a single binary, not managing a dependency tree.&lt;/p&gt;

&lt;h3&gt;
  
  
  The CI/CD surface area is smaller
&lt;/h3&gt;

&lt;p&gt;The LiteLLM attack started with a compromised GitHub Action in CI/CD. Go binaries distributed via npm or Docker reduce the CI/CD surface area because the compilation and dependency resolution happen upstream, not in your pipeline.&lt;/p&gt;

&lt;p&gt;This is not about Go being "more secure" than Python as a language. It's about the deployment model. A compiled binary distributed as a single artifact has a fundamentally smaller attack surface than a package installed via a package manager with a transitive dependency tree and runtime hook mechanisms.&lt;/p&gt;




&lt;h2&gt;
  
  
  Side-by-Side Feature Comparison
&lt;/h2&gt;

&lt;p&gt;Here's an honest look at both gateways.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Feature&lt;/th&gt;
&lt;th&gt;LiteLLM&lt;/th&gt;
&lt;th&gt;Bifrost&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Language&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Python&lt;/td&gt;
&lt;td&gt;Go&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Deployment&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;pip install&lt;/code&gt;, Docker&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;npx&lt;/code&gt;, Docker, Go binary&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Provider support&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;100+ providers&lt;/td&gt;
&lt;td&gt;20+ providers (OpenAI, Anthropic, Bedrock, Azure, Gemini, Vertex AI, Groq, Mistral, Cohere, xAI, and more) + custom providers&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Overhead per request&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;~8ms&lt;/td&gt;
&lt;td&gt;11 microseconds&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Throughput&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Varies (Python GIL limits)&lt;/td&gt;
&lt;td&gt;5,000 RPS sustained&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Caching&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Redis-based key-value&lt;/td&gt;
&lt;td&gt;Weaviate-powered dual-layer semantic caching&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Budget management&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Basic spend tracking&lt;/td&gt;
&lt;td&gt;Four-tier hierarchy (Customer &amp;gt; Team &amp;gt; Virtual Key &amp;gt; Provider Config)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;MCP support&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Limited&lt;/td&gt;
&lt;td&gt;Full MCP gateway with four connection types, sub-3ms latency&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Web UI&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Dashboard available&lt;/td&gt;
&lt;td&gt;Built-in Web UI for visual setup, monitoring, and governance&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;OpenAI compatibility&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes (drop-in replacement, single URL change)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Supply chain surface&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;PyPI + transitive deps + .pth hooks&lt;/td&gt;
&lt;td&gt;Single compiled binary&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Configuration&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Config files, environment variables&lt;/td&gt;
&lt;td&gt;Zero-config start, Web UI, API, or config.json&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Let me be upfront: LiteLLM's provider count is significantly higher. If you need access to 100+ providers through a single gateway, that is a real advantage. Bifrost supports 20+ providers natively with the ability to add custom providers, which covers most production use cases, but it is not the same breadth.&lt;/p&gt;




&lt;h2&gt;
  
  
  Performance Deep Dive: What the Numbers Actually Mean
&lt;/h2&gt;

&lt;p&gt;You'll see "11 microseconds vs 8 milliseconds" in Bifrost's benchmarks. That's roughly a 50x difference. But what does it mean in practice?&lt;/p&gt;

&lt;p&gt;Let's do the math at different scales.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;At 10,000 requests per day:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;LiteLLM overhead: 10,000 x 8ms = 80 seconds of cumulative gateway latency&lt;/li&gt;
&lt;li&gt;Bifrost overhead: 10,000 x 11 microseconds = 0.11 seconds&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;At 100,000 requests per day:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;LiteLLM overhead: 100,000 x 8ms = 800 seconds (~13.3 minutes)&lt;/li&gt;
&lt;li&gt;Bifrost overhead: 100,000 x 11 microseconds = 1.1 seconds&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;At 1,000,000 requests per day:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;LiteLLM overhead: 1,000,000 x 8ms = 8,000 seconds (~2.2 hours)&lt;/li&gt;
&lt;li&gt;Bifrost overhead: 1,000,000 x 11 microseconds = 11 seconds&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;At low volume, the difference doesn't matter much. Your LLM provider's response time (hundreds of milliseconds to seconds) dwarfs the gateway overhead either way.&lt;/p&gt;

&lt;p&gt;But at scale, the difference becomes real. 13 minutes of cumulative latency at 100K requests/day isn't catastrophic, but it adds up across your user base. And 2.2 hours at a million requests/day starts affecting tail latencies and user experience, especially for streaming responses where gateway overhead is felt on every chunk.&lt;/p&gt;

&lt;p&gt;The 5,000 RPS sustained throughput from Bifrost also matters. Python's GIL (Global Interpreter Lock) creates a concurrency ceiling that Go simply doesn't have. If you're running high-concurrency workloads, this is a material difference.&lt;/p&gt;




&lt;h2&gt;
  
  
  Here's What This Means for Your Stack
&lt;/h2&gt;

&lt;p&gt;If you're evaluating LLM gateways right now, the LiteLLM incident should change your evaluation criteria. Not because LiteLLM is bad software, but because it highlighted a category of risk that most teams weren't thinking about.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Questions to ask about any LLM gateway:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;What's the dependency footprint?&lt;/strong&gt; How many transitive dependencies does it pull in? Each one is a potential attack surface.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;What's the deployment model?&lt;/strong&gt; Is it a package you install into your environment, or a standalone binary/container?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Does it have runtime hook mechanisms?&lt;/strong&gt; Can dependencies execute code at startup without explicit imports?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;How is it distributed?&lt;/strong&gt; Via a package manager with mutable versions, or via immutable artifacts?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;What's in the CI/CD chain?&lt;/strong&gt; Are GitHub Actions pinned by SHA? Are publish tokens scoped and rotated?&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;These aren't questions most teams were asking about their LLM gateway a month ago. They should be now.&lt;/p&gt;




&lt;h2&gt;
  
  
  When LiteLLM Still Makes Sense
&lt;/h2&gt;

&lt;p&gt;I want to be honest about this. There are real scenarios where LiteLLM is the better choice.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;You need access to 100+ providers.&lt;/strong&gt; LiteLLM's provider breadth is unmatched. If you're working with niche or specialized providers that Bifrost doesn't support yet, LiteLLM gets you there faster.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Your entire stack is Python and you want deep integration.&lt;/strong&gt; LiteLLM plays well with the Python ML ecosystem. If you're already in that world and need tight integration with LangChain, LlamaIndex, or similar frameworks, LiteLLM fits naturally.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;You need it as a library, not a gateway.&lt;/strong&gt; LiteLLM can be imported and used as a Python library within your application code. Bifrost is a standalone gateway service.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If any of these are your primary requirement, LiteLLM may still be right for you. Just audit your versions, pin your dependencies, and check for &lt;code&gt;.pth&lt;/code&gt; files in your &lt;code&gt;site-packages&lt;/code&gt; directory.&lt;/p&gt;




&lt;h2&gt;
  
  
  When Bifrost Is the Better Choice
&lt;/h2&gt;

&lt;p&gt;Bifrost wins when your priorities look like this:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Security surface area matters to you.&lt;/strong&gt; If you're in a regulated industry, handle sensitive data, or simply don't want to worry about Python supply chain attacks in your infrastructure layer, a compiled Go binary is a different risk profile entirely.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Performance at scale.&lt;/strong&gt; If you're pushing high request volumes and need minimal gateway overhead, 11 microseconds vs 8 milliseconds is not a rounding error.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;You want governance out of the box.&lt;/strong&gt; Bifrost's four-tier budget hierarchy (Customer &amp;gt; Team &amp;gt; Virtual Key &amp;gt; Provider Config) with independent budget checking at each level gives you cost control that's built into the gateway, not bolted on.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Semantic caching.&lt;/strong&gt; Bifrost's Weaviate-powered dual-layer caching understands the meaning of requests, not just exact matches. Similar queries hit the cache even if they're worded differently.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;MCP gateway support.&lt;/strong&gt; If you're building agentic applications, Bifrost has native MCP support with four connection types and sub-3ms tool execution latency.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Zero-config setup.&lt;/strong&gt; Run &lt;code&gt;npx -y @maximhq/bifrost&lt;/code&gt; and you have a working gateway with a Web UI at &lt;code&gt;localhost:8080&lt;/code&gt;. No config files, no environment variables, no setup ceremony.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  The Bigger Question
&lt;/h2&gt;

&lt;p&gt;Should your LLM gateway be a Python package at all?&lt;/p&gt;

&lt;p&gt;This isn't Python-bashing. Python is great for ML research, data science, prototyping, and application-level code. But your LLM gateway sits in the critical path of every AI request your application makes. It's infrastructure.&lt;/p&gt;

&lt;p&gt;Infrastructure components have different requirements than application code. They need to be fast, stable, have minimal dependencies, and present the smallest possible attack surface. This is why web servers, databases, load balancers, and message queues are almost never written in Python. They're written in C, C++, Go, or Rust.&lt;/p&gt;

&lt;p&gt;The LiteLLM incident didn't happen because of a bug in LiteLLM's code. It happened because of a structural property of the Python packaging ecosystem. That's a different kind of risk, and it's one that applies to any Python package in your infrastructure layer.&lt;/p&gt;




&lt;h2&gt;
  
  
  Action Items
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;If you're currently using LiteLLM:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Check your installed version immediately. Versions 1.82.7 and 1.82.8 are compromised.&lt;/li&gt;
&lt;li&gt;Search for &lt;code&gt;.pth&lt;/code&gt; files in your Python &lt;code&gt;site-packages&lt;/code&gt; directories.&lt;/li&gt;
&lt;li&gt;Rotate all credentials that were accessible from environments where LiteLLM was installed (SSH keys, cloud provider credentials, Kubernetes secrets).&lt;/li&gt;
&lt;li&gt;Pin your GitHub Actions by SHA, not by tag.&lt;/li&gt;
&lt;li&gt;Evaluate whether a compiled gateway is a better fit for your security posture.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;If you're evaluating LLM gateways:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Try Bifrost: &lt;code&gt;npx -y @maximhq/bifrost&lt;/code&gt; (takes 30 seconds)&lt;/li&gt;
&lt;li&gt;Check out the &lt;a href="https://git.new/bifrost" rel="noopener noreferrer"&gt;GitHub repo&lt;/a&gt; to see the codebase&lt;/li&gt;
&lt;li&gt;Read the &lt;a href="https://getmax.im/bifrostdocs" rel="noopener noreferrer"&gt;docs&lt;/a&gt; for the full feature set&lt;/li&gt;
&lt;li&gt;Visit the &lt;a href="https://getmax.im/bifrost-home" rel="noopener noreferrer"&gt;website&lt;/a&gt; for architecture details&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The LLM gateway space is going to look different after this incident. Supply chain security just became an evaluation criterion, and compiled gateways have a structural advantage that no amount of Python dependency scanning can match.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>javascript</category>
      <category>python</category>
      <category>devops</category>
    </item>
    <item>
      <title>The LiteLLM Supply Chain Attack Broke Trust in Python-Based AI Infrastructure</title>
      <dc:creator>Pranay Batta</dc:creator>
      <pubDate>Fri, 27 Mar 2026 04:48:05 +0000</pubDate>
      <link>https://dev.to/pranay_batta/the-litellm-supply-chain-attack-broke-trust-in-python-based-ai-infrastructure-1poi</link>
      <guid>https://dev.to/pranay_batta/the-litellm-supply-chain-attack-broke-trust-in-python-based-ai-infrastructure-1poi</guid>
      <description>&lt;p&gt;If you run LiteLLM in production, you probably had a rough week.&lt;/p&gt;

&lt;p&gt;On March 24, 2026, two backdoored versions of &lt;a href="https://github.com/BerriAI/litellm" rel="noopener noreferrer"&gt;litellm&lt;/a&gt; (1.82.7 and 1.82.8) were published to PyPI using stolen credentials. The malware stole SSH keys, AWS/GCP/Azure credentials, Kubernetes secrets, cryptocurrency wallets, and deployed persistent backdoors on infected machines. It was live for about 3 hours. LiteLLM gets &lt;a href="https://pypistats.org/packages/litellm" rel="noopener noreferrer"&gt;3.4 million daily downloads&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;This is the full breakdown of what happened, why it matters, and what you should actually do about it.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Happened: The Full Attack Chain
&lt;/h2&gt;

&lt;p&gt;The attack didn't start with LiteLLM. It started with &lt;a href="https://github.com/aquasecurity/trivy" rel="noopener noreferrer"&gt;Trivy&lt;/a&gt;, a popular container security scanner.&lt;/p&gt;

&lt;p&gt;Here's the sequence:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;A threat actor group called &lt;strong&gt;TeamPCP&lt;/strong&gt; exploited a &lt;code&gt;pull_request_target&lt;/code&gt; workflow vulnerability in Trivy's GitHub Action (&lt;a href="https://github.com/aquasecurity/trivy-action/security/advisories/GHSA-9p44-j4g5-cfx5" rel="noopener noreferrer"&gt;GHSA-9p44-j4g5-cfx5&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;They used this to exfiltrate the aqua-bot credentials and rewrite Trivy v0.69.4 release tags to point to malicious payloads&lt;/li&gt;
&lt;li&gt;On March 23, they also compromised the &lt;a href="https://github.com/Checkmarx/kics-github-action" rel="noopener noreferrer"&gt;Checkmarx KICS GitHub Action&lt;/a&gt; using similar techniques&lt;/li&gt;
&lt;li&gt;LiteLLM's CI/CD pipeline pulled the Trivy action &lt;strong&gt;without pinning to a specific version or commit SHA&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;The malicious Trivy action exfiltrated LiteLLM's &lt;code&gt;PYPI_PUBLISH&lt;/code&gt; token from the GitHub Actions runner&lt;/li&gt;
&lt;li&gt;Using the stolen token, TeamPCP published two backdoored versions to PyPI with &lt;strong&gt;legitimate credentials&lt;/strong&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The full timeline is documented in &lt;a href="https://snyk.io/articles/poisoned-security-scanner-backdooring-litellm/" rel="noopener noreferrer"&gt;Snyk's breakdown&lt;/a&gt; and &lt;a href="https://www.wiz.io/blog/threes-a-crowd-teampcp-trojanizes-litellm-in-continuation-of-campaign" rel="noopener noreferrer"&gt;Wiz's attribution analysis&lt;/a&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Malware: Three Stages of Damage
&lt;/h2&gt;

&lt;p&gt;This was not a simple credential stealer. It was a three-stage payload designed for maximum extraction and persistence.&lt;/p&gt;

&lt;h3&gt;
  
  
  Stage 1: Information Collection
&lt;/h3&gt;

&lt;p&gt;The malware harvested:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;System data (hostname, environment variables, processes, network routing, auth logs)&lt;/li&gt;
&lt;li&gt;SSH private keys, &lt;code&gt;.env&lt;/code&gt; files, Git credentials&lt;/li&gt;
&lt;li&gt;API keys from shell history&lt;/li&gt;
&lt;li&gt;Slack and Discord webhook tokens&lt;/li&gt;
&lt;li&gt;Jenkins, Travis CI, and Terraform configs&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;/etc/shadow&lt;/code&gt; password hashes&lt;/li&gt;
&lt;li&gt;Full AWS credential chain including IMDSv2, Secrets Manager, and SSM Parameter Store&lt;/li&gt;
&lt;li&gt;GCP and Azure service account files&lt;/li&gt;
&lt;li&gt;Docker &lt;code&gt;config.json&lt;/code&gt; and Kubernetes kubeconfig, service account tokens, secrets, and certificates&lt;/li&gt;
&lt;li&gt;Cryptocurrency wallet files (Bitcoin, Ethereum, Solana, Cardano, Monero, Ledger hardware wallet configs)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Stage 2: Encryption and Exfiltration
&lt;/h3&gt;

&lt;p&gt;Everything was bundled, encrypted with AES-256-CBC, the session key wrapped with a hardcoded 4096-bit RSA public key, and exfiltrated to &lt;code&gt;models.litellm.cloud&lt;/code&gt;, a domain registered just one day before the attack.&lt;/p&gt;

&lt;h3&gt;
  
  
  Stage 3: Persistence and Lateral Movement
&lt;/h3&gt;

&lt;p&gt;The malware installed a persistent backdoor at &lt;code&gt;~/.config/sysmon/sysmon.py&lt;/code&gt; with a systemd service polling &lt;code&gt;checkmarx.zone&lt;/code&gt; every 5 minutes for new payloads. In Kubernetes environments, it read all secrets across every namespace and deployed &lt;strong&gt;privileged pods&lt;/strong&gt; to every &lt;code&gt;kube-system&lt;/code&gt; node, mounting the host filesystem.&lt;/p&gt;

&lt;p&gt;Source: &lt;a href="https://snyk.io/articles/poisoned-security-scanner-backdooring-litellm/" rel="noopener noreferrer"&gt;Snyk's full technical analysis&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Two Delivery Mechanisms (This Is the Scary Part)
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Version 1.82.7&lt;/strong&gt; embedded a Base64-encoded payload directly in &lt;code&gt;litellm/proxy/proxy_server.py&lt;/code&gt;. It executed whenever anything imported &lt;code&gt;litellm.proxy&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Version 1.82.8&lt;/strong&gt; added something worse: a file called &lt;code&gt;litellm_init.pth&lt;/code&gt; placed in &lt;code&gt;site-packages/&lt;/code&gt;. Python's &lt;code&gt;.pth&lt;/code&gt; mechanism fires on &lt;strong&gt;every interpreter startup&lt;/strong&gt;. That means the malware ran when you ran &lt;code&gt;pip install&lt;/code&gt;, when your IDE started a language server, when you ran &lt;code&gt;python -c "print('hello')"&lt;/code&gt;. It had nothing to do with importing litellm.&lt;/p&gt;

&lt;p&gt;This is &lt;a href="https://attack.mitre.org/techniques/T1546/018/" rel="noopener noreferrer"&gt;MITRE ATT&amp;amp;CK T1546.018&lt;/a&gt; (Python Startup Hooks). The &lt;code&gt;.pth&lt;/code&gt; file was correctly declared in the wheel's RECORD file, so &lt;code&gt;pip install --require-hashes&lt;/code&gt; would have passed.&lt;/p&gt;




&lt;h2&gt;
  
  
  Who Was Affected
&lt;/h2&gt;

&lt;p&gt;The compromised versions were pulled into multiple major projects:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Project&lt;/th&gt;
&lt;th&gt;PR Merged&lt;/th&gt;
&lt;th&gt;Status&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://github.com/stanfordnlp/dspy" rel="noopener noreferrer"&gt;DSPy&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/stanfordnlp/dspy/pull/9498" rel="noopener noreferrer"&gt;#9498&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Affected&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://github.com/mlflow/mlflow" rel="noopener noreferrer"&gt;MLflow&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/mlflow/mlflow/pull/21971" rel="noopener noreferrer"&gt;#21971&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Affected&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://github.com/All-Hands-AI/OpenHands" rel="noopener noreferrer"&gt;OpenHands&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/All-Hands-AI/OpenHands/pull/13569" rel="noopener noreferrer"&gt;#13569&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Affected&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://github.com/crewAIInc/crewAI" rel="noopener noreferrer"&gt;CrewAI&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/crewAIInc/crewAI/pull/5040" rel="noopener noreferrer"&gt;#5040&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Affected&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://github.com/Arize-ai/phoenix" rel="noopener noreferrer"&gt;Arize Phoenix&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/Arize-ai/phoenix/pull/12342" rel="noopener noreferrer"&gt;#12342&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Affected&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://github.com/paul-gauthier/aider" rel="noopener noreferrer"&gt;Aider&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;N/A&lt;/td&gt;
&lt;td&gt;Safe (pinned litellm==1.82.3)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Aider survived because it &lt;strong&gt;pinned its dependency version&lt;/strong&gt;. That one decision made the difference.&lt;/p&gt;




&lt;h2&gt;
  
  
  How Standard Defenses Failed
&lt;/h2&gt;

&lt;p&gt;This attack bypassed almost every standard protection:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Hash verification passed&lt;/strong&gt; because the packages were published with legitimate stolen credentials&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No typosquatting&lt;/strong&gt; to detect. The package name was exactly &lt;code&gt;litellm&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No suspicious domains&lt;/strong&gt; at install time. The exfiltration domain was &lt;code&gt;models.litellm.cloud&lt;/code&gt;, which looks legitimate&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;pip install --require-hashes&lt;/code&gt;&lt;/strong&gt; would have passed because the &lt;code&gt;.pth&lt;/code&gt; file was correctly declared in the wheel's RECORD&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The only install-time defense that would have caught this is inspecting whether packages install &lt;code&gt;.pth&lt;/code&gt; files containing &lt;code&gt;subprocess&lt;/code&gt;, &lt;code&gt;base64&lt;/code&gt;, or &lt;code&gt;exec&lt;/code&gt; patterns. No widely deployed pip plugin does this automatically today.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Bot Suppression Campaign
&lt;/h2&gt;

&lt;p&gt;When researcher &lt;a href="https://futuresearch.ai/blog/litellm-pypi-supply-chain-attack/" rel="noopener noreferrer"&gt;Callum McMahon at FutureSearch&lt;/a&gt; reported the compromise in &lt;a href="https://github.com/BerriAI/litellm/issues/24512" rel="noopener noreferrer"&gt;GitHub issue #24512&lt;/a&gt;, TeamPCP used &lt;strong&gt;73 previously compromised GitHub accounts&lt;/strong&gt; to post 88 spam comments in 102 seconds, then closed the issue as "not planned" using the compromised maintainer account &lt;code&gt;krrishdholakia&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;76% of these accounts overlapped with the botnet used during the Trivy disclosure. This is documented in &lt;a href="https://www.wiz.io/blog/threes-a-crowd-teampcp-trojanizes-litellm-in-continuation-of-campaign" rel="noopener noreferrer"&gt;Wiz's analysis&lt;/a&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Structural Problem Nobody Is Talking About
&lt;/h2&gt;

&lt;p&gt;LiteLLM is a Python package. It sits between your application and your LLM providers. It holds API keys for OpenAI, Anthropic, AWS Bedrock, Google Vertex, and whatever else you route through it. It is, by design, a high-value target.&lt;/p&gt;

&lt;p&gt;And it runs in a language where:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Any package can install &lt;code&gt;.pth&lt;/code&gt; files that execute on interpreter startup&lt;/li&gt;
&lt;li&gt;Transitive dependencies pull in dozens of packages you never explicitly chose&lt;/li&gt;
&lt;li&gt;A single compromised CI/CD token can publish arbitrary code under a trusted package name&lt;/li&gt;
&lt;li&gt;The GIL means you need to run multiple processes, each of which triggers &lt;code&gt;.pth&lt;/code&gt; execution independently&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is not a criticism of Python as a language. Python is excellent for what it is good at. But the question is whether Python is the right choice for &lt;strong&gt;infrastructure that holds your most sensitive credentials and sits in the critical path of every LLM request&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Compiled languages like Go and Rust produce single binaries with no runtime dependency chain. There is no &lt;code&gt;site-packages&lt;/code&gt; directory. There is no &lt;code&gt;.pth&lt;/code&gt; execution mechanism. There is no &lt;code&gt;pip install&lt;/code&gt; side effect. The attack surface is fundamentally smaller.&lt;/p&gt;




&lt;h2&gt;
  
  
  What You Should Do Right Now
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Check Your LiteLLM Version
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip show litellm | &lt;span class="nb"&gt;grep &lt;/span&gt;Version
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you see 1.82.7 or 1.82.8, you were affected. &lt;a href="https://docs.litellm.ai/blog/security-update-march-2026" rel="noopener noreferrer"&gt;LiteLLM's security update&lt;/a&gt; confirms these versions were compromised.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Scan for .pth Files
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;find &lt;span class="si"&gt;$(&lt;/span&gt;python &lt;span class="nt"&gt;-c&lt;/span&gt; &lt;span class="s2"&gt;"import site; print(site.getsitepackages()[0])"&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt; &lt;span class="nt"&gt;-name&lt;/span&gt; &lt;span class="s2"&gt;"*.pth"&lt;/span&gt; &lt;span class="nt"&gt;-exec&lt;/span&gt; &lt;span class="nb"&gt;grep&lt;/span&gt; &lt;span class="nt"&gt;-l&lt;/span&gt; &lt;span class="s2"&gt;"subprocess&lt;/span&gt;&lt;span class="se"&gt;\|&lt;/span&gt;&lt;span class="s2"&gt;base64&lt;/span&gt;&lt;span class="se"&gt;\|&lt;/span&gt;&lt;span class="s2"&gt;exec"&lt;/span&gt; &lt;span class="o"&gt;{}&lt;/span&gt; &lt;span class="se"&gt;\;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  3. Rotate Everything
&lt;/h3&gt;

&lt;p&gt;If you ran the compromised version, assume all credentials on that machine are compromised:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;AWS/GCP/Azure keys and service accounts&lt;/li&gt;
&lt;li&gt;SSH keys&lt;/li&gt;
&lt;li&gt;API keys (OpenAI, Anthropic, etc.)&lt;/li&gt;
&lt;li&gt;Database passwords&lt;/li&gt;
&lt;li&gt;Kubernetes service account tokens&lt;/li&gt;
&lt;li&gt;CI/CD tokens&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  4. Pin Your Dependencies
&lt;/h3&gt;

&lt;p&gt;Aider survived because it pinned &lt;code&gt;litellm==1.82.3&lt;/code&gt;. Pin your versions. Better yet, pin by hash:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;litellm==1.82.6 --hash=sha256:&amp;lt;known-good-hash&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  5. Pin Your CI/CD Actions by SHA
&lt;/h3&gt;

&lt;p&gt;Don't do this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;aquasecurity/trivy-action@latest&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Do this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;aquasecurity/trivy-action@&amp;lt;full-commit-sha&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  6. Evaluate Your Gateway Architecture
&lt;/h3&gt;

&lt;p&gt;This is the harder conversation. If your LLM gateway is a Python package that you &lt;code&gt;pip install&lt;/code&gt;, it shares the same supply chain as every other Python package on your system. Every transitive dependency is a potential attack vector.&lt;/p&gt;

&lt;p&gt;Alternatives worth evaluating:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/maximhq/bifrost" rel="noopener noreferrer"&gt;Bifrost&lt;/a&gt;&lt;/strong&gt;: Open-source LLM gateway written in Go. Single compiled binary, 11 microsecond overhead at 5,000 RPS. No Python supply chain surface area. Supports OpenAI, Anthropic, Bedrock, Vertex, and 20+ providers.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/tensorzero/tensorzero" rel="noopener noreferrer"&gt;TensorZero&lt;/a&gt;&lt;/strong&gt;: Rust-based LLM gateway with sub-millisecond overhead. Similar compiled-binary benefits.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://developers.cloudflare.com/ai-gateway/" rel="noopener noreferrer"&gt;Cloudflare AI Gateway&lt;/a&gt;&lt;/strong&gt;: Managed edge service. No self-hosted dependency chain at all.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Direct provider SDKs&lt;/strong&gt;: If you only use one or two providers, you may not need a gateway at all. The official OpenAI and Anthropic SDKs are smaller attack surfaces.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The right choice depends on your scale, provider count, and security requirements. But "keep using the Python package that just got backdoored" should not be the default.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Bigger Picture
&lt;/h2&gt;

&lt;p&gt;TeamPCP is not done. They also deployed &lt;a href="https://www.wiz.io/blog/threes-a-crowd-teampcp-trojanizes-litellm-in-continuation-of-campaign" rel="noopener noreferrer"&gt;CanisterWorm&lt;/a&gt;, using the Internet Computer Protocol as a C2 channel. They used an AI agent called &lt;strong&gt;openclaw&lt;/strong&gt; for automated attack targeting. Their target selection focuses on tools with elevated pipeline access: container scanners, infrastructure scanning tools, AI routing libraries.&lt;/p&gt;

&lt;p&gt;LLM gateways are a perfect target. They hold credentials for multiple providers. They run in CI/CD environments. They have broad read access by design.&lt;/p&gt;

&lt;p&gt;The question is not whether this will happen again. It is whether your infrastructure is designed to limit the blast radius when it does.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;References:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://snyk.io/articles/poisoned-security-scanner-backdooring-litellm/" rel="noopener noreferrer"&gt;Snyk: How a Poisoned Security Scanner Became the Key to Backdooring LiteLLM&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://futuresearch.ai/blog/litellm-pypi-supply-chain-attack/" rel="noopener noreferrer"&gt;FutureSearch: Supply Chain Attack in litellm 1.82.8 on PyPI&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.wiz.io/blog/threes-a-crowd-teampcp-trojanizes-litellm-in-continuation-of-campaign" rel="noopener noreferrer"&gt;Wiz: TeamPCP Trojanizes LiteLLM&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.litellm.ai/blog/security-update-march-2026" rel="noopener noreferrer"&gt;LiteLLM Security Update: March 2026&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/BerriAI/litellm/issues/24512" rel="noopener noreferrer"&gt;GitHub Issue #24512&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.theregister.com/2026/03/24/trivy_compromise_litellm/" rel="noopener noreferrer"&gt;The Register: LiteLLM infected via Trivy compromise&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.microsoft.com/en-us/security/blog/2026/03/24/detecting-investigating-defending-against-trivy-supply-chain-compromise/" rel="noopener noreferrer"&gt;Microsoft: Detecting the Trivy Supply Chain Compromise&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.kaspersky.com/blog/critical-supply-chain-attack-trivy-litellm-checkmarx-teampcp/55510/" rel="noopener noreferrer"&gt;Kaspersky: Trojanization of Trivy, Checkmarx, and LiteLLM&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;Tags: litellm, supply-chain-attack, security, python, llm-gateway, ai-infrastructure, devops, cybersecurity&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>python</category>
      <category>security</category>
    </item>
    <item>
      <title>How to Set Up Weighted Load Balancing Across LLM Providers</title>
      <dc:creator>Pranay Batta</dc:creator>
      <pubDate>Thu, 12 Mar 2026 12:25:13 +0000</pubDate>
      <link>https://dev.to/pranay_batta/how-to-set-up-weighted-load-balancing-across-llm-providers-21p7</link>
      <guid>https://dev.to/pranay_batta/how-to-set-up-weighted-load-balancing-across-llm-providers-21p7</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR:&lt;/strong&gt; Running all your LLM traffic through a single provider is a single point of failure. Weighted load balancing lets you split traffic across providers (say 70/30 GPT-4o/Claude), optimize for cost or latency per use case, and failover automatically when one provider goes down. &lt;a href="https://git.new/bifrost" rel="noopener noreferrer"&gt;Bifrost&lt;/a&gt; handles this at the gateway layer with 11 microsecond overhead. Here is how to set it up.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why Single-Provider Is a Bad Idea
&lt;/h2&gt;

&lt;p&gt;You have probably been here. Your entire app routes through OpenAI. One day, OpenAI hits capacity. Your 429 retry logic kicks in, but the retries also get 429'd. Your app is effectively down, and your only option is to wait.&lt;/p&gt;

&lt;p&gt;This is not a hypothetical. Every major LLM provider has had multi-hour outages in the last 12 months. If your architecture assumes 100% availability from a single provider, your architecture is wrong.&lt;/p&gt;

&lt;p&gt;The fix is not "add retry logic." The fix is routing traffic across multiple providers at the gateway layer, with weights you control.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Weighted Load Balancing Actually Means
&lt;/h2&gt;

&lt;p&gt;Instead of sending every request to one provider, you define a split:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;70% of requests go to OpenAI (GPT-4o)&lt;/li&gt;
&lt;li&gt;30% go to Anthropic (Claude Sonnet)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Or maybe:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;50% to Gemini (cheaper for simple tasks)&lt;/li&gt;
&lt;li&gt;50% to Anthropic (better for complex reasoning)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The gateway makes the routing decision per-request based on these weights. Your application code does not change. It sends requests to one endpoint (the gateway), and the gateway distributes them.&lt;/p&gt;

&lt;p&gt;The key benefit: &lt;strong&gt;you change the weights in a config file, not in your application code.&lt;/strong&gt; No redeploy. No code review. Just update the config and the traffic split changes immediately.&lt;/p&gt;

&lt;h2&gt;
  
  
  Setting This Up With Bifrost
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://git.new/bifrost" rel="noopener noreferrer"&gt;Bifrost&lt;/a&gt; is an open-source LLM gateway written in Go. It sits between your app and LLM providers as a reverse proxy, adding 11 microseconds of overhead per request.&lt;/p&gt;

&lt;p&gt;Here is a weighted routing config that splits traffic between OpenAI and Anthropic:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"accounts"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"openai-primary"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"provider"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"openai"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"api_key"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"${OPENAI_API_KEY}"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"weight"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;70&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"anthropic-secondary"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"provider"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"anthropic"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"api_key"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"${ANTHROPIC_API_KEY}"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"weight"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;30&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"fallback_config"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"enabled"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"on_status_codes"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;429&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;500&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;502&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;503&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;What this does:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;70% of incoming requests route to OpenAI, 30% to Anthropic&lt;/li&gt;
&lt;li&gt;If OpenAI returns a 429 or 5xx, the request automatically fails over to Anthropic&lt;/li&gt;
&lt;li&gt;If Anthropic is also down, the request returns an error to the client&lt;/li&gt;
&lt;li&gt;All of this happens at the gateway. Your app sends requests to &lt;code&gt;http://localhost:8080/v1/chat/completions&lt;/code&gt; and does not know which provider handled it&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Start the gateway with zero config:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npx &lt;span class="nt"&gt;-y&lt;/span&gt; @maximhq/bifrost
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then configure providers through the web dashboard at &lt;code&gt;localhost:8080&lt;/code&gt;. No YAML, no environment variable chains. JSON config, web UI, done.&lt;/p&gt;

&lt;h2&gt;
  
  
  Three Load Balancing Strategies That Actually Work
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Cost Optimization Split
&lt;/h3&gt;

&lt;p&gt;Route simple tasks to the cheapest provider, complex tasks to the most capable.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"accounts"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"gemini-flash-cheap"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"provider"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"gemini"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"model"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"gemini-2.5-flash"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"weight"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;60&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"openai-capable"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"provider"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"openai"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"model"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"gpt-4o"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"weight"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;40&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Gemini Flash handles the bulk of requests at lower cost. GPT-4o handles the rest. You can check the per-model pricing on the &lt;a href="https://www.getmaxim.ai/bifrost/model-library" rel="noopener noreferrer"&gt;model library&lt;/a&gt; to find the right split for your use case. If you want exact cost numbers before committing, the &lt;a href="https://www.getmaxim.ai/bifrost/llm-cost-calculator" rel="noopener noreferrer"&gt;LLM cost calculator&lt;/a&gt; lets you compare across providers for your specific token volumes.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Latency-Optimized Split
&lt;/h3&gt;

&lt;p&gt;If your app is latency-sensitive (chatbots, real-time agents), route more traffic to the fastest provider.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"accounts"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"groq-fast"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"provider"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"groq"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"model"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"llama-3.3-70b-versatile"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"weight"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"anthropic-quality"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"provider"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"anthropic"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"model"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"claude-sonnet-4-6"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"weight"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Groq is extremely fast for Llama models. Anthropic gives you better reasoning quality. 50/50 split means half your users get near-instant responses, half get higher quality. You tune the ratio based on what your users actually need.&lt;/p&gt;

&lt;p&gt;For raw numbers on how much overhead each gateway adds, check the &lt;a href="https://www.getmaxim.ai/bifrost/resources/benchmarks" rel="noopener noreferrer"&gt;benchmarks page&lt;/a&gt;. Bifrost adds 11 microseconds. That is not a typo. Microseconds, not milliseconds.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Reliability-First Split
&lt;/h3&gt;

&lt;p&gt;For production apps where uptime matters more than cost.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"accounts"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"openai-primary"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"provider"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"openai"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"weight"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;40&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"anthropic-secondary"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"provider"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"anthropic"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"weight"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;30&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"gemini-tertiary"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"provider"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"gemini"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"weight"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;30&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"fallback_config"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"enabled"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"on_status_codes"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;429&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;500&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;502&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;503&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Three providers. If any one goes down, traffic redistributes to the other two. The probability of all three being down simultaneously is near zero. This is the setup we recommend for anything that cannot afford downtime. The &lt;a href="https://www.getmaxim.ai/bifrost/blog/your-primary-llm-provider-failed-enable-automatic-fallback-with-bifrost/" rel="noopener noreferrer"&gt;automatic failover guide&lt;/a&gt; walks through the full failover config in detail.&lt;/p&gt;

&lt;h2&gt;
  
  
  Adding Budget Controls to the Mix
&lt;/h2&gt;

&lt;p&gt;Weighted routing becomes more powerful when combined with budget controls. You do not want one runaway team to blow through your entire OpenAI quota and leave other teams with nothing.&lt;/p&gt;

&lt;p&gt;Bifrost has a four-tier budget hierarchy: Organization &amp;gt; Team &amp;gt; Virtual Key &amp;gt; Provider. Each level can have daily, weekly, or monthly caps.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"virtual_keys"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"team-backend"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"budget"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"monthly_limit_usd"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;5000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"daily_limit_usd"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;250&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"rate_limit"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"request_max_limit"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"request_reset_duration"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"1h"&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When a team hits their budget, requests can either fail (hard stop) or automatically route to a cheaper model (soft failover). This is configurable per virtual key.&lt;/p&gt;

&lt;p&gt;If you are running &lt;a href="https://www.getmaxim.ai/bifrost/resources/claude-code" rel="noopener noreferrer"&gt;Claude Code&lt;/a&gt; across a dev team, this is especially useful. Each developer gets a virtual key with their own budget. No one developer can burn through the team's allocation.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Provider-Isolation Architecture
&lt;/h2&gt;

&lt;p&gt;Here is something most gateways get wrong: they use a single request queue for all providers. When OpenAI starts rate limiting you and requests back up, the queue fills. Now your Anthropic and Gemini requests are also stuck behind the OpenAI backlog.&lt;/p&gt;

&lt;p&gt;Bifrost uses provider-isolated worker pools. Each provider gets its own queue. OpenAI being slow does not affect Anthropic latency at all. This matters a lot under weighted load balancing, because the whole point is that you have multiple providers and they should operate independently.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Your App → Bifrost Gateway → [OpenAI Pool]    → OpenAI
                           → [Anthropic Pool]  → Anthropic
                           → [Gemini Pool]     → Gemini
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Backpressure policies are configurable per provider: &lt;code&gt;drop&lt;/code&gt; (discard), &lt;code&gt;block&lt;/code&gt; (wait), or &lt;code&gt;error&lt;/code&gt; (fail fast). So if OpenAI's pool fills up, you can choose to fail fast and let the failover logic route to the next provider, instead of waiting.&lt;/p&gt;

&lt;h2&gt;
  
  
  Setting Up With Your Stack
&lt;/h2&gt;

&lt;p&gt;Bifrost is provider-agnostic on the client side. It exposes an OpenAI-compatible API, so any client that speaks OpenAI format works out of the box.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Python (OpenAI SDK):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;openai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;OpenAI&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;OpenAI&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;base_url&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http://localhost:8080/v1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;your-bifrost-virtual-key&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Claude Code:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;ANTHROPIC_BASE_URL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;http://localhost:8080/anthropic
&lt;span class="c"&gt;# Now Claude Code routes through Bifrost to any configured provider&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Any OpenAI-compatible client:&lt;/strong&gt;&lt;br&gt;
Just change the base URL to point at your Bifrost instance. That is it. The &lt;a href="https://www.getmaxim.ai/bifrost/blog/access-gpt-gemini-claude-mistral-etc-through-1-gateway-configure-providers-in-bifrost/" rel="noopener noreferrer"&gt;multi-provider setup guide&lt;/a&gt; covers the full configuration for all 19 supported providers.&lt;/p&gt;

&lt;p&gt;If you are using tools like &lt;a href="https://www.getmaxim.ai/bifrost/blog/integrating-zed-editor-with-bifrost-gateway/" rel="noopener noreferrer"&gt;Zed editor&lt;/a&gt;, &lt;a href="https://www.getmaxim.ai/bifrost/blog/integrating-librechat-with-bifrost-gateway/" rel="noopener noreferrer"&gt;LibreChat&lt;/a&gt;, or &lt;a href="https://www.getmaxim.ai/bifrost/blog/integrating-gemini-cli-with-bifrost-gateway/" rel="noopener noreferrer"&gt;Gemini CLI&lt;/a&gt;, there are specific integration guides for each.&lt;/p&gt;

&lt;h2&gt;
  
  
  When Not to Use Weighted Load Balancing
&lt;/h2&gt;

&lt;p&gt;Being honest here. Weighted routing adds complexity. If you are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Running a prototype or hobby project: just use one provider directly&lt;/li&gt;
&lt;li&gt;Under 100 requests per day: the reliability benefit is not worth the setup&lt;/li&gt;
&lt;li&gt;Using only one model for a very specific task: routing adds no value&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Weighted load balancing makes sense when you are in production, handling real traffic, and need either cost optimization, reliability guarantees, or both.&lt;/p&gt;

&lt;h2&gt;
  
  
  Getting Started
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Start Bifrost (zero config)&lt;/span&gt;
npx &lt;span class="nt"&gt;-y&lt;/span&gt; @maximhq/bifrost

&lt;span class="c"&gt;# Open dashboard&lt;/span&gt;
open http://localhost:8080

&lt;span class="c"&gt;# Configure providers and weights through the UI&lt;/span&gt;
&lt;span class="c"&gt;# Or edit the JSON config directly&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;19 providers supported out of the box. You can compare available models and pricing on the &lt;a href="https://www.getmaxim.ai/bifrost/model-library" rel="noopener noreferrer"&gt;model library&lt;/a&gt; before deciding your split. If you are evaluating gateways in general, the &lt;a href="https://www.getmaxim.ai/bifrost/resources/buyers-guide" rel="noopener noreferrer"&gt;buyer's guide&lt;/a&gt; covers what to look for.&lt;/p&gt;

&lt;p&gt;Source code: &lt;a href="https://git.new/bifrost" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt; | &lt;a href="https://getmax.im/bifrostdocs" rel="noopener noreferrer"&gt;Docs&lt;/a&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;We maintain Bifrost at &lt;a href="https://getmax.im/bifrost-home" rel="noopener noreferrer"&gt;Maxim AI&lt;/a&gt;. It is open-source, MIT licensed, and free to self-host. If you run into issues, open an issue on &lt;a href="https://git.new/bifrostrepo" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt; or check the docs.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>architecture</category>
      <category>llm</category>
      <category>tutorial</category>
    </item>
  </channel>
</rss>
