<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Pranay Batta</title>
    <description>The latest articles on DEV Community by Pranay Batta (@pranay_batta).</description>
    <link>https://dev.to/pranay_batta</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3652594%2F9d2926ca-eede-4542-b782-4feb2ced66f1.jpg</url>
      <title>DEV Community: Pranay Batta</title>
      <link>https://dev.to/pranay_batta</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/pranay_batta"/>
    <language>en</language>
    <item>
      <title>Migrating from LiteLLM to Bifrost: A Step-by-Step Guide</title>
      <dc:creator>Pranay Batta</dc:creator>
      <pubDate>Tue, 05 May 2026 09:46:46 +0000</pubDate>
      <link>https://dev.to/pranay_batta/migrating-from-litellm-to-bifrost-a-step-by-step-guide-don</link>
      <guid>https://dev.to/pranay_batta/migrating-from-litellm-to-bifrost-a-step-by-step-guide-don</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR:&lt;/strong&gt; I migrated a production LLM workload from LiteLLM to Bifrost and the swap took about 30 minutes for the gateway, plus a few config translations. The OpenAI-compatible endpoint means application code did not change. This post walks through the full migration: config mapping, virtual key translation, semantic cache porting, and the gotchas I hit.&lt;/p&gt;

&lt;p&gt;This post assumes familiarity with LiteLLM proxy mode, OpenAI-compatible APIs, and basic Docker or Node.js operations.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Teams Are Looking at the Migration
&lt;/h2&gt;

&lt;p&gt;LiteLLM is the most widely adopted open-source LLM gateway and covers the breadth case well. The reasons I see teams move to Bifrost are usually one of three:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Latency overhead.&lt;/strong&gt; LiteLLM proxy adds roughly 8 milliseconds per request. Bifrost adds 11 microseconds at P99 latency at 5k RPS, which is 50x lower. For high-throughput agent workloads that matters.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;MCP support.&lt;/strong&gt; LiteLLM does not have MCP gateway functionality. If you are running Claude Code or building agentic workflows that hit dozens of tool servers, that gap shows up fast.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Dual-layer semantic caching.&lt;/strong&gt; Bifrost ships exact match plus vector similarity caching with Weaviate, Redis, or Qdrant. LiteLLM has request-level caching but not the same dual-layer model.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;I will walk through the migration assuming you have LiteLLM running today and want to switch without rewriting application code.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 1: Run Bifrost Side by Side
&lt;/h2&gt;

&lt;p&gt;Before touching any application config, get Bifrost running on a different port. Default is 8080, so I leave that alone and keep LiteLLM on its existing port.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npx &lt;span class="nt"&gt;-y&lt;/span&gt; @maximhq/bifrost
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That is the entire setup command for local testing. For production, the Docker option works:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker run &lt;span class="nt"&gt;-p&lt;/span&gt; 8080:8080 maximhq/bifrost:latest
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;a href="https://docs.getbifrost.ai/quickstart/gateway/setting-up" rel="noopener noreferrer"&gt;Bifrost setup docs&lt;/a&gt; cover persistent volumes and configuration mounting if you need them.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 2: Translate Provider Configs
&lt;/h2&gt;

&lt;p&gt;LiteLLM uses a &lt;code&gt;config.yaml&lt;/code&gt; with &lt;code&gt;model_list&lt;/code&gt; entries. Bifrost uses provider configs that map to providers, not individual models.&lt;/p&gt;

&lt;p&gt;LiteLLM config:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;model_list&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;model_name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;gpt-4o&lt;/span&gt;
    &lt;span class="na"&gt;litellm_params&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;openai/gpt-4o&lt;/span&gt;
      &lt;span class="na"&gt;api_key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;os.environ/OPENAI_API_KEY&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;model_name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;claude-sonnet-4-6&lt;/span&gt;
    &lt;span class="na"&gt;litellm_params&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;anthropic/claude-sonnet-4-6&lt;/span&gt;
      &lt;span class="na"&gt;api_key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;os.environ/ANTHROPIC_API_KEY&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Bifrost equivalent:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;providers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;openai&lt;/span&gt;
    &lt;span class="na"&gt;api_key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${OPENAI_API_KEY}&lt;/span&gt;
    &lt;span class="na"&gt;allowed_models&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4o"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4o-mini"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
    &lt;span class="na"&gt;weight&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1.0&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;anthropic&lt;/span&gt;
    &lt;span class="na"&gt;api_key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${ANTHROPIC_API_KEY}&lt;/span&gt;
    &lt;span class="na"&gt;allowed_models&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-sonnet-4-6"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-opus-4-7"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
    &lt;span class="na"&gt;weight&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1.0&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The mental shift: in LiteLLM, you list every model. In Bifrost, you list providers and use &lt;code&gt;allowed_models&lt;/code&gt; to filter. The &lt;a href="https://docs.getbifrost.ai/quickstart/gateway/provider-configuration" rel="noopener noreferrer"&gt;provider configuration docs&lt;/a&gt; cover the full schema.&lt;/p&gt;

&lt;p&gt;One thing to watch: Bifrost is deny-by-default. If you forget to add a provider, every request to that provider returns a clear error. With LiteLLM, missing model entries return 404, which is harder to debug.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 3: Translate Virtual Keys and Budgets
&lt;/h2&gt;

&lt;p&gt;LiteLLM virtual keys map to Bifrost virtual keys, but the budget model is different.&lt;/p&gt;

&lt;p&gt;LiteLLM:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;virtual_keys&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;key_name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;customer-acme&lt;/span&gt;
    &lt;span class="na"&gt;key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;sk-acme-abc&lt;/span&gt;
    &lt;span class="na"&gt;max_budget&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;100.00&lt;/span&gt;
    &lt;span class="na"&gt;rpm_limit&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;200&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Bifrost:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;virtual_keys&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;key_name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;customer-acme&lt;/span&gt;
    &lt;span class="na"&gt;key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;vk-acme-abc&lt;/span&gt;
    &lt;span class="na"&gt;rate_limit&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;request_limit&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;200&lt;/span&gt;
      &lt;span class="na"&gt;request_limit_duration&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1h"&lt;/span&gt;
      &lt;span class="na"&gt;token_limit&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;500000&lt;/span&gt;
      &lt;span class="na"&gt;token_limit_duration&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1d"&lt;/span&gt;
    &lt;span class="na"&gt;budget_limit&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;100.00&lt;/span&gt;
    &lt;span class="na"&gt;budget_duration&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1M"&lt;/span&gt;
    &lt;span class="na"&gt;allowed_models&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4o"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-sonnet-4-6"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The big addition is the four-tier budget hierarchy. Customer, Team, Virtual Key, and Provider Config limits all apply independently. A request must pass all four. Reset durations are calendar-aligned for &lt;code&gt;1d&lt;/code&gt;, &lt;code&gt;1w&lt;/code&gt;, &lt;code&gt;1M&lt;/code&gt;, &lt;code&gt;1Y&lt;/code&gt; in UTC, which matters if your billing aligns to calendar months. The &lt;a href="https://docs.getbifrost.ai/features/governance/budget-and-limits" rel="noopener noreferrer"&gt;budget and limits docs&lt;/a&gt; cover this in detail.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 4: Update Application Endpoints
&lt;/h2&gt;

&lt;p&gt;Both gateways expose OpenAI-compatible endpoints, so most application code does not change. Only the base URL switches.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Before (LiteLLM)
&lt;/span&gt;&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;OpenAI&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;base_url&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http://litellm:4000/v1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sk-acme-abc&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# After (Bifrost)
&lt;/span&gt;&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;OpenAI&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;base_url&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http://bifrost:8080/openai/v1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;vk-acme-abc&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Bifrost also exposes &lt;code&gt;/anthropic&lt;/code&gt; and &lt;code&gt;/genai&lt;/code&gt; endpoints if you want to keep using the native Anthropic or Gemini SDKs. The &lt;a href="https://docs.getbifrost.ai/features/drop-in-replacement" rel="noopener noreferrer"&gt;drop-in replacement docs&lt;/a&gt; cover the full endpoint matrix.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 5: Port Semantic Caching
&lt;/h2&gt;

&lt;p&gt;LiteLLM has request-level caching with Redis. Bifrost has dual-layer caching with vector similarity. The migration is not 1:1, you are upgrading the cache model.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;semantic_cache&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;enabled&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
  &lt;span class="na"&gt;vector_store&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;weaviate&lt;/span&gt;
  &lt;span class="na"&gt;weaviate_url&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${WEAVIATE_URL}&lt;/span&gt;
  &lt;span class="na"&gt;similarity_threshold&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;0.92&lt;/span&gt;
  &lt;span class="na"&gt;conversation_history_threshold&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;3&lt;/span&gt;
  &lt;span class="na"&gt;ttl_seconds&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;86400&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Every request needs the &lt;code&gt;x-bf-cache-key&lt;/code&gt; header to participate in caching. In application code:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4o&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;extra_headers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;x-bf-cache-key&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;customer-&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;tenant_id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;a href="https://docs.getbifrost.ai/features/semantic-caching" rel="noopener noreferrer"&gt;semantic caching docs&lt;/a&gt; cover threshold tuning and per-request overrides.&lt;/p&gt;

&lt;h2&gt;
  
  
  Comparison
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Capability&lt;/th&gt;
&lt;th&gt;LiteLLM&lt;/th&gt;
&lt;th&gt;Bifrost&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Latency overhead&lt;/td&gt;
&lt;td&gt;~8ms&lt;/td&gt;
&lt;td&gt;11 microseconds&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Throughput&lt;/td&gt;
&lt;td&gt;Python-bound&lt;/td&gt;
&lt;td&gt;5,000 RPS single instance&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Virtual keys&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Budget tiers&lt;/td&gt;
&lt;td&gt;Single&lt;/td&gt;
&lt;td&gt;Four-tier (Customer/Team/VK/Provider)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Semantic caching&lt;/td&gt;
&lt;td&gt;Request-level&lt;/td&gt;
&lt;td&gt;Dual-layer with vector similarity&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MCP gateway&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Provider count&lt;/td&gt;
&lt;td&gt;100+&lt;/td&gt;
&lt;td&gt;Major providers + custom&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Managed cloud&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;No (self-hosted only)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Trade-offs and Limitations
&lt;/h2&gt;

&lt;p&gt;Bifrost is self-hosted only. If you are using LiteLLM Cloud, you have to take on infrastructure operations yourself.&lt;/p&gt;

&lt;p&gt;The provider catalog is smaller. LiteLLM supports 100+ providers out of the box. Bifrost covers the major ones (OpenAI, Anthropic, Gemini, Bedrock, Vertex, Azure OpenAI, Ollama, Together, Groq, Cohere) and lets you add custom providers, but if you depend on a niche provider, check the list before migrating.&lt;/p&gt;

&lt;p&gt;The community is smaller and the project is newer. Documentation is solid but Stack Overflow answers and community plugins are still building up.&lt;/p&gt;

&lt;p&gt;OpenRouter compatibility is broken because of a tool call streaming issue. If your stack routes through OpenRouter today, you cannot keep that path through Bifrost.&lt;/p&gt;

&lt;h2&gt;
  
  
  Quick Recap
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Application code does not change because Bifrost is OpenAI-compatible&lt;/li&gt;
&lt;li&gt;Provider configs replace LiteLLM model lists, with provider-level filtering via &lt;code&gt;allowed_models&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Virtual keys translate directly, but the four-tier budget hierarchy is new&lt;/li&gt;
&lt;li&gt;Semantic caching upgrades from request-level to dual-layer with vector similarity&lt;/li&gt;
&lt;li&gt;Run side by side first, cutover by changing the base URL, roll back instantly if needed&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Links
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Bifrost on GitHub: &lt;a href="https://git.new/bifrost" rel="noopener noreferrer"&gt;https://git.new/bifrost&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Bifrost Website: &lt;a href="https://getmax.im/bifrost-home" rel="noopener noreferrer"&gt;https://getmax.im/bifrost-home&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Bifrost Docs: &lt;a href="https://getmax.im/bifrostdocs" rel="noopener noreferrer"&gt;https://getmax.im/bifrostdocs&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Further Reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://docs.getbifrost.ai/quickstart/gateway/setting-up" rel="noopener noreferrer"&gt;https://docs.getbifrost.ai/quickstart/gateway/setting-up&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.getbifrost.ai/quickstart/gateway/provider-configuration" rel="noopener noreferrer"&gt;https://docs.getbifrost.ai/quickstart/gateway/provider-configuration&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.getbifrost.ai/features/governance/virtual-keys" rel="noopener noreferrer"&gt;https://docs.getbifrost.ai/features/governance/virtual-keys&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.getbifrost.ai/features/governance/budget-and-limits" rel="noopener noreferrer"&gt;https://docs.getbifrost.ai/features/governance/budget-and-limits&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.getbifrost.ai/features/semantic-caching" rel="noopener noreferrer"&gt;https://docs.getbifrost.ai/features/semantic-caching&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>llm</category>
      <category>gateway</category>
      <category>migration</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>Rate Limiting in LLM Applications: Why You Need It and How to Build It</title>
      <dc:creator>Pranay Batta</dc:creator>
      <pubDate>Tue, 28 Apr 2026 05:58:35 +0000</pubDate>
      <link>https://dev.to/pranay_batta/rate-limiting-in-llm-applications-why-you-need-it-and-how-to-build-it-5gf4</link>
      <guid>https://dev.to/pranay_batta/rate-limiting-in-llm-applications-why-you-need-it-and-how-to-build-it-5gf4</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR:&lt;/strong&gt; Rate limiting for LLM APIs requires counting tokens, not requests. A single 200K-token context window costs as much as 50 normal API calls. This post covers the gap between request-count limits and token-aware limits, and walks through implementation at both the application layer and the gateway layer.&lt;/p&gt;

&lt;p&gt;This post assumes familiarity with LLM APIs (OpenAI, Anthropic), basic Redis or caching concepts, and running AI applications in production.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Standard Rate Limiting Falls Short
&lt;/h2&gt;

&lt;p&gt;Most developers who have shipped web services know how to rate limit: count requests per user per time window, return 429 when the limit hits. That model breaks down with LLM APIs.&lt;/p&gt;

&lt;p&gt;LLM APIs charge by the token, not the request. A single API call with a 200,000-token context window costs as much as 50 calls with 4,000-token prompts. Request-count limits do nothing to prevent a single runaway call from consuming your entire daily budget.&lt;/p&gt;

&lt;p&gt;OpenAI's production limits expose this directly. Their rate limit tiers use tokens-per-minute (TPM) alongside requests-per-minute (RPM). Hitting the TPM ceiling causes 429s even when you are nowhere near the RPM limit. Building rate limiting that only tracks requests means your application hits provider limits in ways your own limits never predicted.&lt;/p&gt;

&lt;p&gt;Multi-tenant applications add another layer. A single customer running a batch job at 3am can exhaust your provider budget before the rest of your users wake up. Without per-customer limits, one heavy user affects everyone.&lt;/p&gt;

&lt;h2&gt;
  
  
  What You Actually Need to Limit
&lt;/h2&gt;

&lt;p&gt;Four distinct limit types matter in production LLM applications:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Request rate&lt;/strong&gt; — calls per minute or hour. Prevents burst abuse but does not control cost.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Token rate&lt;/strong&gt; — tokens per minute or day. Directly correlates to cost and provider headroom.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Budget cap&lt;/strong&gt; — total spend per period per customer or team. Hard stop before costs escalate.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Scope&lt;/strong&gt; — limits enforced per user, per team, per customer, and per provider independently.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Most teams implement request rate first, add token rate after their first surprise invoice, and add budget caps after their second.&lt;/p&gt;

&lt;h2&gt;
  
  
  Option 1: Application-Level Implementation
&lt;/h2&gt;

&lt;p&gt;The direct approach is middleware that intercepts outgoing API calls, estimates token count before the request leaves your system, and rejects requests that would exceed the limit.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;redis&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;

&lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;redis&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Redis&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;host&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;localhost&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;port&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;6379&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;decode_responses&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;check_token_limit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;customer_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;estimated_tokens&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;limit&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;500_000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;window_seconds&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;86400&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;bool&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;window_key&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;int&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;time&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;//&lt;/span&gt; &lt;span class="n"&gt;window_seconds&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;key&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;token_usage:&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;:&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;window_key&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

    &lt;span class="n"&gt;pipe&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;pipeline&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;pipe&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;incrby&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;estimated_tokens&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;pipe&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;expire&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;window_seconds&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pipe&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="n"&gt;limit&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;estimate_tokens&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="c1"&gt;# ~4 characters per token, rough pre-call estimate
&lt;/span&gt;    &lt;span class="n"&gt;total_chars&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;m&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;m&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;total_chars&lt;/span&gt; &lt;span class="o"&gt;//&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This works, but requires every service that makes LLM calls to implement the same logic. In a monolith, manageable. Across microservices, it becomes duplicated state tracking with consistency problems at the edges.&lt;/p&gt;

&lt;h2&gt;
  
  
  Option 2: Gateway-Level Rate Limiting
&lt;/h2&gt;

&lt;p&gt;A gateway that proxies all LLM traffic enforces limits in one place. Every service routes through the gateway. The gateway handles counting, enforcement, and resets.&lt;/p&gt;

&lt;p&gt;Bifrost handles this through Virtual Keys, each scoped to a customer or team, with request and token limits defined per key:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;virtual_keys&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;key_name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;customer-acme"&lt;/span&gt;
    &lt;span class="na"&gt;key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;vk-acme-abc123"&lt;/span&gt;
    &lt;span class="na"&gt;rate_limit&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;request_limit&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;200&lt;/span&gt;
      &lt;span class="na"&gt;request_limit_duration&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1h"&lt;/span&gt;
      &lt;span class="na"&gt;token_limit&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;500000&lt;/span&gt;
      &lt;span class="na"&gt;token_limit_duration&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1d"&lt;/span&gt;
    &lt;span class="na"&gt;budget_limit&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;100.00&lt;/span&gt;
    &lt;span class="na"&gt;budget_duration&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1M"&lt;/span&gt;
    &lt;span class="na"&gt;allowed_models&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4o"&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-sonnet-4-6"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When &lt;code&gt;customer-acme&lt;/code&gt; exhausts their daily token limit, Bifrost rejects further requests for that key until the window resets. Other customers are unaffected.&lt;/p&gt;

&lt;p&gt;Resets are calendar-aligned for day, week, month, and year durations. A &lt;code&gt;1d&lt;/code&gt; limit resets at UTC midnight rather than 24 hours after the first request. For billing cycles that align to calendar months, this matters.&lt;/p&gt;

&lt;p&gt;LiteLLM offers comparable virtual key functionality. The primary runtime difference: LiteLLM is Python-based with roughly 8ms overhead per request. Bifrost is Go-based with 11 microseconds overhead per request.&lt;/p&gt;

&lt;h2&gt;
  
  
  Comparison
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Approach&lt;/th&gt;
&lt;th&gt;Token-aware&lt;/th&gt;
&lt;th&gt;Per-customer limits&lt;/th&gt;
&lt;th&gt;Budget cap&lt;/th&gt;
&lt;th&gt;Overhead&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Redis middleware (DIY)&lt;/td&gt;
&lt;td&gt;Manual&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Manual&lt;/td&gt;
&lt;td&gt;Negligible&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LiteLLM proxy&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;~8ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Bifrost&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes (Virtual Keys)&lt;/td&gt;
&lt;td&gt;Yes (4-tier)&lt;/td&gt;
&lt;td&gt;11 microseconds&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Kong AI Gateway&lt;/td&gt;
&lt;td&gt;Plugin-based&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Limited (OSS)&lt;/td&gt;
&lt;td&gt;~2-5ms&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Bifrost's four-tier budget hierarchy is worth noting: Customer, Team, Virtual Key, and Provider Config limits all apply independently. A request must pass all four tiers. This allows organization-wide caps alongside fine-grained per-key limits without separate enforcement logic.&lt;/p&gt;

&lt;p&gt;If a Provider Config limit is exceeded, Bifrost excludes that provider but keeps others available. Requests do not fail outright when one provider is saturated.&lt;/p&gt;

&lt;h2&gt;
  
  
  Trade-offs and Limitations
&lt;/h2&gt;

&lt;p&gt;Application-level rate limiting gives you more control over enforcement logic. You can implement business rules a gateway does not support: tiered limits based on subscription plan, grace period overrides for specific customers, or custom token counting that accounts for your system prompt overhead.&lt;/p&gt;

&lt;p&gt;Gateway-level enforcement applies regardless of which service makes the call. The trade-off is an additional network hop and a new dependency in your infrastructure.&lt;/p&gt;

&lt;p&gt;Bifrost is self-hosted only, no managed version. The project is newer than LiteLLM with a smaller community. Factor in that maturity difference when evaluating it against more established options.&lt;/p&gt;

&lt;p&gt;Token counting before a request completes is an estimate. Actual token counts, including generated output tokens, only come back in the API response. Most gateway implementations use pre-call estimates for limits and reconcile against actual usage in the response.&lt;/p&gt;

&lt;h2&gt;
  
  
  Quick Recap
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Request-count limits do not prevent token budget overruns&lt;/li&gt;
&lt;li&gt;Multi-tenant apps need per-customer token limits, not global ones&lt;/li&gt;
&lt;li&gt;Application-level implementation works but duplicates logic across services&lt;/li&gt;
&lt;li&gt;Gateway-level enforcement centralizes limits with no per-service code changes&lt;/li&gt;
&lt;li&gt;Bifrost and LiteLLM both support virtual key rate limiting; the primary difference is runtime overhead&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Links
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Bifrost on GitHub: &lt;a href="https://git.new/bifrost" rel="noopener noreferrer"&gt;https://git.new/bifrost&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Bifrost Docs: &lt;a href="https://getmax.im/bifrostdocs" rel="noopener noreferrer"&gt;https://getmax.im/bifrostdocs&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Virtual Keys: &lt;a href="https://docs.getbifrost.ai/features/governance/virtual-keys" rel="noopener noreferrer"&gt;https://docs.getbifrost.ai/features/governance/virtual-keys&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Budget and Limits: &lt;a href="https://docs.getbifrost.ai/features/governance/budget-and-limits" rel="noopener noreferrer"&gt;https://docs.getbifrost.ai/features/governance/budget-and-limits&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Further Reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://docs.getbifrost.ai/features/governance" rel="noopener noreferrer"&gt;https://docs.getbifrost.ai/features/governance&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.getbifrost.ai/features/governance/virtual-keys" rel="noopener noreferrer"&gt;https://docs.getbifrost.ai/features/governance/virtual-keys&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.getbifrost.ai/features/governance/budget-and-limits" rel="noopener noreferrer"&gt;https://docs.getbifrost.ai/features/governance/budget-and-limits&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.getbifrost.ai/features/governance/routing" rel="noopener noreferrer"&gt;https://docs.getbifrost.ai/features/governance/routing&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>ai</category>
      <category>opensource</category>
      <category>llm</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>Four Go Repositories Worth Your Attention on GitHub's Trending Page This Month</title>
      <dc:creator>Pranay Batta</dc:creator>
      <pubDate>Wed, 22 Apr 2026 20:46:26 +0000</pubDate>
      <link>https://dev.to/pranay_batta/four-go-repositories-worth-your-attention-on-githubs-trending-page-this-month-160i</link>
      <guid>https://dev.to/pranay_batta/four-go-repositories-worth-your-attention-on-githubs-trending-page-this-month-160i</guid>
      <description>&lt;p&gt;When I want to see what developers are actually shipping, GitHub's trending page is a reliable signal. Scrolling through the &lt;a href="https://github.com/trending/go?since=monthly" rel="noopener noreferrer"&gt;Go monthly trending board&lt;/a&gt; this month surfaced four projects that each deserve a closer look. All four are written in Go, and each tackles a very different problem.&lt;/p&gt;

&lt;p&gt;Here's the current monthly ranking for the Go language:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;#&lt;/th&gt;
&lt;th&gt;Repository&lt;/th&gt;
&lt;th&gt;Stars&lt;/th&gt;
&lt;th&gt;Stars This Month&lt;/th&gt;
&lt;th&gt;What It Does&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/QuantumNous/new-api" rel="noopener noreferrer"&gt;QuantumNous/new-api&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;28,276&lt;/td&gt;
&lt;td&gt;5,970&lt;/td&gt;
&lt;td&gt;Unified AI model hub&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/Wei-Shaw/sub2api" rel="noopener noreferrer"&gt;Wei-Shaw/sub2api&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;14,394&lt;/td&gt;
&lt;td&gt;6,822&lt;/td&gt;
&lt;td&gt;AI API subscription sharing&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/steipete/wacli" rel="noopener noreferrer"&gt;steipete/wacli&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;2,034&lt;/td&gt;
&lt;td&gt;1,348&lt;/td&gt;
&lt;td&gt;WhatsApp CLI&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/maximhq/bifrost" rel="noopener noreferrer"&gt;maximhq/bifrost&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;4,169&lt;/td&gt;
&lt;td&gt;1,076&lt;/td&gt;
&lt;td&gt;Enterprise AI gateway&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Here's what each project does and why it's climbing this month.&lt;/p&gt;




&lt;h2&gt;
  
  
  1. Bifrost: An Enterprise AI Gateway Built in Go
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Repository:&lt;/strong&gt; &lt;a href="https://github.com/maximhq/bifrost" rel="noopener noreferrer"&gt;maximhq/bifrost&lt;/a&gt;&lt;br&gt;
&lt;strong&gt;Stars:&lt;/strong&gt; 4,169 | &lt;strong&gt;Forks:&lt;/strong&gt; 485 | &lt;strong&gt;License:&lt;/strong&gt; Apache 2.0&lt;/p&gt;

&lt;p&gt;Bifrost is a high-throughput AI gateway, written in Go, that exposes a single OpenAI-compatible API fronting more than 15 LLM providers. What pulled it onto my radar were the performance figures. Per-request overhead sits at roughly &lt;a href="https://getmax.im/bifrostdocs" rel="noopener noreferrer"&gt;11 microseconds&lt;/a&gt;, and the gateway sustains 5,000 RPS. That puts it around 50x faster than Python-based alternatives like LiteLLM, a gap that teams evaluating &lt;a href="https://www.getmaxim.ai/bifrost/alternatives/litellm-alternatives" rel="noopener noreferrer"&gt;gateway options&lt;/a&gt; tend to notice quickly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key features:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Multi-provider routing&lt;/strong&gt; covering automatic failover and weighted load balancing&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Semantic caching&lt;/strong&gt; using a dual-layer design (exact-hash match plus semantic similarity through &lt;a href="https://weaviate.io/" rel="noopener noreferrer"&gt;Weaviate&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;MCP integration&lt;/strong&gt; featuring Code Mode, which cuts token usage by as much as 92.8% at scale (&lt;a href="https://www.getmaxim.ai/bifrost/blog/bifrost-mcp-gateway-access-control-cost-governance-and-92-lower-token-costs-at-scale" rel="noopener noreferrer"&gt;benchmark source&lt;/a&gt;); teams can review the broader &lt;a href="https://www.getmaxim.ai/bifrost/resources/mcp-gateway" rel="noopener noreferrer"&gt;MCP gateway architecture&lt;/a&gt; for details&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Budget hierarchy&lt;/strong&gt; enforced at four levels: Customer, Team, Virtual Key, and Provider Config&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Zero-config deployment&lt;/strong&gt; through &lt;code&gt;npx @anthropic-ai/bifrost&lt;/code&gt; or Docker&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The architectural decisions matter here. Choosing Go gives the gateway predictable latency because there are no GC pauses under load. Three deployment shapes are supported: an HTTP gateway, a Go SDK, or a drop-in SDK replacement for existing OpenAI or Anthropic client libraries.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Who it is for:&lt;/strong&gt; Teams operating multiple LLM providers in production that need lightweight routing, cost controls, and observability without stacking extra latency into every call.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Links:&lt;/strong&gt; &lt;a href="https://getmax.im/bifrostdocs" rel="noopener noreferrer"&gt;Docs&lt;/a&gt; | &lt;a href="https://git.new/bifrost" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt; | &lt;a href="https://getmax.im/bifrost-home" rel="noopener noreferrer"&gt;Website&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  2. new-api: A Self-Hosted Model Hub
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Repository:&lt;/strong&gt; &lt;a href="https://github.com/QuantumNous/new-api" rel="noopener noreferrer"&gt;QuantumNous/new-api&lt;/a&gt;&lt;br&gt;
&lt;strong&gt;Stars:&lt;/strong&gt; 28,276 | &lt;strong&gt;Forks:&lt;/strong&gt; 5,915 | &lt;strong&gt;License:&lt;/strong&gt; AGPLv3&lt;/p&gt;

&lt;p&gt;new-api tops the monthly Go board by star count. It works as a centralized gateway that aggregates multiple LLM vendors (OpenAI, Azure, Claude, Gemini, DeepSeek, Qwen, and more) and exposes them through standardized relay interfaces.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key features:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Bidirectional format translation across OpenAI, Claude, and Gemini APIs&lt;/li&gt;
&lt;li&gt;Token grouping with model-level restrictions and role-based access control&lt;/li&gt;
&lt;li&gt;A dashboard for real-time usage analytics and billing&lt;/li&gt;
&lt;li&gt;Docker deployment that works against SQLite, MySQL, or PostgreSQL backends&lt;/li&gt;
&lt;li&gt;A multi-language UI covering Chinese, English, French, and Japanese&lt;/li&gt;
&lt;li&gt;Redis support for distributed deployments&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Development velocity is high: the project has more than 5,600 commits. Streaming APIs are supported with configurable timeouts, and reasoning-model handling is built in.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Who it is for:&lt;/strong&gt; Teams that want a self-hosted LLM proxy paired with a full admin dashboard and cross-vendor format conversion.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Links:&lt;/strong&gt; &lt;a href="https://github.com/QuantumNous/new-api" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  3. sub2api: Sharing AI API Subscriptions Across Users
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Repository:&lt;/strong&gt; &lt;a href="https://github.com/Wei-Shaw/sub2api" rel="noopener noreferrer"&gt;Wei-Shaw/sub2api&lt;/a&gt;&lt;br&gt;
&lt;strong&gt;Stars:&lt;/strong&gt; 14,394 | &lt;strong&gt;Forks:&lt;/strong&gt; 2,488&lt;/p&gt;

&lt;p&gt;sub2api approaches the problem from a different angle. Rather than simply proxying API requests, it is designed around pooling and sharing paid AI subscriptions (Claude, OpenAI, Gemini) behind a unified access layer that includes billing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key features:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Multi-account management with OAuth and API key authentication&lt;/li&gt;
&lt;li&gt;API key distribution and lifecycle handling&lt;/li&gt;
&lt;li&gt;Token-level billing that calculates cost with precision&lt;/li&gt;
&lt;li&gt;Account scheduling with sticky sessions&lt;/li&gt;
&lt;li&gt;Per-user and per-account concurrency limits&lt;/li&gt;
&lt;li&gt;A built-in payment system covering Alipay, WeChat Pay, and Stripe&lt;/li&gt;
&lt;li&gt;An administrative dashboard&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The stack is Go 1.25 with the Gin framework and the Ent ORM on the backend, and Vue 3, Vite, and TailwindCSS on the frontend. PostgreSQL 15+ and Redis 7+ are required.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Who it is for:&lt;/strong&gt; Organizations that want to share paid AI subscriptions across multiple users with fine-grained billing and access policies.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Links:&lt;/strong&gt; &lt;a href="https://github.com/Wei-Shaw/sub2api" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  4. wacli: A WhatsApp Command-Line Interface
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Repository:&lt;/strong&gt; &lt;a href="https://github.com/steipete/wacli" rel="noopener noreferrer"&gt;steipete/wacli&lt;/a&gt;&lt;br&gt;
&lt;strong&gt;Stars:&lt;/strong&gt; 2,034 | &lt;strong&gt;Forks:&lt;/strong&gt; 241&lt;/p&gt;

&lt;p&gt;This one stands out from the rest. wacli is a full command-line client for WhatsApp, built on top of the &lt;a href="https://github.com/tulir/whatsmeow" rel="noopener noreferrer"&gt;whatsmeow&lt;/a&gt; library that implements the WhatsApp Web protocol. It was created by &lt;a href="https://github.com/steipete" rel="noopener noreferrer"&gt;Peter Steinberger&lt;/a&gt;, a widely known iOS developer.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key features:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Local message history sync with continuous capture&lt;/li&gt;
&lt;li&gt;Offline search backed by SQLite with FTS5 full-text indexing&lt;/li&gt;
&lt;li&gt;Sending text, quoted replies, and files with captions&lt;/li&gt;
&lt;li&gt;Contact and group management&lt;/li&gt;
&lt;li&gt;Human-readable table output by default, with JSON available for scripting&lt;/li&gt;
&lt;li&gt;A read-only mode that prevents accidental mutations&lt;/li&gt;
&lt;li&gt;QR-code authentication&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Installation is simple: grab it from Homebrew or build from source with &lt;code&gt;go build -tags sqlite_fts5&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Who it is for:&lt;/strong&gt; Developers and power users who want programmatic WhatsApp access from the terminal for automation, search, or scripting tasks.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Links:&lt;/strong&gt; &lt;a href="https://github.com/steipete/wacli" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Why Go Keeps Showing Up at the Top
&lt;/h2&gt;

&lt;p&gt;That all four of these projects are written in Go is not a coincidence. The language's concurrency model, single-binary deployment, compact memory footprint, and predictable behavior under load make it a natural fit for infrastructure tooling.&lt;/p&gt;

&lt;p&gt;Nowhere is that clearer than in AI gateway workloads. When you are proxying thousands of LLM calls every second, every microsecond of added overhead compounds. Python-based options like &lt;a href="https://github.com/BerriAI/litellm" rel="noopener noreferrer"&gt;LiteLLM&lt;/a&gt; add milliseconds of latency per request; Go-based gateways such as Bifrost keep that overhead in the microsecond range, as their &lt;a href="https://www.getmaxim.ai/bifrost/resources/benchmarks" rel="noopener noreferrer"&gt;published performance benchmarks&lt;/a&gt; document in detail.&lt;/p&gt;

&lt;p&gt;The &lt;a href="https://www.cncf.io/projects/" rel="noopener noreferrer"&gt;CNCF project ecosystem&lt;/a&gt; reflects the same pattern. Kubernetes, Prometheus, the Go integrations around Envoy, and the majority of cloud-native infrastructure tools are all built in Go.&lt;/p&gt;

&lt;h2&gt;
  
  
  Wrapping Up
&lt;/h2&gt;

&lt;p&gt;Two clear patterns surface from this month's Go trending list. First, AI infrastructure dominates the category. Second, Go continues to be the language developers reach for when they build it. Whether the need is an enterprise AI gateway with sub-millisecond overhead, a self-hosted model hub, a subscription sharing platform, or a WhatsApp CLI, the Go ecosystem now ships a production-ready option in each category.&lt;/p&gt;

&lt;p&gt;If you want the full list, the &lt;a href="https://github.com/trending/go?since=monthly" rel="noopener noreferrer"&gt;GitHub Go trending page&lt;/a&gt; is one click away.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Data sourced from &lt;a href="https://github.com/trending/go?since=monthly" rel="noopener noreferrer"&gt;GitHub Trending&lt;/a&gt; as of April 22, 2026. Star counts and rankings change daily.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>go</category>
      <category>opensource</category>
      <category>ai</category>
      <category>github</category>
    </item>
    <item>
      <title>How to Cut LLM Token Spend with Semantic Caching: A Production Setup Guide</title>
      <dc:creator>Pranay Batta</dc:creator>
      <pubDate>Wed, 22 Apr 2026 20:25:49 +0000</pubDate>
      <link>https://dev.to/pranay_batta/how-to-cut-llm-token-spend-with-semantic-caching-a-production-setup-guide-2o8j</link>
      <guid>https://dev.to/pranay_batta/how-to-cut-llm-token-spend-with-semantic-caching-a-production-setup-guide-2o8j</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR:&lt;/strong&gt; Semantic caching intercepts LLM API calls and returns cached responses for similar queries, skipping the provider entirely. Zero tokens consumed on cache hits. I set this up with Bifrost and Weaviate in under 30 minutes and it started saving tokens on the first day.&lt;/p&gt;

&lt;h2&gt;
  
  
  What We Are Building
&lt;/h2&gt;

&lt;p&gt;A semantic cache layer that sits between your application and LLM providers. Every API call passes through the cache first. If the query matches a previous one (exact match or semantically similar), the cached response is returned instantly. No LLM call, no tokens billed.&lt;/p&gt;


&lt;div class="ltag-github-readme-tag"&gt;
  &lt;div class="readme-overview"&gt;
    &lt;h2&gt;
      &lt;img src="https://assets.dev.to/assets/github-logo-5a155e1f9a670af7944dd5e12375bc76ed542ea80224905ecaf878b9157cdefc.svg" alt="GitHub logo"&gt;
      &lt;a href="https://github.com/maximhq" rel="noopener noreferrer"&gt;
        maximhq
      &lt;/a&gt; / &lt;a href="https://github.com/maximhq/bifrost" rel="noopener noreferrer"&gt;
        bifrost
      &lt;/a&gt;
    &lt;/h2&gt;
    &lt;h3&gt;
      Fastest enterprise AI gateway (50x faster than LiteLLM) with adaptive load balancer, cluster mode, guardrails, 1000+ models support &amp;amp; &amp;lt;100 µs overhead at 5k RPS.
    &lt;/h3&gt;
  &lt;/div&gt;
  &lt;div class="ltag-github-body"&gt;
    
&lt;div id="readme" class="md"&gt;
&lt;div class="markdown-heading"&gt;
&lt;h1 class="heading-element"&gt;Bifrost AI Gateway&lt;/h1&gt;
&lt;/div&gt;
&lt;p&gt;&lt;a href="https://goreportcard.com/report/github.com/maximhq/bifrost/core" rel="nofollow noopener noreferrer"&gt;&lt;img src="https://camo.githubusercontent.com/7f7e70df9fdaaf4f485f59ca6bc0b5cbbf134d03dd5721da4e31f90f618fc304/68747470733a2f2f676f7265706f7274636172642e636f6d2f62616467652f6769746875622e636f6d2f6d6178696d68712f626966726f73742f636f7265" alt="Go Report Card"&gt;&lt;/a&gt;
&lt;a href="https://discord.gg/exN5KAydbU" rel="nofollow noopener noreferrer"&gt;&lt;img src="https://camo.githubusercontent.com/282b7719f04b28f5959f5e1e17aee806d65f8eea3b862b57af350df0ab57be6f/68747470733a2f2f646362616467652e6c696d65732e70696e6b2f6170692f7365727665722f68747470733a2f2f646973636f72642e67672f65784e354b41796462553f7374796c653d666c6174" alt="Discord badge"&gt;&lt;/a&gt;
&lt;a href="https://codecov.io/gh/maximhq/bifrost" rel="nofollow noopener noreferrer"&gt;&lt;img src="https://camo.githubusercontent.com/8bc2db302c566210d14c09b278639a3f63f07def5fc635a8869e59c996b3100f/68747470733a2f2f636f6465636f762e696f2f67682f6d6178696d68712f626966726f73742f6272616e63682f6d61696e2f67726170682f62616467652e737667" alt="codecov"&gt;&lt;/a&gt;
&lt;a rel="noopener noreferrer nofollow" href="https://camo.githubusercontent.com/b0899925aadfed8626116707178a4015d8cf4aaa0b80acb632cb4782c6dc7272/68747470733a2f2f696d672e736869656c64732e696f2f646f636b65722f70756c6c732f6d6178696d68712f626966726f7374"&gt;&lt;img src="https://camo.githubusercontent.com/b0899925aadfed8626116707178a4015d8cf4aaa0b80acb632cb4782c6dc7272/68747470733a2f2f696d672e736869656c64732e696f2f646f636b65722f70756c6c732f6d6178696d68712f626966726f7374" alt="Docker Pulls"&gt;&lt;/a&gt;
&lt;a href="https://app.getpostman.com/run-collection/31642484-2ba0e658-4dcd-49f4-845a-0c7ed745b916?action=collection%2Ffork&amp;amp;source=rip_markdown&amp;amp;collection-url=entityId%3D31642484-2ba0e658-4dcd-49f4-845a-0c7ed745b916%26entityType%3Dcollection%26workspaceId%3D63e853c8-9aec-477f-909c-7f02f543150e" rel="nofollow noopener noreferrer"&gt;&lt;img src="https://camo.githubusercontent.com/82ccefddb001e2caf9d399f1153fdda561cf3da341bb270e18644d516906bc64/68747470733a2f2f72756e2e7073746d6e2e696f2f627574746f6e2e737667" alt="Run In Postman"&gt;&lt;/a&gt;
&lt;a href="https://artifacthub.io/packages/search?repo=bifrost" rel="nofollow noopener noreferrer"&gt;&lt;img src="https://camo.githubusercontent.com/a6a3c734d6bd57fa8e1d508ac0cdba555bdbcd9191b29b32cf37a964b86b9c67/68747470733a2f2f696d672e736869656c64732e696f2f656e64706f696e743f75726c3d68747470733a2f2f61727469666163746875622e696f2f62616467652f7265706f7369746f72792f626966726f7374" alt="Artifact Hub"&gt;&lt;/a&gt;
&lt;a href="https://github.com/maximhq/bifrost/LICENSE" rel="noopener noreferrer"&gt;&lt;img src="https://camo.githubusercontent.com/3cb44c15a532770a066ba8e61bf11506ad5400e5c61d48f6b639101e442bee79/68747470733a2f2f696d672e736869656c64732e696f2f6769746875622f6c6963656e73652f6d6178696d68712f626966726f7374" alt="License"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;div class="markdown-heading"&gt;
&lt;h2 class="heading-element"&gt;The fastest way to build AI applications that never go down&lt;/h2&gt;
&lt;/div&gt;
&lt;p&gt;Bifrost is a high-performance AI gateway that unifies access to 15+ providers (OpenAI, Anthropic, AWS Bedrock, Google Vertex, and more) through a single OpenAI-compatible API. Deploy in seconds with zero configuration and get automatic failover, load balancing, semantic caching, and enterprise-grade features.&lt;/p&gt;
&lt;div class="markdown-heading"&gt;
&lt;h2 class="heading-element"&gt;Quick Start&lt;/h2&gt;
&lt;/div&gt;
&lt;p&gt;&lt;a rel="noopener noreferrer" href="https://github.com/maximhq/bifrost/./docs/media/getting-started.png"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2Fmaximhq%2Fbifrost%2FHEAD%2F.%2Fdocs%2Fmedia%2Fgetting-started.png" alt="Get started"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Go from zero to production-ready AI gateway in under a minute.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Step 1:&lt;/strong&gt; Start Bifrost Gateway&lt;/p&gt;
&lt;div class="highlight highlight-source-shell notranslate position-relative overflow-auto js-code-highlight"&gt;
&lt;pre&gt;&lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; Install and run locally&lt;/span&gt;
npx -y @maximhq/bifrost

&lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; Or use Docker&lt;/span&gt;
docker run -p 8080:8080 maximhq/bifrost&lt;/pre&gt;

&lt;/div&gt;
&lt;p&gt;&lt;strong&gt;Step 2:&lt;/strong&gt; Configure via Web UI&lt;/p&gt;
&lt;div class="highlight highlight-source-shell notranslate position-relative overflow-auto js-code-highlight"&gt;
&lt;pre&gt;&lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; Open the built-in web interface&lt;/span&gt;
open http://localhost:8080&lt;/pre&gt;

&lt;/div&gt;
&lt;p&gt;&lt;strong&gt;Step 3:&lt;/strong&gt; Make your first API call&lt;/p&gt;
&lt;div class="highlight highlight-source-shell notranslate position-relative overflow-auto js-code-highlight"&gt;
&lt;pre&gt;curl -X POST http://localhost:8080/v1/chat/completions \
  -H &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;Content-Type: application/json&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; \
  -d &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;{&lt;/span&gt;
&lt;span class="pl-s"&gt;    "model": "openai/gpt-4o-mini",&lt;/span&gt;
&lt;span class="pl-s"&gt;    "messages": [{"role": "user", "content": "Hello, Bifrost!"}]&lt;/span&gt;
&lt;span class="pl-s"&gt;  }&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;p&gt;&lt;strong&gt;That's it!&lt;/strong&gt; Your AI gateway is running with a web interface for visual configuration…&lt;/p&gt;
&lt;/div&gt;
  &lt;/div&gt;
  &lt;div class="gh-btn-container"&gt;&lt;a class="gh-btn" href="https://github.com/maximhq/bifrost" rel="noopener noreferrer"&gt;View on GitHub&lt;/a&gt;&lt;/div&gt;
&lt;/div&gt;


&lt;p&gt;Here is the flow:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;App -&amp;gt; Bifrost Gateway -&amp;gt; [Cache Check] -&amp;gt; Hit?  -&amp;gt; Return cached response (0 tokens)
                                        -&amp;gt; Miss? -&amp;gt; Forward to LLM provider -&amp;gt; Cache response -&amp;gt; Return
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The end result: repeated and similar queries cost nothing. For workloads with common patterns (customer support, code generation, FAQ bots), the savings add up fast.&lt;/p&gt;

&lt;h2&gt;
  
  
  Prerequisites
&lt;/h2&gt;

&lt;p&gt;You need three things:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Docker&lt;/strong&gt; and &lt;strong&gt;Docker Compose&lt;/strong&gt; installed (&lt;a href="https://docs.docker.com/get-docker/" rel="noopener noreferrer"&gt;docs&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Weaviate&lt;/strong&gt; as the vector store for semantic similarity matching&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Bifrost&lt;/strong&gt; as the LLM gateway with caching enabled&lt;/li&gt;
&lt;li&gt;At least one LLM provider API key (OpenAI, Anthropic, etc.)&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Everything runs locally. No cloud accounts needed beyond your LLM provider key.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 1: Deploy Weaviate for Vector Storage
&lt;/h2&gt;

&lt;p&gt;Weaviate stores the vector embeddings that power semantic matching. When a new query comes in, Bifrost converts it to a vector and checks Weaviate for similar past queries.&lt;/p&gt;

&lt;p&gt;Create a &lt;code&gt;docker-compose.yml&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;version&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;3.8'&lt;/span&gt;

&lt;span class="na"&gt;services&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;weaviate&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;cr.weaviate.io/semitechnologies/weaviate:latest&lt;/span&gt;
    &lt;span class="na"&gt;ports&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;8081:8080"&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;50051:50051"&lt;/span&gt;
    &lt;span class="na"&gt;environment&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;QUERY_DEFAULTS_LIMIT&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;25&lt;/span&gt;
      &lt;span class="na"&gt;AUTHENTICATION_ANONYMOUS_ACCESS_ENABLED&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;true'&lt;/span&gt;
      &lt;span class="na"&gt;PERSISTENCE_DATA_PATH&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;/var/lib/weaviate'&lt;/span&gt;
      &lt;span class="na"&gt;DEFAULT_VECTORIZER_MODULE&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;text2vec-transformers'&lt;/span&gt;
      &lt;span class="na"&gt;ENABLE_MODULES&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;text2vec-transformers'&lt;/span&gt;
      &lt;span class="na"&gt;TRANSFORMERS_INFERENCE_API&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;http://t2v-transformers:8080'&lt;/span&gt;
      &lt;span class="na"&gt;CLUSTER_HOSTNAME&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;node1'&lt;/span&gt;
    &lt;span class="na"&gt;volumes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;weaviate_data:/var/lib/weaviate&lt;/span&gt;
    &lt;span class="na"&gt;restart&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;on-failure&lt;/span&gt;

  &lt;span class="na"&gt;t2v-transformers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;cr.weaviate.io/semitechnologies/transformers-inference:sentence-transformers-all-MiniLM-L6-v2&lt;/span&gt;
    &lt;span class="na"&gt;environment&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;ENABLE_CUDA&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;0'&lt;/span&gt;
    &lt;span class="na"&gt;restart&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;on-failure&lt;/span&gt;

&lt;span class="na"&gt;volumes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;weaviate_data&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Spin it up:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker compose up &lt;span class="nt"&gt;-d&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Verify Weaviate is running:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl http://localhost:8081/v1/meta | python3 &lt;span class="nt"&gt;-m&lt;/span&gt; json.tool
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You should see a JSON response with version info. If you get connection refused, give it 30 seconds for the transformer model to load.&lt;/p&gt;

&lt;p&gt;For more on &lt;a href="https://weaviate.io/developers/weaviate" rel="noopener noreferrer"&gt;Weaviate's architecture and vectoriser modules&lt;/a&gt;, check their docs.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 2: Configure Bifrost with Semantic Caching Enabled
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://git.new/bifrost" rel="noopener noreferrer"&gt;Bifrost&lt;/a&gt; is an open-source LLM gateway written in Go. 11 microsecond latency overhead, 5,000 RPS throughput. The part that matters here: it has &lt;a href="https://docs.getbifrost.ai/features/semantic-caching" rel="noopener noreferrer"&gt;dual-layer caching&lt;/a&gt; built in.&lt;/p&gt;

&lt;p&gt;Dual-layer means two cache checks run on every request:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Exact hash match&lt;/strong&gt; - identical queries return cached responses instantly&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Semantic similarity&lt;/strong&gt; - queries that mean the same thing but are worded differently also hit the cache&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Start Bifrost:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker run &lt;span class="nt"&gt;-p&lt;/span&gt; 8080:8080 maximhq/bifrost
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Or if you prefer npx:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npx &lt;span class="nt"&gt;-y&lt;/span&gt; @maximhq/bifrost
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now configure the gateway. Create a &lt;code&gt;config.yaml&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;gateway&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;host&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;0.0.0.0"&lt;/span&gt;
  &lt;span class="na"&gt;port&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;8080&lt;/span&gt;

&lt;span class="na"&gt;cache&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;enabled&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
  &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;semantic"&lt;/span&gt;
  &lt;span class="na"&gt;vector_store&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;provider&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;weaviate"&lt;/span&gt;
    &lt;span class="na"&gt;host&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http://localhost:8081"&lt;/span&gt;
  &lt;span class="na"&gt;conversation_history_threshold&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;3&lt;/span&gt;

&lt;span class="na"&gt;accounts&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;production"&lt;/span&gt;
    &lt;span class="na"&gt;providers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;openai-main"&lt;/span&gt;
        &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;openai"&lt;/span&gt;
        &lt;span class="na"&gt;api_key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;${OPENAI_API_KEY}"&lt;/span&gt;
        &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4o"&lt;/span&gt;
        &lt;span class="na"&gt;weight&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;70&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;anthropic-fallback"&lt;/span&gt;
        &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;anthropic"&lt;/span&gt;
        &lt;span class="na"&gt;api_key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;${ANTHROPIC_API_KEY}"&lt;/span&gt;
        &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-sonnet-4-20250514"&lt;/span&gt;
        &lt;span class="na"&gt;weight&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;30&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Key config values:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;cache.enabled: true&lt;/code&gt; turns on the dual-layer cache&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;cache.type: "semantic"&lt;/code&gt; enables both exact hash and semantic similarity (not just exact match)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;vector_store.provider: "weaviate"&lt;/code&gt; points to your Weaviate instance&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;conversation_history_threshold: 3&lt;/code&gt; controls how much conversation context is used for cache key generation. Default is 3. Higher values mean more context-sensitive cache matching but fewer hits.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Full configuration options are in the &lt;a href="https://getmax.im/bifrostdocs" rel="noopener noreferrer"&gt;Bifrost docs&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 3: Point Your LLM Calls Through Bifrost
&lt;/h2&gt;

&lt;p&gt;Bifrost exposes a &lt;a href="https://docs.getbifrost.ai/features/drop-in-replacement" rel="noopener noreferrer"&gt;drop-in replacement&lt;/a&gt; for the OpenAI SDK. Change your base URL and everything else stays the same.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Python (OpenAI SDK):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;openai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;OpenAI&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;OpenAI&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;base_url&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http://localhost:8080/v1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;your-openai-api-key&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# First call - cache miss, hits the LLM provider
&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4o&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;What are the benefits of microservices architecture?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;choices&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Second call - same query, exact cache hit, zero tokens
&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4o&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;What are the benefits of microservices architecture?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;choices&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Third call - different wording, same intent, semantic cache hit
&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4o&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Why should I use a microservices pattern?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;choices&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The first call goes to OpenAI. Tokens are consumed, response is cached. The second call is identical, so the exact hash matches. Response comes from cache. The third call is worded differently but semantically similar. Weaviate's vector search finds the match. Response comes from cache again.&lt;/p&gt;

&lt;p&gt;Both cache hits skip the LLM provider entirely. Zero tokens. Zero cost.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Node.js (OpenAI SDK):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="nx"&gt;OpenAI&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;openai&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;OpenAI&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;baseURL&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;http://localhost:8080/v1&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;apiKey&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;your-openai-api-key&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;gpt-4o&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;role&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;user&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;content&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Explain container orchestration in simple terms&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="p"&gt;],&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;

&lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;choices&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nx"&gt;message&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;content&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Same pattern. Point the base URL at Bifrost, and caching is transparent to your application code.&lt;/p&gt;

&lt;p&gt;If you are using the Anthropic SDK, Bifrost supports that too. The &lt;a href="https://docs.getbifrost.ai/integrations/anthropic-sdk" rel="noopener noreferrer"&gt;Anthropic SDK integration&lt;/a&gt; page has the details.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 4: Monitor Cache Hits and Token Savings
&lt;/h2&gt;

&lt;p&gt;Once traffic is flowing, you want to see what is hitting cache vs what is going through to providers.&lt;/p&gt;

&lt;p&gt;Bifrost exposes metrics that let you track:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Cache hit rate (exact vs semantic)&lt;/li&gt;
&lt;li&gt;Total requests vs routed requests (routed = cache misses that hit a provider)&lt;/li&gt;
&lt;li&gt;Token usage per provider&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Check your Bifrost logs to see cache behaviour in real time:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker logs &lt;span class="nt"&gt;-f&lt;/span&gt; &amp;lt;bifrost-container-id&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each request will indicate whether it was served from cache or forwarded to a provider. Track the ratio over time. On workloads with repeated query patterns, the cache hit rate climbs quickly within the first few hours.&lt;/p&gt;

&lt;h2&gt;
  
  
  How It Works: Exact Hash vs Semantic Similarity
&lt;/h2&gt;

&lt;p&gt;A quick breakdown of the two cache layers.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Exact hash matching&lt;/strong&gt; is straightforward. The entire request (messages, model, parameters) is hashed. If an identical request has been seen before, the cached response is returned. This is fast and deterministic. Same input, same output.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Semantic similarity&lt;/strong&gt; is where it gets interesting. When no exact match exists, Bifrost converts the query into a vector embedding using the transformer model running in Weaviate. It then searches for existing cached queries that are semantically close. If the similarity score is above the threshold, the cached response is returned.&lt;/p&gt;

&lt;p&gt;This is what catches queries like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;"How do I deploy to Kubernetes?" and "What is the process for deploying on k8s?"&lt;/li&gt;
&lt;li&gt;"Explain OAuth 2.0" and "How does OAuth2 authentication work?"&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Different words. Same intent. One LLM call instead of two.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;conversation_history_threshold&lt;/code&gt; setting controls how many previous messages in a conversation are included when generating the cache key. At the default of 3, Bifrost uses the last 3 messages for context. This prevents a cached response from a different conversation context being returned incorrectly.&lt;/p&gt;

&lt;p&gt;For more on how sentence embeddings power this kind of similarity search, &lt;a href="https://huggingface.co/blog/getting-started-with-embeddings" rel="noopener noreferrer"&gt;HuggingFace has a solid primer&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Results: What I Measured After Running This for a Week
&lt;/h2&gt;

&lt;p&gt;I ran this setup against three different workloads for seven days. Here is what I observed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Customer support bot (repetitive queries):&lt;/strong&gt; Highest cache hit rate. Users ask variations of the same 50-100 questions. After the first day, the cache warmed up and a large portion of queries were served from cache. Semantic matching caught the paraphrased versions that exact hash would miss.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Code generation assistant (moderate repetition):&lt;/strong&gt; Lower hit rate than customer support, but still meaningful. Common patterns like "write a function to parse JSON" or "create a REST endpoint" showed up repeatedly with slight variations. Semantic caching caught many of these.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Open-ended research queries (low repetition):&lt;/strong&gt; Lowest hit rate, as expected. Each query was unique enough that neither exact nor semantic matching triggered often. Caching still helped with follow-up questions that rephrased earlier queries.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Latency on cache hits:&lt;/strong&gt; Near-instant. The Weaviate vector lookup adds milliseconds, but compared to a full LLM round trip (typically 500ms to 3s), cache hits felt instantaneous.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Gateway overhead:&lt;/strong&gt; Bifrost's 11 microsecond latency overhead held up. The caching layer adds the Weaviate lookup time on misses and hits, but the gateway itself adds almost nothing.&lt;/p&gt;

&lt;p&gt;The workloads where semantic caching pays off most are the ones with natural query repetition. Customer support, internal knowledge bases, FAQ systems, onboarding assistants. If your users ask the same things in different ways, you are paying for the same answer multiple times.&lt;/p&gt;

&lt;p&gt;For reference, here is what &lt;a href="https://openai.com/api/pricing" rel="noopener noreferrer"&gt;OpenAI charges per token&lt;/a&gt; and what &lt;a href="https://docs.anthropic.com/en/docs/about-claude/models" rel="noopener noreferrer"&gt;Anthropic charges&lt;/a&gt;. On GPT-4o at current pricing, even a moderate cache hit rate translates to real savings on a monthly bill.&lt;/p&gt;

&lt;h2&gt;
  
  
  Further Reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://docs.getbifrost.ai/features/semantic-caching" rel="noopener noreferrer"&gt;Bifrost Semantic Caching Docs&lt;/a&gt; - full config reference&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://docs.getbifrost.ai/quickstart/gateway/setting-up" rel="noopener noreferrer"&gt;Bifrost Setup Guide&lt;/a&gt; - getting started from scratch&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://weaviate.io/developers/weaviate" rel="noopener noreferrer"&gt;Weaviate Developer Docs&lt;/a&gt; - vector store configuration and modules&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://huggingface.co/blog/getting-started-with-embeddings" rel="noopener noreferrer"&gt;Getting Started with Embeddings (HuggingFace)&lt;/a&gt; - how sentence embeddings work&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://redis.io/docs/latest/develop/use/patterns/" rel="noopener noreferrer"&gt;Redis Caching Patterns&lt;/a&gt; - general caching concepts for comparison&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;a href="https://git.new/bifrost" rel="noopener noreferrer"&gt;Bifrost GitHub&lt;/a&gt; | &lt;a href="https://getmax.im/bifrostdocs" rel="noopener noreferrer"&gt;Docs&lt;/a&gt; | &lt;a href="https://getmax.im/bifrost-home" rel="noopener noreferrer"&gt;Website&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If you are running LLM workloads with any kind of query repetition, set up semantic caching before optimising anything else. It is the lowest-effort, highest-impact cost reduction I have found.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>opensource</category>
      <category>llm</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>Smart LLM Routing in Production: Picking the Optimal Model per Request</title>
      <dc:creator>Pranay Batta</dc:creator>
      <pubDate>Tue, 21 Apr 2026 10:58:08 +0000</pubDate>
      <link>https://dev.to/pranay_batta/smart-llm-routing-in-production-picking-the-optimal-model-per-request-2lkh</link>
      <guid>https://dev.to/pranay_batta/smart-llm-routing-in-production-picking-the-optimal-model-per-request-2lkh</guid>
      <description>&lt;p&gt;Every production LLM system eventually runs into the same wall. You are paying too much, responses are too slow, or a single provider outage takes everything down.&lt;/p&gt;

&lt;p&gt;The fix is routing. Instead of hardcoding one model for all requests, you route each request to the best available model based on cost, latency, and reliability.&lt;/p&gt;

&lt;p&gt;I evaluated several approaches over the last few weeks. Marketplace APIs, framework-level abstractions, self-hosted gateways, DIY logic. Here is what the data showed.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Route at All?
&lt;/h2&gt;

&lt;p&gt;If you are only using one model from one provider, you do not need routing. But the moment you add a second provider, routing decisions start piling up.&lt;/p&gt;

&lt;p&gt;Three reasons this matters:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cost.&lt;/strong&gt; GPT-4o costs roughly 10x more per token than GPT-4o-mini. If 60% of your traffic is simple summarization or classification, you are burning money sending it to a frontier model. Routing lets you match request complexity to model price.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Latency.&lt;/strong&gt; Provider response times vary by region, time of day, and current load. A request that takes 800ms on one provider might take 2.5s on another at that exact moment.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Reliability.&lt;/strong&gt; Every provider has outages. Rate limits hit. 429s and 500s happen. If your entire product is wired to one API endpoint, you inherit their downtime.&lt;/p&gt;

&lt;p&gt;Smart routing optimises across all three per request, without changing application code.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Landscape: Four Approaches
&lt;/h2&gt;

&lt;p&gt;Before picking a tool, I mapped out how the options break down.&lt;/p&gt;

&lt;h3&gt;
  
  
  Marketplace routing (OpenRouter)
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://openrouter.ai" rel="noopener noreferrer"&gt;OpenRouter&lt;/a&gt; acts as a unified API across dozens of models from different providers. You send a request to their endpoint, and they handle the provider connection. Good model catalog, single API key. The trade-off is that you are adding a network hop through their servers and routing logic is their black box, not yours. Less control over failover behaviour, budget enforcement, and routing weights.&lt;/p&gt;

&lt;h3&gt;
  
  
  Framework-level routing (Semantic Kernel)
&lt;/h3&gt;

&lt;p&gt;Microsoft's &lt;a href="https://learn.microsoft.com/en-us/semantic-kernel/" rel="noopener noreferrer"&gt;Semantic Kernel&lt;/a&gt; lets you define model selection logic inside your application code. You can set up filters that choose models based on request properties, user tier, or function type. The issue: routing becomes tightly coupled to your application. Every service needs the routing logic, and updating routing config means redeploying application code. No built-in budget enforcement or provider health monitoring either.&lt;/p&gt;

&lt;h3&gt;
  
  
  DIY routing
&lt;/h3&gt;

&lt;p&gt;You can always write your own. A reverse proxy with some logic to pick providers based on health checks and weights. I tried this first with a simple Python setup:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;random&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;httpx&lt;/span&gt;

&lt;span class="n"&gt;PROVIDERS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;openai&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;url&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://api.openai.com/v1/chat/completions&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;weight&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.6&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;anthropic&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;url&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://api.anthropic.com/v1/messages&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;weight&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.4&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;pick_provider&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;names&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;list&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;PROVIDERS&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;keys&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
    &lt;span class="n"&gt;weights&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;PROVIDERS&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;weight&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;names&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;choices&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;names&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;weights&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;weights&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This works for two providers with static weights. It falls apart when you need failover, budget tracking, health checks, or dynamic weight adjustment. I abandoned this after two weeks of edge cases.&lt;/p&gt;

&lt;h3&gt;
  
  
  Gateway-level routing
&lt;/h3&gt;

&lt;p&gt;A gateway sits between your application and LLM providers. You configure routing rules once, and every service behind the gateway gets the same behaviour. Application code does not know or care which provider serves a request.&lt;/p&gt;

&lt;p&gt;This is where I spent most of my time. And this is where the data got interesting.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Gateway-Level Routing Won for Me
&lt;/h2&gt;

&lt;p&gt;The decision came down to one principle: routing is infrastructure, not application logic.&lt;/p&gt;

&lt;p&gt;When routing lives in the application layer, every team implements it differently. One team does round-robin, another does random selection, a third hardcodes a provider. Failover behaviour is inconsistent. Budget tracking is scattered across services.&lt;/p&gt;

&lt;p&gt;A gateway centralises all of that. Configure it once, every downstream service gets consistent routing, failover, and budget enforcement. Change the routing strategy and no application code changes.&lt;/p&gt;

&lt;p&gt;After testing several gateways, &lt;a href="https://git.new/bifrost" rel="noopener noreferrer"&gt;Bifrost&lt;/a&gt; gave me the best combination of routing flexibility and raw performance. Written in Go, 11 microsecond latency overhead, 5,000 RPS sustained throughput. For context, Python-based alternatives like LiteLLM add around 8ms per request. That is roughly a 50x difference in routing overhead.&lt;/p&gt;

&lt;p&gt;Here is how I set it up.&lt;/p&gt;

&lt;h2&gt;
  
  
  Bifrost Routing: The Deep Dive
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Weighted Distribution
&lt;/h3&gt;

&lt;p&gt;The most common routing strategy. You assign weights to providers and Bifrost distributes traffic proportionally. Weights auto-normalise, so you can use any numbers.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;accounts&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;production"&lt;/span&gt;
    &lt;span class="na"&gt;providers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;openai-primary"&lt;/span&gt;
        &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;openai"&lt;/span&gt;
        &lt;span class="na"&gt;api_key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;${OPENAI_API_KEY}"&lt;/span&gt;
        &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4o"&lt;/span&gt;
        &lt;span class="na"&gt;weight&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;60&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;anthropic-secondary"&lt;/span&gt;
        &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;anthropic"&lt;/span&gt;
        &lt;span class="na"&gt;api_key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;${ANTHROPIC_API_KEY}"&lt;/span&gt;
        &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-sonnet-4-20250514"&lt;/span&gt;
        &lt;span class="na"&gt;weight&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;30&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gemini-tertiary"&lt;/span&gt;
        &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gemini"&lt;/span&gt;
        &lt;span class="na"&gt;api_key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;${GEMINI_API_KEY}"&lt;/span&gt;
        &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gemini-2.5-pro"&lt;/span&gt;
        &lt;span class="na"&gt;weight&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;10&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;60% of requests go to GPT-4o. 30% to Claude Sonnet. 10% to Gemini. I used this split to compare output quality across providers on real production traffic. Adjusting the weights is a config change, not a code deploy.&lt;/p&gt;

&lt;p&gt;Full &lt;a href="https://docs.getbifrost.ai/features/governance/routing" rel="noopener noreferrer"&gt;routing configuration docs here&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Automatic Failover
&lt;/h3&gt;

&lt;p&gt;Weighted routing handles the happy path. &lt;a href="https://docs.getbifrost.ai/features/fallbacks" rel="noopener noreferrer"&gt;Failover&lt;/a&gt; handles everything else. When a provider returns errors, Bifrost automatically retries with the next provider in weight order.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;accounts&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;production"&lt;/span&gt;
    &lt;span class="na"&gt;providers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;openai-primary"&lt;/span&gt;
        &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;openai"&lt;/span&gt;
        &lt;span class="na"&gt;api_key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;${OPENAI_API_KEY}"&lt;/span&gt;
        &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4o"&lt;/span&gt;
        &lt;span class="na"&gt;weight&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;80&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;anthropic-fallback"&lt;/span&gt;
        &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;anthropic"&lt;/span&gt;
        &lt;span class="na"&gt;api_key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;${ANTHROPIC_API_KEY}"&lt;/span&gt;
        &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-sonnet-4-20250514"&lt;/span&gt;
        &lt;span class="na"&gt;weight&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;15&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gemini-fallback"&lt;/span&gt;
        &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gemini"&lt;/span&gt;
        &lt;span class="na"&gt;api_key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;${GEMINI_API_KEY}"&lt;/span&gt;
        &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gemini-2.5-pro"&lt;/span&gt;
        &lt;span class="na"&gt;weight&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;5&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;OpenAI returns a 429? Bifrost retries with Anthropic. Anthropic is down? Falls back to Gemini. The application never sees the failure. No retry logic in application code, no manual intervention.&lt;/p&gt;

&lt;p&gt;I ran a 48-hour test where I intentionally rotated provider API keys to simulate outages. Bifrost handled every failover cleanly. Requests were slower (because retries take time) but none failed from the application's perspective.&lt;/p&gt;

&lt;h3&gt;
  
  
  Budget-Aware Routing
&lt;/h3&gt;

&lt;p&gt;This is where Bifrost's approach gets genuinely useful. The &lt;a href="https://docs.getbifrost.ai/features/governance" rel="noopener noreferrer"&gt;governance layer&lt;/a&gt; has a four-tier &lt;a href="https://docs.getbifrost.ai/features/governance/budget-and-limits" rel="noopener noreferrer"&gt;budget hierarchy&lt;/a&gt;: Customer, Team, Virtual Key, and Provider Config.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;budgets&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;level&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;team"&lt;/span&gt;
    &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;backend-team"&lt;/span&gt;
    &lt;span class="na"&gt;limit&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;500&lt;/span&gt;
    &lt;span class="na"&gt;period&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;monthly"&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;level&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;virtual_key"&lt;/span&gt;
    &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;dev-key-pranay"&lt;/span&gt;
    &lt;span class="na"&gt;limit&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;100&lt;/span&gt;
    &lt;span class="na"&gt;period&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;monthly"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When a budget tier is exhausted, routing decisions respect that constraint. If the backend team hits their monthly limit, requests from that team stop going through. If a specific virtual key runs out, that key is blocked but other keys on the same team still work.&lt;/p&gt;

&lt;p&gt;This level of granularity is something I did not find in the other approaches I tested. Most solutions do global rate limiting at best. The four-tier hierarchy lets you set guardrails at every organisational level without building custom middleware.&lt;/p&gt;

&lt;h3&gt;
  
  
  Semantic Caching: Skip Routing Entirely
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://docs.getbifrost.ai/features/semantic-caching" rel="noopener noreferrer"&gt;Dual-layer semantic caching&lt;/a&gt; in Bifrost uses exact hash matching and semantic similarity matching.&lt;/p&gt;

&lt;p&gt;When a request hits the cache, it never reaches a provider. No routing decision needed. No API call. No cost. The response comes back from cache directly.&lt;/p&gt;

&lt;p&gt;For workloads with repeated or similar queries (customer support, code generation with common patterns, FAQ-type interactions), caching eliminates a significant chunk of provider calls entirely. In my testing, cache hit rates on repetitive workloads were high enough to noticeably reduce total routed requests.&lt;/p&gt;

&lt;p&gt;This interacts well with budget-aware routing. Fewer routed requests means budgets last longer.&lt;/p&gt;

&lt;h2&gt;
  
  
  Getting Started
&lt;/h2&gt;

&lt;p&gt;Setup is fast. One command:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npx &lt;span class="nt"&gt;-y&lt;/span&gt; @maximhq/bifrost
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Or Docker:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker run &lt;span class="nt"&gt;-p&lt;/span&gt; 8080:8080 maximhq/bifrost
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Configure your providers in the config file, set your routing weights, and point your application at the gateway endpoint. The &lt;a href="https://docs.getbifrost.ai/quickstart/gateway/setting-up" rel="noopener noreferrer"&gt;setup guide&lt;/a&gt; walks through it. &lt;a href="https://docs.getbifrost.ai/quickstart/gateway/provider-configuration" rel="noopener noreferrer"&gt;Provider configuration&lt;/a&gt; covers all supported providers and model formats.&lt;/p&gt;

&lt;p&gt;Bifrost exposes a &lt;a href="https://docs.getbifrost.ai/features/drop-in-replacement" rel="noopener noreferrer"&gt;drop-in replacement&lt;/a&gt; endpoint for the OpenAI and Anthropic SDKs. If your application already uses either SDK, you change the base URL and nothing else. No code changes needed. The &lt;a href="https://docs.getbifrost.ai/integrations/anthropic-sdk" rel="noopener noreferrer"&gt;Anthropic SDK integration docs&lt;/a&gt; have the specifics.&lt;/p&gt;

&lt;h2&gt;
  
  
  Results After Switching
&lt;/h2&gt;

&lt;p&gt;I ran Bifrost for three weeks across production workloads. Here is what the data showed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Latency overhead&lt;/strong&gt;: Consistently under 15 microseconds per request. The 11 microsecond claim held up in my benchmarks. At 5,000 RPS, total gateway overhead was negligible compared to actual LLM response times. You can &lt;a href="https://docs.getbifrost.ai/benchmarking/getting-started" rel="noopener noreferrer"&gt;run the benchmarks yourself&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Failover recovery&lt;/strong&gt;: Provider failures were transparent to the application. During two real OpenAI degradation events, traffic shifted to Anthropic within the same request cycle. Zero application-level errors.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cost visibility&lt;/strong&gt;: The four-tier budget hierarchy gave me per-team and per-key cost tracking without building anything custom. I caught one team burning through their allocation on a retry loop within the first week.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cache savings&lt;/strong&gt;: Semantic caching reduced routed requests by a meaningful percentage on workloads with repeated query patterns. Those were requests that never hit a provider, never cost anything.&lt;/p&gt;

&lt;p&gt;The combination of weighted routing, automatic failover, budget controls, and semantic caching in a single layer that adds 11 microseconds of overhead is something I have not been able to replicate with any other approach I tested.&lt;/p&gt;

&lt;h2&gt;
  
  
  Final Thoughts
&lt;/h2&gt;

&lt;p&gt;LLM routing is not optional in production. Static provider configs break under load, cost more than they should, and give you zero flexibility when things go wrong.&lt;/p&gt;

&lt;p&gt;The approach matters. Marketplace APIs abstract away too much control. Framework-level routing couples infrastructure decisions to application code. DIY solutions work until the edge cases pile up.&lt;/p&gt;

&lt;p&gt;Gateway-level routing keeps the concern where it belongs: in infrastructure. Bifrost's performance numbers, routing flexibility, and budget hierarchy made it the strongest option in my evaluation.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://git.new/bifrost" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt; | &lt;a href="https://getmax.im/bifrostdocs" rel="noopener noreferrer"&gt;Docs&lt;/a&gt; | &lt;a href="https://getmax.im/bifrost-home" rel="noopener noreferrer"&gt;Website&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If you are running LLMs in production with multiple providers, set up a gateway and stop hardcoding routing in application code. The data speaks for itself.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>devops</category>
      <category>mcp</category>
    </item>
    <item>
      <title>How to Govern Claude Code Usage Across Engineering Teams</title>
      <dc:creator>Pranay Batta</dc:creator>
      <pubDate>Mon, 20 Apr 2026 14:55:53 +0000</pubDate>
      <link>https://dev.to/pranay_batta/how-to-govern-claude-code-usage-across-engineering-teams-53lk</link>
      <guid>https://dev.to/pranay_batta/how-to-govern-claude-code-usage-across-engineering-teams-53lk</guid>
      <description>&lt;p&gt;Claude Code is powerful; maybe too powerful to run without guardrails.&lt;/p&gt;

&lt;p&gt;I came across a case where a mid-sized startup had three engineering teams adopt it independently. Within two weeks, their bill hit $4,200. No breakdown of who spent what, no audit trail, no rate limits—just usage piling up and a growing invoice.&lt;/p&gt;

&lt;p&gt;If your org is adopting Claude Code, you need centralized governance. I tested &lt;a href="https://git.new/bifrost" rel="noopener noreferrer"&gt;Bifrost&lt;/a&gt; as an AI gateway layer to solve exactly this. Here is how I set it up.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem: Ungoverned Claude Code
&lt;/h2&gt;

&lt;p&gt;Claude Code runs locally on each developer's machine. Every developer has their own API key. This means:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;No visibility into per-developer or per-team spend&lt;/li&gt;
&lt;li&gt;No rate limiting. One runaway agent loop burns through your budget&lt;/li&gt;
&lt;li&gt;No audit trail of what tools were called, what code was generated&lt;/li&gt;
&lt;li&gt;No control over which MCP tools Claude Code can access&lt;/li&gt;
&lt;li&gt;No way to enforce org-wide policies&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You need a proxy layer between Claude Code and the LLM provider. That is what an AI gateway does.&lt;/p&gt;

&lt;h2&gt;
  
  
  Setting Up Bifrost as Your Claude Code Gateway
&lt;/h2&gt;

&lt;p&gt;Bifrost is a Go-based AI gateway with &lt;a href="https://getmax.im/bifrostdocs" rel="noopener noreferrer"&gt;11 microsecond latency overhead&lt;/a&gt;. Deploy it with a single command:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npx @anthropic-ai/bifrost
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Or via Docker:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker run &lt;span class="nt"&gt;-p&lt;/span&gt; 8080:8080 ghcr.io/maximhq/bifrost:latest
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Point Claude Code at it by setting the base URL in your Claude Code config:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"apiBaseUrl"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"http://localhost:8080/v1"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"apiKey"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"vk_team_frontend_abc123"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That &lt;code&gt;apiKey&lt;/code&gt; is not an Anthropic key. It is a Bifrost &lt;a href="https://getmax.im/bifrostdocs" rel="noopener noreferrer"&gt;virtual key&lt;/a&gt;. This is where governance starts.&lt;/p&gt;

&lt;h2&gt;
  
  
  Virtual Keys: Per-Developer Access Control
&lt;/h2&gt;

&lt;p&gt;Virtual keys let you issue scoped credentials to each developer or team. Each virtual key maps to an underlying provider key but adds access controls on top.&lt;/p&gt;

&lt;p&gt;Create a virtual key per team:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# bifrost.yaml&lt;/span&gt;
&lt;span class="na"&gt;virtual_keys&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;vk_team_frontend"&lt;/span&gt;
    &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Frontend&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;Team"&lt;/span&gt;
    &lt;span class="na"&gt;provider_config&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;anthropic_prod"&lt;/span&gt;
    &lt;span class="na"&gt;allowed_models&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-sonnet-4-20250514"&lt;/span&gt;
    &lt;span class="na"&gt;rate_limit&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;requests_per_minute&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;60&lt;/span&gt;
      &lt;span class="na"&gt;tokens_per_minute&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;100000&lt;/span&gt;

  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;vk_team_backend"&lt;/span&gt;
    &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Backend&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;Team"&lt;/span&gt;
    &lt;span class="na"&gt;provider_config&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;anthropic_prod"&lt;/span&gt;
    &lt;span class="na"&gt;allowed_models&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-sonnet-4-20250514"&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-opus-4-20250514"&lt;/span&gt;
    &lt;span class="na"&gt;rate_limit&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;requests_per_minute&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;120&lt;/span&gt;
      &lt;span class="na"&gt;tokens_per_minute&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;200000&lt;/span&gt;

  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;vk_dev_rahul"&lt;/span&gt;
    &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Rahul&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;-&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;Backend"&lt;/span&gt;
    &lt;span class="na"&gt;provider_config&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;anthropic_prod"&lt;/span&gt;
    &lt;span class="na"&gt;allowed_models&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-sonnet-4-20250514"&lt;/span&gt;
    &lt;span class="na"&gt;rate_limit&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;requests_per_minute&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;30&lt;/span&gt;
      &lt;span class="na"&gt;tokens_per_minute&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;50000&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each developer gets their own virtual key. They never see the actual Anthropic API key. You revoke access by deleting the virtual key. No key rotation needed on the provider side.&lt;/p&gt;

&lt;p&gt;Check the &lt;a href="https://getmax.im/bifrostdocs" rel="noopener noreferrer"&gt;virtual keys documentation&lt;/a&gt; for tool-level scoping options.&lt;/p&gt;

&lt;h2&gt;
  
  
  Budget Hierarchy: Cap Spend at Every Level
&lt;/h2&gt;

&lt;p&gt;Bifrost supports a &lt;a href="https://getmax.im/bifrostdocs" rel="noopener noreferrer"&gt;four-tier budget hierarchy&lt;/a&gt;: Customer, Team, Virtual Key, and Provider Config. This maps cleanly to engineering org structures.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;budgets&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;org_level&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;monthly_limit_usd&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;10000&lt;/span&gt;

  &lt;span class="na"&gt;teams&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;frontend"&lt;/span&gt;
      &lt;span class="na"&gt;monthly_limit_usd&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;2000&lt;/span&gt;
      &lt;span class="na"&gt;alert_threshold&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;0.8&lt;/span&gt;

    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;backend"&lt;/span&gt;
      &lt;span class="na"&gt;monthly_limit_usd&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;4000&lt;/span&gt;
      &lt;span class="na"&gt;alert_threshold&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;0.8&lt;/span&gt;

    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ml_platform"&lt;/span&gt;
      &lt;span class="na"&gt;monthly_limit_usd&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;3000&lt;/span&gt;
      &lt;span class="na"&gt;alert_threshold&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;0.8&lt;/span&gt;

  &lt;span class="na"&gt;virtual_key_overrides&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;vk_dev_rahul"&lt;/span&gt;
      &lt;span class="na"&gt;daily_limit_usd&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;50&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When a team hits 80% of their budget, you get an alert. When they hit 100%, requests get blocked. No more surprise bills.&lt;/p&gt;

&lt;p&gt;The daily limit on individual virtual keys is useful for catching runaway Claude Code agent loops. If a developer accidentally triggers an infinite tool-call cycle, it burns through $50 and stops. Not $500.&lt;/p&gt;

&lt;h2&gt;
  
  
  Audit Logging: Track Every Tool Call
&lt;/h2&gt;

&lt;p&gt;This is the part that convinced me. Bifrost logs every request with granular detail. For MCP tool calls specifically, you get:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Tool name&lt;/li&gt;
&lt;li&gt;Server name&lt;/li&gt;
&lt;li&gt;Arguments passed&lt;/li&gt;
&lt;li&gt;Results returned&lt;/li&gt;
&lt;li&gt;Latency per call&lt;/li&gt;
&lt;li&gt;Virtual key ID (so you know which developer triggered it)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Check &lt;a href="https://getmax.im/bifrostdocs" rel="noopener noreferrer"&gt;per-tool audit logging docs&lt;/a&gt; for the full schema.&lt;/p&gt;

&lt;p&gt;Query logs to answer questions like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Which developer made the most LLM calls this week?&lt;/li&gt;
&lt;li&gt;What tools is the frontend team using in Claude Code?&lt;/li&gt;
&lt;li&gt;How much did code generation cost per team last month?&lt;/li&gt;
&lt;li&gt;Are any developers hitting rate limits frequently?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is the audit trail you need for SOC 2 compliance and internal cost attribution.&lt;/p&gt;

&lt;h2&gt;
  
  
  Rate Limiting: Prevent Runaway Usage
&lt;/h2&gt;

&lt;p&gt;I already showed rate limits in the virtual key config. But let me explain why this matters specifically for Claude Code.&lt;/p&gt;

&lt;p&gt;Claude Code in agent mode can make dozens of LLM calls per task. A single "refactor this module" command might trigger 15-20 API calls. Without rate limits, one developer running complex refactors back-to-back can consume your entire daily budget in an hour.&lt;/p&gt;

&lt;p&gt;Set conservative limits per developer:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;rate_limit&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;requests_per_minute&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;30&lt;/span&gt;
  &lt;span class="na"&gt;tokens_per_minute&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;50000&lt;/span&gt;
  &lt;span class="na"&gt;concurrent_requests&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;3&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This still allows normal Claude Code usage. But it prevents the scenario where someone kicks off a massive agent task and walks away.&lt;/p&gt;

&lt;p&gt;Bifrost handles rate limiting at the gateway level with &lt;a href="https://getmax.im/bifrostdocs" rel="noopener noreferrer"&gt;sub-millisecond overhead&lt;/a&gt;. The developer gets a clear 429 response. Claude Code handles these gracefully with built-in retry logic.&lt;/p&gt;

&lt;h2&gt;
  
  
  MCP Gateway: Control Which Tools Claude Code Can Access
&lt;/h2&gt;

&lt;p&gt;This is the governance layer that most teams miss. Claude Code can connect to MCP servers that expose file system access, database queries, deployment tools. You need to control which tools each team can use.&lt;/p&gt;

&lt;p&gt;Bifrost acts as an &lt;a href="https://getmax.im/bifrostdocs" rel="noopener noreferrer"&gt;MCP gateway&lt;/a&gt;. You expose a single &lt;code&gt;/mcp&lt;/code&gt; endpoint and control tool access per virtual key.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;mcp&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;servers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;filesystem"&lt;/span&gt;
      &lt;span class="na"&gt;url&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http://localhost:9001"&lt;/span&gt;
      &lt;span class="na"&gt;allowed_tools&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;read_file"&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;write_file"&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;list_directory"&lt;/span&gt;

    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;database"&lt;/span&gt;
      &lt;span class="na"&gt;url&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http://localhost:9002"&lt;/span&gt;
      &lt;span class="na"&gt;allowed_tools&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;query"&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;describe_table"&lt;/span&gt;

    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;deployment"&lt;/span&gt;
      &lt;span class="na"&gt;url&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http://localhost:9003"&lt;/span&gt;
      &lt;span class="na"&gt;allowed_tools&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;deploy_staging"&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rollback"&lt;/span&gt;

  &lt;span class="na"&gt;virtual_key_permissions&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;vk_team_frontend"&lt;/span&gt;&lt;span class="err"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;filesystem"&lt;/span&gt;
  &lt;span class="err"&gt;  &lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;vk_team_backend"&lt;/span&gt;&lt;span class="err"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;filesystem"&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;database"&lt;/span&gt;
  &lt;span class="err"&gt;  &lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;vk_team_platform"&lt;/span&gt;&lt;span class="err"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;filesystem"&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;database"&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;deployment"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Frontend developers cannot accidentally trigger deployments through Claude Code. Backend developers cannot access deployment tools. Only the platform team gets full access.&lt;/p&gt;

&lt;p&gt;Bifrost's MCP support includes &lt;a href="https://getmax.im/bifrostdocs" rel="noopener noreferrer"&gt;Code Mode with 50%+ token reduction&lt;/a&gt; and sub-3ms latency. So you get governance without performance penalties.&lt;/p&gt;

&lt;h2&gt;
  
  
  Putting It All Together
&lt;/h2&gt;

&lt;p&gt;Here is the minimal setup to govern Claude Code across a 20-person engineering team:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Deploy Bifrost (single binary, zero config)&lt;/li&gt;
&lt;li&gt;Create virtual keys per developer&lt;/li&gt;
&lt;li&gt;Set budget limits per team and per developer&lt;/li&gt;
&lt;li&gt;Configure rate limits&lt;/li&gt;
&lt;li&gt;Route MCP tools through the gateway with per-team permissions&lt;/li&gt;
&lt;li&gt;Point each developer's Claude Code config at the gateway&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Total setup time when I did this: about 45 minutes. Most of that was deciding on budget allocations.&lt;/p&gt;

&lt;p&gt;The &lt;a href="https://getmax.im/bifrostdocs" rel="noopener noreferrer"&gt;Bifrost docs&lt;/a&gt; cover each of these in detail. The &lt;a href="https://git.new/bifrost" rel="noopener noreferrer"&gt;GitHub repo&lt;/a&gt; has example configs for common setups.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I Would Do Differently
&lt;/h2&gt;

&lt;p&gt;After running this for a week, a few notes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Start with generous rate limits and tighten based on actual usage data. Too strict and developers complain.&lt;/li&gt;
&lt;li&gt;Set daily limits, not just monthly. Monthly limits let someone blow the budget on day 1.&lt;/li&gt;
&lt;li&gt;Review audit logs weekly. You will find patterns. Some developers are 10x more efficient with Claude Code than others. Share what works.&lt;/li&gt;
&lt;li&gt;Use separate virtual keys for Claude Code vs other AI tools. Makes cost attribution cleaner.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Bottom Line
&lt;/h2&gt;

&lt;p&gt;Claude Code without governance is a liability. With a gateway layer, it becomes a controlled, auditable, budget-safe tool. Bifrost handles this at &lt;a href="https://getmax.im/bifrost-home" rel="noopener noreferrer"&gt;11 microsecond overhead&lt;/a&gt;, so your developers do not notice the proxy.&lt;/p&gt;

&lt;p&gt;The alternative is waiting for the bill shock. I have seen it happen. Set up governance before you scale Claude Code to your full team.&lt;/p&gt;




&lt;p&gt;&lt;a href="https://git.new/bifrost" rel="noopener noreferrer"&gt;Bifrost GitHub&lt;/a&gt; | &lt;a href="https://getmax.im/bifrostdocs" rel="noopener noreferrer"&gt;Documentation&lt;/a&gt; | &lt;a href="https://getmax.im/bifrost-home" rel="noopener noreferrer"&gt;Website&lt;/a&gt;&lt;/p&gt;

</description>
      <category>claude</category>
      <category>ai</category>
      <category>devops</category>
      <category>llm</category>
    </item>
    <item>
      <title>Buyer's Guide to Pick the Best LLM Gateway in 2026</title>
      <dc:creator>Pranay Batta</dc:creator>
      <pubDate>Fri, 17 Apr 2026 13:13:53 +0000</pubDate>
      <link>https://dev.to/pranay_batta/buyers-guide-to-pick-the-best-llm-gateway-in-2026-1epa</link>
      <guid>https://dev.to/pranay_batta/buyers-guide-to-pick-the-best-llm-gateway-in-2026-1epa</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR:&lt;/strong&gt; An LLM gateway sits between your application and LLM providers, handling routing, failover, cost controls, and observability. I tested five gateways against ten evaluation criteria. Bifrost won on latency and governance. LiteLLM wins on provider coverage. Kong and Cloudflare suit different enterprise needs. Here is the full breakdown.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Is an LLM Gateway?
&lt;/h2&gt;

&lt;p&gt;An LLM gateway is a reverse proxy purpose-built for LLM API traffic. It normalises requests across providers like &lt;a href="https://platform.openai.com/docs" rel="noopener noreferrer"&gt;OpenAI&lt;/a&gt; and &lt;a href="https://docs.anthropic.com" rel="noopener noreferrer"&gt;Anthropic&lt;/a&gt;, adds routing logic, failover, cost controls, caching, and observability without changing your application code. Think of it as an API gateway, but designed specifically for the economics and reliability challenges of LLM calls.&lt;/p&gt;

&lt;p&gt;If you are calling more than one LLM provider, or spending more than $500/month on API calls, you need one.&lt;/p&gt;

&lt;h2&gt;
  
  
  The 10 Evaluation Criteria
&lt;/h2&gt;

&lt;p&gt;I benchmarked and tested five gateways over three weeks. Here is what matters, and what I found.&lt;/p&gt;

&lt;h3&gt;
  
  
  a. Latency Overhead
&lt;/h3&gt;

&lt;p&gt;The gateway itself should add near-zero latency. You are already waiting 500ms-2s for LLM responses. If your gateway adds another 8-15ms, that compounds across multi-step agent chains.&lt;/p&gt;

&lt;p&gt;I measured gateway overhead (not LLM response time) using a standardised &lt;a href="https://go.dev" rel="noopener noreferrer"&gt;Go&lt;/a&gt; benchmarking harness:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Bifrost:&lt;/strong&gt; 11 microseconds. Written in Go, handles 5,000 RPS sustained. &lt;a href="https://docs.getbifrost.ai/benchmarking/getting-started" rel="noopener noreferrer"&gt;Benchmark details&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;LiteLLM:&lt;/strong&gt; ~8ms. Python-based, solid for moderate traffic. &lt;a href="https://github.com/BerriAI/litellm" rel="noopener noreferrer"&gt;GitHub repo&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Kong AI Gateway:&lt;/strong&gt; ~3-5ms. Built on Kong's proven proxy layer. &lt;a href="https://konghq.com/products/kong-ai-gateway" rel="noopener noreferrer"&gt;Product page&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cloudflare AI Gateway:&lt;/strong&gt; Sub-1ms at edge (but limited to Cloudflare's network). &lt;a href="https://developers.cloudflare.com/ai-gateway/" rel="noopener noreferrer"&gt;Docs&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Databricks Unity AI Gateway:&lt;/strong&gt; Not independently benchmarkable. Tied to Databricks runtime.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If latency matters (agents, real-time apps), Bifrost is in a different league.&lt;/p&gt;

&lt;h3&gt;
  
  
  b. Provider Coverage
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;LiteLLM:&lt;/strong&gt; 100+ providers. Broadest coverage available.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Bifrost:&lt;/strong&gt; 19+ providers (OpenAI, Anthropic, Azure, Bedrock, Gemini, Mistral, Cohere, Groq, and more).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Kong AI Gateway:&lt;/strong&gt; Major providers via plugins.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cloudflare:&lt;/strong&gt; Major providers only.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Databricks Unity:&lt;/strong&gt; Focused on Databricks ecosystem plus external endpoints.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you need obscure providers, LiteLLM wins. For the top 15-20 providers, any gateway here works.&lt;/p&gt;

&lt;h3&gt;
  
  
  c. Routing Flexibility
&lt;/h3&gt;

&lt;p&gt;Bifrost supports &lt;a href="https://docs.getbifrost.ai/features/governance/routing" rel="noopener noreferrer"&gt;weighted, priority-based, and conditional routing&lt;/a&gt;. Split traffic 70/30 between GPT-4o and Claude Sonnet, or route coding tasks to one model and summarisation to another. LiteLLM has basic load balancing. Kong does routing via plugins. Cloudflare and Databricks offer simpler options.&lt;/p&gt;

&lt;h3&gt;
  
  
  d. Failover and Reliability
&lt;/h3&gt;

&lt;p&gt;When a provider goes down (and they do), what happens? &lt;a href="https://docs.getbifrost.ai/features/fallbacks" rel="noopener noreferrer"&gt;Bifrost's failover&lt;/a&gt; supports automatic retries with configurable backoff and fallback chains. If OpenAI 429s, it rolls to Anthropic automatically. LiteLLM has similar fallback support. Kong uses health checks. Cloudflare and Databricks offer basic retry/fallback options.&lt;/p&gt;

&lt;h3&gt;
  
  
  e. Cost Governance
&lt;/h3&gt;

&lt;p&gt;This is where gateways diverge sharply. Bifrost has a &lt;a href="https://docs.getbifrost.ai/features/governance/budget-and-limits" rel="noopener noreferrer"&gt;four-tier budget system&lt;/a&gt;: per-key, per-team, per-project, and global with hard limits, soft warnings, and rate limits. &lt;a href="https://docs.getbifrost.ai/features/governance" rel="noopener noreferrer"&gt;Full governance docs&lt;/a&gt;. LiteLLM has budget controls via its proxy. Kong and Cloudflare offer rate limiting. Databricks ties into Unity Catalog.&lt;/p&gt;

&lt;h3&gt;
  
  
  f. Caching
&lt;/h3&gt;

&lt;p&gt;Caching identical or similar LLM calls reduces cost and latency dramatically.&lt;/p&gt;

&lt;p&gt;Bifrost supports &lt;a href="https://docs.getbifrost.ai/features/semantic-caching" rel="noopener noreferrer"&gt;dual-layer semantic caching&lt;/a&gt; with exact match and semantic similarity. Backend options include &lt;a href="https://redis.io" rel="noopener noreferrer"&gt;Redis&lt;/a&gt; for exact caching, &lt;a href="https://weaviate.io" rel="noopener noreferrer"&gt;Weaviate&lt;/a&gt; for vector-based semantic matching, and Qdrant as an alternative vector store.&lt;/p&gt;

&lt;p&gt;LiteLLM has basic caching support. Cloudflare caches at the edge (great for repeated queries). Kong and Databricks have limited native caching options.&lt;/p&gt;

&lt;h3&gt;
  
  
  g. Observability
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://docs.getbifrost.ai/features/observability" rel="noopener noreferrer"&gt;Bifrost's observability&lt;/a&gt; captures request/response pairs, token counts, latency, cost, and model metadata with under 0.1ms overhead. Audit logging and virtual key tracking built in. LiteLLM has a dashboard plus integrations. Kong plugs into existing stacks. Cloudflare and Databricks have built-in analytics.&lt;/p&gt;

&lt;h3&gt;
  
  
  h. MCP Support
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://modelcontextprotocol.io" rel="noopener noreferrer"&gt;MCP (Model Context Protocol)&lt;/a&gt; is becoming the standard for tool integration. Gateway-level MCP support matters for managing tool sprawl.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://docs.getbifrost.ai/mcp/overview" rel="noopener noreferrer"&gt;Bifrost's MCP support&lt;/a&gt; includes a Code Mode that generates TypeScript declarations instead of raw tool definitions. At 500 tools, this saves 92% on tokens. Tool-level scoping and access control are built in.&lt;/p&gt;

&lt;p&gt;Databricks Unity just added MCP governance. Kong v3.14 added A2A (Agent-to-Agent) support in April 2026. LiteLLM and Cloudflare have basic or no MCP-specific features.&lt;/p&gt;

&lt;p&gt;If you are building multi-agent systems with many tools, MCP governance is not optional.&lt;/p&gt;

&lt;h3&gt;
  
  
  i. Deployment Model
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Bifrost:&lt;/strong&gt; Self-hosted. &lt;a href="https://docs.getbifrost.ai/quickstart/gateway/setting-up" rel="noopener noreferrer"&gt;Zero-config setup&lt;/a&gt; via &lt;code&gt;npx -y @maximhq/bifrost&lt;/code&gt; or &lt;a href="https://docs.docker.com" rel="noopener noreferrer"&gt;Docker&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;LiteLLM:&lt;/strong&gt; Self-hosted (open-source) or managed (enterprise).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Kong:&lt;/strong&gt; Self-hosted or managed (Konnect).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cloudflare:&lt;/strong&gt; Managed only. You are on Cloudflare's infrastructure.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Databricks Unity:&lt;/strong&gt; Managed. Tied to Databricks workspace.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Self-hosted means your data never leaves your VPC. If you are in a regulated industry, this matters.&lt;/p&gt;

&lt;h3&gt;
  
  
  j. Open Source vs Proprietary
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Bifrost:&lt;/strong&gt; Fully open-source. &lt;a href="https://git.new/bifrost" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;LiteLLM:&lt;/strong&gt; Open-source core, enterprise features behind a paid tier. Note: LiteLLM had a supply chain security incident in March 2026 that affected its PyPI package. Worth reviewing before deploying.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Kong AI Gateway:&lt;/strong&gt; Kong's core is open-source, but AI Gateway features require an enterprise licence.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cloudflare:&lt;/strong&gt; Proprietary managed service.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Databricks Unity:&lt;/strong&gt; Proprietary, part of Databricks platform.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Comparison Table
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Criteria&lt;/th&gt;
&lt;th&gt;Bifrost&lt;/th&gt;
&lt;th&gt;LiteLLM&lt;/th&gt;
&lt;th&gt;Kong AI&lt;/th&gt;
&lt;th&gt;Cloudflare&lt;/th&gt;
&lt;th&gt;Databricks Unity&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Latency overhead&lt;/td&gt;
&lt;td&gt;11us&lt;/td&gt;
&lt;td&gt;~8ms&lt;/td&gt;
&lt;td&gt;~3-5ms&lt;/td&gt;
&lt;td&gt;Sub-1ms (edge)&lt;/td&gt;
&lt;td&gt;N/A&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Providers&lt;/td&gt;
&lt;td&gt;19+&lt;/td&gt;
&lt;td&gt;100+&lt;/td&gt;
&lt;td&gt;Major&lt;/td&gt;
&lt;td&gt;Major&lt;/td&gt;
&lt;td&gt;Ecosystem&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Routing&lt;/td&gt;
&lt;td&gt;Weighted, priority, conditional&lt;/td&gt;
&lt;td&gt;Basic LB&lt;/td&gt;
&lt;td&gt;Plugin-based&lt;/td&gt;
&lt;td&gt;Simple&lt;/td&gt;
&lt;td&gt;Model serving&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Failover&lt;/td&gt;
&lt;td&gt;Full fallback chains&lt;/td&gt;
&lt;td&gt;Fallback support&lt;/td&gt;
&lt;td&gt;Health checks&lt;/td&gt;
&lt;td&gt;Basic retry&lt;/td&gt;
&lt;td&gt;Endpoint fallback&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cost governance&lt;/td&gt;
&lt;td&gt;Four-tier budgets&lt;/td&gt;
&lt;td&gt;Budget + rate limits&lt;/td&gt;
&lt;td&gt;Rate limiting&lt;/td&gt;
&lt;td&gt;Basic&lt;/td&gt;
&lt;td&gt;Unity Catalog&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Caching&lt;/td&gt;
&lt;td&gt;Semantic (Redis/Weaviate/Qdrant)&lt;/td&gt;
&lt;td&gt;Basic&lt;/td&gt;
&lt;td&gt;Limited&lt;/td&gt;
&lt;td&gt;Edge caching&lt;/td&gt;
&lt;td&gt;Limited&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Observability&lt;/td&gt;
&lt;td&gt;Sub-0.1ms, full audit&lt;/td&gt;
&lt;td&gt;Dashboard + integrations&lt;/td&gt;
&lt;td&gt;Stack integration&lt;/td&gt;
&lt;td&gt;Built-in analytics&lt;/td&gt;
&lt;td&gt;MLflow&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MCP support&lt;/td&gt;
&lt;td&gt;Code Mode, 92% savings&lt;/td&gt;
&lt;td&gt;Basic&lt;/td&gt;
&lt;td&gt;A2A (v3.14)&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;td&gt;MCP governance&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Deployment&lt;/td&gt;
&lt;td&gt;Self-hosted&lt;/td&gt;
&lt;td&gt;Self-hosted/managed&lt;/td&gt;
&lt;td&gt;Self-hosted/managed&lt;/td&gt;
&lt;td&gt;Managed only&lt;/td&gt;
&lt;td&gt;Managed only&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Open source&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Core only&lt;/td&gt;
&lt;td&gt;AI features paid&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Decision Framework
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Pick Bifrost if&lt;/strong&gt; you need lowest latency, granular cost governance, semantic caching, and MCP tool management. Self-hosted, open-source. &lt;a href="https://getmax.im/bifrost-home" rel="noopener noreferrer"&gt;Get started here&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pick LiteLLM if&lt;/strong&gt; you need the widest provider coverage and can tolerate 8ms+ overhead. Factor in the March 2026 security incident.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pick Kong AI Gateway if&lt;/strong&gt; you already run Kong and want LLM routing added to existing infrastructure. A2A support in v3.14 is promising.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pick Cloudflare AI Gateway if&lt;/strong&gt; you want zero-ops and are already on Cloudflare. Limited governance for multi-team setups.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pick Databricks Unity AI Gateway if&lt;/strong&gt; you are all-in on Databricks. Strong MCP governance but locks you into the ecosystem.&lt;/p&gt;

&lt;h2&gt;
  
  
  Trade-offs to Accept
&lt;/h2&gt;

&lt;p&gt;No single best gateway exists. Bifrost's 19 providers cover 95% of production traffic but are fewer than LiteLLM's 100+. LiteLLM's Python runtime is slower but easier to extend. Kong is battle-tested as a proxy but its AI features are catching up. Cloudflare is easiest to set up but gives you the least control. Databricks is powerful within its ecosystem and limiting outside it.&lt;/p&gt;

&lt;p&gt;Pick the one that solves your biggest bottleneck first.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Bifrost links:&lt;/strong&gt; &lt;a href="https://git.new/bifrost" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt; | &lt;a href="https://getmax.im/bifrostdocs" rel="noopener noreferrer"&gt;Docs&lt;/a&gt; | &lt;a href="https://getmax.im/bifrost-home" rel="noopener noreferrer"&gt;Website&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>programming</category>
      <category>opensource</category>
      <category>devops</category>
    </item>
    <item>
      <title>Best AI Gateway to Route Codex CLI to Any Model</title>
      <dc:creator>Pranay Batta</dc:creator>
      <pubDate>Thu, 16 Apr 2026 10:05:16 +0000</pubDate>
      <link>https://dev.to/pranay_batta/best-ai-gateway-to-route-codex-cli-to-any-model-4640</link>
      <guid>https://dev.to/pranay_batta/best-ai-gateway-to-route-codex-cli-to-any-model-4640</guid>
      <description>&lt;p&gt;Codex CLI is OpenAI's terminal-based coding agent that runs entirely in your shell. It reads your codebase, proposes changes, runs commands, and writes code. Solid tool. One problem: it only talks to OpenAI by default.&lt;/p&gt;

&lt;p&gt;I wanted to route Codex CLI through an AI gateway so I could use Claude Sonnet, Gemini 2.5 Pro, Mistral, and others without switching tools. I tested a few options. &lt;a href="https://git.new/bifrost" rel="noopener noreferrer"&gt;Bifrost&lt;/a&gt; worked best. Open-source, written in Go, 11 microsecond overhead. Here is exactly how I set it up and what I found.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Route Codex CLI Through an AI Gateway
&lt;/h2&gt;

&lt;p&gt;Codex CLI sends requests to OpenAI's API. That is fine until you need something else. Maybe Claude Sonnet handles your refactoring tasks better. Maybe Gemini's context window fits your monorepo. Maybe you want automatic failover when OpenAI rate limits you mid-session.&lt;/p&gt;

&lt;p&gt;An AI gateway sits between Codex CLI and your providers. It translates requests, routes traffic, and handles failures. You configure it once and Codex CLI does not know the difference.&lt;/p&gt;

&lt;p&gt;Without a gateway, your options are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Stick with OpenAI only (no routing, no failover, no cost tracking)&lt;/li&gt;
&lt;li&gt;Manually swap API keys and base URLs every time you want a different model&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Neither scales.&lt;/p&gt;

&lt;h2&gt;
  
  
  Setting Up Bifrost for Codex CLI
&lt;/h2&gt;

&lt;p&gt;Bifrost exposes an OpenAI-compatible endpoint. Codex CLI connects to it like it would connect to OpenAI directly. Full &lt;a href="https://docs.getbifrost.ai/cli-agents/codex-cli" rel="noopener noreferrer"&gt;Codex CLI integration docs here&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Install Bifrost
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npx &lt;span class="nt"&gt;-y&lt;/span&gt; @maximhq/bifrost
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That starts the gateway locally. The &lt;a href="https://docs.getbifrost.ai/quickstart/gateway/setting-up" rel="noopener noreferrer"&gt;setup guide&lt;/a&gt; has the full walkthrough.&lt;/p&gt;

&lt;h3&gt;
  
  
  The OAuth Gotcha
&lt;/h3&gt;

&lt;p&gt;This one tripped me up. Codex CLI always prefers OAuth authentication over custom API keys. If you have previously logged in with OpenAI, Codex CLI will ignore your custom &lt;code&gt;OPENAI_API_KEY&lt;/code&gt; entirely.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fix:&lt;/strong&gt; Run &lt;code&gt;/logout&lt;/code&gt; inside Codex CLI before configuring Bifrost. Without this step, your gateway config will be silently bypassed.&lt;/p&gt;

&lt;h3&gt;
  
  
  Configure Codex CLI to Use Bifrost
&lt;/h3&gt;

&lt;p&gt;Set your environment variables:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;OPENAI_API_KEY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;bifrost_virtual_key
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;OPENAI_BASE_URL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;http://localhost:8080/openai/v1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Or add it to your &lt;code&gt;codex.toml&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight toml"&gt;&lt;code&gt;&lt;span class="nn"&gt;[auth]&lt;/span&gt;
&lt;span class="py"&gt;api_key&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"bifrost_virtual_key"&lt;/span&gt;

&lt;span class="nn"&gt;[network]&lt;/span&gt;
&lt;span class="py"&gt;openai_base_url&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"http://localhost:8080/openai/v1"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;OPENAI_API_KEY&lt;/code&gt; here is a Bifrost virtual key. Your actual provider keys live in the Bifrost config.&lt;/p&gt;

&lt;p&gt;Done. Every Codex CLI request now flows through Bifrost.&lt;/p&gt;

&lt;h2&gt;
  
  
  Routing Codex CLI to Any Model
&lt;/h2&gt;

&lt;p&gt;This is the core use case. Configure multiple providers in Bifrost, and route Codex CLI traffic however you want. Bifrost uses the &lt;code&gt;provider/model-name&lt;/code&gt; format for cross-provider routing:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;accounts&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;codex-dev"&lt;/span&gt;
    &lt;span class="na"&gt;providers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-primary"&lt;/span&gt;
        &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;anthropic"&lt;/span&gt;
        &lt;span class="na"&gt;api_key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;${ANTHROPIC_API_KEY}"&lt;/span&gt;
        &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;anthropic/claude-sonnet-4-5-20250929"&lt;/span&gt;
        &lt;span class="na"&gt;weight&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;60&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gemini-secondary"&lt;/span&gt;
        &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gemini"&lt;/span&gt;
        &lt;span class="na"&gt;api_key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;${GEMINI_API_KEY}"&lt;/span&gt;
        &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gemini/gemini-2-5-pro"&lt;/span&gt;
        &lt;span class="na"&gt;weight&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;25&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;openai-fallback"&lt;/span&gt;
        &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;openai"&lt;/span&gt;
        &lt;span class="na"&gt;api_key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;${OPENAI_API_KEY}"&lt;/span&gt;
        &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4o"&lt;/span&gt;
        &lt;span class="na"&gt;weight&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;15&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;60% of requests go to Claude Sonnet. 25% to Gemini. 15% to GPT-4o. Weights auto-normalise, so use any numbers.&lt;/p&gt;

&lt;p&gt;I ran this for a week. Claude Sonnet handled tool-heavy refactoring better. Gemini was faster on large context reads. GPT-4o was solid as a fallback. The &lt;a href="https://docs.getbifrost.ai/features/governance/routing" rel="noopener noreferrer"&gt;routing docs&lt;/a&gt; cover all configuration options.&lt;/p&gt;

&lt;p&gt;Other providers you can route to: Mistral, Groq, Cerebras, Cohere, Perplexity. All via the same &lt;code&gt;provider/model-name&lt;/code&gt; format.&lt;/p&gt;

&lt;h3&gt;
  
  
  Can You Use Codex CLI with Non-OpenAI Models?
&lt;/h3&gt;

&lt;p&gt;Yes. That is exactly what this setup does. Bifrost translates the OpenAI-format requests from Codex CLI into whatever format each provider expects. Codex CLI thinks it is talking to OpenAI. Bifrost handles the rest.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Critical requirement:&lt;/strong&gt; non-OpenAI models must support tool use. Codex CLI relies on function calling for file operations, terminal commands, and code editing. If a model does not support tools, it will break on anything beyond simple chat.&lt;/p&gt;

&lt;h3&gt;
  
  
  Automatic Failover
&lt;/h3&gt;

&lt;p&gt;Provider outages are inevitable. Bifrost sorts providers by weight and retries on failure. If Claude goes down, Gemini picks up. Gemini fails, falls back to OpenAI. Your Codex CLI session never interrupts.&lt;/p&gt;

&lt;p&gt;The &lt;a href="https://docs.getbifrost.ai/features/fallbacks" rel="noopener noreferrer"&gt;failover docs&lt;/a&gt; explain the retry logic in detail.&lt;/p&gt;

&lt;h2&gt;
  
  
  Comparison: AI Gateway Options for Codex CLI
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Feature&lt;/th&gt;
&lt;th&gt;Bifrost&lt;/th&gt;
&lt;th&gt;LiteLLM&lt;/th&gt;
&lt;th&gt;Direct API&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Language&lt;/td&gt;
&lt;td&gt;Go&lt;/td&gt;
&lt;td&gt;Python&lt;/td&gt;
&lt;td&gt;N/A&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Routing overhead&lt;/td&gt;
&lt;td&gt;11 microseconds&lt;/td&gt;
&lt;td&gt;~8 milliseconds&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Weighted routing&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Automatic failover&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Budget controls&lt;/td&gt;
&lt;td&gt;4-tier hierarchy&lt;/td&gt;
&lt;td&gt;Basic&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Semantic caching&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Self-hosted&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;N/A&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Codex CLI compatible&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Default&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;LiteLLM works as a proxy for Codex CLI, but the Python runtime adds measurable latency. When every Codex CLI request goes through the gateway, those milliseconds compound. For a tool sitting in the critical path of your coding workflow, overhead matters.&lt;/p&gt;

&lt;h3&gt;
  
  
  How to Route Codex CLI Through an AI Gateway?
&lt;/h3&gt;

&lt;p&gt;Three steps:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Start Bifrost (&lt;code&gt;npx -y @maximhq/bifrost&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;Run &lt;code&gt;/logout&lt;/code&gt; in Codex CLI to clear OAuth&lt;/li&gt;
&lt;li&gt;Set &lt;code&gt;OPENAI_API_KEY&lt;/code&gt; and &lt;code&gt;OPENAI_BASE_URL&lt;/code&gt; to point at Bifrost&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;That is it. Configure your providers in the Bifrost config, and Codex CLI routes to any model you specify.&lt;/p&gt;

&lt;h2&gt;
  
  
  Budget and Observability
&lt;/h2&gt;

&lt;p&gt;Once all Codex CLI traffic flows through Bifrost, you get cost controls and logging for free. The four-tier &lt;a href="https://docs.getbifrost.ai/features/governance/budget-and-limits" rel="noopener noreferrer"&gt;budget hierarchy&lt;/a&gt; lets you cap spend at the virtual key, team, or provider level.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;budgets&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;level&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;virtual_key"&lt;/span&gt;
    &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;codex-cli-dev"&lt;/span&gt;
    &lt;span class="na"&gt;limit&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;150&lt;/span&gt;
    &lt;span class="na"&gt;period&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;monthly"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;a href="https://docs.getbifrost.ai/features/observability" rel="noopener noreferrer"&gt;observability layer&lt;/a&gt; logs every request: latency, tokens, cost, which provider handled it. When you are routing across three providers, this data tells you exactly where your money goes and which model performs best for your tasks.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://docs.getbifrost.ai/features/semantic-caching" rel="noopener noreferrer"&gt;Semantic caching&lt;/a&gt; also helps. Repeated or similar queries hit the cache instead of the provider. Cuts both cost and latency for common operations.&lt;/p&gt;

&lt;h2&gt;
  
  
  Honest Trade-offs
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;OAuth quirk is easy to miss.&lt;/strong&gt; If you skip the &lt;code&gt;/logout&lt;/code&gt; step, Codex CLI silently ignores your gateway config. There is no error. It just routes to OpenAI directly. I lost an hour to this before checking the docs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tool use is non-negotiable.&lt;/strong&gt; Not every model supports function calling well enough for Codex CLI. Stick to models with solid tool use: Claude Sonnet, GPT-4o, Gemini 2.5 Pro. Smaller or older models may fail on file operations.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Self-hosted only.&lt;/strong&gt; You run and maintain the gateway. No managed cloud version for the open-source release. The &lt;a href="https://docs.getbifrost.ai/features/governance" rel="noopener noreferrer"&gt;governance layer&lt;/a&gt; helps with access control, but ops is on you.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Extra hop.&lt;/strong&gt; One more process in the chain. The 11 microsecond overhead is negligible, but it is still something to keep running.&lt;/p&gt;

&lt;h2&gt;
  
  
  Quick Start
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# 1. Start Bifrost&lt;/span&gt;
npx &lt;span class="nt"&gt;-y&lt;/span&gt; @maximhq/bifrost

&lt;span class="c"&gt;# 2. Logout from OpenAI OAuth in Codex CLI&lt;/span&gt;
&lt;span class="c"&gt;# Inside Codex CLI, run: /logout&lt;/span&gt;

&lt;span class="c"&gt;# 3. Point Codex CLI at Bifrost&lt;/span&gt;
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;OPENAI_API_KEY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;bifrost_virtual_key
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;OPENAI_BASE_URL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;http://localhost:8080/openai/v1

&lt;span class="c"&gt;# 4. Use Codex CLI normally - it routes through Bifrost&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://git.new/bifrost" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt; | &lt;a href="https://getmax.im/bifrostdocs" rel="noopener noreferrer"&gt;Docs&lt;/a&gt; | &lt;a href="https://getmax.im/bifrost-home" rel="noopener noreferrer"&gt;Website&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If you are using Codex CLI for real work, routing through an AI gateway gives you model flexibility, failover, and cost visibility that you cannot get from a single provider. I benchmarked the &lt;a href="https://docs.getbifrost.ai/benchmarking/getting-started" rel="noopener noreferrer"&gt;performance&lt;/a&gt; and the overhead is genuinely negligible.&lt;/p&gt;

&lt;p&gt;Open an issue on the &lt;a href="https://git.new/bifrostrepo" rel="noopener noreferrer"&gt;repo&lt;/a&gt; if you run into anything.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>programming</category>
      <category>productivity</category>
      <category>devops</category>
    </item>
    <item>
      <title>What is LLM Orchestration and How AI Gateways Enable It</title>
      <dc:creator>Pranay Batta</dc:creator>
      <pubDate>Wed, 15 Apr 2026 17:37:58 +0000</pubDate>
      <link>https://dev.to/pranay_batta/what-is-llm-orchestration-and-how-ai-gateways-enable-it-mm</link>
      <guid>https://dev.to/pranay_batta/what-is-llm-orchestration-and-how-ai-gateways-enable-it-mm</guid>
      <description>&lt;p&gt;Most teams start with one LLM provider. Then they add a second for cost reasons. Then a third for latency. Six months in, they have a tangled mess of provider-specific SDKs, manual failover logic, and zero visibility into what anything costs. That mess is the problem LLM orchestration solves.&lt;/p&gt;

&lt;p&gt;I evaluated how teams handle multi-model routing at scale. Custom code, orchestration frameworks, AI gateways. Here is what works and what just adds overhead.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is LLM Orchestration?
&lt;/h2&gt;

&lt;p&gt;LLM orchestration is the practice of managing multiple LLM providers, models, and configurations through a unified control layer. Instead of hard-coding provider logic into your application, you route, balance, cache, and monitor all LLM traffic from one place.&lt;/p&gt;

&lt;p&gt;Think of it like a load balancer, but purpose-built for AI workloads. It handles which model gets which request, what happens when a provider goes down, how costs are tracked, and where the logs go.&lt;/p&gt;

&lt;p&gt;The core components of LLM orchestration:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Routing&lt;/strong&gt; - Deciding which model handles each request based on weight, cost, or capability&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Failover&lt;/strong&gt; - Automatically switching to a backup provider when one fails&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Load balancing&lt;/strong&gt; - Distributing requests across providers to avoid rate limits&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cost governance&lt;/strong&gt; - Enforcing budgets per team, project, or API key&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Caching&lt;/strong&gt; - Avoiding duplicate calls for identical or semantically similar prompts&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Observability&lt;/strong&gt; - Tracking latency, tokens, costs, and errors across every request&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Why Do You Need LLM Orchestration?
&lt;/h2&gt;

&lt;p&gt;If you are calling one model from one provider, you do not. The moment any of these are true, you do:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Multiple models (GPT-4o, Claude, Gemini) serving different use cases&lt;/li&gt;
&lt;li&gt;Multiple teams sharing the same providers&lt;/li&gt;
&lt;li&gt;Budget limits per team or per project&lt;/li&gt;
&lt;li&gt;Uptime requirements that demand automatic failover&lt;/li&gt;
&lt;li&gt;Cost optimisation that requires routing cheaper queries to cheaper models&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I measured what happens without orchestration. Teams I evaluated had 15-30% higher LLM costs from duplicate calls, no failover causing multi-minute outages during provider incidents, and zero per-team cost attribution.&lt;/p&gt;

&lt;h2&gt;
  
  
  How Does an AI Gateway Handle Orchestration?
&lt;/h2&gt;

&lt;p&gt;An AI gateway is the infrastructure layer that makes LLM orchestration practical. Without a gateway, you are building every orchestration component yourself. With one, you configure it.&lt;/p&gt;

&lt;p&gt;Here is the comparison:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Orchestration Feature&lt;/th&gt;
&lt;th&gt;DIY (Custom Code)&lt;/th&gt;
&lt;th&gt;AI Gateway&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Multi-model routing&lt;/td&gt;
&lt;td&gt;Custom SDK per provider, manual selection&lt;/td&gt;
&lt;td&gt;Config-based weighted routing, auto-normalised&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Failover&lt;/td&gt;
&lt;td&gt;Try/catch with manual retry logic&lt;/td&gt;
&lt;td&gt;Automatic, sorted by weight, instant retry&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Load balancing&lt;/td&gt;
&lt;td&gt;Custom queue + rate tracking&lt;/td&gt;
&lt;td&gt;Built-in weighted distribution&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cost governance&lt;/td&gt;
&lt;td&gt;Manual token counting + billing integration&lt;/td&gt;
&lt;td&gt;Budget hierarchy with auto-enforcement&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Caching&lt;/td&gt;
&lt;td&gt;Redis/Memcached with custom key logic&lt;/td&gt;
&lt;td&gt;Semantic + exact-match, built-in&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Observability&lt;/td&gt;
&lt;td&gt;Custom logging + dashboards&lt;/td&gt;
&lt;td&gt;Real-time streaming, filters, sub-millisecond&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Maintenance&lt;/td&gt;
&lt;td&gt;Ongoing engineering effort&lt;/td&gt;
&lt;td&gt;Configuration changes&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The DIY approach works for prototypes. For production with multiple teams and providers, it becomes a full-time job.&lt;/p&gt;

&lt;h2&gt;
  
  
  Setting Up LLM Orchestration with Bifrost
&lt;/h2&gt;

&lt;p&gt;I tested several gateways for model orchestration. &lt;a href="https://git.new/bifrost" rel="noopener noreferrer"&gt;Bifrost&lt;/a&gt; stood out on performance: 11 microsecond overhead, 5000 RPS throughput, written in Go. That matters because your orchestration layer should not become a bottleneck.&lt;/p&gt;

&lt;p&gt;Start the gateway:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npx &lt;span class="nt"&gt;-y&lt;/span&gt; @maximhq/bifrost
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Full &lt;a href="https://docs.getbifrost.ai/quickstart/gateway/setting-up" rel="noopener noreferrer"&gt;setup guide here&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Weighted Routing Config
&lt;/h3&gt;

&lt;p&gt;This is where LLM orchestration starts. You define providers with weights, and Bifrost &lt;a href="https://docs.getbifrost.ai/features/governance/routing" rel="noopener noreferrer"&gt;routes traffic accordingly&lt;/a&gt;. Weights are auto-normalised to 1.0, and it follows a deny-by-default model.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;accounts&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;production"&lt;/span&gt;
    &lt;span class="na"&gt;providers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;openai-primary"&lt;/span&gt;
        &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;openai"&lt;/span&gt;
        &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4o"&lt;/span&gt;
        &lt;span class="na"&gt;weight&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;70&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;anthropic-fallback"&lt;/span&gt;
        &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;anthropic"&lt;/span&gt;
        &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-sonnet-4-20250514"&lt;/span&gt;
        &lt;span class="na"&gt;weight&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;30&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;70% of traffic goes to GPT-4o. 30% to Claude. If OpenAI goes down, Bifrost &lt;a href="https://docs.getbifrost.ai/features/fallbacks" rel="noopener noreferrer"&gt;automatically fails over&lt;/a&gt; to the next provider sorted by weight. No code changes. No redeployment.&lt;/p&gt;

&lt;h3&gt;
  
  
  Budget Governance
&lt;/h3&gt;

&lt;p&gt;Bifrost uses a &lt;a href="https://docs.getbifrost.ai/features/governance/budget-and-limits" rel="noopener noreferrer"&gt;four-tier budget hierarchy&lt;/a&gt;: Customer &amp;gt; Team &amp;gt; Virtual Key &amp;gt; Provider Config. Each tier can have independent spend limits and rate limits.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;virtual_keys&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ml-team-key"&lt;/span&gt;
    &lt;span class="na"&gt;budget&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;max_spend&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;500&lt;/span&gt;
      &lt;span class="na"&gt;reset_duration&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1M"&lt;/span&gt;
    &lt;span class="na"&gt;rate_limit&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;max_requests&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;10000&lt;/span&gt;
      &lt;span class="na"&gt;reset_duration&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1d"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When a team hits their budget, requests are denied. Not throttled. Denied. That is real &lt;a href="https://docs.getbifrost.ai/features/governance" rel="noopener noreferrer"&gt;cost governance&lt;/a&gt;, not just monitoring.&lt;/p&gt;

&lt;h3&gt;
  
  
  Semantic Caching
&lt;/h3&gt;

&lt;p&gt;Bifrost runs &lt;a href="https://docs.getbifrost.ai/features/semantic-caching" rel="noopener noreferrer"&gt;dual-layer caching&lt;/a&gt;: exact hash matching for identical prompts, plus semantic similarity for prompts that mean the same thing but are worded differently. Both layers reduce redundant API calls without any application code changes.&lt;/p&gt;

&lt;h3&gt;
  
  
  Observability
&lt;/h3&gt;

&lt;p&gt;Every request is logged with &lt;a href="https://docs.getbifrost.ai/features/observability" rel="noopener noreferrer"&gt;less than 0.1ms overhead&lt;/a&gt;. 14+ filters for slicing data. WebSocket-based live streaming so you can watch requests in real time. No separate logging pipeline needed.&lt;/p&gt;

&lt;h2&gt;
  
  
  What About MCP Workloads?
&lt;/h2&gt;

&lt;p&gt;If you are running MCP servers with 500+ tools, orchestration gets expensive fast. Every tool definition eats tokens. Bifrost's &lt;a href="https://docs.getbifrost.ai/mcp/overview" rel="noopener noreferrer"&gt;MCP Code Mode&lt;/a&gt; achieves 92% token reduction by encoding tool schemas efficiently. That is a direct cost saving on top of the orchestration layer.&lt;/p&gt;

&lt;h2&gt;
  
  
  Honest Trade-offs
&lt;/h2&gt;

&lt;p&gt;No tool is perfect. Here is where to be careful:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Bifrost is self-hosted only.&lt;/strong&gt; You run it in your infrastructure. If you want a fully managed SaaS gateway, this is not it. For teams with compliance requirements, self-hosted is actually a benefit.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Go-based, not Python.&lt;/strong&gt; If your team needs to extend gateway logic in Python, the codebase will be unfamiliar. The upside is the 11 microsecond latency that Python gateways cannot match.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Configuration over code.&lt;/strong&gt; Bifrost favours YAML/UI config over programmatic SDKs. If you need deeply custom routing logic (like routing based on prompt content analysis), you will need to handle that at the application layer.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Alternatives exist.&lt;/strong&gt; For simple single-provider setups, a gateway is overkill. If you are only using OpenAI and do not need failover or budgets, just call the API directly.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  FAQ
&lt;/h2&gt;

&lt;h3&gt;
  
  
  What is the difference between LLM orchestration and LLM routing?
&lt;/h3&gt;

&lt;p&gt;LLM routing is one component of orchestration. Routing decides which model handles a request. Orchestration includes routing plus failover, caching, budgets, load balancing, and observability. Multi-model routing is necessary but not sufficient for production AI workloads.&lt;/p&gt;

&lt;h3&gt;
  
  
  Can I do LLM orchestration without a gateway?
&lt;/h3&gt;

&lt;p&gt;Technically, yes. You can build routing, failover, caching, and observability yourself. Practically, I have seen teams spend 2-3 engineering months building what a gateway provides out of the box. And then they still need to maintain it.&lt;/p&gt;

&lt;h3&gt;
  
  
  How does an AI gateway compare to LangChain for orchestration?
&lt;/h3&gt;

&lt;p&gt;LangChain is a framework for building LLM applications. An AI gateway is infrastructure for managing LLM traffic. They solve different problems. You can use both: LangChain for application logic, and a gateway like &lt;a href="https://git.new/bifrost" rel="noopener noreferrer"&gt;Bifrost&lt;/a&gt; for orchestration underneath. Bifrost is a &lt;a href="https://docs.getbifrost.ai/features/drop-in-replacement" rel="noopener noreferrer"&gt;drop-in replacement&lt;/a&gt; for OpenAI's API format, so integration is straightforward.&lt;/p&gt;

&lt;h2&gt;
  
  
  Bottom Line
&lt;/h2&gt;

&lt;p&gt;LLM orchestration is not optional once you are running multiple models in production. The question is whether you build it or use a gateway. I have tested both paths. The gateway approach - specifically Bifrost at &lt;a href="https://docs.getbifrost.ai/benchmarking/getting-started" rel="noopener noreferrer"&gt;11 microsecond overhead&lt;/a&gt; - saves engineering time and gives you better observability from day one.&lt;/p&gt;

&lt;p&gt;Star the repo: &lt;a href="https://git.new/bifrost" rel="noopener noreferrer"&gt;https://git.new/bifrost&lt;/a&gt;&lt;br&gt;
Docs: &lt;a href="https://getmax.im/bifrostdocs" rel="noopener noreferrer"&gt;https://getmax.im/bifrostdocs&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>productivity</category>
      <category>beginners</category>
      <category>devops</category>
    </item>
    <item>
      <title>MCP at Scale: Access Control, Cost Governance, and 92% Lower Token Costs</title>
      <dc:creator>Pranay Batta</dc:creator>
      <pubDate>Tue, 14 Apr 2026 08:44:38 +0000</pubDate>
      <link>https://dev.to/pranay_batta/mcp-at-scale-access-control-cost-governance-and-92-lower-token-costs-50jf</link>
      <guid>https://dev.to/pranay_batta/mcp-at-scale-access-control-cost-governance-and-92-lower-token-costs-50jf</guid>
      <description>&lt;h2&gt;
  
  
  The Hidden Tax on Every MCP Request
&lt;/h2&gt;

&lt;p&gt;Here is something nobody talks about when they demo MCP integrations: token costs at scale.&lt;/p&gt;

&lt;p&gt;I have been running MCP setups with increasing numbers of connected servers. The pattern is always the same. You connect a few servers, everything works brilliantly. You connect a dozen, costs start climbing. You connect sixteen servers with 500+ tools, and suddenly your token budget is gone before the model even starts thinking about your actual query.&lt;/p&gt;

&lt;p&gt;Why? Every tool definition from every connected server gets injected into the model's context on every single request. 150+ tool definitions can consume the majority of your token budget. And there is zero access control. Any consumer can call any tool. No cost tracking at tool level.&lt;/p&gt;

&lt;p&gt;This is unsustainable for production deployments.&lt;/p&gt;

&lt;h2&gt;
  
  
  I Tested Bifrost's Code Mode Approach
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://git.new/bifrost" rel="noopener noreferrer"&gt;Bifrost&lt;/a&gt; takes a fundamentally different approach to this problem. Instead of dumping all tool definitions into the context window, it exposes a virtual filesystem of Python stub files. The model discovers tools on-demand through four meta-tools:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;listToolFiles&lt;/code&gt; - discover available servers and tools&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;readToolFile&lt;/code&gt; - load specific function signatures&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;getToolDocs&lt;/code&gt; - fetch detailed documentation only when needed&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;executeToolCode&lt;/code&gt; - run scripts in a sandboxed Starlark interpreter&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The key insight: the model only loads what it actually needs for the current query. If you ask it to read a file, it does not need to know about your Slack, GitHub, Jira, and database tools all at once.&lt;/p&gt;

&lt;p&gt;Here is what a typical tool discovery flow looks like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Model calls listToolFiles to see available servers
&lt;/span&gt;&lt;span class="n"&gt;available&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;listToolFiles&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="c1"&gt;# Returns: ["filesystem/", "github/", "slack/", "jira/", ...]
&lt;/span&gt;
&lt;span class="c1"&gt;# Model identifies it needs filesystem tools for this query
&lt;/span&gt;&lt;span class="n"&gt;tools&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;readToolFile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;filesystem/read.py&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# Returns only the function signature for filesystem_read
&lt;/span&gt;
&lt;span class="c1"&gt;# Model fetches docs only if needed
&lt;/span&gt;&lt;span class="n"&gt;docs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;getToolDocs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;filesystem&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;read&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Executes with full sandboxing
&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;executeToolCode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;filesystem/read.py&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;path&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/src/main.go&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is lazy loading for LLM tool contexts. Simple idea. Massive impact.&lt;/p&gt;

&lt;h2&gt;
  
  
  Benchmark Results: 3 Controlled Rounds
&lt;/h2&gt;

&lt;p&gt;I ran three controlled rounds, scaling from 6 servers to 16 servers. Every round maintained a 100% task pass rate. The model completed every task correctly while using dramatically fewer tokens.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Round&lt;/th&gt;
&lt;th&gt;Tools&lt;/th&gt;
&lt;th&gt;Servers&lt;/th&gt;
&lt;th&gt;Token Reduction&lt;/th&gt;
&lt;th&gt;Cost Savings&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;96&lt;/td&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;58.2%&lt;/td&gt;
&lt;td&gt;55.7%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;251&lt;/td&gt;
&lt;td&gt;11&lt;/td&gt;
&lt;td&gt;84.5%&lt;/td&gt;
&lt;td&gt;83.4%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;508&lt;/td&gt;
&lt;td&gt;16&lt;/td&gt;
&lt;td&gt;92.8%&lt;/td&gt;
&lt;td&gt;92.2%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;At roughly 500 tools, Code Mode reduces per-query token usage by about 14x. From 1.15M tokens down to 83K. That is not an incremental improvement. That is a different cost structure entirely.&lt;/p&gt;

&lt;p&gt;The savings compound non-linearly. As you add more tools, the percentage saved increases because Code Mode's overhead stays roughly constant while traditional mode scales linearly with tool count.&lt;/p&gt;

&lt;p&gt;For full benchmark methodology, check the &lt;a href="https://docs.getbifrost.ai/benchmarking/getting-started" rel="noopener noreferrer"&gt;benchmarking docs&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Access Control That Actually Works
&lt;/h2&gt;

&lt;p&gt;Token savings are great, but production MCP deployments need &lt;a href="https://docs.getbifrost.ai/features/governance" rel="noopener noreferrer"&gt;governance&lt;/a&gt;. Bifrost handles this through two mechanisms.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Virtual Keys&lt;/strong&gt; let you create scoped credentials per user, team, or customer. You can scope at the tool level:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;virtual_key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;data-team-key"&lt;/span&gt;
  &lt;span class="na"&gt;allowed_tools&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;database_read&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;database_query&lt;/span&gt;
  &lt;span class="na"&gt;blocked_tools&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;database_delete&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;filesystem_write&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Allow &lt;code&gt;filesystem_read&lt;/code&gt;, block &lt;code&gt;filesystem_write&lt;/code&gt;. Allow &lt;code&gt;database_query&lt;/code&gt;, block &lt;code&gt;database_delete&lt;/code&gt;. Fine-grained, declarative, no code changes needed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;MCP Tool Groups&lt;/strong&gt; are named collections of tools from multiple servers. You create a group, attach it to keys, teams, or users. No database queries at resolve time. This is important when you are running at &lt;a href="https://getmax.im/bifrost-home" rel="noopener noreferrer"&gt;5000 RPS&lt;/a&gt; and cannot afford lookup latency.&lt;/p&gt;

&lt;h2&gt;
  
  
  Per-Tool Observability
&lt;/h2&gt;

&lt;p&gt;Every tool execution gets logged with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Tool name and server source&lt;/li&gt;
&lt;li&gt;Arguments passed and results returned&lt;/li&gt;
&lt;li&gt;Execution latency&lt;/li&gt;
&lt;li&gt;Virtual key that initiated the call&lt;/li&gt;
&lt;li&gt;Parent LLM request context&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You can track &lt;a href="https://docs.getbifrost.ai/features/observability" rel="noopener noreferrer"&gt;cost at the tool level&lt;/a&gt; alongside LLM token costs. This matters when your finance team asks why the AI bill doubled last month. You can point to exactly which tools, which teams, and which queries drove the spend.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://docs.getbifrost.ai/features/governance/budget-and-limits" rel="noopener noreferrer"&gt;Budget and limits&lt;/a&gt; let you set spending caps per virtual key, so no single team can blow through the monthly allocation.&lt;/p&gt;

&lt;h2&gt;
  
  
  Connection Flexibility
&lt;/h2&gt;

&lt;p&gt;Bifrost supports four &lt;a href="https://docs.getbifrost.ai/mcp/overview" rel="noopener noreferrer"&gt;MCP connection types&lt;/a&gt;: STDIO, HTTP, SSE, and in-process via the Go SDK. OAuth 2.0 with PKCE and automatic token refresh is built in. Health monitoring with automatic reconnects keeps things running without manual intervention.&lt;/p&gt;

&lt;p&gt;You can run it in manual approval mode where a human reviews tool calls, or in autonomous agent loop mode where the model chains tool calls independently.&lt;/p&gt;

&lt;p&gt;For Claude Code and Cursor users, the &lt;code&gt;/mcp&lt;/code&gt; endpoint integrates directly. &lt;a href="https://docs.getbifrost.ai/quickstart/gateway/setting-up" rel="noopener noreferrer"&gt;Setup takes minutes&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Honest Trade-offs
&lt;/h2&gt;

&lt;p&gt;No tool is perfect. Here is what I noticed:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Learning curve for Code Mode.&lt;/strong&gt; The virtual filesystem abstraction is elegant, but it is a new mental model. Teams used to traditional MCP tool injection will need to understand why their tools are now "files" the model reads on demand.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Meta-tool overhead on simple queries.&lt;/strong&gt; If you only have 10-20 tools, the overhead of the four meta-tools (listToolFiles, readToolFile, etc.) might not save you much. The real wins kick in above 50-100 tools. Below that threshold, traditional mode works fine.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Starlark sandbox limitations.&lt;/strong&gt; The sandboxed Starlark interpreter is secure by design, but it means tool code runs in a restricted environment. Complex tool implementations may need adjustments.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Dependency on gateway availability.&lt;/strong&gt; Adding a gateway layer means one more component to monitor. Bifrost's 11 microsecond latency and Go-based architecture make this a non-issue in practice, but it is still an additional piece of infrastructure.&lt;/p&gt;

&lt;h2&gt;
  
  
  Who Should Care
&lt;/h2&gt;

&lt;p&gt;If you are running fewer than 50 MCP tools, you probably do not need Code Mode yet. Traditional tool injection works fine at that scale.&lt;/p&gt;

&lt;p&gt;If you are running 100+ tools across multiple servers, or if you need per-team access control, or if your CFO is asking questions about AI infrastructure costs, this is worth evaluating.&lt;/p&gt;

&lt;p&gt;The 92% cost reduction at 500+ tools is the headline number, but the governance features (virtual keys, tool groups, audit logging) are what make it production-ready.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try It
&lt;/h2&gt;

&lt;p&gt;Bifrost is open-source and written in Go.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://git.new/bifrost" rel="noopener noreferrer"&gt;GitHub repo&lt;/a&gt; - star it if this is useful&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://docs.getbifrost.ai/mcp/overview" rel="noopener noreferrer"&gt;MCP documentation&lt;/a&gt; - full setup guide&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://docs.getbifrost.ai/features/governance" rel="noopener noreferrer"&gt;Governance docs&lt;/a&gt; - virtual keys, tool groups, budgets&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://docs.getbifrost.ai/quickstart/gateway/setting-up" rel="noopener noreferrer"&gt;Getting started&lt;/a&gt; - up and running in minutes&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I have been testing a lot of MCP tooling lately. Bifrost's approach to the context window problem is the most practical solution I have seen. The lazy loading pattern for tool definitions should honestly be how all MCP gateways work.&lt;/p&gt;

&lt;p&gt;Check the &lt;a href="https://getmax.im/bifrostdocs" rel="noopener noreferrer"&gt;docs&lt;/a&gt; and give it a spin. Happy to discuss benchmarks or setup in the comments.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>mcp</category>
      <category>llm</category>
      <category>devops</category>
    </item>
    <item>
      <title>How to Track LLM Costs and Rate Limits on AWS Bedrock with an AI Gateway</title>
      <dc:creator>Pranay Batta</dc:creator>
      <pubDate>Mon, 13 Apr 2026 10:21:38 +0000</pubDate>
      <link>https://dev.to/pranay_batta/how-to-track-llm-costs-and-rate-limits-on-aws-bedrock-with-an-ai-gateway-5alh</link>
      <guid>https://dev.to/pranay_batta/how-to-track-llm-costs-and-rate-limits-on-aws-bedrock-with-an-ai-gateway-5alh</guid>
      <description>&lt;p&gt;Running LLM workloads on AWS is easy. Knowing what they cost is not. You spin up Bedrock, call Claude or Mistral a few thousand times, and the bill shows up three days later as a single line item. No breakdown by team. No per-model cost tracking. No rate limits unless you build them yourself.&lt;/p&gt;

&lt;p&gt;I spent the last two weeks evaluating how teams can get proper cost governance over LLM usage on AWS. Native tools, third-party gateways, open-source options. Here is what I found.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem with AWS Native Cost Tracking
&lt;/h2&gt;

&lt;p&gt;AWS gives you CloudWatch and Cost Explorer. Both are built for general AWS resource monitoring. They work fine for EC2, Lambda, S3. For LLM workloads on Bedrock, they fall short.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What you get from CloudWatch + Cost Explorer:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Aggregate Bedrock spend per region&lt;/li&gt;
&lt;li&gt;Invocation counts at the service level&lt;/li&gt;
&lt;li&gt;Basic alarms on total spend thresholds&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;What you do not get:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Per-model token-level cost breakdowns&lt;/li&gt;
&lt;li&gt;Team or project-level budget enforcement&lt;/li&gt;
&lt;li&gt;Rate limiting by user, team, or API key&lt;/li&gt;
&lt;li&gt;Real-time cost tracking per request&lt;/li&gt;
&lt;li&gt;Automatic routing away from providers that exceed limits&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you are running one model for one team, native tools are fine. The moment you have multiple teams, multiple models, or need to enforce granular budgets, you are building custom infrastructure.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Gateway Approach
&lt;/h2&gt;

&lt;p&gt;An LLM gateway sits between your application and Bedrock. Every request passes through it. That gives you a single place to track costs, enforce rate limits, and control routing.&lt;/p&gt;

&lt;p&gt;I tested three approaches:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Feature&lt;/th&gt;
&lt;th&gt;AWS Native (CloudWatch + Cost Explorer)&lt;/th&gt;
&lt;th&gt;LiteLLM&lt;/th&gt;
&lt;th&gt;Bifrost&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;LLM-specific cost tracking&lt;/td&gt;
&lt;td&gt;Aggregate only&lt;/td&gt;
&lt;td&gt;Per-request, per-model&lt;/td&gt;
&lt;td&gt;Per-request, per-model&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Budget hierarchy&lt;/td&gt;
&lt;td&gt;Account-level billing alerts&lt;/td&gt;
&lt;td&gt;Basic budget controls&lt;/td&gt;
&lt;td&gt;4-tier: Customer &amp;gt; Team &amp;gt; Virtual Key &amp;gt; Provider&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Rate limiting&lt;/td&gt;
&lt;td&gt;No native LLM rate limits&lt;/td&gt;
&lt;td&gt;Basic rate limiting&lt;/td&gt;
&lt;td&gt;VK + Provider Config level, token and request limits&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Reset durations&lt;/td&gt;
&lt;td&gt;N/A&lt;/td&gt;
&lt;td&gt;Limited options&lt;/td&gt;
&lt;td&gt;1m, 5m, 1h, 1d, 1w, 1M, 1Y (calendar-aligned UTC)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Bedrock support&lt;/td&gt;
&lt;td&gt;Native&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes (provider type "bedrock")&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Overhead&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;td&gt;~8ms (Python)&lt;/td&gt;
&lt;td&gt;11 microseconds (Go)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Deployment&lt;/td&gt;
&lt;td&gt;N/A&lt;/td&gt;
&lt;td&gt;Self-hosted or cloud&lt;/td&gt;
&lt;td&gt;Self-hosted (runs in your VPC)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Language&lt;/td&gt;
&lt;td&gt;N/A&lt;/td&gt;
&lt;td&gt;Python&lt;/td&gt;
&lt;td&gt;Go&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The numbers tell the story. For teams that need real LLM cost governance on AWS, a dedicated gateway is the right call.&lt;/p&gt;

&lt;h2&gt;
  
  
  Setting Up Bifrost with AWS Bedrock
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://git.new/bifrost" rel="noopener noreferrer"&gt;Bifrost&lt;/a&gt; runs in your VPC alongside Bedrock. No data leaves your infrastructure. That matters for teams with compliance requirements.&lt;/p&gt;

&lt;p&gt;Start the gateway:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npx &lt;span class="nt"&gt;-y&lt;/span&gt; @maximhq/bifrost
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Full &lt;a href="https://docs.getbifrost.ai/quickstart/gateway/setting-up" rel="noopener noreferrer"&gt;setup guide here&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Configure Bedrock as a provider:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;accounts&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ml-team"&lt;/span&gt;
    &lt;span class="na"&gt;providers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bedrock-claude"&lt;/span&gt;
        &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bedrock"&lt;/span&gt;
        &lt;span class="na"&gt;region&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;us-east-1"&lt;/span&gt;
        &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;anthropic.claude-sonnet-4-20250514-v1:0"&lt;/span&gt;
        &lt;span class="na"&gt;weight&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;80&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bedrock-mistral"&lt;/span&gt;
        &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bedrock"&lt;/span&gt;
        &lt;span class="na"&gt;region&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;us-west-2"&lt;/span&gt;
        &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;mistral.mistral-large-2407-v1:0"&lt;/span&gt;
        &lt;span class="na"&gt;weight&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;20&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Weighted routing across models. 80% of requests go to Claude Sonnet on Bedrock, 20% to Mistral. Both running through your AWS account. The &lt;a href="https://docs.getbifrost.ai/quickstart/gateway/provider-configuration" rel="noopener noreferrer"&gt;provider configuration docs&lt;/a&gt; cover all Bedrock model formats and region options.&lt;/p&gt;

&lt;h2&gt;
  
  
  Four-Tier Budget Hierarchy
&lt;/h2&gt;

&lt;p&gt;This is where Bifrost separates itself from everything else I tested. The &lt;a href="https://docs.getbifrost.ai/features/governance/budget-and-limits" rel="noopener noreferrer"&gt;budget system&lt;/a&gt; has four levels: Customer, Team, Virtual Key, and Provider Config. All four must pass for a request to go through.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;budgets&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;customer&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;acme-corp"&lt;/span&gt;
      &lt;span class="na"&gt;limit&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;5000&lt;/span&gt;
      &lt;span class="na"&gt;period&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1M"&lt;/span&gt;

  &lt;span class="na"&gt;team&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ml-engineering"&lt;/span&gt;
      &lt;span class="na"&gt;customer_id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;acme-corp"&lt;/span&gt;
      &lt;span class="na"&gt;limit&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;2000&lt;/span&gt;
      &lt;span class="na"&gt;period&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1M"&lt;/span&gt;

  &lt;span class="na"&gt;virtual_key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;staging-key"&lt;/span&gt;
      &lt;span class="na"&gt;team_id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ml-engineering"&lt;/span&gt;
      &lt;span class="na"&gt;limit&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;500&lt;/span&gt;
      &lt;span class="na"&gt;period&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1w"&lt;/span&gt;

  &lt;span class="na"&gt;provider_config&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bedrock-claude"&lt;/span&gt;
      &lt;span class="na"&gt;limit&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1000&lt;/span&gt;
      &lt;span class="na"&gt;period&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1M"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Customer gets $5,000/month. ML Engineering team gets $2,000 of that. The staging key is capped at $500/week. And the Bedrock Claude provider itself is capped at $1,000/month. If any tier hits its limit, the request is blocked.&lt;/p&gt;

&lt;p&gt;Cost is calculated from provider pricing, token usage, request type, cache status, and batch operations. Not estimated. Calculated from actual usage data.&lt;/p&gt;

&lt;p&gt;The &lt;a href="https://docs.getbifrost.ai/features/governance" rel="noopener noreferrer"&gt;governance docs&lt;/a&gt; have the full breakdown.&lt;/p&gt;

&lt;h2&gt;
  
  
  Rate Limiting That Actually Works for LLMs
&lt;/h2&gt;

&lt;p&gt;AWS does not give you LLM-specific rate limits. Bedrock has service quotas, but those are blunt instruments. You cannot limit a specific team to 100 requests per minute or cap token consumption per API key.&lt;/p&gt;

&lt;p&gt;Bifrost handles rate limiting at two levels: Virtual Key and Provider Config. You can set both request limits (calls per duration) and token limits (tokens per duration).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;rate_limits&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;virtual_key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;staging-key"&lt;/span&gt;
      &lt;span class="na"&gt;requests&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;limit&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;100&lt;/span&gt;
        &lt;span class="na"&gt;duration&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1h"&lt;/span&gt;
      &lt;span class="na"&gt;tokens&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;limit&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;50000&lt;/span&gt;
        &lt;span class="na"&gt;duration&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1h"&lt;/span&gt;

  &lt;span class="na"&gt;provider_config&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bedrock-claude"&lt;/span&gt;
      &lt;span class="na"&gt;requests&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;limit&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;500&lt;/span&gt;
        &lt;span class="na"&gt;duration&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1h"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Reset durations: 1m, 5m, 1h, 1d, 1w, 1M, 1Y. The daily, weekly, monthly, and yearly resets are calendar-aligned in UTC. So "1d" resets at midnight UTC, not 24 hours from first request.&lt;/p&gt;

&lt;p&gt;Here is the clever part: if a provider config exceeds its rate limit, that provider gets excluded from &lt;a href="https://docs.getbifrost.ai/features/governance/routing" rel="noopener noreferrer"&gt;routing&lt;/a&gt;. But other providers in the account remain available. Traffic shifts automatically. No downtime, no manual intervention.&lt;/p&gt;

&lt;h2&gt;
  
  
  Observability at Sub-Millisecond Overhead
&lt;/h2&gt;

&lt;p&gt;Every request through Bifrost is captured: tokens used, latency, cost, response status. The &lt;a href="https://docs.getbifrost.ai/features/observability" rel="noopener noreferrer"&gt;observability layer&lt;/a&gt; adds less than 0.1ms of overhead. Storage backend is SQLite or PostgreSQL.&lt;/p&gt;

&lt;p&gt;What makes this useful for AWS teams:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;14+ API filter options&lt;/strong&gt; for querying logs. Filter by model, provider, team, cost range, status code, time window.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;WebSocket live updates.&lt;/strong&gt; Watch requests flow through in real time. Useful during load testing or incident debugging.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Single pane across providers.&lt;/strong&gt; If you are running Bedrock plus OpenAI or Gemini as &lt;a href="https://docs.getbifrost.ai/features/fallbacks" rel="noopener noreferrer"&gt;failover&lt;/a&gt;, all logs are in one place.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Compare that to checking CloudWatch for Bedrock, then the OpenAI dashboard for your fallback, then manually correlating timestamps. The centralised view saves real time.&lt;/p&gt;

&lt;h2&gt;
  
  
  Honest Trade-offs
&lt;/h2&gt;

&lt;p&gt;No tool solves everything. Here is what to know:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Bifrost is self-hosted only.&lt;/strong&gt; You run it, you maintain it. For teams already on AWS with VPC infrastructure, this is straightforward. For smaller teams without DevOps, it is extra work.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;LiteLLM has broader provider coverage.&lt;/strong&gt; 100+ providers out of the box. If you need niche providers, LiteLLM may have them. Bifrost focuses on major providers but adds the Go performance advantage and deeper governance features.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;AWS native tools have zero overhead.&lt;/strong&gt; If all you need is aggregate cost visibility and basic billing alerts, CloudWatch is already there. No extra infrastructure.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Go vs Python matters at scale.&lt;/strong&gt; Bifrost's 11 microsecond overhead versus LiteLLM's ~8ms becomes significant when you are processing thousands of requests per minute. At low volume, both are fine. At scale, the difference compounds. The &lt;a href="https://docs.getbifrost.ai/benchmarking/getting-started" rel="noopener noreferrer"&gt;benchmarks&lt;/a&gt; back this up: 5,000 RPS on a single instance.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Bifrost is a newer project.&lt;/strong&gt; The community is growing but smaller than LiteLLM's. Documentation is solid. Edge cases may require checking GitHub issues.&lt;/p&gt;

&lt;h2&gt;
  
  
  When to Use What
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Stick with AWS native tools if:&lt;/strong&gt; You have one team, one model, and just need billing alerts.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Consider LiteLLM if:&lt;/strong&gt; You need maximum provider coverage and are comfortable with Python-based overhead.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Use Bifrost if:&lt;/strong&gt; You need granular cost governance, multi-tier budgets, LLM-specific rate limiting, and minimal latency on AWS. Especially if you are already running in a VPC and want &lt;a href="https://docs.getbifrost.ai/features/semantic-caching" rel="noopener noreferrer"&gt;semantic caching&lt;/a&gt; and automatic failover alongside cost controls.&lt;/p&gt;

&lt;h2&gt;
  
  
  Quick Start
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# 1. Start Bifrost in your VPC&lt;/span&gt;
npx &lt;span class="nt"&gt;-y&lt;/span&gt; @maximhq/bifrost

&lt;span class="c"&gt;# 2. Configure Bedrock providers in bifrost.yaml&lt;/span&gt;

&lt;span class="c"&gt;# 3. Set budget and rate limit tiers&lt;/span&gt;

&lt;span class="c"&gt;# 4. Point your application at the gateway&lt;/span&gt;
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;ANTHROPIC_BASE_URL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;http://localhost:8080/anthropic
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;ANTHROPIC_API_KEY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;your-bifrost-virtual-key
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Every Bedrock request now has cost tracking, rate limiting, and observability built in.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://git.new/bifrost" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt; | &lt;a href="https://getmax.im/bifrostdocs" rel="noopener noreferrer"&gt;Docs&lt;/a&gt; | &lt;a href="https://getmax.im/bifrost-home" rel="noopener noreferrer"&gt;Website&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;AWS makes it easy to run LLM workloads. It does not make it easy to govern them. If your team is scaling Bedrock usage and needs real cost controls, a dedicated LLM gateway fills the gap that CloudWatch and Cost Explorer leave open.&lt;/p&gt;

&lt;p&gt;Check the &lt;a href="https://git.new/bifrostrepo" rel="noopener noreferrer"&gt;repo&lt;/a&gt; if you want to dig into the source.&lt;/p&gt;

</description>
      <category>aws</category>
      <category>llm</category>
      <category>opensource</category>
      <category>ai</category>
    </item>
    <item>
      <title>Best Claude Code Gateway for Multi-Model Routing</title>
      <dc:creator>Pranay Batta</dc:creator>
      <pubDate>Fri, 10 Apr 2026 22:29:45 +0000</pubDate>
      <link>https://dev.to/pranay_batta/best-claude-code-gateway-for-multi-model-routing-24mn</link>
      <guid>https://dev.to/pranay_batta/best-claude-code-gateway-for-multi-model-routing-24mn</guid>
      <description>&lt;p&gt;Claude Code is great until you need more than one model. You hit a rate limit on Anthropic, want Gemini for long context, or need GPT-4o for a specific task. The default setup gives you no way to route across providers.&lt;/p&gt;

&lt;p&gt;I spent a week testing gateways that sit between Claude Code and LLM providers. The goal was simple: configure multiple models, set routing weights, get automatic failover, and keep Claude Code working normally.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://git.new/bifrost" rel="noopener noreferrer"&gt;Bifrost&lt;/a&gt; was the clear winner. Open-source, written in Go, 11 microsecond overhead per request. Here is how I set up multi-model routing and what I learned.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Multi-Model Routing Matters
&lt;/h2&gt;

&lt;p&gt;Different models are good at different things. Claude Sonnet handles tool use well. GPT-4o is strong at certain code generation tasks. Gemini 2.5 Pro handles massive context windows. Using one model for everything means you are leaving performance on the table.&lt;/p&gt;

&lt;p&gt;Multi-model routing lets you:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Split traffic across providers by weight&lt;/li&gt;
&lt;li&gt;Fail over automatically when a provider goes down&lt;/li&gt;
&lt;li&gt;Pin specific models for specific tasks&lt;/li&gt;
&lt;li&gt;Control costs by routing cheaper models for simpler operations&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The problem: Claude Code talks to &lt;code&gt;api.anthropic.com&lt;/code&gt; by default. No native multi-model support. You need a gateway.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Setup: Bifrost as a Claude Code Gateway
&lt;/h2&gt;

&lt;p&gt;Bifrost exposes an Anthropic-compatible endpoint. Claude Code does not know a gateway exists. It sends standard requests, and Bifrost translates and routes them to whatever provider you configure.&lt;/p&gt;

&lt;p&gt;Full &lt;a href="https://docs.getbifrost.ai/cli-agents/claude-code" rel="noopener noreferrer"&gt;Claude Code integration docs here&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Install and Connect
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npx &lt;span class="nt"&gt;-y&lt;/span&gt; @maximhq/bifrost
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That starts the gateway locally. &lt;a href="https://docs.getbifrost.ai/quickstart/gateway/setting-up" rel="noopener noreferrer"&gt;Setup guide&lt;/a&gt; has the details.&lt;/p&gt;

&lt;p&gt;Point Claude Code at Bifrost:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;ANTHROPIC_BASE_URL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;http://localhost:8080/anthropic
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;ANTHROPIC_API_KEY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;your-bifrost-virtual-key
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;ANTHROPIC_API_KEY&lt;/code&gt; here is a Bifrost virtual key, not your actual Anthropic key. Provider keys live in the Bifrost config. This is a &lt;a href="https://docs.getbifrost.ai/features/drop-in-replacement" rel="noopener noreferrer"&gt;drop-in replacement&lt;/a&gt; for the Anthropic API.&lt;/p&gt;

&lt;p&gt;Done. Every Claude Code request now flows through Bifrost.&lt;/p&gt;

&lt;h2&gt;
  
  
  Weighted Routing Configuration
&lt;/h2&gt;

&lt;p&gt;This is the core of multi-model routing. You assign weights to providers, and Bifrost distributes traffic accordingly. Weights auto-normalize to sum 1.0, so you can use any numbers.&lt;/p&gt;

&lt;p&gt;Here is a config that splits traffic between GPT-4o and Claude Sonnet:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;accounts&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;dev-team"&lt;/span&gt;
    &lt;span class="na"&gt;providers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;openai-primary"&lt;/span&gt;
        &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;openai"&lt;/span&gt;
        &lt;span class="na"&gt;api_key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;${OPENAI_API_KEY}"&lt;/span&gt;
        &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4o"&lt;/span&gt;
        &lt;span class="na"&gt;weight&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;70&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;anthropic-secondary"&lt;/span&gt;
        &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;anthropic"&lt;/span&gt;
        &lt;span class="na"&gt;api_key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;${ANTHROPIC_API_KEY}"&lt;/span&gt;
        &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-sonnet-4-20250514"&lt;/span&gt;
        &lt;span class="na"&gt;weight&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;30&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;70% of requests go to GPT-4o. 30% to Claude Sonnet. I used this to compare output quality across providers in real coding sessions without manually switching anything.&lt;/p&gt;

&lt;p&gt;The &lt;a href="https://docs.getbifrost.ai/features/governance/routing" rel="noopener noreferrer"&gt;routing docs&lt;/a&gt; cover all the configuration options.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Important detail:&lt;/strong&gt; cross-provider routing does not happen automatically. You must explicitly configure each provider in your config. Bifrost does not guess or infer routing rules.&lt;/p&gt;

&lt;h2&gt;
  
  
  Automatic Failover
&lt;/h2&gt;

&lt;p&gt;Weighted routing is useful. Automatic &lt;a href="https://docs.getbifrost.ai/features/fallbacks" rel="noopener noreferrer"&gt;failover&lt;/a&gt; is essential. Providers go down. Rate limits hit. You do not want your Claude Code session to break mid-task.&lt;/p&gt;

&lt;p&gt;Bifrost sorts providers by weight and retries on failure. If the primary provider fails, the next one picks up the request.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;accounts&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;dev-team"&lt;/span&gt;
    &lt;span class="na"&gt;providers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;openai-primary"&lt;/span&gt;
        &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;openai"&lt;/span&gt;
        &lt;span class="na"&gt;api_key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;${OPENAI_API_KEY}"&lt;/span&gt;
        &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4o"&lt;/span&gt;
        &lt;span class="na"&gt;weight&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;80&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gemini-fallback"&lt;/span&gt;
        &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gemini"&lt;/span&gt;
        &lt;span class="na"&gt;api_key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;${GEMINI_API_KEY}"&lt;/span&gt;
        &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gemini-2.5-pro"&lt;/span&gt;
        &lt;span class="na"&gt;weight&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;15&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;anthropic-fallback"&lt;/span&gt;
        &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;anthropic"&lt;/span&gt;
        &lt;span class="na"&gt;api_key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;${ANTHROPIC_API_KEY}"&lt;/span&gt;
        &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-sonnet-4-20250514"&lt;/span&gt;
        &lt;span class="na"&gt;weight&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;5&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;OpenAI goes down, Bifrost retries with Gemini. Gemini fails, falls back to Anthropic. My coding session never interrupts.&lt;/p&gt;

&lt;h2&gt;
  
  
  Model Pinning for Bedrock and Vertex AI
&lt;/h2&gt;

&lt;p&gt;If your team uses AWS Bedrock or Google Vertex AI, you can pin specific models directly:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Bedrock&lt;/span&gt;
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;ANTHROPIC_DEFAULT_SONNET_MODEL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"bedrock/global.anthropic.claude-sonnet-4-6"&lt;/span&gt;

&lt;span class="c"&gt;# Vertex AI&lt;/span&gt;
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;ANTHROPIC_DEFAULT_SONNET_MODEL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"vertex/claude-sonnet-4-6"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can also override the model mid-session using the &lt;code&gt;--model&lt;/code&gt; flag or the &lt;code&gt;/model&lt;/code&gt; command inside Claude Code. Useful when you want to switch between models for different parts of a task. Start with Sonnet for scaffolding, switch to GPT-4o for a tricky implementation, then back again. The gateway handles the translation layer for each provider.&lt;/p&gt;

&lt;p&gt;This is one area where the &lt;a href="https://docs.getbifrost.ai/integrations/anthropic-sdk" rel="noopener noreferrer"&gt;Anthropic SDK compatibility&lt;/a&gt; matters. Bifrost maintains full compatibility with the Anthropic message format, so model pinning and switching work without any client-side changes.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://docs.getbifrost.ai/quickstart/gateway/provider-configuration" rel="noopener noreferrer"&gt;Provider configuration docs&lt;/a&gt; list all supported providers and model formats.&lt;/p&gt;

&lt;h2&gt;
  
  
  Budget Controls Across Providers
&lt;/h2&gt;

&lt;p&gt;Once all traffic flows through one gateway, cost management becomes straightforward. Bifrost has a four-tier &lt;a href="https://docs.getbifrost.ai/features/governance/budget-and-limits" rel="noopener noreferrer"&gt;budget hierarchy&lt;/a&gt;: Customer, Team, Virtual Key, Provider Config.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;budgets&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;level&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;virtual_key"&lt;/span&gt;
    &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-code-dev"&lt;/span&gt;
    &lt;span class="na"&gt;limit&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;200&lt;/span&gt;
    &lt;span class="na"&gt;period&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;monthly"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Set a limit. When it is reached, requests get blocked. No surprise bills from a runaway Claude Code session.&lt;/p&gt;

&lt;p&gt;The full &lt;a href="https://docs.getbifrost.ai/features/governance" rel="noopener noreferrer"&gt;governance layer&lt;/a&gt; handles rate limiting, access control, and spend management across all configured providers.&lt;/p&gt;

&lt;h2&gt;
  
  
  Observability Across All Providers
&lt;/h2&gt;

&lt;p&gt;Every request through Bifrost gets logged: latency, token count, cost, provider used, response status. The &lt;a href="https://docs.getbifrost.ai/features/observability" rel="noopener noreferrer"&gt;observability layer&lt;/a&gt; gives you a single view across all providers.&lt;/p&gt;

&lt;p&gt;This is particularly useful with multi-model routing. You can see exactly which provider handled each request, compare response times across models, and track per-provider costs. When I was running 70/30 weighted routing between GPT-4o and Claude Sonnet, the observability data showed me exactly how each model performed on real coding tasks. Response times, token consumption, and cost per request, all in one place.&lt;/p&gt;

&lt;p&gt;Without centralized logging, you are checking multiple provider dashboards and guessing which model handled what. That is not sustainable when you are running multiple providers through Claude Code daily.&lt;/p&gt;

&lt;h2&gt;
  
  
  Honest Trade-offs
&lt;/h2&gt;

&lt;p&gt;No tool is perfect. Here is what I found:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;OpenRouter streaming limitation.&lt;/strong&gt; OpenRouter does not stream function call arguments properly. This causes file operation failures in Claude Code. If you use OpenRouter as a provider, expect issues with tool use.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Non-Anthropic model requirements.&lt;/strong&gt; Any non-Anthropic model you route through must support tool use. Claude Code relies heavily on function calling. Models without proper tool support will fail on file operations, search, and other agent tasks.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Self-hosted only.&lt;/strong&gt; The open-source version requires you to run and maintain the gateway. There is no managed cloud offering. That means monitoring, updating, and debugging are on you.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Newer project.&lt;/strong&gt; Bifrost's community is growing but still smaller than older alternatives. Documentation is solid, but edge cases may require digging through issues on GitHub.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Extra hop.&lt;/strong&gt; You are adding a process between Claude Code and your provider. The 11 microsecond overhead is negligible, but it is one more thing in the chain to keep running.&lt;/p&gt;

&lt;h2&gt;
  
  
  Performance
&lt;/h2&gt;

&lt;p&gt;I ran benchmarks matching the &lt;a href="https://docs.getbifrost.ai/benchmarking/getting-started" rel="noopener noreferrer"&gt;benchmarking guide&lt;/a&gt;. The numbers held up: 11 microseconds of routing overhead, 5,000 requests per second on a single instance. The Go implementation makes a real difference. Python-based gateways I tested added significantly more latency.&lt;/p&gt;

&lt;p&gt;For a gateway that sits in the critical path of every LLM call, low overhead matters.&lt;/p&gt;

&lt;h2&gt;
  
  
  Quick Start Summary
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# 1. Start Bifrost&lt;/span&gt;
npx &lt;span class="nt"&gt;-y&lt;/span&gt; @maximhq/bifrost

&lt;span class="c"&gt;# 2. Configure providers in bifrost.yaml (weighted routing + failover)&lt;/span&gt;

&lt;span class="c"&gt;# 3. Point Claude Code at Bifrost&lt;/span&gt;
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;ANTHROPIC_BASE_URL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;http://localhost:8080/anthropic
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;ANTHROPIC_API_KEY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;your-bifrost-virtual-key

&lt;span class="c"&gt;# 4. Use Claude Code normally&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That is it. Your Claude Code session now routes across multiple models with automatic failover and budget controls.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://git.new/bifrost" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt; | &lt;a href="https://getmax.im/bifrostdocs" rel="noopener noreferrer"&gt;Docs&lt;/a&gt; | &lt;a href="https://getmax.im/bifrost-home" rel="noopener noreferrer"&gt;Website&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If you are running Claude Code for real work, multi-model routing is not optional. Single-provider setups break at the worst times. A gateway that handles routing, failover, and cost controls in one place saves hours of debugging and thousands in unexpected spend.&lt;/p&gt;

&lt;p&gt;Open an issue on the &lt;a href="https://git.new/bifrostrepo" rel="noopener noreferrer"&gt;repo&lt;/a&gt; if you run into anything.&lt;/p&gt;

</description>
      <category>llm</category>
      <category>opensource</category>
      <category>ai</category>
      <category>programming</category>
    </item>
  </channel>
</rss>
