DEV Community: Pranay Batta

Setting Up Budgets, Rate Limits, and Guardrails in Bifrost: A Hands-On Walkthrough

Pranay Batta — Wed, 13 May 2026 06:07:06 +0000

TL;DR: Governance is the part of an AI gateway that decides whether you can ship to production. I set up Bifrost's four-tier budget hierarchy, configured rate limits at both virtual key and provider levels, and wired in guardrails for PII redaction and content filtering on a single instance. This post walks through the config, the gotchas, and how the budget tree behaves under real traffic.

This post assumes familiarity with running Bifrost via npx or Docker, the virtual keys concept, and how rate limits typically work in API gateways.

Why Governance Lives at the Gateway

If you let every team hit upstream LLM APIs directly, three problems show up.

You cannot answer "who spent what" without parsing invoices from each upstream vendor. You cannot stop a runaway agent mid-incident, because the kill switch is on the upstream provider dashboard, not in your control plane. And you cannot enforce a content policy without writing the same filter in every service that touches a model.

A gateway moves these problems behind one set of controls. The Bifrost governance resource covers the full model. This post is the hands-on version.

Step 1: The Four-Tier Budget Hierarchy

Bifrost models budgets as a four-tier tree: Customer > Team > Virtual Key > Provider Config. Every request walks the tree from the most specific scope outward. If any tier is over budget, the request is rejected before it leaves the gateway.

customers:
  - id: acme-corp
    budget:
      max_limit: 5000
      reset_duration: 1M

teams:
  - id: support-team
    customer_id: acme-corp
    budget:
      max_limit: 1500
      reset_duration: 1M

  - id: eng-team
    customer_id: acme-corp
    budget:
      max_limit: 2000
      reset_duration: 1M

virtual_keys:
  - key: vk-support-chatbot
    team_id: support-team
    budget:
      max_limit: 500
      reset_duration: 1w

Reset durations are 1m, 5m, 1h, 1d, 1w, 1M, 1Y. The daily, weekly, monthly, and yearly buckets are calendar-aligned in UTC, so a 1d budget resets at 00:00 UTC, not 24 hours after first use. That detail matters when you are debugging "why did my budget reset at 5:30am IST."

The customer above caps total spend at 5,000 units per month. Inside that, the support team has 1,500 and engineering has 2,000. They cannot collectively exceed 5,000 even if each team has headroom, because the customer ceiling binds.

Step 2: Rate Limits at Two Scopes

Budgets cap spend; rate limits cap request frequency. Bifrost enforces rate limits at the virtual key level and the provider config level, with separate request and token counters.

virtual_keys:
  - key: vk-support-chatbot
    team_id: support-team
    rate_limits:
      requests:
        max: 60
        reset_duration: 1m
      tokens:
        max: 200000
        reset_duration: 1h

provider_configs:
  - name: anthropic-primary
    provider: anthropic
    rate_limits:
      requests:
        max: 10000
        reset_duration: 1m

The VK rate limit protects per-key abuse: 60 requests per minute, 200k tokens per hour. The provider config rate limit protects upstream: 10k requests per minute total across all keys hitting that provider, which keeps you under your vendor tier ceiling.

If a limit trips, the gateway returns a 429 with the relevant Retry-After header. Reset windows follow the same calendar-aligned rules as budgets.

Step 3: Routing and Weighted Fallbacks

Governance is also about controlling where requests land. Bifrost supports weighted load balancing across providers and automatic fallbacks. Weights are auto-normalised to sum to 1.0, so you do not have to do the math yourself.

routing:
  - model: claude-sonnet
    providers:
      - name: anthropic-primary
        weight: 3
      - name: anthropic-backup
        weight: 1

That sends 75% of traffic to the primary and 25% to the backup. On failure, Bifrost retries the next provider in weight order. Cross-provider routing (Anthropic to OpenAI) must be explicitly configured. The gateway will not silently fall back to a different vendor unless you tell it to, which is the right default for a governance system.

Step 4: Guardrails for PII and Content

Guardrails are the layer that inspects request and response payloads before they leave the gateway. The Bifrost guardrails resource covers the supported policies. The pattern looks like this in config:

guardrails:
  - name: pii-redact
    type: pii
    action: redact
    fields: [email, ssn, phone]
    scope: [request, response]

  - name: prompt-injection
    type: content
    action: block
    scope: [request]

A redact policy rewrites the payload before forwarding. A block policy rejects the request with a 400 and an audit log entry. Both policies apply on top of the rate limit and budget checks, so a request that survives all three tiers still has to pass the policy filter before it touches the upstream LLM.

How This Stacks Against Other Gateways

Capability	Bifrost	LiteLLM	OpenRouter	Kong AI Gateway
Hierarchical budgets (4 tiers)	Yes	Limited	No	Plugin-based
Rate limits at key + provider	Yes	Key only	Vendor-managed	Yes
Weighted load balancing	Yes	Yes	Vendor routing	Yes
Built-in guardrails	Yes	External	No	Plugin-based
Self-hostable	Yes	Yes	No	Yes
Open-source	Yes	Yes	No	Partial

LiteLLM has wide provider support and a large community (BerriAI/litellm), but its governance model is flatter and guardrails are not built in. OpenRouter is the easiest to start with, but data residency and hard budget caps at the customer tier are not on the menu. A deeper migration walkthrough between gateways lives in the LiteLLM alternatives guide.

Trade-offs and Limitations

Bifrost is a Go binary, so if your platform is Python-first and you want governance code you can fork in-process, LiteLLM is a more natural fit despite its higher per-request overhead (~8ms vs Bifrost's 11μs).

The four-tier budget tree is powerful but adds cognitive load. For a single-team setup, you only need the virtual key tier. Resist the urge to use customer and team scopes for hypothetical future structure.

Calendar-aligned resets are great for finance reporting but counter-intuitive if you expect rolling windows. If you want a true rolling 24-hour bucket, you have to model it with shorter reset durations and external aggregation.

Guardrails inspect payloads in-process, which means the gateway sees them in plaintext. If your threat model requires no plaintext PII transiting any shared service, you need to redact upstream of the gateway.

Quick Recap

Bifrost's four-tier budget tree (Customer > Team > VK > Provider) gives finance-grade spend control inside the gateway.
Rate limits run at both the VK and provider scopes, with separate request and token counters.
Weighted load balancing handles distribution, automatic fallbacks handle failure, and cross-provider routing requires explicit config.
Guardrails for PII redaction and content filtering layer on top of budgets and rate limits, before traffic reaches the upstream model.
The whole setup runs in a single Go binary with 11 microsecond overhead per request, so the governance layer does not become the latency bottleneck.

GitHub: https://git.new/bifrost | Docs: https://getmax.im/bifrostdocs | Website: https://getmax.im/bifrost-home

Self-Hosting an Open Source MCP Gateway: Setup, Security, and Scaling Guide

Pranay Batta — Fri, 08 May 2026 09:25:08 +0000

TL;DR: Self-hosting an MCP gateway gives you control over auth, audit logging, and tool access in a way managed services do not. I set up Bifrost end to end on a single instance, walked through the security configuration (OAuth 2.0 with PKCE, virtual keys with deny-by-default), and pushed it to its documented sustained throughput. This post covers the setup, the security layer, and the scaling levers, plus the gotchas.

This post assumes familiarity with the Model Context Protocol, basic Docker or npx deployment, and how OAuth 2.0 PKCE flows work.

Why Self-Host an MCP Gateway

Direct MCP connections work for one developer on a laptop. They break in three ways once you scale.

Credentials get duplicated across every agent config. Each Claude Code or Cursor instance holds its own MCP server credentials.

Audit trails fragment. When a tool call modifies the wrong record, you cannot answer who called it from which agent session without grepping through agent logs.

Tool access is all-or-nothing. Every agent sees every tool from every connected server, including the dangerous ones.

A self-hosted gateway puts these behind one entry point. Bifrost is the option I worked with because it is open source, written in Go, and ships MCP gateway functionality alongside LLM routing in a single binary.

Step 1: Run the Gateway

npx -y @maximhq/bifrost

That starts Bifrost on port 8080 with a default config. For production, Docker is the path most teams take:

docker run -d \
  -p 8080:8080 \
  -v $(pwd)/config:/app/config \
  -v $(pwd)/data:/app/data \
  maximhq/bifrost:latest

The volume mounts persist configuration and the SQLite store across restarts. The setup docs cover Postgres if you need it for production observability.

Step 2: Register MCP Servers

Bifrost supports STDIO, HTTP, and SSE connection types. The configuration sits under the mcp block.

mcp:
  servers:
    - name: github-mcp
      type: stdio
      command: ["npx", "-y", "@modelcontextprotocol/server-github"]
      env:
        GITHUB_TOKEN: ${GITHUB_TOKEN}

    - name: linear-mcp
      type: http
      url: ${LINEAR_MCP_URL}
      headers:
        Authorization: "Bearer ${LINEAR_TOKEN}"

    - name: filesystem
      type: stdio
      command: ["npx", "-y", "@modelcontextprotocol/server-filesystem", "/data"]
      env:
        HOME: /root
        PATH: /usr/local/bin:/usr/bin

Once saved, the gateway connects to each server and discovers its tool list. Bifrost runs a 10-second health check per server with exponential backoff retry on failure, so a flapping upstream does not take down the whole gateway path.

Step 3: Lock Down Auth With OAuth 2.0 and Virtual Keys

The security layer is where self-hosting earns its keep. Bifrost ships OAuth 2.0 with PKCE for upstream MCP servers that support it, plus virtual keys with deny-by-default access for downstream agents.

virtual_keys:
  - key_name: claude-code-eng
    key: vk-cc-eng
    mcp_clients:
      - name: github-mcp
        allowed_tools: ["search_code", "get_pull_request", "get_issue"]
      - name: filesystem
        allowed_tools: ["read_text_file", "list_directory", "search_files"]

  - key_name: claude-code-write
    key: vk-cc-write
    mcp_clients:
      - name: github-mcp
        allowed_tools: ["*"]
      - name: linear-mcp
        allowed_tools: ["*"]

The first key is read-only across both servers. The second key is broader. If a virtual key has no mcp_clients block, no MCP tools are exposed through that key. The default is deny.

The four-tier budget hierarchy (Customer, Team, Virtual Key, Provider Config) applies to MCP traffic as well as LLM traffic, documented on the Bifrost governance resource.

Step 4: Enable Code Mode for Token Reduction

For agentic workloads with more than a handful of tools, Code Mode is the single biggest cost lever. It replaces full tool definition injection with a Python stub generation flow.

mcp:
  code_mode:
    enabled: true
    sandbox: starlark

Bifrost's published Code Mode benchmarks on the MCP gateway resource page:

MCP tools connected	Token reduction	Pass rate
96 tools	58%	100%
251 tools	84.5%	100%
508 tools	92.8%	100%

The docs also call out the round-trip impact: at 5 servers and around 100 tools, Classic MCP runs ~6 LLM turns versus 3-4 turns under Code Mode, with documented "~50% cost reduction + 30-40% faster execution" in that flow. Source: docs.getbifrost.ai/mcp/code-mode.

Step 5: Connect Claude Code to the Gateway

claude mcp add-json bifrost '{
  "type": "http",
  "url": "http://localhost:8080/mcp",
  "headers": {
    "Authorization": "Bearer vk-cc-eng"
  }
}'

Claude Code now sees only the tools allowed by vk-cc-eng, routed through Bifrost. The full integration walkthrough is on the Bifrost Claude Code resource.

Step 6: Scaling Levers

Bifrost is documented at 11 microsecond latency overhead per request and 5,000 RPS sustained throughput on a single instance, available on the Bifrost benchmarks resource. The Go runtime, async write logging via sync.Pool, and batch processing keep observability overhead under 0.1ms.

For production scaling:

Move from SQLite to Postgres for the observability store
Run multiple Bifrost instances behind a load balancer with session affinity by virtual key
Use Redis (RediSearch), Weaviate, or Qdrant as a shared vector store for semantic caching
Enable streaming response caching for chat workloads with long completions

Comparison

Capability	Direct MCP	Cloudflare AI Gateway	Bifrost
Self-hosted	Yes	No	Yes
OAuth 2.0 with PKCE	Per-server	No	Yes
Per-tool audit logging	No	Basic	Yes
Tool groups	No	No	Yes
Code Mode token reduction	No	No	Yes
Latency overhead	None	Managed	11μs

Trade-offs and Limitations

Bifrost is self-hosted only. There is no managed cloud, so you take on ops cost.

The provider catalog for LLMs is smaller than LiteLLM. Major providers and custom endpoints are covered, but niche providers may not be.

Code Mode adds a small overhead on tiny tool catalogs. Below 5 to 10 tools, the upfront-definition path can be cheaper.

OpenRouter is incompatible because of a tool call streaming issue.

The project is newer than alternatives, so the community of plugins and Stack Overflow answers is still building up.

Quick Recap

One binary covers MCP gateway, LLM routing, semantic caching, and observability
OAuth 2.0 with PKCE, virtual keys with deny-by-default, and per-tool audit logging form the security layer
Code Mode cuts tool definition tokens by 58% to 92.8% depending on catalog size, with ~50% cost reduction at the 100-tool example documented by Bifrost
Scale by moving to Postgres, sharing vector stores, and load-balancing multiple instances by virtual key

GitHub: https://git.new/bifrost | Docs: https://getmax.im/bifrostdocs | Website: https://getmax.im/bifrost-home

Best AI Gateway to Optimize Claude Code Token Cost

Pranay Batta — Thu, 07 May 2026 07:17:03 +0000

TL;DR: Claude Code token costs grow fast on agent-heavy workflows because every tool definition gets injected into the context. An AI gateway in front of Claude Code lets you cache responses, swap to cheaper models, and use Code Mode to cut tool definition overhead. After testing the setup with Bifrost, the largest single optimisation is Code Mode for MCP, which reduces tool definition tokens by 58% to 92.8% depending on tool count.

This post assumes familiarity with Claude Code, MCP server basics, and what ANTHROPIC_BASE_URL does in CLI agents.

Where Claude Code Token Cost Comes From

Three places drive cost on a typical Claude Code workload.

First, the model itself. Default Claude Code uses Sonnet for most tasks and Opus for harder ones. Opus is several times more expensive per token than Sonnet, and Sonnet is several times more than Haiku.

Second, repeat work. Claude Code re-reads files, re-runs grep, and re-thinks the same problem inside long sessions. If two adjacent prompts hit the same code path, the second one is paid for in full unless something is caching it.

Third, the MCP tool catalog. Every tool definition from every connected MCP server gets injected into the context for every request. Anthropic's Model Context Protocol overview spells out how tool discovery works at the protocol level. With a few servers connected, you might be paying tens of thousands of tokens per request to describe tools that the model may never call.

A gateway addresses all three.

Step 1: Point Claude Code at a Gateway

For Bifrost, set the environment variable Claude Code reads to discover its API host.

export ANTHROPIC_BASE_URL=http://localhost:8080/anthropic
export ANTHROPIC_API_KEY=vk-prod-abc123
claude

Bifrost exposes a 100% compatible Anthropic endpoint, so Claude Code does not know it is talking to a gateway. The full integration steps are in the Bifrost Claude Code docs.

Step 2: Pin Claude Code to a Cheaper Model When Appropriate

Claude Code respects model overrides at startup or mid-session. Through the gateway, you can pin Sonnet for default work and only switch to Opus when needed.

claude --model claude-sonnet-4-6

If you are routing through Bedrock or Vertex for compliance reasons, the gateway accepts pinning syntax that maps to those backends:

export ANTHROPIC_DEFAULT_SONNET_MODEL="bedrock/global.anthropic.claude-sonnet-4-6"
export ANTHROPIC_DEFAULT_OPUS_MODEL="vertex/claude-opus-4-7"

The Bifrost CLI agents overview covers Claude Code, Codex CLI, and Gemini CLI in the same setup.

Step 3: Enable Semantic Caching

Claude Code re-asks similar questions inside the same session. Dual-layer caching catches both exact repeats and semantically similar prompts.

semantic_cache:
  enabled: true
  vector_store: weaviate
  weaviate_url: ${WEAVIATE_URL}
  similarity_threshold: 0.92
  conversation_history_threshold: 3
  ttl_seconds: 86400

Every request needs the x-bf-cache-key header to participate in caching. Claude Code does not set this header by default, so you have to either configure Bifrost to inject it per virtual key or wrap the CLI invocation with a header-setting proxy. The semantic caching docs cover the mechanics.

The cache is model and provider isolated, so a Sonnet response never gets returned for an Opus request.

Step 4: Use Code Mode for MCP Servers

This is the biggest single optimisation in my testing. Code Mode replaces full tool definition injection with a Python stub generation approach. Instead of seeing every tool from every server, the model sees four meta-tools: listToolFiles, readToolFile, getToolDocs, and executeToolCode. The model loads only the tool definitions it actually needs.

Bifrost publishes Code Mode benchmarks on its MCP gateway resource page:

MCP tools connected	Token reduction	Pass rate
96 tools	58%	100%
251 tools	84.5%	100%
508 tools	92.8%	100%

At ~500 tools, the per-query payload drops from roughly 1.15M tokens to 83K, a 14x reduction. Tool calls execute inside a sandboxed Starlark interpreter so behaviour is bounded and auditable. The same numbers and detail are in the Bifrost benchmarks resource.

Configuring Code Mode for Claude Code's MCP setup:

mcp:
  code_mode:
    enabled: true
    sandbox: starlark
  servers:
    - name: github-mcp
      type: stdio
      command: ["npx", "-y", "@modelcontextprotocol/server-github"]
    - name: linear-mcp
      type: http
      url: ${LINEAR_MCP_URL}

The MCP tool execution docs cover the sandbox model in depth.

Step 5: Set Per-Session Budget Caps

Long Claude Code sessions can run for hours. Per-virtual-key budget caps stop runaway sessions before they hit your finance team.

virtual_keys:
  - key_name: claude-code-dev
    key: vk-cc-dev
    rate_limit:
      token_limit: 5000000
      token_limit_duration: "1d"
    budget_limit: 50.00
    budget_duration: "1d"
    allowed_models: ["claude-sonnet-4-6", "claude-opus-4-7"]

Reset durations are calendar-aligned (1d, 1w, 1M, 1Y in UTC) so caps line up with billing cycles. The four-tier budget hierarchy (Customer, Team, Virtual Key, Provider Config) is documented on the Bifrost governance resource.

Comparison: Optimisation Levers

Lever	Mechanism	Best for
Model pinning	Default to Sonnet, opt into Opus	All Claude Code workloads
Semantic caching	Vector similarity match	Sessions with repeated patterns
Code Mode (MCP)	Stub generation, demand-loaded tools	Large MCP catalogs
Per-VK budgets	Hard caps with calendar resets	Long-running sessions

Trade-offs and Limitations

Bifrost is self-hosted only with no managed cloud. If you do not have ops capacity to run a gateway, this is real overhead.

Code Mode requires the model to call meta-tools to load definitions, which adds a small number of extra round trips compared to the upfront-definition approach. On large catalogs that is a clear net win. On a 5-tool setup it is not.

Semantic caching introduces freshness questions for code workflows. If your repo state changed but the cached prompt looks identical, you get a stale response. Tight TTLs and per-key cache scoping reduce the risk.

OpenRouter is not compatible because of a tool call streaming issue, so if you currently route Claude Code through OpenRouter you cannot keep that path through Bifrost.

Bifrost is newer than LiteLLM, so the community and ecosystem of integrations is still building up.

Quick Recap

Three cost drivers: model choice, repeat work, and MCP tool definition payload
Pin Claude Code to Sonnet by default, opt into Opus only when needed
Dual-layer semantic caching captures repeat patterns inside long sessions
Code Mode for MCP cuts tool definition tokens by 58% to 92.8% depending on catalog size
Per-virtual-key budgets put hard caps on runaway sessions

GitHub: https://git.new/bifrost | Docs: https://getmax.im/bifrostdocs | Website: https://getmax.im/bifrost-home

How to Route Between Claude Opus 4.7, GPT-5 Turbo, and Gemma 4 With Bifrost

Pranay Batta — Wed, 06 May 2026 12:00:44 +0000

TL;DR: I set up multi-model routing between Claude Opus 4.7, GPT-5 Turbo, and Gemma 4 through Bifrost in under 20 minutes. Application code stays on the OpenAI SDK, weighted routing splits traffic, and automatic fallbacks kick in when a provider fails. This post walks through the config, the failover behaviour I tested, and the gotchas.

This post assumes familiarity with OpenAI-compatible APIs, basic YAML configuration, and the difference between request-time model selection and gateway-level routing.

Why You Want a Gateway in Front of Three Models

Running production traffic against a single model has obvious failure modes. The 2024 OpenAI outage, the December 2025 Anthropic capacity throttles, and Google's regional Gemini outages have all happened in the last 18 months. If your app depends on one provider, you go down with them.

The other reason is cost. Claude Opus 4.7 is great at complex reasoning but expensive on simple tasks. Gemma 4 is faster and cheaper for routine completions. GPT-5 Turbo sits in the middle. A gateway lets you route smart traffic to the right model without rewriting application code.

Bifrost handles this with weighted load balancing and automatic fallbacks. I tested it on a workload that previously ran 100% on Claude.

github.com/maximhq/bifrost

Step 1: Install Bifrost

npx -y @maximhq/bifrost

That starts Bifrost on port 8080 with a default config. For production, Docker works:

docker run -p 8080:8080 maximhq/bifrost:latest

The setup docs cover persistent volumes if you need them.

Step 2: Configure Three Providers

Each provider gets a config block. The weight field controls traffic distribution and the allowed_models list filters which models the provider serves.

providers:
  - name: anthropic
    api_key: ${ANTHROPIC_API_KEY}
    allowed_models: ["claude-opus-4-7", "claude-sonnet-4-6"]
    weight: 0.3

  - name: openai
    api_key: ${OPENAI_API_KEY}
    allowed_models: ["gpt-5-turbo", "gpt-4o"]
    weight: 0.4

  - name: vertex
    api_key: ${VERTEX_API_KEY}
    project_id: ${VERTEX_PROJECT_ID}
    allowed_models: ["gemma-4", "gemini-2.5-pro"]
    weight: 0.3

Weights are auto-normalised to sum to 1.0, so you do not have to do the math. With these weights, 30% of traffic goes to Anthropic, 40% to OpenAI, and 30% to Gemma 4 via Vertex. The provider configuration docs cover the full schema.

Step 3: Application Code Does Not Change

Bifrost exposes OpenAI-compatible endpoints, so the SDK call stays the same. Only the base URL switches.

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8080/openai/v1",
    api_key="vk-prod-abc123"
)

response = client.chat.completions.create(
    model="gpt-5-turbo",
    messages=[{"role": "user", "content": "Summarize this changelog..."}]
)

When the request hits Bifrost, weighted routing picks the provider based on the configured weights. If you want to force a specific model, set model to the exact name and Bifrost routes only to providers with that model in allowed_models.

For Claude requests using the native Anthropic SDK, point at /anthropic. For Gemini, /genai. The drop-in replacement docs cover the full endpoint matrix.

Step 4: Configure Failover

Weighted routing handles the steady state. Failover handles the failure case. Bifrost sorts providers by weight (highest first) and retries on failure.

fallbacks:
  - primary: openai
    fallback_chain: ["anthropic", "vertex"]
    retry_on:
      - "rate_limit_exceeded"
      - "service_unavailable"
      - "timeout"
    max_retries: 2

I tested this by killing the OpenAI request mid-flight using a network rule. Bifrost detected the timeout, retried against Anthropic, and returned the response. The application saw a slightly higher latency but no error. The fallbacks docs cover all retry conditions.

One thing to know: cross-provider routing does not happen automatically. You configure the fallback chain explicitly. If you do not configure fallbacks, a provider failure becomes a request failure.

Step 5: Add Rate Limits Per Provider

Different providers have different rate limit tiers. Encoding those at the gateway prevents one provider from getting hammered when another is down.

providers:
  - name: openai
    api_key: ${OPENAI_API_KEY}
    rate_limit:
      request_limit: 5000
      request_limit_duration: "1m"
      token_limit: 2000000
      token_limit_duration: "1m"

If a Provider Config exceeds rate limits, Bifrost excludes that provider but keeps others available. Requests do not fail outright when one provider is saturated, they get rerouted.

Comparison

Capability	Direct SDK	LiteLLM	Bifrost
Multi-provider routing	Manual	Yes	Yes
Weighted distribution	No	Yes	Yes (auto-normalised)
Cross-provider fallback	DIY	Yes	Yes (chain config)
Latency overhead	0	~8ms	11 microseconds
Per-provider rate limits	DIY	Yes	Yes
OpenAI-compatible	N/A	Yes	Yes

Trade-offs and Limitations

Bifrost is self-hosted only with no managed cloud. If you do not have ops capacity, this is real overhead.

Cross-provider routing does not happen automatically. You have to explicitly configure fallback chains for every primary provider. Forgetting this is the most common config bug I have seen.

OpenRouter routing through Bifrost is broken because of a tool call streaming issue. If you use OpenRouter today, you cannot keep that path through Bifrost.

Bifrost is newer than LiteLLM. Documentation is solid but the community is still building up.

Quick Recap

One config file maps three providers with weighted distribution and per-provider rate limits
Application code stays on the OpenAI SDK with only the base URL changing
Failover requires explicit fallback chain configuration, it does not happen by default
11 microsecond overhead per request, 50x lower than Python-based gateways
Provider exclusion under rate limits keeps the rest of the routing pool healthy

Migrating from LiteLLM to Bifrost: A Step-by-Step Guide

Pranay Batta — Tue, 05 May 2026 09:46:46 +0000

TL;DR: I migrated a production LLM workload from LiteLLM to Bifrost and the swap took about 30 minutes for the gateway, plus a few config translations. The OpenAI-compatible endpoint means application code did not change. This post walks through the full migration: config mapping, virtual key translation, semantic cache porting, and the gotchas I hit.

This post assumes familiarity with LiteLLM proxy mode, OpenAI-compatible APIs, and basic Docker or Node.js operations.

Why Teams Are Looking at the Migration

LiteLLM is the most widely adopted open-source LLM gateway and covers the breadth case well. The reasons I see teams move to Bifrost are usually one of three:

Latency overhead. LiteLLM proxy adds roughly 8 milliseconds per request. Bifrost adds 11 microseconds at P99 latency at 5k RPS, which is 50x lower. For high-throughput agent workloads that matters.
MCP support. LiteLLM does not have MCP gateway functionality. If you are running Claude Code or building agentic workflows that hit dozens of tool servers, that gap shows up fast.
Dual-layer semantic caching. Bifrost ships exact match plus vector similarity caching with Weaviate, Redis, or Qdrant. LiteLLM has request-level caching but not the same dual-layer model.

I will walk through the migration assuming you have LiteLLM running today and want to switch without rewriting application code.

Step 1: Run Bifrost Side by Side

Before touching any application config, get Bifrost running on a different port. Default is 8080, so I leave that alone and keep LiteLLM on its existing port.

npx -y @maximhq/bifrost

That is the entire setup command for local testing. For production, the Docker option works:

docker run -p 8080:8080 maximhq/bifrost:latest

The Bifrost setup docs cover persistent volumes and configuration mounting if you need them.

Step 2: Translate Provider Configs

LiteLLM uses a config.yaml with model_list entries. Bifrost uses provider configs that map to providers, not individual models.

LiteLLM config:

model_list:
  - model_name: gpt-4o
    litellm_params:
      model: openai/gpt-4o
      api_key: os.environ/OPENAI_API_KEY
  - model_name: claude-sonnet-4-6
    litellm_params:
      model: anthropic/claude-sonnet-4-6
      api_key: os.environ/ANTHROPIC_API_KEY

Bifrost equivalent:

providers:
  - name: openai
    api_key: ${OPENAI_API_KEY}
    allowed_models: ["gpt-4o", "gpt-4o-mini"]
    weight: 1.0
  - name: anthropic
    api_key: ${ANTHROPIC_API_KEY}
    allowed_models: ["claude-sonnet-4-6", "claude-opus-4-7"]
    weight: 1.0

The mental shift: in LiteLLM, you list every model. In Bifrost, you list providers and use allowed_models to filter. The provider configuration docs cover the full schema.

One thing to watch: Bifrost is deny-by-default. If you forget to add a provider, every request to that provider returns a clear error. With LiteLLM, missing model entries return 404, which is harder to debug.

Step 3: Translate Virtual Keys and Budgets

LiteLLM virtual keys map to Bifrost virtual keys, but the budget model is different.

LiteLLM:

virtual_keys:
  - key_name: customer-acme
    key: sk-acme-abc
    max_budget: 100.00
    rpm_limit: 200

Bifrost:

virtual_keys:
  - key_name: customer-acme
    key: vk-acme-abc
    rate_limit:
      request_limit: 200
      request_limit_duration: "1h"
      token_limit: 500000
      token_limit_duration: "1d"
    budget_limit: 100.00
    budget_duration: "1M"
    allowed_models: ["gpt-4o", "claude-sonnet-4-6"]

The big addition is the four-tier budget hierarchy. Customer, Team, Virtual Key, and Provider Config limits all apply independently. A request must pass all four. Reset durations are calendar-aligned for 1d, 1w, 1M, 1Y in UTC, which matters if your billing aligns to calendar months. The budget and limits docs cover this in detail.

Step 4: Update Application Endpoints

Both gateways expose OpenAI-compatible endpoints, so most application code does not change. Only the base URL switches.

# Before (LiteLLM)
client = OpenAI(
    base_url="http://litellm:4000/v1",
    api_key="sk-acme-abc"
)

# After (Bifrost)
client = OpenAI(
    base_url="http://bifrost:8080/openai/v1",
    api_key="vk-acme-abc"
)

Bifrost also exposes /anthropic and /genai endpoints if you want to keep using the native Anthropic or Gemini SDKs. The drop-in replacement docs cover the full endpoint matrix.

Step 5: Port Semantic Caching

LiteLLM has request-level caching with Redis. Bifrost has dual-layer caching with vector similarity. The migration is not 1:1, you are upgrading the cache model.

semantic_cache:
  enabled: true
  vector_store: weaviate
  weaviate_url: ${WEAVIATE_URL}
  similarity_threshold: 0.92
  conversation_history_threshold: 3
  ttl_seconds: 86400

Every request needs the x-bf-cache-key header to participate in caching. In application code:

response = client.chat.completions.create(
    model="gpt-4o",
    messages=messages,
    extra_headers={"x-bf-cache-key": f"customer-{tenant_id}"}
)

The semantic caching docs cover threshold tuning and per-request overrides.

Comparison

Capability	LiteLLM	Bifrost
Latency overhead	~8ms	11 microseconds
Throughput	Python-bound	5,000 RPS single instance
Virtual keys	Yes	Yes
Budget tiers	Single	Four-tier (Customer/Team/VK/Provider)
Semantic caching	Request-level	Dual-layer with vector similarity
MCP gateway	No	Yes
Provider count	100+	Major providers + custom
Managed cloud	Yes	No (self-hosted only)

Trade-offs and Limitations

Bifrost is self-hosted only. If you are using LiteLLM Cloud, you have to take on infrastructure operations yourself.

The provider catalog is smaller. LiteLLM supports 100+ providers out of the box. Bifrost covers the major ones (OpenAI, Anthropic, Gemini, Bedrock, Vertex, Azure OpenAI, Ollama, Together, Groq, Cohere) and lets you add custom providers, but if you depend on a niche provider, check the list before migrating.

The community is smaller and the project is newer. Documentation is solid but Stack Overflow answers and community plugins are still building up.

OpenRouter compatibility is broken because of a tool call streaming issue. If your stack routes through OpenRouter today, you cannot keep that path through Bifrost.

Quick Recap

Application code does not change because Bifrost is OpenAI-compatible
Provider configs replace LiteLLM model lists, with provider-level filtering via allowed_models
Virtual keys translate directly, but the four-tier budget hierarchy is new
Semantic caching upgrades from request-level to dual-layer with vector similarity
Run side by side first, cutover by changing the base URL, roll back instantly if needed

Rate Limiting in LLM Applications: Why You Need It and How to Build It

Pranay Batta — Tue, 28 Apr 2026 05:58:35 +0000

TL;DR: Rate limiting for LLM APIs requires counting tokens, not requests. A single 200K-token context window costs as much as 50 normal API calls. This post covers the gap between request-count limits and token-aware limits, and walks through implementation at both the application layer and the gateway layer.

This post assumes familiarity with LLM APIs (OpenAI, Anthropic), basic Redis or caching concepts, and running AI applications in production.

Why Standard Rate Limiting Falls Short

Most developers who have shipped web services know how to rate limit: count requests per user per time window, return 429 when the limit hits. That model breaks down with LLM APIs.

LLM APIs charge by the token, not the request. A single API call with a 200,000-token context window costs as much as 50 calls with 4,000-token prompts. Request-count limits do nothing to prevent a single runaway call from consuming your entire daily budget.

OpenAI's production limits expose this directly. Their rate limit tiers use tokens-per-minute (TPM) alongside requests-per-minute (RPM). Hitting the TPM ceiling causes 429s even when you are nowhere near the RPM limit. Building rate limiting that only tracks requests means your application hits provider limits in ways your own limits never predicted.

Multi-tenant applications add another layer. A single customer running a batch job at 3am can exhaust your provider budget before the rest of your users wake up. Without per-customer limits, one heavy user affects everyone.

What You Actually Need to Limit

Four distinct limit types matter in production LLM applications:

Request rate — calls per minute or hour. Prevents burst abuse but does not control cost.
Token rate — tokens per minute or day. Directly correlates to cost and provider headroom.
Budget cap — total spend per period per customer or team. Hard stop before costs escalate.
Scope — limits enforced per user, per team, per customer, and per provider independently.

Most teams implement request rate first, add token rate after their first surprise invoice, and add budget caps after their second.

Option 1: Application-Level Implementation

The direct approach is middleware that intercepts outgoing API calls, estimates token count before the request leaves your system, and rejects requests that would exceed the limit.

import redis
import time

r = redis.Redis(host="localhost", port=6379, decode_responses=True)

def check_token_limit(
    customer_id: str,
    estimated_tokens: int,
    limit: int = 500_000,
    window_seconds: int = 86400
) -> bool:
    window_key = int(time.time() // window_seconds)
    key = f"token_usage:{customer_id}:{window_key}"

    pipe = r.pipeline()
    pipe.incrby(key, estimated_tokens)
    pipe.expire(key, window_seconds * 2)
    result = pipe.execute()

    return result[0] <= limit

def estimate_tokens(messages: list) -> int:
    # ~4 characters per token, rough pre-call estimate
    total_chars = sum(len(m.get("content", "")) for m in messages)
    return total_chars // 4

This works, but requires every service that makes LLM calls to implement the same logic. In a monolith, manageable. Across microservices, it becomes duplicated state tracking with consistency problems at the edges.

Option 2: Gateway-Level Rate Limiting

A gateway that proxies all LLM traffic enforces limits in one place. Every service routes through the gateway. The gateway handles counting, enforcement, and resets.

Bifrost handles this through Virtual Keys, each scoped to a customer or team, with request and token limits defined per key:

virtual_keys:
  - key_name: "customer-acme"
    key: "vk-acme-abc123"
    rate_limit:
      request_limit: 200
      request_limit_duration: "1h"
      token_limit: 500000
      token_limit_duration: "1d"
    budget_limit: 100.00
    budget_duration: "1M"
    allowed_models:
      - "gpt-4o"
      - "claude-sonnet-4-6"

When customer-acme exhausts their daily token limit, Bifrost rejects further requests for that key until the window resets. Other customers are unaffected.

Resets are calendar-aligned for day, week, month, and year durations. A 1d limit resets at UTC midnight rather than 24 hours after the first request. For billing cycles that align to calendar months, this matters.

LiteLLM offers comparable virtual key functionality. The primary runtime difference: LiteLLM is Python-based with roughly 8ms overhead per request. Bifrost is Go-based with 11 microseconds overhead per request.

Comparison

Approach	Token-aware	Per-customer limits	Budget cap	Overhead
Redis middleware (DIY)	Manual	Yes	Manual	Negligible
LiteLLM proxy	Yes	Yes	Yes	~8ms
Bifrost	Yes	Yes (Virtual Keys)	Yes (4-tier)	11 microseconds
Kong AI Gateway	Plugin-based	Yes	Limited (OSS)	~2-5ms

Bifrost's four-tier budget hierarchy is worth noting: Customer, Team, Virtual Key, and Provider Config limits all apply independently. A request must pass all four tiers. This allows organization-wide caps alongside fine-grained per-key limits without separate enforcement logic.

If a Provider Config limit is exceeded, Bifrost excludes that provider but keeps others available. Requests do not fail outright when one provider is saturated.

Trade-offs and Limitations

Application-level rate limiting gives you more control over enforcement logic. You can implement business rules a gateway does not support: tiered limits based on subscription plan, grace period overrides for specific customers, or custom token counting that accounts for your system prompt overhead.

Gateway-level enforcement applies regardless of which service makes the call. The trade-off is an additional network hop and a new dependency in your infrastructure.

Bifrost is self-hosted only, no managed version. The project is newer than LiteLLM with a smaller community. Factor in that maturity difference when evaluating it against more established options.

Token counting before a request completes is an estimate. Actual token counts, including generated output tokens, only come back in the API response. Most gateway implementations use pre-call estimates for limits and reconcile against actual usage in the response.

Quick Recap

Request-count limits do not prevent token budget overruns
Multi-tenant apps need per-customer token limits, not global ones
Application-level implementation works but duplicates logic across services
Gateway-level enforcement centralizes limits with no per-service code changes
Bifrost and LiteLLM both support virtual key rate limiting; the primary difference is runtime overhead

Four Go Repositories Worth Your Attention on GitHub's Trending Page This Month

Pranay Batta — Wed, 22 Apr 2026 20:46:26 +0000

When I want to see what developers are actually shipping, GitHub's trending page is a reliable signal. Scrolling through the Go monthly trending board this month surfaced four projects that each deserve a closer look. All four are written in Go, and each tackles a very different problem.

Here's the current monthly ranking for the Go language:

#	Repository	Stars	Stars This Month	What It Does
1	QuantumNous/new-api	28,276	5,970	Unified AI model hub
2	Wei-Shaw/sub2api	14,394	6,822	AI API subscription sharing
3	steipete/wacli	2,034	1,348	WhatsApp CLI
4	maximhq/bifrost	4,169	1,076	Enterprise AI gateway

Here's what each project does and why it's climbing this month.

1. Bifrost: An Enterprise AI Gateway Built in Go

Repository: maximhq/bifrost
Stars: 4,169 | Forks: 485 | License: Apache 2.0

Bifrost is a high-throughput AI gateway, written in Go, that exposes a single OpenAI-compatible API fronting more than 15 LLM providers. What pulled it onto my radar were the performance figures. Per-request overhead sits at roughly 11 microseconds, and the gateway sustains 5,000 RPS. That puts it around 50x faster than Python-based alternatives like LiteLLM, a gap that teams evaluating gateway options tend to notice quickly.

Key features:

Multi-provider routing covering automatic failover and weighted load balancing
Semantic caching using a dual-layer design (exact-hash match plus semantic similarity through Weaviate)
MCP integration featuring Code Mode, which cuts token usage by as much as 92.8% at scale (benchmark source); teams can review the broader MCP gateway architecture for details
Budget hierarchy enforced at four levels: Customer, Team, Virtual Key, and Provider Config
Zero-config deployment through npx @anthropic-ai/bifrost or Docker

The architectural decisions matter here. Choosing Go gives the gateway predictable latency because there are no GC pauses under load. Three deployment shapes are supported: an HTTP gateway, a Go SDK, or a drop-in SDK replacement for existing OpenAI or Anthropic client libraries.

Who it is for: Teams operating multiple LLM providers in production that need lightweight routing, cost controls, and observability without stacking extra latency into every call.

Links: Docs | GitHub | Website

2. new-api: A Self-Hosted Model Hub

Repository: QuantumNous/new-api
Stars: 28,276 | Forks: 5,915 | License: AGPLv3

new-api tops the monthly Go board by star count. It works as a centralized gateway that aggregates multiple LLM vendors (OpenAI, Azure, Claude, Gemini, DeepSeek, Qwen, and more) and exposes them through standardized relay interfaces.

Key features:

Bidirectional format translation across OpenAI, Claude, and Gemini APIs
Token grouping with model-level restrictions and role-based access control
A dashboard for real-time usage analytics and billing
Docker deployment that works against SQLite, MySQL, or PostgreSQL backends
A multi-language UI covering Chinese, English, French, and Japanese
Redis support for distributed deployments

Development velocity is high: the project has more than 5,600 commits. Streaming APIs are supported with configurable timeouts, and reasoning-model handling is built in.

Who it is for: Teams that want a self-hosted LLM proxy paired with a full admin dashboard and cross-vendor format conversion.

Links: GitHub

3. sub2api: Sharing AI API Subscriptions Across Users

Repository: Wei-Shaw/sub2api
Stars: 14,394 | Forks: 2,488

sub2api approaches the problem from a different angle. Rather than simply proxying API requests, it is designed around pooling and sharing paid AI subscriptions (Claude, OpenAI, Gemini) behind a unified access layer that includes billing.

Key features:

Multi-account management with OAuth and API key authentication
API key distribution and lifecycle handling
Token-level billing that calculates cost with precision
Account scheduling with sticky sessions
Per-user and per-account concurrency limits
A built-in payment system covering Alipay, WeChat Pay, and Stripe
An administrative dashboard

The stack is Go 1.25 with the Gin framework and the Ent ORM on the backend, and Vue 3, Vite, and TailwindCSS on the frontend. PostgreSQL 15+ and Redis 7+ are required.

Who it is for: Organizations that want to share paid AI subscriptions across multiple users with fine-grained billing and access policies.

Links: GitHub

4. wacli: A WhatsApp Command-Line Interface

Repository: steipete/wacli
Stars: 2,034 | Forks: 241

This one stands out from the rest. wacli is a full command-line client for WhatsApp, built on top of the whatsmeow library that implements the WhatsApp Web protocol. It was created by Peter Steinberger, a widely known iOS developer.

Key features:

Local message history sync with continuous capture
Offline search backed by SQLite with FTS5 full-text indexing
Sending text, quoted replies, and files with captions
Contact and group management
Human-readable table output by default, with JSON available for scripting
A read-only mode that prevents accidental mutations
QR-code authentication

Installation is simple: grab it from Homebrew or build from source with go build -tags sqlite_fts5.

Who it is for: Developers and power users who want programmatic WhatsApp access from the terminal for automation, search, or scripting tasks.

Links: GitHub

Why Go Keeps Showing Up at the Top

That all four of these projects are written in Go is not a coincidence. The language's concurrency model, single-binary deployment, compact memory footprint, and predictable behavior under load make it a natural fit for infrastructure tooling.

Nowhere is that clearer than in AI gateway workloads. When you are proxying thousands of LLM calls every second, every microsecond of added overhead compounds. Python-based options like LiteLLM add milliseconds of latency per request; Go-based gateways such as Bifrost keep that overhead in the microsecond range, as their published performance benchmarks document in detail.

The CNCF project ecosystem reflects the same pattern. Kubernetes, Prometheus, the Go integrations around Envoy, and the majority of cloud-native infrastructure tools are all built in Go.

Wrapping Up

Two clear patterns surface from this month's Go trending list. First, AI infrastructure dominates the category. Second, Go continues to be the language developers reach for when they build it. Whether the need is an enterprise AI gateway with sub-millisecond overhead, a self-hosted model hub, a subscription sharing platform, or a WhatsApp CLI, the Go ecosystem now ships a production-ready option in each category.

If you want the full list, the GitHub Go trending page is one click away.

Data sourced from GitHub Trending as of April 22, 2026. Star counts and rankings change daily.

How to Cut LLM Token Spend with Semantic Caching: A Production Setup Guide

Pranay Batta — Wed, 22 Apr 2026 20:25:49 +0000

TL;DR: Semantic caching intercepts LLM API calls and returns cached responses for similar queries, skipping the provider entirely. Zero tokens consumed on cache hits. I set this up with Bifrost and Weaviate in under 30 minutes and it started saving tokens on the first day.

What We Are Building

A semantic cache layer that sits between your application and LLM providers. Every API call passes through the cache first. If the query matches a previous one (exact match or semantically similar), the cached response is returned instantly. No LLM call, no tokens billed.

maximhq / bifrost

Fastest enterprise AI gateway (50x faster than LiteLLM) with adaptive load balancer, cluster mode, guardrails, 1000+ models support & <100 µs overhead at 5k RPS.

Bifrost AI Gateway

The fastest way to build AI applications that never go down

Bifrost is a high-performance AI gateway that unifies access to 15+ providers (OpenAI, Anthropic, AWS Bedrock, Google Vertex, and more) through a single OpenAI-compatible API. Deploy in seconds with zero configuration and get automatic failover, load balancing, semantic caching, and enterprise-grade features.

Quick Start

Go from zero to production-ready AI gateway in under a minute.

Step 1: Start Bifrost Gateway

# Install and run locally
npx -y @maximhq/bifrost

# Or use Docker
docker run -p 8080:8080 maximhq/bifrost

Step 2: Configure via Web UI

# Open the built-in web interface
open http://localhost:8080

Step 3: Make your first API call

curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "openai/gpt-4o-mini",
    "messages": [{"role": "user", "content": "Hello, Bifrost!"}]
  }'

That's it! Your AI gateway is running with a web interface for visual configuration…

View on GitHub

Here is the flow:

App -> Bifrost Gateway -> [Cache Check] -> Hit?  -> Return cached response (0 tokens)
                                        -> Miss? -> Forward to LLM provider -> Cache response -> Return

The end result: repeated and similar queries cost nothing. For workloads with common patterns (customer support, code generation, FAQ bots), the savings add up fast.

Prerequisites

You need three things:

Docker and Docker Compose installed (docs)
Weaviate as the vector store for semantic similarity matching
Bifrost as the LLM gateway with caching enabled
At least one LLM provider API key (OpenAI, Anthropic, etc.)

Everything runs locally. No cloud accounts needed beyond your LLM provider key.

Step 1: Deploy Weaviate for Vector Storage

Weaviate stores the vector embeddings that power semantic matching. When a new query comes in, Bifrost converts it to a vector and checks Weaviate for similar past queries.

Create a docker-compose.yml:

version: '3.8'

services:
  weaviate:
    image: cr.weaviate.io/semitechnologies/weaviate:latest
    ports:
      - "8081:8080"
      - "50051:50051"
    environment:
      QUERY_DEFAULTS_LIMIT: 25
      AUTHENTICATION_ANONYMOUS_ACCESS_ENABLED: 'true'
      PERSISTENCE_DATA_PATH: '/var/lib/weaviate'
      DEFAULT_VECTORIZER_MODULE: 'text2vec-transformers'
      ENABLE_MODULES: 'text2vec-transformers'
      TRANSFORMERS_INFERENCE_API: 'http://t2v-transformers:8080'
      CLUSTER_HOSTNAME: 'node1'
    volumes:
      - weaviate_data:/var/lib/weaviate
    restart: on-failure

  t2v-transformers:
    image: cr.weaviate.io/semitechnologies/transformers-inference:sentence-transformers-all-MiniLM-L6-v2
    environment:
      ENABLE_CUDA: '0'
    restart: on-failure

volumes:
  weaviate_data:

Spin it up:

docker compose up -d

Verify Weaviate is running:

curl http://localhost:8081/v1/meta | python3 -m json.tool

You should see a JSON response with version info. If you get connection refused, give it 30 seconds for the transformer model to load.

For more on Weaviate's architecture and vectoriser modules, check their docs.

Step 2: Configure Bifrost with Semantic Caching Enabled

Bifrost is an open-source LLM gateway written in Go. 11 microsecond latency overhead, 5,000 RPS throughput. The part that matters here: it has dual-layer caching built in.

Dual-layer means two cache checks run on every request:

Exact hash match - identical queries return cached responses instantly
Semantic similarity - queries that mean the same thing but are worded differently also hit the cache

Start Bifrost:

docker run -p 8080:8080 maximhq/bifrost

Or if you prefer npx:

npx -y @maximhq/bifrost

Now configure the gateway. Create a config.yaml:

gateway:
  host: "0.0.0.0"
  port: 8080

cache:
  enabled: true
  type: "semantic"
  vector_store:
    provider: "weaviate"
    host: "http://localhost:8081"
  conversation_history_threshold: 3

accounts:
  - id: "production"
    providers:
      - id: "openai-main"
        type: "openai"
        api_key: "${OPENAI_API_KEY}"
        model: "gpt-4o"
        weight: 70
      - id: "anthropic-fallback"
        type: "anthropic"
        api_key: "${ANTHROPIC_API_KEY}"
        model: "claude-sonnet-4-20250514"
        weight: 30

Key config values:

cache.enabled: true turns on the dual-layer cache
cache.type: "semantic" enables both exact hash and semantic similarity (not just exact match)
vector_store.provider: "weaviate" points to your Weaviate instance
conversation_history_threshold: 3 controls how much conversation context is used for cache key generation. Default is 3. Higher values mean more context-sensitive cache matching but fewer hits.

Full configuration options are in the Bifrost docs.

Step 3: Point Your LLM Calls Through Bifrost

Bifrost exposes a drop-in replacement for the OpenAI SDK. Change your base URL and everything else stays the same.

Python (OpenAI SDK):

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8080/v1",
    api_key="your-openai-api-key"
)

# First call - cache miss, hits the LLM provider
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "user", "content": "What are the benefits of microservices architecture?"}
    ]
)
print(response.choices[0].message.content)

# Second call - same query, exact cache hit, zero tokens
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "user", "content": "What are the benefits of microservices architecture?"}
    ]
)
print(response.choices[0].message.content)

# Third call - different wording, same intent, semantic cache hit
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "user", "content": "Why should I use a microservices pattern?"}
    ]
)
print(response.choices[0].message.content)

The first call goes to OpenAI. Tokens are consumed, response is cached. The second call is identical, so the exact hash matches. Response comes from cache. The third call is worded differently but semantically similar. Weaviate's vector search finds the match. Response comes from cache again.

Both cache hits skip the LLM provider entirely. Zero tokens. Zero cost.

Node.js (OpenAI SDK):

import OpenAI from 'openai';

const client = new OpenAI({
  baseURL: 'http://localhost:8080/v1',
  apiKey: 'your-openai-api-key',
});

const response = await client.chat.completions.create({
  model: 'gpt-4o',
  messages: [
    { role: 'user', content: 'Explain container orchestration in simple terms' }
  ],
});

console.log(response.choices[0].message.content);

Same pattern. Point the base URL at Bifrost, and caching is transparent to your application code.

If you are using the Anthropic SDK, Bifrost supports that too. The Anthropic SDK integration page has the details.

Step 4: Monitor Cache Hits and Token Savings

Once traffic is flowing, you want to see what is hitting cache vs what is going through to providers.

Bifrost exposes metrics that let you track:

Cache hit rate (exact vs semantic)
Total requests vs routed requests (routed = cache misses that hit a provider)
Token usage per provider

Check your Bifrost logs to see cache behaviour in real time:

docker logs -f <bifrost-container-id>

Each request will indicate whether it was served from cache or forwarded to a provider. Track the ratio over time. On workloads with repeated query patterns, the cache hit rate climbs quickly within the first few hours.

How It Works: Exact Hash vs Semantic Similarity

A quick breakdown of the two cache layers.

Exact hash matching is straightforward. The entire request (messages, model, parameters) is hashed. If an identical request has been seen before, the cached response is returned. This is fast and deterministic. Same input, same output.

Semantic similarity is where it gets interesting. When no exact match exists, Bifrost converts the query into a vector embedding using the transformer model running in Weaviate. It then searches for existing cached queries that are semantically close. If the similarity score is above the threshold, the cached response is returned.

This is what catches queries like:

"How do I deploy to Kubernetes?" and "What is the process for deploying on k8s?"
"Explain OAuth 2.0" and "How does OAuth2 authentication work?"

Different words. Same intent. One LLM call instead of two.

The conversation_history_threshold setting controls how many previous messages in a conversation are included when generating the cache key. At the default of 3, Bifrost uses the last 3 messages for context. This prevents a cached response from a different conversation context being returned incorrectly.

For more on how sentence embeddings power this kind of similarity search, HuggingFace has a solid primer.

Results: What I Measured After Running This for a Week

I ran this setup against three different workloads for seven days. Here is what I observed.

Customer support bot (repetitive queries): Highest cache hit rate. Users ask variations of the same 50-100 questions. After the first day, the cache warmed up and a large portion of queries were served from cache. Semantic matching caught the paraphrased versions that exact hash would miss.

Code generation assistant (moderate repetition): Lower hit rate than customer support, but still meaningful. Common patterns like "write a function to parse JSON" or "create a REST endpoint" showed up repeatedly with slight variations. Semantic caching caught many of these.

Open-ended research queries (low repetition): Lowest hit rate, as expected. Each query was unique enough that neither exact nor semantic matching triggered often. Caching still helped with follow-up questions that rephrased earlier queries.

Latency on cache hits: Near-instant. The Weaviate vector lookup adds milliseconds, but compared to a full LLM round trip (typically 500ms to 3s), cache hits felt instantaneous.

Gateway overhead: Bifrost's 11 microsecond latency overhead held up. The caching layer adds the Weaviate lookup time on misses and hits, but the gateway itself adds almost nothing.

The workloads where semantic caching pays off most are the ones with natural query repetition. Customer support, internal knowledge bases, FAQ systems, onboarding assistants. If your users ask the same things in different ways, you are paying for the same answer multiple times.

For reference, here is what OpenAI charges per token and what Anthropic charges. On GPT-4o at current pricing, even a moderate cache hit rate translates to real savings on a monthly bill.

Smart LLM Routing in Production: Picking the Optimal Model per Request

Pranay Batta — Tue, 21 Apr 2026 10:58:08 +0000

Every production LLM system eventually runs into the same wall. You are paying too much, responses are too slow, or a single provider outage takes everything down.

The fix is routing. Instead of hardcoding one model for all requests, you route each request to the best available model based on cost, latency, and reliability.

I evaluated several approaches over the last few weeks. Marketplace APIs, framework-level abstractions, self-hosted gateways, DIY logic. Here is what the data showed.

Why Route at All?

If you are only using one model from one provider, you do not need routing. But the moment you add a second provider, routing decisions start piling up.

Three reasons this matters:

Cost. GPT-4o costs roughly 10x more per token than GPT-4o-mini. If 60% of your traffic is simple summarization or classification, you are burning money sending it to a frontier model. Routing lets you match request complexity to model price.

Latency. Provider response times vary by region, time of day, and current load. A request that takes 800ms on one provider might take 2.5s on another at that exact moment.

Reliability. Every provider has outages. Rate limits hit. 429s and 500s happen. If your entire product is wired to one API endpoint, you inherit their downtime.

Smart routing optimises across all three per request, without changing application code.

The Landscape: Four Approaches

Before picking a tool, I mapped out how the options break down.

Marketplace routing (OpenRouter)

OpenRouter acts as a unified API across dozens of models from different providers. You send a request to their endpoint, and they handle the provider connection. Good model catalog, single API key. The trade-off is that you are adding a network hop through their servers and routing logic is their black box, not yours. Less control over failover behaviour, budget enforcement, and routing weights.

Framework-level routing (Semantic Kernel)

Microsoft's Semantic Kernel lets you define model selection logic inside your application code. You can set up filters that choose models based on request properties, user tier, or function type. The issue: routing becomes tightly coupled to your application. Every service needs the routing logic, and updating routing config means redeploying application code. No built-in budget enforcement or provider health monitoring either.

DIY routing

You can always write your own. A reverse proxy with some logic to pick providers based on health checks and weights. I tried this first with a simple Python setup:

import random
import httpx

PROVIDERS = {
    "openai": {"url": "https://api.openai.com/v1/chat/completions", "weight": 0.6},
    "anthropic": {"url": "https://api.anthropic.com/v1/messages", "weight": 0.4},
}

def pick_provider():
    names = list(PROVIDERS.keys())
    weights = [PROVIDERS[n]["weight"] for n in names]
    return random.choices(names, weights=weights, k=1)[0]

This works for two providers with static weights. It falls apart when you need failover, budget tracking, health checks, or dynamic weight adjustment. I abandoned this after two weeks of edge cases.

Gateway-level routing

A gateway sits between your application and LLM providers. You configure routing rules once, and every service behind the gateway gets the same behaviour. Application code does not know or care which provider serves a request.

This is where I spent most of my time. And this is where the data got interesting.

Why Gateway-Level Routing Won for Me

The decision came down to one principle: routing is infrastructure, not application logic.

When routing lives in the application layer, every team implements it differently. One team does round-robin, another does random selection, a third hardcodes a provider. Failover behaviour is inconsistent. Budget tracking is scattered across services.

A gateway centralises all of that. Configure it once, every downstream service gets consistent routing, failover, and budget enforcement. Change the routing strategy and no application code changes.

After testing several gateways, Bifrost gave me the best combination of routing flexibility and raw performance. Written in Go, 11 microsecond latency overhead, 5,000 RPS sustained throughput. For context, Python-based alternatives like LiteLLM add around 8ms per request. That is roughly a 50x difference in routing overhead.

Here is how I set it up.

Bifrost Routing: The Deep Dive

Weighted Distribution

The most common routing strategy. You assign weights to providers and Bifrost distributes traffic proportionally. Weights auto-normalise, so you can use any numbers.

accounts:
  - id: "production"
    providers:
      - id: "openai-primary"
        type: "openai"
        api_key: "${OPENAI_API_KEY}"
        model: "gpt-4o"
        weight: 60
      - id: "anthropic-secondary"
        type: "anthropic"
        api_key: "${ANTHROPIC_API_KEY}"
        model: "claude-sonnet-4-20250514"
        weight: 30
      - id: "gemini-tertiary"
        type: "gemini"
        api_key: "${GEMINI_API_KEY}"
        model: "gemini-2.5-pro"
        weight: 10

60% of requests go to GPT-4o. 30% to Claude Sonnet. 10% to Gemini. I used this split to compare output quality across providers on real production traffic. Adjusting the weights is a config change, not a code deploy.

Full routing configuration docs here.

Automatic Failover

Weighted routing handles the happy path. Failover handles everything else. When a provider returns errors, Bifrost automatically retries with the next provider in weight order.

accounts:
  - id: "production"
    providers:
      - id: "openai-primary"
        type: "openai"
        api_key: "${OPENAI_API_KEY}"
        model: "gpt-4o"
        weight: 80
      - id: "anthropic-fallback"
        type: "anthropic"
        api_key: "${ANTHROPIC_API_KEY}"
        model: "claude-sonnet-4-20250514"
        weight: 15
      - id: "gemini-fallback"
        type: "gemini"
        api_key: "${GEMINI_API_KEY}"
        model: "gemini-2.5-pro"
        weight: 5

OpenAI returns a 429? Bifrost retries with Anthropic. Anthropic is down? Falls back to Gemini. The application never sees the failure. No retry logic in application code, no manual intervention.

I ran a 48-hour test where I intentionally rotated provider API keys to simulate outages. Bifrost handled every failover cleanly. Requests were slower (because retries take time) but none failed from the application's perspective.

Budget-Aware Routing

This is where Bifrost's approach gets genuinely useful. The governance layer has a four-tier budget hierarchy: Customer, Team, Virtual Key, and Provider Config.

budgets:
  - level: "team"
    id: "backend-team"
    limit: 500
    period: "monthly"
  - level: "virtual_key"
    id: "dev-key-pranay"
    limit: 100
    period: "monthly"

When a budget tier is exhausted, routing decisions respect that constraint. If the backend team hits their monthly limit, requests from that team stop going through. If a specific virtual key runs out, that key is blocked but other keys on the same team still work.

This level of granularity is something I did not find in the other approaches I tested. Most solutions do global rate limiting at best. The four-tier hierarchy lets you set guardrails at every organisational level without building custom middleware.

Semantic Caching: Skip Routing Entirely

Dual-layer semantic caching in Bifrost uses exact hash matching and semantic similarity matching.

When a request hits the cache, it never reaches a provider. No routing decision needed. No API call. No cost. The response comes back from cache directly.

For workloads with repeated or similar queries (customer support, code generation with common patterns, FAQ-type interactions), caching eliminates a significant chunk of provider calls entirely. In my testing, cache hit rates on repetitive workloads were high enough to noticeably reduce total routed requests.

This interacts well with budget-aware routing. Fewer routed requests means budgets last longer.

Getting Started

Setup is fast. One command:

npx -y @maximhq/bifrost

Or Docker:

docker run -p 8080:8080 maximhq/bifrost

Configure your providers in the config file, set your routing weights, and point your application at the gateway endpoint. The setup guide walks through it. Provider configuration covers all supported providers and model formats.

Bifrost exposes a drop-in replacement endpoint for the OpenAI and Anthropic SDKs. If your application already uses either SDK, you change the base URL and nothing else. No code changes needed. The Anthropic SDK integration docs have the specifics.

Results After Switching

I ran Bifrost for three weeks across production workloads. Here is what the data showed.

Latency overhead: Consistently under 15 microseconds per request. The 11 microsecond claim held up in my benchmarks. At 5,000 RPS, total gateway overhead was negligible compared to actual LLM response times. You can run the benchmarks yourself.

Failover recovery: Provider failures were transparent to the application. During two real OpenAI degradation events, traffic shifted to Anthropic within the same request cycle. Zero application-level errors.

Cost visibility: The four-tier budget hierarchy gave me per-team and per-key cost tracking without building anything custom. I caught one team burning through their allocation on a retry loop within the first week.

Cache savings: Semantic caching reduced routed requests by a meaningful percentage on workloads with repeated query patterns. Those were requests that never hit a provider, never cost anything.

The combination of weighted routing, automatic failover, budget controls, and semantic caching in a single layer that adds 11 microseconds of overhead is something I have not been able to replicate with any other approach I tested.

Final Thoughts

LLM routing is not optional in production. Static provider configs break under load, cost more than they should, and give you zero flexibility when things go wrong.

The approach matters. Marketplace APIs abstract away too much control. Framework-level routing couples infrastructure decisions to application code. DIY solutions work until the edge cases pile up.

Gateway-level routing keeps the concern where it belongs: in infrastructure. Bifrost's performance numbers, routing flexibility, and budget hierarchy made it the strongest option in my evaluation.

GitHub | Docs | Website

If you are running LLMs in production with multiple providers, set up a gateway and stop hardcoding routing in application code. The data speaks for itself.

How to Govern Claude Code Usage Across Engineering Teams

Pranay Batta — Mon, 20 Apr 2026 14:55:53 +0000

Claude Code is powerful; maybe too powerful to run without guardrails.

I came across a case where a mid-sized startup had three engineering teams adopt it independently. Within two weeks, their bill hit $4,200. No breakdown of who spent what, no audit trail, no rate limits—just usage piling up and a growing invoice.

If your org is adopting Claude Code, you need centralized governance. I tested Bifrost as an AI gateway layer to solve exactly this. Here is how I set it up.

The Problem: Ungoverned Claude Code

Claude Code runs locally on each developer's machine. Every developer has their own API key. This means:

No visibility into per-developer or per-team spend
No rate limiting. One runaway agent loop burns through your budget
No audit trail of what tools were called, what code was generated
No control over which MCP tools Claude Code can access
No way to enforce org-wide policies

You need a proxy layer between Claude Code and the LLM provider. That is what an AI gateway does.

Setting Up Bifrost as Your Claude Code Gateway

Bifrost is a Go-based AI gateway with 11 microsecond latency overhead. Deploy it with a single command:

npx @anthropic-ai/bifrost

Or via Docker:

docker run -p 8080:8080 ghcr.io/maximhq/bifrost:latest

Point Claude Code at it by setting the base URL in your Claude Code config:

{
  "apiBaseUrl": "http://localhost:8080/v1",
  "apiKey": "vk_team_frontend_abc123"
}

That apiKey is not an Anthropic key. It is a Bifrost virtual key. This is where governance starts.

Virtual Keys: Per-Developer Access Control

Virtual keys let you issue scoped credentials to each developer or team. Each virtual key maps to an underlying provider key but adds access controls on top.

Create a virtual key per team:

# bifrost.yaml
virtual_keys:
  - id: "vk_team_frontend"
    name: "Frontend Team"
    provider_config: "anthropic_prod"
    allowed_models:
      - "claude-sonnet-4-20250514"
    rate_limit:
      requests_per_minute: 60
      tokens_per_minute: 100000

  - id: "vk_team_backend"
    name: "Backend Team"
    provider_config: "anthropic_prod"
    allowed_models:
      - "claude-sonnet-4-20250514"
      - "claude-opus-4-20250514"
    rate_limit:
      requests_per_minute: 120
      tokens_per_minute: 200000

  - id: "vk_dev_rahul"
    name: "Rahul - Backend"
    provider_config: "anthropic_prod"
    allowed_models:
      - "claude-sonnet-4-20250514"
    rate_limit:
      requests_per_minute: 30
      tokens_per_minute: 50000

Each developer gets their own virtual key. They never see the actual Anthropic API key. You revoke access by deleting the virtual key. No key rotation needed on the provider side.

Check the virtual keys documentation for tool-level scoping options.

Budget Hierarchy: Cap Spend at Every Level

Bifrost supports a four-tier budget hierarchy: Customer, Team, Virtual Key, and Provider Config. This maps cleanly to engineering org structures.

budgets:
  org_level:
    monthly_limit_usd: 10000

  teams:
    - name: "frontend"
      monthly_limit_usd: 2000
      alert_threshold: 0.8

    - name: "backend"
      monthly_limit_usd: 4000
      alert_threshold: 0.8

    - name: "ml_platform"
      monthly_limit_usd: 3000
      alert_threshold: 0.8

  virtual_key_overrides:
    - id: "vk_dev_rahul"
      daily_limit_usd: 50

When a team hits 80% of their budget, you get an alert. When they hit 100%, requests get blocked. No more surprise bills.

The daily limit on individual virtual keys is useful for catching runaway Claude Code agent loops. If a developer accidentally triggers an infinite tool-call cycle, it burns through $50 and stops. Not $500.

Audit Logging: Track Every Tool Call

This is the part that convinced me. Bifrost logs every request with granular detail. For MCP tool calls specifically, you get:

Tool name
Server name
Arguments passed
Results returned
Latency per call
Virtual key ID (so you know which developer triggered it)

Check per-tool audit logging docs for the full schema.

Query logs to answer questions like:

Which developer made the most LLM calls this week?
What tools is the frontend team using in Claude Code?
How much did code generation cost per team last month?
Are any developers hitting rate limits frequently?

This is the audit trail you need for SOC 2 compliance and internal cost attribution.

Rate Limiting: Prevent Runaway Usage

I already showed rate limits in the virtual key config. But let me explain why this matters specifically for Claude Code.

Claude Code in agent mode can make dozens of LLM calls per task. A single "refactor this module" command might trigger 15-20 API calls. Without rate limits, one developer running complex refactors back-to-back can consume your entire daily budget in an hour.

Set conservative limits per developer:

rate_limit:
  requests_per_minute: 30
  tokens_per_minute: 50000
  concurrent_requests: 3

This still allows normal Claude Code usage. But it prevents the scenario where someone kicks off a massive agent task and walks away.

Bifrost handles rate limiting at the gateway level with sub-millisecond overhead. The developer gets a clear 429 response. Claude Code handles these gracefully with built-in retry logic.

MCP Gateway: Control Which Tools Claude Code Can Access

This is the governance layer that most teams miss. Claude Code can connect to MCP servers that expose file system access, database queries, deployment tools. You need to control which tools each team can use.

Bifrost acts as an MCP gateway. You expose a single /mcp endpoint and control tool access per virtual key.

mcp:
  servers:
    - name: "filesystem"
      url: "http://localhost:9001"
      allowed_tools:
        - "read_file"
        - "write_file"
        - "list_directory"

    - name: "database"
      url: "http://localhost:9002"
      allowed_tools:
        - "query"
        - "describe_table"

    - name: "deployment"
      url: "http://localhost:9003"
      allowed_tools:
        - "deploy_staging"
        - "rollback"

  virtual_key_permissions:
    "vk_team_frontend":
      - "filesystem"
    "vk_team_backend":
      - "filesystem"
      - "database"
    "vk_team_platform":
      - "filesystem"
      - "database"
      - "deployment"

Frontend developers cannot accidentally trigger deployments through Claude Code. Backend developers cannot access deployment tools. Only the platform team gets full access.

Bifrost's MCP support includes Code Mode with 50%+ token reduction and sub-3ms latency. So you get governance without performance penalties.

Putting It All Together

Here is the minimal setup to govern Claude Code across a 20-person engineering team:

Deploy Bifrost (single binary, zero config)
Create virtual keys per developer
Set budget limits per team and per developer
Configure rate limits
Route MCP tools through the gateway with per-team permissions
Point each developer's Claude Code config at the gateway

Total setup time when I did this: about 45 minutes. Most of that was deciding on budget allocations.

The Bifrost docs cover each of these in detail. The GitHub repo has example configs for common setups.

What I Would Do Differently

After running this for a week, a few notes:

Start with generous rate limits and tighten based on actual usage data. Too strict and developers complain.
Set daily limits, not just monthly. Monthly limits let someone blow the budget on day 1.
Review audit logs weekly. You will find patterns. Some developers are 10x more efficient with Claude Code than others. Share what works.
Use separate virtual keys for Claude Code vs other AI tools. Makes cost attribution cleaner.

Bottom Line

Claude Code without governance is a liability. With a gateway layer, it becomes a controlled, auditable, budget-safe tool. Bifrost handles this at 11 microsecond overhead, so your developers do not notice the proxy.

The alternative is waiting for the bill shock. I have seen it happen. Set up governance before you scale Claude Code to your full team.

Bifrost GitHub | Documentation | Website

Buyer's Guide to Pick the Best LLM Gateway in 2026

Pranay Batta — Fri, 17 Apr 2026 13:13:53 +0000

TL;DR: An LLM gateway sits between your application and LLM providers, handling routing, failover, cost controls, and observability. I tested five gateways against ten evaluation criteria. Bifrost won on latency and governance. LiteLLM wins on provider coverage. Kong and Cloudflare suit different enterprise needs. Here is the full breakdown.

What Is an LLM Gateway?

An LLM gateway is a reverse proxy purpose-built for LLM API traffic. It normalises requests across providers like OpenAI and Anthropic, adds routing logic, failover, cost controls, caching, and observability without changing your application code. Think of it as an API gateway, but designed specifically for the economics and reliability challenges of LLM calls.

If you are calling more than one LLM provider, or spending more than $500/month on API calls, you need one.

The 10 Evaluation Criteria

I benchmarked and tested five gateways over three weeks. Here is what matters, and what I found.

a. Latency Overhead

The gateway itself should add near-zero latency. You are already waiting 500ms-2s for LLM responses. If your gateway adds another 8-15ms, that compounds across multi-step agent chains.

I measured gateway overhead (not LLM response time) using a standardised Go benchmarking harness:

Bifrost: 11 microseconds. Written in Go, handles 5,000 RPS sustained. Benchmark details.
LiteLLM: ~8ms. Python-based, solid for moderate traffic. GitHub repo.
Kong AI Gateway: ~3-5ms. Built on Kong's proven proxy layer. Product page.
Cloudflare AI Gateway: Sub-1ms at edge (but limited to Cloudflare's network). Docs.
Databricks Unity AI Gateway: Not independently benchmarkable. Tied to Databricks runtime.

If latency matters (agents, real-time apps), Bifrost is in a different league.

b. Provider Coverage

LiteLLM: 100+ providers. Broadest coverage available.
Bifrost: 19+ providers (OpenAI, Anthropic, Azure, Bedrock, Gemini, Mistral, Cohere, Groq, and more).
Kong AI Gateway: Major providers via plugins.
Cloudflare: Major providers only.
Databricks Unity: Focused on Databricks ecosystem plus external endpoints.

If you need obscure providers, LiteLLM wins. For the top 15-20 providers, any gateway here works.

c. Routing Flexibility

Bifrost supports weighted, priority-based, and conditional routing. Split traffic 70/30 between GPT-4o and Claude Sonnet, or route coding tasks to one model and summarisation to another. LiteLLM has basic load balancing. Kong does routing via plugins. Cloudflare and Databricks offer simpler options.

d. Failover and Reliability

When a provider goes down (and they do), what happens? Bifrost's failover supports automatic retries with configurable backoff and fallback chains. If OpenAI 429s, it rolls to Anthropic automatically. LiteLLM has similar fallback support. Kong uses health checks. Cloudflare and Databricks offer basic retry/fallback options.

e. Cost Governance

This is where gateways diverge sharply. Bifrost has a four-tier budget system: per-key, per-team, per-project, and global with hard limits, soft warnings, and rate limits. Full governance docs. LiteLLM has budget controls via its proxy. Kong and Cloudflare offer rate limiting. Databricks ties into Unity Catalog.

f. Caching

Caching identical or similar LLM calls reduces cost and latency dramatically.

Bifrost supports dual-layer semantic caching with exact match and semantic similarity. Backend options include Redis for exact caching, Weaviate for vector-based semantic matching, and Qdrant as an alternative vector store.

LiteLLM has basic caching support. Cloudflare caches at the edge (great for repeated queries). Kong and Databricks have limited native caching options.

g. Observability

Bifrost's observability captures request/response pairs, token counts, latency, cost, and model metadata with under 0.1ms overhead. Audit logging and virtual key tracking built in. LiteLLM has a dashboard plus integrations. Kong plugs into existing stacks. Cloudflare and Databricks have built-in analytics.

h. MCP Support

MCP (Model Context Protocol) is becoming the standard for tool integration. Gateway-level MCP support matters for managing tool sprawl.

Bifrost's MCP support includes a Code Mode that generates TypeScript declarations instead of raw tool definitions. At 500 tools, this saves 92% on tokens. Tool-level scoping and access control are built in.

Databricks Unity just added MCP governance. Kong v3.14 added A2A (Agent-to-Agent) support in April 2026. LiteLLM and Cloudflare have basic or no MCP-specific features.

If you are building multi-agent systems with many tools, MCP governance is not optional.

i. Deployment Model

Bifrost: Self-hosted. Zero-config setup via npx -y @maximhq/bifrost or Docker.
LiteLLM: Self-hosted (open-source) or managed (enterprise).
Kong: Self-hosted or managed (Konnect).
Cloudflare: Managed only. You are on Cloudflare's infrastructure.
Databricks Unity: Managed. Tied to Databricks workspace.

Self-hosted means your data never leaves your VPC. If you are in a regulated industry, this matters.

j. Open Source vs Proprietary

Bifrost: Fully open-source. GitHub.
LiteLLM: Open-source core, enterprise features behind a paid tier. Note: LiteLLM had a supply chain security incident in March 2026 that affected its PyPI package. Worth reviewing before deploying.
Kong AI Gateway: Kong's core is open-source, but AI Gateway features require an enterprise licence.
Cloudflare: Proprietary managed service.
Databricks Unity: Proprietary, part of Databricks platform.

Comparison Table

Criteria	Bifrost	LiteLLM	Kong AI	Cloudflare	Databricks Unity
Latency overhead	11us	~8ms	~3-5ms	Sub-1ms (edge)	N/A
Providers	19+	100+	Major	Major	Ecosystem
Routing	Weighted, priority, conditional	Basic LB	Plugin-based	Simple	Model serving
Failover	Full fallback chains	Fallback support	Health checks	Basic retry	Endpoint fallback
Cost governance	Four-tier budgets	Budget + rate limits	Rate limiting	Basic	Unity Catalog
Caching	Semantic (Redis/Weaviate/Qdrant)	Basic	Limited	Edge caching	Limited
Observability	Sub-0.1ms, full audit	Dashboard + integrations	Stack integration	Built-in analytics	MLflow
MCP support	Code Mode, 92% savings	Basic	A2A (v3.14)	None	MCP governance
Deployment	Self-hosted	Self-hosted/managed	Self-hosted/managed	Managed only	Managed only
Open source	Yes	Core only	AI features paid	No	No

Decision Framework

Pick Bifrost if you need lowest latency, granular cost governance, semantic caching, and MCP tool management. Self-hosted, open-source. Get started here.

Pick LiteLLM if you need the widest provider coverage and can tolerate 8ms+ overhead. Factor in the March 2026 security incident.

Pick Kong AI Gateway if you already run Kong and want LLM routing added to existing infrastructure. A2A support in v3.14 is promising.

Pick Cloudflare AI Gateway if you want zero-ops and are already on Cloudflare. Limited governance for multi-team setups.

Pick Databricks Unity AI Gateway if you are all-in on Databricks. Strong MCP governance but locks you into the ecosystem.

Trade-offs to Accept

No single best gateway exists. Bifrost's 19 providers cover 95% of production traffic but are fewer than LiteLLM's 100+. LiteLLM's Python runtime is slower but easier to extend. Kong is battle-tested as a proxy but its AI features are catching up. Cloudflare is easiest to set up but gives you the least control. Databricks is powerful within its ecosystem and limiting outside it.

Pick the one that solves your biggest bottleneck first.

Bifrost links: GitHub | Docs | Website

Best AI Gateway to Route Codex CLI to Any Model

Pranay Batta — Thu, 16 Apr 2026 10:05:16 +0000

Codex CLI is OpenAI's terminal-based coding agent that runs entirely in your shell. It reads your codebase, proposes changes, runs commands, and writes code. Solid tool. One problem: it only talks to OpenAI by default.

I wanted to route Codex CLI through an AI gateway so I could use Claude Sonnet, Gemini 2.5 Pro, Mistral, and others without switching tools. I tested a few options. Bifrost worked best. Open-source, written in Go, 11 microsecond overhead. Here is exactly how I set it up and what I found.

Why Route Codex CLI Through an AI Gateway

Codex CLI sends requests to OpenAI's API. That is fine until you need something else. Maybe Claude Sonnet handles your refactoring tasks better. Maybe Gemini's context window fits your monorepo. Maybe you want automatic failover when OpenAI rate limits you mid-session.

An AI gateway sits between Codex CLI and your providers. It translates requests, routes traffic, and handles failures. You configure it once and Codex CLI does not know the difference.

Without a gateway, your options are:

Stick with OpenAI only (no routing, no failover, no cost tracking)
Manually swap API keys and base URLs every time you want a different model

Neither scales.

Setting Up Bifrost for Codex CLI

Bifrost exposes an OpenAI-compatible endpoint. Codex CLI connects to it like it would connect to OpenAI directly. Full Codex CLI integration docs here.

Install Bifrost

npx -y @maximhq/bifrost

That starts the gateway locally. The setup guide has the full walkthrough.

The OAuth Gotcha

This one tripped me up. Codex CLI always prefers OAuth authentication over custom API keys. If you have previously logged in with OpenAI, Codex CLI will ignore your custom OPENAI_API_KEY entirely.

Fix: Run /logout inside Codex CLI before configuring Bifrost. Without this step, your gateway config will be silently bypassed.

Configure Codex CLI to Use Bifrost

Set your environment variables:

export OPENAI_API_KEY=bifrost_virtual_key
export OPENAI_BASE_URL=http://localhost:8080/openai/v1

Or add it to your codex.toml:

[auth]
api_key = "bifrost_virtual_key"

[network]
openai_base_url = "http://localhost:8080/openai/v1"

The OPENAI_API_KEY here is a Bifrost virtual key. Your actual provider keys live in the Bifrost config.

Done. Every Codex CLI request now flows through Bifrost.

Routing Codex CLI to Any Model

This is the core use case. Configure multiple providers in Bifrost, and route Codex CLI traffic however you want. Bifrost uses the provider/model-name format for cross-provider routing:

accounts:
  - id: "codex-dev"
    providers:
      - id: "claude-primary"
        type: "anthropic"
        api_key: "${ANTHROPIC_API_KEY}"
        model: "anthropic/claude-sonnet-4-5-20250929"
        weight: 60
      - id: "gemini-secondary"
        type: "gemini"
        api_key: "${GEMINI_API_KEY}"
        model: "gemini/gemini-2-5-pro"
        weight: 25
      - id: "openai-fallback"
        type: "openai"
        api_key: "${OPENAI_API_KEY}"
        model: "gpt-4o"
        weight: 15

60% of requests go to Claude Sonnet. 25% to Gemini. 15% to GPT-4o. Weights auto-normalise, so use any numbers.

I ran this for a week. Claude Sonnet handled tool-heavy refactoring better. Gemini was faster on large context reads. GPT-4o was solid as a fallback. The routing docs cover all configuration options.

Other providers you can route to: Mistral, Groq, Cerebras, Cohere, Perplexity. All via the same provider/model-name format.

Can You Use Codex CLI with Non-OpenAI Models?

Yes. That is exactly what this setup does. Bifrost translates the OpenAI-format requests from Codex CLI into whatever format each provider expects. Codex CLI thinks it is talking to OpenAI. Bifrost handles the rest.

Critical requirement: non-OpenAI models must support tool use. Codex CLI relies on function calling for file operations, terminal commands, and code editing. If a model does not support tools, it will break on anything beyond simple chat.

Automatic Failover

Provider outages are inevitable. Bifrost sorts providers by weight and retries on failure. If Claude goes down, Gemini picks up. Gemini fails, falls back to OpenAI. Your Codex CLI session never interrupts.

The failover docs explain the retry logic in detail.

Comparison: AI Gateway Options for Codex CLI

Feature	Bifrost	LiteLLM	Direct API
Language	Go	Python	N/A
Routing overhead	11 microseconds	~8 milliseconds	0
Weighted routing	Yes	Yes	No
Automatic failover	Yes	Yes	No
Budget controls	4-tier hierarchy	Basic	No
Semantic caching	Yes	No	No
Self-hosted	Yes	Yes	N/A
Codex CLI compatible	Yes	Yes	Default

LiteLLM works as a proxy for Codex CLI, but the Python runtime adds measurable latency. When every Codex CLI request goes through the gateway, those milliseconds compound. For a tool sitting in the critical path of your coding workflow, overhead matters.

How to Route Codex CLI Through an AI Gateway?

Three steps:

Start Bifrost (npx -y @maximhq/bifrost)
Run /logout in Codex CLI to clear OAuth
Set OPENAI_API_KEY and OPENAI_BASE_URL to point at Bifrost

That is it. Configure your providers in the Bifrost config, and Codex CLI routes to any model you specify.

Budget and Observability

Once all Codex CLI traffic flows through Bifrost, you get cost controls and logging for free. The four-tier budget hierarchy lets you cap spend at the virtual key, team, or provider level.

budgets:
  - level: "virtual_key"
    id: "codex-cli-dev"
    limit: 150
    period: "monthly"

The observability layer logs every request: latency, tokens, cost, which provider handled it. When you are routing across three providers, this data tells you exactly where your money goes and which model performs best for your tasks.

Semantic caching also helps. Repeated or similar queries hit the cache instead of the provider. Cuts both cost and latency for common operations.

Honest Trade-offs

OAuth quirk is easy to miss. If you skip the /logout step, Codex CLI silently ignores your gateway config. There is no error. It just routes to OpenAI directly. I lost an hour to this before checking the docs.

Tool use is non-negotiable. Not every model supports function calling well enough for Codex CLI. Stick to models with solid tool use: Claude Sonnet, GPT-4o, Gemini 2.5 Pro. Smaller or older models may fail on file operations.

Self-hosted only. You run and maintain the gateway. No managed cloud version for the open-source release. The governance layer helps with access control, but ops is on you.

Extra hop. One more process in the chain. The 11 microsecond overhead is negligible, but it is still something to keep running.

Quick Start

# 1. Start Bifrost
npx -y @maximhq/bifrost

# 2. Logout from OpenAI OAuth in Codex CLI
# Inside Codex CLI, run: /logout

# 3. Point Codex CLI at Bifrost
export OPENAI_API_KEY=bifrost_virtual_key
export OPENAI_BASE_URL=http://localhost:8080/openai/v1

# 4. Use Codex CLI normally - it routes through Bifrost

GitHub | Docs | Website

If you are using Codex CLI for real work, routing through an AI gateway gives you model flexibility, failover, and cost visibility that you cannot get from a single provider. I benchmarked the performance and the overhead is genuinely negligible.

Open an issue on the repo if you run into anything.

DEV Community: Pranay Batta

Setting Up Budgets, Rate Limits, and Guardrails in Bifrost: A Hands-On Walkthrough

Why Governance Lives at the Gateway

Step 1: The Four-Tier Budget Hierarchy

Step 2: Rate Limits at Two Scopes

Step 3: Routing and Weighted Fallbacks

Step 4: Guardrails for PII and Content

How This Stacks Against Other Gateways

Trade-offs and Limitations

Quick Recap

Self-Hosting an Open Source MCP Gateway: Setup, Security, and Scaling Guide

Why Self-Host an MCP Gateway

Step 1: Run the Gateway

Step 2: Register MCP Servers

Step 3: Lock Down Auth With OAuth 2.0 and Virtual Keys

Step 4: Enable Code Mode for Token Reduction

Step 5: Connect Claude Code to the Gateway

Step 6: Scaling Levers

Comparison

Trade-offs and Limitations

Quick Recap

Best AI Gateway to Optimize Claude Code Token Cost

Where Claude Code Token Cost Comes From

Step 1: Point Claude Code at a Gateway

Step 2: Pin Claude Code to a Cheaper Model When Appropriate

Step 3: Enable Semantic Caching

Step 4: Use Code Mode for MCP Servers

Step 5: Set Per-Session Budget Caps

Comparison: Optimisation Levers

Trade-offs and Limitations

Quick Recap

How to Route Between Claude Opus 4.7, GPT-5 Turbo, and Gemma 4 With Bifrost

Why You Want a Gateway in Front of Three Models

Step 1: Install Bifrost

Step 2: Configure Three Providers

Step 3: Application Code Does Not Change

Step 4: Configure Failover

Step 5: Add Rate Limits Per Provider

Comparison

Trade-offs and Limitations

Quick Recap

Links

Further Reading

Migrating from LiteLLM to Bifrost: A Step-by-Step Guide

Why Teams Are Looking at the Migration

Step 1: Run Bifrost Side by Side

Step 2: Translate Provider Configs

Step 3: Translate Virtual Keys and Budgets

Step 4: Update Application Endpoints

Step 5: Port Semantic Caching

Comparison

Trade-offs and Limitations

Quick Recap

Links

Further Reading

Rate Limiting in LLM Applications: Why You Need It and How to Build It

Why Standard Rate Limiting Falls Short

What You Actually Need to Limit

Option 1: Application-Level Implementation

Option 2: Gateway-Level Rate Limiting

Comparison

Trade-offs and Limitations

Quick Recap

Links

Further Reading

Four Go Repositories Worth Your Attention on GitHub's Trending Page This Month

1. Bifrost: An Enterprise AI Gateway Built in Go

2. new-api: A Self-Hosted Model Hub

3. sub2api: Sharing AI API Subscriptions Across Users

4. wacli: A WhatsApp Command-Line Interface

Why Go Keeps Showing Up at the Top

Wrapping Up

How to Cut LLM Token Spend with Semantic Caching: A Production Setup Guide

What We Are Building

maximhq / bifrost

Fastest enterprise AI gateway (50x faster than LiteLLM) with adaptive load balancer, cluster mode, guardrails, 1000+ models support & <100 µs overhead at 5k RPS.

Bifrost AI Gateway

The fastest way to build AI applications that never go down

Quick Start

Prerequisites