<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Kuldeep Paul</title>
    <description>The latest articles on DEV Community by Kuldeep Paul (@kuldeep_paul).</description>
    <link>https://dev.to/kuldeep_paul</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F2945723%2F40d70f4f-01f5-49ae-b4b5-2a1c2f77c64f.jpeg</url>
      <title>DEV Community: Kuldeep Paul</title>
      <link>https://dev.to/kuldeep_paul</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/kuldeep_paul"/>
    <language>en</language>
    <item>
      <title>Open-Source LLM Gateways for Production: Which One Fits Your Stack in 2026?</title>
      <dc:creator>Kuldeep Paul</dc:creator>
      <pubDate>Thu, 11 Jun 2026 20:49:41 +0000</pubDate>
      <link>https://dev.to/kuldeep_paul/open-source-llm-gateways-for-production-which-one-fits-your-stack-in-2026-13lo</link>
      <guid>https://dev.to/kuldeep_paul/open-source-llm-gateways-for-production-which-one-fits-your-stack-in-2026-13lo</guid>
      <description>&lt;p&gt;&lt;em&gt;Evaluating five open-source AI gateways for multi-provider LLM routing, performance, and governance in 2026. &lt;a href="https://www.getmaxim.ai/bifrost" rel="noopener noreferrer"&gt;Bifrost&lt;/a&gt; stands out as the enterprise choice for teams that need ultra-low latency, comprehensive governance, and production-grade reliability.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Building production AI systems in 2026 means routing across multiple LLM providers, managing costs at scale, and keeping your model calls resilient when a provider stumbles. That's where open-source AI gateways come in. They sit between your application and the LLM providers you're using, unifying fragmented APIs into one clean interface while handling failover, caching, budgets, and observability. Because they're open-source, you own the code, can audit every line of the routing logic, and deploy them inside your own infrastructure (no vendor control plane required).&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.getmaxim.ai/bifrost" rel="noopener noreferrer"&gt;Bifrost&lt;/a&gt;, built by Maxim AI as a &lt;a href="https://github.com/maximhq/bifrost" rel="noopener noreferrer"&gt;Go-native gateway&lt;/a&gt;, has emerged as the top choice for teams running mission-critical AI at scale that demand production latency, deep access controls, and self-hosted deployment without compromise. In this post, we'll walk through five production-ready open-source gateways, what makes each worth considering, and how to choose the right fit for your workload.&lt;/p&gt;

&lt;h2&gt;
  
  
  Understanding Open-Source AI Gateways
&lt;/h2&gt;

&lt;p&gt;A self-hostable gateway is infrastructure that runs on your machines, sitting between your services and the LLM providers you call. It accepts requests through a single API endpoint, routes them to the right provider, and sends responses back to your app. Along the way, it can enforce budgets, cache responses, log access, failover across providers, and expose your tools via the Model Context Protocol (MCP) for agentic work.&lt;/p&gt;

&lt;p&gt;What changed in 2026 is the bar for "production-ready." Today's best gateways treat MCP traffic, semantic caching, and per-team budget controls as core capabilities, not bolt-on plugins. A &lt;a href="https://www.cncf.io/reports/cncf-annual-survey-2024/" rel="noopener noreferrer"&gt;CNCF survey&lt;/a&gt; shows cloud-native organizations increasingly run their own critical infrastructure layers rather than relying on managed services, and gateways are following the same trend.&lt;/p&gt;

&lt;h2&gt;
  
  
  Evaluating These Gateways: What Actually Matters
&lt;/h2&gt;

&lt;p&gt;When you're testing gateways, focus on these dimensions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Overhead at scale&lt;/strong&gt;: How much latency does the gateway add per request? Test at sustained throughput (thousands of requests per second), not toy traffic. A gateway that adds milliseconds kills your latency budget for streaming chat or multi-step agent workflows.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Provider breadth&lt;/strong&gt;: How many LLM APIs does it support? Can you use the same gateway code with OpenAI, Anthropic, Bedrock, and self-hosted models, or do you need workarounds?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;MCP and agents&lt;/strong&gt;: Does it natively support Model Context Protocol as both client and server? Can it route agent tool calls, or does that require a separate layer?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Governance and access control&lt;/strong&gt;: Can you set per-team or per-project budgets, rate limits, and granular permissions? For multi-tenant or regulated setups, this matters hugely.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Caching strategy&lt;/strong&gt;: Exact-match caching is table stakes. Semantic caching (matching similar requests even when phrased differently) cuts costs dramatically for production Q&amp;amp;A and RAG.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Deployment flexibility&lt;/strong&gt;: Can you run this in VPC-only, air-gapped, on Kubernetes, or in a container? Does it have external dependencies that limit where you can run it?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Open-source terms&lt;/strong&gt;: Is the license clean (Apache 2.0, MIT)? Is there a clear boundary between free and commercial tiers?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The &lt;a href="https://www.getmaxim.ai/bifrost/resources/buyers-guide" rel="noopener noreferrer"&gt;Bifrost buyer's guide&lt;/a&gt; provides a structured matrix to evaluate all of these.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Five Best Open-Source AI Gateways
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Bifrost
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://www.getmaxim.ai/bifrost" rel="noopener noreferrer"&gt;Bifrost&lt;/a&gt; is a high-performance gateway written in Go, engineered from day one as infrastructure-grade software. Testing shows it adds roughly 11 microseconds of per-request overhead at 5,000 requests per second, which is about 54 times less tail-latency impact than a Python proxy on the same hardware, with 68% less memory use. You can grab the full &lt;a href="https://www.getmaxim.ai/bifrost/resources/benchmarks" rel="noopener noreferrer"&gt;benchmark results&lt;/a&gt; to see how it compares. Apache 2.0 licensed, code is on &lt;a href="https://github.com/maximhq/bifrost" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What it does well:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Single API gateway for 20+ providers&lt;/strong&gt;: OpenAI, Anthropic, Bedrock, Vertex AI, Azure, Mistral, Groq, Cohere, Ollama, vLLM, all accessible through one &lt;a href="https://docs.getbifrost.ai/features/drop-in-replacement" rel="noopener noreferrer"&gt;OpenAI-compatible endpoint&lt;/a&gt;. Flip the base URL in your existing SDK, nothing else changes.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Native MCP as client and server&lt;/strong&gt;: Bifrost can connect to external tool servers and expose tools to clients (Claude, agents, etc.), with &lt;a href="https://docs.getbifrost.ai/mcp/agent-mode" rel="noopener noreferrer"&gt;Agent Mode&lt;/a&gt; for autonomous execution and &lt;a href="https://docs.getbifrost.ai/mcp/code-mode" rel="noopener noreferrer"&gt;Code Mode&lt;/a&gt; for AI-written tool orchestration that cuts tokens and latency. The &lt;a href="https://www.getmaxim.ai/bifrost/resources/mcp-gateway" rel="noopener noreferrer"&gt;MCP resource hub&lt;/a&gt; has the full story.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multi-tier budget and access control&lt;/strong&gt;: &lt;a href="https://docs.getbifrost.ai/features/governance/virtual-keys" rel="noopener noreferrer"&gt;Virtual keys&lt;/a&gt; are the core, where you set per-consumer budgets, rate limits, and MCP tool allow-lists across virtual key, team, and customer tiers.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Zero-downtime failover&lt;/strong&gt;: &lt;a href="https://docs.getbifrost.ai/features/fallbacks" rel="noopener noreferrer"&gt;Automatic provider routing&lt;/a&gt; with weighted load balancing. When a provider starts returning errors, traffic shifts seamlessly.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Dual-layer caching&lt;/strong&gt;: Combines exact-match with &lt;a href="https://docs.getbifrost.ai/features/semantic-caching" rel="noopener noreferrer"&gt;semantic similarity&lt;/a&gt; so rephrased questions hit the cache.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Enterprise deployment&lt;/strong&gt;: &lt;a href="https://www.getmaxim.ai/bifrost/enterprise" rel="noopener noreferrer"&gt;In-VPC isolation, air-gapped mode, clustering for HA, RBAC, and immutable audit trails&lt;/a&gt;, built to satisfy SOC 2, GDPR, and HIPAA auditors.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Deploys in under 60 seconds with &lt;code&gt;npx -y @maximhq/bifrost&lt;/code&gt; or a single Docker image. Integrates natively with Claude Code, Codex CLI, Gemini CLI, and Cursor.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; Teams operating mission-critical AI workloads that require best-in-class performance, scalability, and reliability. It serves as a centralized AI gateway to route, govern, and secure all AI traffic across models and environments with ultra low latency. Bifrost unifies LLM gateway, MCP gateway, and Agents gateway capabilities into a single platform. Designed for regulated industries and strict enterprise requirements, it supports air-gapped deployments, VPC isolation, and on-prem infrastructure. It provides full control over data, access, and execution, along with robust security, policy enforcement, and governance capabilities.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. LiteLLM
&lt;/h3&gt;

&lt;p&gt;LiteLLM is a Python-based open-source project that gives you a unified OpenAI-compatible API to 100+ providers. Available as both a Python SDK and a standalone proxy server with built-in admin UI, virtual keys, and spend dashboards. It's widely used across the ecosystem and has a healthy contributor community behind it.&lt;/p&gt;

&lt;p&gt;What draws teams: maximum provider catalog in this roundup, straightforward Python integration, and minimal friction to get started. The downsides surface under load. Python's &lt;a href="https://docs.python.org/3/glossary.html#term-global-interpreter-lock" rel="noopener noreferrer"&gt;Global Interpreter Lock&lt;/a&gt; caps single-process throughput, pushing tail latency higher as concurrency climbs. Running it at meaningful scale requires standing up PostgreSQL and Redis for state management. The open-source version offers exact-match caching (no semantic smarts) and lacks native MCP support. If you're evaluating a move to Bifrost, the &lt;a href="https://www.getmaxim.ai/bifrost/alternatives/litellm-alternatives" rel="noopener noreferrer"&gt;feature comparison&lt;/a&gt; breaks down the gaps.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; Python teams that value breadth of provider coverage for experimentation and early prototypes, and can live with higher latency ceilings as traffic grows.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Kong AI Gateway
&lt;/h3&gt;

&lt;p&gt;Kong AI Gateway extends Kong's widely-deployed API gateway with plugins for LLM-aware routing, prompt templates, token-rate-limiting, and request transforms. If your team already runs Kong for API management, the appeal is obvious: add AI routing without standing up another proxy.&lt;/p&gt;

&lt;p&gt;Where it gets tricky: AI features ship as plugins layered on a general-purpose gateway, not as a purpose-built AI system. The most advanced features (fine-grained governance, MCP, rich observability) live in Kong's commercial tier. The plugin architecture and config model add operational overhead if you're new to Kong.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; Teams that already have Kong in production for API management and want to extend it into LLM routing without introducing a new infrastructure component.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Apache APISIX
&lt;/h3&gt;

&lt;p&gt;APISIX is an Apache project: a cloud-native API gateway built on high-performance NGINX and Lua, with a strong governance model and active maintainer community. Recent versions added AI plugins to handle LLM routing and token-aware rate limiting.&lt;/p&gt;

&lt;p&gt;APISIX shines if you're already using it for general APIs and want to route AI traffic through the same platform. The catch: AI capabilities come through plugins rather than native architecture, so you don't get semantic caching, a built-in MCP gateway, or the granular budget controls (hierarchical spend limits, per-consumer rate shaping) that purpose-built AI gateways offer. The APISIX config model has a learning curve if you haven't used it before.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; Teams standardized on APISIX for API management that want plugin-based LLM routing without standing up separate infrastructure.&lt;/p&gt;

&lt;h3&gt;
  
  
  5. Envoy AI Gateway
&lt;/h3&gt;

&lt;p&gt;Envoy AI Gateway is a newer open-source effort that layers LLM awareness onto the Envoy proxy and Kubernetes Gateway API. It's built for teams already running Envoy or Istio service meshes who want AI routing as part of their existing infrastructure fabric.&lt;/p&gt;

&lt;p&gt;The win: native Kubernetes integration and service-mesh hooks. As a newer project, it carries some rough edges: narrower provider support, lower throughput (single-digit milliseconds of overhead), no semantic caching yet, and no virtual-key-style budget hierarchy. The xDS configuration model (Envoy's config system) is powerful but steep if you're outside the Envoy ecosystem.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; Teams deeply committed to Kubernetes with Envoy or Istio already in the mesh, wanting AI traffic management that plugs into their existing infrastructure.&lt;/p&gt;

&lt;h2&gt;
  
  
  Quick Comparison Table
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Gateway&lt;/th&gt;
&lt;th&gt;Language&lt;/th&gt;
&lt;th&gt;License&lt;/th&gt;
&lt;th&gt;Overhead @ scale&lt;/th&gt;
&lt;th&gt;Native MCP&lt;/th&gt;
&lt;th&gt;Semantic cache&lt;/th&gt;
&lt;th&gt;Deployment fit&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Bifrost&lt;/td&gt;
&lt;td&gt;Go&lt;/td&gt;
&lt;td&gt;Apache 2.0&lt;/td&gt;
&lt;td&gt;~11µs at 5K RPS&lt;/td&gt;
&lt;td&gt;Yes (both)&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Enterprise, production scale&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LiteLLM&lt;/td&gt;
&lt;td&gt;Python&lt;/td&gt;
&lt;td&gt;MIT&lt;/td&gt;
&lt;td&gt;Hundreds of µs&lt;/td&gt;
&lt;td&gt;No (OSS)&lt;/td&gt;
&lt;td&gt;Exact-match only&lt;/td&gt;
&lt;td&gt;Prototyping, broad coverage&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Kong AI Gateway&lt;/td&gt;
&lt;td&gt;Lua/Kong&lt;/td&gt;
&lt;td&gt;Apache 2.0&lt;/td&gt;
&lt;td&gt;Millisecond range&lt;/td&gt;
&lt;td&gt;Plugin-based&lt;/td&gt;
&lt;td&gt;Plugin-based&lt;/td&gt;
&lt;td&gt;Existing Kong shops&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Apache APISIX&lt;/td&gt;
&lt;td&gt;Lua/NGINX&lt;/td&gt;
&lt;td&gt;Apache 2.0&lt;/td&gt;
&lt;td&gt;Millisecond range&lt;/td&gt;
&lt;td&gt;Limited&lt;/td&gt;
&lt;td&gt;No (OSS)&lt;/td&gt;
&lt;td&gt;Existing APISIX shops&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Envoy AI Gateway&lt;/td&gt;
&lt;td&gt;Go/Envoy&lt;/td&gt;
&lt;td&gt;Apache 2.0&lt;/td&gt;
&lt;td&gt;1-3 ms&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Kubernetes/Istio mesh&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;For deeper detail on benchmarking methodology and the governance model behind the comparison, check the &lt;a href="https://www.getmaxim.ai/bifrost/resources/benchmarks" rel="noopener noreferrer"&gt;benchmark hub&lt;/a&gt; and &lt;a href="https://www.getmaxim.ai/bifrost/resources/governance" rel="noopener noreferrer"&gt;governance deep-dive&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Common Questions
&lt;/h2&gt;

&lt;h3&gt;
  
  
  What's the actual performance difference?
&lt;/h3&gt;

&lt;p&gt;Bifrost leads on raw throughput: 11 microseconds per request at high concurrency. Its Go architecture sidesteps the Python Global Interpreter Lock that Python proxies can't escape. That translates to roughly 54 times lower tail latency (P99) compared to a Python gateway under identical load. For streaming responses or agent workflows with multiple LLM hops, that difference between microseconds and milliseconds is the difference between feeling instant and feeling slow.&lt;/p&gt;

&lt;h3&gt;
  
  
  Do these gateways really support MCP?
&lt;/h3&gt;

&lt;p&gt;MCP support varies. Bifrost has a native &lt;a href="https://www.getmaxim.ai/bifrost/resources/mcp-gateway" rel="noopener noreferrer"&gt;MCP gateway&lt;/a&gt; that acts as both client and server, with tool filtering and per-consumer auth. Others either expose MCP through plugins or don't support it in their open-source version. The &lt;a href="https://modelcontextprotocol.io" rel="noopener noreferrer"&gt;MCP spec&lt;/a&gt; defines the standard, and native support (as opposed to plugin-based) makes a real difference in simplicity and capability.&lt;/p&gt;

&lt;h3&gt;
  
  
  Can I use these in a regulated industry?
&lt;/h3&gt;

&lt;p&gt;Yes, if the gateway supports on-prem deployment, air-gapped operation, immutable audit logs, and fine-grained access control. Self-hosting keeps all prompts, responses, and audit trails inside your infrastructure, which is mandatory for SOC 2, HIPAA, and GDPR contexts. Bifrost's &lt;a href="https://www.getmaxim.ai/bifrost/enterprise" rel="noopener noreferrer"&gt;enterprise deployment docs&lt;/a&gt; cover the compliance-grade setup: VPC isolation, clustering for HA, RBAC with SSO, and audit trails sized for auditors.&lt;/p&gt;

&lt;h2&gt;
  
  
  Making Your Choice
&lt;/h2&gt;

&lt;p&gt;The decision tree is straightforward:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Production AI at enterprise scale with strict latency and governance requirements&lt;/strong&gt;: &lt;a href="https://www.getmaxim.ai/bifrost" rel="noopener noreferrer"&gt;Bifrost&lt;/a&gt; is the clear pick. Lowest overhead, native MCP, hierarchical governance, semantic caching, compliance-ready deployment.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Maximum provider breadth for experimentation&lt;/strong&gt;: LiteLLM's 100+ provider catalog is unmatched.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Already running Kong&lt;/strong&gt;: Kong AI Gateway saves you introducing another proxy.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Already running APISIX&lt;/strong&gt;: APISIX AI plugins let you reuse what you have.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Kubernetes and Istio native&lt;/strong&gt;: Envoy AI Gateway slots into your mesh.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For teams weighing multiple options, the &lt;a href="https://www.getmaxim.ai/bifrost/resources/buyers-guide" rel="noopener noreferrer"&gt;LLM gateway buying guide&lt;/a&gt; provides a detailed capability matrix to score each gateway against your specific requirements.&lt;/p&gt;

&lt;h2&gt;
  
  
  Next Steps
&lt;/h2&gt;

&lt;p&gt;If you're looking at self-hosted gateways for your production AI stack, &lt;a href="https://www.getmaxim.ai/bifrost" rel="noopener noreferrer"&gt;Bifrost on GitHub&lt;/a&gt; is open-source and deploys in 30 seconds. The docs cover setup, configuration, and integration with coding agents. To see how it handles your actual workloads and discuss deployment, &lt;a href="https://getmaxim.ai/bifrost/book-a-demo" rel="noopener noreferrer"&gt;book time with the Bifrost team&lt;/a&gt;. They'll walk you through performance tuning and compliance setup.&lt;/p&gt;

</description>
    </item>
    <item>
      <title>Enterprise Rollout of Claude Code and Codex: Governance Without Friction</title>
      <dc:creator>Kuldeep Paul</dc:creator>
      <pubDate>Thu, 11 Jun 2026 20:49:08 +0000</pubDate>
      <link>https://dev.to/kuldeep_paul/enterprise-rollout-of-claude-code-and-codex-governance-without-friction-4ejm</link>
      <guid>https://dev.to/kuldeep_paul/enterprise-rollout-of-claude-code-and-codex-governance-without-friction-4ejm</guid>
      <description>&lt;p&gt;&lt;em&gt;&lt;a href="https://www.getmaxim.ai/bifrost" rel="noopener noreferrer"&gt;Bifrost&lt;/a&gt; is the open-source AI gateway for governing AI coding agents. Deploy virtual key governance, guardrails, and audit logs to scale Claude Code and Codex safely.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Engineering teams adopting Claude Code and Codex face a governance dilemma: how do you scale agent usage to hundreds of developers while controlling costs, preventing data leakage, and maintaining compliance? Research by &lt;a href="https://www.blackduck.com/blog/ai-coding-assistant-security-risks-benefits-devsecops-2025.html" rel="noopener noreferrer"&gt;Black Duck&lt;/a&gt; shows 85% of organizations now use AI in development, but most lack centralized control over which models engineers can access, how much they spend, or whether sensitive data makes its way into external APIs. &lt;a href="https://www.getmaxim.ai/bifrost" rel="noopener noreferrer"&gt;Bifrost&lt;/a&gt;, an &lt;a href="https://github.com/maximhq/bifrost" rel="noopener noreferrer"&gt;open-source Go-based gateway&lt;/a&gt; built by Maxim AI, lets you deploy a single control point that every Claude Code and Codex call flows through (applying cost enforcement, secret detection, and compliance logging automatically, invisibly to the developer).&lt;/p&gt;

&lt;h2&gt;
  
  
  What Does Governance of Coding Agents Actually Look Like?
&lt;/h2&gt;

&lt;p&gt;Governance means placing a policy enforcement layer between developers and their AI tools. Instead of each engineer managing their own provider credentials, picking their own models, and sending requests directly outbound, all traffic routes through a gateway that validates access, enforces budgets, scans for unsafe content, and records audit evidence.&lt;/p&gt;

&lt;p&gt;In ungoverned setups, developers operate in silos: each person has their own API keys, each person chooses their own model, and there's no single view of what's being called or how much it costs. When trouble comes (a runaway loop burning through the monthly budget, credentials accidentally committed into a prompt, or a regulatory audit asking for proof of what was called and when), you're left assembling evidence from scattered logs across multiple provider accounts.&lt;/p&gt;

&lt;p&gt;Bifrost flips this: Claude Code and Codex point to one endpoint, and you apply policy once instead of per-developer. That endpoint serves as both a proxy and a policy engine, keeping control unified. The &lt;a href="https://www.getmaxim.ai/bifrost/resources/governance" rel="noopener noreferrer"&gt;governance resource&lt;/a&gt; page details how virtual keys, budgets, rate limits, and content rules all work together.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Real Risks of Ungoverned AI Coding Agent Adoption
&lt;/h2&gt;

&lt;p&gt;Three compounding risks emerge when coding agents aren't centrally governed:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Runaway spend.&lt;/strong&gt; Coding agents generate many token-heavy requests. Without per-person budget limits, a single infinite loop or a developer experimenting with expensive reasoning models can produce five-figure surprises on next month's bill, with no clear owner or explanation.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Credential exposure in prompts.&lt;/strong&gt; Developers paste code into agents for context. If that code contains database passwords, cloud credentials, or API secrets, and no layer inspects the traffic, those secrets travel straight to an external model provider. &lt;a href="https://www.kusari.dev/blog/ai-coding-assistants-in-2026-4x-faster-10x-riskier-the-hidden-security-cost" rel="noopener noreferrer"&gt;GitGuardian&lt;/a&gt; reported a 40% higher rate of secret leakage in repositories using AI assistants, driven by agents reading local files and incorporating them into prompts.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Compliance becomes impossible.&lt;/strong&gt; Auditors need an immutable trail: who called what model, when, with what budget, and what was the outcome? Without centralized logging, you're hunting across vendor dashboards, trying to correlate a user's questions with a provider's request logs. Regulated workloads in healthcare, finance, or insurance cannot move forward without that evidence.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The &lt;a href="https://www.coalitionforsecureai.org/addressing-whats-next-in-securing-enterprise-ai/" rel="noopener noreferrer"&gt;Coalition for Secure AI&lt;/a&gt; noted that today's typical developer workflow puts AI agents outside the IDE with full repository access, able to read files, execute commands, and call external systems via MCP. That capability is powerful, but without a gate in front of it, policies cannot be enforced consistently. A &lt;a href="https://www.getmaxim.ai/bifrost/resources/cli-agents" rel="noopener noreferrer"&gt;gateway for CLI coding agents&lt;/a&gt; provides exactly that gate.&lt;/p&gt;

&lt;h2&gt;
  
  
  How Bifrost Brings Order to Coding Agent Deployments
&lt;/h2&gt;

&lt;p&gt;Bifrost positions itself as an OpenAI, Anthropic, and Google-compatible proxy that Claude Code, Codex, and other agents point to instead of calling providers directly. The proxy applies governance rules to every request and response before they traverse the wire.&lt;/p&gt;

&lt;h3&gt;
  
  
  Issuing per-developer virtual keys with independent budgets and model restrictions
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://docs.getbifrost.ai/features/governance/virtual-keys" rel="noopener noreferrer"&gt;Virtual keys&lt;/a&gt; form the foundation of Bifrost's access control. Each key is a credential that a developer or squad uses; each key carries its own independent configuration:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Model permissions:&lt;/strong&gt; a principal engineer's key can access expensive research-grade models, while team members' keys are limited to production-optimized, cost-efficient alternatives by default.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Budgets:&lt;/strong&gt; each key has a spend ceiling in dollars, resettable daily, weekly, monthly, or yearly, plus the ability to roll budgets up to team or department levels for visibility.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Token and request caps:&lt;/strong&gt; set maximum tokens per hour or maximum requests per minute, preventing infinite loops from consuming quota or blowing a budget.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Setting a single flag, &lt;code&gt;enforce_auth_on_inference&lt;/code&gt;, makes the virtual key header mandatory on every request. Any call without a valid key is rejected immediately. This one switch transforms Bifrost from an optional proxy into the &lt;a href="https://www.getmaxim.ai/bifrost/resources/governance" rel="noopener noreferrer"&gt;enforced control point&lt;/a&gt; for every coding agent in the organization.&lt;/p&gt;

&lt;h3&gt;
  
  
  Real-time inspection for secrets and PII
&lt;/h3&gt;

&lt;p&gt;Coding agents inherently send local file contents, shell command outputs, and code snippets into prompts (all potential sources of sensitive data). Bifrost validates both incoming requests and outgoing responses without adding latency. The &lt;a href="https://docs.getbifrost.ai/enterprise/guardrails" rel="noopener noreferrer"&gt;guardrails module&lt;/a&gt; offers multiple options:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Built-in secrets scanning&lt;/strong&gt; (powered by Gitleaks) detects API keys, database credentials, and tokens in requests before they leave your network.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Custom regex rules&lt;/strong&gt; (including templates for PII like SSNs and credit cards) let you define patterns specific to your organization and redact or reject them.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multi-provider integration&lt;/strong&gt; with AWS Bedrock Guardrails, Azure Content Safety, GraySwan, and Patronus AI for industry-standard content filtering, jailbreak detection, and policy enforcement.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You can chain rules so that a sensitive prompt gets scanned by multiple guardrails in sequence, giving you defense-in-depth at the gateway rather than scattered, hard-to-maintain checks inside each application.&lt;/p&gt;

&lt;h3&gt;
  
  
  Immutable audit logs for compliance teams
&lt;/h3&gt;

&lt;p&gt;Every agent request is logged. The enterprise edition adds &lt;a href="https://docs.getbifrost.ai/enterprise/audit-logs" rel="noopener noreferrer"&gt;immutable audit trails&lt;/a&gt; that align with SOC 2 Type II, GDPR, HIPAA, and ISO 27001 requirements. Logs capture the developer identity, the model called, tokens consumed, guardrail violations detected, and the timestamp. These logs export continuously to your SIEM, data lake, or compliance archive, so compliance and security teams get one consolidated record instead of having to stitch together evidence from multiple vendor consoles.&lt;/p&gt;

&lt;h3&gt;
  
  
  Controlling which MCP tools each agent can access
&lt;/h3&gt;

&lt;p&gt;When Claude Code connects to Bifrost as an MCP gateway, it gains access to whatever MCP servers Bifrost knows about. But &lt;a href="https://docs.getbifrost.ai/features/governance/mcp-tools" rel="noopener noreferrer"&gt;tool filtering&lt;/a&gt; means each virtual key defines which tools are available to the bearer. One developer might have access to the deployment and database query tools; another might be limited to read-only introspection. Code Mode, which lets the agent orchestrate multiple tools by writing orchestration code, cuts token usage by over half and latency by 40%, lowering both cost and the overhead of compliance logging.&lt;/p&gt;

&lt;h2&gt;
  
  
  Deploying Safely: Identity, Access Control, and Network Isolation
&lt;/h2&gt;

&lt;p&gt;A secure rollout isn't just about governance policy. It's about making sure policy enforcement itself is protected. &lt;a href="https://www.getmaxim.ai/bifrost/enterprise" rel="noopener noreferrer"&gt;Bifrost Enterprise&lt;/a&gt; adds three critical layers:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;SSO and RBAC:&lt;/strong&gt; Connect your identity provider (Okta, Microsoft Entra) so that developer access follows your existing directory. Define roles that control who can change policy, who can view usage telemetry, and who can access sensitive audit logs. Satisfies access-control requirements in SOC 2 and ISO 27001.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;In-VPC and on-prem deployment:&lt;/strong&gt; run Bifrost inside your private network boundary using Kubernetes, ECS, or bare metal. Prompts and responses never traverse the public internet, meeting data residency requirements for regulated workloads.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Secrets management integration:&lt;/strong&gt; offload API key and credential management to HashiCorp Vault, AWS Secrets Manager, Google Secret Manager, or Azure Key Vault. Credentials are never stored in Bifrost configuration; they're fetched on demand and never logged.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These controls exist because policy is only credible if it cannot be bypassed by deploying in a different region or routing around the gateway.&lt;/p&gt;

&lt;h2&gt;
  
  
  Practical Rollout: Five Steps from Day Zero to Enforcement
&lt;/h2&gt;

&lt;p&gt;A phased approach lets your platform team prove the value of governance on a pilot before expanding organization-wide:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Provision the gateway.&lt;/strong&gt; Deploy Bifrost to your VPC or a managed hosting environment, configure your LLM providers, and set up optional guardrails.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Point your agents at Bifrost.&lt;/strong&gt; Update Claude Code and Codex to use the Bifrost endpoint (change one environment variable). The &lt;a href="https://www.getmaxim.ai/bifrost/resources/cli-agents" rel="noopener noreferrer"&gt;Bifrost CLI&lt;/a&gt; tool makes this a one-command launch for each agent.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Create and distribute virtual keys.&lt;/strong&gt; Issue one key per developer or per squad with model permissions, spend limits, and request caps. Then enforce mandatory authentication.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Activate content safety rules.&lt;/strong&gt; Enable secrets scanning and PII rules first (they catch the most common leakage vectors), then layer additional guardrails from content safety providers.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Wire identity and auditing.&lt;/strong&gt; Integrate your SSO provider, assign roles, and enable continuous audit log exports to your security infrastructure.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Performance remains predictable throughout: Bifrost adds only &lt;a href="https://www.getmaxim.ai/bifrost/resources/benchmarks" rel="noopener noreferrer"&gt;11 microseconds per request&lt;/a&gt; at 5,000 RPS in sustained load, so policy enforcement is transparent to developers using Claude Code interactively.&lt;/p&gt;

&lt;h2&gt;
  
  
  Addressing Common Questions
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Does using a gateway change how developers work?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;No. Developers use Claude Code and Codex exactly as before; governance is applied invisibly at the proxy layer. The only change is the base URL the tool points at.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Can we use different LLM providers and models for different developers?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Yes. Each virtual key specifies which providers and models it can access. One developer can run GPT-5 with Claude Code while another runs Claude Sonnet 4.5 with Codex, all governed by one central policy engine.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What stops secrets from leaking into external APIs?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.getmaxim.ai/bifrost" rel="noopener noreferrer"&gt;Bifrost&lt;/a&gt; inspects every prompt and response in-line before they cross the network boundary. Gitleaks-backed secrets detection and custom regex rules identify credentials and PII, redacting or rejecting them on the spot.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Is this only for large teams?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;No. The open-source Bifrost gateway handles virtual keys, budgets, rate limits, and MCP tool filtering, which covers most production teams. Larger or compliance-sensitive organizations add RBAC, SSO, immutable audit logs, and advanced content safety via the enterprise edition.&lt;/p&gt;

&lt;h2&gt;
  
  
  Consolidating Control: The Case for a Unified Gateway
&lt;/h2&gt;

&lt;p&gt;Governing Claude Code and Codex at enterprise scale means routing every call through a single policy layer, not managing a thousand individual developer configurations. Bifrost provides that layer as an open-source AI gateway with per-developer virtual keys, inline guardrails, persistent audit logs, and in-VPC deployment. Whether you're rolling out agents to a pilot team or scaling to your entire engineering organization, the same governance engine applies consistently.&lt;/p&gt;

&lt;p&gt;Ready to govern your coding agent rollout? Check out the &lt;a href="https://www.getmaxim.ai/bifrost/resources" rel="noopener noreferrer"&gt;Bifrost resources&lt;/a&gt; to explore deployment patterns, or &lt;a href="https://getmaxim.ai/bifrost/book-a-demo" rel="noopener noreferrer"&gt;schedule a conversation&lt;/a&gt; with the Bifrost team to map a secure coding agent strategy for your organization.&lt;/p&gt;

</description>
    </item>
    <item>
      <title>The Top 5 AI Governance Platforms for Running LLMs in Production</title>
      <dc:creator>Kuldeep Paul</dc:creator>
      <pubDate>Thu, 11 Jun 2026 20:45:49 +0000</pubDate>
      <link>https://dev.to/kuldeep_paul/the-top-5-ai-governance-platforms-for-running-llms-in-production-433l</link>
      <guid>https://dev.to/kuldeep_paul/the-top-5-ai-governance-platforms-for-running-llms-in-production-433l</guid>
      <description>&lt;p&gt;&lt;em&gt;Comparing governance solutions for multi-team, multi-provider AI deployments in 2026. &lt;a href="https://www.getmaxim.ai/bifrost" rel="noopener noreferrer"&gt;Bifrost&lt;/a&gt; is the top choice for engineering teams needing production-grade controls with minimal latency overhead.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;When you're running LLMs across multiple teams and multiple providers, governance is no longer optional. Raw provider keys scattered across services, unbounded token budgets, and no way to audit what data is being sent where; these problems compound fast. AI governance platforms solve this by putting policy enforcement directly into the request path, before any call reaches a provider. &lt;a href="https://www.getmaxim.ai/bifrost" rel="noopener noreferrer"&gt;Bifrost&lt;/a&gt;, the &lt;a href="https://github.com/maximhq/bifrost" rel="noopener noreferrer"&gt;open-source AI gateway&lt;/a&gt; written in Go by Maxim AI, is the strongest choice for platform teams that need to route, govern, and audit mission-critical AI traffic with best-in-class performance, scalability, and reliability. This breakdown covers the top 5 platforms in the governance space and where each fits into your infrastructure.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Governance Matters When You're Scaling AI
&lt;/h2&gt;

&lt;p&gt;Governance is the set of controls that enforce &lt;em&gt;who&lt;/em&gt; gets access to &lt;em&gt;what&lt;/em&gt;, &lt;em&gt;how much&lt;/em&gt; they spend, &lt;em&gt;where&lt;/em&gt; their data flows, and what &lt;em&gt;evidence&lt;/em&gt; gets logged for compliance. It's moved from "nice to have" to procurement requirement. &lt;a href="https://www.gartner.com/en/newsroom/press-releases/2026-02-17-gartner-global-ai-regulations-fuel-billion-dollar-market-for-ai-governance-platforms" rel="noopener noreferrer"&gt;Gartner&lt;/a&gt; forecasts the market will hit $492M this year and exceed $1B by 2030 as regulation spreads to 75% of the world's economies.&lt;/p&gt;

&lt;p&gt;Two things are driving the adoption curve. &lt;strong&gt;Cost first&lt;/strong&gt;: a misconfigured agent or a dev using a live key without rate limits can blow through a month's budget in hours. &lt;strong&gt;Regulation second&lt;/strong&gt;: the &lt;a href="https://artificialintelligenceact.eu/the-act/" rel="noopener noreferrer"&gt;EU AI Act&lt;/a&gt;, NIST AI RMF, and ISO/IEC 42001 now expect enforceable, auditable controls;  not just policies on a wiki. Gartner's research shows teams using dedicated governance platforms are 3.4x more likely to actually &lt;em&gt;succeed&lt;/em&gt; at governance than those cobbling it together.&lt;/p&gt;

&lt;p&gt;The shift is clear: you can document governance all day, but a policy doc won't stop an agent from hitting an unapproved model or exceeding budget. Only a control in the request pipeline does.&lt;/p&gt;

&lt;h2&gt;
  
  
  What to Look For in a Governance Platform
&lt;/h2&gt;

&lt;p&gt;Strong platforms for production AI share core capabilities. Evaluate on these dimensions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;In-request enforcement&lt;/strong&gt;: actual controls that block/allow requests on the fly, not dashboards that report after the fact.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Granular access&lt;/strong&gt;: per-user, per-team, per-project credentials tied to specific models, providers, and permissions.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Budget + rate limits&lt;/strong&gt;: hierarchical spend caps and token throttling that fail safely when ceilings are hit.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Audit trails&lt;/strong&gt;: immutable logs that hold up under SOC 2, GDPR, HIPAA, and ISO audits.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Deployment flexibility&lt;/strong&gt;: in-VPC, on-prem, air-gapped modes so regulated data never touches public infrastructure.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Low overhead&lt;/strong&gt;: minimal added latency, since every call goes through the governance layer.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Platforms that pack all of this into a single control plane reduce your operational surface area and make compliance easier to prove.&lt;/p&gt;

&lt;h2&gt;
  
  
  The 5 Top Options for AI Governance
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Bifrost
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://www.getmaxim.ai/bifrost" rel="noopener noreferrer"&gt;Bifrost&lt;/a&gt; is an open-source AI gateway that sits in the request path and enforces access, budgets, and policy for 1,000+ models across 20+ providers through one OpenAI-compatible endpoint. Instead of bolting a monitoring tool onto your stack, you route every LLM call through Bifrost, which validates permissions and budgets before forwarding to a provider. The &lt;a href="https://github.com/maximhq/bifrost" rel="noopener noreferrer"&gt;open-source gateway&lt;/a&gt; adds just 11 microseconds of latency per request at 5,000 RPS. &lt;a href="https://www.getmaxim.ai/bifrost/resources/benchmarks" rel="noopener noreferrer"&gt;Benchmarked in production&lt;/a&gt;, so governance doesn't come at a performance cost.&lt;/p&gt;

&lt;p&gt;The core abstraction is the &lt;a href="https://docs.getbifrost.ai/features/governance/virtual-keys" rel="noopener noreferrer"&gt;virtual key&lt;/a&gt;. Each key represents a consumer, a developer, a service, a team, or a customer, and encodes a specific policy: allowed providers, allowed models, spend limit, and request quota. Raw provider API keys stay locked in Bifrost; they never touch your services. This alone eliminates one of the biggest sources of key sprawl and cost leakage.&lt;/p&gt;

&lt;p&gt;Spend control is hierarchical. &lt;a href="https://docs.getbifrost.ai/features/governance/budget-and-limits" rel="noopener noreferrer"&gt;Budgets and rate limits&lt;/a&gt; cascade through Customer → Team → Virtual Key → Provider Config layers. Every applicable budget must have balance for a request to pass. Once a limit is hit, requests fail gracefully with a policy error instead of continuing to charge. Budgets reset on calendar boundaries, daily, weekly, monthly, yearly, so your reconciliation matches your finance team's calendar.&lt;/p&gt;

&lt;p&gt;For stricter environments, &lt;a href="https://www.getmaxim.ai/bifrost/enterprise" rel="noopener noreferrer"&gt;Bifrost Enterprise&lt;/a&gt; extends the open-source gateway. It adds &lt;a href="https://docs.getbifrost.ai/enterprise/rbac" rel="noopener noreferrer"&gt;role-based access control&lt;/a&gt;, OIDC/SSO integration with Okta and Entra, &lt;a href="https://docs.getbifrost.ai/enterprise/audit-logs" rel="noopener noreferrer"&gt;immutable audit logs&lt;/a&gt; for compliance frameworks, and content-safety guardrails. For regulated workloads, &lt;a href="https://docs.getbifrost.ai/enterprise/invpc-deployments" rel="noopener noreferrer"&gt;VPC-isolated deployment&lt;/a&gt; keeps all traffic inside private infrastructure with zero public egress. Governance extends to agentic use cases too: running as an &lt;a href="https://www.getmaxim.ai/bifrost/resources/mcp-gateway" rel="noopener noreferrer"&gt;MCP gateway&lt;/a&gt;, Bifrost applies the same access rules and policy enforcement to Model Context Protocol tool invocations.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; Bifrost is built for enterprises running mission-critical AI workloads that require best-in-class performance, scalability, and reliability. It serves as a centralized AI gateway to route, govern, and secure all AI traffic across models and environments with ultra low latency. Bifrost unifies LLM gateway, MCP gateway, and Agents gateway capabilities into a single platform. Designed for regulated industries and strict enterprise requirements, it supports air-gapped deployments, VPC isolation, and on-prem infrastructure. It provides full control over data, access, and execution, along with robust security, policy enforcement, and governance capabilities.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Native Cloud Governance
&lt;/h3&gt;

&lt;p&gt;The hyperscalers, AWS, Google Cloud, Azure, bundle governance with their managed AI services. You get identity integration, usage budgets, and content moderation for their models, all in your existing cloud console.&lt;/p&gt;

&lt;p&gt;The catch: breadth. These suites work well for single-cloud, single-provider teams. The moment you add external providers or self-hosted models, you're managing governance across multiple dashboards with incompatible policy models and separate audit logs. Enterprises spanning multiple providers end up stitching things together manually.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; single-cloud teams fully committed to one provider's model family who want governance bundled with their existing IAM and billing.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Risk and Compliance Platforms
&lt;/h3&gt;

&lt;p&gt;A distinct category focuses on governance from the compliance angle: model inventories, risk registers, bias audits, and mapping controls to frameworks like NIST AI RMF and ISO/IEC 42001. These platforms produce the documentation and evidence that auditors and regulators expect.&lt;/p&gt;

&lt;p&gt;But they don't enforce policy on live requests. They describe governance, they don't execute it. Most teams pair these with an infrastructure-layer gateway that does the actual enforcement.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; compliance, legal, and risk teams needing lifecycle documentation, framework attestation, and audit evidence for a portfolio of AI systems.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Red-Teaming and Security Testing
&lt;/h3&gt;

&lt;p&gt;Red-teaming platforms govern AI by discovering weaknesses through automated adversarial testing, jailbreaks, prompt injection, data extraction, unsafe completions. Baked into CI/CD, they shift adversarial testing left, making it a pre-deployment gate instead of a quarterly review.&lt;/p&gt;

&lt;p&gt;They address a real threat class, but they govern model &lt;em&gt;behavior under attack&lt;/em&gt; rather than day-to-day access and budgets. They complement, but don't replace, a gateway managing traffic on live requests.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; security teams needing continuous adversarial testing of models and agents before each deployment.&lt;/p&gt;

&lt;h3&gt;
  
  
  5. Evaluation and Observability
&lt;/h3&gt;

&lt;p&gt;These platforms govern through quality measurement: grading outputs, tracing production behavior, alerting on regression. Maxim AI, the company behind Bifrost, offers &lt;a href="https://www.getmaxim.ai/products/agent-simulation-evaluation" rel="noopener noreferrer"&gt;agent simulation and evaluation&lt;/a&gt; and &lt;a href="https://www.getmaxim.ai/products/agent-observability" rel="noopener noreferrer"&gt;production observability&lt;/a&gt; with distributed tracing and live alerts.&lt;/p&gt;

&lt;p&gt;Quality scoring is essential, but it lives alongside, not inside, the access-and-budget enforcement layer. Teams typically run an evaluation platform for quality assurance and a gateway for runtime access control.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; teams measuring agent quality, running pre-release evaluations, and observing production behavior across your AI stack's lifecycle.&lt;/p&gt;

&lt;h2&gt;
  
  
  How Bifrost Enforces Governance In-Path
&lt;/h2&gt;

&lt;p&gt;Bifrost enforces governance where it matters: directly in the request flow. Every LLM call through &lt;a href="https://www.getmaxim.ai/bifrost" rel="noopener noreferrer"&gt;Bifrost&lt;/a&gt; is validated against the policies attached to its virtual key before any data reaches a provider. Because it's a &lt;a href="https://docs.getbifrost.ai/features/drop-in-replacement" rel="noopener noreferrer"&gt;drop-in replacement&lt;/a&gt; for standard SDKs, you add governance by changing one line: the base URL.&lt;/p&gt;

&lt;p&gt;The enforcement pipeline combines multiple checks:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Identification&lt;/strong&gt;: the virtual key reveals which consumer this is and what permissions they have.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Model allowlist&lt;/strong&gt;: checks if the requested model is permitted by the key's policy.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Budget validation&lt;/strong&gt;: verifies every applicable budget (consumer, team, key, provider) still has balance.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Rate-limit validation&lt;/strong&gt;: checks token and request limits at the key and provider levels.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Routing&lt;/strong&gt;: excludes providers that hit their limits; directs traffic to a permissioned alternative.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The result: a runaway job can't silently overspend, a contractor's key can't reach production models, and a provider outage doesn't cascade. For deeper context on how these pieces fit, the &lt;a href="https://www.getmaxim.ai/bifrost/resources/governance" rel="noopener noreferrer"&gt;governance resource hub&lt;/a&gt; and &lt;a href="https://www.getmaxim.ai/bifrost/resources/buyers-guide" rel="noopener noreferrer"&gt;LLM Gateway Buyer's Guide&lt;/a&gt; provide capability matrices and architectural details.&lt;/p&gt;

&lt;h2&gt;
  
  
  Picking the Right Governance Platform
&lt;/h2&gt;

&lt;p&gt;The choice depends on where your enforcement bottleneck is.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Multi-provider, mission-critical traffic&lt;/strong&gt;: you need an infrastructure-layer gateway enforcing access and budgets on every request.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Compliance and audit evidence&lt;/strong&gt;: a risk-and-compliance platform handles documentation and framework mapping.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Model security posture&lt;/strong&gt;: a red-teaming tool shifts adversarial testing into your release pipeline.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Output quality and monitoring&lt;/strong&gt;: an evaluation platform measures behavior and traces production.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Most mature deployments use more than one. But the foundation is always a control plane that enforces access, spend, and policy at request time.  that's where regulators and finance teams focus. Bifrost provides that layer as an open-source gateway, and &lt;a href="https://www.getmaxim.ai/bifrost/enterprise" rel="noopener noreferrer"&gt;Bifrost Enterprise&lt;/a&gt; adds RBAC, SSO, audit logging, and VPC isolation for regulated environments, all while keeping the 11-microsecond overhead.&lt;/p&gt;

&lt;h2&gt;
  
  
  Getting Started with Runtime Governance
&lt;/h2&gt;

&lt;p&gt;Governance works best when it's enforced in the request path, not bolted on afterward. Bifrost delivers that enforcement as an open-source AI gateway with virtual keys, hierarchical budgets, rate limits, audit logs, and private-cloud deployment in one platform, with just 11 microseconds of added latency at production scale. To see how Bifrost can centralize governance and access control across your infrastructure, &lt;a href="https://getmaxim.ai/bifrost/book-a-demo" rel="noopener noreferrer"&gt;book a demo&lt;/a&gt; with the Bifrost team.&lt;/p&gt;

</description>
    </item>
    <item>
      <title>How to Control MCP Server Access at Scale: Top 5 Gateway Tools</title>
      <dc:creator>Kuldeep Paul</dc:creator>
      <pubDate>Thu, 11 Jun 2026 20:45:20 +0000</pubDate>
      <link>https://dev.to/kuldeep_paul/how-to-control-mcp-server-access-at-scale-top-5-gateway-tools-6mf</link>
      <guid>https://dev.to/kuldeep_paul/how-to-control-mcp-server-access-at-scale-top-5-gateway-tools-6mf</guid>
      <description>&lt;p&gt;&lt;em&gt;Choosing the right MCP gateway matters. &lt;a href="https://www.getmaxim.ai/bifrost" rel="noopener noreferrer"&gt;Bifrost&lt;/a&gt; stands out as the best option for teams running production AI agents that need governance, security, and speed without compromise.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;When you're connecting AI agents to multiple MCP servers, things get messy fast. Your first agent uses two servers. Your second adds three more. By the time you have five agents across your org, you're staring at dozens of connections, no visibility into which agent can call which tool, and token costs that grow with every new server you add. An MCP gateway sits in the middle, creating a single checkpoint where you control access, authenticate, audit, and manage costs across everything.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.getmaxim.ai/bifrost" rel="noopener noreferrer"&gt;Bifrost&lt;/a&gt; is an &lt;a href="https://github.com/maximhq/bifrost" rel="noopener noreferrer"&gt;open-source AI gateway&lt;/a&gt; built in Go that solves this by acting as both an MCP client and server at once. But it's not the only option out there. This post walks through five gateways you can use to govern MCP access in production, comparing them on ease of setup, governance depth, audit capabilities, and how they handle costs at scale.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Does an MCP Gateway Actually Need to Do?
&lt;/h2&gt;

&lt;p&gt;MCP server access governance means controlling who can do what with your tools, logging everything, and keeping costs predictable. Here's what a serious gateway needs to handle:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Tool-level access control&lt;/strong&gt;: lock down which teams, customers, or services can touch which tools. When a key has no permission for a tool, it shouldn't see it at all (deny-by-default).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Centralized identity&lt;/strong&gt;: manage credentials to your MCP servers in one place, not scattered across agent configs.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;An audit trail&lt;/strong&gt;: record every tool call with who made it, what happened, and why. Non-negotiable for compliance.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cost boundaries&lt;/strong&gt;: set budgets and rate limits so one runaway agent doesn't tank your bill.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Deployment options&lt;/strong&gt;: the ability to run this in your own VPC, on-prem, or air-gapped if your industry or data residency requires it.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The &lt;a href="https://www.getmaxim.ai/bifrost/resources/mcp-gateway" rel="noopener noreferrer"&gt;MCP gateway resource&lt;/a&gt; goes deeper on what these controls look like in practice. The five tools below are compared against those criteria.&lt;/p&gt;

&lt;h2&gt;
  
  
  1. Bifrost: The All-in-One Approach
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://www.getmaxim.ai/bifrost" rel="noopener noreferrer"&gt;Bifrost&lt;/a&gt; combines MCP governance with LLM routing in one system. It routes requests to your models &lt;em&gt;and&lt;/em&gt; governs access to your tools, so you're not stitching together separate infrastructure.&lt;/p&gt;

&lt;p&gt;The architecture is straightforward: Bifrost sits between your agents and your MCP servers. It aggregates connections from multiple servers and exposes them through a single &lt;code&gt;/mcp&lt;/code&gt; endpoint, with access enforced per request. At &lt;a href="https://www.getmaxim.ai/bifrost/resources/benchmarks" rel="noopener noreferrer"&gt;5,000 requests per second, it adds only 11 microseconds&lt;/a&gt; of overhead, so governance doesn't become a bottleneck.&lt;/p&gt;

&lt;p&gt;Governance happens through &lt;a href="https://docs.getbifrost.ai/features/governance/virtual-keys" rel="noopener noreferrer"&gt;virtual keys&lt;/a&gt;. Each key is a bundle of permissions: which providers you can talk to, your budget, your rate limit, and which tools you can call. Crucially, the default is &lt;em&gt;deny&lt;/em&gt; (a key with no MCP configuration sees zero tools until you explicitly add them).&lt;/p&gt;

&lt;p&gt;Authentication to protected servers uses OAuth 2.0 with automatic token refresh and PKCE. On the cost side, Bifrost's &lt;a href="https://docs.getbifrost.ai/mcp/code-mode" rel="noopener noreferrer"&gt;Code Mode&lt;/a&gt; lets models orchestrate tools through code rather than loading massive tool definitions into the prompt, cutting tokens by &lt;a href="https://www.getmaxim.ai/bifrost/blog/bifrost-mcp-gateway-access-control-cost-governance-and-92-lower-token-costs-at-scale" rel="noopener noreferrer"&gt;up to 92% at scale&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;For enterprise teams, Bifrost adds &lt;a href="https://docs.getbifrost.ai/enterprise/mcp-tool-groups" rel="noopener noreferrer"&gt;MCP tool groups&lt;/a&gt; so you define tool sets once and attach them to keys, teams, or customers. &lt;a href="https://docs.getbifrost.ai/enterprise/mcp-with-fa" rel="noopener noreferrer"&gt;Federated auth&lt;/a&gt; turns your existing APIs into MCP tools from OpenAPI specs or Postman collections (no code required). &lt;a href="https://docs.getbifrost.ai/enterprise/audit-logs" rel="noopener noreferrer"&gt;Audit logs&lt;/a&gt; export to any SIEM for SOC 2, HIPAA, GDPR, and ISO 27001 audits. It runs anywhere: VPC, on-prem, air-gapped, with SSO via Okta or Entra.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; Bifrost is built for enterprises running mission-critical AI workloads that require best-in-class performance, scalability, and reliability. It serves as a centralized AI gateway to route, govern, and secure all AI traffic across models and environments with ultra low latency. Bifrost unifies LLM gateway, MCP gateway, and Agents gateway capabilities into a single platform. Designed for regulated industries and strict enterprise requirements, it supports air-gapped deployments, VPC isolation, and on-prem infrastructure. It provides full control over data, access, and execution, along with robust security, policy enforcement, and governance capabilities.&lt;/p&gt;

&lt;h2&gt;
  
  
  2. Docker MCP Gateway: Container-Native Isolation
&lt;/h2&gt;

&lt;p&gt;If your team lives in Docker, Docker MCP Gateway is worth a look. It runs each MCP server in its own container with signed images and built-in secrets handling.&lt;/p&gt;

&lt;p&gt;The appeal is isolation at the container boundary. Each server runs independently, image signatures verify provenance, and secrets come through Docker's native systems, not hardcoded configs. For teams already shipping everything via containers, it's a natural fit.&lt;/p&gt;

&lt;p&gt;The catch: Docker MCP Gateway is more of a toolkit than a turnkey solution. You get the building blocks (containers and signed images), but you layer on identity management, per-consumer access lists, and audit logging yourself. It also governs tools at the container level, not per-request, and doesn't handle model routing, so you'd run LLM traffic separately.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; Container teams with Docker expertise who want full control over their setup and are willing to build governance layers on top.&lt;/p&gt;

&lt;h2&gt;
  
  
  3. Kong AI Gateway: For API-First Teams
&lt;/h2&gt;

&lt;p&gt;Kong added MCP support through an AI MCP Proxy plugin. It translates between MCP and HTTP, letting MCP clients talk to REST APIs without rewriting them as MCP servers.&lt;/p&gt;

&lt;p&gt;If you already use Kong for API management, extending it to MCP traffic keeps everything under one roof. Your existing rate-limiting, authentication, and observability policies apply to agent tool access too. It's familiar territory for platform teams that already standardize on Kong.&lt;/p&gt;

&lt;p&gt;But MCP is bolted onto Kong rather than native to it. Tool-level access control, deny-by-default filtering, and token-optimization features like Code Mode aren't Kong's focus. LLM routing stays separate. For a deeper comparison, check the &lt;a href="https://www.getmaxim.ai/bifrost/resources/buyers-guide" rel="noopener noreferrer"&gt;LLM gateway buyer's guide&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; Teams with existing Kong deployments who want to manage agent tools through the same gateway and policies they use for REST APIs.&lt;/p&gt;

&lt;h2&gt;
  
  
  4. Azure API Management for MCP: Microsoft Ecosystem
&lt;/h2&gt;

&lt;p&gt;Microsoft's approach: extend existing Azure services rather than shipping a new product. Azure API Management provides policy enforcement and OAuth 2.0, Entra ID handles identity and RBAC, and Azure Container Apps hosts MCP servers with Kubernetes-native scaling.&lt;/p&gt;

&lt;p&gt;For Azure-centric orgs, it's appealing. You're using identity and policy infrastructure you already operate, reducing the number of new systems your platform team has to maintain and secure. Microsoft maintains an open-source MCP gateway for Kubernetes alongside the managed services.&lt;/p&gt;

&lt;p&gt;The downside: Azure-first architecture locks you in. If you're multi-cloud, management complexity and vendor concerns become real. MCP capabilities are spread across multiple Azure services rather than unified in one layer, and portable deployments &lt;a href="https://www.getmaxim.ai/bifrost/enterprise" rel="noopener noreferrer"&gt;across AWS, GCP, and on-prem&lt;/a&gt; get harder.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; Large enterprises committed to the Microsoft stack that want identity and policy controls integrated with Azure services.&lt;/p&gt;

&lt;h2&gt;
  
  
  5. Cloudflare AI Gateway and MCP Server Portals: Edge-First
&lt;/h2&gt;

&lt;p&gt;Cloudflare's model: use Cloudflare AI Gateway, MCP Server Portals, and Cloudflare Gateway to create a security layer at the network edge. It includes shadow MCP detection to catch unauthorized server connections.&lt;/p&gt;

&lt;p&gt;If you're already on Cloudflare's edge network and use Cloudflare One and Workers, this integrates naturally. MCP control happens at the network edge alongside your existing Zero Trust policies.&lt;/p&gt;

&lt;p&gt;The dependency: you need Cloudflare infrastructure to make this work. If you're not on that platform, it's harder to justify. Model routing across LLM providers still happens separately from &lt;a href="https://www.getmaxim.ai/bifrost/resources/governance" rel="noopener noreferrer"&gt;MCP access governance&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; Orgs already invested in Cloudflare One and Workers that want edge-based MCP control with built-in shadow server detection.&lt;/p&gt;

&lt;h2&gt;
  
  
  Making the Choice
&lt;/h2&gt;

&lt;p&gt;The right gateway depends on two things: do you need tool governance and model routing in one system (you probably do), and what infrastructure does your team already run?&lt;/p&gt;

&lt;h3&gt;
  
  
  Should governance and routing be unified?
&lt;/h3&gt;

&lt;p&gt;In production? Almost always yes. Having tool access, model routing, audit logs, and identity in one control plane means simpler audits, smaller attack surface, and fewer integration points to debug. &lt;a href="https://www.getmaxim.ai/bifrost/resources/governance" rel="noopener noreferrer"&gt;Bifrost is the only gateway here that unifies both&lt;/a&gt; by design.&lt;/p&gt;

&lt;h3&gt;
  
  
  What matters most for compliance?
&lt;/h3&gt;

&lt;p&gt;Immutable audit logs, deny-by-default filtering, and the ability to run in isolated environments. Regulated industries (healthcare, finance, government) need proof of every tool call and the option to run in a VPC or offline. Bifrost's audit and access controls map directly to SOC 2, HIPAA, GDPR, and ISO 27001.&lt;/p&gt;

&lt;h3&gt;
  
  
  How do gateways cut token costs?
&lt;/h3&gt;

&lt;p&gt;Most use budgets and rate limits. But there's a structural problem: if every request loads full tool definitions into the model context, costs climb with each new server. Bifrost's Code Mode flips this (the model writes orchestration code instead of carrying tool schemas, cutting tokens significantly).&lt;/p&gt;

&lt;h2&gt;
  
  
  Start Governing MCP Access
&lt;/h2&gt;

&lt;p&gt;Controlling MCP server access at scale requires per-consumer tool filtering, centralized credentials, immutable audit trails, and cost controls that hold up as you add servers. Bifrost delivers all of these in one high-performance control plane, with deny-by-default access, tool groups for teams and customers, federated auth for enterprise APIs, and audit logs ready for SIEM export.&lt;/p&gt;

&lt;p&gt;Ready to see it in action? &lt;a href="https://getmaxim.ai/bifrost/book-a-demo" rel="noopener noreferrer"&gt;Book a demo&lt;/a&gt; with the Bifrost team, or check out the &lt;a href="https://www.getmaxim.ai/bifrost/resources" rel="noopener noreferrer"&gt;Bifrost resources hub&lt;/a&gt; for step-by-step guides.&lt;/p&gt;

</description>
    </item>
    <item>
      <title>What Happens When MCP Server Access Goes Ungoverned: 5 Critical Security Risks</title>
      <dc:creator>Kuldeep Paul</dc:creator>
      <pubDate>Thu, 11 Jun 2026 20:44:54 +0000</pubDate>
      <link>https://dev.to/kuldeep_paul/what-happens-when-mcp-server-access-goes-ungoverned-5-critical-security-risks-1p7c</link>
      <guid>https://dev.to/kuldeep_paul/what-happens-when-mcp-server-access-goes-ungoverned-5-critical-security-risks-1p7c</guid>
      <description>&lt;p&gt;&lt;em&gt;Without central governance, direct MCP connections expose agents to tool poisoning, prompt injection, privilege escalation, credential theft, and missing audit trails. &lt;a href="https://www.getmaxim.ai/bifrost" rel="noopener noreferrer"&gt;Bifrost&lt;/a&gt; centralizes MCP access control, validation, and logging.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;When an AI agent connects directly to MCP servers without a governance checkpoint, you've essentially handed the model keys to your infrastructure and hoped nothing goes wrong. &lt;a href="https://www.getmaxim.ai/bifrost" rel="noopener noreferrer"&gt;Bifrost&lt;/a&gt;, a &lt;a href="https://github.com/maximhq/bifrost" rel="noopener noreferrer"&gt;Go-based open-source MCP gateway&lt;/a&gt; from Maxim AI, intercepts every tool call with access controls, input/output validation, and audit logging. This post walks through the five most dangerous gaps in ungoverned MCP setups and how a centralized gateway fills each one.&lt;/p&gt;

&lt;h2&gt;
  
  
  Defining Ungoverned MCP Server Access
&lt;/h2&gt;

&lt;p&gt;An ungoverned MCP server configuration is one where agents talk directly to external Model Context Protocol servers with no intermediary layer to check tools, filter what models can access, validate incoming and outgoing data, identify callers, or keep records. The Model Context Protocol, an open standard released by Anthropic in late 2024, lets AI models dynamically discover and invoke external tools across filesystems, databases, APIs, and internal services at runtime.&lt;/p&gt;

&lt;p&gt;MCP's strength (instant tool availability) is also its vulnerability surface. When agents contact tool servers directly, three risk factors align dangerously: privileged system access, potentially hostile input (from documents, web pages, tool responses), and a communications channel to exfiltrate data. All five risks below follow from that convergence.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Tool poisoning&lt;/strong&gt; via malicious metadata embedded in tool definitions&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Indirect prompt injection&lt;/strong&gt; through tool response content&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Excessive privilege&lt;/strong&gt; with no enforcement of least-privilege access&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Secret exposure&lt;/strong&gt; across prompts, arguments, and tool output&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Missing forensics&lt;/strong&gt; with no record of who ran what tool and when&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Risk 1: Tool Poisoning in Tool Metadata and Descriptions
&lt;/h2&gt;

&lt;p&gt;Tool poisoning occurs when a rogue or compromised MCP server stores malicious instructions inside tool properties like names, descriptions, or JSON schema. The model reads these properties as trusted input, visible to the LLM but invisible to the user watching the terminal. &lt;a href="https://invariantlabs.ai/blog/mcp-security-notification-tool-poisoning-attacks" rel="noopener noreferrer"&gt;Invariant Labs first documented this attack&lt;/a&gt; as a subset of indirect prompt injection, and &lt;a href="https://owasp.org/www-community/attacks/MCP_Tool_Poisoning" rel="noopener noreferrer"&gt;OWASP now classifies MCP tool poisoning&lt;/a&gt; as its own threat category, with a structural gap between initial connection and runtime behavior.&lt;/p&gt;

&lt;p&gt;Ungoverned setups broadcast every available tool to the model with zero filtering, making it trivial for a poisoned definition to steer the agent toward data theft or unauthorized operations. &lt;a href="https://docs.getbifrost.ai/features/governance/mcp-tools" rel="noopener noreferrer"&gt;Bifrost's tool filtering&lt;/a&gt; flips the default to deny: unless a virtual key explicitly permits access to a specific client and its tools, nothing is available. Instead of exposing your entire tool library, teams whitelist only what each agent legitimately needs, which curtails how far one poisoned tool can propagate. Bifrost also consolidates your tool registry in one place, letting your security team review a single governed surface instead of hunting through every agent's independent MCP connections.&lt;/p&gt;

&lt;h2&gt;
  
  
  Risk 2: Indirect Prompt Injection Via Tool Responses
&lt;/h2&gt;

&lt;p&gt;Indirect prompt injection is when hostile instructions slip into content an agent processes: a malicious support ticket, a booby-trapped web page, or the result returned from a tool. Because the model treats all context equally, it can't separate a real user instruction from a planted one. &lt;a href="https://developer.microsoft.com/blog/protecting-against-indirect-injection-attacks-mcp" rel="noopener noreferrer"&gt;Microsoft's security researchers detailed this attack pattern for MCP&lt;/a&gt;, and known incidents follow the same playbook: in mid-2025, a Supabase agent with database-level privileges got tricked by injected SQL in user-submitted tickets, leaking OAuth integration tokens to a public forum.&lt;/p&gt;

&lt;p&gt;The solution requires a layer that watches every prompt and response in flight, not distributed filters inside each application. &lt;a href="https://docs.getbifrost.ai/enterprise/guardrails" rel="noopener noreferrer"&gt;Bifrost enforces guardrails&lt;/a&gt; at the gateway, scanning both LLM inputs and MCP tool inputs/outputs in real time. The same safety rules that screen user prompts also inspect what gets passed to and returned from your tools, with options for prompt-injection defense, PII masking, and content filtering via native or third-party providers. If a rule is violated, Bifrost can stop the request, erase sensitive fields, or write it to an audit log, giving you one control point across your entire agent fleet rather than fractured per-service policies that drift over months.&lt;/p&gt;

&lt;h2&gt;
  
  
  Risk 3: Privilege Creep: When Every Agent Can Call Every Tool
&lt;/h2&gt;

&lt;p&gt;Privilege creep is the default misconfiguration: agents get broad API credentials and the full tool menu because building tight per-consumer scoping without a control plane is tedious. The outcome: your internal prototype and your customer-facing agent invoke the same destructive operations. &lt;a href="https://www.getmaxim.ai/bifrost" rel="noopener noreferrer"&gt;Bifrost&lt;/a&gt; inverts this by making least privilege the starting point.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://docs.getbifrost.ai/features/governance/virtual-keys" rel="noopener noreferrer"&gt;Virtual keys&lt;/a&gt; are Bifrost's primary governance unit; each holds its own allow-list of which MCP clients and tools a consumer can access, plus model and cost limits. Bifrost enforces the allow-list twice: once when planning which tools the model sees, and again when the model actually tries to invoke a tool, so a key meant for read-only diagnostics can't silently pivot to writing data. For larger deployments, &lt;a href="https://www.getmaxim.ai/bifrost/resources/governance" rel="noopener noreferrer"&gt;governance extends to RBAC and tool groups&lt;/a&gt;, named collections of tools you attach to keys, teams, or users and resolve per request. Every model call only ever touches the tools its consumer is authorized for.&lt;/p&gt;

&lt;h2&gt;
  
  
  Risk 4: Secrets in Flight: When Credentials Leak Into Logs and Responses
&lt;/h2&gt;

&lt;p&gt;Secrets leak when API keys, OAuth tokens, or passwords flow through prompts, tool parameters, or results and end up in logs, vector databases, or attackers' hands. Ungoverned MCP deployments make this worse in two ways: you tend to embed provider credentials in every agent separately, and there's zero inspection as a secret transits a tool invocation.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.getmaxim.ai/bifrost" rel="noopener noreferrer"&gt;Bifrost&lt;/a&gt; solves both. Provider credentials stay encrypted inside the gateway and never ship to client code; they rotate independently of the virtual keys that reference them, so secrets never live in app config or .env files. On detection, &lt;a href="https://docs.getbifrost.ai/enterprise/guardrails/secrets-detection" rel="noopener noreferrer"&gt;Bifrost's secrets guardrail&lt;/a&gt;, powered by Gitleaks under the hood, catches leaked keys, tokens, and certificates in prompts and completions before they escape. For teams in regulated industries, &lt;a href="https://www.getmaxim.ai/bifrost/enterprise" rel="noopener noreferrer"&gt;in-VPC and self-hosted options&lt;/a&gt; keep request payloads, detection events, and all credentials within your network perimeter, and &lt;a href="https://docs.getbifrost.ai/enterprise/data-access-control" rel="noopener noreferrer"&gt;vault integrations&lt;/a&gt; (HashiCorp Vault, AWS Secrets Manager, Google Secret Manager, Azure Key Vault) handle rotation and access.&lt;/p&gt;

&lt;h2&gt;
  
  
  Risk 5: No Audit Trail: When You Can't Answer "What Happened?"
&lt;/h2&gt;

&lt;p&gt;Ungoverned MCP setups scatter tool-call evidence across application logs, if it's captured at all, which makes incident postmortems and compliance audits painful. The MCP spec itself says best practice is a human in the loop with ability to veto tool invocations, and that's only credible if every action is recorded.&lt;/p&gt;

&lt;p&gt;Once &lt;a href="https://www.getmaxim.ai/bifrost/resources/mcp-gateway" rel="noopener noreferrer"&gt;Bifrost becomes the single checkpoint&lt;/a&gt; between agents and tools, each tool call becomes a first-class log entry with full metadata: tool name, source MCP server, arguments sent, response received, latency, the virtual key that triggered it, and the originating LLM request. You can slice by virtual key to see what a specific consumer executed, or by tool to track which server gets hammered. For compliance, &lt;a href="https://docs.getbifrost.ai/enterprise/audit-logs" rel="noopener noreferrer"&gt;immutable audit logs&lt;/a&gt; meet SOC 2, GDPR, HIPAA, and ISO 27001 standards and can ship to external SIEM and data-lake systems for archival and investigation.&lt;/p&gt;

&lt;h2&gt;
  
  
  How Bifrost Governs MCP From Request to Tool Return
&lt;/h2&gt;

&lt;p&gt;Bifrost sits as both an MCP client (connecting to your external tool servers) and an MCP server (exposing one controlled endpoint to agents and desktop clients like Claude Desktop). Routing every tool call through a single control plane is what makes all five protections consistent rather than ad hoc per-application. By default, &lt;a href="https://www.getmaxim.ai/bifrost" rel="noopener noreferrer"&gt;Bifrost blocks auto-execution&lt;/a&gt; of tool calls; the model returns suggestions, and your code must explicitly approve and invoke them, preserving human judgment for risky operations. For teams that want automation, &lt;a href="https://docs.getbifrost.ai/mcp/agent-mode" rel="noopener noreferrer"&gt;Agent Mode&lt;/a&gt; lets you set an auto-approval list while keeping the boundary explicit.&lt;/p&gt;

&lt;p&gt;Bifrost also tackles cost and context bloat. When agents connect to many servers, &lt;a href="https://docs.getbifrost.ai/mcp/code-mode" rel="noopener noreferrer"&gt;Code Mode&lt;/a&gt; has the model generate compact Python code to orchestrate tools in a sandbox instead of stuffing every tool definition into the prompt, cutting input tokens by up to 92.8% and shrinking the payload that poisoned definitions can corrupt. The overhead is tiny: &lt;a href="https://www.getmaxim.ai/bifrost/resources/benchmarks" rel="noopener noreferrer"&gt;11 microseconds per request at 5,000 RPS&lt;/a&gt;, so governance never becomes your latency bottleneck. The &lt;a href="https://www.getmaxim.ai/bifrost/blog/bifrost-mcp-gateway-access-control-cost-governance-and-92-lower-token-costs-at-scale" rel="noopener noreferrer"&gt;cost-and-access deep dive&lt;/a&gt; shows how filtering, logging, and Code Mode combine at production scale.&lt;/p&gt;

&lt;h2&gt;
  
  
  Common Questions About MCP Security
&lt;/h2&gt;

&lt;h3&gt;
  
  
  What's the #1 MCP attack vector?
&lt;/h3&gt;

&lt;p&gt;Prompt injection (including tool poisoning) tops the list. It's scary because it combines three factors: an attacker controls tool metadata, the model reads it as gospel truth, and the agent has real system access. This pattern shows up in almost every documented MCP incident.&lt;/p&gt;

&lt;h3&gt;
  
  
  Does a gateway actually stop prompt injection?
&lt;/h3&gt;

&lt;p&gt;A gateway alone doesn't make your model immune, but it adds layers that direct connections lack: request and response guardrails on every tool call, deny-by-default filtering, and tamper-evident logs. Together they shrink both the odds of a successful attack and how much damage one can do.&lt;/p&gt;

&lt;h3&gt;
  
  
  How do you actually implement least privilege for MCP agents?
&lt;/h3&gt;

&lt;p&gt;Give each agent its own credentials and tool permission set instead of sharing universal access. &lt;a href="https://docs.getbifrost.ai/features/governance/virtual-keys" rel="noopener noreferrer"&gt;Bifrost virtual keys&lt;/a&gt; apply a per-consumer allow-list to both what the model sees and what it can actually invoke, enforced at planning time and execution time so privilege doesn't escalate mid-call.&lt;/p&gt;

&lt;h2&gt;
  
  
  Starting With Bifrost
&lt;/h2&gt;

&lt;p&gt;Ungoverned MCP server setups concentrate privileged tool access, untrusted input vectors, and exfiltration paths into one blind spot. A governed MCP gateway decouples those concerns: deny-by-default tool filtering, runtime guardrails on prompts and responses, credential isolation, and immutable logs, all running between your agents and external tools. &lt;a href="https://www.getmaxim.ai/bifrost/resources" rel="noopener noreferrer"&gt;Bifrost provides this control plane as an open-source AI gateway with enterprise governance&lt;/a&gt;, and the full feature set is documented in the &lt;a href="https://www.getmaxim.ai/bifrost/resources" rel="noopener noreferrer"&gt;resources hub&lt;/a&gt;. Ready to see how Bifrost locks down MCP access across your agent infrastructure? &lt;a href="https://getmaxim.ai/bifrost/book-a-demo" rel="noopener noreferrer"&gt;Book a demo&lt;/a&gt; with the team.&lt;/p&gt;

</description>
    </item>
    <item>
      <title>Running Bifrost in Cluster Mode: The Path to Enterprise AI Deployments</title>
      <dc:creator>Kuldeep Paul</dc:creator>
      <pubDate>Thu, 11 Jun 2026 20:44:10 +0000</pubDate>
      <link>https://dev.to/kuldeep_paul/running-bifrost-in-cluster-mode-the-path-to-enterprise-ai-deployments-41m</link>
      <guid>https://dev.to/kuldeep_paul/running-bifrost-in-cluster-mode-the-path-to-enterprise-ai-deployments-41m</guid>
      <description>&lt;p&gt;&lt;em&gt;&lt;a href="https://www.getmaxim.ai/bifrost" rel="noopener noreferrer"&gt;Bifrost&lt;/a&gt; cluster mode distributes gateway traffic across multiple nodes with synchronized state, automatic node failover, and zero-downtime rolling updates.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;If you're running an AI gateway in production, you've probably wondered what happens when a single instance fails. Cluster mode is Bifrost's answer: multiple gateway nodes that act as a single system, sharing rate limits, budgets, and governance rules in real time. &lt;a href="https://www.getmaxim.ai/bifrost" rel="noopener noreferrer"&gt;Bifrost&lt;/a&gt;, the &lt;a href="https://github.com/maximhq/bifrost" rel="noopener noreferrer"&gt;open-source AI gateway built in Go&lt;/a&gt; by Maxim AI, supports cluster mode to give teams the high-availability infrastructure that production AI workloads demand. Let's walk through what it is, why you'd use it, how it actually works under the hood, and how to deploy it on Kubernetes with Helm.&lt;/p&gt;

&lt;h2&gt;
  
  
  Understanding Cluster Mode
&lt;/h2&gt;

&lt;p&gt;Cluster mode runs multiple gateway instances as an interconnected system. In the default setup (cluster mode disabled), each instance is isolated. It talks to a shared database, but handles its own rate limits and governance state. With cluster mode on, every node synchronizes those counters and policies with every other node, creating a single logical gateway spread across many machines.&lt;/p&gt;

&lt;p&gt;Here's the practical difference: without clustering, each node enforces a rate limit independently. So if you set a limit of 1,000 requests per minute, each node gets its own 1,000 rps quota. With clustering, the fleet shares a single 1,000 rps quota. One node uses 600, another uses 300, and the limit holds cluster-wide.&lt;/p&gt;

&lt;p&gt;The tech here is production-grade. &lt;a href="https://docs.getbifrost.ai/enterprise/clustering" rel="noopener noreferrer"&gt;Gossip-based state synchronization&lt;/a&gt; makes sure every node converges on the same view of the cluster in seconds. Cluster mode requires PostgreSQL, because SQLite doesn't support multi-node coordination. The feature set is part of &lt;a href="https://www.getmaxim.ai/bifrost/enterprise" rel="noopener noreferrer"&gt;Bifrost Enterprise&lt;/a&gt;, though the open-source image accepts cluster config values.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Teams Deploy Clustering
&lt;/h2&gt;

&lt;p&gt;Single nodes fail. Cloud environments are noisy. Rate limits need to be accurate across the whole system. That's why enterprises move to clustered gateways.&lt;/p&gt;

&lt;p&gt;Here's what clustering solves:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;No single point of failure:&lt;/strong&gt; When a node goes down (hardware fault, pod eviction, zone disruption), traffic automatically shifts to healthy instances. The gateway keeps serving requests.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Shared rate limits and budgets:&lt;/strong&gt; A 1,000 requests-per-minute limit holds across the whole fleet, not per node. Same for budget tracking. &lt;a href="https://docs.getbifrost.ai/features/governance/budget-and-limits" rel="noopener noreferrer"&gt;Budget counters&lt;/a&gt; stay consistent everywhere.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Unified governance:&lt;/strong&gt; &lt;a href="https://docs.getbifrost.ai/features/governance/virtual-keys" rel="noopener noreferrer"&gt;Virtual keys&lt;/a&gt;, routing rules, providers, and access policies sync to every node instantly. Configuration drift disappears.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Handle spikes gracefully:&lt;/strong&gt; Horizontal scaling adds replicas, distributing load instead of crushing a single instance.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Zero-downtime updates:&lt;/strong&gt; Roll out new code one pod at a time while the rest keep serving traffic.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multi-region awareness:&lt;/strong&gt; Tag nodes with region labels so requests route to the closest instance.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For teams building capacity plans and high-availability strategies, the &lt;a href="https://www.getmaxim.ai/bifrost/resources/enterprise-deployment" rel="noopener noreferrer"&gt;enterprise deployment guide&lt;/a&gt; covers cluster sizing and HA topologies in detail.&lt;/p&gt;

&lt;h2&gt;
  
  
  How Clustering Actually Works
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://www.getmaxim.ai/bifrost" rel="noopener noreferrer"&gt;Bifrost&lt;/a&gt; cluster mode is fully peer-to-peer. There's no primary node, no external coordinator, no special "cluster leader node" in the request path. Every node is equal. Each one discovers peers automatically, monitors whether every other node is alive, and gets state updates as they happen.&lt;/p&gt;

&lt;p&gt;Clustering uses two separate communication channels with different jobs. Membership signals (which nodes are in the cluster, are they alive?) run over a &lt;a href="https://github.com/hashicorp/memberlist" rel="noopener noreferrer"&gt;memberlist gossip&lt;/a&gt; layer on port 10101 (TCP and UDP). This uses the SWIM protocol, which is designed for scalable, eventually consistent group membership. Everything else, including usage counters, config changes, routing rules, virtual keys, RBAC, and MCP tool definitions, travels over a dedicated gRPC channel on port 10102 (TCP). Splitting these two keeps the membership system isolated from application traffic, and each can be tuned independently.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://docs.getbifrost.ai/enterprise/clustering" rel="noopener noreferrer"&gt;Bifrost replicates 30+ entity types&lt;/a&gt; across the cluster: model catalog, providers, governance counters, routing rules, RBAC, MCP tools, pricing, and prompt deployments. All of it gets synced. Each message has a unique ID and timestamp. Nodes keep a short-lived deduplication cache, so if a message arrives twice, it's only processed once. All nodes converge to the same state within seconds (eventual consistency).&lt;/p&gt;

&lt;p&gt;Leader election is automatic. One cluster-wide leader, one leader per region. The rule is simple: lexicographically first healthy member wins. The cluster re-evaluates every 30 seconds, so leadership transfers automatically when nodes join, leave, or fail. The leader handles singleton tasks (like fetching upstream pricing once instead of every node doing it), then broadcasts the result to the rest.&lt;/p&gt;

&lt;h2&gt;
  
  
  Deploying on Kubernetes with Helm
&lt;/h2&gt;

&lt;p&gt;The easiest deployment path is Kubernetes + Helm with native pod discovery. The default mesh clustering model means nodes reach each other directly over gossip and gRPC. Minimum recommendation: three nodes (survives one failure), five nodes (survives two failures), seven or more for even higher redundancy.&lt;/p&gt;

&lt;p&gt;A minimal cluster Helm config looks like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# cluster-values.yaml&lt;/span&gt;
&lt;span class="na"&gt;replicaCount&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;3&lt;/span&gt;

&lt;span class="na"&gt;storage&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;mode&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;postgres&lt;/span&gt;

&lt;span class="na"&gt;postgresql&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;external&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;enabled&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
    &lt;span class="na"&gt;host&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;your-postgres-host.example.com"&lt;/span&gt;
    &lt;span class="na"&gt;port&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;5432&lt;/span&gt;
    &lt;span class="na"&gt;user&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;bifrost&lt;/span&gt;
    &lt;span class="na"&gt;database&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;bifrost&lt;/span&gt;
    &lt;span class="na"&gt;sslMode&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;require&lt;/span&gt;
    &lt;span class="na"&gt;existingSecret&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;postgres-credentials"&lt;/span&gt;
    &lt;span class="na"&gt;passwordKey&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;password"&lt;/span&gt;

&lt;span class="na"&gt;bifrost&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;encryptionKeySecret&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bifrost-encryption"&lt;/span&gt;
    &lt;span class="na"&gt;key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;encryption-key"&lt;/span&gt;
  &lt;span class="na"&gt;cluster&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;enabled&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
    &lt;span class="na"&gt;region&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;us-east-1"&lt;/span&gt;
    &lt;span class="na"&gt;discovery&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;enabled&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
      &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;kubernetes&lt;/span&gt;
      &lt;span class="na"&gt;k8sNamespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;default"&lt;/span&gt;
      &lt;span class="na"&gt;k8sLabelSelector&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;app.kubernetes.io/name=bifrost"&lt;/span&gt;
    &lt;span class="na"&gt;gossip&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;port&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;10101&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Kubernetes discovery works by querying the API for pods matching your label selector. The gossip layer finds peers that way. You'll need service account RBAC to list pods. After creating your PostgreSQL and encryption secrets and configuring RBAC, just run:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;helm &lt;span class="nb"&gt;install &lt;/span&gt;bifrost bifrost/bifrost &lt;span class="nt"&gt;-f&lt;/span&gt; cluster-values.yaml
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Both ports 10101 (TCP+UDP) and 10102 (TCP) need to be open between pods, so check your NetworkPolicies and security groups.&lt;/p&gt;

&lt;h3&gt;
  
  
  Other discovery methods
&lt;/h3&gt;

&lt;p&gt;Bifrost supports six ways to discover cluster members, so it works anywhere:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Kubernetes:&lt;/strong&gt; Pod discovery via the API, best for cloud-native setups.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;DNS:&lt;/strong&gt; Headless services or A-record resolution, great with StatefulSets.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Consul and etcd:&lt;/strong&gt; For teams already running those systems.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;UDP broadcast and mDNS:&lt;/strong&gt; Local-network discovery for on-prem and dev.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;On serverless platforms without peer-to-peer networking (like Google Cloud Run), Bifrost offers broker mode. Each node makes one outbound connection to a central relay that fans out messages to the rest. Same reliability, different topology. See the &lt;a href="https://docs.getbifrost.ai/enterprise/invpc-deployments" rel="noopener noreferrer"&gt;in-VPC deployment docs&lt;/a&gt; for air-gapped and private-cloud setups.&lt;/p&gt;

&lt;h2&gt;
  
  
  Production HA Patterns
&lt;/h2&gt;

&lt;p&gt;A solid &lt;a href="https://www.getmaxim.ai/bifrost" rel="noopener noreferrer"&gt;Bifrost&lt;/a&gt; cluster combines the multi-node foundation with scheduling, autoscaling, and shutdown policies that keep things running through real-world disruptions. Goal: a fleet that survives node faults, scales for demand, and deploys new versions without dropping requests.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Spread replicas across hosts:&lt;/strong&gt; Use pod anti-affinity so instances don't land on the same node. One machine can't take down the whole gateway.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Guard availability during maintenance:&lt;/strong&gt; Set a &lt;a href="https://kubernetes.io/docs/concepts/workloads/pods/disruptions/" rel="noopener noreferrer"&gt;PodDisruptionBudget&lt;/a&gt; with &lt;code&gt;minAvailable: 2&lt;/code&gt;. Kubernetes won't let voluntary disruptions (drains, autoscaling) drop you below that threshold.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Graceful shutdown:&lt;/strong&gt; Termination grace period + preStop hook ensures streaming responses finish cleanly instead of getting cut off.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Smart autoscaling:&lt;/strong&gt; Enable HPA with conservative scale-down. Let the cluster grow for spikes, but don't aggressively kill pods that are still busy.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Layer in provider resilience:&lt;/strong&gt; Mix cluster-level HA with &lt;a href="https://docs.getbifrost.ai/features/fallbacks" rel="noopener noreferrer"&gt;automatic provider failover&lt;/a&gt; and &lt;a href="https://docs.getbifrost.ai/enterprise/adaptive-load-balancing" rel="noopener noreferrer"&gt;adaptive load balancing&lt;/a&gt;. Survive node failures AND provider outages.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Monitor health:&lt;/strong&gt; Export &lt;a href="https://docs.getbifrost.ai/features/observability/prometheus" rel="noopener noreferrer"&gt;Prometheus metrics&lt;/a&gt;. Use the built-in cluster topology view to confirm every node is alive and receiving messages.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Rolling upgrades work because nodes negotiate capabilities with each other. New and old versions can run side by side without losing quorum. Deploy one pod at a time using standard Kubernetes rolling strategies.&lt;/p&gt;

&lt;h2&gt;
  
  
  Common Questions
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Does cluster mode need a specific database?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Yes, PostgreSQL. Bifrost cluster mode coordinates shared state through the database, so it's a hard requirement. SQLite only works for single instances.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How many nodes should I run?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Three is the minimum and handles one failure. Five tolerates two. Seven+ gives you three fault tolerance. Pick based on how much disruption you can absorb.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Can I use cluster mode on serverless?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Yes, with broker mode. Instead of direct peer-to-peer connections, every node connects outbound to a central relay. Works on Cloud Run and similar platforms.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Do cluster upgrades require downtime?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Nope. Nodes support mixed versions during rolling upgrades. Replace one pod at a time, keep the rest serving traffic.&lt;/p&gt;

&lt;h2&gt;
  
  
  Getting Started
&lt;/h2&gt;

&lt;p&gt;Bifrost cluster mode is the deployment pattern for teams that need a highly available AI gateway. It delivers gossip-based state sync, automatic failover, shared rate limits, and zero-downtime deploys across a peer-to-peer cluster. The &lt;a href="https://www.getmaxim.ai/bifrost/resources/enterprise-deployment" rel="noopener noreferrer"&gt;enterprise deployment guide&lt;/a&gt; and the full &lt;a href="https://www.getmaxim.ai/bifrost/resources" rel="noopener noreferrer"&gt;Bifrost resources hub&lt;/a&gt; have reference topologies, sizing guidance, and HA walkthroughs.&lt;/p&gt;

&lt;p&gt;Ready to see how Bifrost cluster mode fits your production setup? &lt;a href="https://getmaxim.ai/bifrost/book-a-demo" rel="noopener noreferrer"&gt;Book a demo&lt;/a&gt; with the Bifrost team.&lt;/p&gt;

</description>
    </item>
    <item>
      <title>Stop Runaway LLM Bills: Cost Overrun Prevention with Bifrost</title>
      <dc:creator>Kuldeep Paul</dc:creator>
      <pubDate>Thu, 11 Jun 2026 20:42:59 +0000</pubDate>
      <link>https://dev.to/kuldeep_paul/stop-runaway-llm-bills-cost-overrun-prevention-with-bifrost-4jh1</link>
      <guid>https://dev.to/kuldeep_paul/stop-runaway-llm-bills-cost-overrun-prevention-with-bifrost-4jh1</guid>
      <description>&lt;p&gt;&lt;em&gt;&lt;a href="https://www.getmaxim.ai/bifrost" rel="noopener noreferrer"&gt;Bifrost&lt;/a&gt; tackles LLM cost overruns by enforcing budgets at the gateway, deduplicating expensive requests, and cutting token consumption before charges hit.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Token bills blow up in production when spending spins beyond what teams predicted during development. It only takes an exposed API key, a recursive loop that hammers multiple providers, or an agent reloading tool schemas on every single call to turn a tidy monthly bill into a nightmare. According to Gartner's latest forecast, &lt;a href="https://www.gartner.com/en/newsroom/press-releases/2026-05-19-gartner-forecasts-worldwide-ai-spending-to-grow-47-percent-in-2026" rel="noopener noreferrer"&gt;$2.59 trillion in worldwide AI spending is coming in 2026&lt;/a&gt;, a 47% jump year-over-year, and much of that flows through production LLM systems where few teams have real spend controls in place. &lt;a href="https://www.getmaxim.ai/bifrost" rel="noopener noreferrer"&gt;Bifrost&lt;/a&gt;, the &lt;a href="https://github.com/maximhq/bifrost" rel="noopener noreferrer"&gt;open-source AI gateway&lt;/a&gt; written in Go by Maxim AI, puts cost limits front-and-center by catching budget overages, eliminating duplicate work, and trimming token usage at the request level. Here's how to prevent cost overruns at the infrastructure layer.&lt;/p&gt;

&lt;h2&gt;
  
  
  Understanding What Drives Production Cost Overruns
&lt;/h2&gt;

&lt;p&gt;When production LLM costs spike unexpectedly, it's not random. Unplanned token spend happens when live traffic patterns clash with the budget numbers a team built into prototypes. The culprits are usually the same few things: unrestricted keys floating around services with no per-team limits, the same query hitting the provider multiple times instead of using a cached result, giant context windows stuffed with unused tool listings, and spending dashboards that only tell you about the problem after it's already expensive.&lt;/p&gt;

&lt;p&gt;The top cost-overrun sources in production:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Unguarded API keys spread across the stack&lt;/strong&gt; ,  raw provider credentials without per-consumer spend caps let any service or customer rack up charges unchecked.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Repeated equivalent queries&lt;/strong&gt; ,  identical or nearly-identical requests that should return a stored answer instead hit the provider every time, wasting spend on work the system already solved.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Oversized contexts&lt;/strong&gt; ,  agents dumping complete tool catalogs or full conversation histories into the prompt, padding input token costs on each request.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reactive cost monitoring&lt;/strong&gt; ,  alerting dashboards that notify after overspending happened rather than blocking the call that triggered it.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://www.getmaxim.ai/bifrost" rel="noopener noreferrer"&gt;Bifrost&lt;/a&gt; handles all four pain points from one place. Since it's a &lt;a href="https://docs.getbifrost.ai/features/drop-in-replacement" rel="noopener noreferrer"&gt;drop-in replacement&lt;/a&gt; that sits between services and providers, enforcement happens on every request without touching application code.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why the Gateway Is the Best Place to Control LLM Costs
&lt;/h2&gt;

&lt;p&gt;Cost controls in individual services don't scale. Application teams each set different limits, spend accounting stays fragmented, and you end up with no unified spending ceiling anywhere. A gateway layer sees everything: it's where all requests converge before they hit a provider and incur charges. That's the chokepoint for cost governance.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.getmaxim.ai/bifrost" rel="noopener noreferrer"&gt;Bifrost&lt;/a&gt; manages traffic to 1000+ models via one OpenAI-compatible endpoint and layers in just &lt;a href="https://www.getmaxim.ai/bifrost/resources/benchmarks" rel="noopener noreferrer"&gt;11 microseconds of latency&lt;/a&gt; overhead per call at 5,000 RPS in real-world benchmarks. Enforcement can't become a performance hit or teams will bypass it. The gateway already intercepts every request, so adding budget checks, response caching, and token reduction there is cheap in CPU and fast. The &lt;a href="https://www.getmaxim.ai/bifrost/resources/buyers-guide" rel="noopener noreferrer"&gt;LLM Gateway Buyer's Guide&lt;/a&gt; helps teams weigh performance, feature depth, and cost-control options across platforms.&lt;/p&gt;

&lt;h2&gt;
  
  
  Setting Hard Spend Limits Before Money Gets Spent
&lt;/h2&gt;

&lt;p&gt;The critical difference: blocking a request that would bust your budget, not alerting you after the damage is done. Gateway-level &lt;a href="https://docs.getbifrost.ai/features/governance/budget-and-limits" rel="noopener noreferrer"&gt;budget and rate limits&lt;/a&gt; prevent overspending by rejecting requests that exceed your ceilings, not by warning you afterward.&lt;/p&gt;

&lt;p&gt;The key governance tool is the &lt;a href="https://docs.getbifrost.ai/features/governance/virtual-keys" rel="noopener noreferrer"&gt;virtual key&lt;/a&gt;. Instead of handing raw provider keys to every service, Bifrost issues virtual keys tied to a specific budget, rate limit, model whitelist, and routing policy. Raw provider credentials never leak across your infrastructure, which closes off one of the biggest cost-leak vectors.&lt;/p&gt;

&lt;p&gt;Bifrost's budget hierarchy is nested:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Customer-level&lt;/strong&gt; ,  isolate spending by account (critical for multi-tenant SaaS).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Team-level&lt;/strong&gt; ,  set spending ceilings per department or product area.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Virtual-key-level&lt;/strong&gt; ,  cap spend by specific service or application.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Provider-level&lt;/strong&gt; ,  budget limits per model provider inside a single key.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A request has to pass every applicable budget check before it goes through. When any budget is exhausted, Bifrost sends a 402 response and stops LLM traffic for that key until the budget window resets; the key stays active for non-LLM operations. Token and request rate limits trigger a 429 when thresholds are crossed within a configured period. Reset windows can be rolling or calendar-aligned (monthly resets hit the 1st, not a rolling 30-day window).&lt;/p&gt;

&lt;p&gt;For SaaS, per-customer keys mean one customer can't drive up costs for everyone else. Budget isolation is built in. This &lt;a href="https://www.getmaxim.ai/bifrost/resources/governance" rel="noopener noreferrer"&gt;governance and access control&lt;/a&gt; approach was designed for enterprise teams and regulated industries where uncontrolled spend isn't just a finance problem, it's a compliance violation.&lt;/p&gt;

&lt;h2&gt;
  
  
  Cutting Costs by Caching Semantically Similar Queries
&lt;/h2&gt;

&lt;p&gt;A chunk of production traffic is functionally redundant: support chatbots answer the same questions, search-powered pipelines re-embed identical queries, automation tools replay similar instructions. Every time a repeat request hits the provider, you pay for work the system already completed.&lt;/p&gt;

&lt;p&gt;Bifrost includes &lt;a href="https://docs.getbifrost.ai/features/semantic-caching" rel="noopener noreferrer"&gt;semantic caching&lt;/a&gt; as a native feature with a two-part strategy. Exact hash matches get served instantly, and prompts that don't match exactly still hit the cache through vector similarity with a tunable threshold (default: 0.8). Cached hits land in roughly 5ms, call it near-instant vs. the seconds a live provider call takes, so you save time and tokens on the same request.&lt;/p&gt;

&lt;p&gt;Bifrost's semantic cache includes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Dual-layer architecture&lt;/strong&gt; ,  combine exact-match speed with fuzzy vector similarity to catch near-duplicates.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Adjustable similarity threshold&lt;/strong&gt; ,  decide how close a match has to be; different rules for different endpoints.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multiple vector backends&lt;/strong&gt; ,  plug in Weaviate, Redis/Valkey, Qdrant, or Pinecone.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Streaming-compatible&lt;/strong&gt; ,  cached responses stream back with chunks in the right order.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Per-request toggles&lt;/strong&gt; ,  endpoints that need fresh responses can skip the cache.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The cache lives in the gateway, so every service behind &lt;a href="https://www.getmaxim.ai/bifrost" rel="noopener noreferrer"&gt;Bifrost&lt;/a&gt; gets caching without separate integration per service. Governance can tune when caching helps without sacrificing freshness where it matters.&lt;/p&gt;

&lt;h2&gt;
  
  
  Code Mode: Slashing Token Costs in Multi-Tool Agents
&lt;/h2&gt;

&lt;p&gt;Multi-tool agent workflows are a special cost problem: every single request includes the full schema catalog for every available tool. Connect an agent to five MCP servers with 20 tools each, and the LLM gets 100 tool definitions before it even sees the user's question. The model burns tokens parsing schemas instead of working, and this repeats every call.&lt;/p&gt;

&lt;p&gt;Bifrost solves this with Code Mode, part of the &lt;a href="https://www.getmaxim.ai/bifrost/resources/mcp-gateway" rel="noopener noreferrer"&gt;MCP gateway&lt;/a&gt;. &lt;a href="https://docs.getbifrost.ai/mcp/code-mode" rel="noopener noreferrer"&gt;Code Mode&lt;/a&gt; flips the script: instead of listing all tools directly, it hands the model four meta-tools that orchestrate everything else. The LLM writes Python executing in an isolated sandbox, and only the final result bubbles back up to the context. Intermediate tool outputs stay sandboxed rather than cycling through the model's context.&lt;/p&gt;

&lt;p&gt;The savings compound as you add tools. Bifrost's own testing shows Code Mode slashes input token usage to as low as 92.8%, cuts costs by ~92.2%, and executes ~40% quicker in deployments with many tools, with input-token gains ranging from 58.2% to 92.8% depending on tool count. The reason: classic MCP spend grows with each tool, but Code Mode is bounded by what the model reads. Coding agents and CLI tools running through the gateway get the same spend controls and multi-provider governance. The &lt;a href="https://www.getmaxim.ai/bifrost/blog/bifrost-mcp-gateway-access-control-cost-governance-and-92-lower-token-costs-at-scale" rel="noopener noreferrer"&gt;detailed technical breakdown on token costs and access control&lt;/a&gt; walks through multi-server setups.&lt;/p&gt;

&lt;h2&gt;
  
  
  Making Spend Attribution Clear and Actionable
&lt;/h2&gt;

&lt;p&gt;Controlling costs starts with knowing where it's coming from. Figuring out which team, application, and model is eating your budget requires attribution: tags for users, projects, and API keys. Without that traceability, a budget is just a number, not a control.&lt;/p&gt;

&lt;p&gt;Bifrost includes native request monitoring and &lt;a href="https://docs.getbifrost.ai/features/observability/otel" rel="noopener noreferrer"&gt;built-in observability&lt;/a&gt; (Prometheus, OpenTelemetry/OTLP, compatible with Grafana, New Relic, Honeycomb). Request activity ties back to the virtual key, team, and customer responsible for it, giving finance and infrastructure teams spend visibility at the same detail they need to set budgets. This &lt;a href="https://www.getmaxim.ai/bifrost/resources/governance" rel="noopener noreferrer"&gt;cost governance setup&lt;/a&gt; is what makes per-consumer budgets actually enforceable. Organizations operating under compliance rules get more: &lt;a href="https://www.getmaxim.ai/bifrost/enterprise" rel="noopener noreferrer"&gt;Bifrost Enterprise&lt;/a&gt; layers on audit logging, role-based access, and private-network deployment, extending cost governance into restricted and regulated settings.&lt;/p&gt;

&lt;h3&gt;
  
  
  Enforcement and monitoring: what's the difference?
&lt;/h3&gt;

&lt;p&gt;Monitoring tells you after overspend happens. Enforcement stops the request that causes it. Bifrost does both: &lt;a href="https://docs.getbifrost.ai/features/governance/budget-and-limits" rel="noopener noreferrer"&gt;budget and rate limits&lt;/a&gt; block in real time, while Prometheus/OpenTelemetry telemetry show where spend concentrates so you can tune future limits.&lt;/p&gt;

&lt;h3&gt;
  
  
  Will adding cost controls add latency?
&lt;/h3&gt;

&lt;p&gt;No. Bifrost runs 11 microseconds overhead per request at 5,000 calls per second, so checks, deduplication, and token reduction layer in without becoming a performance tax teams work around.&lt;/p&gt;

&lt;h3&gt;
  
  
  How to get started?
&lt;/h3&gt;

&lt;p&gt;First step: replace raw provider keys with virtual keys enforcing budgets and rate limits. Next, enable semantic caching on high-traffic endpoints. Last, flip on Code Mode for any agent talking to three or more tool servers.&lt;/p&gt;

&lt;h2&gt;
  
  
  Getting Spend Under Control
&lt;/h2&gt;

&lt;p&gt;Controlling production LLM spend boils down to three things: enforcing limits, ditching duplicate work, and cutting token bloat at the infrastructure layer, not scattered through every app. &lt;a href="https://www.getmaxim.ai/bifrost" rel="noopener noreferrer"&gt;Bifrost&lt;/a&gt; combines real-time budget walls, smart dual-layer caching, and Code Mode token cuts in a single &lt;a href="https://docs.getbifrost.ai/overview" rel="noopener noreferrer"&gt;open-source gateway&lt;/a&gt; that's a drop-in swap for current SDKs. Every control lives at the gateway, so you get multi-layer cost governance, caching, and observability without rolling custom metering. Ready to lock down production spending? &lt;a href="https://getmaxim.ai/bifrost/book-a-demo" rel="noopener noreferrer"&gt;Book a demo&lt;/a&gt; with the Bifrost team to see it in action.&lt;/p&gt;

</description>
    </item>
    <item>
      <title>Running a High-Performance AI Gateway on Kubernetes</title>
      <dc:creator>Kuldeep Paul</dc:creator>
      <pubDate>Thu, 11 Jun 2026 20:38:16 +0000</pubDate>
      <link>https://dev.to/kuldeep_paul/running-a-high-performance-ai-gateway-on-kubernetes-1b8k</link>
      <guid>https://dev.to/kuldeep_paul/running-a-high-performance-ai-gateway-on-kubernetes-1b8k</guid>
      <description>&lt;p&gt;&lt;em&gt;Bifrost, the open-source AI gateway, handles thousands of concurrent LLM requests on Kubernetes with near-zero overhead, autoscaling, and centralized governance, everything you need for enterprise-grade production traffic.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;When AI requests arrive at scale (hundreds or thousands per second), even milliseconds of added latency compound into user-visible slowdowns and unnecessary token costs. A high-performance &lt;strong&gt;AI gateway on Kubernetes&lt;/strong&gt; lets you absorb that load with a declarative, horizontally scalable deployment while maintaining full control over data, policy, and request routing. &lt;a href="https://www.getmaxim.ai/bifrost" rel="noopener noreferrer"&gt;Bifrost&lt;/a&gt;, an &lt;a href="https://github.com/maximhq/bifrost" rel="noopener noreferrer"&gt;open-source AI gateway&lt;/a&gt; written in Go, is purpose-built for enterprise teams handling mission-critical AI workloads at high concurrency. This guide covers deploying Bifrost on Kubernetes at production scale, from initial Helm installation through multi-replica cluster mode, autoscaling, and enterprise-grade governance.&lt;/p&gt;

&lt;h2&gt;
  
  
  Core Requirements for a Production AI Gateway on Kubernetes
&lt;/h2&gt;

&lt;p&gt;More than just a proxy is needed to handle enterprise AI traffic. A gateway that can sustain thousands of concurrent requests requires:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Horizontal scaling&lt;/strong&gt;: pods that scale in and out automatically, driven by CPU and memory metrics.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Consistent state across all replicas&lt;/strong&gt;: shared rate limits, budgets, and policy counters that don't drift as the cluster scales.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Clean shutdown under load&lt;/strong&gt;: in-flight streams (especially SSE responses) that finish gracefully during pod termination or node drain.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Minimal request overhead&lt;/strong&gt;: every microsecond of latency matters at this scale.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Observability integrated from the start&lt;/strong&gt;: metrics, distributed traces, and readiness checks baked into the workload.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Bifrost ships as a first-class Kubernetes resource. The official Helm chart maps all configuration values directly to the runtime, so your cluster always matches what's in your values file. No configuration drift, no surprises.&lt;/p&gt;

&lt;h2&gt;
  
  
  Concurrency Architecture: Why Implementation Matters at Scale
&lt;/h2&gt;

&lt;p&gt;Below 100 requests per second, gateway overhead is imperceptible. At 1,000 RPS and beyond, the architecture of the gateway itself decides whether service quality holds steady or collapses.&lt;/p&gt;

&lt;p&gt;Bifrost is compiled to a single Go binary with goroutines handling concurrent work. This contrasts with Python-based proxies, which face the Global Interpreter Lock and asyncio overhead, both of which constrain parallelism. Internally, Bifrost uses a &lt;a href="https://docs.getbifrost.ai/architecture/core/concurrency" rel="noopener noreferrer"&gt;worker-pool concurrency model&lt;/a&gt;: requests are distributed to workers in a round-robin pattern, queue buffers are sized for traffic bursts, and when the system saturates, backpressure policies either queue excess work or drop it cleanly.&lt;/p&gt;

&lt;p&gt;Performance at high concurrency is measurable. When stress-tested at 5,000 requests per second, Bifrost adds only 11 microseconds of overhead per request. Comparative &lt;a href="https://www.getmaxim.ai/bifrost/resources/benchmarks" rel="noopener noreferrer"&gt;benchmark data&lt;/a&gt; shows 54 times lower P99 latency and roughly 68% lower memory consumption versus a Python gateway under identical load. At enterprise scales handling sustained high-concurrency traffic, this gap between implementations is what separates predictable tail latency from service degradation.&lt;/p&gt;

&lt;h2&gt;
  
  
  Deploying Bifrost on Kubernetes: The Helm Path
&lt;/h2&gt;

&lt;p&gt;The quickest way to get a gateway running is the official Helm chart. First, register the repository, then provision an encryption key and deploy:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;helm repo add bifrost https://maximhq.github.io/bifrost/helm-charts
helm repo update

kubectl create secret generic bifrost-encryption-key &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--from-literal&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;encryption-key&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;openssl rand &lt;span class="nt"&gt;-base64&lt;/span&gt; 32&lt;span class="si"&gt;)&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;

helm &lt;span class="nb"&gt;install &lt;/span&gt;bifrost bifrost/bifrost &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--set&lt;/span&gt; image.tag&lt;span class="o"&gt;=&lt;/span&gt;v1.4.11 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--set&lt;/span&gt; bifrost.encryptionKeySecret.name&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"bifrost-encryption-key"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--set&lt;/span&gt; bifrost.encryptionKeySecret.key&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"encryption-key"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In production deployments, &lt;a href="https://www.getmaxim.ai/bifrost" rel="noopener noreferrer"&gt;the Bifrost gateway&lt;/a&gt; relies on PostgreSQL for the backing store instead of SQLite, and runs three or more replicas for high availability. The switch to Postgres is what enables state sharing across pods. Within the chart, &lt;a href="https://docs.getbifrost.ai/deployment-guides/helm" rel="noopener noreferrer"&gt;the Helm deployment guide&lt;/a&gt; exposes a client-facing config section that directly controls concurrency:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;bifrost&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;client&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;initialPoolSize&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1000&lt;/span&gt;        &lt;span class="c1"&gt;# preallocate this many request workers&lt;/span&gt;
    &lt;span class="na"&gt;dropExcessRequests&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;     &lt;span class="c1"&gt;# shed overload instead of buffering infinitely&lt;/span&gt;
    &lt;span class="na"&gt;enableLogging&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
    &lt;span class="na"&gt;enforceGovernanceHeader&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A high &lt;code&gt;initialPoolSize&lt;/code&gt; pre-reserves worker capacity to handle expected load spikes. Setting &lt;code&gt;dropExcessRequests&lt;/code&gt; to true means the gateway will reject requests gracefully when overwhelmed, rather than letting request queues grow unbounded. Both settings are critical to keeping a high-concurrency AI gateway predictable at the traffic ceiling.&lt;/p&gt;

&lt;h2&gt;
  
  
  Multi-Replica Deployments: Sharing State via Cluster Mode
&lt;/h2&gt;

&lt;p&gt;Just running multiple pod replicas is not enough. If each pod enforces rate limits independently, you end up with the limit multiplied across replicas. That's where &lt;a href="https://docs.getbifrost.ai/deployment-guides/helm/cluster" rel="noopener noreferrer"&gt;cluster mode&lt;/a&gt; comes in: it synchronizes in-memory state (rate limit counters, budget spent, policy rules) across all pods using a gossip protocol.&lt;/p&gt;

&lt;p&gt;On Kubernetes, the recommended approach queries the API server to discover peer pods by label, so new replicas are auto-discovered without manual peer lists:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;bifrost&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;cluster&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;enabled&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
    &lt;span class="na"&gt;discovery&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;enabled&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
      &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;kubernetes&lt;/span&gt;
      &lt;span class="na"&gt;k8sNamespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;default"&lt;/span&gt;
      &lt;span class="na"&gt;k8sLabelSelector&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;app.kubernetes.io/name=bifrost"&lt;/span&gt;
    &lt;span class="na"&gt;gossip&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;port&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;7946&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The pod's service account needs read permissions on pods in that namespace, set up via Role and RoleBinding. Other discovery options (DNS, static peer lists, Consul, etcd) work too for environments where Kubernetes API access isn't available. More advanced HA patterns, including region-aware routing and broker mode for Cloud Run, are covered in the &lt;a href="https://docs.getbifrost.ai/enterprise/clustering" rel="noopener noreferrer"&gt;full clustering guide&lt;/a&gt;. Note: cluster mode is an enterprise feature and requires PostgreSQL.&lt;/p&gt;

&lt;h2&gt;
  
  
  Intelligent Scaling: Autoscaling Without Dropping Requests
&lt;/h2&gt;

&lt;p&gt;A gateway must scale out when load spikes and scale back in afterward, all without terminating active requests. The Bifrost Helm chart wires three pieces together: the &lt;a href="https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/" rel="noopener noreferrer"&gt;Horizontal Pod Autoscaler&lt;/a&gt;, pod anti-affinity rules, and graceful termination:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;replicaCount&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;3&lt;/span&gt;

&lt;span class="na"&gt;autoscaling&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;enabled&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
  &lt;span class="na"&gt;minReplicas&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;3&lt;/span&gt;
  &lt;span class="na"&gt;maxReplicas&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;15&lt;/span&gt;
  &lt;span class="na"&gt;targetCPUUtilizationPercentage&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;70&lt;/span&gt;
  &lt;span class="na"&gt;targetMemoryUtilizationPercentage&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;75&lt;/span&gt;
  &lt;span class="na"&gt;behavior&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;scaleDown&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;stabilizationWindowSeconds&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;300&lt;/span&gt;   &lt;span class="c1"&gt;# wait before shrinking&lt;/span&gt;
      &lt;span class="na"&gt;policies&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Pods&lt;/span&gt;
          &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;
          &lt;span class="na"&gt;periodSeconds&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;120&lt;/span&gt;

&lt;span class="na"&gt;terminationGracePeriodSeconds&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;90&lt;/span&gt;       &lt;span class="c1"&gt;# allow streams to finish&lt;/span&gt;
&lt;span class="na"&gt;lifecycle&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;preStop&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;exec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;command&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sh"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;-c"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sleep&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;20"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;

&lt;span class="na"&gt;affinity&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;podAntiAffinity&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;requiredDuringSchedulingIgnoredDuringExecution&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;labelSelector&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;matchLabels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;app.kubernetes.io/name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;bifrost&lt;/span&gt;
        &lt;span class="na"&gt;topologyKey&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;kubernetes.io/hostname&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The scale-down window prevents unnecessary churn during brief traffic dips. The extended grace period and &lt;code&gt;preStop&lt;/code&gt; hook give streaming responses time to finish before a pod is removed. Spread replicas across different nodes with pod anti-affinity to ensure a single node failure doesn't bring the gateway offline. Combined with &lt;a href="https://docs.getbifrost.ai/features/fallbacks" rel="noopener noreferrer"&gt;provider failover&lt;/a&gt;, this setup keeps the gateway online through infrastructure events and upstream provider issues alike.&lt;/p&gt;

&lt;h2&gt;
  
  
  Governance, Observability, Compliance at Enterprise Scale
&lt;/h2&gt;

&lt;p&gt;High throughput alone means little without governance, visibility, and compliance. Bifrost centralizes all three.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Governance.&lt;/strong&gt; &lt;a href="https://docs.getbifrost.ai/features/governance/virtual-keys" rel="noopener noreferrer"&gt;Virtual keys&lt;/a&gt; are your primary control lever: each carries access permissions, spending limits, and request rate caps. Turning on &lt;code&gt;is_vk_mandatory&lt;/code&gt; forces every request through a governed key. Budgets and rate limits can be set at the key, team, or customer level, and in cluster mode those counters stay synchronized across the entire replica set. For teams building fine-grained control at scale, the &lt;a href="https://www.getmaxim.ai/bifrost/resources/governance" rel="noopener noreferrer"&gt;governance resource hub&lt;/a&gt; lays out the full model.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Observability.&lt;/strong&gt; Bifrost exposes &lt;a href="https://docs.getbifrost.ai/features/observability/prometheus" rel="noopener noreferrer"&gt;Prometheus metrics&lt;/a&gt; at &lt;code&gt;/metrics&lt;/code&gt; and ships a ServiceMonitor for automatic scraping. It also supports &lt;a href="https://docs.getbifrost.ai/features/observability/otel" rel="noopener noreferrer"&gt;OpenTelemetry&lt;/a&gt; for end-to-end distributed tracing. Health probes hook directly into Kubernetes liveness and readiness checks. Worker and queue metrics feed capacity planning decisions.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Compliance and security.&lt;/strong&gt; &lt;a href="https://www.getmaxim.ai/bifrost/enterprise" rel="noopener noreferrer"&gt;Bifrost Enterprise&lt;/a&gt; supplies &lt;a href="https://docs.getbifrost.ai/enterprise/guardrails" rel="noopener noreferrer"&gt;guardrails&lt;/a&gt; for request filtering and secrets detection, plus &lt;a href="https://docs.getbifrost.ai/enterprise/rbac" rel="noopener noreferrer"&gt;RBAC&lt;/a&gt; for access control. &lt;a href="https://docs.getbifrost.ai/enterprise/audit-logs" rel="noopener noreferrer"&gt;Audit logs&lt;/a&gt; are immutable and support SOC 2, GDPR, HIPAA, and ISO 27001. Strict data residency is possible through &lt;a href="https://docs.getbifrost.ai/enterprise/invpc-deployments" rel="noopener noreferrer"&gt;in-VPC deployment&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Combining throughput with policy and compliance is the hallmark of a gateway that works in production. The same &lt;a href="https://www.getmaxim.ai/bifrost/resources/benchmarks" rel="noopener noreferrer"&gt;benchmark data&lt;/a&gt; that informs scaling decisions also guides replica sizing and resource requests for your traffic profile.&lt;/p&gt;

&lt;h2&gt;
  
  
  Next Steps: Running Bifrost in Your Cluster
&lt;/h2&gt;

&lt;p&gt;Deploying a high-performance AI gateway on Kubernetes distills to: Helm-based declarative deployment, PostgreSQL cluster mode for shared state, autoscaling tuned for graceful shutdown, and built-in governance plus observability. Bifrost packages these together as a single Kubernetes workload designed for high-concurrency production AI traffic, with a nearly transparent overhead profile under sustained load.&lt;/p&gt;

&lt;p&gt;Ready to see Bifrost handling your enterprise AI workloads? &lt;a href="https://getmaxim.ai/bifrost/book-a-demo" rel="noopener noreferrer"&gt;Book a demo&lt;/a&gt; with the team.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>kubernetes</category>
      <category>llm</category>
      <category>performance</category>
    </item>
    <item>
      <title>Connect Claude Code to Groq With Bifrost</title>
      <dc:creator>Kuldeep Paul</dc:creator>
      <pubDate>Thu, 11 Jun 2026 20:37:44 +0000</pubDate>
      <link>https://dev.to/kuldeep_paul/connect-claude-code-to-groq-with-bifrost-340f</link>
      <guid>https://dev.to/kuldeep_paul/connect-claude-code-to-groq-with-bifrost-340f</guid>
      <description>&lt;p&gt;&lt;em&gt;&lt;a href="https://www.getmaxim.ai/bifrost" rel="noopener noreferrer"&gt;Bifrost&lt;/a&gt; is an open-source LLM gateway that redirects Claude Code to any inference provider, including Groq, without modifying Claude Code itself. This walkthrough covers the entire configuration.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;By default, Claude Code connects exclusively to Anthropic's API and runs only Claude models. For individual developers, this constraint is fine. Teams seeking fast inference on open-source models for cost-sensitive applications, or those wanting to direct specific workloads to specialized hardware, find the default limiting. &lt;a href="https://www.getmaxim.ai/bifrost" rel="noopener noreferrer"&gt;Bifrost&lt;/a&gt;, the &lt;a href="https://github.com/maximhq/bifrost" rel="noopener noreferrer"&gt;open-source AI gateway written in Go by Maxim AI&lt;/a&gt;, stands between Claude Code and any LLM provider, translating at the protocol level. By configuring Claude Code to use Bifrost's Anthropic-compatible endpoint, you can direct all inference to Groq or any other supported provider without modifying Claude Code's binary or internals.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Case for Groq
&lt;/h2&gt;

&lt;p&gt;Groq's inference service relies on Language Processing Units (LPUs), custom silicon optimized for neural network computation. This LPU-based approach guarantees consistent latency: each token generation takes the same time, removing the variance present on traditional GPU platforms. &lt;a href="https://awesomeagents.ai/reviews/review-groq/" rel="noopener noreferrer"&gt;Independent performance tests&lt;/a&gt; show Groq delivering 4-7x greater token throughput than leading GPU systems, with initial-token latency 3-4x lower at equivalent scale.&lt;/p&gt;

&lt;p&gt;For Claude Code users, this speed advantage applies in specific contexts:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Experimental and exploratory work&lt;/strong&gt; using openly-licensed models (Llama 4 Scout, llama-3.3-70b-versatile, DeepSeek R1 Distill) for faster prototyping at lower per-token cost than paid frontier models&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Chained multi-step reasoning&lt;/strong&gt; where each tool invocation requires a new LLM call; per-call speed improvements compound across the chain&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Selective model usage&lt;/strong&gt; where routine tasks run on quick, low-cost open models, while complex reasoning stays on Anthropic's paid tiers&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;An important limitation: Groq's API lacks support for image input, vector embeddings, voice synthesis, or transcription. It processes text-based chat and function calling exclusively. Features like Claude's &lt;code&gt;computer_use&lt;/code&gt; tool are unsupported. Deploy Groq for text and function-call work; keep Anthropic available for multimodal or proprietary-feature requirements.&lt;/p&gt;

&lt;h2&gt;
  
  
  Bifrost's Role in Connecting Claude Code to Groq
&lt;/h2&gt;

&lt;p&gt;Bifrost presents an Anthropic-compatible interface at the &lt;code&gt;/anthropic&lt;/code&gt; path. When Claude Code points here instead of Anthropic's servers, Bifrost acts as the intermediary. It accepts Claude Code's request in Anthropic format, routes it to whichever provider you've configured, and returns the response in the shape Claude Code expects.&lt;/p&gt;

&lt;p&gt;Since Groq implements OpenAI's API, Bifrost maps through its OpenAI layer with Groq-specific tuning: fields like &lt;code&gt;store&lt;/code&gt;, &lt;code&gt;service_tier&lt;/code&gt;, and &lt;code&gt;prompt_cache_key&lt;/code&gt; are ignored, and server-sent events use Groq's schema. Full support for function calling is included. The &lt;a href="https://docs.getbifrost.ai/quickstart/gateway/provider-configuration" rel="noopener noreferrer"&gt;Bifrost provider configuration layer&lt;/a&gt; and the request routing layer are orthogonal; you configure Groq independently from configuring which Claude Code requests hit Groq.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 1: Launch Bifrost
&lt;/h2&gt;

&lt;p&gt;The simplest path is via npx with Node.js 18+.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npx &lt;span class="nt"&gt;-y&lt;/span&gt; @maximhq/bifrost &lt;span class="nt"&gt;-app-dir&lt;/span&gt; ./bifrost-data
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The gateway starts at &lt;code&gt;http://localhost:8080&lt;/code&gt; and initializes a &lt;code&gt;bifrost-data&lt;/code&gt; directory for configs and data. Visit &lt;code&gt;http://localhost:8080&lt;/code&gt; in a browser to reach the control panel.&lt;/p&gt;

&lt;p&gt;Docker is also an option:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker run &lt;span class="nt"&gt;-p&lt;/span&gt; 8080:8080 maximhq/bifrost:latest
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Step 2: Add Groq as a Provider
&lt;/h2&gt;

&lt;p&gt;The Bifrost control panel and &lt;code&gt;config.json&lt;/code&gt; both work for provider setup. In the dashboard, go to &lt;strong&gt;Models &amp;gt; Model Providers&lt;/strong&gt;, locate &lt;strong&gt;Groq&lt;/strong&gt;, add it if missing, and enter your Groq API key (raw or as &lt;code&gt;env.GROQ_API_KEY&lt;/code&gt;). Under &lt;strong&gt;Allowed Models&lt;/strong&gt;, choose &lt;strong&gt;All Models&lt;/strong&gt; or restrict to a list.&lt;/p&gt;

&lt;p&gt;For &lt;code&gt;config.json&lt;/code&gt; in your &lt;code&gt;bifrost-data&lt;/code&gt; folder:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"providers"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"groq"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"keys"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"groq-key-1"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"value"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"env.GROQ_API_KEY"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"models"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"*"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"weight"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;1.0&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Before starting Bifrost, export your Groq key:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;GROQ_API_KEY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"your-groq-api-key"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Verify activation in the dashboard: &lt;strong&gt;Models &amp;gt; Model Providers &amp;gt; Groq&lt;/strong&gt; will list the available Groq models.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 3: Generate a Virtual Key
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://docs.getbifrost.ai/features/governance/virtual-keys" rel="noopener noreferrer"&gt;Virtual keys&lt;/a&gt; let Claude Code authenticate to Bifrost without exposing your real provider keys. Each key can carry routing policies, spending caps, and rate-limit rules.&lt;/p&gt;

&lt;p&gt;In the dashboard, go to &lt;strong&gt;Governance &amp;gt; Virtual Keys&lt;/strong&gt;, create a new entry, name it, select Groq as an allowed provider, and copy the generated value. Pass this to Claude Code in the configuration step.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 4: Connect Claude Code to Bifrost
&lt;/h2&gt;

&lt;p&gt;Now that Bifrost is running with Groq configured, point Claude Code at Bifrost. Two options: directly edit &lt;code&gt;settings.json&lt;/code&gt;, or run &lt;a href="https://docs.getbifrost.ai/quickstart/cli/getting-started" rel="noopener noreferrer"&gt;Bifrost CLI&lt;/a&gt;, a guided terminal tool that sets up Claude Code without requiring you to manage environment variables or files.&lt;/p&gt;

&lt;h3&gt;
  
  
  Option A: Direct settings.json Edit
&lt;/h3&gt;

&lt;p&gt;Claude Code loads environment settings from &lt;code&gt;settings.json&lt;/code&gt;. Location varies by OS and scope:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;macOS / Linux / WSL system-wide: &lt;code&gt;~/.claude/settings.json&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Windows system-wide: &lt;code&gt;%USERPROFILE%\.claude\settings.json&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Project-level: &lt;code&gt;.claude/settings.json&lt;/code&gt; in the project folder&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Merge the following into your &lt;code&gt;settings.json&lt;/code&gt;'s &lt;code&gt;env&lt;/code&gt; block (shown here as the key alone; combine with your other settings):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="nl"&gt;"env"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"ANTHROPIC_BASE_URL"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"http://localhost:8080/anthropic"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"ANTHROPIC_AUTH_TOKEN"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"your-bifrost-virtual-key"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"ANTHROPIC_DEFAULT_HAIKU_MODEL"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"groq/llama-3.3-70b-versatile"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"ANTHROPIC_DEFAULT_SONNET_MODEL"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"groq/llama-3.3-70b-versatile"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;ANTHROPIC_BASE_URL&lt;/code&gt; sends Claude Code's calls to Bifrost's local Anthropic layer. &lt;code&gt;ANTHROPIC_AUTH_TOKEN&lt;/code&gt; provides the virtual key in the &lt;code&gt;Authorization: Bearer&lt;/code&gt; header for Bifrost's auth and routing. Model names prefixed with &lt;code&gt;groq/&lt;/code&gt; pin Haiku and Sonnet slots to Groq.&lt;/p&gt;

&lt;p&gt;If Claude Code is active, exit and relaunch. Prevent caching issues by running &lt;code&gt;/logout&lt;/code&gt; first.&lt;/p&gt;

&lt;h3&gt;
  
  
  Option B: Bifrost CLI (no Manual Configuration)
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://docs.getbifrost.ai/quickstart/cli/getting-started" rel="noopener noreferrer"&gt;Bifrost CLI&lt;/a&gt; is a guided terminal launcher. It connects Claude Code to an active Bifrost instance interactively, handling base URL setup, key storage, model selection, and MCP server hookup—all without editing files or setting env variables.&lt;/p&gt;

&lt;p&gt;From a second terminal (gateway already running):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npx &lt;span class="nt"&gt;-y&lt;/span&gt; @maximhq/bifrost-cli
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The interactive flow asks five questions:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Bifrost gateway location (typically &lt;code&gt;http://localhost:8080&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;Your virtual key (optional; press Enter to skip)&lt;/li&gt;
&lt;li&gt;Pick &lt;strong&gt;Claude Code&lt;/strong&gt; as your agent&lt;/li&gt;
&lt;li&gt;Filter by &lt;code&gt;groq/&lt;/code&gt; and select your model (e.g., &lt;code&gt;groq/llama-3.3-70b-versatile&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;Confirm and launch&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The CLI writes environment variables on the fly, auto-registers Bifrost's MCP service for your tools, and saves settings to &lt;code&gt;~/.bifrost/config.json&lt;/code&gt; between runs. Keys go to the system keyring—never saved as text.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;On model selection:&lt;/strong&gt; Claude Code depends on function calling for file operations, terminal execution, and code changes. Verify your chosen Groq model supports tool use. The &lt;a href="https://console.groq.com/docs/models" rel="noopener noreferrer"&gt;GroqCloud documentation&lt;/a&gt; lists each model's capabilities.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Step 5: Test the Setup
&lt;/h2&gt;

&lt;p&gt;Once Claude Code is pointed at Bifrost (via either method), run:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;claude &lt;span class="nt"&gt;--model&lt;/span&gt; groq/llama-3.3-70b-versatile
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Inside the session, type &lt;code&gt;/model&lt;/code&gt; to confirm the active model. Check Bifrost's dashboard &lt;strong&gt;Logs&lt;/strong&gt; tab to see the request routed to Groq; a 200 status indicates success.&lt;/p&gt;

&lt;p&gt;To change models mid-session, use:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;/model groq/llama-3.3-70b-versatile
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Automatic Failover and Routing Rules
&lt;/h2&gt;

&lt;p&gt;Bifrost opens up &lt;a href="https://docs.getbifrost.ai/features/fallbacks" rel="noopener noreferrer"&gt;failover on demand&lt;/a&gt;: if Groq returns errors or hits rate limits, Bifrost can automatically reroute to Anthropic or another provider without Claude Code knowing. Build a fallback chain in the dashboard (&lt;strong&gt;Features &amp;gt; Fallbacks&lt;/strong&gt;) or in &lt;code&gt;config.json&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"fallbacks"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"from"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"groq/llama-3.3-70b-versatile"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"to"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"anthropic/claude-haiku-4-5"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://docs.getbifrost.ai/features/governance/routing" rel="noopener noreferrer"&gt;Routing rules&lt;/a&gt; give you condition-based switching too. Example: send Claude Code requests (identifiable by &lt;code&gt;claude-cli&lt;/code&gt; user-agent) to Groq by default, but route others to Anthropic. This creates per-client model behavior on a single instance.&lt;/p&gt;

&lt;h2&gt;
  
  
  Governance and Observability Built In
&lt;/h2&gt;

&lt;p&gt;Routing Claude Code via Bifrost unlocks the full &lt;a href="https://docs.getbifrost.ai/features/governance" rel="noopener noreferrer"&gt;governance stack&lt;/a&gt; at no cost. &lt;a href="https://docs.getbifrost.ai/features/governance/budget-and-limits" rel="noopener noreferrer"&gt;Per-key spending caps and request limits&lt;/a&gt; prevent unexpected bills. Logs capture every request's provider, model, token count, and timing. For teams with many Claude Code users, the &lt;a href="https://www.getmaxim.ai/bifrost/resources/buyers-guide" rel="noopener noreferrer"&gt;LLM Gateway Buyer's Guide&lt;/a&gt; details how to layer virtual keys by team, project, or owner.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://docs.getbifrost.ai/features/observability/default" rel="noopener noreferrer"&gt;Observability&lt;/a&gt; is ready immediately via the Bifrost dashboard. Prometheus metrics live at &lt;code&gt;/metrics&lt;/code&gt; for your monitoring systems. OTLP export is available for Datadog, Grafana, or any standards-compatible backend.&lt;/p&gt;

&lt;p&gt;For teams needing to centralize MCP tool management alongside model routing, Bifrost can also work as an &lt;a href="https://docs.getbifrost.ai/mcp/overview" rel="noopener noreferrer"&gt;MCP gateway&lt;/a&gt;: link upstream MCP servers, expose them through &lt;code&gt;/mcp&lt;/code&gt;, and register them once with Claude Code via a single command. This cleanly separates inference routing from tool control.&lt;/p&gt;

&lt;h2&gt;
  
  
  Wrapping Up
&lt;/h2&gt;

&lt;p&gt;Getting Claude Code to Groq through Bifrost boils down to three shared steps (start gateway, enable Groq, make a virtual key) and then connecting Claude Code, either via direct &lt;code&gt;settings.json&lt;/code&gt; editing or &lt;a href="https://docs.getbifrost.ai/quickstart/cli/getting-started" rel="noopener noreferrer"&gt;Bifrost CLI&lt;/a&gt; for a guided, no-variables approach. From there, Claude Code defaults to Groq, with failover, governance, and dashboards ready to go.&lt;/p&gt;

&lt;p&gt;The same flow works for any Bifrost-supported provider: flip the model name to &lt;code&gt;anthropic/&lt;/code&gt;, &lt;code&gt;openai/&lt;/code&gt;, &lt;code&gt;bedrock/&lt;/code&gt;, &lt;code&gt;vertex/&lt;/code&gt;, or any of 20+ platforms, and Claude Code adapts instantly.&lt;/p&gt;

&lt;p&gt;For multi-team scaling with load balancing, clustering, vault integration, OIDC, or private-cloud deployment, &lt;a href="https://getmaxim.ai/bifrost/book-a-demo" rel="noopener noreferrer"&gt;schedule a demo with the Bifrost team&lt;/a&gt;.&lt;/p&gt;

</description>
    </item>
    <item>
      <title>AI Gateways for Production: A Technical Overview</title>
      <dc:creator>Kuldeep Paul</dc:creator>
      <pubDate>Thu, 11 Jun 2026 20:37:17 +0000</pubDate>
      <link>https://dev.to/kuldeep_paul/ai-gateways-for-production-a-technical-overview-4ife</link>
      <guid>https://dev.to/kuldeep_paul/ai-gateways-for-production-a-technical-overview-4ife</guid>
      <description>&lt;p&gt;&lt;em&gt;Route, govern, and observe LLM traffic from a single control plane. &lt;a href="https://www.getmaxim.ai/bifrost" rel="noopener noreferrer"&gt;Bifrost&lt;/a&gt; unifies 20+ providers through one OpenAI-compatible API.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Imagine your production AI system calling OpenAI, Anthropic, AWS Bedrock, and Google Vertex AI, each with its own SDK, authentication scheme, and rate-limit surface. Duplicated integration code sprawls across your codebase, access controls are inconsistent, and no single vantage point shows cost or latency. This is the problem an AI gateway solves. It acts as a unified entry point between applications and LLM providers, centralizing routing, security, and observability through one API. &lt;a href="https://www.getmaxim.ai/bifrost" rel="noopener noreferrer"&gt;Bifrost&lt;/a&gt;, the &lt;a href="https://github.com/maximhq/bifrost" rel="noopener noreferrer"&gt;open-source AI gateway&lt;/a&gt; built in Go by Maxim AI, is the best overall choice for enterprise teams running mission-critical AI workloads that require best-in-class performance, scalability, and reliability. This guide covers what an AI gateway does, where it differs from traditional API gateways, which production-grade capabilities matter, and how to evaluate one for your stack.&lt;/p&gt;

&lt;h2&gt;
  
  
  Defining an AI Gateway
&lt;/h2&gt;

&lt;p&gt;Between your application and your LLM providers sits a middleware layer called an AI gateway. It exposes a single API to all models, then applies routing rules, fallback logic, authentication, caching, rate limiting, and monitoring to every request passing through.&lt;/p&gt;

&lt;p&gt;Why does this layer exist? Direct integrations do not scale. Every provider has its own SDK, its own quota semantics, and its own response format; adding a new model multiplies integration and maintenance burden. &lt;a href="https://www.ibm.com/think/topics/ai-gateway" rel="noopener noreferrer"&gt;A 2025 overview from IBM&lt;/a&gt; frames the problem as a need for unified layer connecting applications to models while applying governance and security policies consistently.&lt;/p&gt;

&lt;p&gt;In operation, the gateway becomes the single source of truth for AI traffic: which application made which request, at what cost, with what latency, under which policy constraints. Both &lt;a href="https://docs.getbifrost.ai/overview" rel="noopener noreferrer"&gt;Bifrost's documentation&lt;/a&gt; and the &lt;a href="https://www.getmaxim.ai/bifrost/resources/buyers-guide" rel="noopener noreferrer"&gt;LLM Gateway Buyer's Guide&lt;/a&gt; detail the concrete capabilities that matter, which structure the rest of this discussion.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Gateways Differ from Traditional API Gateways
&lt;/h2&gt;

&lt;p&gt;A conventional API gateway manages microservices: it enforces authentication, limits requests by count, and routes by path. LLM workloads break these assumptions, which is why a specialized layer has emerged.&lt;/p&gt;

&lt;p&gt;The key differences:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Pricing is token-based, not request-based.&lt;/strong&gt; Two identical requests can differ wildly in cost depending on model selection and output length. &lt;a href="https://docs.getbifrost.ai/features/governance/budget-and-limits" rel="noopener noreferrer"&gt;Rate limits and budgets&lt;/a&gt; must therefore track tokens and spending, not just request count.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Providers enforce per-key quotas.&lt;/strong&gt; When &lt;a href="https://platform.openai.com/docs/guides/rate-limits" rel="noopener noreferrer"&gt;OpenAI's rate limits&lt;/a&gt; are hit on one key, the gateway must distribute load across keys or fall back to a different provider entirely.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Sensitive data lives in payloads.&lt;/strong&gt; Customer information, source code, and credentials flow through prompts regularly, making prompt screening and redaction a gateway-level responsibility.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Responses are format-heterogeneous.&lt;/strong&gt; Streaming semantics, error types, and provider-specific parameters differ across vendors; the gateway must normalize all of this into one consistent shape.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A traditional API gateway can sit in front of an LLM endpoint, but it lacks the ability to reason about tokens, model failover decisions, or content safety. An AI gateway is purpose-built for these exact problems.&lt;/p&gt;

&lt;h2&gt;
  
  
  Capabilities Every Production Gateway Needs
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://www.getmaxim.ai/bifrost" rel="noopener noreferrer"&gt;Bifrost's feature set&lt;/a&gt; covers everything teams should expect from a production gateway:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Unified API, drop-in ready.&lt;/strong&gt; A single OpenAI-compatible endpoint works for every provider. The &lt;a href="https://docs.getbifrost.ai/features/drop-in-replacement" rel="noopener noreferrer"&gt;drop-in SDK pattern&lt;/a&gt; means OpenAI, Anthropic, LangChain, and LiteLLM code needs only a base URL change.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multi-provider, multi-model routing.&lt;/strong&gt; &lt;a href="https://docs.getbifrost.ai/providers/supported-providers/overview" rel="noopener noreferrer"&gt;Support for 20+ providers and 1,000+ models&lt;/a&gt;, including OpenAI, Anthropic, Bedrock, Vertex, Azure, and more, with consistent responses regardless of where a request routes.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Failover and load balancing.&lt;/strong&gt; When a provider fails or hits quota, &lt;a href="https://docs.getbifrost.ai/features/fallbacks" rel="noopener noreferrer"&gt;automatic fallbacks&lt;/a&gt; reroute to alternates. Weighted key distribution handles load balancing.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Semantic caching.&lt;/strong&gt; Similar queries pull responses from cache instead of making new API calls, cutting cost and latency through &lt;a href="https://docs.getbifrost.ai/features/semantic-caching" rel="noopener noreferrer"&gt;semantic matching&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Access control.&lt;/strong&gt; &lt;a href="https://docs.getbifrost.ai/features/governance/virtual-keys" rel="noopener noreferrer"&gt;Virtual keys&lt;/a&gt; bundle per-consumer permissions, budgets, and rate limits into one governance entity.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Monitoring and tracing.&lt;/strong&gt; Every request is logged with full metadata. Prometheus metrics emit natively, and OpenTelemetry spans integrate with any observability backend.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Content safety.&lt;/strong&gt; &lt;a href="https://docs.getbifrost.ai/enterprise/guardrails" rel="noopener noreferrer"&gt;Guardrail plugins&lt;/a&gt; screen prompts and completions against AWS Bedrock Guardrails, Azure Content Safety, and Patronus AI services in-line.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;MCP tool management.&lt;/strong&gt; Acting as an &lt;a href="https://docs.getbifrost.ai/mcp/overview" rel="noopener noreferrer"&gt;MCP gateway&lt;/a&gt;, Bifrost centralizes &lt;a href="https://modelcontextprotocol.io/" rel="noopener noreferrer"&gt;Model Context Protocol&lt;/a&gt; tool discovery and execution, so agents access tools through the same governed gateway.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These capabilities reinforce one another: failover only works with multi-provider support, governance depends on a single control point, and observability is incomplete unless all traffic flows through the same layer.&lt;/p&gt;

&lt;h2&gt;
  
  
  How Bifrost Operates at Scale
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://www.getmaxim.ai/bifrost" rel="noopener noreferrer"&gt;Built in Go, the Bifrost gateway&lt;/a&gt; achieves minimal overhead. Production &lt;a href="https://www.getmaxim.ai/bifrost/resources/benchmarks" rel="noopener noreferrer"&gt;benchmarks&lt;/a&gt; at 5,000 requests per second show just 11 microseconds of added latency, with zero request failures.&lt;/p&gt;

&lt;p&gt;Getting started takes minutes. One Docker command or NPX invocation brings the gateway online with zero required configuration. Via the web UI or API, teams configure providers, API keys, routing rules, and access policies. Existing application code points at Bifrost by changing the SDK base URL. All requests then flow through the gateway, automatically gaining failover, caching, governance, and tracing with no additional code.&lt;/p&gt;

&lt;p&gt;The observability pipeline operates continuously. The gateway records each request including provider, model, token volumes, cost, response time, and cache hit status. Prometheus scraping works natively, and OpenTelemetry traces feed into any compatible backend, so AI traffic lands in the same monitoring dashboards platform teams already run.&lt;/p&gt;

&lt;p&gt;For agent workloads, the same infrastructure handles tools. Bifrost serves as both an MCP client and server, connecting external tool providers and exposing them to clients (Claude Desktop, Cursor, etc.), with per-key tool filtering that restricts each consumer to permitted tools only.&lt;/p&gt;

&lt;h2&gt;
  
  
  Evaluation Framework for Production Deployments
&lt;/h2&gt;

&lt;p&gt;Selecting a gateway is an infrastructure decision, so apply the same rigor as any critical-path system. Four evaluation areas distinguish production-ready deployments from development experiments:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Latency impact.&lt;/strong&gt; The gateway lives on the request path, so validate overhead under your actual traffic profile, not in a single request benchmark.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Availability guarantees.&lt;/strong&gt; Single points of failure are unacceptable. Bifrost Enterprise offers &lt;a href="https://docs.getbifrost.ai/enterprise/clustering" rel="noopener noreferrer"&gt;clustering&lt;/a&gt; with automatic discovery and zero-downtime rolling updates.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Deployment options.&lt;/strong&gt; Regulated industries demand the gateway run within their own network boundary. &lt;a href="https://www.getmaxim.ai/bifrost/enterprise" rel="noopener noreferrer"&gt;Bifrost Enterprise&lt;/a&gt; provides in-VPC, on-prem, and isolated deployments, along with audit logs that satisfy SOC 2, GDPR, HIPAA, and ISO 27001 requirements.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Governance breadth.&lt;/strong&gt; Ensure the platform supports nested budgets, RBAC, and identity provider federation, not merely API key management.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For detailed capability comparisons across multiple gateways, the &lt;a href="https://www.getmaxim.ai/bifrost/resources/buyers-guide" rel="noopener noreferrer"&gt;LLM gateway evaluation guide&lt;/a&gt; provides a structured matrix.&lt;/p&gt;

&lt;p&gt;Timing plays a role as well. Early gateway adoption, when your system runs one provider, carries minimal switching friction (just a base URL change) while delivering failover and cost tracking before an outage forces the decision. Late adoption, after multiple providers are entrenched, becomes a migration project.&lt;/p&gt;

&lt;h2&gt;
  
  
  Adopting Bifrost
&lt;/h2&gt;

&lt;p&gt;An AI gateway collapses fragmented provider integrations into one governed, observable, resilient layer, and starting early avoids integration debt. Bifrost deploys in seconds as open-source software and scales from a developer laptop to enterprise clusters processing thousands of requests per second. Explore how &lt;a href="https://getmaxim.ai/bifrost/book-a-demo" rel="noopener noreferrer"&gt;Bifrost fits your infrastructure&lt;/a&gt; by booking a demo with the Bifrost team.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>api</category>
      <category>llm</category>
      <category>opensource</category>
    </item>
    <item>
      <title>MCP at Scale: When to Use Code Mode Instead of Classic Tool Calling</title>
      <dc:creator>Kuldeep Paul</dc:creator>
      <pubDate>Thu, 11 Jun 2026 20:36:43 +0000</pubDate>
      <link>https://dev.to/kuldeep_paul/mcp-at-scale-when-to-use-code-mode-instead-of-classic-tool-calling-1f02</link>
      <guid>https://dev.to/kuldeep_paul/mcp-at-scale-when-to-use-code-mode-instead-of-classic-tool-calling-1f02</guid>
      <description>&lt;p&gt;&lt;em&gt;Classic MCP loads every tool schema on every request, driving up costs. Code Mode in &lt;a href="https://www.getmaxim.ai/bifrost" rel="noopener noreferrer"&gt;Bifrost&lt;/a&gt; cuts input tokens by 92.8% by letting AI orchestrate through Python.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Building AI agents that connect to dozens of external tools creates a hidden cost: each request loaded with 150+ tool schemas, consuming most of the model's token budget just parsing definitions instead of solving the task. This is the central tradeoff between classic MCP and Code Mode. &lt;a href="https://www.getmaxim.ai/bifrost" rel="noopener noreferrer"&gt;Bifrost&lt;/a&gt;, the &lt;a href="https://github.com/maximhq/bifrost" rel="noopener noreferrer"&gt;open-source AI gateway built in Go&lt;/a&gt; by Maxim AI, gives teams both options, letting them pick based on deployment scale and workflow complexity. Understanding when each approach makes sense can cut costs by orders of magnitude.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Default Approach: Direct Tool Calling
&lt;/h2&gt;

&lt;p&gt;Classic MCP tool calling is how most MCP implementations work out of the box. The gateway pushes all connected tool definitions into the model's context before each request arrives, the model generates tool-call suggestions, and the application handles execution separately through an explicit approval step.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://docs.getbifrost.ai/mcp/tool-execution" rel="noopener noreferrer"&gt;Tool execution&lt;/a&gt; in Bifrost follows a deliberate, stateless flow: each chat request asks the model what to do, returns the suggestions without running them, lets the application review for safety, and then executes the approved actions via a separate call before resuming conversation. Bifrost simultaneously acts as an MCP client connecting to external servers via STDIO, HTTP, or SSE, and as an MCP server exposing those tools to clients such as Claude Desktop. See the &lt;a href="https://docs.getbifrost.ai/mcp/overview" rel="noopener noreferrer"&gt;MCP overview&lt;/a&gt; for the full mechanics.&lt;/p&gt;

&lt;p&gt;This design delivers tight control. Tool execution never happens without a reviewer in the loop, all operations are logged, and the calling code maintains conversation state. The downside emerges at scale: as servers multiply, the model rereads the entire tool list on every turn, and every intermediate step in a multi-tool task has to flow back through the context before the next move.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Alternative: Code Mode
&lt;/h2&gt;

&lt;p&gt;Code Mode flips the model: instead of injecting all tool definitions, the system shows the AI four small meta-tools and lets it write a Python orchestration script that runs in a controlled sandbox. The model sees compact schemas rather than a catalog of 150 definitions.&lt;/p&gt;

&lt;p&gt;When &lt;a href="https://docs.getbifrost.ai/mcp/code-mode" rel="noopener noreferrer"&gt;Code Mode&lt;/a&gt; is enabled in Bifrost, four lightweight tools appear in every request:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;listToolFiles&lt;/strong&gt;: reveal available MCP servers and their tools&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;readToolFile&lt;/strong&gt;: fetch Python function signatures for a server as needed&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;getToolDocs&lt;/strong&gt;: retrieve full details for a single tool on demand&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;executeToolCode&lt;/strong&gt;: submit Python code for sandbox execution with full tool bindings&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The model discovers what's available, pulls the signatures it needs, and writes a short Python script that Bifrost runs via a Starlark interpreter. Any intermediate outputs from tool calls stay sandboxed and never return to the model; only the final result comes back. This mirrors the pattern described by &lt;a href="https://www.anthropic.com/engineering/code-execution-with-mcp" rel="noopener noreferrer"&gt;Anthropic's engineering team&lt;/a&gt;, which found that a Google Drive to Salesforce integration shrank from 150,000 tokens down to 2,000 when tool calls gave way to code execution. Bifrost embeds this as a first-class gateway feature, picking Python over JavaScript (models see more Python in training data) and including a dedicated doc-lookup tool to compress context even further.&lt;/p&gt;

&lt;h2&gt;
  
  
  Comparing the Two Patterns
&lt;/h2&gt;

&lt;p&gt;Both approaches operate through the same &lt;a href="https://www.getmaxim.ai/bifrost/resources/mcp-gateway" rel="noopener noreferrer"&gt;MCP gateway&lt;/a&gt;. The difference lies in what the model sees, how many interaction cycles happen, and where orchestration logic runs. Switching from one to the other is just a per-client configuration toggle.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Aspect&lt;/th&gt;
&lt;th&gt;Classic MCP&lt;/th&gt;
&lt;th&gt;Code Mode&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Tool visibility&lt;/td&gt;
&lt;td&gt;Complete list in context, repeated per turn&lt;/td&gt;
&lt;td&gt;Four meta-tools; definitions fetched on demand&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Model interaction cycles&lt;/td&gt;
&lt;td&gt;One step per tool invocation&lt;/td&gt;
&lt;td&gt;3 to 4 total across whole workflow&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Where intermediate outputs go&lt;/td&gt;
&lt;td&gt;Returned to model&lt;/td&gt;
&lt;td&gt;Processed inside the sandbox&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Where workflow logic lives&lt;/td&gt;
&lt;td&gt;Conversation loop&lt;/td&gt;
&lt;td&gt;Python script&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cost scaling&lt;/td&gt;
&lt;td&gt;Grows linearly with tool count&lt;/td&gt;
&lt;td&gt;Capped by actually-used files&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Right for&lt;/td&gt;
&lt;td&gt;Few tools, straightforward calls&lt;/td&gt;
&lt;td&gt;Many servers, intricate multi-step work&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Classic MCP's simplicity shines for lean tool sets. Code Mode trades that immediacy for resource efficiency: tools stay behind the meta-tool interface, and the model only reads the stubs and docs it actually uses.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Token Math as Tool Count Climbs
&lt;/h2&gt;

&lt;p&gt;The efficiency gains from Code Mode multiply as deployments scale. At large tool counts, classic MCP cost balloons because the entire catalog reloads every turn, whereas Code Mode cost plateaus based on which stub files the model opens.&lt;/p&gt;

&lt;p&gt;Bifrost ran benchmarks across increasing MCP footprints, comparing Code Mode to classic MCP with the same test queries each time. The &lt;a href="https://www.getmaxim.ai/bifrost/blog/bifrost-mcp-gateway-access-control-cost-governance-and-92-lower-token-costs-at-scale" rel="noopener noreferrer"&gt;benchmark writeup&lt;/a&gt; shows the separation widening:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;96 tools across 6 servers&lt;/strong&gt;: input token drop of 58.2%, cost drop of 55.7%&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;251 tools across 11 servers&lt;/strong&gt;: input token drop of 84.5%, cost drop of 83.4%&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;508 tools across 16 servers&lt;/strong&gt;: input token drop of 92.8% (75.1M falling to 5.4M), cost drop from $377 to $29&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The largest test maintained a 100% pass rate (65 of 65 succeeded), showing the token reduction came without sacrificing task completion. Per-query math at 500 tools shows roughly a 14x saving, dropping from 1.15M to 83K tokens. These runs also cut LLM round trips from many calls to 3 to 4 total, and accelerated execution by ~40%. Using &lt;a href="https://www.getmaxim.ai/bifrost/resources/benchmarks" rel="noopener noreferrer"&gt;published benchmarks&lt;/a&gt;, teams can test their own MCP configurations.&lt;/p&gt;

&lt;h2&gt;
  
  
  How Code Mode Executes, Step by Step
&lt;/h2&gt;

&lt;p&gt;Code Mode replaces the back-and-forth loop with a single discover-then-run sequence. Once activated for an MCP client, that client's tools disappear from direct availability and only become accessible through the four meta-tools.&lt;/p&gt;

&lt;p&gt;A typical interaction unfolds like this:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Discovery&lt;/strong&gt;: the model calls &lt;code&gt;listToolFiles&lt;/code&gt; to examine available servers&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Schema loading&lt;/strong&gt;: &lt;code&gt;readToolFile&lt;/code&gt; retrieves the Python signatures for a relevant server&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Documentation lookup&lt;/strong&gt;: if needed, &lt;code&gt;getToolDocs&lt;/code&gt; pulls deeper docs for one tool&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Execution&lt;/strong&gt;: the model writes Python code that &lt;code&gt;executeToolCode&lt;/code&gt; runs in the sandbox, returning a compact outcome&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The Starlark sandbox intentionally limits what code can do: no import statements, no filesystem or network access, default 30-second timeout. Tools from Code Mode clients appear as global objects in the sandbox; calls run synchronously, so scripts can chain multiple tool operations and return a single aggregated result. Bifrost offers two ways to organize the stub files: server-level puts all tools from one server in one file, while tool-level creates one file per tool for servers with large schemas.&lt;/p&gt;

&lt;p&gt;Code Mode is toggled per MCP client. A single Bifrost instance can run some clients in classic mode and others in Code Mode simultaneously, and adding new servers follows the standard &lt;a href="https://docs.getbifrost.ai/mcp/connecting-to-servers" rel="noopener noreferrer"&gt;connection flow&lt;/a&gt; either way. Complete setup instructions and the full meta-tool API are in the &lt;a href="https://docs.getbifrost.ai/mcp/code-mode" rel="noopener noreferrer"&gt;Code Mode docs&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Deciding Between Classic MCP and Code Mode
&lt;/h2&gt;

&lt;p&gt;Code Mode wins for deployments with 3+ MCP servers, workflows that require multiple orchestrated steps, or scenarios where token cost and latency are top concerns. Stick with classic calling for single or dual-server setups with straightforward, one-step tool use. The patterns need not be all-or-nothing: advanced teams use Code Mode for heavy-duty servers like web search or data access, and keep direct calling for quick utilities.&lt;/p&gt;

&lt;h3&gt;
  
  
  What if I only have one or two MCP servers?
&lt;/h3&gt;

&lt;p&gt;The token savings from Code Mode rarely justify the added conceptual overhead with minimal server counts. Classic calling has lower mental complexity and debugging is simpler. Code Mode's payoff accelerates as the tool count climbs.&lt;/p&gt;

&lt;h3&gt;
  
  
  Will Code Mode slow my agents down?
&lt;/h3&gt;

&lt;p&gt;No. By consolidating multiple tool steps into one sandboxed script, Code Mode actually cuts the count of LLM interactions by 3 to 4 times and typically improves execution speed by 40% in large deployments. Classic MCP can still be preferable for single, immediate tool calls where latency is critical.&lt;/p&gt;

&lt;h3&gt;
  
  
  Can I run both at once?
&lt;/h3&gt;

&lt;p&gt;Absolutely. Code Mode is configured individually per MCP client, so a Bifrost setup can have some clients in classic mode and others in Code Mode. This approach lets teams migrate the largest, most expensive servers first while keeping lightweight tools on the simpler path.&lt;/p&gt;

&lt;h3&gt;
  
  
  How secure is Code Mode auto-execution?
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://docs.getbifrost.ai/mcp/agent-mode" rel="noopener noreferrer"&gt;Agent Mode&lt;/a&gt; lets Code Mode scripts run without human approval, but with safeguards. Bifrost extracts every tool reference from the code and executes only if all of them are on the allowlist; any off-list calls bounce back for approval. Paired with &lt;a href="https://docs.getbifrost.ai/features/governance/mcp-tools" rel="noopener noreferrer"&gt;granular tool filtering per virtual key&lt;/a&gt;, this keeps autonomous runs inside configured guardrails. For compliance-heavy teams and regulated industries, Bifrost operates as a full-featured &lt;a href="https://www.getmaxim.ai/bifrost/resources/mcp-gateway" rel="noopener noreferrer"&gt;MCP gateway&lt;/a&gt; with access logs, cost attribution, and tool-level permissions.&lt;/p&gt;

&lt;h2&gt;
  
  
  Next Steps
&lt;/h2&gt;

&lt;p&gt;Picking between classic MCP and Code Mode fundamentally depends on scale. For small tool inventories, classic calling is clear and direct. For agents managing many connected servers, Code Mode constrains token consumption and latency even as the catalog expands, while sustaining reliability. Both ride the same infrastructure; toggling Code Mode happens once servers connect via the &lt;a href="https://docs.getbifrost.ai/quickstart/gateway/setting-up" rel="noopener noreferrer"&gt;Bifrost quickstart&lt;/a&gt;. The &lt;a href="https://modelcontextprotocol.io/" rel="noopener noreferrer"&gt;MCP specification&lt;/a&gt; provides full protocol details for teams digging deeper into the standard itself.&lt;/p&gt;

&lt;p&gt;Ready to optimize MCP orchestration and governance at scale? &lt;a href="https://getmaxim.ai/bifrost/book-a-demo" rel="noopener noreferrer"&gt;Schedule a demo&lt;/a&gt; with the Bifrost team, or browse the &lt;a href="https://www.getmaxim.ai/bifrost/resources" rel="noopener noreferrer"&gt;Bifrost resources hub&lt;/a&gt; for benchmarks and deployment patterns.&lt;/p&gt;

</description>
    </item>
    <item>
      <title>Claude Code Governance: How an AI Gateway Secures Agent Access</title>
      <dc:creator>Kuldeep Paul</dc:creator>
      <pubDate>Thu, 11 Jun 2026 20:35:56 +0000</pubDate>
      <link>https://dev.to/kuldeep_paul/claude-code-governance-how-an-ai-gateway-secures-agent-access-59i</link>
      <guid>https://dev.to/kuldeep_paul/claude-code-governance-how-an-ai-gateway-secures-agent-access-59i</guid>
      <description>&lt;p&gt;&lt;em&gt;Effective Claude Code governance requires a gateway-layer control point. &lt;a href="https://www.getmaxim.ai/bifrost" rel="noopener noreferrer"&gt;Bifrost&lt;/a&gt; delivers that control through virtual keys, credential isolation, MCP tool filtering, and immutable audit trails for every agent session.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;From a single terminal session, Claude Code reads full repositories, runs shell commands, and invokes external MCP tools, all authenticated with a raw provider API key. When developers share the same key, the same model permissions, and an unrestricted tool surface, a single leaked credential or an over-privileged session crosses from a billing problem into a security incident. Claude Code governance is how platform teams insert identity, access control, and audit between the agent and the provider, so every request is scoped, attributable, and recoverable after the fact. &lt;a href="https://www.getmaxim.ai/bifrost" rel="noopener noreferrer"&gt;Bifrost&lt;/a&gt;, the &lt;a href="https://github.com/maximhq/bifrost" rel="noopener noreferrer"&gt;Go-based open-source AI gateway&lt;/a&gt; built by Maxim AI, enforces those controls at the gateway layer through virtual keys, credential isolation, MCP tool filtering, guardrails, and audit logs. This guide explains what governance actually requires and how Bifrost applies it without touching the developer workflow.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Five Security Controls That Claude Code Governance Requires
&lt;/h2&gt;

&lt;p&gt;At its core, Claude Code governance means running every agent request through a control point that authenticates the caller, keeps real provider credentials off developer machines, limits which models and tools each session can reach, and captures each action in a tamper-proof log. An AI gateway is that control point; the Claude Code client itself does not need to change.&lt;/p&gt;

&lt;p&gt;The risks that make this necessary are well established. The &lt;a href="https://owasp.org/www-project-top-10-for-large-language-model-applications/" rel="noopener noreferrer"&gt;OWASP Top 10 for LLM Applications&lt;/a&gt; places excessive agency and sensitive information disclosure in its list of the most critical LLM security risks, and an autonomous coding agent that holds file access, shell access, and tool-calling privileges exercises that agency on every run. Five controls address those risks comprehensively:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Credential isolation&lt;/strong&gt;: provider keys are held centrally in the gateway, not distributed to individual developer machines&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Identity and attribution&lt;/strong&gt;: each request is tied to a developer, team, or project rather than a shared API key&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Access scoping&lt;/strong&gt;: per-key policies define which models, providers, and MCP tools any given session is allowed to reach&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Containment&lt;/strong&gt;: rate limits and guardrails bound the damage a compromised or runaway session can cause&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Audit and evidence&lt;/strong&gt;: request-level records covering session, model, token, and user data for compliance purposes&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;All five controls are applied through &lt;a href="https://www.getmaxim.ai/bifrost/resources/governance" rel="noopener noreferrer"&gt;gateway-level governance&lt;/a&gt;, which means they take effect on every Claude Code session uniformly, regardless of developer, provider, or deployment environment.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Anthropic's Native Controls Leave Gaps
&lt;/h2&gt;

&lt;p&gt;Out of the box, Claude Code authenticates directly with a single provider key per developer. &lt;a href="https://code.claude.com/docs/en/overview" rel="noopener noreferrer"&gt;Anthropic's Team and Enterprise plans&lt;/a&gt; include SSO and admin controls, but they stop short of fine-grained, cross-provider access control and request-level audit logs that platform teams need. Direct-to-provider usage produces four security gaps:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Broadly scoped credentials&lt;/strong&gt;: a raw provider key stored on a laptop can be copied, committed to version control, or reused in ways the platform team cannot track&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Uniform access across roles&lt;/strong&gt;: every developer, from intern to staff engineer, reaches the same provider catalog, model set, and tool surface&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No MCP tool boundary&lt;/strong&gt;: a connected MCP server exposes all of its tools to every session&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No per-request identity or audit record&lt;/strong&gt;: the provider console shows aggregate usage, not who ran which session, against which model, at what cost&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://code.claude.com/docs/en/costs" rel="noopener noreferrer"&gt;Anthropic's cost documentation&lt;/a&gt; reports that 90 percent of users stay under $30 per active day, which still exposes organizations to a long tail of sessions costing hundreds of dollars. A flat, identical-access model cannot contain that tail. The gateway layer is where these gaps close, specifically through &lt;a href="https://docs.getbifrost.ai/features/governance/virtual-keys" rel="noopener noreferrer"&gt;virtual keys&lt;/a&gt; that attach identity and policy to every request before any provider receives it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Routing Claude Code Through Bifrost
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://www.getmaxim.ai/bifrost" rel="noopener noreferrer"&gt;Bifrost&lt;/a&gt; intercepts Claude Code's requests at the transport layer before they reach any upstream provider. Connecting Claude Code to the gateway takes two environment variable changes; no modifications to the Claude Code binary, extensions, or workflow are needed.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Route Claude Code through the Bifrost gateway&lt;/span&gt;
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;ANTHROPIC_BASE_URL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"http://localhost:8080/anthropic"&lt;/span&gt;

&lt;span class="c"&gt;# Authenticate using a Bifrost virtual key (no Anthropic account required)&lt;/span&gt;
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;ANTHROPIC_AUTH_TOKEN&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"your-virtual-key"&lt;/span&gt;

&lt;span class="c"&gt;# Start Claude Code&lt;/span&gt;
claude
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Bifrost exposes an Anthropic-compatible API surface, so streaming responses, tool calls, and extended thinking all continue to function without modification. The full setup, including &lt;code&gt;settings.json&lt;/code&gt; configuration and provider passthrough patterns for Bedrock, Vertex, and Azure, is covered in the &lt;a href="https://docs.getbifrost.ai/cli-agents/claude-code" rel="noopener noreferrer"&gt;Claude Code integration guide&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Does the gateway put provider credentials on developer machines?
&lt;/h3&gt;

&lt;p&gt;No. With the recommended &lt;code&gt;ANTHROPIC_AUTH_TOKEN&lt;/code&gt; method, Claude Code transmits only the Bifrost virtual key. Real provider credentials are stored centrally in the gateway and injected per-request. Revoking or rotating a key takes effect immediately on the next request, with no key rotation ceremony and no environment variable changes required across developer machines.&lt;/p&gt;

&lt;h3&gt;
  
  
  What does this add to the developer's daily workflow?
&lt;/h3&gt;

&lt;p&gt;In practice, nothing. Only the base URL and the virtual key differ from a direct-to-Anthropic setup. Developers run the same &lt;code&gt;claude&lt;/code&gt; command and keep all existing session features, including &lt;code&gt;/model&lt;/code&gt; switching. Bifrost adds 11 microseconds of per-request overhead at 5,000 requests per second in sustained &lt;a href="https://www.getmaxim.ai/bifrost/resources/benchmarks" rel="noopener noreferrer"&gt;performance benchmarks&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Virtual Keys: Where the Security Boundary Is Defined
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://docs.getbifrost.ai/features/governance/virtual-keys" rel="noopener noreferrer"&gt;Virtual keys&lt;/a&gt; are the core governance primitive in the Bifrost AI gateway. Each key represents one authenticated consumer, whether that is an individual developer, a squad, a CI pipeline, or an external tenant, and carries a policy defining exactly what that consumer can do:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Provider and model scoping&lt;/strong&gt;: a key configured for Sonnet and Haiku cannot call Opus or reach an unlisted provider; attempting to switch out-of-scope via &lt;code&gt;/model&lt;/code&gt; returns a clear error rather than executing silently&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;MCP tool filtering&lt;/strong&gt;: &lt;a href="https://docs.getbifrost.ai/mcp/filtering" rel="noopener noreferrer"&gt;tool filtering&lt;/a&gt; on a per-virtual-key basis determines which tools each agent session can call, so a developer building a support agent sees CRM and ticketing tools, while an SRE sees Kubernetes and observability tools&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tenant isolation&lt;/strong&gt;: virtual keys scoped per team or customer keep access, budgets, and usage data cleanly separated across organizational boundaries&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;MCP tool filtering is the most direct answer to the excessive agency risk: the full set of connected tools remains available in the gateway, but each session's visible surface is constrained to what that role is authorized to use. A single GitHub or database MCP server no longer functions as an all-or-nothing grant to every developer.&lt;/p&gt;

&lt;h2&gt;
  
  
  Rate Limits and Guardrails as Containment Controls
&lt;/h2&gt;

&lt;p&gt;Even with access scoped correctly, a compromised key or a misbehaving automated session needs to be bounded. Bifrost provides two containment mechanisms at the gateway layer: rate limits and guardrails.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://docs.getbifrost.ai/features/governance/rate-limits" rel="noopener noreferrer"&gt;Rate limits&lt;/a&gt; operate at the virtual key level, capping requests per minute for each consumer. This covers subagent loops, automation scripts that run unbounded, and leaked keys being driven programmatically. Setting a tighter per-minute ceiling on CI automation keys than on interactive developer keys means one failed pipeline cannot consume the team's entire throughput allocation.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://docs.getbifrost.ai/enterprise/guardrails" rel="noopener noreferrer"&gt;Guardrails&lt;/a&gt; evaluate prompts and completions as they transit the gateway. &lt;a href="https://docs.getbifrost.ai/enterprise/guardrails/secrets-detection" rel="noopener noreferrer"&gt;Secrets detection&lt;/a&gt; scans for API keys, tokens, and credentials that a developer might paste into context or that a model might reproduce in its output. Custom regex patterns allow teams to redact or block organization-specific sensitive strings. Content-safety enforcement through AWS Bedrock Guardrails, Azure Content Safety, and Patronus AI applies consistently across all Claude Code sessions, configured once in the gateway rather than per developer.&lt;/p&gt;

&lt;h2&gt;
  
  
  Building an Audit and Compliance Record
&lt;/h2&gt;

&lt;p&gt;Governance that cannot be demonstrated to auditors has limited value. &lt;a href="https://docs.getbifrost.ai/enterprise/audit-logs" rel="noopener noreferrer"&gt;Audit logs&lt;/a&gt; in Bifrost generate an immutable, request-level record of every Claude Code session, covering model, token counts, tool call activity, and the virtual key or user behind each request. This record meets the evidence requirements for SOC 2, HIPAA, GDPR, and ISO 27001.&lt;/p&gt;

&lt;p&gt;For workloads in regulated industries, &lt;a href="https://docs.getbifrost.ai/features/observability" rel="noopener noreferrer"&gt;observability&lt;/a&gt; adds Prometheus metrics and OpenTelemetry distributed traces, each tagged with virtual key, team, model, and tool identifiers. Security and platform teams can query exactly which model a developer or agent used, when, and how many tokens it consumed. Bifrost can be deployed &lt;a href="https://www.getmaxim.ai/bifrost/enterprise" rel="noopener noreferrer"&gt;in-VPC or in air-gapped environments&lt;/a&gt; so no inference traffic or credential exits the organization's own infrastructure; SSO through OIDC maps virtual keys to directory identities in Okta or Microsoft Entra, closing the loop between organizational identity management and per-request attribution.&lt;/p&gt;

&lt;h2&gt;
  
  
  Start Governing Claude Code Agent Access
&lt;/h2&gt;

&lt;p&gt;Claude Code governance is, first, a security problem. The provider console provides a spending record; it does not keep credentials off laptops, restrict tool access by role, stop a runaway session, or prove who ran what. Placing &lt;a href="https://www.getmaxim.ai/bifrost" rel="noopener noreferrer"&gt;Bifrost&lt;/a&gt; between Claude Code and the provider adds virtual keys, tool filtering, guardrails, and audit logs behind the same Anthropic-compatible endpoint the agent already targets, with no change to how developers work. The &lt;a href="https://www.getmaxim.ai/bifrost/resources/governance" rel="noopener noreferrer"&gt;Bifrost governance resource page&lt;/a&gt; documents the full control set, and the &lt;a href="https://www.getmaxim.ai/bifrost/resources/buyers-guide" rel="noopener noreferrer"&gt;LLM Gateway Buyer's Guide&lt;/a&gt; covers how gateway options compare on governance and security depth.&lt;/p&gt;

&lt;p&gt;To apply Claude Code governance to your engineering organization's actual usage, &lt;a href="https://getmaxim.ai/bifrost/book-a-demo" rel="noopener noreferrer"&gt;book a Bifrost demo&lt;/a&gt; with the team.&lt;/p&gt;

</description>
    </item>
  </channel>
</rss>
