<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Sahajmeet Kaur</title>
    <description>The latest articles on DEV Community by Sahajmeet Kaur (@sahajmeet_kaur_).</description>
    <link>https://dev.to/sahajmeet_kaur_</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3978504%2F0ef5b27d-0f02-4f25-ab3a-6e9534bbf6e9.png</url>
      <title>DEV Community: Sahajmeet Kaur</title>
      <link>https://dev.to/sahajmeet_kaur_</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/sahajmeet_kaur_"/>
    <language>en</language>
    <item>
      <title>MCP Authentication: How We Secured 12 MCP Servers Without Losing Our Minds</title>
      <dc:creator>Sahajmeet Kaur</dc:creator>
      <pubDate>Wed, 01 Jul 2026 07:47:29 +0000</pubDate>
      <link>https://dev.to/sahajmeet_kaur_/mcp-authentication-how-we-secured-12-mcp-servers-without-losing-our-minds-1di9</link>
      <guid>https://dev.to/sahajmeet_kaur_/mcp-authentication-how-we-secured-12-mcp-servers-without-losing-our-minds-1di9</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR:&lt;/strong&gt; MCP authentication is genuinely more complex than regular API auth because you're managing credentials across many servers, for many agents, under many user identities - often all at once. The approaches range from static API keys (fast, insecure at scale) to OAuth 2.1 with PKCE (spec-compliant, more setup) to a centralized gateway that handles all downstream auth for you. We went through all three stages. This post covers what we learned.&lt;/p&gt;




&lt;p&gt;Eight months ago our MCP auth story was: shared API key in a &lt;code&gt;.env&lt;/code&gt; file, every developer had access to everything, fingers crossed nothing bad happened.&lt;/p&gt;

&lt;p&gt;Two near-misses later - one agent that almost deleted production data via a misconfigured write tool, one contractor whose MCP access wasn't revoked after they left - we got serious about it.&lt;/p&gt;

&lt;p&gt;Here's the setup we landed on, what each part solved, and what it cost in setup time.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why MCP auth is harder than regular API auth
&lt;/h2&gt;

&lt;p&gt;Regular API auth has one credential relationship: your application authenticates to a service. MCP auth has three:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Client → Gateway/Server&lt;/strong&gt;: how your agent proves its identity to the MCP infrastructure&lt;br&gt;
&lt;strong&gt;Gateway → Downstream service&lt;/strong&gt;: how the MCP server authenticates to GitHub, Jira, Slack, or whatever backend it wraps&lt;br&gt;
&lt;strong&gt;User delegation&lt;/strong&gt;: when an agent acts on behalf of a specific human (post to Slack &lt;em&gt;as&lt;/em&gt; a user, not as a bot), how that user's identity flows through the call chain&lt;/p&gt;

&lt;p&gt;Managing all three manually, per server, per developer, per agent is where the complexity explodes. Most MCP auth problems are coordination problems, not cryptography problems.&lt;/p&gt;


&lt;h2&gt;
  
  
  The authentication methods, in order of complexity
&lt;/h2&gt;
&lt;h3&gt;
  
  
  Static API keys / Bearer tokens
&lt;/h3&gt;

&lt;p&gt;The simplest option. The MCP server expects a static Bearer token in the &lt;code&gt;Authorization&lt;/code&gt; header. You set it once in the server config and again in the client config. Done.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"mcpServers"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"my-server"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"url"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"https://my-mcp-server.internal/mcp"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"headers"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"Authorization"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Bearer your-static-token-here"&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Where it works:&lt;/strong&gt; Internal servers with a single operator, dev environments, quick prototypes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Where it breaks:&lt;/strong&gt; Rotation. Every time the token changes, every client that uses it needs updating. With 15 developers and 8 servers, token rotation becomes a coordination nightmare. And if the token is in a &lt;code&gt;.env&lt;/code&gt; file in a repo, it's eventually in git history.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The real risk:&lt;/strong&gt; Static tokens have no user identity attached. When an agent calls a tool using a static token, there's no way to know which developer or workflow triggered it. Audit trails become "token X called tool Y at time Z" - useless for compliance or incident response.&lt;/p&gt;




&lt;h3&gt;
  
  
  Environment variables for stdio servers
&lt;/h3&gt;

&lt;p&gt;Stdio-based MCP servers run as local processes. Auth happens outside the MCP protocol — you pass credentials as environment variables that the server reads at startup.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"mcpServers"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"github"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"command"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"npx"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"args"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"-y"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"@modelcontextprotocol/server-github"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"env"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"GITHUB_PERSONAL_ACCESS_TOKEN"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"${env:GITHUB_TOKEN}"&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;${env:VAR}&lt;/code&gt; syntax in Claude Code and Cursor pulls from your shell environment rather than hardcoding the value in the config file — this keeps credentials out of version control.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Where it works:&lt;/strong&gt; Local development with stdio servers where each developer authenticates with their own credentials.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Where it breaks:&lt;/strong&gt; It doesn't scale. Each developer manages their own credentials per server. There's no centralized revocation. When a developer leaves, you're hoping they cleared their local environment (they often haven't).&lt;/p&gt;




&lt;h3&gt;
  
  
  OAuth 2.1 with PKCE for remote servers
&lt;/h3&gt;

&lt;p&gt;The MCP spec standardized OAuth 2.1 with PKCE in its March 2025 revision. This is the correct long-term answer for remote MCP servers because it ties tool calls to real user identities through your existing identity provider.&lt;/p&gt;

&lt;p&gt;The flow:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Agent initiates an MCP connection&lt;/li&gt;
&lt;li&gt;Server redirects to your IdP (Okta, Azure AD, Auth0)&lt;/li&gt;
&lt;li&gt;User authenticates in the browser&lt;/li&gt;
&lt;li&gt;IdP issues an authorization code&lt;/li&gt;
&lt;li&gt;Client exchanges the code for an access token (PKCE ensures this can't be intercepted)&lt;/li&gt;
&lt;li&gt;Token is attached to all subsequent MCP calls&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;What this gives you:&lt;/strong&gt; Tool calls are tied to the authenticated user, not a shared service credential. Tokens expire and auto-refresh. Revoking a user's access in your IdP automatically revokes their MCP access.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What this costs:&lt;/strong&gt; More setup. Your MCP server needs to implement the OAuth resource server side. Your client needs to handle the browser redirect flow. Not all MCP clients have fully implemented the November 2025 spec revision yet - check your specific client's OAuth support before depending on it.&lt;/p&gt;




&lt;h3&gt;
  
  
  PATs and VATs for service accounts
&lt;/h3&gt;

&lt;p&gt;For production agents that run without human-in-the-loop (no browser redirect possible), the pattern is Personal Access Tokens (PATs) for individual users and Virtual Account Tokens (VATs) for service accounts.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;PAT&lt;/strong&gt;: bound to a specific user's identity, appropriate for development workflows and user-delegated actions&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;VAT&lt;/strong&gt;: a service account credential with defined permissions, appropriate for automated agents running in production without a human user attached&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The distinction matters for audit trails: PAT calls show as coming from the specific developer; VAT calls show as coming from the named service account.&lt;/p&gt;




&lt;h2&gt;
  
  
  The setup we landed on
&lt;/h2&gt;

&lt;p&gt;After the two near-misses, we didn't want to manage individual OAuth flows per server per developer. The maintenance surface was too large. What we implemented instead was a centralized gateway that handles all downstream auth for us.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How it works:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;One token per developer, one token per service account.&lt;/strong&gt; Developers authenticate to the gateway with a single PAT. Service agents authenticate with a VAT. The gateway manages every downstream credential — GitHub OAuth tokens, Jira API keys, Confluence tokens, internal API service accounts and auto-refreshes them before they expire.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;RBAC at the tool level.&lt;/strong&gt; We can say: "team type A can call &lt;code&gt;search_issues&lt;/code&gt; and &lt;code&gt;create_issue&lt;/code&gt; in Jira but not &lt;code&gt;delete_issue&lt;/code&gt;." Defined per-server, per-tool, per-role in the gateway config. The agent never sees tools it isn't authorized to call - &lt;code&gt;tools/list&lt;/code&gt; returns a filtered set based on the caller's identity.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;OAuth 2.0 for user-delegated actions.&lt;/strong&gt; For tool calls that should act on behalf of a real user - posting to Slack as a specific person, creating a Jira ticket attributed to the right engineer - we use OAuth 2.0 with our Okta setup. The gateway handles token exchange and refresh. Agents don't manage OAuth flows directly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Audit log for every call.&lt;/strong&gt; Every tool invocation logged: which agent, which user identity, which tool, what parameters, what response, timestamp. This was non-negotiable for our security team and it's also genuinely useful for debugging production agent failures.&lt;/p&gt;

&lt;p&gt;We implemented this using &lt;a href="https://www.truefoundry.com/docs/ai-gateway/mcp/mcp-overview" rel="noopener noreferrer"&gt;TrueFoundry's MCP Gateway&lt;/a&gt;. The Okta integration took about a day to configure. The RBAC setup took roughly a day per server to define policies properly. The time investment paid back in the first month - we had one offboarding event and MCP access was fully revoked in a single dashboard action rather than hunting down six separate credentials.&lt;/p&gt;




&lt;h2&gt;
  
  
  The CVE worth knowing about
&lt;/h2&gt;

&lt;p&gt;In early 2026, JFrog Security Research disclosed a vulnerability in a popular MCP OAuth implementation (v0.1.16 and earlier) where the package forwarded OAuth authorization endpoint URLs to system handlers without sanitization. A malicious MCP server could construct a URL that executed arbitrary OS commands on the developer's machine.&lt;/p&gt;

&lt;p&gt;The fix shipped in v0.1.16. But the broader lesson: OAuth flows in MCP clients are relatively new and the spec is still settling (the November 2025 revision introduced Client ID Metadata Documents as the preferred registration method, replacing Dynamic Client Registration in most cases). Check your MCP client's patch level and the spec compliance version it's implementing before depending on OAuth for production workloads.&lt;/p&gt;

&lt;p&gt;What's your current MCP auth setup, and what forced you to take it seriously? The "near-miss with a write tool" pattern seems to be a common forcing function - interested to hear what others have hit. Drop it in the comments.&lt;/p&gt;

</description>
      <category>mcp</category>
      <category>security</category>
      <category>devops</category>
      <category>ai</category>
    </item>
    <item>
      <title>MCP Governance: What It Actually Means in Production (And the Four Walls We Had to Build)</title>
      <dc:creator>Sahajmeet Kaur</dc:creator>
      <pubDate>Wed, 01 Jul 2026 07:44:08 +0000</pubDate>
      <link>https://dev.to/sahajmeet_kaur_/mcp-governance-what-it-actually-means-in-production-and-the-four-walls-we-had-to-build-2671</link>
      <guid>https://dev.to/sahajmeet_kaur_/mcp-governance-what-it-actually-means-in-production-and-the-four-walls-we-had-to-build-2671</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR:&lt;/strong&gt; MCP governance is the set of controls that determine which agents can access which tools, under which identities, with what limits, and with what audit trail. Raw MCP has none of this - it's a protocol for structured tool calls, not a policy engine. The governance layer is something teams have to build deliberately, and most don't start until something breaks. This post is the thing I wish I'd read before our first incident.&lt;/p&gt;




&lt;p&gt;When we first deployed MCP servers in production, our governance story was: don't do anything obviously stupid. No access controls beyond "you have the server URL or you don't." No audit trail beyond whatever logs the downstream API generated. No limits on which tools any agent could call.&lt;/p&gt;

&lt;p&gt;That story lasted about four months, until three things happened in quick succession:&lt;/p&gt;

&lt;p&gt;A support agent that had access to a Jira MCP server ended up being tested with a write-enabled configuration. It created 47 duplicate tickets before someone noticed.&lt;/p&gt;

&lt;p&gt;A contractor finished an engagement and we revoked their GitHub and Slack access. Three weeks later their Jira MCP token was still active because it wasn't on the offboarding checklist - MCP credentials weren't part of our standard process.&lt;/p&gt;

&lt;p&gt;A research agent pulling competitor content via a web-fetch MCP tool retrieved a page containing injected instructions that the agent executed. Nothing catastrophic, but the blast radius could have been significant with a differently-configured agent.&lt;/p&gt;

&lt;p&gt;All three of these are governance failures. None of them required sophisticated attacks. They just required an absence of controls that should have existed.&lt;/p&gt;




&lt;h2&gt;
  
  
  What MCP governance actually is
&lt;/h2&gt;

&lt;p&gt;MCP governance is the controls layer on top of the MCP protocol that determines:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Who&lt;/strong&gt; can connect to which MCP servers (identity + access control)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;What&lt;/strong&gt; they can do once connected (tool-level RBAC)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Under which limits&lt;/strong&gt; (rate limits, token budgets, workflow step caps)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;With what record&lt;/strong&gt; (audit trail per invocation)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;With what policy enforcement&lt;/strong&gt; on content (input and output guardrails)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The MCP protocol itself doesn't provide any of this. It defines how agents call tools and how tools return results. The governance is infrastructure you build on top of it or don't, and learn why you should have.&lt;/p&gt;




&lt;h2&gt;
  
  
  The four walls
&lt;/h2&gt;

&lt;p&gt;After our three incidents, we mapped the governance problem into four distinct control areas. Each has different failure modes and different mitigations.&lt;/p&gt;

&lt;h3&gt;
  
  
  Wall 1: Access control
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;The failure mode:&lt;/strong&gt; Any agent with a server URL can call any tool. No access scoping, no identity requirement, no per-team restrictions.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What we needed:&lt;/strong&gt; Tool-level RBAC. Not just "team A can connect to the Jira server" but "team A can call &lt;code&gt;search_issues&lt;/code&gt; and &lt;code&gt;create_issue&lt;/code&gt; but not &lt;code&gt;delete_issue&lt;/code&gt;, &lt;code&gt;delete_project&lt;/code&gt;, or &lt;code&gt;bulk_update&lt;/code&gt;."&lt;/p&gt;

&lt;p&gt;The second requirement is what most early MCP setups miss. Controlling server access at the connection level doesn't help if the server exposes both read and write tools and you want different agents to have different capabilities.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How we implemented it:&lt;/strong&gt; All MCP access routes through a central gateway. RBAC policies are defined per server, per tool, per role. Agents receive filtered tool listings - &lt;code&gt;tools/list&lt;/code&gt; returns only what they're authorized to call. They never see tools they can't use, which removes the surface area for misconfiguration entirely.&lt;/p&gt;

&lt;p&gt;We use &lt;a href="https://www.truefoundry.com/blog/best-mcp-gateways" rel="noopener noreferrer"&gt;TrueFoundry's MCP Gateway&lt;/a&gt; for this. The RBAC configuration took roughly a day per server to set up properly. We also use Virtual MCP Servers - curated logical endpoints that expose only the tool subset a given team persona needs, so a research agent and a customer support agent see completely different tool surfaces even if they're both authorized for the same underlying servers.&lt;/p&gt;

&lt;h3&gt;
  
  
  Wall 2: Identity and credential governance
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;The failure mode:&lt;/strong&gt; Credentials are managed individually per developer per server. When someone joins, they set up credentials manually. When someone leaves, you hope someone revokes them everywhere.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What we needed:&lt;/strong&gt; Centralized credential management with offboarding that propagates.&lt;/p&gt;

&lt;p&gt;The contractor credential problem from our incident is almost universal. MCP servers are typically a separate system from everything else in your onboarding/offboarding workflow. Unless you deliberately connect them, they won't be covered.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How we implemented it:&lt;/strong&gt; Every developer and service agent authenticates to a central gateway with a single token (PAT for humans, VAT for service agents). The gateway manages all downstream credentials — GitHub OAuth, Jira API keys, Confluence tokens, internal API service accounts and auto-refreshes them before expiry.&lt;/p&gt;

&lt;p&gt;Offboarding became one action: revoke the gateway token. Every downstream MCP server's access is cut automatically because the credentials the gateway manages are under the gateway's control, not the individual's.&lt;/p&gt;

&lt;p&gt;For tools that should act on behalf of a specific user (post to Slack as a person, not as a bot), we use OAuth 2.0 with our Okta integration. The gateway handles token exchange and refresh - agents don't manage OAuth flows directly.&lt;/p&gt;

&lt;h3&gt;
  
  
  Wall 3: Audit trail
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;The failure mode:&lt;/strong&gt; Tool calls generate logs in two places - the MCP server logs and the downstream API logs — and neither one tells you &lt;em&gt;which agent or user&lt;/em&gt; triggered the call.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What we needed:&lt;/strong&gt; A structured audit trail per tool invocation that records: which agent, which authenticated user, which tool, what input parameters, what the response was, at what time.&lt;/p&gt;

&lt;p&gt;This is what the compliance team actually asks for. "Tool X was called 400 times last month" is not useful. "Agent Y, under service account Z, called &lt;code&gt;delete_record&lt;/code&gt; on table T at 14:23:07 with these parameters" is what makes an incident investigation recoverable.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How we implemented it:&lt;/strong&gt; The gateway logs every tool call with structured metadata before routing it to the downstream server. Logs export via OpenTelemetry to our Datadog setup. Queryable. When our security team asked "which agents accessed the production data API in the last 30 days," it went from a two-day investigation to a ten-minute query.&lt;/p&gt;

&lt;h3&gt;
  
  
  Wall 4: Content guardrails
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;The failure mode:&lt;/strong&gt; An agent retrieves content from an external system via an MCP tool. That content contains injected instructions. The agent processes it and executes the injected instructions. This is prompt injection via tool response and it's the one that caught us with the web-fetch agent.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What we needed:&lt;/strong&gt; Post-tool-call inspection of what the tool returns before it enters the agent's context.&lt;/p&gt;

&lt;p&gt;This is different from input guardrails on LLM calls (which inspect what the user or application sends to the model). Tool response guardrails inspect what the MCP server sends back before the agent sees it. The injection happens in the tool's output, so the defense has to be at that layer.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How we implemented it:&lt;/strong&gt; Gateway-level post-execution guardrails that inspect MCP tool responses before returning them to the calling agent. We run PII detection and prompt injection pattern matching on tool responses. For high-risk tools (web fetch, document retrieval, external API responses), we also apply content sanitization in mutate mode — the response is modified to strip detected injections rather than just being flagged.&lt;/p&gt;

&lt;p&gt;This is the governance layer most teams skip until they've had the injection near-miss. We wish we'd built it first.&lt;/p&gt;




&lt;h2&gt;
  
  
  The four walls in a table
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;What breaks without it&lt;/th&gt;
&lt;th&gt;What implements it&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Access control&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Any agent calls any tool, including destructive ones&lt;/td&gt;
&lt;td&gt;Tool-level RBAC, Virtual MCP Servers, filtered &lt;code&gt;tools/list&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Identity + credentials&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Offboarding misses MCP access; shared tokens have no user attribution&lt;/td&gt;
&lt;td&gt;Centralized gateway with PAT/VAT auth, IdP integration&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Audit trail&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Incidents aren't reconstructable; compliance questions take days to answer&lt;/td&gt;
&lt;td&gt;Structured log per tool call with user identity, exported to SIEM&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Content guardrails&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Prompt injection via tool responses executes in agent context&lt;/td&gt;
&lt;td&gt;Post-tool-call inspection before response enters agent context&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  What governance doesn't require
&lt;/h2&gt;

&lt;p&gt;A few things I'd push back on from early conversations about MCP governance:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;You don't need to solve all four walls at once.&lt;/strong&gt; Wall 1 (access control) has the highest return on investment because it prevents the most common class of incident - over-permissioned agents. Start there. Wall 4 (content guardrails) is the most technically involved and matters most when agents are retrieving external, untrusted content. If your agents only call internal APIs with trusted responses, Wall 4 can wait.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;You don't need custom tooling for every server.&lt;/strong&gt; The pattern that scales is a central gateway that handles governance for all servers, rather than baking governance into each server individually. One config to update when policy changes, one place to check when something goes wrong.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Governance doesn't have to slow agents down.&lt;/strong&gt; Auth checks, RBAC evaluation, and audit log writes can all happen in-memory on the hot path with async log flushing. TrueFoundry's gateway adds sub-3ms latency under load. If governance is adding seconds per tool call, something is wrong with the architecture, not with governance in principle.&lt;/p&gt;




&lt;p&gt;What governance controls did you build first and which incident made you prioritize it? The "agent called a tool it shouldn't have" pattern seems to be the most common forcing function. Drop it in the comments.&lt;/p&gt;

</description>
      <category>mcp</category>
      <category>security</category>
      <category>ai</category>
      <category>devops</category>
    </item>
    <item>
      <title>What Is an MCP Registry? (And the NxM Problem It Solves)</title>
      <dc:creator>Sahajmeet Kaur</dc:creator>
      <pubDate>Wed, 01 Jul 2026 07:35:26 +0000</pubDate>
      <link>https://dev.to/sahajmeet_kaur_/what-is-an-mcp-registry-and-the-nxm-problem-it-solves-4ogm</link>
      <guid>https://dev.to/sahajmeet_kaur_/what-is-an-mcp-registry-and-the-nxm-problem-it-solves-4ogm</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR:&lt;/strong&gt; An MCP registry is a centralized catalog of MCP servers - what they do, how to connect, what tools they expose, and who's authorized to call them. Without one, every developer and agent maintains their own copy of that information, credential rotation breaks everything, and access control is nonexistent. With one, tool discovery is dynamic, credentials are managed centrally, and RBAC governs what each agent can see and call. This post covers what it is, how it works, and the specific failure modes it prevents.&lt;/p&gt;




&lt;p&gt;There's a pattern I've seen play out in almost every team that scales past a handful of MCP servers.&lt;/p&gt;

&lt;p&gt;The first few connections feel like magic. You run &lt;code&gt;npx @modelcontextprotocol/server-github&lt;/code&gt; in a terminal, point Claude Code at it, and your agent can suddenly search repositories. Clean, fast, no glue code. Then you add Jira. Then Confluence. Then a Slack MCP server and an internal data API. Now you have six servers and twelve developers, and each developer has their own &lt;code&gt;~/.cursor/mcp.json&lt;/code&gt; or &lt;code&gt;.claude/settings.json&lt;/code&gt; with hardcoded connection details and credentials.&lt;/p&gt;

&lt;p&gt;When a server endpoint changes, you update twelve config files manually. When a credential rotates, you hunt down every developer who cached it. When a new engineer joins, you spend half a day walking them through which servers exist and how to connect. When someone leaves, you hope you remembered to revoke their individual tokens on each of the six systems.&lt;/p&gt;

&lt;p&gt;That's the distributed sticky-note problem. An MCP registry is the infrastructure that replaces it.&lt;/p&gt;




&lt;h2&gt;
  
  
  What an MCP registry is
&lt;/h2&gt;

&lt;p&gt;An MCP registry is a centralized catalog that stores metadata about every MCP server in your organization - what it does, how to connect to it, what tools it exposes, what auth method it uses, and who is authorized to access it.&lt;/p&gt;

&lt;p&gt;The analogy that made it click for me: it's to MCP servers what a DNS server is to IP addresses, or what a service registry is to microservices. Instead of hardcoding addresses everywhere, you have one authoritative place that knows where everything is and how to reach it. Components look things up instead of having them embedded.&lt;/p&gt;

&lt;p&gt;At minimum, a registry entry contains:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Server identity&lt;/strong&gt; - name, description, owner team, approval status&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Connection details&lt;/strong&gt; - endpoint URL, transport type (stdio vs Streamable HTTP vs SSE)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Auth metadata&lt;/strong&gt; - what authentication the server requires and how to obtain credentials&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tool schema&lt;/strong&gt; - what tools the server exposes, what parameters each accepts&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Access policy&lt;/strong&gt; - which users, teams, or agents are authorized to connect&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The separation between connection details and auth metadata is what makes credential rotation cheap. When a GitHub OAuth token rotates, you update one record in the registry. Every agent and developer connecting through the registry picks up the new credentials automatically — no config file hunting required.&lt;/p&gt;




&lt;h2&gt;
  
  
  The N×M problem
&lt;/h2&gt;

&lt;p&gt;The problem a registry solves has a name in distributed systems: the N×M integration problem.&lt;/p&gt;

&lt;p&gt;If you have N agents and M MCP servers and you connect them directly, you have N×M integration points. Each one needs its own connection config, its own credentials, its own error handling, its own version of "what happens when the server moves."&lt;/p&gt;

&lt;p&gt;With 3 agents and 3 servers that's 9 integration points. With 8 agents and 6 servers it's 48. At 240 engineers and 8 servers - the kind of scale where this becomes a real operational problem - you're maintaining roughly 1,900 config entries by hand.&lt;/p&gt;

&lt;p&gt;A registry collapses this. Each server is registered once. Each agent connects to the registry endpoint, declares what it needs, and the registry routes to the right server with the right credentials. N+M integration points instead of N×M.&lt;/p&gt;

&lt;p&gt;The secondary benefit: when a server moves (new URL, new cluster, new transport), you update one registry entry. Nothing downstream breaks.&lt;/p&gt;




&lt;h2&gt;
  
  
  What a registry is not
&lt;/h2&gt;

&lt;p&gt;Worth being explicit because "MCP registry" means different things in different contexts:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The public MCP registry&lt;/strong&gt; (the one at &lt;code&gt;registry.npmmcp.com&lt;/code&gt; or similar) is a discovery catalog — a searchable list of publicly available MCP servers that developers can browse to find tools. It's like an app store listing page. It doesn't authenticate connections, enforce access control, or provide an audit trail. It's a great place to find servers. It's not production infrastructure.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;An enterprise MCP registry&lt;/strong&gt; is what this post is about: the privately operated catalog that governs how your specific agents connect to your specific servers, with auth and RBAC and audit logging. Same concept (centralized metadata), different operating context.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;An MCP registry is not a proxy.&lt;/strong&gt; A proxy handles transport. A registry handles metadata and policy. In practice, the two are often deployed together — the registry knows about servers, the proxy actually routes traffic to them — but they're distinct components.&lt;/p&gt;




&lt;h2&gt;
  
  
  The governance gap in raw MCP
&lt;/h2&gt;

&lt;p&gt;Raw MCP has no built-in access control. Any agent with a server's connection URL can call any of its tools. A junior developer's agent and a senior engineer's agent have identical access. An over-permissioned sub-agent has the same blast radius as the service account that spawned it.&lt;/p&gt;

&lt;p&gt;The governance layer a registry adds:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;RBAC at the tool level.&lt;/strong&gt; Not just "team A can access the Jira server" but "team A can call &lt;code&gt;search_issues&lt;/code&gt; and &lt;code&gt;create_issue&lt;/code&gt; but not &lt;code&gt;delete_issue&lt;/code&gt;." The access policy is defined in the registry and enforced before the tool call reaches the server.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pre-tool-call visibility filtering.&lt;/strong&gt; Agents only see the tools they're authorized to use in the first place. An agent browsing available tools via &lt;code&gt;tools/list&lt;/code&gt; gets back a filtered list based on its identity — not the full server capability set. This matters: an agent that doesn't know a destructive tool exists can't accidentally call it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Centralized credential management.&lt;/strong&gt; Users authenticate once to the registry. The registry manages downstream credentials — OAuth tokens for GitHub, API keys for Jira, service account tokens for internal APIs — and refreshes them automatically. Offboarding is one action: revoke the user's registry access. Every downstream system is covered.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Audit trail per tool invocation.&lt;/strong&gt; Every tool call logged: which agent, which user identity, which tool, what parameters, what the response was, at what time. When a security team asks "which agents accessed the production data API last month," it's a query rather than two days of log archaeology.&lt;/p&gt;




&lt;h2&gt;
  
  
  Virtual MCP Servers: the access-scoping primitive
&lt;/h2&gt;

&lt;p&gt;One concept worth understanding because it's genuinely useful: Virtual MCP Servers.&lt;/p&gt;

&lt;p&gt;A virtual MCP server is a curated logical endpoint that exposes a subset of tools from one or more real MCP servers. Instead of giving an agent access to an entire Jira MCP server with all its read and write tools, you define a virtual server called "finance-readonly" that exposes only the Jira tools the finance workflow needs — and nothing else.&lt;/p&gt;

&lt;p&gt;The agent points at the virtual server endpoint. It sees a limited, purpose-appropriate tool surface. The actual routing to the underlying server happens inside the registry/gateway layer, invisible to the agent.&lt;/p&gt;

&lt;p&gt;This is the pattern that solves the "over-permissioned agent" problem without requiring every agent to implement its own access filtering. The policy lives in the platform, not in the agent code.&lt;/p&gt;




&lt;h2&gt;
  
  
  What we use and where TrueFoundry fits
&lt;/h2&gt;

&lt;p&gt;After hitting the credential sprawl problem and a compliance audit that asked for tool call logs we couldn't produce cleanly, we evaluated a few registry options and ended up on &lt;a href="https://www.truefoundry.com/blog/mcp-registry-and-ai-gateway" rel="noopener noreferrer"&gt;TrueFoundry's MCP Registry and Gateway&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The specific things that mattered:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Unified credential management.&lt;/strong&gt; Developers get a single Personal Access Token for the registry. The registry handles OAuth for GitHub, Jira, Confluence, and our internal APIs — auto-refreshing tokens before they expire. Offboarding became one action instead of six.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;RBAC with identity provider integration.&lt;/strong&gt; Access policies connect to our existing Okta setup. When we add a user to the "data-engineering" team in Okta, they automatically get the MCP access profile for that team. We stopped maintaining a parallel access list.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Virtual MCP Servers.&lt;/strong&gt; We define one virtual server per team persona. The security team sees Sentry and Datadog tools. Product engineers see GitHub and Jira. Neither sees the other's tool surface, and neither sees the internal data API unless they're explicitly provisioned for it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Full audit trail.&lt;/strong&gt; Every tool invocation logged with structured metadata. When the compliance question came — "which agents accessed the production data API in the last 30 days" — it took ten minutes to answer.&lt;/p&gt;

&lt;p&gt;Tradeoff worth naming: TrueFoundry is Kubernetes-native, so there's real setup overhead if you're not already on K8s. For a team with two or three servers and five developers, a shared config file is fine. The registry infrastructure pays off when you cross roughly ten servers or twenty developers or when a compliance requirement forces the audit trail question.&lt;/p&gt;




&lt;h2&gt;
  
  
  When you actually need one
&lt;/h2&gt;

&lt;p&gt;The honest signals:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You can't answer "what MCP servers do we have running right now" without asking multiple people&lt;/li&gt;
&lt;li&gt;A server endpoint change means updating configs on multiple developer machines&lt;/li&gt;
&lt;li&gt;A credential rotation cascades into a half-day of finding and updating every place it's referenced&lt;/li&gt;
&lt;li&gt;A developer left and you're not fully certain their MCP access was revoked everywhere&lt;/li&gt;
&lt;li&gt;A compliance or security team has asked for a log of which agents used which tools&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If two or more of those describe your situation, the registry conversation is overdue. The distributed sticky-note problem scales as O(developers × servers) - it gets expensive fast.&lt;/p&gt;

&lt;p&gt;What does your current MCP server management setup look like - are you still on individual config files per developer, or has something pushed you toward a central registry? Would like to hear how teams are handling credential rotation across many servers without a centralized credential store. Drop it in the comments.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>devops</category>
      <category>mcp</category>
    </item>
    <item>
      <title>What Is an Agent Registry? (And What We Broke Before We Had One)</title>
      <dc:creator>Sahajmeet Kaur</dc:creator>
      <pubDate>Sat, 27 Jun 2026 06:30:00 +0000</pubDate>
      <link>https://dev.to/sahajmeet_kaur_/what-is-an-agent-registry-and-what-we-broke-before-we-had-one-37jn</link>
      <guid>https://dev.to/sahajmeet_kaur_/what-is-an-agent-registry-and-what-we-broke-before-we-had-one-37jn</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;An AI agent registry is a centralized catalog of every agent in your organization — what each agent does, what tools it can access, what version is running, who owns it, and how to call it&lt;/li&gt;
&lt;li&gt;It's to agents what a container registry is to Docker images or what a service mesh is to microservices — the layer that makes distributed components governable&lt;/li&gt;
&lt;li&gt;We hit the "which agents do we have?" wall at 14 agents across 3 teams. That's when the registry stopped being a nice-to-have&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;About four months into our agentic AI buildout, our head of security asked a question I couldn't answer: "Can you give me a list of every AI agent running in production, what systems they have access to, and what version of each is currently deployed?"&lt;/p&gt;

&lt;p&gt;I had a rough mental model. I knew about the agents my team had built. I had a vague idea of what the data engineering team had shipped. The product team had recently added two agents I'd heard about secondhand.&lt;/p&gt;

&lt;p&gt;I spent the better part of a day pulling together a spreadsheet. By the time I finished, one of the agents I'd listed had already been replaced by a newer version. Two of them had been granted access to an internal API I hadn't known about.&lt;/p&gt;

&lt;p&gt;The spreadsheet was outdated before I sent it.&lt;/p&gt;

&lt;p&gt;That was our forcing function for building a proper agent registry. This post is what I wish I'd read before that conversation happened.&lt;/p&gt;




&lt;h2&gt;
  
  
  What an agent registry is
&lt;/h2&gt;

&lt;p&gt;An agent registry is a centralized catalog of AI agents — a single source of truth that tracks every agent deployed in your organization, its capabilities, its integrations, its ownership, and its current state.&lt;/p&gt;

&lt;p&gt;The analogy that landed for me: it's to agents what a container registry (Docker Hub, ECR, GCR) is to container images. When you have three containers running, you don't need a registry — you know what you have. When you have 40 containers across six teams, you need a registry to know what's running, who owns it, what version is deployed, and what depends on what.&lt;/p&gt;

&lt;p&gt;Agents are the same. At two or three agents, a shared Notion doc is sufficient. At 14 agents across three teams, you need infrastructure that tracks state, not a doc that someone last edited last month.&lt;/p&gt;

&lt;p&gt;A registry stores metadata for each agent:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Identity and ownership&lt;/strong&gt; — which team built it, who's the current owner, what's the canonical name&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Capabilities&lt;/strong&gt; — what the agent can do, expressed as a standard interface (increasingly via the Model Context Protocol, so other agents can discover and call it without custom integration)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tool and model access&lt;/strong&gt; — which MCP servers it's authorized to use, which models it can call, what permissions it holds&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Version and deployment state&lt;/strong&gt; — which version is currently in production, what changed, when it was last updated&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Observability metadata&lt;/strong&gt; — success rate, latency, last error, evaluation scores if you're running evals&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Access policy&lt;/strong&gt; — which other agents or services are authorized to call this agent&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The last one is what distinguishes a registry from a spreadsheet: it's not just a catalog, it's the enforcement point for agent-to-agent communication.&lt;/p&gt;




&lt;h2&gt;
  
  
  What goes wrong without one
&lt;/h2&gt;

&lt;p&gt;We ran without a registry for longer than we should have. Here's what actually broke.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Shadow agents.&lt;/strong&gt; Three separate teams had independently built agents that called our internal data API. None of them knew about the others. When we introduced rate limits on that API, two of the agents started failing intermittently — and we spent a week debugging what we thought was a data API problem before realizing the actual problem was three agents competing for quota we'd only budgeted for one.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Version confusion at 2am.&lt;/strong&gt; An agent went into production with a bug. We rolled back. The rollback was applied to one environment but not the other. For six hours, our staging environment had the fixed version and production had the broken one, because there was no single source of truth for which version was where. The incident took longer to resolve than it should have because different team members were looking at different version references.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The offboarding gap.&lt;/strong&gt; When an engineer left the team, we revoked their credentials for the systems we knew about. Three weeks later, a contractor reported that an internal Jira webhook was still firing from an agent they'd built. The agent had been registered nowhere. It was running on a piece of infrastructure they'd stood up themselves, using credentials that hadn't been included in the offboarding checklist because nobody knew the agent existed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;M×N integration hell.&lt;/strong&gt; Each new agent that needed to call tools had to build its own integration with each tool. Eight agents, six tools: 48 potential integration points, each with its own credential management, error handling, and retry logic. When a tool API changed, we had to find and update every agent that used it manually.&lt;/p&gt;

&lt;p&gt;The registry fixes all four of these. Shadow agents can't exist if registration is a prerequisite for deployment. Version state is tracked centrally. Offboarding is "revoke this agent's access in the registry." M×N integrations collapse to each tool being registered once, each agent pointing to the registry.&lt;/p&gt;




&lt;h2&gt;
  
  
  What a registry is not
&lt;/h2&gt;

&lt;p&gt;Worth being explicit, because I conflated some things early on.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;It's not a deployment platform.&lt;/strong&gt; The registry tracks what's running, but it doesn't run the agents. Deployment is a separate concern — Kubernetes, a container orchestrator, whatever your team uses. The registry is the catalog; deployment is the execution layer.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;It's not an orchestration framework.&lt;/strong&gt; LangGraph, CrewAI, AutoGen — those handle how agents coordinate with each other. The registry handles what agents exist and whether they're authorized to talk to each other at all. These are complementary, not competing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;It's not an MCP server list.&lt;/strong&gt; An MCP server registry catalogs available tools. An agent registry catalogs available agents. Both are useful. Both are needed. TrueFoundry calls the combination of the two a unified MCP and Agents Registry — one place where you can see both the tools agents can use and the agents themselves. That unification matters because the governance question is really "which agents can call which tools" — you need both catalogs to answer it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;It's not just a spreadsheet.&lt;/strong&gt; The spreadsheet version of an agent catalog is a snapshot. A proper registry is stateful — it connects to your observability layer and shows live performance, not last-week's-update performance. When TrueFoundry's registry shows you an agent's success rate, it's pulling from real-time telemetry, not a manually updated field.&lt;/p&gt;




&lt;h2&gt;
  
  
  The architecture pattern that makes it work
&lt;/h2&gt;

&lt;p&gt;The pattern that made everything cleaner: every agent registers with the gateway using the Model Context Protocol. Once registered, the agent looks like a standard MCP endpoint to every other agent in the system. A LangGraph agent and a CrewAI agent and a custom HTTP service all appear as the same kind of thing to the orchestrator — they're all just callable endpoints with a defined schema.&lt;/p&gt;

&lt;p&gt;This is what solves the M×N problem architecturally. Each tool is registered once. Each agent is registered once. The registry maps which agents can call which tools. Agents don't need to know how to integrate with Jira or Slack or your internal data API directly — they call the registry endpoint, and the registry handles routing, credentials, and access control.&lt;/p&gt;

&lt;p&gt;The other pattern that mattered: the registry as the access control enforcement point. Before this, access control for agent-to-agent calls lived in application code — each agent decided for itself whether to accept a call. That's as reliable as it sounds. Moving access control to the registry layer means it's enforced centrally, consistently, and not dependent on each individual agent implementation being correct.&lt;/p&gt;




&lt;h2&gt;
  
  
  What we ended up using
&lt;/h2&gt;

&lt;p&gt;After the security audit incident, we evaluated a few options and landed on &lt;a href="https://www.truefoundry.com/blog/ai-agent-registry" rel="noopener noreferrer"&gt;TrueFoundry's Agent Registry&lt;/a&gt;. I can explain specifically what mattered.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Unified agent and MCP catalog.&lt;/strong&gt; Every agent and every tool visible in one place. When the security team asks "which agents have access to the internal data API," the answer is a query, not a two-day investigation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Framework-agnostic registration.&lt;/strong&gt; We have agents on LangGraph, one on CrewAI, and two custom HTTP services. The registry handles all of them through a standard registration interface. Once registered, governance policies apply regardless of what framework built the agent — the same RBAC rules, the same audit trail, the same access policies.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Live performance tracking.&lt;/strong&gt; The registry shows each agent's success rate, average latency, and last error pulled from the observability layer. We set a routing rule: for production code changes, only route to agents with &amp;gt;90% success rate on the latest eval run. The registry enforces this automatically rather than requiring a human to check before deploying.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A2A communication via MCP.&lt;/strong&gt; When an agent needs to call another agent, it goes through the registry. The registry checks whether the calling agent is authorized to invoke the target agent, handles the call, and logs the interaction with both agent identities. The over-privileged sub-agent problem — where a spawned agent inherits more permissions than it should — is closed at the registry layer.&lt;/p&gt;

&lt;p&gt;The tradeoff: TrueFoundry is Kubernetes-native, so there's real infrastructure investment if you're not already on K8s. For a team of 5 with 3 agents, a YAML file is probably enough. The inflection point for us was around 10 agents across multiple teams with compliance requirements.&lt;/p&gt;




&lt;h2&gt;
  
  
  When you actually need one
&lt;/h2&gt;

&lt;p&gt;The honest answer: you need a registry before you think you do, and you'll know you needed it earlier after you don't have one.&lt;/p&gt;

&lt;p&gt;Some concrete signals:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You can't answer "which agents do we have in production" without asking multiple people&lt;/li&gt;
&lt;li&gt;A team deploys an agent and you find out about it from a runaway cost alert rather than a check-in&lt;/li&gt;
&lt;li&gt;An engineer leaves and you realize you don't know what credentials their agents were using&lt;/li&gt;
&lt;li&gt;Two teams built agents that do similar things because neither knew the other existed&lt;/li&gt;
&lt;li&gt;You want to introduce rate limits or access controls on an internal system and don't know how many agents are calling it&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  If any of those describe your situation, the registry conversation is overdue. If none of them do yet, you're probably still small enough that the overhead isn't justified.
&lt;/h2&gt;

&lt;p&gt;What pushed you toward building or adopting a registry — and what does your current agent catalog look like? Curious whether most teams are still on the spreadsheet version or if the registry infrastructure has actually caught up to the agent deployment pace. Drop it in the comments.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>agents</category>
      <category>mlops</category>
      <category>devops</category>
    </item>
    <item>
      <title>LiteLLM vs OpenRouter: I Used Both. Here's Where Each One Actually Broke.</title>
      <dc:creator>Sahajmeet Kaur</dc:creator>
      <pubDate>Fri, 26 Jun 2026 06:30:00 +0000</pubDate>
      <link>https://dev.to/sahajmeet_kaur_/litellm-vs-openrouter-i-used-both-heres-where-each-one-actually-broke-53gb</link>
      <guid>https://dev.to/sahajmeet_kaur_/litellm-vs-openrouter-i-used-both-heres-where-each-one-actually-broke-53gb</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;LiteLLM and OpenRouter are not competing products - LiteLLM is a self-hosted open-source proxy you run yourself, OpenRouter is a managed cloud aggregator. The comparison only makes sense if you understand which problem you're actually trying to solve&lt;/li&gt;
&lt;li&gt;LiteLLM's ceiling: SSO and team-level budget enforcement are behind the enterprise license, Redis dependency for distributed rate limiting has a failure mode worth knowing about, YAML config gets unwieldy at scale&lt;/li&gt;
&lt;li&gt;OpenRouter's ceiling: everything lives in OpenRouter's infrastructure, no self-hosted models, no team-level governance, a 5.5% credit purchase fee that compounds at high volume&lt;/li&gt;
&lt;li&gt;Where we landed: neither was the right long-term answer for our setup - this post explains why&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;When I started evaluating LLM routing options about a year ago, most of the "LiteLLM vs OpenRouter" content I found was comparing features in a matrix and calling it a day. It wasn't that useful because it missed the more important question: these tools have fundamentally different architectures, different deployment models, and different ceilings. Picking between them is less "which has more features" and more "which problem are you actually trying to solve right now."&lt;/p&gt;

&lt;p&gt;I ran LiteLLM in staging for about six weeks and used OpenRouter for a parallel workload. Here's what I actually found.&lt;/p&gt;




&lt;h2&gt;
  
  
  What each tool is (the architecture distinction that matters)
&lt;/h2&gt;

&lt;p&gt;Before any feature comparison: LiteLLM and OpenRouter are not the same category of thing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;LiteLLM&lt;/strong&gt; is an open-source Python library and proxy server you host yourself. It gives you a unified, OpenAI-compatible API in front of 100+ model providers. You pip install it, run it as a Docker container, and it lives in your infrastructure. You own the uptime, the scaling, and the configuration. The Anthropic and OpenAI credentials live in your environment. Nothing leaves your network unless you tell it to.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;OpenRouter&lt;/strong&gt; is a managed cloud service. You create an account, buy credits, and point your OpenAI SDK at &lt;code&gt;https://openrouter.ai/api/v1&lt;/code&gt; with an OpenRouter API key. You don't run anything. The model request goes through OpenRouter's infrastructure, which routes to whichever provider serves that model. Their business model is a 5.5% fee on credit purchases, with provider token rates passed through without markup.&lt;/p&gt;

&lt;p&gt;The practical implication: if you need your prompts to stay inside your infrastructure, OpenRouter is immediately off the table. If you want zero infrastructure overhead and just want to access 200+ models through one API key in the next ten minutes, LiteLLM has a steeper setup curve than OpenRouter.&lt;/p&gt;

&lt;p&gt;Once you understand that distinction, the comparison becomes a lot cleaner.&lt;/p&gt;




&lt;h2&gt;
  
  
  LiteLLM: where it's genuinely good and where it breaks
&lt;/h2&gt;

&lt;h3&gt;
  
  
  What works well
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Provider coverage and SDK compatibility.&lt;/strong&gt; LiteLLM supports 100+ providers - OpenAI, Anthropic, AWS Bedrock, Google Vertex, Mistral, Groq, Cohere, Together AI, Ollama, and more through a single OpenAI-compatible format. You write standard OpenAI SDK code once, and routing to a different provider is a model string change. For teams with self-hosted models, this is particularly useful because LiteLLM routes to your own endpoints with the same interface as cloud providers.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Load balancing across deployments.&lt;/strong&gt; You can define multiple deployments of the same model across providers or regions, and LiteLLM load-balances across them with configurable strategies: simple-shuffle, least-busy, latency-based, cost-based. This is the right level of control for teams managing both cloud and self-hosted infrastructure.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Virtual keys with per-key budgets.&lt;/strong&gt; Each virtual key can have its own budget and rate limit. For a small team where one engineer owns the gateway config, this is enough. You issue a key per service, set a budget, done.&lt;/p&gt;

&lt;h3&gt;
  
  
  Where it breaks
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;YAML at scale.&lt;/strong&gt; LiteLLM config is YAML. For a solo engineer with three models, it's fine. For a platform team managing 40 engineers across four squads with different model access requirements, it becomes a coordination problem. Every time a squad needs a new model routing rule, someone has to edit the same YAML file, test the change, and redeploy. We had two merge conflicts in one week.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;SSO is Enterprise only.&lt;/strong&gt; We needed Okta. That's behind the enterprise license. The open-source version doesn't support corporate SSO. For most organizations past a certain size, this is a hard requirement, not a preference.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Redis dependency.&lt;/strong&gt; Distributed rate limiting in LiteLLM requires Redis. This is fine in normal operation. The edge case: if Redis has an availability issue, LiteLLM's rate limiting can fail open - requests go through with no limits enforced. In a runaway job scenario, this means your safety net disappears at exactly the wrong moment. We tested this. It behaved as documented, which means the behavior is intentional but it's worth understanding before you depend on it in production.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Team-level budget enforcement.&lt;/strong&gt; Per-key budgets work. Per-team budgets that span multiple keys with a shared ceiling — the kind of thing a platform team needs to charge back spend to different business units - require more config work and, the enterprise tier handles this cleanly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; Solo engineers and small teams prototyping self-hosted model access. MIT license, zero vendor relationship, full infrastructure control. The SSO and governance features are there if you pay for the enterprise tier - budget for that if you're running more than 10 engineers through it.&lt;/p&gt;




&lt;h2&gt;
  
  
  OpenRouter: where it's genuinely good and where it breaks
&lt;/h2&gt;

&lt;h3&gt;
  
  
  What works well
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Zero setup to first request.&lt;/strong&gt; Create account, buy credits, change base URL. That's it. No infrastructure to run, no container to maintain, no YAML to write. For rapid prototyping or a hackathon, this is the right level of effort.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Model breadth.&lt;/strong&gt; 300+ models accessible through one API key. Including models that would otherwise require separate API accounts with separate providers — Mistral, Nous, Perplexity, and others available through OpenRouter before they had easy direct API access. For experimentation across frontier models, this is genuinely useful.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Intelligent routing options.&lt;/strong&gt; OpenRouter's routing suffixes are a nice abstraction: &lt;code&gt;:nitro&lt;/code&gt; routes to highest-throughput provider, &lt;code&gt;:floor&lt;/code&gt; routes to cheapest, &lt;code&gt;:online&lt;/code&gt; injects web search results. You can also pass a &lt;code&gt;models&lt;/code&gt; array with fallback priority. For teams that don't want to think about provider selection, the defaults work.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Unified billing.&lt;/strong&gt; One invoice, one credit balance, across every provider you're using. For teams where multi-provider accounting is a headache, this is real simplification.&lt;/p&gt;

&lt;h3&gt;
  
  
  Where it breaks
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Everything lives in OpenRouter's infrastructure.&lt;/strong&gt; Your prompts, your responses, your API keys - all pass through OpenRouter's systems. For teams with data residency requirements, regulated workloads, or compliance obligations that specify where inference data can travel, this is a hard blocker. There's no self-hosted option and no VPC deployment path.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The 5.5% credit fee compounds.&lt;/strong&gt; OpenRouter charges 5.5% on credit purchases. Provider token rates pass through without markup. On low volumes, this is fine. At $50k/month in inference spend, you're paying $2,750/month to OpenRouter in platform fees on top of model costs. At $200k/month, it's $11,000/month. The math is worth doing before you commit to this as your production routing layer.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;No team-level governance.&lt;/strong&gt; OpenRouter doesn't have a concept of "team A can only use these models" or "developer X has a $500/month cap." Access control is per API key. Budget management is at the account level. For a solo developer this is fine. For a platform team managing 40 engineers with different access requirements, you're building governance on top of OpenRouter rather than getting it from OpenRouter.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;No self-hosted model support.&lt;/strong&gt; If you're running a fine-tuned model on your own infrastructure, OpenRouter can't route to it. Your routing split between OpenRouter (for cloud providers) and some other system (for your own models) means split observability, split cost tracking, and split governance. We had this problem and it was worse than it sounds.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; Individual developers and small teams who want fast access to many models with zero infrastructure. Also genuinely useful as the cloud-provider routing layer for teams that pair it with a self-hosted solution for their own models - though that means managing two systems.&lt;/p&gt;




&lt;h2&gt;
  
  
  Head-to-head on the things that matter in production
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;LiteLLM&lt;/th&gt;
&lt;th&gt;OpenRouter&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Deployment model&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Self-hosted (Docker, pip)&lt;/td&gt;
&lt;td&gt;Managed cloud only&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Data residency&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Your infrastructure&lt;/td&gt;
&lt;td&gt;OpenRouter's infrastructure&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Provider coverage&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;100+ (incl. self-hosted)&lt;/td&gt;
&lt;td&gt;300+ (cloud only)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Self-hosted model support&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;SSO / OKTA&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Enterprise license&lt;/td&gt;
&lt;td&gt;Enterprise tier&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Per-team budget caps&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Limited without Enterprise&lt;/td&gt;
&lt;td&gt;Not available&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Rate limiting&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Redis-backed (fail-open risk)&lt;/td&gt;
&lt;td&gt;Managed (their infra)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Semantic caching&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;✅ (Redis)&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Guardrails&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Basic hooks&lt;/td&gt;
&lt;td&gt;Not native&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Compliance certs&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Pricing model&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Open-source + Enterprise license&lt;/td&gt;
&lt;td&gt;5.5% credit purchase fee&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;MCP / agent support&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Config model&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;YAML file&lt;/td&gt;
&lt;td&gt;Dashboard + API&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Good for prototyping&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅✅ (easier)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Good for 40+ engineers&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;With Enterprise license&lt;/td&gt;
&lt;td&gt;With governance workarounds&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  Where we went after hitting both ceilings
&lt;/h2&gt;

&lt;p&gt;We ran LiteLLM for about six weeks. The YAML config problem was manageable. The SSO requirement wasn't - we needed Okta and weren't going to pay the enterprise license for a gateway that still had the Redis failure-open edge case and no native self-hosted model observability.&lt;/p&gt;

&lt;p&gt;We used OpenRouter for a parallel data enrichment workload during the same period. It was excellent for the first two months. Then the workload scaled, the data residency question came from legal, and the 5.5% fee at our run rate became a real number on a real spreadsheet.&lt;/p&gt;

&lt;p&gt;Neither tool was wrong. Both were right for earlier stages of what we were building. The problem was that we'd outgrown the ceiling of both at roughly the same time.&lt;/p&gt;

&lt;p&gt;We ended up on &lt;a href="https://www.truefoundry.com/ai-gateway" rel="noopener noreferrer"&gt;TrueFoundry's AI Gateway&lt;/a&gt;. The specific things that mattered for our situation:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;In-memory rate limiting, no Redis dependency.&lt;/strong&gt; Auth, budget checks, and rate limits all happen in-memory in the gateway process - no external dependency in the hot path, no failure-open edge case under Redis load. The benchmarks show ~3–4ms added latency at 350+ RPS on a single vCPU, which matched our own testing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Full VPC deployment.&lt;/strong&gt; Everything runs inside our Kubernetes cluster. No inference data, no control plane traffic leaves our infrastructure. This answered the legal/compliance question cleanly - no carve-outs, no "the dashboard is SaaS but the inference is on-prem" nuance.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Self-hosted and cloud models unified.&lt;/strong&gt; Our Llama deployment and our OpenAI and Anthropic traffic go through the same gateway endpoint. Same cost attribution dashboard, same rate limiting, same audit trail. No split observability.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Per-team budgets enforced on the hot path.&lt;/strong&gt; When a team hits their token budget, subsequent requests return rate-limit errors before spend accumulates. The enforcement happens before the API call, not as an alert after.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;SSO out of the box.&lt;/strong&gt; Okta via SAML, no enterprise license gating.&lt;/p&gt;

&lt;p&gt;The tradeoff: If you're a two-person team shipping fast, LiteLLM or OpenRouter will get you further faster. The decision point for us was when compliance requirements and multi-team governance became real - that's when the infrastructure investment in a proper gateway started paying off.&lt;/p&gt;




&lt;h2&gt;
  
  
  How to pick between them for your situation
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Use LiteLLM if:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You want full infrastructure control and MIT-licensed open source&lt;/li&gt;
&lt;li&gt;You have self-hosted models that need to route through the same system as your cloud providers&lt;/li&gt;
&lt;li&gt;You're comfortable managing YAML config and owning the gateway's uptime&lt;/li&gt;
&lt;li&gt;You can absorb the enterprise license cost when you need SSO and team governance&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Use OpenRouter if:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You want zero infrastructure to manage and the fastest path to first request&lt;/li&gt;
&lt;li&gt;You need access to many models, including newer ones from smaller providers&lt;/li&gt;
&lt;li&gt;Your workload doesn't have data residency or compliance requirements&lt;/li&gt;
&lt;li&gt;You're fine with account-level billing and don't need per-team governance&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Consider moving beyond both when:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Legal or compliance asks where your inference data lives and "OpenRouter's servers" isn't acceptable&lt;/li&gt;
&lt;li&gt;You have self-hosted models that need the same governance as your cloud provider traffic&lt;/li&gt;
&lt;li&gt;Multiple teams need separate budget caps enforced before they spend, not after&lt;/li&gt;
&lt;li&gt;The Redis failure-open scenario is a real risk for your rate limiting SLA&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;What pushed you toward LiteLLM or OpenRouter — and what made you stay or leave? Has anyone found a clean way to unify governance across both (self-hosted via LiteLLM + cloud via OpenRouter) without running two separate observability stacks. Drop it in the comments.&lt;/p&gt;

</description>
      <category>llm</category>
      <category>ai</category>
      <category>devops</category>
      <category>mlops</category>
    </item>
    <item>
      <title>AI Gateway vs API Gateway: They Solve Different Problems</title>
      <dc:creator>Sahajmeet Kaur</dc:creator>
      <pubDate>Thu, 25 Jun 2026 06:30:00 +0000</pubDate>
      <link>https://dev.to/sahajmeet_kaur_/ai-gateway-vs-api-gateway-they-solve-different-problems-we-confused-them-for-six-months-56fe</link>
      <guid>https://dev.to/sahajmeet_kaur_/ai-gateway-vs-api-gateway-they-solve-different-problems-we-confused-them-for-six-months-56fe</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;TL;DR:&lt;/strong&gt; An API gateway manages HTTP traffic between services - auth, routing, rate limiting, load balancing for REST and gRPC. An AI gateway manages LLM workloads — token-based rate limiting, model routing, cost attribution, semantic caching, guardrails. Use an API gateway for your microservices. Use an AI gateway for your LLM traffic. Most production teams eventually need both, sitting at different layers. This post walks through exactly where each one fits.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;p&gt;When we started adding LLM features to our platform, we already had Kong running for our microservices. The instinct was natural: route the LLM traffic through Kong too. Same auth, same rate limiting, same observability stack. One gateway to rule them all.&lt;/p&gt;

&lt;p&gt;It worked for about six months, and only in the sense that requests got through. What it didn't give us was anything useful for actually managing AI workloads. We had no idea what each team was spending on tokens. We had no way to set a budget cap that would fire before the bill arrived. Our rate limits were based on requests per minute, which meant a single request with a 50k token prompt counted the same as one with a 200 token prompt. And when OpenAI had a partial outage, Kong had no concept of "try Anthropic instead" - we just served errors.&lt;/p&gt;

&lt;p&gt;None of that is a criticism of Kong. It's doing exactly what it was designed to do. The problem was us expecting an API gateway to handle a fundamentally different category of infrastructure problem.&lt;/p&gt;

&lt;p&gt;Here's the precise distinction, and why it matters architecturally.&lt;/p&gt;




&lt;h2&gt;
  
  
  What an API gateway actually does
&lt;/h2&gt;

&lt;p&gt;An API gateway is a reverse proxy that sits between client applications and backend services. It handles the cross-cutting concerns of service-to-service HTTP communication: authentication, authorization, rate limiting, load balancing, SSL termination, request transformation, and routing based on URL paths or headers.&lt;/p&gt;

&lt;p&gt;A typical request flow through an API gateway:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Client sends a request to the gateway endpoint&lt;/li&gt;
&lt;li&gt;Gateway verifies authentication (API key, JWT, OAuth token)&lt;/li&gt;
&lt;li&gt;Gateway applies rate limiting - is this client over their request quota?&lt;/li&gt;
&lt;li&gt;Gateway routes to the appropriate upstream service based on the path&lt;/li&gt;
&lt;li&gt;Gateway applies any request/response transformations&lt;/li&gt;
&lt;li&gt;Response returns through the gateway back to the client&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The unit of everything here is the &lt;strong&gt;HTTP request&lt;/strong&gt;. Rate limits are requests per second or per minute. Cost is measured in compute, not in content. The gateway doesn't know or care what's &lt;em&gt;inside&lt;/em&gt; the request body — it's routing bytes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;An API gateway works best when:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You have multiple backend microservices that all need consistent auth and rate limiting&lt;/li&gt;
&lt;li&gt;You want a single ingress point for all your REST or gRPC traffic&lt;/li&gt;
&lt;li&gt;You need request/response transformation or protocol translation&lt;/li&gt;
&lt;li&gt;You're enforcing service mesh policies across a Kubernetes cluster&lt;/li&gt;
&lt;li&gt;Your traffic is synchronous and stateless&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Kong, NGINX, AWS API Gateway, Apigee - these are all pretty good at this. They've been hardened in production for over a decade.&lt;/p&gt;




&lt;h2&gt;
  
  
  What an AI gateway actually does
&lt;/h2&gt;

&lt;p&gt;An AI gateway is middleware that sits between your application code and your LLM providers. It handles the cross-cutting concerns of LLM traffic: model routing, token-based rate limiting, cost attribution, semantic caching, fallback chains, guardrails, and prompt/response logging.&lt;/p&gt;

&lt;p&gt;A typical request flow through an AI gateway:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Application sends a prompt to the gateway's OpenAI-compatible endpoint&lt;/li&gt;
&lt;li&gt;Gateway authenticates the request and checks the team/user's token budget&lt;/li&gt;
&lt;li&gt;Gateway applies token-based rate limiting - is this team over their token quota?&lt;/li&gt;
&lt;li&gt;Gateway routes to the appropriate model based on your rules (cost, latency, fallback)&lt;/li&gt;
&lt;li&gt;Gateway applies input guardrails (PII detection, prompt injection checks)&lt;/li&gt;
&lt;li&gt;Request goes to the LLM provider&lt;/li&gt;
&lt;li&gt;Gateway applies output guardrails on the response&lt;/li&gt;
&lt;li&gt;Gateway logs the full request with token count, cost, latency, user attribution&lt;/li&gt;
&lt;li&gt;Response returns to the application&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The unit here is the &lt;strong&gt;token&lt;/strong&gt;, not the HTTP request. A single HTTP request might contain 50k tokens and cost $0.75. Another might contain 200 tokens and cost $0.003. Request-level rate limiting treats these identically. Token-level rate limiting doesn't.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;An AI gateway works best when:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You're routing traffic to one or more LLM providers and need unified observability&lt;/li&gt;
&lt;li&gt;Multiple teams or applications are hitting LLMs and you need cost attribution&lt;/li&gt;
&lt;li&gt;A model going down should trigger automatic failover, not user-facing errors&lt;/li&gt;
&lt;li&gt;You need to enforce spending limits before bills arrive, not after&lt;/li&gt;
&lt;li&gt;Guardrails on prompts or responses are a compliance or safety requirement&lt;/li&gt;
&lt;li&gt;You're deploying agents that call tools and need governed MCP access&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  The core difference, precisely stated
&lt;/h2&gt;

&lt;p&gt;Both gateways manage traffic. The distinction is what &lt;em&gt;dimension&lt;/em&gt; of that traffic they understand.&lt;/p&gt;

&lt;p&gt;An API gateway understands &lt;strong&gt;HTTP&lt;/strong&gt; - paths, methods, headers, authentication tokens.&lt;/p&gt;

&lt;p&gt;An AI gateway understands &lt;strong&gt;LLM semantics&lt;/strong&gt; - token counts, model capabilities, cost per inference, prompt content, response safety.&lt;/p&gt;

&lt;p&gt;This is why you can't substitute one for the other. You could technically route LLM requests through Kong - it will auth them, rate limit them (by request count), and route them to OpenAI. But it cannot tell you that your data science team spent $847 on Claude Opus last week, or automatically switch to GPT-4o-mini when Claude is rate-limiting you, or catch a prompt injection attempt before it reaches the model. Those capabilities require a layer that understands what an LLM request actually &lt;em&gt;is&lt;/em&gt;.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;API Gateway&lt;/th&gt;
&lt;th&gt;AI Gateway&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Primary purpose&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Route HTTP traffic between services&lt;/td&gt;
&lt;td&gt;Manage LLM requests between apps and model providers&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Understands&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;HTTP paths, headers, auth tokens&lt;/td&gt;
&lt;td&gt;Tokens, models, costs, prompts, responses&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Rate limiting unit&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Requests/second&lt;/td&gt;
&lt;td&gt;Tokens/minute, spend/team, spend/user&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Cost tracking&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Not applicable&lt;/td&gt;
&lt;td&gt;Per-token, per-model, per-team, per-request&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Routing logic&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;URL path, method, headers&lt;/td&gt;
&lt;td&gt;Model capability, latency, cost, fallback chains&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Caching&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;HTTP response caching (exact match)&lt;/td&gt;
&lt;td&gt;Semantic caching (similar prompts → same response)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Auth model&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;API keys, JWT, OAuth&lt;/td&gt;
&lt;td&gt;API keys, RBAC, virtual keys, team/user scoping&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Failover&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Upstream service health checks&lt;/td&gt;
&lt;td&gt;Model provider failover, automatic model switching&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Content awareness&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;None — routes bytes&lt;/td&gt;
&lt;td&gt;Prompt inspection, output validation, PII detection&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Observability&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Request counts, latency, error rates&lt;/td&gt;
&lt;td&gt;Token usage, cost, model performance, prompt logs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Compliance tooling&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Audit logs for service access&lt;/td&gt;
&lt;td&gt;Guardrails, PII scrubbing, content moderation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Agent/MCP support&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Native in dedicated AI gateways&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Examples&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Kong, NGINX, Apigee, AWS API Gateway&lt;/td&gt;
&lt;td&gt;TrueFoundry, LiteLLM, Portkey, Helicone&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  The mistake we made and the one you're probably making
&lt;/h2&gt;

&lt;p&gt;We assumed token-level rate limiting was something we could implement in Kong with a plugin. We spent two weeks on a Lua plugin that tried to parse the OpenAI request body, extract token estimates, and apply limits. It technically worked in dev. It broke under streaming responses (tokens arrive incrementally - you don't know the total until the stream ends). It broke when we added Anthropic, whose request format differs. It broke when we added our self-hosted model, which had no token counting at all.&lt;/p&gt;

&lt;p&gt;The deeper issue: we were building an AI gateway inside an API gateway. Every capability we needed - semantic caching, model fallback, token-based rate limiting, per-team cost attribution - would have required a custom plugin. The end state of that path was a custom AI gateway implemented as a Kong plugin collection, maintained by our team, with all the operational overhead that implies.&lt;/p&gt;

&lt;p&gt;At some point the honest question is: why build this when it already exists?&lt;/p&gt;

&lt;p&gt;The answer for us was switching to a dedicated AI gateway for LLM traffic, keeping Kong for everything else. The Kong configuration stayed exactly the same. The LLM services started routing through the AI gateway instead of directly to providers. We set &lt;code&gt;OPENAI_BASE_URL&lt;/code&gt; and &lt;code&gt;ANTHROPIC_BASE_URL&lt;/code&gt; in each service's config, pointed them at the AI gateway endpoint, and got token-level observability without touching application code.&lt;/p&gt;




&lt;h2&gt;
  
  
  Do you need both, or just one?
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;If you only have microservices and no LLM workloads:&lt;/strong&gt; you need an API gateway. A dedicated AI gateway adds nothing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;If you have LLM workloads but no microservices (rare):&lt;/strong&gt; you need an AI gateway. A generic API gateway won't handle what you need.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;If you have both (most teams at this point):&lt;/strong&gt; you need both, running at different layers. The API gateway handles your service mesh. The AI gateway handles your LLM traffic. They don't compete — they sit in different parts of the stack.&lt;/p&gt;

&lt;p&gt;The practical setup: your AI gateway is one more upstream service from the perspective of your API gateway. External traffic → API gateway → your services → AI gateway → LLM providers. Each layer handles what it's designed for.&lt;/p&gt;




&lt;h2&gt;
  
  
  What we use and why
&lt;/h2&gt;

&lt;p&gt;We kept Kong for our service mesh - it handles auth and routing for our ~30 internal services and we have years of operational investment in it.&lt;/p&gt;

&lt;p&gt;For LLM traffic, we moved to &lt;a href="https://www.truefoundry.com/ai-gateway" rel="noopener noreferrer"&gt;TrueFoundry's AI Gateway&lt;/a&gt;. The specific things that mattered:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Token-based rate limiting in-memory.&lt;/strong&gt; Auth, budget checks, and rate limits all run in-memory before the request goes out - no external DB lookup per request, no Redis dependency that can fail open. The benchmarks show 350+ RPS on 1 vCPU with under 10ms overhead, which we've matched in our own testing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Per-team budget enforcement on the hot path.&lt;/strong&gt; When a team hits their token budget, subsequent requests return a rate-limit error. Not an alert after the fact — an error before the spend happens. That's the architectural difference that fixed our 2am billing alerts.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Model fallback chains.&lt;/strong&gt; We configured: Claude Opus 4 → Claude Sonnet 4.6 → GPT-4o → GPT-4o-mini, in priority order. When Claude's rate limits fire, traffic automatically routes to the fallback. Our apps stopped serving errors during model outages.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Unified cost attribution.&lt;/strong&gt; One dashboard showing spend by team, by model, by application, including our self-hosted Llama deployment. Previously three separate billing dashboards with no cross-referencing.&lt;/p&gt;

&lt;p&gt;The honest tradeoff: If you want something lighter, LiteLLM covers the core routing and cost tracking with lower infrastructure overhead — the SSO and team-level governance features require the enterprise license, but for smaller setups it's a solid starting point.&lt;/p&gt;

&lt;p&gt;What's your current setup, are you routing AI traffic through a general API gateway, or have you added a dedicated layer? Particularly curious whether teams on Apigee or AWS API Gateway have found native AI features worth it, or whether they ended up with a separate AI gateway layer regardless. Drop it in the comments.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>devops</category>
      <category>llm</category>
      <category>mlops</category>
    </item>
    <item>
      <title>What Is an Agent Gateway? (And Why Our AI Gateway Stopped Being Enough)</title>
      <dc:creator>Sahajmeet Kaur</dc:creator>
      <pubDate>Wed, 24 Jun 2026 10:02:06 +0000</pubDate>
      <link>https://dev.to/sahajmeet_kaur_/what-is-an-agent-gateway-and-why-our-ai-gateway-stopped-being-enough-3h0k</link>
      <guid>https://dev.to/sahajmeet_kaur_/what-is-an-agent-gateway-and-why-our-ai-gateway-stopped-being-enough-3h0k</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;TL;DR:&lt;/strong&gt; An agent gateway is a control layer specifically built for AI agents - handling authentication, routing, policy enforcement, observability, and orchestration for agent-to-model and agent-to-tool communication. An AI gateway routes LLM requests. An agent gateway governs autonomous agent behaviour. Once your agents start calling tools, spawning sub-agents, and running multi-step workflows, you need both.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;p&gt;For the first six months, our AI gateway solved every problem we threw at it. One endpoint for all LLM providers, rate limiting per team, cost attribution, guardrails on prompts and responses. It worked cleanly and we were happy with it.&lt;/p&gt;

&lt;p&gt;Then we started deploying agents.&lt;/p&gt;

&lt;p&gt;The first one was simple - a support triage agent that classified incoming tickets and routed them to the right team. It called one model, occasionally fetched a Confluence doc via an MCP tool, and wrote a label back to Jira. Fine. No problems.&lt;/p&gt;

&lt;p&gt;The second one was a data pipeline agent that pulled from BigQuery, ran transformations, called two different models depending on the task type, and spawned sub-agents for specific segments of the work. That's when things got interesting in the wrong way.&lt;/p&gt;

&lt;p&gt;The AI gateway saw all of this as a stream of LLM requests. It had no idea that request #47 was a sub-agent spawned by the parent agent that started request #12. It couldn't tell us what the agent had decided to do between step 3 and step 7. When the sub-agent went into an unexpected loop and burned through our token budget in 40 minutes, the gateway flagged the cost spike after the fact - there was no per-agent circuit breaker that would have caught it at step 4.&lt;/p&gt;

&lt;p&gt;That's the problem an agent gateway solves. And it took us longer than I'd like to admit to realise these were genuinely different infrastructure requirements.&lt;/p&gt;




&lt;h2&gt;
  
  
  What is an Agent Gateway?
&lt;/h2&gt;

&lt;p&gt;An agent gateway is a centralised control layer that sits between AI agents and everything they interact with: LLM models, external tools, other agents, and internal APIs.&lt;/p&gt;

&lt;p&gt;The definition that clicked for me: think of what an API gateway does for microservices - centralised auth, routing, rate limiting, observability. Now imagine that same concept, but designed specifically for autonomous agents that maintain state across multiple steps, can spawn other agents, and interact with the world through tools rather than just returning text.&lt;/p&gt;

&lt;p&gt;An API gateway handles a request and returns a response. An agent gateway handles a workflow - a sequence of decisions, tool invocations, model calls, and sub-agent delegations that might run for seconds or hours, where each step's output feeds the next.&lt;/p&gt;

&lt;p&gt;The fundamental difference: LLM requests are stateless. Agent workflows are stateful. That distinction drives almost every architectural decision in an agent gateway.&lt;/p&gt;




&lt;h2&gt;
  
  
  How an agent gateway works?
&lt;/h2&gt;

&lt;p&gt;When an agent makes a request through a gateway, here's what happens:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Identity and auth&lt;/strong&gt;&lt;br&gt;
The gateway authenticates the agent - not just "is this a valid API key" but "which agent is this, which user or service account owns it, and what is it permitted to do?" This matters because in a multi-agent system, a sub-agent spawned by a parent agent should only inherit the permissions the parent was authorised to delegate, not the parent's full access. OAuth 2.0 identity injection ties every action to a verified identity before anything reaches a model or tool.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Policy enforcement&lt;/strong&gt;&lt;br&gt;
Before the request reaches its destination, the gateway evaluates policies: does this agent have access to this tool? Has this workflow exceeded its token budget? Is this request within the agent's defined scope? These checks happen on the hot path not as a post-hoc audit, but as a gate before execution.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Protocol-aware routing&lt;/strong&gt;&lt;br&gt;
Unlike HTTP requests that a standard API gateway understands natively, agent traffic typically uses protocols like MCP (Model Context Protocol) or A2A (agent-to-agent). An agent gateway understands these protocols - it can route MCP tool calls to the right server, multiplex multi-agent conversations, and handle server-initiated messages (SSE, streaming updates) that agents send back to clients. A generic reverse proxy can't do this; it doesn't understand what an MCP session is.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. State and session tracking&lt;/strong&gt;&lt;br&gt;
The gateway maintains context across the agent's workflow. It knows that request #47 belongs to the same session as request #12. This is what enables step-level observability — rather than a flat list of LLM calls, you get a trace that shows the agent's full execution path: which model was called at each step, which tool was invoked, what the intermediate results were, and where the workflow branched.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. Circuit breakers and execution controls&lt;/strong&gt;&lt;br&gt;
If an agent goes into a loop - calling the same tool repeatedly, or spawning sub-agents that spawn more sub-agents, the gateway detects this and intervenes before it becomes a $400 Tuesday incident. Hard budget caps apply per agent, per workflow, per team. Retry limits and timeout controls apply at the workflow level, not just per individual request.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;6. Audit trail&lt;/strong&gt;&lt;br&gt;
Every agent action, tool invocation, model call, and sub-agent delegation is logged with structured metadata — which agent, which step, which user identity, which tool, what the parameters were, what the result was. This is what a compliance audit actually needs: not "model X was called 4,000 times" but "agent Y, acting under user Z's identity, invoked the &lt;code&gt;delete_record&lt;/code&gt; tool on table T at 14:23:07 with these parameters."&lt;/p&gt;




&lt;h2&gt;
  
  
  Agent gateway vs AI gateway vs API gateway
&lt;/h2&gt;

&lt;p&gt;This is the question I get most often when I explain what we built, so let me just address it directly.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;API Gateway&lt;/th&gt;
&lt;th&gt;AI Gateway&lt;/th&gt;
&lt;th&gt;Agent Gateway&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Designed for&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;REST/HTTP microservices&lt;/td&gt;
&lt;td&gt;LLM requests (prompts → responses)&lt;/td&gt;
&lt;td&gt;Autonomous agent workflows&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Traffic model&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Stateless request/response&lt;/td&gt;
&lt;td&gt;Stateless request/response&lt;/td&gt;
&lt;td&gt;Stateful, multi-step, long-running&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Protocol understanding&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;HTTP, gRPC, REST&lt;/td&gt;
&lt;td&gt;OpenAI-compatible API&lt;/td&gt;
&lt;td&gt;MCP, A2A, JSON-RPC, SSE&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Auth model&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Service-level API keys&lt;/td&gt;
&lt;td&gt;Per-user/team API keys&lt;/td&gt;
&lt;td&gt;Per-agent identity with delegated permissions&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Rate limiting unit&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Requests/second&lt;/td&gt;
&lt;td&gt;Tokens/minute, spend/team&lt;/td&gt;
&lt;td&gt;Budget/workflow, steps/session&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Observability&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Request logs, latency&lt;/td&gt;
&lt;td&gt;Token usage, cost per model&lt;/td&gt;
&lt;td&gt;Step-level traces across agent lifecycle&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Intervention capability&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Block/allow per request&lt;/td&gt;
&lt;td&gt;Budget caps, content guardrails&lt;/td&gt;
&lt;td&gt;Circuit breakers, loop detection, workflow pause&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Knows about tools&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Partial (guardrails on tool calls)&lt;/td&gt;
&lt;td&gt;Native — MCP tool registry, per-tool policies&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The short version: you need all three, and they live at different layers of the stack. The API gateway handles your microservices. The AI gateway handles your LLM traffic. The agent gateway handles your autonomous agents. In practice, the AI gateway and agent gateway are often the same system — but they need to be the same system that was &lt;em&gt;designed&lt;/em&gt; for both, not an AI gateway with agent features bolted on.&lt;/p&gt;




&lt;h2&gt;
  
  
  The specific things that break without one
&lt;/h2&gt;

&lt;p&gt;If you're running agents now and thinking "we don't have an agent gateway and things seem fine," here are the four things that will eventually catch you:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Credential sprawl in multi-agent systems.&lt;/strong&gt; Each agent needs access to tools, models, and APIs. Without a gateway, each agent manages its own credentials. In a system with 12 agents, that's potentially 12 × 6 credential relationships to track, rotate, and revoke. When an agent is decommissioned, you're hunting down API keys across every system it ever touched.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The M×N integration explosion.&lt;/strong&gt; Each agent that needs to call a tool needs a direct integration with that tool. With 10 agents and 8 tools, that's 80 potential integration points to build, maintain, and monitor. An agent gateway — specifically, an MCP gateway layer within it — collapses this to each agent knowing one endpoint and each tool being registered once.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;No visibility into what agents actually did.&lt;/strong&gt; Your AI gateway can tell you "4,000 model calls happened today." It cannot tell you "the data pipeline agent made 12 calls to BigQuery, then called GPT-4o three times with the results, then spawned a sub-agent that made 6 more calls before producing its output." That second kind of trace is what you actually need to debug a production agent incident.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Runaway workflows.&lt;/strong&gt; Agents can loop. They can spawn sub-agents that spawn more sub-agents. They can get stuck in a retry pattern that burns tokens indefinitely. Without a circuit breaker at the workflow level, the first you hear about it is the cost spike notification — after the damage is done.&lt;/p&gt;




&lt;h2&gt;
  
  
  What we ended up using
&lt;/h2&gt;

&lt;p&gt;After hitting walls three and four simultaneously (a runaway sub-agent loop concurrent with an audit request we couldn't answer), we evaluated our options and ended up on &lt;a href="https://www.truefoundry.com/agent-gateway" rel="noopener noreferrer"&gt;TrueFoundry's Agent Gateway&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;A few specific things that mattered:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Framework-agnostic registration.&lt;/strong&gt; We have agents on LangGraph, one on CrewAI, and two custom HTTP agents. The agent gateway supports all of them through a single registration interface. Once an agent is registered, it gets a governed identity, shows up in the central registry, and its traffic flows through the same policy engine regardless of what framework built it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Per-agent identity with OAuth 2.0 injection.&lt;/strong&gt; Every agent action is tied to the authenticated user or service account that owns that agent's session. When a sub-agent is spawned, it inherits only the delegated permissions its parent was authorised to pass. The "over-privileged sub-agent" pattern where a sub-agent ends up with broader access than the human who started the workflow is closed by design.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step-level traces across the full workflow.&lt;/strong&gt; The observability goes beyond model call logs. We can see the full execution trace: every decision point, every tool invocation, every sub-agent delegation, every intermediate result. When something breaks in production, debugging time dropped significantly instead of inferring what happened from model call logs, we read the actual execution trace.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Workflow-level circuit breakers.&lt;/strong&gt; Budget caps apply per workflow, not just per team. Loop detection fires at the gateway level before a runaway agent creates an incident. We set a max-steps limit on long-running workflows as a safety backstop. These controls didn't exist in our AI gateway and would have been genuinely painful to implement at the application layer across 14 different agent implementations.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;MCP tool governance in the same system.&lt;/strong&gt; Because TrueFoundry's agent gateway sits alongside the MCP gateway, tool access policies for agents live in the same control plane as model access policies. When a new agent is onboarded, we define both in one place: which models it can call, which tools it can use, what its budget is, who can modify it. One system, one audit trail.&lt;/p&gt;

&lt;p&gt;The honest tradeoff: there's more to configure upfront than a simple AI gateway setup. If you have two agents and they're not doing anything complex, it's probably overhead you don't need yet. The inflection point for us was around four agents with overlapping tool access and a compliance requirement that needed a proper audit trail. That's when the additional configuration started paying back.&lt;/p&gt;




&lt;h2&gt;
  
  
  When you actually need an agent gateway
&lt;/h2&gt;

&lt;p&gt;The question isn't whether agents need governance - they do. The question is when the governance complexity justifies dedicated infrastructure.&lt;/p&gt;

&lt;p&gt;You probably don't need an agent gateway yet if:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You have one or two simple agents with no tool access or sub-agent delegation&lt;/li&gt;
&lt;li&gt;Your agents are purely internal, no compliance requirements, low stakes&lt;/li&gt;
&lt;li&gt;You're still in the prototyping phase and governance can wait&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You need an agent gateway when:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Multiple agents share access to the same tools and you need to control who can do what&lt;/li&gt;
&lt;li&gt;A runaway agent loop could cause a real operational or financial incident&lt;/li&gt;
&lt;li&gt;Someone (security, compliance, a customer) has asked you for an audit trail of agent actions&lt;/li&gt;
&lt;li&gt;Sub-agents are involved and you need to reason about delegated permissions&lt;/li&gt;
&lt;li&gt;You're deploying agents across multiple teams and need consistent governance without each team re-implementing it&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The M×N integration explosion is usually the first concrete forcing function. The audit trail request is usually the second.&lt;/p&gt;




&lt;p&gt;What's your current agent setup - are you handling governance at the application layer or have you moved to a dedicated agent gateway layer? Would love to hear how teams are solving the runaway loop problem without infrastructure-level circuit breakers. Drop it in the comments.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>agents</category>
      <category>devops</category>
      <category>mlops</category>
    </item>
    <item>
      <title>What Is an AI Gateway? (And the Week We Realized We Desperately Needed One)</title>
      <dc:creator>Sahajmeet Kaur</dc:creator>
      <pubDate>Wed, 24 Jun 2026 09:33:19 +0000</pubDate>
      <link>https://dev.to/sahajmeet_kaur_/what-is-an-ai-gateway-and-the-week-we-realized-we-desperately-needed-one-3h5a</link>
      <guid>https://dev.to/sahajmeet_kaur_/what-is-an-ai-gateway-and-the-week-we-realized-we-desperately-needed-one-3h5a</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;An AI gateway is a middleware layer between your application code and your LLM providers - it centralises routing, auth, rate limiting, cost tracking, and guardrails in one place&lt;/li&gt;
&lt;li&gt;You probably don't think you need one until something specific breaks: a runaway cost spike, a failed model causing silent errors, a security audit you can't pass&lt;/li&gt;
&lt;li&gt;We went from scattered SDKs and shared API keys to a gateway-first setup over about three months - this post covers what changed and what we'd do differently&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;Six months ago we had what I'd describe as a functional mess. We were running three LLM providers - OpenAI for our customer-facing chat, Anthropic for internal document summarisation, and a self-hosted Llama model for batch classification jobs. Each had its own SDK. Each had its own API key, living in &lt;code&gt;.env&lt;/code&gt; files on whoever's machine had last run that service. Each had its own rate limiting logic, copy-pasted between services with slight variations.&lt;/p&gt;

&lt;p&gt;It worked, in the way that things work when nobody has had a bad enough incident yet.&lt;/p&gt;

&lt;p&gt;The incident arrived on a Tuesday. A background job that was supposed to run once a week got accidentally scheduled to run every minute. It was calling GPT-4o. We noticed when the Slack alert fired at 2am about an unusual credit card charge. By the time someone killed the job, we'd burned through $340 in about four hours. The API key had no spending limit. There was no alerting on token usage. The job had no rate limiting. All three of those gaps were things we knew about and hadn't prioritised.&lt;/p&gt;

&lt;p&gt;That week, we started properly looking at AI gateways.&lt;/p&gt;




&lt;h2&gt;
  
  
  What an AI gateway actually is?
&lt;/h2&gt;

&lt;p&gt;The simplest definition: an AI gateway is a middleware layer that sits between your application code and your LLM providers. All your LLM requests go through it, and it handles the cross-cutting concerns that you'd otherwise have to re-implement in every service: routing, authentication, rate limiting, cost tracking, caching, fallbacks, guardrails.&lt;/p&gt;

&lt;p&gt;The analogy that clicked for me is an API gateway for the rest of your microservices stack. If you've ever set up Kong or AWS API Gateway to handle auth and rate limiting for your REST services, an AI gateway does the same thing but for LLM traffic specifically, which has different characteristics (token-based pricing, streaming responses, variable latency, context windows) that a generic API gateway doesn't handle cleanly.&lt;/p&gt;

&lt;p&gt;Architecturally, it typically has:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A &lt;strong&gt;routing engine&lt;/strong&gt; that directs requests to the right model based on rules you define (latency, cost, fallback chains)&lt;/li&gt;
&lt;li&gt;A &lt;strong&gt;policy layer&lt;/strong&gt; for rate limits, spending caps, and access control&lt;/li&gt;
&lt;li&gt;An &lt;strong&gt;observability stack&lt;/strong&gt; that logs requests, responses, token usage, and costs — ideally at per-user and per-team granularity&lt;/li&gt;
&lt;li&gt;A &lt;strong&gt;caching layer&lt;/strong&gt; that avoids redundant API calls for identical or semantically similar prompts&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The important thing is that none of this lives in your application code. It's a separate layer with its own config, which means you can change routing rules or enforce a new spending limit without touching application code or doing a deployment.&lt;/p&gt;




&lt;h2&gt;
  
  
  The problems it actually solves
&lt;/h2&gt;

&lt;p&gt;Before I get into specific features, it helps to be concrete about the problems. The ones we hit:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Unmanaged API keys&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;We had four API keys in four &lt;code&gt;.env&lt;/code&gt; files. When an engineer left the team, we invalidated their personal keys but not the shared service keys, because we weren't entirely sure which services were using them. A gateway solves this by being the only thing that holds the real provider keys. Application services authenticate to the gateway with scoped virtual keys. If you need to revoke access, you revoke the virtual key — the underlying provider key stays intact and doesn't need to change.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Zero cost visibility&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;We knew our monthly spend from the Anthropic and OpenAI dashboards. We had no idea which team or service was responsible for which portion of that spend. When costs went up, we couldn't attribute it. A gateway with per-team and per-service cost tracking meant that the next month, we had a breakdown: classification job (42%), customer chat (31%), internal summarisation (19%), miscellaneous (8%). Suddenly we knew where to optimise.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. No spending limits&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The Tuesday incident. Enough said. A gateway lets you set hard token or spend limits per API key, per team, per service. When the limit hits, the request gets a rate-limit error instead of a bill at the end of the month.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Silent failures on model outages&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;When OpenAI had a partial outage last March, our customer chat just... failed quietly. Requests returned errors, the frontend showed a generic message, and we found out from a user report rather than an alert. A gateway with fallback routing would have automatically switched to Anthropic or our self-hosted model and kept the service up. We were just making direct SDK calls with no fallback logic.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. The security audit&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This one came from outside the team. Our security team did a review and had two questions we couldn't fully answer: "Can you show me an audit log of which users triggered which model calls in the last 90 days?" and "How do you ensure that production credentials aren't accessible to developers locally?" We couldn't answer either cleanly. A gateway with request-level logging and centralised key management is the infrastructure answer to both.&lt;/p&gt;




&lt;h2&gt;
  
  
  What "routing" means in practice
&lt;/h2&gt;

&lt;p&gt;One feature that sounds vague but turned out to be genuinely useful: routing.&lt;/p&gt;

&lt;p&gt;Not just "send this request to OpenAI" — but &lt;em&gt;intelligent&lt;/em&gt; routing. There are a few modes worth understanding:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Latency-based routing:&lt;/strong&gt; The gateway continuously monitors response times across your configured providers. When one provider's latency spikes, it automatically routes to whichever is fastest. This is particularly useful when you're using multiple deployments of the same model across regions.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Weighted load balancing:&lt;/strong&gt; You can split traffic across providers by percentage. We used this when testing a new model — routing 10% of requests to the new model, watching the metrics, and gradually shifting the split as confidence grew. No code changes, just a config update.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fallback chains:&lt;/strong&gt; Define a priority order. If the primary model is unavailable or rate-limited, try the secondary, then the tertiary. The request succeeds from the application's perspective — it never sees the fallback happen.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cost-based routing:&lt;/strong&gt; Route to cheaper models for lower-stakes tasks. We ended up routing classification jobs to a smaller, cheaper model and only using GPT-4o for the tasks that genuinely needed it. The gateway enforces this policy centrally rather than relying on individual engineers to make cost-conscious choices in every service.&lt;/p&gt;




&lt;h2&gt;
  
  
  Caching: the feature we underestimated
&lt;/h2&gt;

&lt;p&gt;We expected to use caching for exact-match deduplication — if two users send the identical prompt, return the cached response. Useful, but not that common in practice.&lt;/p&gt;

&lt;p&gt;What we didn't expect was how useful semantic caching turned out to be. Semantically similar prompts — not identical, but asking for the same thing slightly differently — return the cached response if the similarity score is above a threshold you configure. For our summarisation workload, we found that a significant portion of requests were semantically similar enough to return cached results. That's real cost reduction without any change in output quality.&lt;/p&gt;

&lt;p&gt;The key configuration decisions: cache expiry (how long is a cached response valid?), and the similarity threshold (how similar is "similar enough"?). These are worth tuning — the defaults are conservative and you can usually go more aggressive once you understand your workload.&lt;/p&gt;




&lt;h2&gt;
  
  
  Guardrails: the part most teams skip until they shouldn't
&lt;/h2&gt;

&lt;p&gt;Guardrails are the part of AI gateway setup that gets deferred because it feels like a "compliance problem" rather than an engineering problem. It's both.&lt;/p&gt;

&lt;p&gt;A guardrail is a policy that runs on requests before they're sent to the model (input guardrails) and on responses before they're returned to the application (output guardrails). Common uses:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;PII detection:&lt;/strong&gt; Strip or redact personally identifiable information before it leaves your environment or appears in a response. Crucial if you're handling customer data.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Content moderation:&lt;/strong&gt; Filter inputs or outputs that violate safety policies.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Prompt injection detection:&lt;/strong&gt; Flag or block requests that appear to be attempting to manipulate the model's behaviour via injected instructions. Particularly important if any user-generated content makes it into your prompts.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The way TrueFoundry handles this is through pre-built integrations that need no external credentials for the basics, with options to plug in Azure Content Safety, AWS Bedrock Guardrails, OpenAI Moderations, or Google Model Armor for more specific requirements. You can run guardrails in &lt;strong&gt;validate&lt;/strong&gt; mode (inspect, flag, optionally block) or &lt;strong&gt;mutate&lt;/strong&gt; mode (inspect and modify — useful for PII scrubbing where you want to replace rather than reject).&lt;/p&gt;

&lt;p&gt;The thing that shifted my thinking on this: guardrails aren't just for compliance. Prompt injection via third-party content is a real engineering risk once you're building agents that retrieve external content and put it into context. A guardrail that runs on retrieved content before it reaches the model is the right architectural answer — not trying to sanitise inputs at the application layer.&lt;/p&gt;




&lt;h2&gt;
  
  
  How we ended up on TrueFoundry
&lt;/h2&gt;

&lt;p&gt;We evaluated a few options. LiteLLM was the first thing we tried — it's the obvious starting point because it's open source, MIT licensed, and gets you a unified endpoint across providers in an afternoon. We ran it for about six weeks. What broke for us: no SSO integration (we needed Okta for compliance), and the per-team budget enforcement we needed was behind the enterprise license. The YAML config also got unwieldy as we added more models and routing rules.&lt;/p&gt;

&lt;p&gt;We ended up on &lt;a href="https://www.truefoundry.com/ai-gateway" rel="noopener noreferrer"&gt;TrueFoundry's AI Gateway&lt;/a&gt;. A few specifics that mattered to our evaluation:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Architecture:&lt;/strong&gt; The gateway runs entirely in-memory for auth, rate limiting, and routing decisions — no external DB lookups on the hot path. Config syncs from the control plane via NATS. This means gateway latency doesn't degrade as you add governance rules. The benchmarks show 350+ RPS on 1 vCPU with under 10ms added latency at full load, which matched what we saw in our own testing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key management:&lt;/strong&gt; Developers get virtual keys that map to gateway-managed provider credentials. The actual OpenAI and Anthropic keys never leave the secrets manager. Onboarding a new developer means issuing a new virtual key. Offboarding means revoking it — one action, immediate effect.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Per-team budgets:&lt;/strong&gt; Enforced on the request path, not as a post-spend alert. When the limit hits, requests return rate-limit errors. We haven't had another 2am Slack alert.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Self-hosted model support:&lt;/strong&gt; We route to our on-prem Llama deployment through the same gateway as our OpenAI and Anthropic traffic. Same observability, same cost attribution, same rate limiting. This was the biggest gap with Portkey, which has no visibility into self-hosted model infrastructure.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Deployment:&lt;/strong&gt; We run it inside our VPC. The whole control plane stays in our infrastructure, which is what our security team needed to answer the data residency question cleanly.&lt;/p&gt;

&lt;p&gt;What I'd flag as the honest tradeoff: TrueFoundry is more to set up than LiteLLM. It's Kubernetes-native, so if you don't have a K8s environment, there's more upfront work. And Portkey's prompt management UI is genuinely better for non-engineers who want to iterate on prompts without touching config files. Those are real differences worth knowing before you evaluate.&lt;/p&gt;




&lt;h2&gt;
  
  
  The one-line config change that changed everything
&lt;/h2&gt;

&lt;p&gt;The moment everything clicked was adding this to our service configs:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;ANTHROPIC_BASE_URL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;https://&amp;lt;gateway-url&amp;gt;/api/inference/
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;OPENAI_BASE_URL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;https://&amp;lt;gateway-url&amp;gt;/api/inference/
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's it. Every existing SDK call — LangChain, the OpenAI Python client, direct requests — started going through the gateway without any code changes. Suddenly we had cost attribution, rate limiting, and request logging across all our services. The application code didn't know anything had changed.&lt;/p&gt;

&lt;p&gt;This is also, incidentally, the right way to think about what a gateway does to your architecture: it's a configuration change at the infrastructure layer, not a code change at the application layer. That's why the governance it provides actually holds — it's enforced at the network level, not dependent on individual engineers remembering to implement it.&lt;/p&gt;




&lt;h2&gt;
  
  
  When you probably don't need one yet
&lt;/h2&gt;

&lt;p&gt;I don't want to make it sound like everyone needs an AI gateway immediately. If you're at an early stage, the overhead isn't worth it.&lt;/p&gt;

&lt;p&gt;You probably don't need a gateway yet if:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You're one developer building a prototype or demo&lt;/li&gt;
&lt;li&gt;You use a single model from a single provider&lt;/li&gt;
&lt;li&gt;You have no cost visibility requirements and no compliance obligations&lt;/li&gt;
&lt;li&gt;You don't have multiple teams sharing AI infrastructure&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You're ready for a gateway when:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;More than one team is hitting LLMs and you can't answer "what is each team spending?"&lt;/li&gt;
&lt;li&gt;An engineer leaving means you need to hunt down and rotate credentials across multiple places&lt;/li&gt;
&lt;li&gt;A model outage has caused a user-facing incident and you had no fallback&lt;/li&gt;
&lt;li&gt;Someone has asked for an audit log and you didn't have one&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The Tuesday incident was our sign. Hopefully yours is less expensive.&lt;/p&gt;




&lt;p&gt;What's the specific thing that pushed your team toward a gateway — or convinced you to hold off? Curious whether cost incidents are as common a forcing function as they were for us, or whether it's usually the security audit that does it. Drop it in the comments.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>claude</category>
      <category>mcp</category>
    </item>
    <item>
      <title>Portkey Alternative: I Switched Away from Portkey. Here's the Honest Reason Why.</title>
      <dc:creator>Sahajmeet Kaur</dc:creator>
      <pubDate>Fri, 19 Jun 2026 11:30:00 +0000</pubDate>
      <link>https://dev.to/sahajmeet_kaur_/portkey-alternative-i-switched-away-from-portkey-heres-the-honest-reason-why-3k15</link>
      <guid>https://dev.to/sahajmeet_kaur_/portkey-alternative-i-switched-away-from-portkey-heres-the-honest-reason-why-3k15</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Portkey is genuinely good for small-to-mid teams shipping fast — the DX is excellent, setup is minimal, and the observability dashboard is among the strongest for single-team use&lt;/li&gt;
&lt;li&gt;Palo Alto Networks completed the acquisition on May 29, 2026. Portkey is now Prisma AIRS's AI Gateway. What that means for the developer-first roadmap is an open question, and it's a real factor for any long-term infrastructure decision&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.truefoundry.com/" rel="noopener noreferrer"&gt;TrueFoundry&lt;/a&gt; is more to set up but covers the full lifecycle — routing, MCP governance, model deployment, data sovereignty — which is where we kept hitting Portkey's ceiling&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;We adopted Portkey about eighteen months ago. At the time it was the obvious call: our team was six people, we were routing to OpenAI and Anthropic, and we needed observability fast. Portkey was running in an afternoon. The dashboard was clean. Cost tracking worked. Prompt versioning was a genuine quality-of-life improvement. I'd have recommended it to anyone in that situation without hesitation.&lt;/p&gt;

&lt;p&gt;Then things changed. The team grew to forty engineers across four squads. Two squads started using self-hosted models. A security review asked for per-team cost attribution and a full audit trail. We added agent workflows that needed MCP governance. And in April, Palo Alto Networks announced they were acquiring Portkey.&lt;/p&gt;

&lt;p&gt;At each of those inflection points, we bumped into something Portkey didn't handle the way we needed. This post is the honest account of what those were — and why we eventually switched.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Portkey does really well
&lt;/h2&gt;

&lt;p&gt;Before getting into the friction, it's worth being clear about where Portkey is genuinely strong — because the "obvious better choice" framing is only useful if it's grounded in real tradeoffs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Developer onboarding is pretty good.&lt;/strong&gt; Three lines of code and you have routing, retries, fallback, and observability. No config files, no YAML, no infrastructure decisions. For a team that needs to be in production this week, that matters enormously.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The prompt management UI works well.&lt;/strong&gt; Versioned prompts, labeled deployments, a playground that actually works — this is the kind of tooling that makes the difference between prompt iteration being a discipline versus a guessing game. Portkey's Prompt Engineering Studio is genuinely ahead of most alternatives here.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;LLM-level observability is clean and actionable.&lt;/strong&gt; Cost per provider, per model, per API key. Token usage over time. Latency distributions. It's not flashy but it's exactly what you need when you're trying to understand what your LLM usage is actually costing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Reliability features work.&lt;/strong&gt; Automatic retries, fallback chains, load balancing across providers — these all worked reliably in production.&lt;/p&gt;

&lt;p&gt;If your situation is: one or two teams, routing to external providers, no compliance requirements, no self-hosted models, no MCP governance — Portkey is probably the right tool. Stop reading and go set it up.&lt;/p&gt;




&lt;h2&gt;
  
  
  Where we kept hitting walls
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Wall 1: Environment isolation
&lt;/h3&gt;

&lt;p&gt;This was the first thing that caught us. We had dev, staging, and production all pointing at the same Portkey workspace. Fine when you're small. Once we had four squads doing concurrent experiments, routing config changes in dev started bleeding into staging metrics. We couldn't cleanly separate observability by environment.&lt;/p&gt;

&lt;p&gt;Portkey's workspace model is good for a single team. It's not designed around the idea that multiple environments need hard isolation at the config and observability level. We ended up running three separate Portkey accounts, which meant three billing relationships, three sets of API keys to manage, and no unified view across environments.&lt;/p&gt;

&lt;p&gt;With TrueFoundry, &lt;a href="https://www.truefoundry.com/docs/ai-gateway/intro-to-llm-gateway" rel="noopener noreferrer"&gt;workspace scoping is Kubernetes-namespace-backed&lt;/a&gt; — environments are physically isolated, not just logically separated in a shared dashboard. One control plane, clean separation. That's the architecture we needed.&lt;/p&gt;

&lt;h3&gt;
  
  
  Wall 2: Cost attribution across teams
&lt;/h3&gt;

&lt;p&gt;Around month four, our head of engineering asked a reasonable question: "Which team is spending the most on LLM calls, and on what models?" We couldn't answer it cleanly.&lt;/p&gt;

&lt;p&gt;Portkey does per-key and per-workspace budget tracking, but cross-team attribution requires either separate workspaces (back to the three-account problem) or careful key management discipline that breaks down as teams grow. Budget enforcement is also reactive — you get alerted after you've hit a limit, not before. We had one incident where an agent workflow hit a token spike over a weekend and ran up a bill before the Monday alert fired.&lt;/p&gt;

&lt;p&gt;TrueFoundry enforces budgets on the hot path, not as a post-spend alert. Cost attribution runs by team, user, model, and application simultaneously — including for self-hosted models, which Portkey has no visibility into at all. When we moved two squads to self-hosted Llama, their costs became completely invisible to Portkey. That's not a gap you can work around.&lt;/p&gt;

&lt;h3&gt;
  
  
  Wall 3: The compliance audit
&lt;/h3&gt;

&lt;p&gt;This was the one that really forced the issue. Our security team ran a compliance review and asked two things: "Can you prove that LLM traffic from the finance squad never left our network?" and "Can you produce a per-request audit trail for MCP tool calls from the last 90 days?"&lt;/p&gt;

&lt;p&gt;Portkey's hybrid VPC mode genuinely keeps inference payloads in-network — I want to be clear that this is real, not marketing. But the control plane — the dashboard, guardrail configuration, analytics aggregation — remains in Portkey's cloud. Our security team's position was that control plane residency matters as much as data residency. If an attacker compromises the control plane, they can change guardrail rules even if they can't see the prompts. That's not a position I'd have anticipated, but once they explained it, it's hard to argue with.&lt;/p&gt;

&lt;p&gt;TrueFoundry runs the entire hot path — auth, rate limiting, guardrails, traces — inside our Kubernetes cluster with no external dependencies. &lt;a href="https://www.truefoundry.com/docs/platform/gateway-plane-architecture" rel="noopener noreferrer"&gt;The architecture docs&lt;/a&gt; describe this pretty clearly: in-memory auth, config synced via NATS, OTEL traces exported to our own backends. Nothing leaves the cluster unless we tell it to. For the compliance team, that was the distinction that mattered.&lt;/p&gt;

&lt;p&gt;On the MCP audit trail question: at the time, Portkey's MCP guardrails were still in early access, and custom tool-call validation was via an adjacent webhook path rather than a native policy engine. We couldn't produce the per-request trace with user attribution that the audit required.&lt;/p&gt;

&lt;h3&gt;
  
  
  Wall 4: Self-hosted models
&lt;/h3&gt;

&lt;p&gt;When two of our squads moved to self-hosted Llama models on our own GPU infrastructure, we needed the gateway to route to those endpoints the same way it routed to OpenAI. Not a separate system — the same observability, the same cost attribution, the same RBAC.&lt;/p&gt;

&lt;p&gt;Portkey doesn't host models and has no visibility into self-hosted model infrastructure. You can point it at a custom endpoint, but GPU utilization, container logs, pod health — none of that is accessible. When a self-hosted model started OOM-crashing under load, the debugging path was: notice the errors in Portkey's dashboard, switch to a completely different observability stack to diagnose the infrastructure, fix the issue, come back. Two tools, two contexts, one problem.&lt;/p&gt;

&lt;p&gt;TrueFoundry handles both the gateway and the compute layer. When something breaks, I can see the request error in the same UI where I check the pod logs and GPU memory. That's not a minor convenience — it cut our mean time to resolution for model infrastructure incidents from hours to minutes.&lt;/p&gt;




&lt;h2&gt;
  
  
  The acquisition question
&lt;/h2&gt;

&lt;p&gt;On April 30, 2026, Palo Alto Networks announced they were acquiring Portkey. The deal &lt;a href="https://www.paloaltonetworks.com/company/press/2026/palo-alto-networks-completes-acquisition-of-portkey-to-secure-ai-agents" rel="noopener noreferrer"&gt;closed May 29&lt;/a&gt;. Portkey is now the AI Gateway component of Prisma AIRS, their enterprise security platform.&lt;/p&gt;

&lt;p&gt;I want to be careful not to be unfair here. Palo Alto Networks said they'll continue supporting existing Portkey customers, and the acquisition does make sense architecturally — an AI gateway is a natural adjacency for a security platform trying to govern agentic AI.&lt;/p&gt;

&lt;p&gt;But here's the practical question we had to ask: is the product we're betting our AI infrastructure on going to keep prioritizing developer-first AI gateway features, or is it going to be prioritized around security platform consolidation for Prisma AIRS customers?&lt;/p&gt;

&lt;p&gt;That's not a rhetorical question. I genuinely don't know the answer. What I do know is that when you're making a 2-3 year infrastructure commitment, "the roadmap might shift toward a security vendor's priorities" is a risk that didn't exist six months ago. We decided it was a risk we weren't comfortable with for a piece of core infrastructure. That might be overcautious — I wouldn't fault a team for staying on Portkey if it's working for them. But it was the final factor in our decision.&lt;/p&gt;




&lt;h2&gt;
  
  
  Where TrueFoundry was better suited for us
&lt;/h2&gt;

&lt;p&gt;I'm not going to pretend TrueFoundry is strictly better in every dimension. It's not.&lt;/p&gt;

&lt;p&gt;TrueFoundry is Kubernetes-native. If your team isn't already running Kubernetes or doesn't have platform engineering capacity, that's might be new to adapt. Portkey's three-line setup isn't just marketing - it reflects a lighter deployment model.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;For pure API routing to external providers, Portkey is faster to get value from.&lt;/strong&gt; If MCP governance, self-hosted models, and full data sovereignty aren't in your near-term picture, TrueFoundry's additional surface area is overhead you don't need.&lt;/p&gt;

&lt;p&gt;What TrueFoundry does meaningfully better:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Full data sovereignty with everything running inside your VPC, control plane included&lt;/li&gt;
&lt;li&gt;Per-team cost attribution across both external APIs and self-hosted models, enforced on the hot path&lt;/li&gt;
&lt;li&gt;MCP governance with tool-level RBAC, pre/post guardrail hooks, Virtual MCP Servers, per-request audit trail&lt;/li&gt;
&lt;li&gt;Unified observability from request trace down to pod logs and GPU utilization&lt;/li&gt;
&lt;li&gt;Routing and deployment in one system, so migrating from OpenAI to a self-hosted model doesn't require a new platform&lt;/li&gt;
&lt;li&gt;Roadmap driven by AI infrastructure, not security platform consolidation&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The honest version: if you're a team that will stay in "routing to external providers" territory indefinitely, Portkey is probably the right call for now. If your needs include self-hosted models, regulated data environments, MCP governance, or multi-team cost attribution, we hit those walls and they were real.&lt;/p&gt;




&lt;p&gt;What's your experience been? Curious whether others on Portkey are watching the Prisma AIRS integration and deciding what to do — or whether most teams are comfortable staying put and waiting to see how the roadmap develops. Drop it in the comments.&lt;/p&gt;




</description>
      <category>ai</category>
      <category>llm</category>
      <category>devops</category>
      <category>mlops</category>
    </item>
    <item>
      <title>What Is an MCP Proxy - And When Do You Actually Need a Gateway Instead?</title>
      <dc:creator>Sahajmeet Kaur</dc:creator>
      <pubDate>Thu, 18 Jun 2026 19:35:43 +0000</pubDate>
      <link>https://dev.to/sahajmeet_kaur_/what-is-an-mcp-proxy-and-when-do-you-actually-need-a-gateway-instead-kpg</link>
      <guid>https://dev.to/sahajmeet_kaur_/what-is-an-mcp-proxy-and-when-do-you-actually-need-a-gateway-instead-kpg</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;An MCP proxy forwards requests between AI agents and MCP servers — it handles transport, not governance. Fast to set up, hits a wall the moment you have more than one team or more than two servers&lt;/li&gt;
&lt;li&gt;An MCP gateway adds identity, RBAC, audit trails, and per-tool policy enforcement on top of that routing layer — it's where your organization's actual AI policy gets enforced&lt;/li&gt;
&lt;li&gt;We started with a proxy, got bitten by the exact things proxies don't handle, and ended up needing a gateway. This post is the thing I wish I'd read before making that call&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;When I first started wiring up MCP servers for our engineering team, I kept running into the term "MCP proxy" and wasn't entirely sure what it meant or how it differed from an "MCP gateway." Both sit between an AI client and MCP servers. Both forward requests. The difference looked like branding more than substance.&lt;/p&gt;

&lt;p&gt;It's not. I figured that out the expensive way.&lt;/p&gt;

&lt;p&gt;Here's the clean explanation I eventually pieced together, plus the real-world situation that made the distinction matter.&lt;/p&gt;




&lt;h2&gt;
  
  
  What an MCP proxy actually is
&lt;/h2&gt;

&lt;p&gt;An MCP proxy is a transport layer. Its job is protocol mediation — it forwards requests from MCP clients to MCP servers, and responses back. That's the whole thing.&lt;/p&gt;

&lt;p&gt;The most common reason you'd reach for one is the stdio problem. Claude Code, Cursor, and most local MCP clients speak stdio — they expect to launch a server process and talk to it over stdin/stdout. But if your MCP server is running remotely (inside a Docker container, on a staging server, on someone else's machine), you need something in the middle that wraps that stdio interface and exposes it over HTTP/SSE or WebSockets so the remote client can reach it. That's a proxy.&lt;/p&gt;

&lt;p&gt;What a proxy does not do:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;It doesn't know what a "tool call" is. It forwards bytes.&lt;/li&gt;
&lt;li&gt;It doesn't check who is making the request or whether they should be allowed to&lt;/li&gt;
&lt;li&gt;It doesn't enforce policies per tool or per team&lt;/li&gt;
&lt;li&gt;It doesn't write audit logs with user attribution&lt;/li&gt;
&lt;li&gt;It doesn't handle token management or credential storage&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A proxy is the right answer when the real question is "how do I physically get this request from A to B." It's not the right answer when the question is "should this agent be allowed to run this tool, and do I have a record that it did."&lt;/p&gt;

&lt;p&gt;For a single developer connecting a local AI client to one MCP server in a dev environment, a proxy is fine — and in fact it's probably all you need. The problems start when you scale horizontally: more developers, more servers, more agents, different teams with different access requirements.&lt;/p&gt;




&lt;h2&gt;
  
  
  The situation that clarified it for us
&lt;/h2&gt;

&lt;p&gt;We had six MCP servers running internally: GitHub, Confluence, Jira, Sentry, Datadog, and an internal data API. Each team had configured their own local connections — developers were managing credentials themselves, there was no central record of what tools had been invoked, and anyone with a client config could reach any server.&lt;/p&gt;

&lt;p&gt;It worked fine until it didn't.&lt;/p&gt;

&lt;p&gt;The first problem was credential sprawl. Every developer had their own GitHub OAuth token, their own Jira API key, their own Confluence credentials. When someone left the team, we had to hunt down and revoke six separate credentials across six systems. We missed one. A contractor who had left three weeks earlier still had an active Jira key in their old laptop's MCP config. We only found out during a routine audit.&lt;/p&gt;

&lt;p&gt;The second problem was a near-miss with prompt injection. An agent was using the Confluence MCP server to pull documentation into context. A vendor had left a support ticket in Confluence with what turned out to be an injected instruction embedded in the formatting. Claude processed the ticket content and started executing steps from the injected text before a human caught it. Nothing catastrophic happened, but it was a visceral illustration of what "no policy layer between agent and tool" actually means in practice.&lt;/p&gt;

&lt;p&gt;The third problem was visibility. When our head of security asked "which agents have accessed our internal data API in the last 30 days, and with what parameters," the honest answer was "we don't know." We had server logs on the API itself, but no correlation to which agent or user identity had triggered each call. The audit trail stopped at the network layer.&lt;/p&gt;

&lt;p&gt;That was the moment my team realized we didn't have a proxy problem. We had a governance problem. And a proxy wasn't going to solve it.&lt;/p&gt;




&lt;h2&gt;
  
  
  The actual difference: proxy vs gateway
&lt;/h2&gt;

&lt;p&gt;Here's the mental model that eventually clicked for me:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A proxy answers: can this request reach its destination?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A gateway answers: should this request be allowed to happen at all — and is there a record that it did?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The distinction looks subtle on a whiteboard. In production, it's the difference between "MCP is running" and "MCP is governed."&lt;/p&gt;

&lt;p&gt;Concretely, a gateway adds:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Identity and authentication.&lt;/strong&gt; The gateway knows who is making the request — not just which client, but which human user, authenticated through your corporate IdP (OAuth 2.0, SAML, SSO). This is what makes access revocation work cleanly: you offboard someone in Okta, their token stops working at the gateway, and they lose access to every MCP server simultaneously.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tool-level RBAC.&lt;/strong&gt; Not just "team A can access the GitHub server" but "team A can use &lt;code&gt;search_repositories&lt;/code&gt; and &lt;code&gt;read_file&lt;/code&gt;, but not &lt;code&gt;push_commit&lt;/code&gt; or &lt;code&gt;delete_branch&lt;/code&gt;." That granularity is what separates a policy from a vague intention.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Audit trail per tool call.&lt;/strong&gt; Every invocation logged with user identity, tool name, request parameters, response, and latency. Queryable. Exportable to your SIEM. This is what makes the security team's question answerable.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pre- and post-execution guardrails.&lt;/strong&gt; Policy evaluated before the tool runs (should this input be allowed?) and after (does this output contain PII or secrets before it goes back into the agent's context?). This is the prompt injection mitigation — the gateway can inspect tool responses and strip or flag injected instructions before they reach the agent.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Unified credential management.&lt;/strong&gt; Users authenticate once to the gateway. The gateway handles outbound auth to every downstream MCP server. Credentials live in a vault, not on developer machines.&lt;/p&gt;




&lt;h2&gt;
  
  
  What we actually ended up using
&lt;/h2&gt;

&lt;p&gt;After the audit incident, we evaluated a few options. I'll be honest after multiple considerations, that we landed on &lt;a href="https://www.truefoundry.com/docs/ai-gateway/mcp/mcp-overview" rel="noopener noreferrer"&gt;TrueFoundry's MCP Gateway&lt;/a&gt;, I can explain specifically why the architecture fit our problem.&lt;/p&gt;

&lt;p&gt;The thing that mattered most to us was unified token management. Before the gateway, six servers meant six credential relationships per developer. With TrueFoundry, each developer gets a single Personal Access Token. The gateway maintains the mapping from that token to OAuth credentials for GitHub, Confluence, Jira, Sentry, and Datadog — and refreshes them automatically when they expire. Offboarding is one action: revoke the PAT. Done.&lt;/p&gt;

&lt;p&gt;The second thing was &lt;a href="https://www.truefoundry.com/docs/ai-gateway/mcp/virtual-mcp-server" rel="noopener noreferrer"&gt;Virtual MCP Servers&lt;/a&gt;. This is a concept I hadn't seen elsewhere before we built it. Instead of exposing a full MCP server to agents — with all its tools, including the destructive ones — you define a curated logical endpoint that exposes only the tools you want a given team or agent to see. Our product engineering team's "dev tools" endpoint exposes GitHub read tools, Jira read/write, and Sentry. It does not expose the internal data API or the Datadog write tools. Those only appear in the security team's endpoint. Agents see one clean surface; the governance lives in the platform.&lt;/p&gt;

&lt;p&gt;The third thing was the &lt;a href="https://www.truefoundry.com/docs/ai-gateway/guardrails-overview" rel="noopener noreferrer"&gt;guardrail&lt;/a&gt; layer. Pre-execution checks validate tool inputs against defined policies before anything runs. Post-execution validation inspects tool responses for PII, secrets, or injected content before it reaches the agent's context. This directly addresses the Confluence prompt injection incident we'd already had.&lt;/p&gt;

&lt;p&gt;The performance overhead was not an issue in practice — the &lt;a href="https://www.truefoundry.com/docs/ai-gateway/mcp/mcp-overview" rel="noopener noreferrer"&gt;docs describe sub-3ms latency under load&lt;/a&gt; using in-memory auth and rate limiting rather than DB lookups per request. For agents making dozens of tool calls per workflow, that matters.&lt;/p&gt;




&lt;h2&gt;
  
  
  When a proxy is actually the right answer
&lt;/h2&gt;

&lt;p&gt;I don't want to make this sound like proxies are always wrong. They're not.&lt;/p&gt;

&lt;p&gt;You probably just need a proxy if:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You're a solo developer connecting a local AI client to one or two MCP servers in a dev environment&lt;/li&gt;
&lt;li&gt;You're doing a proof-of-concept and governance isn't in scope yet&lt;/li&gt;
&lt;li&gt;Your only problem is the stdio-to-HTTP transport gap — you have a local STDIO server and need to expose it remotely&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You need a gateway when:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;More than one person is using MCP tools and you need to control who has access to what&lt;/li&gt;
&lt;li&gt;You need an audit trail that satisfies a security team or compliance requirement&lt;/li&gt;
&lt;li&gt;You have agents accessing sensitive internal systems and need to know what they touched&lt;/li&gt;
&lt;li&gt;Someone leaving the team means you need to reliably cut off their tool access&lt;/li&gt;
&lt;li&gt;You've had (or nearly had) a prompt injection incident via an MCP tool response&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The honest version: most teams start with a proxy because it's the fastest path to something working. That's fine. The mistake is treating the proxy as a permanent solution when the system has already grown past what a proxy can govern.&lt;/p&gt;




&lt;h2&gt;
  
  
  The question worth asking now
&lt;/h2&gt;

&lt;p&gt;If you have MCP tools running in your organization right now, here's the specific question I'd ask: if your security team asked "which agents invoked which tools in the last 30 days, and under whose identity," could you answer it?&lt;/p&gt;

&lt;p&gt;If yes - great, your governance layer is working.&lt;/p&gt;

&lt;p&gt;If no - you probably have a proxy where you need a gateway.&lt;/p&gt;




&lt;p&gt;Curious what others are running here. Are most people still on raw proxy setups, or has the security pressure pushed teams toward proper gateways faster than I'd expect? And has anyone dealt with the prompt injection via MCP tool response problem at scale - would love to hear what actually worked. Comments below.&lt;/p&gt;

</description>
      <category>mcp</category>
      <category>ai</category>
      <category>devops</category>
      <category>llm</category>
    </item>
    <item>
      <title>What It Took to Actually Govern Claude Code Across Our Engineering Team</title>
      <dc:creator>Sahajmeet Kaur</dc:creator>
      <pubDate>Thu, 18 Jun 2026 19:01:41 +0000</pubDate>
      <link>https://dev.to/sahajmeet_kaur_/what-it-took-to-actually-govern-claude-code-across-our-engineering-team-4jp6</link>
      <guid>https://dev.to/sahajmeet_kaur_/what-it-took-to-actually-govern-claude-code-across-our-engineering-team-4jp6</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Claude Code's attack surface is bigger than most teams realize - two CVEs in early 2026 showed that cloning a repo is enough to get your API keys stolen or run arbitrary code on a developer's machine&lt;/li&gt;
&lt;li&gt;The four gaps we found: unmanaged API keys, no centralized traffic visibility, no filesystem controls, and MCP servers running completely ungoverned&lt;/li&gt;
&lt;li&gt;Fixing all four required more than just patching - it needed a different mental model for how a terminal-based AI tool should be treated&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A few months ago our security team flagged something in an audit: we had 60+ engineers using Claude Code, and our "governance" for it was essentially nothing. API keys were in &lt;code&gt;.bash_profile&lt;/code&gt; files. There was no way to see what models people were hitting, what it was costing, or who had access to what. When someone left the company, we had no clean way to revoke their Claude Code access without hunting down which machine they'd set their key on.&lt;/p&gt;

&lt;p&gt;We'd done all the right things for Claude.ai — SSO, domain capture, admin console, the works. But Claude Code is a different beast. It's not a web app. It runs in a terminal with the developer's full filesystem permissions, and it authenticates with an API key, not a browser session. None of our web-layer controls touched it.&lt;/p&gt;

&lt;p&gt;The audit was uncomfortable. Then the CVEs made it urgent.&lt;/p&gt;




&lt;h2&gt;
  
  
  The wake-up call: two CVEs that changed how we thought about Claude Code
&lt;/h2&gt;

&lt;p&gt;In early 2026, Check Point Research published findings on two vulnerabilities in Claude Code — CVE-2025-59536 and CVE-2026-21852 — that made me realize we'd been thinking about this wrong.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;CVE-2025-59536 (CVSS 8.7):&lt;/strong&gt; A malicious &lt;code&gt;.claude/settings.json&lt;/code&gt; in a repository could execute arbitrary shell commands before Claude Code even showed a trust dialog. In earlier versions, hooks defined in that file ran at startup — before the user was asked to confirm anything. Cloning an attacker's repo and running &lt;code&gt;claude&lt;/code&gt; in it was enough to get RCE on a developer's machine.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;CVE-2026-21852 (CVSS 5.3):&lt;/strong&gt; This one hit differently. Claude Code uses an environment variable called &lt;code&gt;ANTHROPIC_BASE_URL&lt;/code&gt; to decide where to send API requests. A malicious repo could override that via its settings file, redirecting all traffic — including the authentication header carrying the developer's API key — to an attacker-controlled server. The attacker proxies requests to the real Anthropic API so nothing looks broken. The developer notices nothing. The attacker has your key.&lt;/p&gt;

&lt;p&gt;Both are patched now (CVE-2025-59536 in v1.0.111, CVE-2026-21852 in v2.0.65). But the thing that stuck with me wasn't the specific vulnerabilities — it was the underlying assumption they exposed. We'd all been treating &lt;code&gt;.claude/settings.json&lt;/code&gt; as passive config. It's not. In an agentic tool that can run shell commands and call external APIs, repo-level config is part of the execution layer. Same threat model as a malicious &lt;code&gt;package.json&lt;/code&gt; postinstall script. We just weren't thinking about it that way yet.&lt;/p&gt;

&lt;p&gt;After the CVE disclosure, my team did a sweep of our repos and found three that had &lt;code&gt;.claude/settings.json&lt;/code&gt; files with non-standard &lt;code&gt;ANTHROPIC_BASE_URL&lt;/code&gt; overrides. None of them were malicious — developers had put them there for legitimate local testing. But they also would have redirected traffic for anyone else who cloned those repos. We removed them and added a CI check. Then we started working on the actual governance problem.&lt;/p&gt;




&lt;h2&gt;
  
  
  Gap 1: API keys were completely unmanaged
&lt;/h2&gt;

&lt;p&gt;This was the most embarrassing one to admit. Every developer using Claude Code had either:&lt;/p&gt;

&lt;p&gt;a) Their own personal Anthropic key (which meant their personal billing, no audit trail, and no way to revoke on offboarding)&lt;br&gt;
b) A shared team key that lived in a shared &lt;code&gt;.env&lt;/code&gt; somewhere (which is worse)&lt;/p&gt;

&lt;p&gt;The fix seems obvious in retrospect — issue keys through the Anthropic Admin Console with explicit expiry, store them in AWS Secrets Manager or HashiCorp Vault, and never let them touch &lt;code&gt;.bash_profile&lt;/code&gt; or shell history. Rotate quarterly, revoke immediately on offboarding.&lt;/p&gt;

&lt;p&gt;But the deeper fix was routing Claude Code through a gateway so the Anthropic key never lived on developer machines at all. With &lt;a href="https://www.truefoundry.com/docs/ai-gateway/claude-code" rel="noopener noreferrer"&gt;AI Gateway&lt;/a&gt;, developers authenticate to the gateway with a scoped virtual key. The underlying Anthropic credential stays in the gateway's secrets manager. If a developer's machine is compromised, the attacker gets a gateway key that we can revoke from a dashboard — not a raw Anthropic API key with workspace-level access.&lt;/p&gt;

&lt;p&gt;That distinction matters more than it sounds. A stolen Anthropic key can access all workspace files, modify shared data, and run up API costs before you notice. A stolen gateway key gets you a revocable token with model-level and budget-level restrictions baked in.&lt;/p&gt;


&lt;h2&gt;
  
  
  Gap 2: No visibility into what was actually happening
&lt;/h2&gt;

&lt;p&gt;Before we set up the gateway, our "observability" for Claude Code was checking the Anthropic billing dashboard once a month and wincing.&lt;/p&gt;

&lt;p&gt;We had no idea:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Which models developers were hitting&lt;/li&gt;
&lt;li&gt;How much each team was spending vs others&lt;/li&gt;
&lt;li&gt;Whether anyone was sending production data through Claude Code&lt;/li&gt;
&lt;li&gt;What happened when a model was unavailable (usually: the developer just didn't know and filed a vague bug report)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Setting &lt;code&gt;ANTHROPIC_BASE_URL&lt;/code&gt; to point at a gateway is the single highest-leverage change you can make to a Claude Code deployment. One line of config gives you a centralized enforcement point for everything — not just observability, but model allowlisting, per-developer rate limits, fallback routing, and budget caps.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;ANTHROPIC_BASE_URL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;https://&amp;lt;your-gateway-url&amp;gt;/api/inference/
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After we did this, we could see request-level traces with developer attribution, per-model token spend broken down by team, and cost anomalies surfaced automatically. We found one engineer running a batch job through Claude Code that was generating about 3x average daily spend in an afternoon. Not malicious — they just didn't know. We set a per-developer daily limit and the problem went away without any policy conversations.&lt;/p&gt;

&lt;p&gt;One thing we learned the hard way: if you're using Claude Admin Console's server-managed settings to control Claude Code, those settings are bypassed when &lt;code&gt;ANTHROPIC_BASE_URL&lt;/code&gt; is set. So if you route through a gateway, you need MDM (Jamf on macOS, Puppet/Ansible on Linux) to push the &lt;code&gt;ANTHROPIC_BASE_URL&lt;/code&gt; setting into system-level managed config files that developers can't override:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;macOS: &lt;code&gt;/Library/Application Support/ClaudeCode/managed-settings.json&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Linux: &lt;code&gt;/etc/claude-code/managed-settings.json&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is also the direct mitigation for CVE-2026-21852 — if &lt;code&gt;ANTHROPIC_BASE_URL&lt;/code&gt; is set at the OS level by MDM and locked, a malicious repo's &lt;code&gt;.claude/settings.json&lt;/code&gt; can't override it.&lt;/p&gt;




&lt;h2&gt;
  
  
  Gap 3: The local machine was completely open
&lt;/h2&gt;

&lt;p&gt;Even with the gateway in place, the gateway only governs network-level traffic. Claude Code running on a developer's machine can still read &lt;code&gt;.env&lt;/code&gt; files, &lt;code&gt;.ssh&lt;/code&gt; keys, &lt;code&gt;~/.aws/credentials&lt;/code&gt;, and anything else the local user has access to — and that content can end up in a prompt before it ever hits the network.&lt;/p&gt;

&lt;p&gt;We spent an afternoon putting together a baseline &lt;code&gt;managed-settings.json&lt;/code&gt;. Here's the version we landed on:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"permissions"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"disableBypassPermissionsMode"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"disable"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"deny"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="s2"&gt;"Bash(curl:*)"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="s2"&gt;"Bash(wget:*)"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="s2"&gt;"Read(**/.env)"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="s2"&gt;"Read(**/.env.*)"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="s2"&gt;"Read(**/secrets/**)"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="s2"&gt;"Read(**/.ssh/**)"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="s2"&gt;"Read(**/credentials/**)"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"ask"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="s2"&gt;"Bash(git push:*)"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="s2"&gt;"Write(**)"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"allowManagedPermissionRulesOnly"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"allowManagedHooksOnly"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"transcriptRetentionDays"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;14&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"sandbox"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"enabled"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A few settings here are doing the most work:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;allowManagedPermissionRulesOnly: true&lt;/code&gt;&lt;/strong&gt; — this is the CVE-2025-59536 mitigation. It means project-level &lt;code&gt;.claude/settings.json&lt;/code&gt; files cannot add new permissions, only the system-level managed config applies. A malicious repo can't expand what Claude Code is allowed to do on that machine.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;allowManagedHooksOnly: true&lt;/code&gt;&lt;/strong&gt; — blocks hook injection. Hooks can run arbitrary code between sessions; this prevents a cloned repo from registering new hooks.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;disableBypassPermissionsMode: "disable"&lt;/code&gt;&lt;/strong&gt; — prevents &lt;code&gt;--dangerously-skip-permissions&lt;/code&gt; from being used in scripts or CI. We found two CI workflows that had been using this flag. Both got refactored.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;deny&lt;/code&gt; list&lt;/strong&gt; — blocking reads on &lt;code&gt;.env&lt;/code&gt;, &lt;code&gt;.ssh&lt;/code&gt;, and credentials directories. We debated this — some developers complained it broke legitimate workflows. We made exceptions on a case-by-case basis via an explicit allow rather than leaving the door open by default.&lt;/p&gt;

&lt;p&gt;Sandboxing adds OS-level isolation on top. On macOS it uses Seatbelt, on Linux bubblewrap. It enforces filesystem and network boundaries at a layer below Claude Code's own permission system.&lt;/p&gt;




&lt;h2&gt;
  
  
  Gap 4: MCP was running completely ungoverned
&lt;/h2&gt;

&lt;p&gt;This was the gap that took us longest to appreciate, because MCP looks like a developer experience feature until you realize what it actually is: direct programmatic access from Claude Code to internal systems.&lt;/p&gt;

&lt;p&gt;Our developers had connected Claude Code to GitHub (for code search), Jira (for ticket context), and a couple of internal APIs. All of those connections were configured locally on developer machines, each with their own credentials stored wherever. There was no approval process, no audit trail, and no way to see which tools Claude had been invoking during a session.&lt;/p&gt;

&lt;p&gt;The prompt injection risk here is underappreciated. When Claude retrieves content from an external system via an MCP tool — a GitHub issue, a Jira ticket, a web page — that content arrives in Claude's context. If it contains injected instructions, Claude may execute them silently. We had a case where a Jira ticket from an external vendor contained what looked like a formatting instruction that Claude Code interpreted as a command. Nothing bad happened, but it was a near miss that made the problem very concrete.&lt;/p&gt;

&lt;p&gt;The fix was centralizing MCP access through a gateway with an allowlist. We deployed &lt;a href="https://www.truefoundry.com/docs/ai-gateway/mcp/mcp-overview" rel="noopener noreferrer"&gt;TrueFoundry's MCP Gateway&lt;/a&gt; as the single endpoint for all MCP server access. In &lt;code&gt;managed-settings.json&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"allowedMcpServers"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"serverUrl"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"https://&amp;lt;your-mcp-gateway-url&amp;gt;/*"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"strictKnownMarketplaces"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Setting &lt;code&gt;strictKnownMarketplaces&lt;/code&gt; to an empty array blocks marketplace-sourced MCP server installations. Developers can no longer add random MCP servers from the Claude marketplace — any new server has to go through our review process and get registered in the gateway.&lt;/p&gt;

&lt;p&gt;What we got from the gateway itself: each developer authenticates once, and the gateway handles downstream auth to GitHub, Jira, and everything else. RBAC controls which teams can access which tools. Every tool invocation generates an audit trace with the developer's identity, the tool name, the request and response, and the latency. We can see exactly what Claude touched during a session, not just which model it called.&lt;/p&gt;

&lt;p&gt;The Virtual MCP Servers feature turned out to be genuinely useful for our security team's access: we set up a "security tools" endpoint that exposes only the Sentry and Datadog tools relevant to security workflows, separate from the broader set of tools available to product engineers. Agents only see what they're supposed to see.&lt;/p&gt;




&lt;h2&gt;
  
  
  What the end state looks like
&lt;/h2&gt;

&lt;p&gt;Six months after the audit, our setup is:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Identity:&lt;/strong&gt; SSO + domain capture for Claude.ai. All Claude Code keys are gateway virtual keys issued through TrueFoundry, rotated automatically, revoked on offboarding via a single dashboard action.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Routing:&lt;/strong&gt; &lt;code&gt;ANTHROPIC_BASE_URL&lt;/code&gt; pushed via MDM to all developer machines, pointing at TrueFoundry AI Gateway. Per-developer daily token limits. Per-team budget caps. Model allowlist (we've restricted certain high-cost models to specific teams that have a justified need).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Sandboxing:&lt;/strong&gt; &lt;code&gt;managed-settings.json&lt;/code&gt; deployed via MDM with the permission deny-list and sandbox enabled. &lt;code&gt;allowManagedPermissionRulesOnly&lt;/code&gt; and &lt;code&gt;allowManagedHooksOnly&lt;/code&gt; both true.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;MCP:&lt;/strong&gt; TrueFoundry MCP Gateway as the single MCP endpoint. All downstream servers registered and approved. Tool-level RBAC. Full audit trail exported to Datadog.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Audit logging:&lt;/strong&gt; Everything flows through the gateway to Datadog via OpenTelemetry. 90-day retention. We get a weekly summary of spend by team, model, and application, and an alert if any developer's usage spikes more than 3x their 7-day average.&lt;/p&gt;

&lt;p&gt;Is it perfect? No. BYOD is still a gap — we don't have MDM coverage on contractor machines, which means &lt;code&gt;ANTHROPIC_BASE_URL&lt;/code&gt; enforcement is honor-system for that population. If anyone has solved BYOD Claude Code governance cleanly, I'd genuinely like to hear how. Drop it in the comments.&lt;/p&gt;

</description>
      <category>claude</category>
      <category>claudecode</category>
      <category>ai</category>
      <category>devops</category>
    </item>
    <item>
      <title>AI Gateway Comparison 2026: Kong, Portkey, LiteLLM, and TrueFoundry</title>
      <dc:creator>Sahajmeet Kaur</dc:creator>
      <pubDate>Wed, 17 Jun 2026 12:26:27 +0000</pubDate>
      <link>https://dev.to/sahajmeet_kaur_/ai-gateway-comparison-2026-kong-portkey-litellm-and-truefoundry-mf1</link>
      <guid>https://dev.to/sahajmeet_kaur_/ai-gateway-comparison-2026-kong-portkey-litellm-and-truefoundry-mf1</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Kong, Portkey, and LiteLLM each hit real walls when AI usage spreads across multiple teams — mostly around cost attribution, environment isolation, and MCP/agent governance&lt;/li&gt;
&lt;li&gt;The choice between them depends more on what you're already running than on feature lists&lt;/li&gt;
&lt;li&gt;TrueFoundry handles the full stack (routing + MCP + model deployment) in one system, which simplifies things but also means more to adopt upfront&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;Six months ago our team started evaluating AI gateways in earnest. Not for a demo — we had five squads shipping LLM features, three different cloud providers in play, and a security review asking uncomfortable questions about who had called which model, when, and with what prompt.&lt;/p&gt;

&lt;p&gt;I spent a few weeks actually running Kong, Portkey, and LiteLLM in staging. This is what I found.&lt;/p&gt;




&lt;h2&gt;
  
  
  What "AI gateway" means in practice
&lt;/h2&gt;

&lt;p&gt;An AI gateway sits between your application code and your LLM providers. At minimum it handles:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Routing requests to the right model/provider&lt;/li&gt;
&lt;li&gt;Retrying and failing over when a provider is down&lt;/li&gt;
&lt;li&gt;Rate limiting by user, team, or application&lt;/li&gt;
&lt;li&gt;Tracking token usage and cost&lt;/li&gt;
&lt;li&gt;Some form of access control&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Where things get more complicated is when you're running agents. An agent workflow that calls ten MCP tools across a single conversation has different governance requirements than a simple chat endpoint — you need to know which tool was called, by which agent, under which identity, and whether it hit any policy. A routing proxy doesn't cover that. A few of the tools below do.&lt;/p&gt;




&lt;h2&gt;
  
  
  Kong AI Gateway
&lt;/h2&gt;

&lt;p&gt;Kong has been the default API gateway for Kubernetes-based microservices for years. The AI gateway layer adds LLM-specific plugins on top of that foundation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The good:&lt;/strong&gt; If you're already running Kong, adding AI routing through the same control plane is genuinely low-friction. The plugin model (Lua-based, composable) is mature and customizable. Token-level rate limiting via the AI Rate Limiting Advanced plugin is the right abstraction for LLM workloads — limits based on actual token consumption, not HTTP request counts. Kong 3.13+ also added MCP and A2A (agent-to-agent) protocol support.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The friction:&lt;/strong&gt; The open-source version doesn't include the GUI, advanced analytics, or the advanced AI rate limiting plugin — those are Konnect Enterprise. So if you're not already paying for Kong, you're evaluating two things simultaneously: the gateway and the commercial tier. Teams coming to Kong fresh for AI routing often find the operational overhead hard to justify.&lt;/p&gt;

&lt;p&gt;More structurally: Kong treats LLM calls as HTTP requests with some extra metadata. It doesn't have native concepts for prompt lifecycle, agent tracing, or MCP server discovery. You can wire things together with plugins, but you're composing separate pieces.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;When it makes sense:&lt;/strong&gt; You already run Kong. You want one control plane for API and AI traffic. You're not starting from scratch.&lt;/p&gt;




&lt;h2&gt;
  
  
  Portkey
&lt;/h2&gt;

&lt;p&gt;Portkey is AI-native — built from the start for LLM applications rather than adapted from general API management. That shows in the developer experience.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The good:&lt;/strong&gt; Setup is genuinely fast. The routing config is readable. The observability dashboard surfaces token cost breakdowns in a way that's actually useful when you're debugging why a particular model is expensive. Prompt versioning with a playground is a real quality-of-life improvement. Semantic caching, retries, and fallbacks work out of the box.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The friction:&lt;/strong&gt; Portkey's design is application-scoped, which is fine for one team but creates gaps at org scale. Environment isolation (dev vs staging vs production) isn't a native concept. Cost attribution across multiple teams needs workarounds. Budget limits per org are an Enterprise plan feature — teams on lower tiers can set per-key budgets, but not workspace-level enforcement. Log retention is 30 days on the Production tier, which doesn't meet compliance requirements in most regulated industries without upgrading.&lt;/p&gt;

&lt;p&gt;There's also an ownership question worth naming: Palo Alto Networks completed an acquisition of Portkey. That's worth factoring into a long-term evaluation. It might mean better enterprise integration and security tooling. It might mean pricing changes or slower product velocity. It's an open question and I'd want to know the roadmap before putting Portkey in critical infrastructure.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;When it makes sense:&lt;/strong&gt; Small team, shipping fast, developer experience is the priority, governance requirements are simple.&lt;/p&gt;




&lt;h2&gt;
  
  
  LiteLLM
&lt;/h2&gt;

&lt;p&gt;LiteLLM is the tool most AI engineers try first. OpenAI-compatible API across 100+ models, MIT license, Docker image, large community. It's genuinely the easiest starting point.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The good:&lt;/strong&gt; Provider coverage is broad. The translation layer is clean — you write standard OpenAI-format requests and LiteLLM handles routing them to Anthropic, Bedrock, Gemini, self-hosted models, whatever. Virtual keys with per-key budgets and rate limits work once configured. The admin dashboard handles basic team management.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The friction:&lt;/strong&gt; The governance features that actually matter at scale — SSO, RBAC, team-level budget enforcement — are behind the Enterprise license. You can't connect Okta to the open-source version. At 20 engineers that's manageable. At 200, you're either paying for the license or sharing master keys in Slack.&lt;/p&gt;

&lt;p&gt;Config is YAML-heavy. That's fine when one engineer owns it, but it doesn't scale cleanly when multiple teams need to modify routing rules independently. Distributed rate limiting requires Redis — if Redis has a problem, your rate limit enforcement degrades. There's also no SLA and no formal audit trail support in the open-source build.&lt;/p&gt;

&lt;p&gt;LiteLLM recently changed their support model, noting it no longer fits their scale. Worth tracking if you're depending on support as part of your infrastructure decision.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;When it makes sense:&lt;/strong&gt; Individual engineer or small team prototyping, self-hosted model access, comfort with managing your own infrastructure, willing to absorb the enterprise license cost later.&lt;/p&gt;




&lt;h2&gt;
  
  
  TrueFoundry
&lt;/h2&gt;

&lt;p&gt;TrueFoundry's gateway connects to 1,000+ LLMs through a single OpenAI-compatible endpoint — OpenAI, Anthropic, Gemini, Bedrock, Azure OpenAI, Groq, Mistral, xAI, Together AI, self-hosted, and more. The routing layer handles load balancing by weight or latency, automatic fallback chains, and retries.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Architecture detail that matters:&lt;/strong&gt; The gateway is built on &lt;a href="https://hono.dev/" rel="noopener noreferrer"&gt;Hono&lt;/a&gt; with no external calls in the hot path. Auth, rate limiting, and load balancing all run in-memory. Config syncs from the control plane via NATS. Rate limiting uses a sliding window token bucket — per-minute windows, 5-second bucket granularity — &lt;a href="https://www.truefoundry.com/docs/ai-gateway/ratelimiting" rel="noopener noreferrer"&gt;all enforced in-memory&lt;/a&gt; without DB lookups per request. The benchmarks show 350+ RPS on 1 vCPU / 1 GB RAM with 7–12ms added latency at full load with tracing on. The in-memory design is why it stays fast as RBAC rules and budget checks pile up.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Where it handles the org-scale problems:&lt;/strong&gt; RBAC scopes to users, teams, and applications — not just API keys. Budget limits apply simultaneously at user, team, and application level; the most restrictive wins. You can see per-model cost attribution, not just aggregate spend. Guardrails (PII detection, prompt injection filtering, content moderation) are built in without external credentials. External providers — Azure Content Safety, Bedrock Guardrails, OpenAI Moderations, Google Model Armor — are available for teams with specific compliance requirements. Guardrails apply in validate mode (inspect, optionally block) or mutate mode (inspect and modify), and they cover MCP tool results as well as LLM responses.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The MCP layer:&lt;/strong&gt; This is the part that doesn't exist in the other tools at the same depth. TrueFoundry has a dedicated &lt;a href="https://www.truefoundry.com/docs/ai-gateway/mcp/mcp-overview" rel="noopener noreferrer"&gt;MCP Gateway&lt;/a&gt; that gives you a central registry for all MCP servers. OAuth 2.0 auth is managed in one place — one token per user, auto-refreshed across all registered servers. You can compose "Virtual MCP Servers" — a logical endpoint that exposes a curated subset of tools from multiple real MCP servers, so a finance agent only sees the tools it's supposed to see. Every tool call gets a full audit trail with user identity, tool invoked, and policies evaluated.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The trade-off:&lt;/strong&gt; TrueFoundry is more to adopt upfront than LiteLLM or Portkey. It's a platform that also handles model deployment on GPUs, fine-tuning, and agent lifecycle management. If you want just a routing proxy, the surface area is bigger than you need. If you're already asking "how do we govern our agents' tool access" or "how do we deploy and route to our own models from the same place," then the additional surface area starts being useful rather than burdensome.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Deployment options:&lt;/strong&gt; SaaS, VPC-hosted, on-prem, and air-gapped (the air-gapped setup uses a forward proxy config documented in the architecture docs). SOC 2, HIPAA, and ITAR certified.&lt;/p&gt;




&lt;h2&gt;
  
  
  Feature comparison
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Kong&lt;/th&gt;
&lt;th&gt;Portkey&lt;/th&gt;
&lt;th&gt;LiteLLM&lt;/th&gt;
&lt;th&gt;TrueFoundry&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Self-hosted&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;Limited&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Managed SaaS&lt;/td&gt;
&lt;td&gt;✅ (Konnect)&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;VPC / on-prem&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;Enterprise tier&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Air-gapped&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;SOC 2&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;HIPAA&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ITAR&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Token-level rate limiting&lt;/td&gt;
&lt;td&gt;✅ (plugin)&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Per-team budget enforcement&lt;/td&gt;
&lt;td&gt;✅ (Konnect)&lt;/td&gt;
&lt;td&gt;Enterprise tier&lt;/td&gt;
&lt;td&gt;Enterprise license&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MCP support&lt;/td&gt;
&lt;td&gt;Plugin (3.13+)&lt;/td&gt;
&lt;td&gt;Partial&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;Native&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Audit trail per tool call&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Model deployment&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;A few notes on this table: "Limited" and "Partial" are my characterizations based on docs at the time of writing — check current docs before making a decision. The Kong and Portkey enterprise features depend on which pricing tier you're on.&lt;/p&gt;




&lt;h2&gt;
  
  
  Where each tool actually lands
&lt;/h2&gt;

&lt;p&gt;After running all four in staging and living with the tradeoffs, here's the honest picture:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Kong&lt;/strong&gt; is a reasonable answer if — and only if — you're already running Kong. The AI plugins extend an existing investment well. But if you're starting fresh and evaluating gateways specifically for AI workloads, Kong's operational overhead and plugin-assembly model is hard to justify when purpose-built options exist. It's not an AI gateway that happens to do APIs; it's an API gateway that happens to do AI. That order matters.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Portkey&lt;/strong&gt; is genuinely good for a single team moving fast. The DX is the best of the four. But the acquisition by Palo Alto Networks is a real consideration, not a footnote. When a product gets absorbed into a large security vendor, the roadmap focus shifts — pricing tends to move upmarket, iteration slows, and developer-first features get deprioritized in favor of enterprise sales. It may work out fine, but locking in Portkey as core infrastructure right now means betting on that outcome. For a team that needs a stable, actively-developed gateway over a 2–3 year horizon, that's not a comfortable bet to make today.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;LiteLLM&lt;/strong&gt; is where most teams start, and it earns that. For solo engineers and small teams it's hard to beat — MIT license, zero friction, broad provider coverage. The problem is it doesn't grow with you cleanly. The features you actually need at org scale (SSO, RBAC, audit trails, team-level budget enforcement) are all behind the enterprise license, and the YAML-heavy config model starts creating coordination problems as more teams touch it. LiteLLM is a great proof-of-concept tool that becomes a migration project when your AI usage matures.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;TrueFoundry&lt;/strong&gt; is the only one of the four that was designed assuming AI usage spreads across multiple teams, involves agents using tools, and needs to hold up under compliance scrutiny. The in-memory rate limiting, hierarchical RBAC, MCP governance layer, and the fact that it covers deployment as well as routing — these aren't features bolted on later. They're the thing it was built to do. The honest tradeoff is setup complexity upfront. But if you're past the "one team, one LLM call" stage, that upfront complexity pays back quickly against the alternative: stitching together LiteLLM for routing, a separate deployment platform, and something custom for MCP governance.&lt;/p&gt;

&lt;p&gt;If you're starting a new AI infrastructure evaluation today and don't already have Kong in your stack, TrueFoundry is the one I'd run seriously. Not because the others are bad — they're each good at specific things — but because it's the only one where growing from "route LLM calls" to "govern agent workflows at org scale" doesn't require switching tools.&lt;/p&gt;




&lt;p&gt;What's your current gateway setup, and where did it start breaking for you? Especially curious if anyone's navigated the LiteLLM-to-something-else migration — that transition seems to catch teams off guard. Drop it in the comments.&lt;/p&gt;

</description>
      <category>llm</category>
      <category>ai</category>
      <category>mlops</category>
      <category>devops</category>
    </item>
  </channel>
</rss>
