<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Deepti Shukla</title>
    <description>The latest articles on DEV Community by Deepti Shukla (@deeptishuklatfy).</description>
    <link>https://dev.to/deeptishuklatfy</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3818367%2F8715c109-f1ab-4975-9c3c-1303cd6f5df1.png</url>
      <title>DEV Community: Deepti Shukla</title>
      <link>https://dev.to/deeptishuklatfy</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/deeptishuklatfy"/>
    <language>en</language>
    <item>
      <title>How to Connect Your First MCP Server to an AI Agent (Without Breaking Anything in Production)</title>
      <dc:creator>Deepti Shukla</dc:creator>
      <pubDate>Tue, 07 Apr 2026 11:03:58 +0000</pubDate>
      <link>https://dev.to/deeptishuklatfy/how-to-connect-your-first-mcp-server-to-an-ai-agent-without-breaking-anything-in-production-4j5b</link>
      <guid>https://dev.to/deeptishuklatfy/how-to-connect-your-first-mcp-server-to-an-ai-agent-without-breaking-anything-in-production-4j5b</guid>
      <description>&lt;p&gt;Every MCP getting-started guide shows you the same thing: ten lines of code, a local file system server, and an agent that can read files. It works in five minutes. You show it to your team. Everyone is impressed.&lt;br&gt;
Then someone asks whether it's ready to ship.&lt;br&gt;
It isn't. Not yet. Not because MCP is hard — it isn't — but because getting from "works on my machine" to "works reliably in production with real users and a security team" requires a few additional decisions that the tutorial skipped.&lt;br&gt;
This article covers both: the quick path to a working MCP setup, and the honest list of what you need to address before you let it anywhere near production data.&lt;/p&gt;
&lt;h3&gt;
  
  
  Part 1: What a Working MCP Setup Actually Looks Like
&lt;/h3&gt;

&lt;p&gt;MCP has two sides: the client and the server.&lt;br&gt;
The MCP server is a lightweight service that exposes tools. Each tool has a name, a description, an input schema, and a handler function that does the actual work. An MCP server for a database, for example, might expose tools called query_records, insert_record, and list_tables. The server handles the MCP protocol — receiving tool discovery requests, responding with the tool list, accepting tool calls, and returning results.&lt;br&gt;
The MCP client is your agent — specifically, the part of your agent framework that communicates with MCP servers. Most major agent frameworks (LangChain, LlamaIndex, AutoGen, and others) now have native MCP client support. You point the client at an MCP server, it fetches the available tools, and those tools become available for the LLM to call.&lt;br&gt;
A minimal working setup in Python looks roughly like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;#Connect your agent to an MCP server
&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;your_agent_framework&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;MCPClient&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Agent&lt;/span&gt;

&lt;span class="c1"&gt;# Point the client at your MCP server
&lt;/span&gt;&lt;span class="n"&gt;mcp_client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;MCPClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;server_url&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http://localhost:8000&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# The client fetches available tools automatically
&lt;/span&gt;&lt;span class="n"&gt;available_tools&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;mcp_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;list_tools&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# Pass tools to your agent
&lt;/span&gt;&lt;span class="n"&gt;agent&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Agent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-sonnet-4&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;tools&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;available_tools&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  The agent can now call any tool the server exposes
&lt;/h2&gt;

&lt;p&gt;response = agent.run("List all open support tickets assigned to me")&lt;br&gt;
The agent sends the tool list to the LLM. When the LLM decides it needs to call list_tickets, it generates a structured tool call. The agent framework intercepts it, sends it to the MCP server, gets the result, and feeds it back into the LLM's context. The LLM continues reasoning with the tool result.&lt;br&gt;
That's it locally. It takes minutes to get running and feels magical the first time.&lt;/p&gt;

&lt;h3&gt;
  
  
  Part 2: What Works in a Demo and Breaks in Production
&lt;/h3&gt;

&lt;p&gt;Here's the honest part. The setup above has five characteristics that are fine for development and actively dangerous for production.&lt;br&gt;
There's no authentication. The MCP server is open to anyone who can reach the URL. In local development that's only you. In a deployed environment, it's potentially anyone on the network.&lt;br&gt;
There's no access control. Every agent that connects gets every tool. The concept of "this agent should only see read tools, not write tools" doesn't exist in the basic setup.&lt;br&gt;
There's no audit trail. When the agent calls insert_record with certain arguments, there's no log connecting that tool call to the user who triggered it, the LLM call that produced it, or the business context that justified it.&lt;br&gt;
There's no defence against tool poisoning. In April 2025, Invariant Labs demonstrated that a malicious MCP server can embed hidden instructions in tool responses that the LLM reads as commands. In the basic setup, tool responses flow directly from the server into LLM context with no inspection layer in between.&lt;br&gt;
There's no centralised management. If you're running this with one agent, one server, and one developer, the above is manageable. When you have six teams, twenty agents, and forty MCP servers, managing credentials, access policies, and tool inventory in application code becomes a full-time job.&lt;br&gt;
None of these are edge cases. They're the normal state of any MCP deployment that's been running for more than a few months and has more than one team contributing to it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Part 3: The Three Things to Get Right Before You Ship
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;Authentication: Use your existing identity provider, not new credentials
The worst outcome is a parallel credential system — new API keys, new user accounts, new rotation policies — maintained alongside your existing identity infrastructure. It creates duplication, increases surface area, and inevitably drifts out of sync.
The right approach is to federate MCP authentication to your existing IdP. If your organisation uses Okta or Azure AD, MCP tool access should be governed by the same identities, the same roles, and the same access policies as everything else. When an employee's account is deactivated, their agent's tool access is revoked automatically. No separate step, no risk of missing it.&lt;/li&gt;
&lt;li&gt;Tool scoping: Agents should only see what they're authorised to use
The principle of least privilege applies to AI agents at least as much as it applies to human users. An agent handling customer support queries has no legitimate reason to call database administration tools. A finance workflow agent has no reason to trigger deployment pipelines.
In a direct-connection setup, tool scoping requires each agent to filter its own tool list — which means it's implemented inconsistently, if at all. In a gateway setup, scoping is enforced at the discovery layer: the gateway intercepts the tools/list response and returns only the tools the requesting agent is authorised to see. The agent literally cannot discover tools it shouldn't have access to.&lt;/li&gt;
&lt;li&gt;Logging: You need a record that connects the LLM call to the tool call to the outcome
When something goes wrong — and with AI agents, something will eventually go wrong — you need to be able to reconstruct what happened. Not "the database was modified at 14:32" but "User A triggered Agent B, which called Tool C with Arguments D, based on LLM call E, which was triggered by User Request F."
That chain of causation is what makes an AI system debuggable and auditable. It doesn't exist in the basic MCP setup and requires deliberate infrastructure to create.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  The Production Path
&lt;/h2&gt;

&lt;p&gt;The cleanest path from working demo to production-ready MCP deployment is to route your agents through an MCP gateway rather than connecting them directly to servers. The gateway handles authentication, access control, logging, and response inspection in one place. Your agent code doesn't change — it still talks to an MCP endpoint. The governance layer sits between the agent and the tools.&lt;br&gt;
&lt;a href="https://www.truefoundry.com/mcp-gateway" rel="noopener noreferrer"&gt;TrueFoundry's MCP Gateway&lt;/a&gt; is designed specifically for teams making this transition. It integrates with Okta, Azure AD, and other enterprise identity providers for centralised authentication. It enforces RBAC at the tool level so agents only discover what they're authorised to use. It captures full request traces linking every tool call to its triggering LLM call and user context. And it deploys within your own infrastructure — VPC, on-premises, or air-gapped — so no inference data leaves your environment.&lt;br&gt;
You connect your agents to the gateway instead of directly to MCP servers. Everything else stays the same. The demo that impressed your team last week becomes the production system that doesn't keep your security team up at night.&lt;br&gt;
&lt;a href="https://www.truefoundry.com/mcp-gateway" rel="noopener noreferrer"&gt;Explore TrueFoundry's MCP Gateway →&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>productivity</category>
      <category>devops</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>What Is Model Context Protocol (MCP)? A Plain Guide for Engineers</title>
      <dc:creator>Deepti Shukla</dc:creator>
      <pubDate>Mon, 06 Apr 2026 08:59:49 +0000</pubDate>
      <link>https://dev.to/deeptishuklatfy/what-is-model-context-protocol-mcp-a-plain-guide-for-engineers-5ddo</link>
      <guid>https://dev.to/deeptishuklatfy/what-is-model-context-protocol-mcp-a-plain-guide-for-engineers-5ddo</guid>
      <description>&lt;p&gt;If you've seen "MCP" appear three times this week — in a job description, a Slack thread, and a GitHub repo — and nodded along without being entirely sure what it is, this article is for you.&lt;br&gt;
Model Context Protocol is not complicated. It solves a specific problem, it does it cleanly, and once you understand what that problem was, the solution makes immediate sense. Here's everything you need to know.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem MCP Solves
&lt;/h2&gt;

&lt;p&gt;AI models are good at reasoning. They are, by themselves, entirely isolated. A language model trained on text knows a lot of things. It doesn't know what's in your database, what's in your Slack channel, or what tasks are currently open in Jira. It can't send an email, query your CRM, or trigger a deployment.&lt;/p&gt;

&lt;p&gt;For AI agents to do useful work — not just answer questions but actually act — they need to connect to external tools and data sources. Before MCP, every one of those connections was custom-built. A team building an AI assistant for their engineering workflow would write a custom integration for GitHub, a different one for Jira, another one for their internal deployment system. None of those integrations transferred to another team. None of them were reusable across different LLMs. If they wanted to switch from OpenAI to Claude, they rewrote the integrations. If another team wanted similar functionality, they built it from scratch.&lt;/p&gt;

&lt;p&gt;BCG put a number on this problem: without a standard protocol, integration complexity grows quadratically as AI agents multiply across an organisation. Every new agent needs its own connections to every tool it needs. It compounds quickly.&lt;br&gt;
MCP solves this by standardising the connection. Instead of each team building custom integrations, tools expose themselves as MCP servers using one standard interface. Any MCP-compatible agent can connect to any MCP server without custom code. The integration is built once and works everywhere.&lt;/p&gt;

&lt;h2&gt;
  
  
  What MCP Actually Is
&lt;/h2&gt;

&lt;p&gt;Model Context Protocol is an open standard — originally released by Anthropic in November 2024, donated to the Linux Foundation in December 2025 as part of the newly formed Agentic AI Foundation — that defines how AI agents discover and call external tools.&lt;br&gt;
At its core, MCP is a communication protocol. It specifies:&lt;br&gt;
How tools are described. An MCP server exposes a list of tools with structured definitions: name, description, input schema, output schema. The LLM reads these definitions to understand what tools are available and how to use them.&lt;/p&gt;

&lt;p&gt;How tools are called. When an agent wants to use a tool, it sends a structured request to the MCP server. The server executes the tool and returns a structured response. Everything flows over a standard message format based on JSON-RPC 2.0.&lt;br&gt;
How discovery works. Agents query an MCP server to find out what tools it offers. This means agents can adapt to the tools available to them rather than requiring hard-coded tool definitions.&lt;br&gt;
The analogy that makes the most sense: MCP is to AI agents what USB-C is to devices. Before USB-C, every device used a different connector. Charging cables, data cables, display cables — all different, all incompatible. USB-C standardised the connector. You plug in and it works, regardless of which device or which cable.&lt;/p&gt;

&lt;p&gt;MCP standardised the connector between AI agents and tools. An agent that speaks MCP can connect to any tool that speaks MCP, regardless of which LLM powers the agent or which system the tool connects to.&lt;/p&gt;

&lt;h2&gt;
  
  
  How It Works in Three Steps
&lt;/h2&gt;

&lt;p&gt;Step 1: A tool owner creates an MCP server. This is a lightweight service that exposes one or more tools — a database query function, a Slack messaging capability, a code execution environment — using the MCP interface. The server describes what tools it offers and how to call them.&lt;br&gt;
Step 2: An agent discovers available tools. When an agent initialises, it queries the MCP server and receives a structured list of available tools with their schemas. The agent now knows what it can do.&lt;/p&gt;

&lt;p&gt;Step 3: The agent calls a tool. When the LLM decides it needs to use a tool — based on the user's request and the tools it knows are available — it sends a structured tool call to the MCP server. The server executes the tool and returns the result. The LLM incorporates the result into its reasoning and continues.&lt;br&gt;
That's the complete loop. The LLM doesn't need to know the implementation details of the tool. The tool doesn't need to know anything about the LLM. The protocol handles the conversation between them.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why the Ecosystem Grew So Fast
&lt;/h2&gt;

&lt;p&gt;MCP launched in November 2024. By April 2025, MCP server downloads had grown from roughly 100,000 to over 8 million per month. By late 2025, more than 5,800 MCP servers were publicly available, covering everything from Slack, Confluence, and Sentry to databases, code execution environments, and internal enterprise systems. SDK downloads crossed 97 million per month.&lt;br&gt;
Three things drove adoption that quickly.&lt;br&gt;
First, the major LLM providers endorsed it immediately. Anthropic built it, but OpenAI, Google, and Microsoft adopted it within months. That cross-vendor support meant developers could build MCP integrations once and use them with any LLM.&lt;br&gt;
Second, the integration cost dropped to near zero for tool owners. Exposing an existing API as an MCP server is a small amount of wrapper code. Companies like Slack, Datadog, and Sentry added MCP support quickly because the incremental effort was minimal.&lt;br&gt;
Third, developers were hungry for exactly this. The alternative — building and maintaining custom tool integrations per agent, per team, per LLM — was visibly painful. MCP provided relief that was immediately felt.&lt;/p&gt;

&lt;h2&gt;
  
  
  What MCP Doesn't Include
&lt;/h2&gt;

&lt;p&gt;MCP defines the connection. It doesn't define the rules around the connection.&lt;br&gt;
The protocol has no built-in mechanism for specifying which agents are allowed to call which tools. It has no audit logging. It has no way to detect if a tool response contains injected instructions designed to manipulate the LLM. It has no concept of per-team access policies.&lt;br&gt;
This isn't a flaw — it's a deliberate scope decision. Protocols stay minimal. The governance layer is built on top.&lt;/p&gt;

&lt;p&gt;For teams using MCP in local development or small-scale experiments, this gap is manageable. For teams deploying agents in production with multiple teams, sensitive data, and compliance requirements, the gap between what MCP provides and what enterprise deployment requires is significant.&lt;br&gt;
That gap is what an MCP gateway fills: a governance and security layer that sits in front of your MCP servers and handles authentication, access control, audit logging, and tool scoping in one place, consistently, for every agent that passes through it.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.truefoundry.com/" rel="noopener noreferrer"&gt;TrueFoundry's MCP Gateway&lt;/a&gt; is built specifically for this layer. It connects to your existing identity provider, enforces RBAC at the tool level, logs every tool invocation with full context, and deploys entirely within your own infrastructure — so your data never leaves your environment. Teams already managing significant AI workloads use it to take MCP from working in a demo to working reliably in production, across teams, at enterprise scale.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.truefoundry.com/mcp-gateway" rel="noopener noreferrer"&gt;Explore TrueFoundry's MCP Gateway →&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>webdev</category>
      <category>beginners</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>5 Things That Go Wrong When You Run MCP Without a Gateway (And How Enterprises Fix Them)</title>
      <dc:creator>Deepti Shukla</dc:creator>
      <pubDate>Mon, 30 Mar 2026 19:06:54 +0000</pubDate>
      <link>https://dev.to/deeptishuklatfy/5-things-that-go-wrong-when-you-run-mcp-without-a-gateway-and-how-enterprises-fix-them-3jf1</link>
      <guid>https://dev.to/deeptishuklatfy/5-things-that-go-wrong-when-you-run-mcp-without-a-gateway-and-how-enterprises-fix-them-3jf1</guid>
      <description>&lt;p&gt;Every MCP tutorial ends the same way. The demo works. The agent finds the tool, calls it, gets a result, and everyone in the meeting nods appreciatively. Then someone asks: "How do we do this with our actual users, our actual data, and our actual compliance team?"&lt;br&gt;
That's where the tutorial stops and the real problems start.&lt;br&gt;
MCP — the &lt;a href="https://www.truefoundry.com/blog/what-is-mcp-gateway" rel="noopener noreferrer"&gt;Model Context Protocol&lt;/a&gt; released by Anthropic in November 2024 and now backed by OpenAI, Google, and Microsoft — is a genuinely good standard. It solved a real problem: before MCP, every AI-to-tool connection was custom-built, non-transferable, and rebuilt from scratch by every team. MCP made tool connections reusable and interoperable. That's valuable.&lt;/p&gt;

&lt;p&gt;What MCP doesn't include is a governance layer. The protocol defines how agents connect to tools. It doesn't define who's allowed to connect, what they can do when they get there, how you know what happened, or how you stop a compromised tool from doing something it shouldn't. That's not a criticism of &lt;a href="https://www.truefoundry.com/blog/what-is-mcp-gateway" rel="noopener noreferrer"&gt;MCP&lt;/a&gt; — it's a deliberate scope decision. The protocol stays minimal. The governance is your problem.&lt;br&gt;
Running MCP without a gateway means you're solving that governance problem ad-hoc, in application code, differently for every team. Here's what that looks like in practice.&lt;/p&gt;

&lt;h2&gt;
  
  
  Problem 1: No Central Visibility Into What Your Agents Are Actually Doing
&lt;/h2&gt;

&lt;p&gt;When agents connect directly to MCP servers, the audit trail is fragmented by design. Your LLM provider has logs of what the model was asked. Your MCP server has logs of what tool was called. Nothing connects them.&lt;br&gt;
When an agent does something unexpected — and it will — debugging means manually cross-referencing timestamps across three to five systems: the LLM call log, the MCP server log, whatever application logging you have, and possibly the downstream system the tool modified. There's no single record that says "this user triggered this agent, which made this LLM call, which called this tool, with these arguments, and got this result."&lt;br&gt;
In a low-stakes internal tool, that's annoying. In a regulated environment — healthcare, finance, legal — the absence of a coherent audit trail isn't just inconvenient. It's a compliance gap that can't be closed with documentation alone.&lt;br&gt;
The fix is a gateway that logs every tool invocation with full context: agent identity, user identity, tool name, arguments, response, and latency — all linked to the LLM call that triggered it. One record, one place, searchable and exportable.&lt;br&gt;
&lt;a href="https://www.truefoundry.com/mcp-gateway" rel="noopener noreferrer"&gt;TrueFoundry's MCP Gateway&lt;/a&gt; captures exactly this — every tools/list and tools/call invocation is logged with agent identity, user context, arguments, and response status, creating a coherent audit trail across all your MCP-connected systems. When something goes wrong, the answer is in one dashboard, not four log files.&lt;/p&gt;

&lt;h2&gt;
  
  
  Problem 2: Authentication Is a Patchwork That Nobody Owns
&lt;/h2&gt;

&lt;p&gt;In a direct-connection MCP setup, each server handles its own authentication. Some use API keys stored in environment variables. Some use OAuth flows that expire and nobody notices until an agent starts failing. Some, particularly internal tools built quickly, use nothing at all because the developer figured it was only accessible internally anyway.&lt;br&gt;
The result six months into any reasonably active MCP deployment: a collection of credentials scattered across config files, environment variables, and secrets managers with different rotation policies, different expiry timelines, and no central record of which agent is using which credential for which server.&lt;br&gt;
When an engineer leaves the company, you want to revoke their access to every system their agents could reach. With fragmented auth, you don't know what that list is. You search config files and hope you found everything.&lt;br&gt;
The fix is centralised authentication at the gateway layer, federated to your existing identity provider. Every agent authenticates to the gateway using your organisation's standard credentials — Okta, Azure AD, Google Workspace — and the gateway handles downstream authentication to individual MCP servers. Revoke someone's organisational access and the gateway propagates that revocation everywhere, automatically.&lt;br&gt;
&lt;a href="https://www.truefoundry.com/mcp-gateway" rel="noopener noreferrer"&gt;TrueFoundry's MCP Gateway&lt;/a&gt; integrates natively with enterprise identity providers via standard protocols, so access grants and revocations happen in one place and take effect across every connected MCP server immediately.&lt;/p&gt;

&lt;h2&gt;
  
  
  Problem 3: Agents Accumulate Permissions Far Beyond What They Need
&lt;/h2&gt;

&lt;p&gt;Permissions in direct-connection MCP setups tend to accrete. An agent that needed read access to a database got write access because it was easier at the time. A tool connection intended for one agent got reused by another because the credential was already in the shared config. A staging credential got copied to production because the deployment was urgent.&lt;br&gt;
None of these decisions are malicious. They're all the result of moving fast without a governance layer that enforces least-privilege by default.&lt;br&gt;
The consequence is agents with capabilities they were never meant to have. In a benign scenario, this means an agent occasionally does something surprising. In a less benign scenario, it means that when an agent is compromised — through a prompt injection attack, a malicious user input, or a buggy workflow — the blast radius is much larger than it needed to be.&lt;br&gt;
The fix is tool scoping at the gateway level. Agents only see the tools they're authorised to use. If a support agent isn't authorised to modify database records, it can't discover that tool in the first place, because the gateway filters the discovery response before it reaches the agent. What the agent can't see, it can't call.&lt;br&gt;
&lt;a href="https://www.truefoundry.com/" rel="noopener noreferrer"&gt;TrueFoundry&lt;/a&gt; enforces granular RBAC at the tool level — a support agent sees support tools, a finance workflow sees finance tools, and never the other way around — configured centrally and enforced on every request.&lt;/p&gt;

&lt;h2&gt;
  
  
  Problem 4: Tool Poisoning Is a Real and Underestimated Attack Vector
&lt;/h2&gt;

&lt;p&gt;In April 2025, security researchers at Invariant Labs demonstrated a class of attack specific to MCP that doesn't exist in traditional API integrations: tool poisoning.&lt;br&gt;
The attack works like this: a malicious or compromised MCP server returns a tool response that contains hidden instructions embedded in the text. These instructions are formatted to be invisible to human reviewers but interpretable by the LLM as commands. The model reads the tool response, internalises the injected instruction, and executes it — potentially accessing data, calling other tools, or exfiltrating information — as part of its normal reasoning process.&lt;br&gt;
In the demonstrated exploit, an attacker was able to extract a user's WhatsApp message history by manipulating what appeared to be an innocuous get_fact_of_the_day() tool response. The user saw a daily fact. The agent extracted and transmitted message history.&lt;br&gt;
In a direct-connection setup, there is no inspection layer between the MCP server response and the LLM context. Whatever the tool returns, the model reads. A gateway that inspects tool responses before they re-enter LLM context can detect and sanitise injected instructions before they execute.&lt;br&gt;
&lt;a&gt;TrueFoundry's MCP Gateway&lt;/a&gt; includes guardrails for inspecting tool responses, providing an interception layer between MCP servers and the LLM context that direct-connection setups fundamentally cannot offer.&lt;/p&gt;

&lt;h2&gt;
  
  
  Problem 5: Scaling to Multiple Teams Turns Credential Management Into a Full-Time Job
&lt;/h2&gt;

&lt;p&gt;One team, one agent, two MCP servers: manageable. Four teams, fifteen agents, thirty MCP servers: credential management, access policy maintenance, and tool inventory tracking collectively become a second full-time engineering job that nobody was hired to do.&lt;br&gt;
The specific failure modes at scale: teams duplicate MCP server connections because they don't know another team already set one up. Access policies that were appropriate six months ago haven't been reviewed since. New MCP servers get added without going through any approval process because there isn't one. The person who understood the original setup has moved to a different team.&lt;br&gt;
The fix is a centralised MCP server registry with approval workflows. New servers are registered once, access policies are defined at registration, and authorised agents across all teams get access automatically without any per-team configuration work. The registry is the single source of truth for what tools exist and who can use them.&lt;br&gt;
&lt;a href="https://www.truefoundry.com/mcp-gateway" rel="noopener noreferrer"&gt;TrueFoundry's MCP Gateway &lt;/a&gt;includes exactly this registry — a centralised portal where MCP servers across cloud, on-premises, and hybrid deployments are visible in one view, with approval workflows that control which roles access which servers before any connection is established.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Pattern Across All Five
&lt;/h2&gt;

&lt;p&gt;Every problem above has the same root cause: governance that lives in application code rather than infrastructure. When governance is in the code, it's inconsistent across teams, invisible to anyone not reading that specific codebase, and bypassed the moment someone is in a hurry.&lt;br&gt;
When governance is in the infrastructure layer — the MCP gateway — it's consistent by default, visible to platform and security teams, and enforced regardless of how individual engineers implement their agents.&lt;br&gt;
MCP made the connection standard. The gateway makes the connection safe.&lt;br&gt;
&lt;a href="https://www.truefoundry.com/mcp-gateway" rel="noopener noreferrer"&gt;Explore TrueFoundry's MCP Gateway →&lt;/a&gt;&lt;/p&gt;

</description>
      <category>webdev</category>
      <category>ai</category>
      <category>opensource</category>
      <category>architecture</category>
    </item>
    <item>
      <title>Your AI Gateway Just Became an Attack Vector: Anatomy of the LiteLLM Supply Chain Compromise</title>
      <dc:creator>Deepti Shukla</dc:creator>
      <pubDate>Fri, 27 Mar 2026 13:07:43 +0000</pubDate>
      <link>https://dev.to/deeptishuklatfy/your-ai-gateway-just-became-an-attack-vector-anatomy-of-the-litellm-supply-chain-compromise-1g7m</link>
      <guid>https://dev.to/deeptishuklatfy/your-ai-gateway-just-became-an-attack-vector-anatomy-of-the-litellm-supply-chain-compromise-1g7m</guid>
      <description>&lt;p&gt;On March 24, 2026, two backdoored versions of LiteLLM — the popular open-source LLM proxy with &lt;strong&gt;3.4 million daily PyPI downloads&lt;/strong&gt; — were published to PyPI. They were live for roughly two to three hours before being quarantined. In that window, a three-stage credential stealer was deployed to every system that pulled the update, targeting everything from AWS keys to Kubernetes cluster secrets to cryptocurrency wallets.&lt;/p&gt;

&lt;p&gt;But this wasn't a simple account takeover. The LiteLLM compromise was the final link in a &lt;strong&gt;five-day cascading supply chain campaign&lt;/strong&gt; that started by weaponizing a vulnerability scanner. Here's the full story.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Kill Chain: From Security Scanner to AI Proxy
&lt;/h2&gt;

&lt;p&gt;The threat group behind this — tracked as &lt;strong&gt;TeamPCP&lt;/strong&gt;, with suspected (unconfirmed) ties to LAPSUS$ — didn't attack LiteLLM directly. They built a chain of compromises, each one enabling the next.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 1: Trivy (March 19)
&lt;/h3&gt;

&lt;p&gt;It started with Aqua Security's &lt;a href="https://github.com/aquasecurity/trivy" rel="noopener noreferrer"&gt;Trivy&lt;/a&gt;, one of the most widely used open-source vulnerability scanners. Weeks earlier, an autonomous bot called &lt;code&gt;hackerbot-claw&lt;/code&gt; exploited a misconfigured &lt;code&gt;pull_request_target&lt;/code&gt; workflow in Trivy's repo to steal a Personal Access Token. Aqua rotated credentials — but the rotation was incomplete.&lt;/p&gt;

&lt;p&gt;On March 19, TeamPCP used the remaining credentials (which still had tag-writing privileges) to force-push malicious commits to &lt;strong&gt;76 of 77 version tags&lt;/strong&gt; in &lt;code&gt;aquasecurity/trivy-action&lt;/code&gt; and all 7 tags in &lt;code&gt;aquasecurity/setup-trivy&lt;/code&gt;. They also published an infected Trivy binary (v0.69.4) to GitHub Releases and container registries.&lt;/p&gt;

&lt;p&gt;A vulnerability scanner — a tool people install &lt;em&gt;specifically to make their pipelines more secure&lt;/em&gt; — became the initial attack vector. The irony is hard to overstate.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 2: npm Worm (March 20)
&lt;/h3&gt;

&lt;p&gt;npm tokens stolen from Trivy's CI environment fed a self-propagating worm called &lt;strong&gt;CanisterWorm&lt;/strong&gt; that infected 66+ npm packages. The blast radius was expanding.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 3: Checkmarx KICS (March 23)
&lt;/h3&gt;

&lt;p&gt;All 35 tags of &lt;code&gt;Checkmarx/kics-github-action&lt;/code&gt; — another security scanning tool — were hijacked using a compromised service account, likely harvested from one of the earlier compromises. &lt;strong&gt;Two security scanners now compromised in the same campaign.&lt;/strong&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 4: LiteLLM (March 24)
&lt;/h3&gt;

&lt;p&gt;LiteLLM's CI/CD pipeline ran the compromised Trivy action. TeamPCP harvested PyPI publishing credentials from that pipeline and used them to publish backdoored versions (v1.82.7 and v1.82.8) directly to PyPI, completely bypassing the project's normal release workflow.&lt;/p&gt;

&lt;p&gt;The chain:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Vulnerable CI workflow → compromised security scanner → stolen CI secrets → compromised AI proxy serving millions of downloads per day&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Inside the Payload: Three Stages of Compromise
&lt;/h2&gt;

&lt;p&gt;This wasn't a lazy crypto-miner. The malware was engineered for &lt;strong&gt;deep, persistent infiltration&lt;/strong&gt; with encrypted exfiltration and a built-in researcher-defeat mechanism.&lt;/p&gt;

&lt;h3&gt;
  
  
  Stage 1 — Silent Activation
&lt;/h3&gt;

&lt;p&gt;The package drops a 34KB file called &lt;code&gt;litellm_init.pth&lt;/code&gt; into Python's site-packages directory. Python's &lt;code&gt;.pth&lt;/code&gt; file mechanism is designed for path configuration, but it can execute arbitrary code — and it does so &lt;strong&gt;on every Python interpreter startup&lt;/strong&gt;, not just when LiteLLM is imported.&lt;/p&gt;

&lt;p&gt;If the package was installed in your environment, the payload was running on every Python process. No &lt;code&gt;import litellm&lt;/code&gt; required. This is a legitimate Python feature that doubles as a devastating attack surface, and it deserves far more attention from the Python security community.&lt;/p&gt;

&lt;p&gt;Additionally, malicious code was injected into &lt;code&gt;proxy_server.py&lt;/code&gt; in both affected versions, hitting anyone who actually ran the LiteLLM proxy directly.&lt;/p&gt;

&lt;h3&gt;
  
  
  Stage 2 — Reconnaissance and Credential Harvesting
&lt;/h3&gt;

&lt;p&gt;The second stage performs deep system enumeration and sweeps for sensitive data:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;SSH keys&lt;/strong&gt; and Git credentials&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cloud provider credentials&lt;/strong&gt; — AWS access keys, GCP application default credentials, Azure tokens&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Kubernetes configs&lt;/strong&gt; — kubeconfig files and service account tokens&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Infrastructure secrets&lt;/strong&gt; — Terraform state files, Helm configs, CI/CD environment variables&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Application secrets&lt;/strong&gt; — &lt;code&gt;.env&lt;/code&gt; files, database connection strings&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Cryptocurrency wallets&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The malware didn't just grab files. It actively &lt;strong&gt;queried discovered credentials&lt;/strong&gt; — calling AWS APIs, listing Kubernetes secrets across namespaces — to validate and expand access.&lt;/p&gt;

&lt;p&gt;All harvested data was encrypted with AES-256-CBC using a randomly generated session key. That session key was then encrypted with a hardcoded 4096-bit RSA public key. The package was bundled as &lt;code&gt;tpcp.tar.gz&lt;/code&gt; and exfiltrated to &lt;code&gt;models[.]litellm[.]cloud&lt;/code&gt; — a domain deliberately chosen to look like legitimate LiteLLM infrastructure.&lt;/p&gt;

&lt;h3&gt;
  
  
  Stage 3 — Persistence and Lateral Movement
&lt;/h3&gt;

&lt;p&gt;The final stage installs a systemd service (&lt;code&gt;sysmon.py&lt;/code&gt;) that polls a command-and-control server every 50 minutes for additional payloads to execute. This survives package uninstallation — removing &lt;code&gt;litellm&lt;/code&gt; from pip does not remove the backdoor.&lt;/p&gt;

&lt;p&gt;In &lt;strong&gt;Kubernetes environments&lt;/strong&gt;, the malware goes further: it reads all cluster secrets across all namespaces, then attempts to deploy &lt;strong&gt;privileged pods on every node&lt;/strong&gt; in the &lt;code&gt;kube-system&lt;/code&gt; namespace. The goal is full cluster takeover.&lt;/p&gt;

&lt;p&gt;One notable detail: the C2 polling mechanism includes a filter that rejects responses containing "youtube.com" — a simple but effective technique to defeat security researchers using mock C2 servers.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why AI Gateways Are High-Value Targets
&lt;/h2&gt;

&lt;p&gt;LiteLLM is an AI gateway — it sits between your application and every LLM provider you use (OpenAI, Anthropic, Azure OpenAI, Bedrock, Vertex AI, and dozens more). By design, it holds API keys for all of them. It often runs with broad network access, frequently inside Kubernetes clusters alongside other production services.&lt;/p&gt;

&lt;p&gt;This makes AI gateways uniquely attractive targets:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Credential density is extreme.&lt;/strong&gt; A single compromised LiteLLM instance can yield API keys for every LLM provider an organization uses, plus whatever infrastructure credentials exist on the host. Compare this to compromising a single-purpose microservice.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Deployment environments are privileged.&lt;/strong&gt; Most serious LLM deployments run on Kubernetes. The LiteLLM proxy typically needs network access to external APIs, often has access to secrets stores, and runs in clusters alongside other production workloads. Compromising it gives lateral movement opportunities that the TeamPCP malware was explicitly designed to exploit.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Update velocity is high.&lt;/strong&gt; The AI ecosystem moves fast. Teams often track the latest versions of tools like LiteLLM to get new model support, bug fixes, and features. This creates a wide window for supply chain attacks — automated pipelines pull updates quickly, and manual review of each release is rare.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Security maturity lags adoption.&lt;/strong&gt; Many teams deploying LLM infrastructure haven't applied the same supply chain security rigor they use for traditional dependencies. Pinned versions, checksum verification, artifact attestation, and staged rollouts are often absent from AI tooling pipelines.&lt;/p&gt;

&lt;h2&gt;
  
  
  What You Should Do
&lt;/h2&gt;

&lt;h3&gt;
  
  
  If you installed litellm v1.82.7 or v1.82.8
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Treat the entire host or container as compromised.&lt;/strong&gt; Uninstalling the package is insufficient — the systemd persistence mechanism survives pip uninstall.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Isolate affected systems&lt;/strong&gt; immediately from the network.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Look for the backdoor&lt;/strong&gt;: check for &lt;code&gt;sysmon.py&lt;/code&gt; and associated systemd services.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Rotate everything&lt;/strong&gt;: SSH keys, cloud credentials (AWS/GCP/Azure), Kubernetes configs and service account tokens, all LLM provider API keys, database passwords, CI/CD secrets, &lt;code&gt;.env&lt;/code&gt; contents.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;In Kubernetes&lt;/strong&gt;: audit for unauthorized privileged pods in &lt;code&gt;kube-system&lt;/code&gt;, review secrets access logs via audit trails, check for unknown service accounts or role bindings.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Review network logs&lt;/strong&gt; for connections to &lt;code&gt;models[.]litellm[.]cloud&lt;/code&gt; and &lt;code&gt;checkmarx[.]zone&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Rebuild affected systems&lt;/strong&gt; from known-good images. Credential rotation alone may not be sufficient if the C2 channel delivered additional payloads.&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  For everyone: harden your AI supply chain
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Pin exact versions and verify checksums.&lt;/strong&gt; Never use &lt;code&gt;&amp;gt;=&lt;/code&gt; or &lt;code&gt;~=&lt;/code&gt; for critical infrastructure dependencies. Use hash-pinning in requirements files (&lt;code&gt;--require-hashes&lt;/code&gt;).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Audit your CI/CD pipeline dependencies.&lt;/strong&gt; The entire LiteLLM compromise happened because a GitHub Action in the CI pipeline was compromised. Do you know which third-party actions have access to your publishing secrets? Pin actions to commit SHAs, not tags.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Use artifact attestation.&lt;/strong&gt; Sigstore and similar tools can verify that a package was built from a specific source commit by a specific workflow. If LiteLLM's releases had been attested and consumers had verified attestations, the malicious versions would have been rejected.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Isolate your AI gateway.&lt;/strong&gt; Your LLM proxy doesn't need access to your entire cloud account, your Kubernetes cluster secrets, or your SSH keys. Run it in a minimal environment with only the credentials it actually needs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Monitor for unexpected releases.&lt;/strong&gt; Set up alerts for new versions of critical dependencies. If your AI gateway publishes a new version outside normal release patterns, investigate before deploying.&lt;/p&gt;

&lt;h2&gt;
  
  
  Rethinking the AI Gateway Layer
&lt;/h2&gt;

&lt;p&gt;This incident highlights a structural problem: when a single open-source package becomes the chokepoint for all your LLM traffic &lt;em&gt;and&lt;/em&gt; runs as a self-managed proxy in your infrastructure, a supply chain compromise becomes a skeleton key to your entire AI stack.&lt;/p&gt;

&lt;p&gt;It's worth evaluating alternatives that reduce this risk surface. Managed AI gateway solutions like &lt;a href="https://www.truefoundry.com/" rel="noopener noreferrer"&gt;TrueFoundry&lt;/a&gt; take a fundamentally different approach — the gateway runs as managed infrastructure with enterprise-grade security controls, rather than as a PyPI package you pull into your own environment and trust to self-update. This means the attack surface of "compromised package in your CI/CD" simply doesn't exist for the gateway layer. TrueFoundry also provides built-in secrets management, RBAC, and audit logging for LLM API keys, so credentials aren't scattered across environment variables waiting to be harvested.&lt;/p&gt;

&lt;p&gt;This isn't about any single tool being inherently unsafe — the LiteLLM maintainers were themselves victims of an upstream compromise. It's about whether the &lt;strong&gt;deployment model&lt;/strong&gt; of your AI gateway introduces unnecessary risk. Self-managed open-source proxies require you to own the entire supply chain security burden. Managed platforms shift that burden to a team whose full-time job is securing it.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Bigger Picture
&lt;/h2&gt;

&lt;p&gt;The TeamPCP campaign (tracked as CVE-2026-33634 for the Trivy component, sonatype-2026-001357 for LiteLLM) is being analyzed by security teams across the industry — Sonatype, Wiz, Datadog Security Labs, Snyk, ReversingLabs, Kaspersky, and Palo Alto Networks have all published detailed technical reports.&lt;/p&gt;

&lt;p&gt;With an estimated &lt;strong&gt;500,000+ credentials already exfiltrated&lt;/strong&gt; and the C2 infrastructure having had time to deliver additional payloads, the full impact of this campaign will take months to assess.&lt;/p&gt;

&lt;p&gt;The AI ecosystem has inherited all of the software supply chain's worst problems without the maturity to deal with them. If there's one takeaway from this incident, it's this: &lt;strong&gt;your AI infrastructure deserves the same supply chain security rigor as the rest of your stack&lt;/strong&gt; — and probably more, given what it has access to.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;If you're dealing with incident response on this, the detailed technical analyses from &lt;a href="https://www.sonatype.com/blog/compromised-litellm-pypi-package-delivers-multi-stage-credential-stealer" rel="noopener noreferrer"&gt;Sonatype&lt;/a&gt;, &lt;a href="https://securitylabs.datadoghq.com/articles/litellm-compromised-pypi-teampcp-supply-chain-campaign/" rel="noopener noreferrer"&gt;Datadog Security Labs&lt;/a&gt;, and &lt;a href="https://www.wiz.io/blog/threes-a-crowd-teampcp-trojanizes-litellm-in-continuation-of-campaign" rel="noopener noreferrer"&gt;Wiz&lt;/a&gt; are excellent starting points.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>llm</category>
      <category>opensource</category>
      <category>python</category>
      <category>security</category>
    </item>
    <item>
      <title>TrueFoundry vs Bifrost: Performance Benchmark on Agentic Workloads</title>
      <dc:creator>Deepti Shukla</dc:creator>
      <pubDate>Thu, 26 Mar 2026 09:43:32 +0000</pubDate>
      <link>https://dev.to/deeptishuklatfy/truefoundry-vs-bifrost-performance-benchmark-on-agentic-workloads-4h21</link>
      <guid>https://dev.to/deeptishuklatfy/truefoundry-vs-bifrost-performance-benchmark-on-agentic-workloads-4h21</guid>
      <description>&lt;p&gt;Raw gateway latency is easy to benchmark. You spin up a load test, fire 5,000 requests per second at an endpoint, and report the overhead number. Bifrost does this very well — 11µs of added overhead at 5K RPS is a genuinely impressive number and a reflection of building in Go rather than Python.&lt;br&gt;
But agentic workloads don't look like 5,000 identical chat completions in a tight loop. They look like this: an agent receives a task, decides which tool to call, invokes an MCP server, gets a result, calls a different LLM with that result as context, hits a rate limit, retries with exponential backoff on a fallback model, generates a response, and logs the entire chain for debugging. That sequence involves 4–8 distinct gateway operations per user-facing request, crosses provider and tool boundaries, and fails in entirely different ways than a simple proxy failure.&lt;br&gt;
When you benchmark AI gateways against agentic workloads — not synthetic throughput tests — the performance dimensions that matter shift significantly. This article breaks down how &lt;a href="https://www.truefoundry.com/" rel="noopener noreferrer"&gt;TrueFoundry&lt;/a&gt; and Bifrost compare across each one.&lt;/p&gt;

&lt;h2&gt;
  
  
  What We're Comparing
&lt;/h2&gt;

&lt;p&gt;Bifrost is an open-source AI gateway built in Go by Maxim AI. It's purpose-built for high-throughput LLM routing with a focus on minimal overhead, automatic failover, and a unified API across 20+ providers. It's genuinely fast, has clean MCP support, and is free to self-host under Apache 2.0. Its target audience is developers who want maximum performance with full control over their own infrastructure.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.truefoundry.com/" rel="noopener noreferrer"&gt;TrueFoundry&lt;/a&gt; is an enterprise AI platform with an AI Gateway at its core. It covers the full stack from model deployment and fine-tuning to LLM routing, MCP governance, prompt management, and observability — all on Kubernetes, deployable in your VPC or on-premises. It's recognised in the &lt;a href="https://www.truefoundry.com/gartner-2025-market-guide-ai-gateways?utm_source=hello_bar&amp;amp;utm_medium=website" rel="noopener noreferrer"&gt;2025 Gartner Market Guide for AI Gateways&lt;/a&gt; and targets enterprise ML teams who need governance, multi-team controls, and production reliability across both LLMs and the infrastructure they run on.&lt;/p&gt;

&lt;p&gt;These are not the same product aimed at the same buyer. Understanding where each wins requires being precise about which agentic performance dimensions actually matter in production.&lt;/p&gt;

&lt;h3&gt;
  
  
  Dimension 1: Raw Routing Overhead
&lt;/h3&gt;

&lt;p&gt;Bifrost wins here — and by a significant margin on the raw number.&lt;br&gt;
Bifrost adds approximately 11µs of overhead per request at 5,000 RPS. That's not a typo. Eleven microseconds. It's the direct result of building in Go with zero-copy message passing and in-memory state, and it's the benchmark Bifrost leads with for good reason.&lt;br&gt;
&lt;a href="https://www.truefoundry.com/ai-gateway" rel="noopener noreferrer"&gt;TrueFoundry's AI Gateway&lt;/a&gt; operates at 3–4ms of overhead at 350+ RPS per vCPU. That's a larger absolute latency number. For a simple prompt-and-response path, Bifrost is faster.&lt;br&gt;
Why this matters less for agentic workloads than it appears: In a multi-step agent loop, the dominant latency is LLM inference time — typically 500ms to 5,000ms per call depending on model and response length. Gateway overhead of 3–4ms represents 0.1–0.6% of total agent loop latency. Whether your gateway adds 11µs or 4ms is irrelevant when the agent is waiting 2 seconds for Claude to respond.&lt;br&gt;
Where raw overhead matters is high-frequency, short-context workloads: classification pipelines, embedding generation at scale, real-time routing decisions. For those workloads, Bifrost's architecture is the right choice.&lt;br&gt;
For multi-step agentic workflows with tool calls, retrieval, and LLM reasoning, gateway overhead is not the bottleneck and optimising for it comes at the cost of the capabilities that actually determine reliability.&lt;/p&gt;

&lt;h3&gt;
  
  
  Dimension 2: MCP Tool Call Governance
&lt;/h3&gt;

&lt;p&gt;TrueFoundry wins for enterprise deployments.&lt;br&gt;
Both platforms support MCP natively. The architectural difference is what each platform does around tool execution.&lt;br&gt;
Bifrost operates as both an MCP client and MCP server, supports STDIO/HTTP/SSE transports, and requires explicit execution through the /v1/mcp/tool/execute endpoint rather than auto-executing tool calls. This is sensible security design. What it doesn't provide out of the box is enterprise identity federation: tying MCP tool access to your existing Okta, Azure AD, or Google Workspace identity provider so that tool permissions inherit from the user's organisational role.&lt;br&gt;
&lt;a href="https://www.truefoundry.com/mcp-gateway" rel="noopener noreferrer"&gt;TrueFoundry's MCP Gateway&lt;/a&gt; is built around enterprise RBAC from the ground up. Tool access is scoped to organisational identity — an agent running on behalf of a user in the Finance team can access read tools for financial data and nothing else, enforced at the gateway level rather than in application code. Every tool call is traceable to an authenticated identity, logged with full request context, and auditable for compliance purposes. The MCP server registry auto-discovers registered servers and applies access policies on connection, not on each call.&lt;/p&gt;

&lt;p&gt;For a startup with one team building one agent, Bifrost's MCP handling is entirely sufficient. For an enterprise with 15 teams, 40 agents, and a compliance requirement to demonstrate that no agent accessed data outside its authorised scope, TrueFoundry's governance layer is what makes that demonstration possible.&lt;/p&gt;

&lt;h3&gt;
  
  
  Dimension 3: Agentic Failure Recovery
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://www.truefoundry.com/" rel="noopener noreferrer"&gt;TrueFoundry&lt;/a&gt; wins on multi-dimensional fallback logic.&lt;br&gt;
Both platforms handle the basic case: provider returns a 5xx error, gateway routes to the fallback model. This is table stakes.&lt;br&gt;
The harder agentic failure modes are more specific:&lt;br&gt;
Budget-triggered fallback during an agent run. An agent loop that starts on GPT-4o and hits the team's token budget mid-session should degrade gracefully to a cheaper model, not fail the entire agent task. &lt;a href="https://www.truefoundry.com/" rel="noopener noreferrer"&gt;TrueFoundry's budget policies &lt;/a&gt;and fallback routing handle this as a first-class case: the fallback trigger is not only provider failure but also cost threshold breach, with per-team policy controlling the degradation path.&lt;br&gt;
Latency-based fallback for real-time agents. If an LLM provider's p95 latency spikes above your threshold during a user-facing agent interaction, the gateway should detect the degradation and reroute before the user notices. TrueFoundry's adaptive routing monitors real-time provider latency and adjusts routing continuously, not just on hard failure.&lt;br&gt;
Tool call failure handling in agent chains. When an MCP tool call fails in the middle of a multi-step agent workflow, the recovery path is different from an LLM call failure — you can't just retry the same tool call if the failure was a permissions error or a malformed request. TrueFoundry traces the full agent chain and surfaces tool call failures with context about where in the workflow they occurred, which makes debugging and recovery substantially faster.&lt;/p&gt;

&lt;p&gt;Bifrost handles provider-level failover cleanly. It doesn't have the same depth of per-team budget enforcement or agentic workflow tracing that makes the more complex failure modes manageable in enterprise production.&lt;/p&gt;

&lt;h3&gt;
  
  
  Dimension 4: Observability at Agent Chain Depth
&lt;/h3&gt;

&lt;p&gt;TrueFoundry wins for multi-step agent debugging.&lt;br&gt;
Bifrost offers solid infrastructure-level observability: native Prometheus metrics, OpenTelemetry support, Grafana/Datadog integration, structured logging. This is what you need to monitor gateway health, track request throughput, and alert on error rate spikes.&lt;br&gt;
What it doesn't provide natively is observability into the agent chain: the sequence of LLM calls, tool invocations, context accumulation, and decision points that constitute a single agent task execution. When an agent produces a wrong answer or takes an unexpected action, infrastructure metrics tell you the request completed in 4.2 seconds with 12,000 tokens. They don't tell you which tool call returned unexpected data, which prompt version was active, or where in the reasoning chain the model made the wrong decision.&lt;br&gt;
TrueFoundry captures full chain traces: each LLM call in a multi-step agent task is linked to the preceding tool call and the following model response, with token counts, latency, model identity, prompt version, and cost attributed at the step level. Combined with &lt;a href="https://www.truefoundry.com/prompt-management" rel="noopener noreferrer"&gt;TrueFoundry's prompt management&lt;/a&gt;, you can identify whether a quality regression in agent output was caused by a model change, a prompt change, a tool returning different data, or a budget-triggered model fallback — because all of those events are captured in the same trace.&lt;br&gt;
This is not a feature most teams need when they're running their first agent in staging. It's the feature that determines whether debugging a production incident takes 20 minutes or two days.&lt;/p&gt;

&lt;h3&gt;
  
  
  Dimension 5: Deployment Model and Data Residency
&lt;/h3&gt;

&lt;p&gt;TrueFoundry wins for regulated enterprises.&lt;br&gt;
Bifrost supports VPC deployment with private cloud infrastructure, which covers the baseline data residency requirement: your gateway doesn't send traffic through third-party infrastructure.&lt;br&gt;
TrueFoundry's deployment architecture goes further. Its Control Plane and Data Plane are explicitly decoupled, meaning that no inference data, prompt content, model output, or agent trace ever transits through TrueFoundry's infrastructure. Everything stays within your cloud region or on-premises environment. For organisations subject to GDPR, HIPAA, or financial services data localisation requirements, this decoupled architecture is what makes compliance demonstrable rather than assumed.&lt;br&gt;
Additionally, TrueFoundry runs on Kubernetes natively across EKS, AKS, GKE, and on-premises clusters. If you're already running AI workloads on Kubernetes, TrueFoundry integrates into your existing infrastructure model rather than introducing a separate deployment paradigm.&lt;/p&gt;

&lt;h3&gt;
  
  
  Choose Bifrost if:
&lt;/h3&gt;

&lt;p&gt;You're a developer-first team that needs maximum raw throughput, you're comfortable managing your own infrastructure, your agentic workloads are relatively homogenous, and enterprise governance requirements are light. The zero-config startup and open-source foundation make it genuinely the fastest path from zero to a working gateway.&lt;/p&gt;

&lt;h3&gt;
  
  
  Choose TrueFoundry if:
&lt;/h3&gt;

&lt;p&gt;You're running AI across multiple teams with different cost budgets and model access policies, your agents call enterprise tools that require identity-scoped access control, you need to demonstrate data residency compliance, or you want a single platform that covers model deployment, fine-tuning, LLM routing, and observability without stitching together separate tools. TrueFoundry customers report 40–60% reductions in LLM infrastructure costs and deployment timeline reductions of over 50% — outcomes that come from the governance and observability layer, not the routing layer.&lt;br&gt;
The 11µs vs 3–4ms gap is real. It's also the wrong thing to optimise for in most enterprise agentic deployments. What determines whether your AI agents work reliably in production at scale isn't how fast your gateway proxies a request. It's whether you can see what they're doing, control what they cost, govern what they access, and debug them when they fail.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.truefoundry.com/ai-gateway" rel="noopener noreferrer"&gt;See TrueFoundry's AI Gateway&lt;/a&gt; → · &lt;a href="https://www.truefoundry.com/gartner-2025-market-guide-ai-gateways?utm_source=hello_bar&amp;amp;utm_medium=website" rel="noopener noreferrer"&gt;Read the 2025 Gartner Market Guide&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>webdev</category>
      <category>devops</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>7 Things Your AI Gateway Should Be Doing in Production (Most Aren't Doing 3 of Them)</title>
      <dc:creator>Deepti Shukla</dc:creator>
      <pubDate>Tue, 24 Mar 2026 14:57:58 +0000</pubDate>
      <link>https://dev.to/deeptishuklatfy/7-things-your-ai-gateway-should-be-doing-in-production-most-arent-doing-3-of-them-n44</link>
      <guid>https://dev.to/deeptishuklatfy/7-things-your-ai-gateway-should-be-doing-in-production-most-arent-doing-3-of-them-n44</guid>
      <description>&lt;p&gt;Most teams set up an &lt;a href="https://www.truefoundry.com/blog/generative-ai-gateway" rel="noopener noreferrer"&gt;AI gateway&lt;/a&gt; the same way they set up a reverse proxy in 2012: route the traffic, add a key, move on. It works until it doesn't — and when it stops working in production, it stops working loudly.&lt;br&gt;
An &lt;a href="https://www.truefoundry.com/ai-gateway" rel="noopener noreferrer"&gt;AI gateway&lt;/a&gt; is not an API proxy with a language model on the other end. It's the control plane for everything your AI systems do in production: how they access models, how much they spend, how they behave when a provider goes down, what data leaves your infrastructure, and how you debug it when something goes wrong at 2am.&lt;br&gt;
The gap between what most AI gateways are doing and what they should be doing is wide. Here are the seven things a production AI gateway needs to do, including the three that most teams haven't gotten to yet — and what it costs them when they don't.&lt;/p&gt;

&lt;h2&gt;
  
  
  1. Unified Multi-Provider Access With a Single API Contract ✅ Most are doing this
&lt;/h2&gt;

&lt;p&gt;This is the baseline. A production &lt;a href="https://www.truefoundry.com/ai-gateway" rel="noopener noreferrer"&gt;AI gateway &lt;/a&gt;should give your engineers a single endpoint and a single authentication method that works regardless of which LLM provider or model is behind it — OpenAI, Anthropic, Gemini, Mistral, Groq, or a self-hosted model running on your own GPU cluster.&lt;br&gt;
The practical value is that your application code never changes when you switch models. You don't update base URLs, regenerate credentials, or modify request schemas when you move from Claude Sonnet 4 to GPT-4o or add a self-hosted Llama 3 to the mix. The gateway handles the translation.&lt;br&gt;
&lt;a href="https://www.truefoundry.com/" rel="noopener noreferrer"&gt;TrueFoundry's&lt;/a&gt; AI Gateway connects to 250+ LLM providers — including hosted providers and self-hosted models running on vLLM, TGI, or Triton — through one API endpoint. Engineers configure their client once. The platform team controls which models are available, at what cost, to whom.&lt;br&gt;
This is table stakes. If your gateway isn't doing this, it's not really a gateway — it's a forwarding rule.&lt;/p&gt;

&lt;h2&gt;
  
  
  2. Automatic Fallback and Failover Routing ✅ Most are doing this
&lt;/h2&gt;

&lt;p&gt;Provider outages happen. OpenAI has had multiple significant incidents in the past 18 months. Anthropic has throttled requests during peak periods. A production system that routes all traffic through a single provider without a fallback strategy is a production system with a single point of failure.&lt;br&gt;
A gateway should detect provider errors in real time — 429 rate limit responses, 5xx errors, latency spikes above a configurable threshold — and automatically reroute to a fallback model without the application layer ever knowing there was a problem.&lt;br&gt;
The configuration should be flexible: you might want GPT-4o to fall back to Claude Sonnet 4 for quality-sensitive paths, but fall back to GPT-4o Mini for high-volume, cost-sensitive paths where acceptable quality is lower. These are different fallback policies and they should be independently configurable per route.&lt;br&gt;
This is also largely understood by now. The more interesting question is whether your gateway is doing the fallback routing intelligently — based on error rate, latency percentile, and cost — or just blindly switching on any failure.&lt;/p&gt;

&lt;h2&gt;
  
  
  3. Per-Team Spend Enforcement With Real-Time Budget Tracking ✅ Most are doing this, badly
&lt;/h2&gt;

&lt;p&gt;Spend visibility and spend enforcement are different things, and most teams have the first without the second.&lt;br&gt;
Visibility means you can see — at the end of the month, or after the fact — which team consumed how many tokens. Enforcement means that when the data science team hits 80% of their monthly token budget on the 15th, something happens automatically: an alert fires, requests route to a cheaper fallback model, or a hard cap kicks in before the overage.&lt;br&gt;
The enforcement layer is what most gateways are missing. They expose usage dashboards. They don't enforce policy at the request level in real time.&lt;br&gt;
&lt;a href="https://www.truefoundry.com/" rel="noopener noreferrer"&gt;TrueFoundry&lt;/a&gt; lets you configure per-team, per-project, and per-environment budget policies that enforce at the gateway layer before a request reaches the provider. When a team hits their threshold, the gateway can alert, downgrade model routing, or hard cap — based on whatever policy you've set. The application doesn't break. The bill doesn't surprise.&lt;/p&gt;

&lt;h2&gt;
  
  
  4. Full Request-Level Observability, Not Just Aggregate Metrics ⚠️ Most are doing this partially
&lt;/h2&gt;

&lt;p&gt;This is where the gap starts to open up.&lt;br&gt;
Aggregate metrics — total tokens consumed, average latency, error rate by provider — are useful for billing and capacity planning. They tell you almost nothing about why your production AI system is behaving the way it is.&lt;br&gt;
Request-level observability means capturing the full trace of every LLM call: the prompt, the response, the token breakdown (input vs output), the model used, the latency at each layer, the team and user that made the request, and the cost attributed to that specific call. This is what you need to debug production issues, identify expensive prompt patterns, catch quality regressions, and build a feedback loop for improvement.&lt;br&gt;
The difference between aggregate metrics and request-level tracing is roughly the difference between knowing your application has high CPU and knowing which function is causing it.&lt;br&gt;
&lt;a href="https://www.truefoundry.com/" rel="noopener noreferrer"&gt;TrueFoundry&lt;/a&gt; captures full request traces — prompt, completion, token counts, latency, model attribution, cost, and team identity — and surfaces them in a real-time dashboard with filtering by team, model, time range, and error state. When something behaves unexpectedly in production, the answer is usually visible in the trace data within minutes.&lt;br&gt;
Most teams using lighter-weight gateways have aggregates but not traces. They know the total. They can't explain the individual.&lt;/p&gt;

&lt;h2&gt;
  
  
  5. PII Detection and Data Residency Controls ❌ Most are NOT doing this
&lt;/h2&gt;

&lt;p&gt;This is the first of the three things most gateways aren't doing — and in regulated industries, it's the one that creates the most legal exposure.&lt;br&gt;
When your engineers send prompts to external LLM providers, those prompts routinely contain data that should never leave your infrastructure: customer names and email addresses embedded in support ticket context, financial figures in analyst-facing tools, patient identifiers in healthcare applications, proprietary code in developer-facing copilots.&lt;br&gt;
Most teams handle this through developer guidelines and code review. Both fail in production. Guidelines aren't enforced. Code review doesn't catch every case. Context-stuffing patterns that look safe at the individual call level can expose sensitive data in aggregate.&lt;br&gt;
A production AI gateway should inspect outbound prompts for PII and sensitive data patterns before they leave your infrastructure — and either redact, block, or route to a self-hosted model depending on the sensitivity of what was found. This enforcement has to happen at the gateway layer to be reliable, because it can't depend on application-level compliance by every team and every developer.&lt;br&gt;
&lt;a href="https://www.truefoundry.com/blog/ai-gateway" rel="noopener noreferrer"&gt;TrueFoundry's AI Gateway&lt;/a&gt; includes guardrails for PII detection and content moderation that apply at the request level, before data reaches any external provider. For organisations with strict data residency requirements — GDPR, HIPAA, financial services regulations — the gateway can be deployed entirely within your VPC or on-premises, ensuring that no inference data, no prompt content, and no response ever transits through third-party infrastructure.&lt;br&gt;
Most teams know they have this problem. Most haven't instrumented a solution at the infrastructure layer yet.&lt;/p&gt;

&lt;h2&gt;
  
  
  6. Versioned Prompt Management Tied to Deployment ❌ Most are NOT doing this
&lt;/h2&gt;

&lt;p&gt;Prompts are code. Most teams aren't treating them that way.&lt;br&gt;
The typical state of prompt management in a production AI team: prompts are hardcoded strings in application code, changed via pull request with no systematic evaluation, deployed as part of a general application release with no ability to roll back the prompt independently of the application, and never formally versioned in a way that lets you compare performance across versions.&lt;br&gt;
This creates a class of production bugs that are uniquely painful: the model's behaviour changed, but nothing in the deployment pipeline changed — because the prompt changed in a way that wasn't tracked, or a model was swapped at the provider level without a corresponding prompt update.&lt;br&gt;
A production AI gateway should include prompt versioning as a first-class feature: version-controlled prompt templates, the ability to run A/B tests between prompt versions with statistical tracking, rollback to a previous prompt version in seconds without a full application redeploy, and full traceability connecting which prompt version was used for which request.&lt;br&gt;
&lt;a href="https://www.truefoundry.com/" rel="noopener noreferrer"&gt;TrueFoundry&lt;/a&gt; includes prompt management natively within the gateway layer: version-controlled templates, A/B testing across prompt versions, and full trace linkage so you can see exactly which prompt version produced which output for any specific request in production. When a quality regression hits, you can identify whether it was a model change, a prompt change, or a data change — and roll back the right thing.&lt;br&gt;
Teams running prompts as unversioned strings in application code are accumulating technical debt that compounds every time they make a change they can't formally evaluate.&lt;/p&gt;

&lt;h2&gt;
  
  
  7. &lt;a href="https://www.truefoundry.com/mcp-gateway" rel="noopener noreferrer"&gt;MCP Gateway&lt;/a&gt; for Agentic Tool Access ❌ Most are NOT doing this (yet)
&lt;/h2&gt;

&lt;p&gt;This is the newest gap, and the one that's going to matter most over the next 12 months.&lt;br&gt;
As AI systems move from single-turn completions to multi-step agentic workflows, the attack surface and governance requirements change fundamentally. An agent that can call tools — search the web, query your database, execute code, send emails, update CRM records — needs a governance layer that's categorically different from a prompt-and-response proxy.&lt;br&gt;
&lt;a href="https://www.truefoundry.com/mcp-gateway" rel="noopener noreferrer"&gt;Model Context Protocol&lt;/a&gt; (MCP) is the emerging standard for how agents discover and call tools. Without a gateway layer in front of MCP, you have agents making arbitrary tool calls with no access control, no audit trail, no rate limiting, and no way to enforce which tools a given agent is allowed to use.&lt;br&gt;
The specific risks: prompt injection attacks that cause agents to call tools the application developer never intended; agents accumulating permissions that exceed what any individual request should have; tool calls that exfiltrate data or trigger external side effects with no audit log; and no mechanism to restrict which &lt;a href="https://www.truefoundry.com/mcp-gateway" rel="noopener noreferrer"&gt;MCP servers&lt;/a&gt; a given team or application can access.&lt;br&gt;
&lt;a href="https://www.truefoundry.com/mcp-gateway" rel="noopener noreferrer"&gt;TrueFoundry's MCP Gateway&lt;/a&gt; provides a secure, governed access layer in front of your MCP servers: RBAC enforcement at the tool level (this agent can call search and read, but not write or execute), full request tracing for every tool call, integration with enterprise identity providers like Okta and Azure AD, and auto-discovery of registered MCP servers with proper access controls applied automatically.&lt;br&gt;
Most teams building agentic systems right now are connecting directly to MCP servers without any gateway layer. The governance debt they're accumulating will become visible the first time an agent does something it shouldn't have been able to do.&lt;/p&gt;

&lt;h2&gt;
  
  
  The 3-Minute Audit for Your Current Gateway
&lt;/h2&gt;

&lt;p&gt;Before evaluating alternatives, it's worth auditing what your current setup is actually doing. Ask these questions:&lt;br&gt;
On PII and data residency: Can you demonstrate that no customer PII has ever been sent to an external LLM provider in a prompt? If the answer is "I think so" or "our developers know not to do that," the answer is no.&lt;br&gt;
On prompt versioning: Can you tell me which prompt version was used for any specific production request from last Tuesday? If you'd need to check git blame and cross-reference a deployment log, the answer is no.&lt;br&gt;
On agentic tool access: If you have agents calling tools, can you pull an audit log of every tool call made in the last 7 days, with the agent identity and the justification from the model? If not, the answer is no.&lt;br&gt;
Most teams are 4 out of 7 on this list. Getting to 7 out of 7 doesn't require replacing your infrastructure — it requires picking a gateway platform that covers the full surface area, not just the routing layer.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Most Gateways Stop at 4
&lt;/h2&gt;

&lt;p&gt;The first four capabilities on this list — unified access, fallback routing, spend tracking, and aggregate observability — are relatively straightforward to build. They've been commoditised. Several open-source options cover them adequately.&lt;br&gt;
The last three — PII enforcement, prompt versioning, and agentic governance — are harder because they require the gateway to understand the semantics of what's passing through it, not just the routing. They require integration with your identity provider, your compliance framework, your deployment pipeline. They require the gateway to be a platform, not a proxy.&lt;br&gt;
&lt;a href="https://www.truefoundry.com/" rel="noopener noreferrer"&gt;TrueFoundry&lt;/a&gt; is built as that platform. It's recognised in the 2025 Gartner Market Guide for AI Gateways, handles 350+ requests per second on a single vCPU at 3–4ms of added latency, and can be deployed fully within your VPC for organisations with strict data residency requirements.&lt;br&gt;
The teams that will have well-governed, cost-efficient, production-reliable AI systems in 12 months are the ones adding these last three capabilities now, before the agentic complexity compounds.&lt;br&gt;
Explore TrueFoundry's AI Gateway →&lt;/p&gt;

</description>
      <category>ai</category>
      <category>webdev</category>
      <category>programming</category>
      <category>devops</category>
    </item>
    <item>
      <title>How to Enforce LLM Spend Limits Per Team Without Slowing Down Your Engineers</title>
      <dc:creator>Deepti Shukla</dc:creator>
      <pubDate>Mon, 23 Mar 2026 16:27:29 +0000</pubDate>
      <link>https://dev.to/deeptishuklatfy/how-to-enforce-llm-spend-limits-per-team-without-slowing-down-your-engineers-ml</link>
      <guid>https://dev.to/deeptishuklatfy/how-to-enforce-llm-spend-limits-per-team-without-slowing-down-your-engineers-ml</guid>
      <description>&lt;p&gt;Every AI platform team eventually hits the same moment: finance sends a spreadsheet, engineering doesn't know where the tokens went, and someone on the data science team just ran a 400,000-token context window against GPT-4o to test a hypothesis on a Friday afternoon.&lt;br&gt;
LLM costs don't creep up on you. They sprint.&lt;/p&gt;

&lt;p&gt;According to Andreessen Horowitz, AI infrastructure spending — primarily on LLM API calls — is consuming 20–40% of revenue at many early-stage AI companies. For enterprises, uncontrolled LLM usage across teams can turn a predictable cloud cost line into a surprise at the end of every billing cycle.&lt;/p&gt;

&lt;p&gt;The instinct is to lock things down: centralize API keys, require approvals, add manual budgeting steps. But that instinct is wrong. The moment you make it hard for engineers to access LLMs, they route around the controls — using personal API keys, shadow accounts, or skipping experimentation altogether. You trade cost visibility for velocity, and you lose both.&lt;/p&gt;

&lt;p&gt;The right approach is programmatic spend enforcement at the infrastructure layer, invisible to engineers during normal usage and firm at the boundaries. Here's how to build it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why LLM Costs Are So Hard to Control Without Infrastructure
&lt;/h2&gt;

&lt;p&gt;Before getting into solutions, it's worth understanding why this problem is uniquely difficult for LLMs compared to traditional cloud cost management.&lt;br&gt;
With compute or storage, you provision resources in advance and costs are predictable. With LLMs, costs are generated at inference time, driven by factors your engineers may not even think about: prompt length, context window size, response verbosity, retry logic on failures, and the choice between a $0.002/1K token model versus a $0.015/1K token model.&lt;/p&gt;

&lt;p&gt;A single agent loop that retries on failure can multiply expected costs by 5–10x. A well-intentioned developer who switches from GPT-4o Mini to GPT-4o for "better quality" can increase costs per call by 25x without changing a single line of business logic.&lt;/p&gt;

&lt;p&gt;Three specific failure modes show up repeatedly in production AI systems:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;No per-team visibility.&lt;/strong&gt; Most companies using LLM APIs through a shared key have zero insight into which team, product, or feature is responsible for which spend. When the bill comes, the breakdown is "OpenAI: $47,000" with no further detail.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;No enforcement boundary.&lt;/strong&gt; Even if you have visibility, there's typically no mechanism to stop a team from exceeding their budget mid-cycle without manually revoking API access — which breaks everything downstream.&lt;br&gt;
&lt;strong&gt;Governance that blocks experimentation.&lt;/strong&gt; Manual approval workflows, centralized key management with a ticket queue, or flat rate limits that apply equally to production and development environments all create friction that slows down the teams doing the most valuable work.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Architecture That Actually Works: An AI Gateway with Budget Controls
&lt;/h2&gt;

&lt;p&gt;The solution is an AI gateway — a proxy layer that sits between your engineers and every LLM provider, intercepts every API call, and enforces spend policies in real time without adding meaningful latency.&lt;br&gt;
Think of it as the IAM layer for LLM access. Your engineers don't call OpenAI directly. They call your gateway, which routes to the right provider, enforces their team's quota, logs the usage, and routes to a fallback model if they're approaching a budget ceiling.&lt;/p&gt;

&lt;p&gt;The gateway approach works because it decouples policy from access. Engineers get unified credentials that work across every model provider. Platform teams set the rules. Nobody needs to coordinate.&lt;/p&gt;

&lt;p&gt;Here's what that architecture needs to do well:&lt;br&gt;
&lt;strong&gt;Per-team quota management&lt;/strong&gt;— token limits, request limits, and spend limits that apply to a specific team, project, or even individual user, configurable independently.&lt;br&gt;
&lt;strong&gt;Real-time monitoring&lt;/strong&gt;— usage visible at the call level, not just aggregated at billing time. You need to know which team consumed 2 million tokens on a Tuesday, not when the invoice arrives.&lt;br&gt;
Graceful degradation, not hard blocks — when a team approaches their limit, the right behavior is to route to a cheaper model (GPT-4o Mini instead of GPT-4o, for example), not to throw a 403 and break their service.&lt;br&gt;
&lt;strong&gt;Environment-aware policies&lt;/strong&gt; — development environments should have generous limits to allow experimentation. Production environments need tighter budgets with stricter monitoring. These should be separate policies on the same infrastructure.&lt;/p&gt;

&lt;h2&gt;
  
  
  How TrueFoundry Handles LLM Spend Enforcement
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://www.truefoundry.com/ai-gateway" rel="noopener noreferrer"&gt;TrueFoundry's&lt;/a&gt; AI Gateway is built for exactly this use case. It connects to 250+ LLM providers through a single API endpoint and exposes a governance layer that platform teams can configure without touching application code.&lt;br&gt;
Here's how spend enforcement works in practice.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 1: Centralize API Key Management
&lt;/h3&gt;

&lt;p&gt;Instead of distributing provider API keys to individual teams, you configure them once in &lt;a href="https://www.truefoundry.com/ai-gateway" rel="noopener noreferrer"&gt;TrueFoundry&lt;/a&gt; and issue virtual credentials — scoped tokens that proxy to the real keys with usage tracking attached.&lt;br&gt;
Engineers update their base URL and authentication header once. Everything else stays the same. From the application's perspective, it's still calling the OpenAI API. From the platform's perspective, every call is now attributable, measurable, and enforceable.&lt;/p&gt;

&lt;h1&gt;
  
  
  Before: direct provider access
&lt;/h1&gt;

&lt;p&gt;client = OpenAI(api_key="sk-...")&lt;/p&gt;

&lt;h2&gt;
  
  
  After: routed through &lt;a href="https://www.truefoundry.com/ai-gateway" rel="noopener noreferrer"&gt;TrueFoundry&lt;/a&gt; AI Gateway
&lt;/h2&gt;

&lt;p&gt;client = OpenAI(&lt;br&gt;
    api_key="tf-team-data-science-prod",&lt;br&gt;
    base_url="&lt;a href="https://your-org.truefoundry.com/api/llm" rel="noopener noreferrer"&gt;https://your-org.truefoundry.com/api/llm&lt;/a&gt;"&lt;br&gt;
)&lt;br&gt;
No other code change required.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 2: Define Budget Policies Per Team
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://www.truefoundry.com/ai-gateway" rel="noopener noreferrer"&gt;TrueFoundry&lt;/a&gt; lets you set budget policies at multiple levels — by team, by project, by environment, or by individual user. Each policy can enforce limits on:&lt;br&gt;
Token usage (input + output tokens combined, or separately)&lt;br&gt;
Request count (number of API calls per hour, day, or month)&lt;br&gt;
Estimated spend (dollar value, calculated from provider pricing)&lt;br&gt;
A typical configuration for a data science team with a $2,000/month budget and a separate $500/month allowance for experimentation looks like this in the platform — two policies, one for prod workloads and one for dev, with different limits and different alert thresholds.&lt;br&gt;
When the team hits 80% of their budget, &lt;a href="https://www.truefoundry.com/ai-gateway" rel="noopener noreferrer"&gt;TrueFoundry&lt;/a&gt; sends an alert to whoever you've designated — the team lead, the platform team, finance — before there's a problem, not after.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 3: Configure Intelligent Fallback Routing
&lt;/h3&gt;

&lt;p&gt;Hard limits that break production are worse than no limits. The smarter approach is model fallback routing: when a team is approaching their budget ceiling, the gateway automatically routes subsequent calls to a cheaper model while maintaining the same API contract.&lt;br&gt;
&lt;a href="https://www.truefoundry.com/ai-gateway" rel="noopener noreferrer"&gt;TrueFoundry&lt;/a&gt; supports fallback routing configurations where you define a primary model and one or more fallback targets with the conditions that trigger a switch — budget threshold reached, latency spike, provider error rate too high, or any combination.&lt;/p&gt;

&lt;p&gt;A team that normally uses Claude Sonnet 4 can have automatic fallback to Claude Haiku 4 when they've consumed 75% of their monthly token budget. Their application keeps running. Their costs stop accelerating. They get a notification. No engineer needs to change anything at runtime.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 4: Use Real-Time Observability to Find the Waste
&lt;/h3&gt;

&lt;p&gt;Enforcement without visibility is flying blind in the other direction. &lt;a href="https://www.truefoundry.com/ai-gateway" rel="noopener noreferrer"&gt;TrueFoundry&lt;/a&gt;'s gateway captures full traces of every LLM call — prompt, response, token counts, latency, model used, team attribution, and cost — and makes that data available in a real-time dashboard.&lt;br&gt;
In practice, this surfaces three patterns that are almost always present in any multi-team AI deployment:&lt;br&gt;
Expensive prompt patterns. A specific workflow that sends a 12,000-token system prompt on every request. The fix — prompt compression or caching — takes an afternoon and can reduce that team's spend by 60%.&lt;br&gt;
Unnecessary model choices. A classification task running against GPT-4o when GPT-4o Mini or a fine-tuned smaller model would perform identically. Switching models on 80% of classification calls with no quality loss is a common first-pass optimization.&lt;br&gt;
Retry loops inflating costs. Error handling that retries failed calls without exponential backoff, effectively multiplying call volume by 3–5x during any provider instability. Visible at the gateway level as a spike in calls with a high error rate preceding them.&lt;br&gt;
None of these are visible at the billing statement level. All of them are immediately visible in a per-call trace dashboard.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Numbers That Make the Case
&lt;/h2&gt;

&lt;p&gt;Teams that move from direct LLM provider access to a governed gateway layer consistently report similar outcomes. &lt;a href="https://www.truefoundry.com/ai-gateway" rel="noopener noreferrer"&gt;TrueFoundry &lt;/a&gt;customers report 40–60% reductions in LLM infrastructure spend after implementing quota management, fallback routing, and prompt optimization based on gateway observability.&lt;br&gt;
The mechanics of why this happens: direct provider access has no forcing function for prompt efficiency, model selection, or caching. When there's a cost per call that someone is watching, teams naturally optimize. When there isn't, they don't.&lt;br&gt;
The operational overhead of managing this through manual processes — ticket queues for key access, spreadsheet-based budget tracking, post-hoc billing analysis — typically consumes 4–8 hours of platform engineering time per week. Automated enforcement at the gateway layer brings that to near zero.&lt;/p&gt;

&lt;h2&gt;
  
  
  What You Don't Want to Do
&lt;/h2&gt;

&lt;p&gt;Two approaches to LLM cost control are popular and both are counterproductive.&lt;br&gt;
Shared API keys with no attribution is the default state for most teams. It's easy to set up and provides zero visibility or control. When costs spike, you have no way to identify the source.&lt;br&gt;
Manual approval workflows solve the visibility problem but create a worse one. Engineers who need a new API key or an increased quota file a ticket, wait, follow up, and lose a day or more. In an environment where LLMs are a core development tool, that friction directly reduces experimentation velocity — which is where most AI product value comes from.&lt;br&gt;
The right trade-off is automated enforcement with generous defaults for development, tighter policies for production, and real-time visibility for everyone. Engineers move fast. Platform teams stay in control. Finance gets a predictable number.&lt;/p&gt;

&lt;h2&gt;
  
  
  Getting Started
&lt;/h2&gt;

&lt;p&gt;If you're running LLM workloads across multiple teams and currently routing directly to providers, the migration path with &lt;a href="https://www.truefoundry.com/" rel="noopener noreferrer"&gt;TrueFoundry&lt;/a&gt; is straightforward: update the base URL and API key in your existing client configuration, configure team budgets in the platform, and set up fallback routing for your highest-spend models.&lt;br&gt;
&lt;a href="https://www.truefoundry.com/ai-gateway" rel="noopener noreferrer"&gt;TrueFoundry&lt;/a&gt;'s AI Gateway handles 350+ requests per second on a single vCPU at 3–4ms of added latency — well below any threshold that would affect application performance or developer experience. It's recognized in the 2025 Gartner Market Guide for AI Gateways.&lt;br&gt;
The engineers won't notice the governance layer. Finance will notice the bill.&lt;br&gt;
&lt;a href="https://www.truefoundry.com/ai-gateway" rel="noopener noreferrer"&gt;Explore TrueFoundry's AI Gateway →&lt;/a&gt;&lt;/p&gt;

</description>
      <category>llm</category>
      <category>ai</category>
      <category>webdev</category>
      <category>programming</category>
    </item>
    <item>
      <title>Top 5 AI Gateway Companies in 2026 (Ranked for Enterprise Teams)</title>
      <dc:creator>Deepti Shukla</dc:creator>
      <pubDate>Wed, 18 Mar 2026 18:52:23 +0000</pubDate>
      <link>https://dev.to/deeptishuklatfy/top-5-ai-gateway-companies-in-2026-ranked-for-enterprise-teams-3hi6</link>
      <guid>https://dev.to/deeptishuklatfy/top-5-ai-gateway-companies-in-2026-ranked-for-enterprise-teams-3hi6</guid>
      <description>&lt;p&gt;Enterprise LLM spending surged past $8.4 billion in 2026, and with it came a brutal reality check: getting a model to work in a demo is easy. Getting it to work reliably, securely, and cost-efficiently across an organization of thousands? That's an infrastructure problem. And the infrastructure layer solving that problem right now is the AI Gateway.&lt;br&gt;
An AI Gateway sits between your applications and your LLM providers. It handles routing, authentication, rate limiting, cost tracking, observability, and — increasingly — MCP-based tool integrations for agentic workflows. Without one, you're dealing with vendor lock-in, no fallback strategy, scattered API keys, and zero visibility into what your models are actually doing in production.&lt;br&gt;
There are a lot of players in this space. These are the 5 that matter most right now.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;1. TrueFoundry — The Enterprise AI Gateway Built for Governance and Agentic Scale&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://www.truefoundry.com/" rel="noopener noreferrer"&gt;TrueFoundry&lt;/a&gt; isn't just an &lt;a href="https://www.truefoundry.com/ai-gateway" rel="noopener noreferrer"&gt;AI Gateway&lt;/a&gt; — it's the most complete answer to enterprise AI infrastructure in 2026. It was recognized in the 2026 Gartner® Market Guide for AI Gateways as well as Gartner's Innovation Insight: &lt;a href="https://www.truefoundry.com/mcp-gateway" rel="noopener noreferrer"&gt;MCP Gateways&lt;/a&gt; report, which puts it in rare company for a platform that only a few years ago was primarily known for its LLMOps capabilities.&lt;br&gt;
The core product is a unified AI Gateway that connects to 1,000+ LLMs through a single API endpoint. &lt;/p&gt;

&lt;p&gt;It supports chat, completion, embedding, and reranking across all major providers — OpenAI, Anthropic, Google, Mistral, Groq, and more. Under the hood it delivers approximately 3–4 ms latency while handling 350+ requests per second on a single vCPU, scaling horizontally with ease through Kubernetes-based infrastructure. That's a significant performance edge over alternatives like LiteLLM for teams running production-grade workloads.&lt;br&gt;
But what truly differentiates TrueFoundry heading into 2026 is its &lt;a href="https://www.truefoundry.com/mcp-gateway" rel="noopener noreferrer"&gt;MCP Gateway&lt;/a&gt; — the piece of infrastructure that almost no other gateway provider handles well.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.truefoundry.com/mcp-gateway" rel="noopener noreferrer"&gt;The MCP Gateway&lt;/a&gt;: Why It's a Category of Its Own&lt;br&gt;
As teams shift from simple chatbots to full autonomous agents, they hit a new kind of complexity: the N×M integration problem. With N agents and M external tools (Slack, GitHub, Confluence, Sentry, Datadog, internal APIs), every agent ends up implementing its own connection, authentication, and error handling for every tool. The result is a sprawling, ungovernable web of point-to-point integrations.&lt;br&gt;
TrueFoundry's MCP Gateway resolves this entirely. It acts as a centralized reverse proxy between all your AI agents and all your MCP Servers — a single control point for tool discovery, authentication, routing, and observability. Agents connect to one endpoint. The gateway handles everything else.&lt;/p&gt;

&lt;p&gt;Key capabilities include a Centralized MCP Registry for dynamic tool discovery, Federated Identity integration with Okta, Azure AD, and other IdPs via OAuth 2.0, per-server RBAC for compliance-grade access control, and full end-to-end tracing of every MCP request, LLM call, and agent decision from a single dashboard.&lt;br&gt;
The platform also includes an interactive Prompt Playground where developers can test different models, prompts, MCP tools, and configurations before deploying. Configurations can be saved as versioned, reusable templates. Ready-to-use code snippets are generated automatically for the OpenAI client, LangChain, and other frameworks — so the gap from experiment to production is measured in minutes, not weeks.&lt;br&gt;
For data-sensitive industries, TrueFoundry's entire platform runs inside your own VPC, on-premises environment, or air-gapped infrastructure. No data leaves your domain.&lt;/p&gt;

&lt;p&gt;Best for: Enterprise AI governance, multi-model LLMOps, agentic workflows at scale, regulated industries (healthcare, finance, defense), teams that cannot compromise on data sovereignty.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;2. Kong AI Gateway — The Battle-Tested API Giant Moves into AI&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Kong has been a dominant force in API management for over a decade, and in 2026 its AI Gateway extends that legacy into the LLM layer. Built on top of the existing Kong Gateway runtime, it unifies API and AI traffic management in a single platform — which is a meaningful architectural advantage for teams who are already running Kong for their microservices infrastructure.&lt;br&gt;
Performance-wise, Kong is credible at scale. In benchmarks against Portkey and LiteLLM running on AWS EKS clusters, Kong Konnect Data Planes delivered over 228% higher throughput than Portkey and 859% higher throughput than LiteLLM, with 65% lower latency than Portkey and 86% lower latency than LiteLLM in proxy-mode comparisons.&lt;/p&gt;

&lt;p&gt;Kong's AI Gateway supports multi-LLM routing with a unified abstraction layer, token-level rate limiting per consumer, semantic caching for cost reduction, automatic fallback and retry logic, and comprehensive observability. On the MCP front, Kong offers enterprise-grade MCP gateway functionality with auto-generation of MCP servers from any existing API, centralized OAuth enforcement, and real-time observability — though the depth of its MCP Registry and governance features doesn't yet match TrueFoundry's purpose-built MCP Gateway.&lt;br&gt;
The platform also carries 100+ enterprise-grade plugin capabilities ported from the traditional API gateway world, which gives it a head start on authentication schemes, request transformation, and traffic management that newer AI-native gateways are still catching up to.&lt;br&gt;
Best for: Organizations already invested in Kong infrastructure, teams managing both traditional APIs and AI traffic in a unified control plane, Kubernetes-native deployments.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;3. Portkey — The AI-Native Gateway for Developer Teams&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Where Kong comes from the API management world, Portkey was designed from day one specifically for LLM application workflows. That shows in its developer experience and its prompt-aware abstractions. Portkey connects to 1,600+ LLMs and providers through a single unified API, covering all major providers plus emerging models and open-source deployments.&lt;br&gt;
The platform's strongest suits are observability and prompt management. Every request is traced end-to-end — tokens in and out, latency, cost, guardrail violations, all tied to custom metadata like user ID, team, or environment. Its prompt management studio supports collaborative template creation, versioning, A/B testing, and rollback. For teams iterating fast on AI products, this removes a lot of friction.&lt;/p&gt;

&lt;p&gt;Portkey handles 30 million policies per month for some enterprise customers, with governance features including virtual key management (so API keys never leave Portkey's vault), RBAC, org/workspace isolation, configurable routing with automatic retries and exponential backoff, and 50+ pre-built guardrails covering content filtering and PII detection. It carries SOC2, ISO27001, HIPAA, and GDPR certifications.&lt;br&gt;
The caveat: Portkey's LLMOps capabilities are positioned as a full platform, but key features like model deployment are absent. And while it supports remote MCP Servers via its Responses API, it lacks the centralized authentication and governance that a dedicated MCP Gateway provides.&lt;br&gt;
Best for: Developer and product teams building LLM applications who need deep observability and prompt lifecycle management without the overhead of a full enterprise platform.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;4. LiteLLM — The Open-Source Gateway That Democratized Multi-Model Access&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;LiteLLM has one of the most important origin stories in the AI gateway space. It's the tool that made multi-provider LLM access accessible to individual developers and small teams — a Python SDK and proxy server with a unified OpenAI-compatible API covering 100+ LLM providers. Its GitHub star count and community adoption reflect how foundational it became during the early days of the LLM boom.&lt;/p&gt;

&lt;p&gt;The value proposition is simple: zero cost to get started, maximum flexibility, and broad provider compatibility. LiteLLM supports cost tracking and budget limits per project or team, retry and fallback logic, integration with observability tools like Langfuse and MLflow, and basic MCP gateway support with tool access control by team and API key.&lt;br&gt;
The tradeoffs become visible at scale. TrueFoundry's AI Gateway benchmarks show LiteLLM struggling beyond moderate RPS, with high latency and no built-in horizontal scaling. Production teams increasingly report memory issues and stability concerns under load. There is no formal commercial backing, no SLAs, and no enterprise support plan — which makes it difficult to justify for organizations with compliance requirements or uptime guarantees.&lt;br&gt;
LiteLLM's place in 2026 is as a prototyping and development tool, and a starting point that many teams eventually graduate from as their AI workloads mature into production.&lt;br&gt;
Best for: Individual developers, early-stage startups, teams experimenting with multi-provider LLM access before committing to a production infrastructure strategy.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;5. Helicone — Performance and Simplicity for Production Observability&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Helicone is built in Rust, and that architectural decision defines its identity: it adds approximately 50 ms of overhead (one of the lowest in the category) and delivers health-aware routing with circuit breaking to automatically detect failures and route to healthy providers. For teams where performance is the primary concern and they don't need the full governance stack of a platform like TrueFoundry, Helicone hits a well-defined sweet spot.&lt;br&gt;
Its core offering is a drop-in proxy for OpenAI-compatible APIs with rich built-in monitoring — request logs, cost tracking, latency analysis, and alerting — available as both a managed SaaS service and a self-hosted open-source deployment. Latency load-balancing and native observability integration are production-grade. The caching layer can deliver up to 95% cost savings on repeated prompts, which in high-volume applications is a meaningful number.&lt;/p&gt;

&lt;p&gt;Where Helicone falls short for enterprise buyers is governance depth. RBAC, multi-org federation, compliance certifications, and advanced agentic / MCP support are limited compared to TrueFoundry or Kong. It is, intentionally, not trying to be a full LLMOps platform. For consumer-facing applications where compliance requirements are minimal and developer simplicity is the priority, that's a perfectly valid tradeoff.&lt;br&gt;
Best for: Performance-focused engineering teams building consumer applications, teams who want open-source observability with minimal setup overhead, organizations starting to instrument their LLM stack.&lt;/p&gt;

&lt;p&gt;The Honest Summary&lt;br&gt;
The AI Gateway category is maturing fast, and the right choice depends almost entirely on where you are in your AI journey and what you're optimizing for.&lt;br&gt;
If you're prototyping, LiteLLM gets you moving in under an hour for free. If you're building a developer-first LLM product and need great observability, Portkey or Helicone are strong fits. If you're running Kong and want unified API + AI traffic management at scale, Kong AI Gateway is the natural extension.&lt;/p&gt;

&lt;p&gt;But if you're an enterprise team building agentic systems, navigating compliance requirements, and need to govern access to both LLMs and external tools through a secure MCP Gateway — &lt;a href="https://www.truefoundry.com/mcp-gateway" rel="noopener noreferrer"&gt;TrueFoundry&lt;/a&gt; is the platform the rest of the field is still catching up to. The Gartner recognition, the 1,000+ LLM integrations, the 350+ RPS on a single vCPU, and the only purpose-built enterprise MCP Gateway in the market make it the standout choice for teams taking production AI seriously in 2026.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Which AI Gateway is your team running in production? Drop it in the comments.&lt;/strong&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llmops</category>
      <category>machinelearning</category>
      <category>devops</category>
    </item>
  </channel>
</rss>
