Most teams set up an AI gateway the same way they set up a reverse proxy in 2012: route the traffic, add a key, move on. It works until it doesn't — and when it stops working in production, it stops working loudly.
An AI gateway is not an API proxy with a language model on the other end. It's the control plane for everything your AI systems do in production: how they access models, how much they spend, how they behave when a provider goes down, what data leaves your infrastructure, and how you debug it when something goes wrong at 2am.
The gap between what most AI gateways are doing and what they should be doing is wide. Here are the seven things a production AI gateway needs to do, including the three that most teams haven't gotten to yet — and what it costs them when they don't.
1. Unified Multi-Provider Access With a Single API Contract ✅ Most are doing this
This is the baseline. A production AI gateway should give your engineers a single endpoint and a single authentication method that works regardless of which LLM provider or model is behind it — OpenAI, Anthropic, Gemini, Mistral, Groq, or a self-hosted model running on your own GPU cluster.
The practical value is that your application code never changes when you switch models. You don't update base URLs, regenerate credentials, or modify request schemas when you move from Claude Sonnet 4 to GPT-4o or add a self-hosted Llama 3 to the mix. The gateway handles the translation.
TrueFoundry's AI Gateway connects to 250+ LLM providers — including hosted providers and self-hosted models running on vLLM, TGI, or Triton — through one API endpoint. Engineers configure their client once. The platform team controls which models are available, at what cost, to whom.
This is table stakes. If your gateway isn't doing this, it's not really a gateway — it's a forwarding rule.
2. Automatic Fallback and Failover Routing ✅ Most are doing this
Provider outages happen. OpenAI has had multiple significant incidents in the past 18 months. Anthropic has throttled requests during peak periods. A production system that routes all traffic through a single provider without a fallback strategy is a production system with a single point of failure.
A gateway should detect provider errors in real time — 429 rate limit responses, 5xx errors, latency spikes above a configurable threshold — and automatically reroute to a fallback model without the application layer ever knowing there was a problem.
The configuration should be flexible: you might want GPT-4o to fall back to Claude Sonnet 4 for quality-sensitive paths, but fall back to GPT-4o Mini for high-volume, cost-sensitive paths where acceptable quality is lower. These are different fallback policies and they should be independently configurable per route.
This is also largely understood by now. The more interesting question is whether your gateway is doing the fallback routing intelligently — based on error rate, latency percentile, and cost — or just blindly switching on any failure.
3. Per-Team Spend Enforcement With Real-Time Budget Tracking ✅ Most are doing this, badly
Spend visibility and spend enforcement are different things, and most teams have the first without the second.
Visibility means you can see — at the end of the month, or after the fact — which team consumed how many tokens. Enforcement means that when the data science team hits 80% of their monthly token budget on the 15th, something happens automatically: an alert fires, requests route to a cheaper fallback model, or a hard cap kicks in before the overage.
The enforcement layer is what most gateways are missing. They expose usage dashboards. They don't enforce policy at the request level in real time.
TrueFoundry lets you configure per-team, per-project, and per-environment budget policies that enforce at the gateway layer before a request reaches the provider. When a team hits their threshold, the gateway can alert, downgrade model routing, or hard cap — based on whatever policy you've set. The application doesn't break. The bill doesn't surprise.
4. Full Request-Level Observability, Not Just Aggregate Metrics ⚠️ Most are doing this partially
This is where the gap starts to open up.
Aggregate metrics — total tokens consumed, average latency, error rate by provider — are useful for billing and capacity planning. They tell you almost nothing about why your production AI system is behaving the way it is.
Request-level observability means capturing the full trace of every LLM call: the prompt, the response, the token breakdown (input vs output), the model used, the latency at each layer, the team and user that made the request, and the cost attributed to that specific call. This is what you need to debug production issues, identify expensive prompt patterns, catch quality regressions, and build a feedback loop for improvement.
The difference between aggregate metrics and request-level tracing is roughly the difference between knowing your application has high CPU and knowing which function is causing it.
TrueFoundry captures full request traces — prompt, completion, token counts, latency, model attribution, cost, and team identity — and surfaces them in a real-time dashboard with filtering by team, model, time range, and error state. When something behaves unexpectedly in production, the answer is usually visible in the trace data within minutes.
Most teams using lighter-weight gateways have aggregates but not traces. They know the total. They can't explain the individual.
5. PII Detection and Data Residency Controls ❌ Most are NOT doing this
This is the first of the three things most gateways aren't doing — and in regulated industries, it's the one that creates the most legal exposure.
When your engineers send prompts to external LLM providers, those prompts routinely contain data that should never leave your infrastructure: customer names and email addresses embedded in support ticket context, financial figures in analyst-facing tools, patient identifiers in healthcare applications, proprietary code in developer-facing copilots.
Most teams handle this through developer guidelines and code review. Both fail in production. Guidelines aren't enforced. Code review doesn't catch every case. Context-stuffing patterns that look safe at the individual call level can expose sensitive data in aggregate.
A production AI gateway should inspect outbound prompts for PII and sensitive data patterns before they leave your infrastructure — and either redact, block, or route to a self-hosted model depending on the sensitivity of what was found. This enforcement has to happen at the gateway layer to be reliable, because it can't depend on application-level compliance by every team and every developer.
TrueFoundry's AI Gateway includes guardrails for PII detection and content moderation that apply at the request level, before data reaches any external provider. For organisations with strict data residency requirements — GDPR, HIPAA, financial services regulations — the gateway can be deployed entirely within your VPC or on-premises, ensuring that no inference data, no prompt content, and no response ever transits through third-party infrastructure.
Most teams know they have this problem. Most haven't instrumented a solution at the infrastructure layer yet.
6. Versioned Prompt Management Tied to Deployment ❌ Most are NOT doing this
Prompts are code. Most teams aren't treating them that way.
The typical state of prompt management in a production AI team: prompts are hardcoded strings in application code, changed via pull request with no systematic evaluation, deployed as part of a general application release with no ability to roll back the prompt independently of the application, and never formally versioned in a way that lets you compare performance across versions.
This creates a class of production bugs that are uniquely painful: the model's behaviour changed, but nothing in the deployment pipeline changed — because the prompt changed in a way that wasn't tracked, or a model was swapped at the provider level without a corresponding prompt update.
A production AI gateway should include prompt versioning as a first-class feature: version-controlled prompt templates, the ability to run A/B tests between prompt versions with statistical tracking, rollback to a previous prompt version in seconds without a full application redeploy, and full traceability connecting which prompt version was used for which request.
TrueFoundry includes prompt management natively within the gateway layer: version-controlled templates, A/B testing across prompt versions, and full trace linkage so you can see exactly which prompt version produced which output for any specific request in production. When a quality regression hits, you can identify whether it was a model change, a prompt change, or a data change — and roll back the right thing.
Teams running prompts as unversioned strings in application code are accumulating technical debt that compounds every time they make a change they can't formally evaluate.
7. MCP Gateway for Agentic Tool Access ❌ Most are NOT doing this (yet)
This is the newest gap, and the one that's going to matter most over the next 12 months.
As AI systems move from single-turn completions to multi-step agentic workflows, the attack surface and governance requirements change fundamentally. An agent that can call tools — search the web, query your database, execute code, send emails, update CRM records — needs a governance layer that's categorically different from a prompt-and-response proxy.
Model Context Protocol (MCP) is the emerging standard for how agents discover and call tools. Without a gateway layer in front of MCP, you have agents making arbitrary tool calls with no access control, no audit trail, no rate limiting, and no way to enforce which tools a given agent is allowed to use.
The specific risks: prompt injection attacks that cause agents to call tools the application developer never intended; agents accumulating permissions that exceed what any individual request should have; tool calls that exfiltrate data or trigger external side effects with no audit log; and no mechanism to restrict which MCP servers a given team or application can access.
TrueFoundry's MCP Gateway provides a secure, governed access layer in front of your MCP servers: RBAC enforcement at the tool level (this agent can call search and read, but not write or execute), full request tracing for every tool call, integration with enterprise identity providers like Okta and Azure AD, and auto-discovery of registered MCP servers with proper access controls applied automatically.
Most teams building agentic systems right now are connecting directly to MCP servers without any gateway layer. The governance debt they're accumulating will become visible the first time an agent does something it shouldn't have been able to do.
The 3-Minute Audit for Your Current Gateway
Before evaluating alternatives, it's worth auditing what your current setup is actually doing. Ask these questions:
On PII and data residency: Can you demonstrate that no customer PII has ever been sent to an external LLM provider in a prompt? If the answer is "I think so" or "our developers know not to do that," the answer is no.
On prompt versioning: Can you tell me which prompt version was used for any specific production request from last Tuesday? If you'd need to check git blame and cross-reference a deployment log, the answer is no.
On agentic tool access: If you have agents calling tools, can you pull an audit log of every tool call made in the last 7 days, with the agent identity and the justification from the model? If not, the answer is no.
Most teams are 4 out of 7 on this list. Getting to 7 out of 7 doesn't require replacing your infrastructure — it requires picking a gateway platform that covers the full surface area, not just the routing layer.
Why Most Gateways Stop at 4
The first four capabilities on this list — unified access, fallback routing, spend tracking, and aggregate observability — are relatively straightforward to build. They've been commoditised. Several open-source options cover them adequately.
The last three — PII enforcement, prompt versioning, and agentic governance — are harder because they require the gateway to understand the semantics of what's passing through it, not just the routing. They require integration with your identity provider, your compliance framework, your deployment pipeline. They require the gateway to be a platform, not a proxy.
TrueFoundry is built as that platform. It's recognised in the 2025 Gartner Market Guide for AI Gateways, handles 350+ requests per second on a single vCPU at 3–4ms of added latency, and can be deployed fully within your VPC for organisations with strict data residency requirements.
The teams that will have well-governed, cost-efficient, production-reliable AI systems in 12 months are the ones adding these last three capabilities now, before the agentic complexity compounds.
Explore TrueFoundry's AI Gateway →
Top comments (0)