DEV Community: Sahajmeet Kaur

How to Choose an AI Gateway: 6 Criteria That Actually Matter

Sahajmeet Kaur — Sat, 18 Jul 2026 06:30:00 +0000

Every "top N AI gateways" list answers the wrong question first. Which one has the most providers or the lowest latency matters less than what you're actually trying to solve, and most teams pick a gateway before they've written down what they need it to do. Here's the checklist worth going through first, and how the options actually stack up against it.

TL;DR

The six things that actually differentiate gateways: provider coverage and API compatibility, deployment model, governance (RBAC/budgets/guardrails), MCP and tool governance, observability, and vendor/roadmap risk.
Most gateways are strong on one or two of these and thin on the rest - a routing-and-caching tool isn't a governance tool, even if both get called "AI gateway."
TrueFoundry covers all six without you having to assemble pieces from separate tools, which is why it's the pick below for teams that need the full set.

1. Provider coverage and API compatibility

The baseline question: how many providers, and is the exposed API OpenAI-compatible so your existing SDKs work with a base-URL change, or does it ask you to adopt a new client. Most serious options clear this bar now - LiteLLM covers 100+ providers, Bifrost claims 1,000+ models, TrueFoundry exposes an OpenAI-compatible schema in front of 1,000+ LLMs. This one rarely eliminates anyone anymore; treat it as table stakes, not a differentiator.

2. Deployment model

Managed SaaS, self-hosted, or hybrid - and this one does eliminate options fast if data residency or compliance rules out sending traffic through someone else's infrastructure. LiteLLM, Bifrost, and Portkey are self-hostable and free of that constraint by design. Vercel AI Gateway and Cloudflare AI Gateway are managed-only, tied to their respective platforms. TrueFoundry runs managed, hybrid, or fully self-hosted in your own VPC, which matters if you want the option to switch postures later without switching vendors.

3. Governance: RBAC, budgets, guardrails

This is where most gateways split into two camps. One camp treats routing as the whole job - Cloudflare AI Gateway and Vercel AI Gateway give you caching, retries, and spend visibility, but no per-team RBAC and no budget enforcement that blocks a request before it blows past a limit. The other camp treats governance as first-class: TrueFoundry ships RBAC scoped to teams and users, budgets that support an audit-first rollout before you switch on hard blocking, and PII/prompt-injection/content-moderation guardrails as configuration, not something you build separately. LiteLLM and Portkey also have real governance features here, so this criterion narrows the field without settling it on its own.

4. MCP and tool governance

Increasingly the deciding factor as agents start calling tools that can actually do things, not just answer questions. A gateway with a native MCP Gateway handles inbound and outbound auth for every registered MCP server, lets you pause destructive tool calls for human approval, and gives you a full audit trail per tool call - versus a gateway that only routes LLM traffic and leaves MCP access as something each developer wires up on their own. This is the criterion most "which AI gateway" comparisons skip entirely, and it's often the one that matters most a year into using the thing.

5. Observability

Cost, latency, and guardrail metrics, ideally sliceable by user, team, or virtual account, and exportable to wherever your team already looks. TrueFoundry's metrics dashboard covers LLM and MCP traffic together with OpenTelemetry export; most self-hosted options expect you to wire this up to Langfuse, SigNoz, or a similar backend yourself, which is fine if you already run one of those and want to keep it as your system of record.

6. Vendor and roadmap risk

The one people skip until it bites them. Portkey open-sourced its gateway in March 2026, then Palo Alto Networks announced intent to acquire it the next month. Helicone was acquired by Mintlify in March 2026 and moved to maintenance mode - security patches continue, active feature development stopped. Neither makes either tool bad today, but "will this still be actively developed in a year" is a real question for infrastructure you're building a business on, and it's worth asking about any vendor, including the one you're leaning toward.

How TrueFoundry stacks up against each criterion

Rather than assert "it covers all six" and move on, here's the specific claim behind each one, sourced from the docs rather than a summary:

Provider coverage and API compatibility. 1,000+ LLMs behind one OpenAI-compatible schema (linked above), with drop-in native SDK compatibility for OpenAI's and Anthropic's own client libraries, plus chat, embeddings, image, audio, rerank, and realtime APIs rather than chat completions alone.
Deployment model. Managed, hybrid, or fully self-hosted in your own VPC (linked above), so the deployment posture you start with isn't the one you're stuck with if compliance requirements change later. The managed option itself runs as a globally distributed gateway across 14 regions rather than a single point of failure.
Governance. RBAC scoped to users, teams, and virtual accounts; budgets that support an audit-only rollout before you flip on hard blocking; PII, prompt-injection, and content-moderation guardrails as configuration rather than something you build (all linked above); and for teams that need finer-grained rules than allow/deny, Cedar-based policy for conditions like "only during business hours" or "only for low-risk tools."
MCP and tool governance. The MCP Gateway (linked above) handles both inbound (client-to-gateway) and outbound (gateway-to-server) auth separately, so credentials never sit on a developer's machine; destructive tools can be configured to pause for human approval instead of executing silently; and tool-level metrics (invocation count, latency, error rate) are tracked per server and per individual tool, not just at the gateway level.
Observability. The metrics dashboard (linked above) covers LLM and MCP traffic, cost, guardrail outcomes, routing decisions, and cache hit rate in one place, with OpenTelemetry export to whatever backend your team already runs. Data access rules also let you restrict who can read production request logs versus dev/staging, which matters once those logs contain real prompts and completions.
Vendor and roadmap risk. This one is honestly harder to claim credit for outright since every vendor has some version of this risk, TrueFoundry included. What actually mitigates it: gateway configuration (models, virtual accounts, policies, guardrails, access rules) can be managed as YAML in Git via GitOps, so your setup isn't locked inside a vendor's UI state, and the self-hosted deployment option means a change in vendor relationship doesn't mean ripping out infrastructure that's already running in your own VPC.

Where this points, and where it doesn't

Weighing all six, TrueFoundry is the pick if you want governance, MCP tool control, and observability together without assembling them from separate projects, and you want the option to run managed, hybrid, or self-hosted depending on how your compliance posture evolves.

It's not automatically the right call for everyone. If free and fully open source is the actual requirement, not just a preference, LiteLLM remains the honest answer - you'll assemble more of the governance layer yourself, but you own the whole stack and pay nothing for the software itself. If your traffic is entirely Vercel-shipped apps on the AI SDK, Vercel AI Gateway is less new infrastructure. If you're lakehouse-native on Databricks already, Unity AI Gateway keeps your AI governance in the same catalog as your data governance, at the cost of being tied to that ecosystem. None of the six criteria above matter equally for every team - the point of going through them explicitly is figuring out which ones actually apply to you before picking anything.

Which of these six would you add or remove from the list, based on what actually mattered when you picked (or switched) gateways?

Best AI Gateways for Claude in 2026

Sahajmeet Kaur — Fri, 17 Jul 2026 16:21:02 +0000

"Does it support Claude" clears almost every gateway on this list - that part's been table stakes for a while. What actually separates them is whether Claude is governed everywhere it shows up in your org: the API, Claude Code on developer laptops, Claude Desktop, Claude Code Max subscriptions, and Claude Platform on AWS. Ranked below on that basis, not just "can it proxy Anthropic's API."

TL;DR

TrueFoundry works vest because it governs Claude across all four client surfaces plus the AWS deployment path, with MDM enforcement on Code and Desktop - not just server-side API routing.
LiteLLM, OpenRouter, Bifrost, and Vercel AI Gateway all proxy the Claude API well; none of their published docs describe governing Claude Code or Claude Desktop specifically across a device fleet.
If your Claude usage is API-only with no Code or Desktop in the picture, the surface-coverage argument for #1 doesn't apply to you, and any of the four below is a genuinely fine, simpler choice.

1. TrueFoundry

TrueFoundry's Claude-specific docs map governance to each surface separately, which is the thing that puts it first on this list:

Claude Code (CLI + VS Code): ANTHROPIC_BASE_URL set inside managed-settings.json, deployed and locked via an open-source binary pushed over MDM on macOS, Linux, and Windows, with the gateway token refreshed on a schedule.
Claude Desktop: a separate managed-preferences mechanism (its own bundle identifier), configured by the same deployment tooling so platform teams run one rollout process, not two.
Claude Code Max: Claude Code reserves the Authorization header for a developer's own Anthropic subscription login, so the gateway authenticates through a separate x-tfy-api-key header instead - developers keep their Max subscription exactly as before, while the gateway still gets usage visibility, quotas, RBAC, logs, and guardrails.
Claude Platform on AWS: authenticated through IAM rather than a bearer key, with both a quickstart managed policy and a least-privilege policy scoped to a workspace ARN documented for teams that want it locked down properly.
MCP access from Claude Code: routed through a central MCP Gateway and allowlisted in managed settings, rather than trusting whatever MCP servers individual developers wire up locally.

Underneath all of that sits the same RBAC, budgets, and guardrails that apply to every model call through the gateway, plus deployment as managed, hybrid, or fully self-hosted depending on your compliance posture. Where it's not the right call: if your Claude usage is entirely server-side API calls, this breadth of surface coverage is more than you need, and a simpler gateway lower on this list is the honest answer.

2. LiteLLM

LiteLLM routes to Claude as one of its 100+ supported providers through the same OpenAI-compatible proxy, with cost tracking and load balancing built in. Free, open source, and self-hosted, with the largest community of any option here.

Where it falls short for Claude specifically: it's API-level routing. Nothing in its published docs governs Claude Code or Claude Desktop as separate surfaces - if developers are running those locally, LiteLLM isn't the layer watching them.

3. OpenRouter

One API key, one credit balance, Claude among 315+ models, pass-through provider pricing with a 5.5% card fee and a 5% BYOK fee above 1M requests a month. Zero setup if you just want to call Claude (or switch between Claude and other models) without standing up infrastructure.

Where it falls short for Claude specifically: it's a hosted aggregator built around API access - no mechanism for governing Claude Code or Claude Desktop on developer machines, and no RBAC or per-team budget enforcement beyond the single account's credit balance.

4. Bifrost

Bifrost is a Go-based open-source gateway that markets itself on raw throughput - its own benchmarks claim well under 100 microseconds of overhead at 5,000 requests per second. Claude is one of the 1,000+ models it routes to.

Where it falls short for Claude specifically: same gap as the others - it's a high-performance API router, not a tool for governing Claude Code or Desktop across a device fleet. If raw throughput at extreme RPS is genuinely your bottleneck and Claude is just one of several models you're calling, it's worth a look; if surface governance is the concern, it isn't built for that.

5. Vercel AI Gateway

One endpoint, zero markup on tokens even with your own keys, and Claude works with a one-line config change if you're already on the Vercel AI SDK. Genuinely the least new infrastructure if your app already ships through Vercel.

Where it falls short for Claude specifically: it's routing and spend visibility for API calls from Vercel-shipped apps. It has no bearing on Claude Code or Claude Desktop running on a developer's own machine, which is a separate problem it was never built to solve.

Where the ranking doesn't apply

If your org's entire Claude footprint is server-side API calls - no Claude Code, no Claude Desktop, nothing running on developer laptops - the surface-governance case for TrueFoundry ranking first doesn't hold, and any of the four options above is a legitimately simpler, sufficient choice. The ranking here is specifically about governing Claude everywhere it can show up in a company, not just proxying the API well, which every one of these five already does.

Is your org's Claude usage mostly API calls, or is Claude Code/Desktop already running on developer machines without anything governing it yet? Curious how many teams have actually mapped out which surface they're exposed on before picking a gateway.

Helicone Acquisition: What Maintenance Mode Actually Means

Sahajmeet Kaur — Thu, 16 Jul 2026 06:30:00 +0000

TL;DR

Helicone is open source and ships two related things: a proxy-based observability platform (change one URL, not your SDK, and get full request/response logging, cost, latency, and quality tracking) and a separate, newer open-source AI Gateway repo covering routing across 20+ providers.
Mintlify acquired Helicone on March 3, 2026; the product is in maintenance mode now, meaning bug fixes and new model support continue but active feature work has stopped, and Mintlify says it will help existing customers migrate.

What Helicone actually is

The integration model is the part I think is genuinely well designed: instead of wrapping your LLM client in a new SDK, you change one URL. Every request passes through Helicone's proxy, which logs the full request and response, tracks token counts and cost, and applies whatever behavior you've configured - caching, rate limiting, custom metadata - before forwarding the call to the actual provider. For debugging agent or RAG pipelines specifically, it gives you trace and session inspection across the whole call chain, not just a single request/response pair.

Separately, Helicone maintains its own open-source AI Gateway - one API in front of OpenAI, Anthropic, Google, Bedrock, and 20+ other providers, with smart routing that's aware of provider uptime and your own rate limits, aiming to always hit the fastest, cheapest, or most reliable option depending on how you've configured it. Both the observability platform and the gateway are self-hostable via Docker or Helm if you don't want to run on Helicone's cloud.

The acquisition, and what "maintenance mode" actually covers

Mintlify announced the acquisition on March 3, 2026. Per Helicone's own post about it, the product moves into maintenance mode: security updates, bug fixes, and new model support continue to ship, but active feature development has ended, and Mintlify says it will work with existing customers on migrating elsewhere. That's a meaningfully different status from either "actively developed" or "shut down" - it's closer to "stable, but the roadmap stopped moving on the date of acquisition."

Practically, that means: if you're already running Helicone, nothing breaks tomorrow, and you'll keep getting security patches and new model additions. If you're choosing new observability or gateway infrastructure today, you're choosing something whose feature roadmap has already ended, which is worth weighing against actively developed alternatives before committing not because what exists today is bad, but because the gap between it and a competitor with continued investment only grows from here.

Where it's still genuinely good, frozen roadmap and all

The proxy-based integration, the caching and rate limiting bundled into the same request path, and the cost/latency/quality tracking are all still solid mechanisms today - none of that stops working because feature development paused. If what you need is exactly what's already built - a drop-in observability layer without an SDK rewrite - maintenance mode doesn't change whether it does that job well right now.

Where it matters more is if your needs are going to grow: more sophisticated governance, deeper agent tracing, new provider integrations beyond what's already planned. One dev.to writeup covers the broader 2026 LLM proxy landscape shift this acquisition is part of, and there's a hands-on tool review worth reading if you want a working engineer's take on it before the maintenance-mode news, to see what people liked about it independent of this development.

Where I'd point you if you're picking new infrastructure today

The one-line proxy pattern Helicone popularized - point your base URL at the gateway instead of rewriting your client is the same integration model TrueFoundry's AI Gateway uses, so switching that specific habit isn't the hard part.

Where the two diverge is what's behind that URL: TrueFoundry's metrics dashboard covers the same cost, latency, and quality ground Helicone's observability product does, and adds RBAC, per-team budgets, and PII/prompt-injection guardrails as part of an actively developed product rather than one whose feature roadmap stopped on acquisition day. That's the actual tradeoff if you're choosing today: proven, frozen functionality versus a comparable starting point that's still moving.
Key Features of TrueFoundry
AI Gateway: Acts as a centralized control plane, enabling secure and efficient communication between applications and AI models. It supports features like unified API interfaces, access control, rate limiting, and observability.
MCP Servers: Facilitate the interaction between AI agents and external tools or services. TrueFoundry allows the deployment of dedicated MCP servers to manage agent traffic, enforce policies, and scale model access.
Tracing: Provides end-to-end observability by tracing every step from user input to agent response. This includes tracking user prompts, system messages, tool inputs, LLM calls, and workflow decisions.
Security & Compliance: Ensures data protection and compliance with standards like SOC 2, HIPAA, and GDPR. Features include role-based access control (RBAC), audit logging, and runtime security measures.
Deployment Flexibility: Supports various deployment options, including on-premises, virtual private cloud (VPC), air-gapped, hybrid, or public cloud environments, providing complete control over data and infrastructure.
TrueFoundry is a leading enterprise platform because it unifies AI deployment, observability, and governance in one scalable solution. Its advanced features, like the AI Gateway, MCP servers, and end-to-end tracing, give organizations full control, security, and transparency, making it ideal for managing complex AI applications at scale.

If you're currently on Helicone, has the maintenance-mode news actually changed your plans, or does it do everything you need already and the frozen roadmap just doesn't matter for your use case?

Vercel AI Gateway: One Endpoint, Zero Token Markup, and Where It Stops

Sahajmeet Kaur — Wed, 15 Jul 2026 18:41:31 +0000

TL;DR

It's a single HTTP endpoint in front of hundreds of models, referenced as creator/model-name strings, with automatic failover to the same model on a different provider and zero markup on tokens, even with your own provider keys.
It's built for teams already on Vercel and the AI SDK - switching models is a string change, not a new client - but it's a routing and spend-visibility layer, not a governance one: no per-team RBAC, no budget enforcement, no MCP access control.
One independent benchmark found the native Anthropic SDK about 15-20% faster than the gateway on small prompts, with the gap nearly disappearing at large context - worth testing on your own workload rather than assuming a fixed tax either way.

What it actually is

The gateway is one endpoint, https://ai-gateway.vercel.sh/v1, sitting in front of models from every major provider. You reference a model as creator/model-name - anthropic/claude-opus-4.8, openai/gpt-5.5 - and in the AI SDK, passing that string as the model automatically routes the call through the gateway. There's no separate client to install for the gateway itself; if you're already using the AI SDK, this is largely a config change.

Concretely, what you get: embeddings alongside chat completions, spend monitoring across every provider you've configured, automatic retries, load balancing across targets, and Bring Your Own Key if you'd rather use your own provider accounts than Vercel's. The one I'd call out specifically: automatic failover to the same model on a different provider when one degrades - not just falling back to a cheaper model, but keeping the exact model constant and switching which infrastructure serves it, so output doesn't change even though the request just got rerouted.

Zero data retention is available across OpenAI, Anthropic, and Google, though it's worth knowing this isn't free by default - team-wide Zero Data Retention costs $0.10 per 1,000 successful requests on Pro and Enterprise plans, and there are similar per-request add-on costs for a team-wide Provider Allowlist or Custom Reporting.

Pricing, briefly

Every Vercel team account gets $5 a month in included AI Gateway credits from first use. Beyond that: no markup on tokens - you pay the provider's listed rate, including when you bring your own key, which is a genuinely different model from gateways that take a percentage on top. You can buy more credits any time with no obligation to keep renewing them. See Vercel's own pricing page for current numbers rather than trusting a snapshot here.

The latency question, honestly

One benchmark I found compared the gateway directly against the native Anthropic SDK: at small prompts (around 10 tokens), the native SDK ran about 15-20% faster, but at large context (120K tokens), the gap between the two nearly disappeared. That's one person's benchmark, not a guarantee about your workload, but it's a useful shape to expect - the routing overhead matters proportionally more on tiny, latency-sensitive calls than on large ones where model inference time dominates anyway.

Where it stops being the right tool

This is a routing and spend-visibility layer, and it's worth being precise about what that doesn't include. There's no per-team or per-user RBAC over which models different groups can call. There's no budget enforcement - spend monitoring tells you what happened, it doesn't block a request before it blows past a limit the way a hard budget cap does. There's no MCP server governance, and no built-in guardrails for PII or prompt injection. None of that is a knock on the product - it's simply solving a different problem (one API, one bill, automatic failover) than the one a dedicated AI gateway with RBAC, budgets, and guardrails solves.

If you're a small team shipping on Vercel and want model routing without adding infrastructure, this is a genuinely good fit. If you need company-wide governance across many teams and many apps - not just visibility into one Vercel project's spend - that's a different category of tool.

This is the part where TrueFoundry's AI Gateway is what I'd reach for once "spend monitoring" isn't enough and you actually need budgets and rate limits enforced per team, not just visible after the fact, plus PII and prompt-injection guardrails applied at the gateway rather than in application code. The two aren't really competing for the same buyer: Vercel AI Gateway is the right call if your traffic is Vercel-shipped apps and the AI SDK is already your interface; TrueFoundry's gateway is built to sit in front of everything - Vercel-deployed apps included - once you have more than one team or more than one deployment target to govern consistently. Worth saying plainly since I'm not a neutral source on this comparison.

If you're running Vercel AI Gateway in production, has the automatic same-model failover actually saved you during a real provider outage, or has it mostly stayed dormant? Curious how often that specific feature actually fires versus just being there as insurance.

Best LiteLLM Alternatives in 2026

Sahajmeet Kaur — Wed, 15 Jul 2026 18:37:54 +0000

LiteLLM is a genuinely good piece of software - free, open source, 100+ providers behind one OpenAI-compatible format, and it's the default a lot of teams reach for first. It's also a self-hosted proxy you run and patch yourself, with governance features you mostly assemble on top rather than get out of the box, and that gap is exactly where each of the seven options below picks up.

TL;DR

LiteLLM covers routing, cost tracking, and basic guardrails well, but RBAC, budget enforcement, MCP governance, and tool-level policy are things you build around it, not things it ships with.
TrueFoundry is the pick if you want that governance layer (RBAC, budgets with an audit-first rollout, PII/prompt-injection guardrails, a native MCP Gateway) without assembling it yourself, in a managed, hybrid, or fully self-hosted deployment.
That said - if you want zero cost and the largest OSS community, or you're locked into Cloudflare/Kong/Vercel/Databricks already, one of the other six is honestly the better call, and this list says exactly where.

1. TrueFoundry

TrueFoundry's AI Gateway is the top pick here for one specific reason: the things LiteLLM leaves for you to build - RBAC scoped to teams and users, budgets that support an audit-first rollout before you switch on hard blocking, and PII/prompt-injection/content-moderation guardrails - are first-class here instead. It also ships a full MCP Gateway with proper inbound/outbound auth handling, per-tool approval gates for destructive actions, and Cedar/OPA-based tool policy, which LiteLLM doesn't have an equivalent to at all.

Deployment is the other differentiator: managed, hybrid, or fully self-hosted in your own VPC, so "I don't want my traffic touching someone else's infra" isn't a reason to rule it out the way it might be with a purely hosted option.

Where it's not the right call: it isn't free, and it isn't a community-maintained open-source project - you're trusting a vendor's roadmap. If your actual requirement is zero cost and the biggest OSS community behind the tool, that's LiteLLM itself, not any alternative to it.

3. Portkey

Portkey's gateway went fully open source under Apache 2.0 in March 2026, after the company said it was processing over a trillion tokens a day through the hosted product. Self-hosters get the same feature set as the paid version now: circuit breakers, usage policies, an MCP gateway with OAuth 2.1, and the full model catalog with no license key. It's a genuinely strong option if governance-plus-open-source is the specific combination you want.

Where it's not the right call: Palo Alto Networks announced intent to acquire Portkey in April 2026, with the deal expected to close around July 2026. That's not disqualifying today, but it's a real question mark on a production bet, and worth weighing before you commit to its current open-source trajectory.

4. Bifrost

Bifrost is a Go-based open-source gateway from Maxim AI that markets itself specifically on raw performance - the README claims 50x faster throughput than LiteLLM with under 100 microseconds of overhead at 5,000 requests per second, plus adaptive load balancing and guardrails. Those are Bifrost's own published numbers, not independently reproduced here, so verify them against your own traffic pattern before treating the multiplier as fact.

Where it's not the right call: if raw throughput at extreme RPS isn't actually your bottleneck, you're adopting Go infrastructure for a performance margin you may never notice.

5. Cloudflare AI Gateway

Free on every Cloudflare plan, running on Cloudflare's own edge network, with caching, retries, model fallback, and analytics if your app already sits behind Cloudflare. Genuinely zero setup in that case.

Where it's not the right call: no built-in RBAC, no policy-as-code, no per-team cost attribution. It's routing and caching, not governance - fine if that's all you need, not a fit if it isn't.

6. Kong AI Gateway

If you already run Kong as your API gateway, adding AI routing as a plugin is less new surface area than standing up a separate LLM-specific tool.

Where it's not the right call: per Kong's own positioning, AI capability here is an extension on a general-purpose API gateway, not purpose-built for LLM traffic - no real latency-based routing, no cost-aware fallback, no built-in guardrails.

7. Vercel AI Gateway

One endpoint, zero markup on tokens even with your own provider keys, and it's a one-line config change if you're already using the Vercel AI SDK.

Where it's not the right call: it's a routing and spend-visibility layer, not a governance one - no per-team RBAC, no budget enforcement, no MCP server governance. Great if you're a small team shipping on Vercel; not built for company-wide policy across many teams.

Where LiteLLM alternatives converge and diverge

Every option above other than LiteLLM itself either adds governance LiteLLM doesn't ship with (TrueFoundry, Portkey) or trades governance away entirely for being embedded in an ecosystem you're probably already in (Cloudflare, Kong, Vercel). None of them are a strict upgrade over LiteLLM in every dimension - they're each solving for something LiteLLM specifically doesn't prioritize. This other rundown of LiteLLM alternatives is worth a look if you want a second set of eyes on the same question, and covers a couple of options not on this list.

Which gap actually pushed you to look at alternatives - governance, performance, or just wanting to not run infrastructure? Curious whether the reasons cluster the way I'd expect or whether there's a category I'm missing entirely.

Claude Code Governance: Managed Settings, Audit Logs, and Where the Gaps Actually Are

Sahajmeet Kaur — Sat, 11 Jul 2026 13:34:18 +0000

TL;DR

Claude Code resolves settings through a precedence stack - managed settings beat everything else and can't be overridden by CLI flags, project settings, or user settings which makes the managed settings file the real enforcement point, not a suggestion.
Server-managed settings (configured once in the admin console, no file deployment needed) are available on Team (Claude Code 2.1.38+) and Enterprise (2.1.30+) plans; audit logs and the Compliance API are Enterprise-only.
Managed settings and audit logs cover the CLI and VS Code surface - Claude Desktop and claude.ai (the web app) are separate surfaces with different or nonexistent governance hooks, and MCP servers need their own access story on top of all of it.

The precedence stack is the actual mechanism

Per Claude Code's own settings docs, when the same setting shows up in more than one place, managed settings win, full stop - they can't be overridden by anything, including a developer editing their own local config. Below that: local, project, and user settings apply in descending priority, so a project-level rule beats a personal default but loses to whatever the managed file says.

That single fact is why managed settings, not developer training or a wiki page, are the actual governance layer. A rule that says "don't run destructive shell commands" or "block network access from tool calls" only means something if it's enforced somewhere a developer can't turn it off.

Managed settings can be deployed three ways: server-managed (configured in the claude.ai admin console, delivered to clients automatically at sign-in and re-polled hourly, no file deployment required), MDM/OS-level policies (a macOS managed-preferences domain or a Windows registry key, pushed through Jamf, Intune, or Group Policy), or file-based (managed-settings.json dropped into a system directory, with a managed-settings.d/ folder for teams to layer independent policy fragments without editing one shared file). Worth knowing: managed settings parse tolerantly - an invalid entry gets stripped with a warning instead of disabling the rest of your policy, which matters when multiple teams are contributing rules to the same fleet.

Which plan you're on actually matters here

Server-managed settings require Team (Claude Code 2.1.38 or later) or Enterprise (2.1.30 or later) - they're not available on Free, Pro, or Max. Audit logs go a step further: per Anthropic's own documentation, they're Enterprise-only, and an export aggregates the trailing 180 days of activity into a downloadable link that's live for 24 hours. The Compliance API sits behind the same Enterprise gate, and it's the difference between pulling a manual CSV export periodically and streaming audit events into a SIEM continuously.

If you're evaluating Claude Code governance for a team on anything below Team or Enterprise, there's nothing to configure yet - that's the first thing to check before writing any policy at all.

Where this stops covering you

Managed settings and audit logs are built around the CLI and VS Code extension. Two other surfaces need separate handling:

Claude Desktop uses a different managed-preferences mechanism (its own bundle identifier, separate from Claude Code's), so a policy written for the CLI doesn't automatically apply there.

claude.ai, the web app, is the hardest surface, and it's worth being precise about why: it's not just that no one has built an endpoint setting for it, it's that Anthropic doesn't allow proxying claude.ai traffic through an external gateway at all. There's no config file, no environment variable, no supported way to redirect it. The admin console's SSO, domain capture, and org-level policies are the only sanctioned levers on this surface - anything past that means intercepting traffic at the OS network layer on a managed device, which is a different (and more invasive) thing than the officially supported routing the CLI gets.

MCP servers are their own problem independent of which surface is calling them. Every MCP server a developer wires up is a new set of credentials and a new piece of attack surface - unvetted servers bring prompt-injection risk and, without a central registry, no audit trail of which tools were actually called or what data came back. The common recommendation is to route all MCP access through one gateway and allowlist only that gateway's URL in managed settings, rather than letting individual developers point Claude Code at arbitrary MCP endpoints.

Where a gateway fits into this - TrueFoundry AI Gateway

TrueFoundry's mapping of what it can govern per Claude surface looks like this, TrueFoundry can put every Claude surface not just the CLI -behind the AI Gateway, with fleet-wide enforcement via MDM where the tool supports it:

Capability	Claude Web	Claude Desktop / Cowork	Claude Code (CLI / VS Code)	Claude Code Max
Proxy and govern models	-	Yes	Yes	Yes
Proxy and govern MCP servers	-	Yes	Yes	Yes
MDM setup	-	-	Yes	Yes

Claude Web is the explicit exception in that table, for the reason above - Anthropic doesn't permit routing its traffic through an external gateway, so the only governance available there is SSO, domain capture, and Admin Console policy, not a gateway integration.

For Claude Code and VS Code, TrueFoundry sets ANTHROPIC_BASE_URL inside managed-settings.json, deployed and immutably locked via an open-source binary (tfy-local-ai-setup) pushed over MDM on macOS, Linux, and Windows, refreshing the gateway token on a schedule so it doesn't go stale. MCP access gets routed the same way: allowlist only the MCP Gateway's URL, and let the gateway handle RBAC, tool-level restrictions, and the audit trail instead of trusting each developer's own MCP configuration. Once traffic is behind the gateway, you get the same controls regardless of which Claude surface it came from: RBAC over users/teams/virtual accounts, per-user and per-team rate and budget limits, failover across providers via virtual models, PII/prompt-injection/content-moderation guardrails, and full request tracing.

Claude Code Max is a specific wrinkle worth naming: if you're on an Anthropic Max subscription, Claude Code reserves the Authorization header for your own Anthropic session, so the gateway authenticates separately via an x-tfy-api-key header instead of taking over that header entirely. Practically, that means developers keep signing in with their own Max subscription and Claude Code keeps working the way they're used to, while the gateway still gets usage visibility, quotas, RBAC, logs, and guardrails on top - governance without asking anyone to change their day-to-day workflow.

The honest limitation

None of this stops someone from just using a personal Claude account outside the managed fleet entirely, on a device MDM doesn't cover. Governance here is strong on the surfaces it reaches and has a real blind spot on the ones it doesn't which is worth saying plainly rather than implying full coverage. There's a good practical breakdown of the same gap in Control and Visibility for Claude Code if you want a second perspective on where enterprise deployments actually get stuck.

What's the surface your org hasn't figured out governance for yet - Desktop, the web app, or MCP servers specifically?

MCP Server Management: Centralizing Auth, RBAC, and Approval Gates Before They Sprawl

Sahajmeet Kaur — Sat, 11 Jul 2026 13:25:44 +0000

Six months ago we had one or two MCP servers. Now it's common for a team here to spin up a dozen - Slack, Jira, a database, an internal API wrapped as MCP, a couple of vendor-provided ones, each with its own credentials, and no single list of which agent could reach which one. That sprawl is what pushed us to put TrueFoundry's MCP Gateway in front of everything instead of letting each developer wire up their own connections.

TL;DR

MCP servers accumulate credentials and access fast, and without a central registry, nobody has an inventory of which agent can call which server or tool.
TrueFoundry's MCP Gateway sits between every MCP client (an IDE, an agent) and every registered server, terminates two separate auth handshakes (inbound from the client, outbound to the upstream server), and applies RBAC, guardrails, and observability at that single point.

The sprawl problem, concretely

Each new MCP server used to mean a new API key or OAuth app, configured wherever the agent framework happened to run - someone's local .env, a CI secret, a config map in whatever cluster. Nobody centrally decided "this agent can use this server," it just happened because someone wired it up to get their thing working. That's fine until you need to answer "which agents can currently write to Jira" or "revoke this one server's access without touching the other eleven" - which is exactly the question we couldn't answer before centralizing this.

What the gateway actually sits between

MCP itself is a JSON-RPC protocol: a client calls methods like initialize, tools/list, tools/call, and resources/list against a server, and the server responds with what it can do and the results of invoking it. Without a gateway, every IDE and every agent framework holds its own direct connection to every server it needs - N clients times M servers, each connection independently authenticated, independently configured, and invisible to anyone outside that one developer's machine.

TrueFoundry's MCP Gateway collapses that to N clients talking to one endpoint, which then talks to M servers on their behalf. The model is registration-once, consumption-by-many: our platform team registers a server a single time, and every developer's IDE, every agent, and every workflow that needs it points at the same gateway URL instead of holding its own credentials to the server directly.

Two separate auth handshakes, not one

This is the part that took us longest to get right conceptually, because "auth" for an MCP gateway is actually two independent problems: how the client authenticates to the gateway (inbound), and how the gateway authenticates to the upstream MCP server (outbound). Per the auth and security docs, the gateway configures these separately and they mix and match.

Inbound (client → gateway) supports a TrueFoundry API key, an identity provider token, or TrueFoundry OAuth - the gateway validates whichever one shows up and applies access control based on whoever it resolves to. Some people on our team authenticate with API keys, others through SSO via our IdP, and both get evaluated against the same RBAC without us having to reconcile two separate permission systems.

Outbound (gateway → upstream server) has four modes, and picking the right one depends on what the upstream server expects:

API Key - shared across everyone, or issued per individual so each caller's key can be tracked and revoked separately.
OAuth2, Authorization Code - for servers that need a real user to authorize (GitHub, Slack, Google), where each person consents once and the gateway stores and refreshes their token afterward.
OAuth2, Client Credentials - for server-to-server calls with no human in the loop.
Token Passthrough - the gateway forwards the same token it validated on the inbound side, which only works if the MCP server can itself validate TrueFoundry or your IdP's tokens. Token Forwarding is the related-but-different option where the client instead supplies separate upstream credentials via a header (x-tfy-mcp-headers), for servers with their own auth system that doesn't recognize inbound tokens at all.

The connection flow for an IDE looks like this: a developer signs in once with TrueFoundry OAuth, their IDE gets a temporary token to talk to the gateway, and if the target server needs a per-user provider token, the gateway handles that authorization separately and stores the resulting token at the gateway layer - it never reaches the developer's machine. Every tool call after that point is logged, authorized, and rate-limited by the gateway before it ever reaches the upstream server. For servers that support it, Auth Overrides let an individual user or a virtual account supply their own upstream credentials rather than sharing one team-wide key - currently supported for individually-issued API keys and OAuth2 Authorization Code, with shared-key and client-credentials overrides listed as coming soon per the docs.

What centralizing it actually buys us, beyond auth

RBAC decides who can use which server, and which tools within it. We can grant a team access to a server but disable its destructive tools - delete, write, admin actions - for everyone except a smaller group.
Destructive tools can pause for a human. Tools marked destructive automatically stop and wait for user confirmation instead of executing silently, which matters a lot more once an agent can actually delete a record or send a message on someone's behalf.
Guardrails run pre- and post-tool-call, the same policy-enforcement pattern as LLM input/output guardrails, just applied to tool arguments and tool results instead of prompts and completions.
Every tool call is traced with the actual JSON-RPC method, not just a generic "request happened" log line, which matters when we're debugging why a specific tool call failed versus a tools/list call.

Two ways we've added servers without hand-writing one

Virtual MCP servers let us curate tools from several registered servers into one narrower server for a specific team or workflow, without standing up a new deployment. When a workflow needs two tools from Jira and one from GitHub, we expose exactly those three under one virtual server instead of granting the requesting team the entire surface area of both underlying servers.

OpenAPI to MCP auto-generates an MCP server from an existing OpenAPI specification, which mattered for us because most of what we wanted to expose was already a REST API we'd written internally - we got governed, gateway-fronted MCP tools out of our existing API surface without writing a custom MCP server by hand.

Watching it after it's running

The MCP Metrics dashboard has two views: a server-centric one (requests per second, P50/P75/P90/P99 latency, failure rate broken down by error type, and a breakdown of which JSON-RPC methods - tools/list, tools/call, and so on - make up traffic mix) and a tool-centric one that drills into invocation count, latency, and error rate per individual tool across every server. Both can be sliced by user, virtual account, or team, which is the difference between "the GitHub server is slow" and "the GitHub server is slow specifically for one team's batch job" - a distinction that's saved us real debugging time.

If you want that data outside the dashboard, there's a metrics query API that accepts the same shape of query - datasource: "mcpMetrics", fields like mcpServerName, toolName, method, and latencyMs - so you can pull p99 latency per tool or find every tool call over five seconds programmatically instead of eyeballing a chart. We use this to feed our own internal alerting rather than watching the dashboard manually.

How many MCP servers does your org actually have registered somewhere central right now, versus scattered across people's local configs? We only found out our real number once we went looking - is that universal or if some teams actually had a handle on it already.

OpenRouter Alternatives: If You Need Self-Hosting, Lower Fees, or Real Governance

Sahajmeet Kaur — Tue, 07 Jul 2026 19:30:45 +0000

TL;DR

OpenRouter is a hosted aggregator - one API key, 315+ models, pass-through provider pricing plus a 5.5% card fee ($0.80 minimum) and a 5% BYOK fee above 1M requests a month.
People look elsewhere for three reasons: data residency (prompts and completions transiting a vendor you don't control), the fee math at real volume, or the lack of built-in RBAC, budgets, and audit trails.
Self-hosted options (LiteLLM, Kong) remove the third-party hop entirely; managed alternatives with more governance (Portkey, TrueFoundry, Cloudflare AI Gateway) exist if you don't want to run infrastructure yourself either.

What OpenRouter actually is, briefly

One API key and one credit balance in front of 315+ models from every major provider, with an OpenAI-compatible API so switching models is a one-line change. The free tier gives you 25+ free models at 20 requests/minute and 50 free-model requests a day, which jumps to 1,000/day once you've put at least $10 of credit on the account. Pay-as-you-go is exactly that, and Enterprise adds SSO, SLAs, and negotiated support. OpenRouter doesn't mark up the model price itself - what you pay per token matches the provider's own listed price - but there's a 5.5% platform fee on credit-card top-ups (with an $0.80 minimum, so small top-ups pay a higher effective rate) and a 5% bring-your-own-key fee on usage above 1M requests a month. See OpenRouter's own pricing page for the current numbers rather than trusting a snapshot in this post.

Why people actually go looking for something else

Governance and compliance constraints
Using OpenRouter requires routing requests through a third-party proxy before they reach the model provider. For regulated industries, this additional hop can complicate compliance with frameworks such as GDPR, HIPAA, or internal data residency requirements. OpenRouter also offers limited pre-processing controls for enforcing organizational policies before data leaves the application environment.

Gaps in observability and debugging
OpenRouter provides usage and billing visibility but offers limited execution-level observability. For production systems, teams often need traces that link prompts, routing decisions, latency, and model-specific failures. Without integrated tracing or easy export of telemetry into internal observability stacks, debugging complex workflows becomes operationally expensive.

Governance. OpenRouter is built around a single account and credit balance. If you need per-team budgets, RBAC over who can call which models, or an audit trail that satisfies a compliance review, that's not really what it's designed for - you'd be building it on top rather than getting it out of the box.

Data residency. If your prompts or completions are regulated or contractually restricted from touching infrastructure outside your own environment, a third-party aggregator is off the table before cost or features even come up.

Fee math at volume. The 5.5%/5% fees are trivial at low spend and stop being trivial once you're processing meaningful volume - at that point, the infrastructure cost of self-hosting a gateway is competing directly against a percentage of your entire model spend, not a small platform fee.

The actual alternatives

TrueFoundry AI Gateway - Runs managed, hybrid, or fully self-hosted in your own VPC, with RBAC, budgets, and guardrails built in rather than added separately - the thing I'd point to if governance is the actual gap you're trying to close, not just cost or latency.
TrueFoundry's gateway is uniquely built for the era of Agentic AI. It natively supports the Model Context Protocol (MCP), allowing your agents to securely connect to internal tools and data sources with centralized governance via the MCP Gateway. Its multi-model routing goes beyond simple price and latency; you can define sophisticated fallback chains, enforce team-level quotas, and use a unified AI Gateway Playground to test and version prompts across 250+ models. With integrated observability, TrueFoundry captures end-to-end traces of every interaction, making it a comprehensive control plane for the entire LLM lifecycle.

LiteLLM - open source, self-hosted, 100+ providers behind one OpenAI-compatible format, with cost tracking, guardrails, and load balancing built in. Free to run; you pay for the Postgres/Redis/compute it needs, which for a real production deployment tends to run a few hundred dollars a month. The default choice if you want to leave OpenRouter for a self-hosted option with the largest community around it.

Portkey - open sourced under Apache 2.0 in March 2026 after processing what the company says is over a trillion tokens a day on its hosted product. Routes to 1,600+ models and ships an MCP gateway with OAuth 2.1, which matters if governance over tool calls is as much your concern as governance over model calls. Palo Alto Networks announced intent to acquire Portkey in April 2026, with the deal expected to close around July 2026 - worth watching before betting a production stack on where the open-source version goes next.

Kong AI Gateway - if you already run Kong as your API gateway, adding AI routing as a plugin is less new surface area than standing up a separate LLM-specific tool. The tradeoff, per Kong's own positioning: AI routing here is an extension on a general-purpose gateway, not purpose-built, so you're not getting cost-aware routing or built-in guardrails the way a dedicated option provides.

How I'd narrow it down

If the problem is...	Reach for
You already run Kong	Kong AI Gateway
Governance, open source preferred	Litellm/Portkey
Governance (RBAC, budgets, audit) without building it yourself, managed/hybrid is fine	TrueFoundry

What actually pushed you off OpenRouter, if you've moved - was it one of these three reasons, or something I haven't listed?

The Four Questions I Ask Before Any Team Gets LLM Access: Cost, Safety, Access, Audit

Sahajmeet Kaur — Sun, 05 Jul 2026 06:30:00 +0000

TL;DR

LLM governance comes down to four questions per request: what's it costing (FinOps), is it safe (guardrails), who's allowed to make it (access control), and can we reconstruct it later (audit).
Budgets and guardrails both support an audit-first rollout: run in audit/observe mode to see what would have been blocked before you actually start blocking anything.
Policies should get stricter as you go from dev to staging to prod, not uniform everywhere - permissive dev environments and locked-down prod environments solve different problems.

Cost: can you cap it before it caps you

It's genuinely easy to blow up an LLM bill with a bug - an agent stuck in a retry loop, a prompt that keeps growing, a batch job someone forgot was scheduled. Budget limiting with a hard block is one option, but there's a middle step worth knowing about: audit mode, where you deploy a budget rule, watch real traffic against it for a full period, and only flip on enforcement once you trust the number. Useful when you genuinely don't know what a reasonable cap looks like yet and don't want to find out by breaking something.

Safety: is sensitive data leaving, and is the model behaving

This is the guardrails half of the equation: PII/PHI detection, prompt injection and jailbreak defense, content moderation, and general DLP, applied to what goes in and what comes back. TrueFoundry's built-in guardrails cover content moderation, PII detection, and prompt injection out of the box on the SaaS gateway (backed by Azure AI Content Safety and Azure AI Language under the hood) - worth noting these three specific ones aren't available if you're running the gateway fully self-hosted, where you'd reach for OpenAI Moderations, Bedrock Guardrails, or a bring-your-own-key option instead.

There's also a layer most people don't think of as a "guardrail" at first: fine-grained access control over which tools an agent can call, using something like Cedar or OPA policy language with default-deny. That matters more as agents start calling MCP tools that can actually do things, not just answer questions.

Access: who's allowed to call which models, how often

This is RBAC and rate limiting - scoped API keys per team or application, and per-user/per-model/per-application throttles. One pattern I like from here: pinning a production gateway so it only accepts requests tagged env: prod, using metadata that's injected automatically from virtual accounts rather than something a developer has to remember to set by hand. It means a dev build pointed at the wrong URL gets rejected instead of quietly running against prod.

Audit: can you answer "who did this" six months later

Every AI call, logged: who made it, what they sent, what came back. Not glamorous, but it's the difference between "we think this is what happened" and "here's exactly what happened" when someone asks during a compliance review or an incident postmortem.

The pattern across all four: audit before enforce, tighten as you go to prod

Two things show up repeatedly in how these are meant to be rolled out. First, both budgets and guardrails support an audit-only mode - deploy the rule, watch what it would have done against real traffic, then switch on enforcement once you trust it. That's a much less painful way to introduce a new policy than shipping it straight to blocking and finding out in production that your thresholds were wrong.

Second, policies should get progressively stricter from dev to staging to prod, not applied uniformly. A permissive dev gateway lets engineers iterate without guardrails getting in the way of testing edge cases; a locked-down prod gateway is where you actually want PII redaction and prompt injection defense running on every request. Applying prod-strictness everywhere just trains people to route around it.

What TrueFoundry's AI Gateway Provides Here

Everything in this post - virtual keys, RBAC, budgets, rate limits, audit logs, residency rules, and guardrails as enforced policy — is something TrueFoundry's AI Gateway expresses as configuration in one control plane. Access control defines who (users, teams, virtual accounts) may call which provider accounts and models; Personal Access Tokens and Virtual Account Tokens are how applications authenticate to the gateway instead of holding raw provider keys; rate-limit and budget configs apply per user, team, virtual account, model, or any custom metadata key; and guardrails — including Cedar and OPA as policy-as-code at the MCP-tool boundary — run as enforced rules at four lifecycle hooks.

TrueFoundry's AI Gateway is an enterprise-grade control plane that sits between your applications and 1,600+ models — across OpenAI, Anthropic, Google, AWS Bedrock, Azure OpenAI, and your own self-hosted models — behind a single OpenAI-compatible API. It turns the governance controls in this post into configuration rather than per-service code: Virtual Accounts for non-human production identity, Personal Access Tokens for development, RBAC scoped per provider account, rate-limit and budget rules expressed as YAML with per-user/per-team/per-model/per-metadata scopes, and policy-as-code guardrails at the MCP tool boundary.

Because the gateway already sits on every request and emits a complete trace for every call, it is also where compliance-grade audit becomes practical. Identity, policy decisions, guardrail outcomes, model choice, token counts, and cost all land on the same trace ID — visible in Request Traces in the UI, exportable through OpenTelemetry, and accessible by API for SIEM integration. The same gateway adds exact and semantic caching, fallbacks and retries, and observability dashboards, deploys as SaaS or in your VPC, on-prem, or air-gapped with SOC 2, HIPAA, and ITAR compliance, and is recognized in Gartner's Market Guide for AI Gateways.

Which of the four - cost, safety, access, or audit is the one your org has actually solved, and which one is still duct tape? I'd guess audit is the one most teams skip until they need it.

What Is an LLM Gateway? Routing, Fallback, and Rate Limits Explained

Sahajmeet Kaur — Fri, 03 Jul 2026 20:36:38 +0000

TL;DR

An LLM gateway is a proxy that gives your app one API for every model provider, translating a single request shape into whatever each backend actually expects.
Teams add one for three concrete reasons: provider outages, per-provider rate limits, and SDK lock-in when application code talks to a specific vendor's client directly.
Once every call passes through one place, that place naturally becomes where you'd add routing, caching, rate limiting, and governance - not because a gateway is "for" those things, but because it's the only component that sees every request.

What it actually is, concretely

Without a gateway, your application code holds a provider SDK - say, the OpenAI Python client and calls it directly. Want to add Anthropic as a second option? Now you hold two SDKs with two different request and response shapes, and your code has an if/else for which one to use. An LLM gateway sits in front of both, exposes one API (usually OpenAI-compatible, since that's become the de facto standard most tooling expects), and translates your single request into whatever the actual provider needs on the other side. Your application never touches a provider SDK directly again - it points at the gateway's URL and changes a model name string when it wants to switch.

That's the whole concept. Everything else people associate with gateways - load balancing, caching, guardrails, cost tracking is stuff that tends to get added on top, because centralizing every model call in one place makes it a natural spot to add all of it.

Why teams actually add one

Outages. OpenAI and Anthropic both had multiple incidents on their public status pages between February and May of 2025 - nothing exotic, just the kind of thing that happens when you depend on someone else's infrastructure. If your app calls one provider directly and that provider has a bad afternoon, your app has a bad afternoon too. A gateway that can fail over to a second provider when the primary is down keeps your app up during exactly the window a single-provider setup goes down.

Rate limits. Azure OpenAI, like most providers, enforces tokens-per-minute and requests-per-minute quotas per model per region. Under normal load that's invisible. Under a traffic spike, or a bug that puts an agent into a retry loop, you hit the ceiling and start getting 429s. A gateway that understands rate limits can route the overflow to a different model or provider instead of just failing the request.

Lock-in. Every provider's SDK is shaped a little differently, and if your codebase talks to a raw provider SDK in forty places, swapping in a cheaper or better model six months from now means touching forty places. A gateway means you change a model name in one config, not your application code.

What to actually check a gateway has

Not every gateway does all of this, and it's worth checking which pieces a specific option actually implements before assuming:

Provider coverage and API shape. How many providers, and is the exposed API OpenAI-compatible so existing SDKs work with just a base-URL change, or a bespoke format you have to adopt.
Routing and fallback. Weighted routing, latency-based routing, automatic failover, and canary rollouts for testing a new model on a slice of traffic before trusting it with everything. TrueFoundry's routing docs walk through this pattern in detail if you want the mechanics.
Rate limiting and caching, applied per user, team, or application, not just a single global limit. See rate limiting and semantic caching for what that looks like in practice.
Guardrails and governance, if you need PII detection, prompt injection defense, budgets, or RBAC baked in rather than built separately.
Deployment model. Managed SaaS, self-hosted, or hybrid - this matters a lot if you have compliance requirements that rule out sending traffic through someone else's infrastructure. TrueFoundry's deployment modes doc covers that split.
Open source vs. managed, and what that means for cost at your actual request volume, not just at a demo scale.

Where I'd push back on my own argument

If you're calling one provider, for one use case, at low volume, you probably don't need any of this yet. Add a gateway once you're multi-provider, multi-team, or once an outage or rate-limit wall has actually cost you something, not before.

What pushed you to add a gateway, if you have one - was it an outage, a rate-limit wall, or something else entirely? Curious whether the trigger is usually the same across different teams' setups.

What I Actually Look at When Debugging a Slow or Expensive LLM Call

Sahajmeet Kaur — Fri, 03 Jul 2026 20:06:20 +0000

TL;DR

Standard APM metrics (latency, status code, error rate) don't capture the things that actually drive LLM cost and behavior: token counts, per-model cost, guardrail overhead, and prompt-level detail.
Traces need to include input/output tokens, cost, and latency per hop, and that data has to flow to wherever your team already looks - usually via OpenTelemetry into Langfuse, SigNoz, or a similar backend.
Guardrail metrics are a separate axis worth watching on their own: how often is a guardrail blocking, mutating, or adding latency, and to which users or teams.

The gap between "observability" and "LLM observability"

A normal trace tells you a service was called and how long it took. An LLM trace needs a few more fields to be useful: input tokens, output tokens, per-model cost, and often the prompt and completion themselves for debugging quality issues (with the access controls that implies - more on that below). OpenTelemetry's GenAI semantic conventions exist specifically because different LLM observability tools were inventing incompatible attribute names for the same underlying data, and it's worth reading if you're instrumenting anything yourself.

TrueFoundry's metrics dashboard rolls LLM and MCP performance, cost, guardrail outcomes, routing decisions, and cache hit rates into one place, which is the same shape of data most of the standalone LLM observability tools (Langfuse, SigNoz, Lunary, Laminar) expect if you export traces out via OTLP.

What actually matters when something's slow or expensive

Token counts, not just latency. A slow request and an expensive request are different problems with different fixes. If a request is slow because it's generating 4,000 tokens, the fix might be a lower max_tokens or a smaller model, not a networking investigation.

Per-model, per-provider cost, in real time, not from a monthly invoice you see three weeks later. By the time a cost anomaly shows up on a bill, whatever caused it has usually already happened a thousand more times.

Guardrail latency and outcomes, as their own metric. Guardrail metrics - evaluated requests, blocked/mutated rates, and P50-P99 latency per guardrail - matter because a PII redaction or prompt-injection check that adds 400ms per call is a real cost, and it's easy to add guardrails without ever checking what they cost you in latency.

Traces, exported to wherever your team already looks. TrueFoundry exports OpenTelemetry traces (and separately, metrics) to whatever OTEL-compatible backend you already run - the export docs cover the setup, and there are dedicated guides for Langfuse and other backends if you want traces to land somewhere your team already has dashboards.

Who can see what. Request logs contain prompts and completions, which in a lot of orgs is sensitive by default. Data access rules let you restrict prod request logs to on-call/SRE/security while keeping dev and staging logs open to a wider group - a distinction that's easy to skip until someone asks why an intern can read production customer prompts.

Where this gets harder than normal service observability

Multi-hop agent traces are the part that still isn't fully solved industry-wide. When agent A calls agent B calls an MCP tool, keeping a single trace ID coherent across all three hops so you can reconstruct "what actually happened" after the fact is genuinely a harder problem than tracing a normal microservice call chain, mostly because the agent frameworks involved don't all propagate context the same way.

Audit logs are a related but distinct thing worth not conflating with observability: they're the durable "who did what" record for compliance, not the dashboard you check when something's slow.

What's the messiest part of your own LLM tracing setup right now - is it the multi-hop context problem, cost attribution, or something else entirely? Curious what other people have run into.

MCP Authentication: How We Secured 12 MCP Servers Without Losing Our Minds

Sahajmeet Kaur — Wed, 01 Jul 2026 07:47:29 +0000

TL;DR: MCP authentication is genuinely more complex than regular API auth because you're managing credentials across many servers, for many agents, under many user identities - often all at once. The approaches range from static API keys (fast, insecure at scale) to OAuth 2.1 with PKCE (spec-compliant, more setup) to a centralized gateway that handles all downstream auth for you. We went through all three stages. This post covers what we learned.

Eight months ago our MCP auth story was: shared API key in a .env file, every developer had access to everything, fingers crossed nothing bad happened.

Two near-misses later - one agent that almost deleted production data via a misconfigured write tool, one contractor whose MCP access wasn't revoked after they left - we got serious about it.

Here's the setup we landed on, what each part solved, and what it cost in setup time.

Why MCP auth is harder than regular API auth

Regular API auth has one credential relationship: your application authenticates to a service. MCP auth has three:

Client → Gateway/Server: how your agent proves its identity to the MCP infrastructure
Gateway → Downstream service: how the MCP server authenticates to GitHub, Jira, Slack, or whatever backend it wraps
User delegation: when an agent acts on behalf of a specific human (post to Slack as a user, not as a bot), how that user's identity flows through the call chain

Managing all three manually, per server, per developer, per agent is where the complexity explodes. Most MCP auth problems are coordination problems, not cryptography problems.

The authentication methods, in order of complexity

Static API keys / Bearer tokens

The simplest option. The MCP server expects a static Bearer token in the Authorization header. You set it once in the server config and again in the client config. Done.

{
  "mcpServers": {
    "my-server": {
      "url": "https://my-mcp-server.internal/mcp",
      "headers": {
        "Authorization": "Bearer your-static-token-here"
      }
    }
  }
}

Where it works: Internal servers with a single operator, dev environments, quick prototypes.

Where it breaks: Rotation. Every time the token changes, every client that uses it needs updating. With 15 developers and 8 servers, token rotation becomes a coordination nightmare. And if the token is in a .env file in a repo, it's eventually in git history.

The real risk: Static tokens have no user identity attached. When an agent calls a tool using a static token, there's no way to know which developer or workflow triggered it. Audit trails become "token X called tool Y at time Z" - useless for compliance or incident response.

Environment variables for stdio servers

Stdio-based MCP servers run as local processes. Auth happens outside the MCP protocol — you pass credentials as environment variables that the server reads at startup.

{
  "mcpServers": {
    "github": {
      "command": "npx",
      "args": ["-y", "@modelcontextprotocol/server-github"],
      "env": {
        "GITHUB_PERSONAL_ACCESS_TOKEN": "${env:GITHUB_TOKEN}"
      }
    }
  }
}

The ${env:VAR} syntax in Claude Code and Cursor pulls from your shell environment rather than hardcoding the value in the config file — this keeps credentials out of version control.

Where it works: Local development with stdio servers where each developer authenticates with their own credentials.

Where it breaks: It doesn't scale. Each developer manages their own credentials per server. There's no centralized revocation. When a developer leaves, you're hoping they cleared their local environment (they often haven't).

OAuth 2.1 with PKCE for remote servers

The MCP spec standardized OAuth 2.1 with PKCE in its March 2025 revision. This is the correct long-term answer for remote MCP servers because it ties tool calls to real user identities through your existing identity provider.

The flow:

Agent initiates an MCP connection
Server redirects to your IdP (Okta, Azure AD, Auth0)
User authenticates in the browser
IdP issues an authorization code
Client exchanges the code for an access token (PKCE ensures this can't be intercepted)
Token is attached to all subsequent MCP calls

What this gives you: Tool calls are tied to the authenticated user, not a shared service credential. Tokens expire and auto-refresh. Revoking a user's access in your IdP automatically revokes their MCP access.

What this costs: More setup. Your MCP server needs to implement the OAuth resource server side. Your client needs to handle the browser redirect flow. Not all MCP clients have fully implemented the November 2025 spec revision yet - check your specific client's OAuth support before depending on it.

PATs and VATs for service accounts

For production agents that run without human-in-the-loop (no browser redirect possible), the pattern is Personal Access Tokens (PATs) for individual users and Virtual Account Tokens (VATs) for service accounts.

PAT: bound to a specific user's identity, appropriate for development workflows and user-delegated actions
VAT: a service account credential with defined permissions, appropriate for automated agents running in production without a human user attached

The distinction matters for audit trails: PAT calls show as coming from the specific developer; VAT calls show as coming from the named service account.

The setup we landed on

After the two near-misses, we didn't want to manage individual OAuth flows per server per developer. The maintenance surface was too large. What we implemented instead was a centralized gateway that handles all downstream auth for us.

How it works:

One token per developer, one token per service account. Developers authenticate to the gateway with a single PAT. Service agents authenticate with a VAT. The gateway manages every downstream credential — GitHub OAuth tokens, Jira API keys, Confluence tokens, internal API service accounts and auto-refreshes them before they expire.

RBAC at the tool level. We can say: "team type A can call search_issues and create_issue in Jira but not delete_issue." Defined per-server, per-tool, per-role in the gateway config. The agent never sees tools it isn't authorized to call - tools/list returns a filtered set based on the caller's identity.

OAuth 2.0 for user-delegated actions. For tool calls that should act on behalf of a real user - posting to Slack as a specific person, creating a Jira ticket attributed to the right engineer - we use OAuth 2.0 with our Okta setup. The gateway handles token exchange and refresh. Agents don't manage OAuth flows directly.

Audit log for every call. Every tool invocation logged: which agent, which user identity, which tool, what parameters, what response, timestamp. This was non-negotiable for our security team and it's also genuinely useful for debugging production agent failures.

We implemented this using TrueFoundry's MCP Gateway. The Okta integration took about a day to configure. The RBAC setup took roughly a day per server to define policies properly. The time investment paid back in the first month - we had one offboarding event and MCP access was fully revoked in a single dashboard action rather than hunting down six separate credentials.

The CVE worth knowing about

In early 2026, JFrog Security Research disclosed a vulnerability in a popular MCP OAuth implementation (v0.1.16 and earlier) where the package forwarded OAuth authorization endpoint URLs to system handlers without sanitization. A malicious MCP server could construct a URL that executed arbitrary OS commands on the developer's machine.

The fix shipped in v0.1.16. But the broader lesson: OAuth flows in MCP clients are relatively new and the spec is still settling (the November 2025 revision introduced Client ID Metadata Documents as the preferred registration method, replacing Dynamic Client Registration in most cases). Check your MCP client's patch level and the spec compliance version it's implementing before depending on OAuth for production workloads.

What's your current MCP auth setup, and what forced you to take it seriously? The "near-miss with a write tool" pattern seems to be a common forcing function - interested to hear what others have hit. Drop it in the comments.