What Is an LLM Gateway? Routing, Fallback, and Rate Limits Explained

#ai #llm #devops #opensource

TL;DR

An LLM gateway is a proxy that gives your app one API for every model provider, translating a single request shape into whatever each backend actually expects.
Teams add one for three concrete reasons: provider outages, per-provider rate limits, and SDK lock-in when application code talks to a specific vendor's client directly.
Once every call passes through one place, that place naturally becomes where you'd add routing, caching, rate limiting, and governance - not because a gateway is "for" those things, but because it's the only component that sees every request.

What it actually is, concretely

Without a gateway, your application code holds a provider SDK - say, the OpenAI Python client and calls it directly. Want to add Anthropic as a second option? Now you hold two SDKs with two different request and response shapes, and your code has an if/else for which one to use. An LLM gateway sits in front of both, exposes one API (usually OpenAI-compatible, since that's become the de facto standard most tooling expects), and translates your single request into whatever the actual provider needs on the other side. Your application never touches a provider SDK directly again - it points at the gateway's URL and changes a model name string when it wants to switch.

That's the whole concept. Everything else people associate with gateways - load balancing, caching, guardrails, cost tracking is stuff that tends to get added on top, because centralizing every model call in one place makes it a natural spot to add all of it.

Why teams actually add one

Outages. OpenAI and Anthropic both had multiple incidents on their public status pages between February and May of 2025 - nothing exotic, just the kind of thing that happens when you depend on someone else's infrastructure. If your app calls one provider directly and that provider has a bad afternoon, your app has a bad afternoon too. A gateway that can fail over to a second provider when the primary is down keeps your app up during exactly the window a single-provider setup goes down.

Rate limits. Azure OpenAI, like most providers, enforces tokens-per-minute and requests-per-minute quotas per model per region. Under normal load that's invisible. Under a traffic spike, or a bug that puts an agent into a retry loop, you hit the ceiling and start getting 429s. A gateway that understands rate limits can route the overflow to a different model or provider instead of just failing the request.

Lock-in. Every provider's SDK is shaped a little differently, and if your codebase talks to a raw provider SDK in forty places, swapping in a cheaper or better model six months from now means touching forty places. A gateway means you change a model name in one config, not your application code.

What to actually check a gateway has

Not every gateway does all of this, and it's worth checking which pieces a specific option actually implements before assuming:

Provider coverage and API shape. How many providers, and is the exposed API OpenAI-compatible so existing SDKs work with just a base-URL change, or a bespoke format you have to adopt.
Routing and fallback. Weighted routing, latency-based routing, automatic failover, and canary rollouts for testing a new model on a slice of traffic before trusting it with everything. TrueFoundry's routing docs walk through this pattern in detail if you want the mechanics.
Rate limiting and caching, applied per user, team, or application, not just a single global limit. See rate limiting and semantic caching for what that looks like in practice.
Guardrails and governance, if you need PII detection, prompt injection defense, budgets, or RBAC baked in rather than built separately.
Deployment model. Managed SaaS, self-hosted, or hybrid - this matters a lot if you have compliance requirements that rule out sending traffic through someone else's infrastructure. TrueFoundry's deployment modes doc covers that split.
Open source vs. managed, and what that means for cost at your actual request volume, not just at a demo scale.

Where I'd push back on my own argument

If you're calling one provider, for one use case, at low volume, you probably don't need any of this yet. Add a gateway once you're multi-provider, multi-team, or once an outage or rate-limit wall has actually cost you something, not before.

What pushed you to add a gateway, if you have one - was it an outage, a rate-limit wall, or something else entirely? Curious whether the trigger is usually the same across different teams' setups.