Why I Put an LLM Gateway in Front of Every Model Call: Outages, Rate Limits, Lock-in

#llm #ai #devops #mlops

TL;DR

Provider outages, rate limits, and per-provider SDK differences are the three concrete reasons teams end up routing LLM traffic through a gateway instead of calling providers directly.
A gateway gives you one OpenAI-compatible endpoint, load balancing with automatic fallback, and semantic caching, without changing application code when you add or swap a model.
It's also the natural place to enforce budgets, rate limits, and guardrails, which is worth knowing before you pick a gateway that doesn't do those things.

The three problems that show up first

Outages. OpenAI and Anthropic both had multiple incidents on their public status pages between February and May of 2025 - nothing exotic, just the kind of thing that happens when you depend on someone else's infrastructure. If your app calls one provider directly and that provider has a bad afternoon, your app has a bad afternoon too. Routing through a gateway that can fail over to a second provider when the primary is down is the same instinct that led people to put load balancers in front of web servers twenty years ago.

Rate limits. Azure OpenAI, like most providers, enforces tokens-per-minute and requests-per-minute quotas per model per region. Under normal load that's invisible. Under a traffic spike, or a bug that puts an agent into a retry loop, you hit the ceiling and start getting 429s. A gateway that understands rate limits can route the overflow to a different model or provider instead of just failing the request.

Lock-in. Every provider's SDK is shaped a little differently, and if your codebase talks to the raw OpenAI SDK in forty places, swapping in a cheaper or better model six months from now means touching forty places. An OpenAI-compatible gateway endpoint means you change a model name in one config, not your application code.

What "gateway" tends to mean in practice

Concretely, most teams end up wanting some combination of these, and it's worth checking which ones a given gateway actually has before you commit to one:

One endpoint for every provider. TrueFoundry's AI Gateway exposes an OpenAI-compatible schema in front of over a thousand models across providers, so your application code doesn't need a different client per vendor.
Load balancing and fallback, so a model going down or slowing down doesn't take your app with it. The routing docs walk through weighted routing, latency-based routing, and canary rollouts for testing a new model on a slice of traffic before trusting it with everything.
Rate limiting, applied per user, per team, or per application, so one runaway script or one noisy customer doesn't starve everyone else. Details here.
Semantic caching, which cuts cost and latency on requests that are semantically similar to ones you've already served. Docs on that.

The part people don't expect: it becomes your policy chokepoint

Once every model call passes through one place, that place is also where you'd naturally enforce budgets, access control, and guardrails - not because a gateway is "for" governance, but because it's the only component that sees every request regardless of which team or which model made it. I wrote a follow-up on the governance side specifically, since it's a big enough topic on its own.

Deployment matters here too: TrueFoundry's gateway can run fully managed, hybrid (your infra, their control plane), or entirely self-hosted in your own VPC, per the deployment modes doc - worth checking if "SaaS gateway" is a non-starter for your compliance posture.

If you'd rather run something yourself first, TrueFoundry's own writeup on LLM gateways covers the same ground from the product side, and LiteLLM and Bifrost are the two open-source options I see mentioned most often as a starting point.

Where I'd push back on my own argument

A gateway is one more moving part and one more thing that can be slow or wrong. If you're calling one provider, for one use case, at low volume, you probably don't need this yet - you need it once you're multi-provider, multi-team, or once a provider outage actually cost you something. This piece on LLM gateway vs proxy vs router is a good breakdown of where the lines are if you're trying to figure out how much you actually need. There's also a solid walkthrough of wiring automatic provider fallback into an agent if you want to see the failover pattern in code before adopting a whole gateway.

What's your actual trigger for adding a gateway layer - was it an outage, a rate limit wall, or something else? I'm curious whether the reasons match across different teams' setups or whether I'm missing a category entirely.