AI Gateways in 2026: a field guide to the 106 cost problem

#ai #webdev #opensource #llm

If you call more than one large language model from your code, you have already met the problem an AI gateway solves — you just may not have named it yet.

Here is the number that makes the case. Take one concrete task: generate a 100,000-token report. Send it to the cheapest capable model and it costs about \$0.03. Send the same task to the most expensive frontier model and it costs about \$3.01. That is a 106× spread for output a user often cannot tell apart.

No team rewrites its application eleven times to chase that spread. An AI gateway is how you capture it without rewriting anything.

What an AI gateway actually is

Strip away the marketing and it is a proxy that sits between your code and the model providers. You point your OpenAI-compatible client at the gateway instead of at OpenAI, and in return you get one endpoint and one key for many models — plus the things you would otherwise build yourself: automatic failover when a provider has a bad minute, caching, per-team rate limits and budgets, usage and cost tracking, and guardrails.

The mental model: you change a base_url, not your application.

from openai import OpenAI

client = OpenAI(
    base_url="https://your-gateway/v1",   # the only change
    api_key="...",
)
client.chat.completions.create(
    model="anthropic/claude-fable-5",      # ask the gateway for any provider's model
    messages=[{"role": "user", "content": "Hello"}],
)

The only decision that matters first: self-host or hosted

Everything else follows from this.

Hosted, minimal ops. You want to be calling models in five minutes and you are fine paying a small fee for it. OpenRouter is the marketplace default — 400+ models, ~5.5% on credits. Vercel AI Gateway and Cloudflare AI Gateway go further and charge 0% markup, billing you at provider list price while adding routing and caching on top.

Self-hosted, your infrastructure. Your keys, your network, no per-token middleman fee — you pay only for the box it runs on. LiteLLM is the broad default (Python, 100+ providers, virtual keys and budgets). If the gateway must never be your bottleneck, Bifrost (Go) and TensorZero (Rust) are built for throughput. If you already run Kubernetes, the AI plugins on Kong, Higress or Apache APISIX mean one less new service to operate.

In the Chinese ecosystem the same role is played by new-api and one-api, which add key distribution and billing on top — useful when you need to resell or meter access across a team.

Three things engineers consistently miss

1. Reasoning tokens are billed as output — and they are invisible. Modern reasoning models emit hidden "thinking" tokens charged at the (high) output rate. A task that looks like 20K of output can bill as 50K+. When you size a budget, size it against output, not against the visible answer, and use the model's effort controls to cap it.

2. Cached input is 5–10× cheaper, and fragile. Providers bill a reused prompt prefix at a steep discount. But caching is a prefix match: change one byte near the front — a timestamp, a reordered JSON field — and you silently fall back to full price. A gateway that rewrites or normalizes your prompts can quietly destroy a cache-hit rate you were counting on.

3. The gateway is your security perimeter, so patch it like one. It sees every prompt and holds every key. In 2026, LiteLLM shipped two serious CVEs — a pre-auth SQL injection and an unauthenticated RCE that landed on CISA's exploited-vulnerabilities list — both fixed in v1.83.7. The lesson is not "avoid LiteLLM"; it is that popularity makes a gateway a target. Pin to current stable, restrict egress, and never expose the admin panel to the public internet.

The senior take

After comparing dozens of these, the reframing that helped most: stop shopping for "the best gateway" and start designing your routing and governance. The gateway is plumbing. The value is the policy you run through it — cheap model by default, escalate to a flagship only when a task fails; one audit trail; one budget; one place to enforce data-retention rules. Pick the gateway that makes your policy easy to express, and you will care a lot less about the feature-matrix differences that vendor blog posts obsess over.

That is also why the honest answer to "which one should I use?" is always "for what?" — cheapest access, EU compliance, on-prem data sovereignty, and Kubernetes-native governance lead to four different boxes.

I keep a curated, open-source list that organizes every AI gateway by exactly that — what you need rather than which vendor — with a decision tree, a reproducible cost benchmark (the 106× number above is computed by a unit-tested script, not asserted), and a compliance/security/stability scorecard for 23 of them. It is bilingual and refreshed daily:

github.com/cuihuan/awesome-ai-gateway — and an interactive site if you prefer sortable tables.

If you are choosing a gateway right now, I would genuinely like to hear what constraint is driving your decision — drop it in the comments.

Top comments (1)

HARD IN SOFT OUT • Jun 13

This is the most useful taxonomy of AI gateways I've seen — especially the "three things engineers miss" section. Reasoning tokens billed as output with no visible UI? That quietly destroys cost forecasts. And the caching fragility point is brutal: one timestamp change and your 10× discount vanishes.

Two things that could make this even sharper:

Add a "cost of switching" matrix. You mention the 106× spread, but moving a production workload from GPT‑4o to Llama 3.1 might break prompt structure, output format, or tool‑calling patterns. A heuristic for "how likely is this task to survive a model swap without prompt engineering?" would help teams decide whether the 106× is actually capturable.
The CVE warning deserves an expansion. LiteLLM's vulnerabilities are real, but you don't mention detection — how do you know if your gateway has been compromised? Monitoring for unusual token volumes, new model access patterns, or egress to unexpected IPs. A one‑line "watch these metrics" would turn a warning into an action item.

One small improvement: the decision tree in the linked repo is excellent, but the post itself could use a simple flow chart in text form — “Need EU compliance? → Route 1. Need on‑prem? → Route 2.” Many readers won't click away, so embedding the key branches here would help.

And the dark joke (because hidden reasoning tokens are pure evil):

I asked a reasoning model: “How much will this cost?”

It thought for 8 seconds, then said: “About $0.04.”

The bill arrived: $0.42.

I asked why.

It said: “I had to think about your question.”

Solid, practical, no‑fluff. Thanks for this.