DEV Community

Cover image for AI Gateways in 2026: a field guide to the 106 cost problem
崔小涣
崔小涣

Posted on

AI Gateways in 2026: a field guide to the 106 cost problem

If you call more than one large language model from your code, you have already met the problem an AI gateway solves — you just may not have named it yet.

Here is the number that makes the case. Take one concrete task: generate a 100,000-token report. Send it to the cheapest capable model and it costs about \$0.03. Send the same task to the most expensive frontier model and it costs about \$3.01. That is a 106× spread for output a user often cannot tell apart.

No team rewrites its application eleven times to chase that spread. An AI gateway is how you capture it without rewriting anything.

What an AI gateway actually is

Strip away the marketing and it is a proxy that sits between your code and the model providers. You point your OpenAI-compatible client at the gateway instead of at OpenAI, and in return you get one endpoint and one key for many models — plus the things you would otherwise build yourself: automatic failover when a provider has a bad minute, caching, per-team rate limits and budgets, usage and cost tracking, and guardrails.

The mental model: you change a base_url, not your application.

from openai import OpenAI

client = OpenAI(
    base_url="https://your-gateway/v1",   # the only change
    api_key="...",
)
client.chat.completions.create(
    model="anthropic/claude-fable-5",      # ask the gateway for any provider's model
    messages=[{"role": "user", "content": "Hello"}],
)
Enter fullscreen mode Exit fullscreen mode

The only decision that matters first: self-host or hosted

Everything else follows from this.

Hosted, minimal ops. You want to be calling models in five minutes and you are fine paying a small fee for it. OpenRouter is the marketplace default — 400+ models, ~5.5% on credits. Vercel AI Gateway and Cloudflare AI Gateway go further and charge 0% markup, billing you at provider list price while adding routing and caching on top.

Self-hosted, your infrastructure. Your keys, your network, no per-token middleman fee — you pay only for the box it runs on. LiteLLM is the broad default (Python, 100+ providers, virtual keys and budgets). If the gateway must never be your bottleneck, Bifrost (Go) and TensorZero (Rust) are built for throughput. If you already run Kubernetes, the AI plugins on Kong, Higress or Apache APISIX mean one less new service to operate.

In the Chinese ecosystem the same role is played by new-api and one-api, which add key distribution and billing on top — useful when you need to resell or meter access across a team.

Three things engineers consistently miss

1. Reasoning tokens are billed as output — and they are invisible. Modern reasoning models emit hidden "thinking" tokens charged at the (high) output rate. A task that looks like 20K of output can bill as 50K+. When you size a budget, size it against output, not against the visible answer, and use the model's effort controls to cap it.

2. Cached input is 5–10× cheaper, and fragile. Providers bill a reused prompt prefix at a steep discount. But caching is a prefix match: change one byte near the front — a timestamp, a reordered JSON field — and you silently fall back to full price. A gateway that rewrites or normalizes your prompts can quietly destroy a cache-hit rate you were counting on.

3. The gateway is your security perimeter, so patch it like one. It sees every prompt and holds every key. In 2026, LiteLLM shipped two serious CVEs — a pre-auth SQL injection and an unauthenticated RCE that landed on CISA's exploited-vulnerabilities list — both fixed in v1.83.7. The lesson is not "avoid LiteLLM"; it is that popularity makes a gateway a target. Pin to current stable, restrict egress, and never expose the admin panel to the public internet.

The senior take

After comparing dozens of these, the reframing that helped most: stop shopping for "the best gateway" and start designing your routing and governance. The gateway is plumbing. The value is the policy you run through it — cheap model by default, escalate to a flagship only when a task fails; one audit trail; one budget; one place to enforce data-retention rules. Pick the gateway that makes your policy easy to express, and you will care a lot less about the feature-matrix differences that vendor blog posts obsess over.

That is also why the honest answer to "which one should I use?" is always "for what?" — cheapest access, EU compliance, on-prem data sovereignty, and Kubernetes-native governance lead to four different boxes.


I keep a curated, open-source list that organizes every AI gateway by exactly that — what you need rather than which vendor — with a decision tree, a reproducible cost benchmark (the 106× number above is computed by a unit-tested script, not asserted), and a compliance/security/stability scorecard for 23 of them. It is bilingual and refreshed daily:

github.com/cuihuan/awesome-ai-gateway — and an interactive site if you prefer sortable tables.

If you are choosing a gateway right now, I would genuinely like to hear what constraint is driving your decision — drop it in the comments.

Top comments (0)