<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: 崔小涣</title>
    <description>The latest articles on DEV Community by 崔小涣 (@_7a561cb4673b6d2a455c5).</description>
    <link>https://dev.to/_7a561cb4673b6d2a455c5</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3982022%2F80b12bed-5a39-4780-97af-6094e696b0aa.jpg</url>
      <title>DEV Community: 崔小涣</title>
      <link>https://dev.to/_7a561cb4673b6d2a455c5</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/_7a561cb4673b6d2a455c5"/>
    <language>en</language>
    <item>
      <title>AI Gateways in 2026: a field guide to the 106 cost problem</title>
      <dc:creator>崔小涣</dc:creator>
      <pubDate>Sat, 13 Jun 2026 00:29:35 +0000</pubDate>
      <link>https://dev.to/_7a561cb4673b6d2a455c5/ai-gateways-in-2026-a-field-guide-to-the-106x-cost-problem-57hl</link>
      <guid>https://dev.to/_7a561cb4673b6d2a455c5/ai-gateways-in-2026-a-field-guide-to-the-106x-cost-problem-57hl</guid>
      <description>&lt;p&gt;If you call more than one large language model from your code, you have already met the problem an &lt;em&gt;AI gateway&lt;/em&gt; solves — you just may not have named it yet.&lt;/p&gt;

&lt;p&gt;Here is the number that makes the case. Take one concrete task: generate a 100,000-token report. Send it to the cheapest capable model and it costs about &lt;strong&gt;\$0.03&lt;/strong&gt;. Send the &lt;em&gt;same task&lt;/em&gt; to the most expensive frontier model and it costs about &lt;strong&gt;\$3.01&lt;/strong&gt;. That is a &lt;strong&gt;106× spread&lt;/strong&gt; for output a user often cannot tell apart.&lt;/p&gt;

&lt;p&gt;No team rewrites its application eleven times to chase that spread. An AI gateway is how you capture it without rewriting anything.&lt;/p&gt;

&lt;h2&gt;
  
  
  What an AI gateway actually is
&lt;/h2&gt;

&lt;p&gt;Strip away the marketing and it is a proxy that sits between your code and the model providers. You point your OpenAI-compatible client at the gateway instead of at OpenAI, and in return you get one endpoint and one key for many models — plus the things you would otherwise build yourself: automatic failover when a provider has a bad minute, caching, per-team rate limits and budgets, usage and cost tracking, and guardrails.&lt;/p&gt;

&lt;p&gt;The mental model: you change a &lt;code&gt;base_url&lt;/code&gt;, not your application.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;openai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;OpenAI&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;OpenAI&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;base_url&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://your-gateway/v1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;   &lt;span class="c1"&gt;# the only change
&lt;/span&gt;    &lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;anthropic/claude-fable-5&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;      &lt;span class="c1"&gt;# ask the gateway for any provider's model
&lt;/span&gt;    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Hello&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}],&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  The only decision that matters first: self-host or hosted
&lt;/h2&gt;

&lt;p&gt;Everything else follows from this.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Hosted, minimal ops.&lt;/strong&gt; You want to be calling models in five minutes and you are fine paying a small fee for it. &lt;a href="https://openrouter.ai" rel="noopener noreferrer"&gt;OpenRouter&lt;/a&gt; is the marketplace default — 400+ models, ~5.5% on credits. &lt;a href="https://vercel.com/ai-gateway" rel="noopener noreferrer"&gt;Vercel AI Gateway&lt;/a&gt; and &lt;a href="https://developers.cloudflare.com/ai-gateway/" rel="noopener noreferrer"&gt;Cloudflare AI Gateway&lt;/a&gt; go further and charge &lt;strong&gt;0% markup&lt;/strong&gt;, billing you at provider list price while adding routing and caching on top.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Self-hosted, your infrastructure.&lt;/strong&gt; Your keys, your network, no per-token middleman fee — you pay only for the box it runs on. &lt;a href="https://github.com/BerriAI/litellm" rel="noopener noreferrer"&gt;LiteLLM&lt;/a&gt; is the broad default (Python, 100+ providers, virtual keys and budgets). If the gateway must never be your bottleneck, &lt;a href="https://github.com/maximhq/bifrost" rel="noopener noreferrer"&gt;Bifrost&lt;/a&gt; (Go) and &lt;a href="https://github.com/tensorzero/tensorzero" rel="noopener noreferrer"&gt;TensorZero&lt;/a&gt; (Rust) are built for throughput. If you already run Kubernetes, the AI plugins on &lt;a href="https://github.com/Kong/kong" rel="noopener noreferrer"&gt;Kong&lt;/a&gt;, &lt;a href="https://github.com/higress-group/higress" rel="noopener noreferrer"&gt;Higress&lt;/a&gt; or &lt;a href="https://apisix.apache.org/" rel="noopener noreferrer"&gt;Apache APISIX&lt;/a&gt; mean one less new service to operate.&lt;/p&gt;

&lt;p&gt;In the Chinese ecosystem the same role is played by &lt;a href="https://github.com/QuantumNous/new-api" rel="noopener noreferrer"&gt;new-api&lt;/a&gt; and &lt;a href="https://github.com/songquanpeng/one-api" rel="noopener noreferrer"&gt;one-api&lt;/a&gt;, which add key distribution and billing on top — useful when you need to &lt;em&gt;resell&lt;/em&gt; or meter access across a team.&lt;/p&gt;

&lt;h2&gt;
  
  
  Three things engineers consistently miss
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1. Reasoning tokens are billed as output — and they are invisible.&lt;/strong&gt; Modern reasoning models emit hidden "thinking" tokens charged at the (high) output rate. A task that looks like 20K of output can bill as 50K+. When you size a budget, size it against output, not against the visible answer, and use the model's effort controls to cap it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Cached input is 5–10× cheaper, and fragile.&lt;/strong&gt; Providers bill a reused prompt prefix at a steep discount. But caching is a &lt;em&gt;prefix match&lt;/em&gt;: change one byte near the front — a timestamp, a reordered JSON field — and you silently fall back to full price. A gateway that rewrites or normalizes your prompts can quietly destroy a cache-hit rate you were counting on.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. The gateway is your security perimeter, so patch it like one.&lt;/strong&gt; It sees every prompt and holds every key. In 2026, &lt;a href="https://github.com/BerriAI/litellm" rel="noopener noreferrer"&gt;LiteLLM&lt;/a&gt; shipped two serious CVEs — a pre-auth SQL injection and an unauthenticated RCE that landed on CISA's exploited-vulnerabilities list — both fixed in v1.83.7. The lesson is not "avoid LiteLLM"; it is that popularity makes a gateway a target. Pin to current stable, restrict egress, and never expose the admin panel to the public internet.&lt;/p&gt;

&lt;h2&gt;
  
  
  The senior take
&lt;/h2&gt;

&lt;p&gt;After comparing dozens of these, the reframing that helped most: &lt;strong&gt;stop shopping for "the best gateway" and start designing your routing and governance.&lt;/strong&gt; The gateway is plumbing. The value is the policy you run through it — cheap model by default, escalate to a flagship only when a task fails; one audit trail; one budget; one place to enforce data-retention rules. Pick the gateway that makes &lt;em&gt;your&lt;/em&gt; policy easy to express, and you will care a lot less about the feature-matrix differences that vendor blog posts obsess over.&lt;/p&gt;

&lt;p&gt;That is also why the honest answer to "which one should I use?" is always "for what?" — cheapest access, EU compliance, on-prem data sovereignty, and Kubernetes-native governance lead to four different boxes.&lt;/p&gt;




&lt;p&gt;I keep a curated, open-source list that organizes every AI gateway by exactly that — &lt;em&gt;what you need&lt;/em&gt; rather than which vendor — with a decision tree, a reproducible cost benchmark (the 106× number above is computed by a unit-tested script, not asserted), and a compliance/security/stability scorecard for 23 of them. It is bilingual and refreshed daily:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/cuihuan/awesome-ai-gateway" rel="noopener noreferrer"&gt;github.com/cuihuan/awesome-ai-gateway&lt;/a&gt;&lt;/strong&gt; — and an &lt;a href="https://cuihuan.github.io/awesome-ai-gateway/" rel="noopener noreferrer"&gt;interactive site&lt;/a&gt; if you prefer sortable tables.&lt;/p&gt;

&lt;p&gt;If you are choosing a gateway right now, I would genuinely like to hear what constraint is driving your decision — drop it in the comments.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>opensource</category>
      <category>webdev</category>
    </item>
  </channel>
</rss>
