I spent $788 on an AI coding agent in one day. Here's the breakdown.

崔小涣 — Sat, 13 Jun 2026 08:18:21 +0000

I left an AI coding agent running for one day. Then I read the invoice.

$788. In about 13 hours.

I'm posting the real breakdown because I think a lot of people are quietly running up this kind of bill without seeing where it goes — and the fix is boring and effective.

The receipt

One day, 10:21–23:05. 11 sessions, 3,572 API calls across 4 models:

Model	Calls	Output tokens	Cache-read tokens	Cost
Fable 5 ($10/$50)	2,613	1.04M	448M	~$617
Opus 4.8 ($5/$25)	671	769K	248M	~$168
Haiku 4.5 ($1/$5)	242	27K	9M	~$1.70
Sonnet 4.6 ($3/$15)	46	6K	2M	~$0.90
Total	3,572			~$788

Two numbers reframed how I think about this:

The flagship ate $617 by itself — 78% of the bill from one model I'd set as the default for everything.
Haiku did 242 real calls for $1.70. A coffee. For work that, honestly, looked a lot like the work I was paying the flagship $0.24/call to do.

That's not a 2× or 3× gap. Per call it's a ~360× difference, and I was sending almost everything to the expensive end out of pure default-laziness.

What the cache-read column is telling you

Notice 448M + 248M = ~700M cache-read tokens. Agentic coding re-sends a big context every turn; cache reads are billed at ~0.1× input, which is the only reason this was $788 and not several thousand. The flip side: anything that breaks your cache (a changed timestamp, reordered tool list, a proxy that normalizes prompts) silently re-bills at full input price. On this volume, a broken cache is a 10× event.

The fix is routing, not abstinence

I didn't conclude "stop using good models." I concluded "stop sending everything to them." The pattern:

Cheap model by default. Classification, file edits, boilerplate, retrieval — a fast small model handles these fine.
Escalate on signal. Hard reasoning, ambiguous specs, failed attempts → bump to the flagship.
Cap it. Per-key budgets so a runaway loop trips a limit instead of your card.
Watch the cache. Keep the prompt prefix byte-stable so cache reads actually hit.

This is exactly what an AI gateway / model router does — it's the layer that lets you express "cheap by default, escalate when it's hard" once, instead of hard-coding a model everywhere. I've since taken the flagship out of the default path, and the same workload now lands in the low tens of dollars a day.

If you want the receipts

While digging into routing I built an open-source, pain-point-organized list of AI gateways — with a reproducible cost benchmark that prices concrete workloads (including a coding scenario with reasoning tokens) across 11 models, computed by a unit-tested script. Plug in your own token mix and see your real number before the invoice does:

github.com/cuihuan/awesome-ai-gateway · interactive cost tables

If you're running agents daily — have you actually looked at your per-model breakdown? I'd bet most of the bill is one model doing work a cheaper one could.

AI Gateways in 2026: a field guide to the 106 cost problem

崔小涣 — Sat, 13 Jun 2026 00:29:35 +0000

If you call more than one large language model from your code, you have already met the problem an AI gateway solves — you just may not have named it yet.

Here is the number that makes the case. Take one concrete task: generate a 100,000-token report. Send it to the cheapest capable model and it costs about \$0.03. Send the same task to the most expensive frontier model and it costs about \$3.01. That is a 106× spread for output a user often cannot tell apart.

No team rewrites its application eleven times to chase that spread. An AI gateway is how you capture it without rewriting anything.

What an AI gateway actually is

Strip away the marketing and it is a proxy that sits between your code and the model providers. You point your OpenAI-compatible client at the gateway instead of at OpenAI, and in return you get one endpoint and one key for many models — plus the things you would otherwise build yourself: automatic failover when a provider has a bad minute, caching, per-team rate limits and budgets, usage and cost tracking, and guardrails.

The mental model: you change a base_url, not your application.

from openai import OpenAI

client = OpenAI(
    base_url="https://your-gateway/v1",   # the only change
    api_key="...",
)
client.chat.completions.create(
    model="anthropic/claude-fable-5",      # ask the gateway for any provider's model
    messages=[{"role": "user", "content": "Hello"}],
)

The only decision that matters first: self-host or hosted

Everything else follows from this.

Hosted, minimal ops. You want to be calling models in five minutes and you are fine paying a small fee for it. OpenRouter is the marketplace default — 400+ models, ~5.5% on credits. Vercel AI Gateway and Cloudflare AI Gateway go further and charge 0% markup, billing you at provider list price while adding routing and caching on top.

Self-hosted, your infrastructure. Your keys, your network, no per-token middleman fee — you pay only for the box it runs on. LiteLLM is the broad default (Python, 100+ providers, virtual keys and budgets). If the gateway must never be your bottleneck, Bifrost (Go) and TensorZero (Rust) are built for throughput. If you already run Kubernetes, the AI plugins on Kong, Higress or Apache APISIX mean one less new service to operate.

In the Chinese ecosystem the same role is played by new-api and one-api, which add key distribution and billing on top — useful when you need to resell or meter access across a team.

Three things engineers consistently miss

1. Reasoning tokens are billed as output — and they are invisible. Modern reasoning models emit hidden "thinking" tokens charged at the (high) output rate. A task that looks like 20K of output can bill as 50K+. When you size a budget, size it against output, not against the visible answer, and use the model's effort controls to cap it.

2. Cached input is 5–10× cheaper, and fragile. Providers bill a reused prompt prefix at a steep discount. But caching is a prefix match: change one byte near the front — a timestamp, a reordered JSON field — and you silently fall back to full price. A gateway that rewrites or normalizes your prompts can quietly destroy a cache-hit rate you were counting on.

3. The gateway is your security perimeter, so patch it like one. It sees every prompt and holds every key. In 2026, LiteLLM shipped two serious CVEs — a pre-auth SQL injection and an unauthenticated RCE that landed on CISA's exploited-vulnerabilities list — both fixed in v1.83.7. The lesson is not "avoid LiteLLM"; it is that popularity makes a gateway a target. Pin to current stable, restrict egress, and never expose the admin panel to the public internet.

The senior take

After comparing dozens of these, the reframing that helped most: stop shopping for "the best gateway" and start designing your routing and governance. The gateway is plumbing. The value is the policy you run through it — cheap model by default, escalate to a flagship only when a task fails; one audit trail; one budget; one place to enforce data-retention rules. Pick the gateway that makes your policy easy to express, and you will care a lot less about the feature-matrix differences that vendor blog posts obsess over.

That is also why the honest answer to "which one should I use?" is always "for what?" — cheapest access, EU compliance, on-prem data sovereignty, and Kubernetes-native governance lead to four different boxes.

I keep a curated, open-source list that organizes every AI gateway by exactly that — what you need rather than which vendor — with a decision tree, a reproducible cost benchmark (the 106× number above is computed by a unit-tested script, not asserted), and a compliance/security/stability scorecard for 23 of them. It is bilingual and refreshed daily:

github.com/cuihuan/awesome-ai-gateway — and an interactive site if you prefer sortable tables.

If you are choosing a gateway right now, I would genuinely like to hear what constraint is driving your decision — drop it in the comments.

DEV Community: 崔小涣