AI API cost control is a routing problem, not a pricing spreadsheet

#saas #ai #devtools #api

Most teams start AI cost control with a spreadsheet: model A costs this much, model B costs that much, so use the cheaper one.

That helps for a week. Then production traffic arrives.

The real cost problem is not the model price. It is losing the path between a user request and the billable provider call.

Once a product has multiple features, API keys, environments, retries, and fallback routes, the invoice stops answering the question founders actually care about:

Which product path created this spend, and could we have routed it better?

The failure mode

A typical early setup looks like this:

one OpenAI key in an environment variable
one Claude key for higher quality tasks
maybe Gemini or a proxy for cheaper workloads
logs that show application errors, but not token economics
a monthly provider invoice that arrives too late

This is fine while one developer is experimenting.

It breaks when several workflows share the same provider account. A single retry loop, a background summarizer, or a test environment can quietly become the largest customer in your AI budget.

The bad part is not only that money was spent. The bad part is that you cannot reconstruct the route.

Treat every AI request like a billable event

The cleaner pattern is to attach accounting data before the request leaves your system.

At minimum, every call should carry:

user or API key owner
project or workspace
requested model
actual upstream model
route type, such as direct, backup, or cheaper pool
input and output tokens
settlement bucket, such as credits, wallet balance, or internal cost center
request id for debugging

This makes the gateway the source of truth, not the provider invoice.

If a request starts as gpt-5.5 but gets served by a backup route, that decision should be visible. If a cheaper model pool handles a non-critical workflow, that should be visible too. If a premium direct route is used, it should be attached to the right balance and owner immediately.

Route policy matters more than average price

Averages hide the thing you need to tune.

For example, a team may discover that 80% of its calls are low-risk transformations that can tolerate a cheaper route, while 20% need the official direct model path. If both are merged into one monthly spend line, nobody can make a good routing decision.

A practical setup separates:

official/direct models for workloads where predictability matters
ordinary or pooled routes for lower-cost throughput
fallback channels for provider instability
per-route usage and error logs
clear balances or budgets for each settlement path

That is also how you avoid confusing product pricing with provider pricing. A product might sell usage-based credits while still routing internally across several providers. The customer should see a stable API surface; the operator should see the routing economics.

Alerts should trigger on velocity, not just totals

Daily spend alerts are too slow for runaway loops.

Token velocity catches problems earlier. A workflow that normally burns 20k tokens per hour and suddenly burns 2M tokens in 10 minutes is the event you care about. The absolute daily total may still look acceptable when the damage starts.

Useful alert signals include:

tokens per minute by API key
error rate by upstream channel
fallback route frequency
spend by model route
sudden provider/model mix changes
failed requests that still consumed tokens

This is where gateway-level logs beat provider dashboards. Provider dashboards are useful, but they do not know your feature boundaries.

What we are building

I am building Tokens Forge around this idea: one OpenAI-compatible API surface, but with model routing, official/direct and lower-cost routes, usage logs, balance separation, and AI Researcher workflows in one place.

The goal is not to hide complexity with a black-box proxy. The goal is to make the routing and billing path inspectable enough that a founder can answer: