Proxy, Gateway, or Poll the Usage API? Picking an Architecture for AI Cost Visibility

#ai #devops #finops #architecture

At some point your AI bill stops being a rounding error and someone asks the obvious question: who spent what, on which model, doing what? Answering it means putting something between your developers and the providers — or putting something next to the providers. There are three common shapes, and the choice has real consequences for latency, failure modes, and what data you end up holding. Most teams pick one by accident and regret it later. Here's how to pick one on purpose.

The three shapes

1. Inline proxy. You stand up a service that every LLM request flows through. It forwards to OpenAI/Anthropic/OpenRouter, reads the response, records tokens and cost, and returns the completion. LiteLLM-style gateways do this.

2. SDK / wrapper instrumentation. You wrap the client library so each call emits a metric before returning. No separate network hop, but every call site has to use your wrapper.

3. Usage-API polling. You touch the request path not at all. Instead you periodically read the provider's own metering API — the usage and activity endpoints most providers already expose — and reconstruct who-spent-what from data the platform computed for you.

They sound interchangeable. They are not.

Where each one bites you

Latency and availability. An inline proxy is now on the critical path of every model call. Its p99 is your p99. Its downtime is your outage. You will, eventually, add retries and a circuit breaker so a metering hiccup doesn't take down inference — at which point you're maintaining a piece of production infrastructure whose entire job is to watch production infrastructure. Polling has zero request-path latency by construction; if the poller is down you lose freshness, not traffic.

Coverage. SDK instrumentation only sees calls that go through your SDK. The moment a developer runs a tool you didn't wrap — an agent like OpenClaw, a curl in a CI script, a notebook — that spend is invisible. A proxy catches everything if you can force all egress through it, which in practice means network policy work. Polling catches everything the provider meters, regardless of how the call was made, because it reads the provider's ledger rather than the traffic.

Attribution granularity. This is the subtle one. With openrouter/auto and similar auto-routing, the model that actually ran is chosen server-side per request. A naive proxy that only logs the requested model records auto and learns nothing; it has to parse the response body to recover the real model. Polling the usage API gets the resolved per-model breakdown directly, because that's what the provider bills on. For per-developer attribution, the cleanest trick under any architecture is one API key per developer or workload — then attribution is a GROUP BY key, not a log-parsing exercise.

Data exposure. A proxy and an SDK wrapper both sit in the data path, which means prompts and completions pass through code you now own. That's a security surface: logs that accidentally capture prompt text, a breach that now includes user content, a compliance review that takes three times as long. Polling reads counts and dollars, never message bodies. If your goal is cost visibility, ask whether your cost tool has any business seeing the contents of a prompt. Usually it doesn't.

A polling implementation you can run today

The polling approach is the least discussed and the easiest to stand up, so here's a concrete skeleton. The two ideas that make it work: per-developer keys for attribution, and a scheduled read of the activity endpoint.

import requests

def pull_usage(provisioning_key, dev_keys):
    rows = []
    for dev, key_id in dev_keys.items():
        r = requests.get(
            "https://openrouter.ai/api/v1/activity",
            headers={"Authorization": f"Bearer {provisioning_key}"},
            params={"key": key_id},
        )
        for item in r.json()["data"]:
            rows.append({
                "dev": dev,
                "model": item["model"],   # resolved model, even for auto routes
                "cost": item["usage"],
                "date": item["date"],
            })
    return rows

Run that on a few-minute cron, push the rows into whatever time-series store or warehouse you already have, and you've got near-realtime per-developer, per-model spend — including a real breakdown of what openrouter/auto actually resolved to — without a single byte of production traffic flowing through you.

Turn the series into something that pages you

Visibility you have to remember to look at isn't visibility. Add a cheap anomaly check so the system tells you instead. Per-developer baselines beat one global threshold, because the engineer who runs nightly eval suites shouldn't trip the same wire as the one who normally spends two dollars a day:

import statistics as s

def is_anomaly(history, today):           # history = trailing 14-30 days
    mu, sigma = s.mean(history), s.pstdev(history)
    return today > mu + 3 * sigma         # ~3 sigma = "this isn't a busy Tuesday"

Wire that to the poller and a runaway agent stuck in a retry loop gets caught the same day, not on next month's invoice. The gap between those two is the difference between a Slack message and a postmortem.

So which one?

If you need to enforce policy in the request path — hard budget cutoffs, key rotation, request rewriting — you genuinely want a gateway, and you should accept the operational cost of running one. If you fully control every call site and value a single network hop, SDK instrumentation is fine. But if what you actually want is visibility and alerting — who spent what, on which model, and tell me when it's weird — polling the usage API gives you that with no request-path risk and no prompt exposure. Most teams asking the who-spent-what question want the third thing and reach for the first.

If you'd rather not build and babysit the poller, the per-key provisioning, and the baselining, that's what we made Reckon: read-only usage-API polling (no proxy, no SDK, never sees your prompts), KMS-encrypted keys, per-developer and per-model breakdowns — including realtime openrouter/auto attribution for OpenClaw and anyone else on auto-routing — Slack digests, same-day anomaly alerts, a /spend command, and a Linear integration. Free up to three developers, then $19 per developer per month ($99/mo minimum). The architecture above stands on its own, though — choose the shape that matches what you're actually trying to do, not the one that sounds most thorough.

Top comments (1)

Tokens Forge • Jul 1

Good framing. I think the missing branch is when cost visibility is only one part of the control plane. If a team only needs after-the-fact attribution, polling usage APIs is cleaner. But once they need request-time controls - model scopes per key, budget cutoffs, fallback policy, balance buckets, and a receipt that shows requested model vs upstream model - a gateway becomes the enforcement layer, not just a logging layer.

That is the angle we are building around Tokens Forge as an AI API relay / OpenAI-compatible gateway.