DEV Community

Brian Mello
Brian Mello

Posted on

Which model actually ran? Tracking `openrouter/auto` usage by model in realtime

If you route through OpenRouter with openrouter/auto, you have a small, recurring mystery on your hands: you asked for "the best available model," OpenRouter picked one, and unless you went looking, you have no idea which one served the request or what it cost. Multiply that by an autonomous agent making hundreds of calls an hour — OpenClaw and friends love auto — and your month-end bill becomes a whodunit.

The good news is that OpenRouter already records everything you need. You just have to go get it. Here's how to build per-model, per-developer visibility without a proxy, an SDK wrapper, or touching a single prompt.

The one field everyone misses

When you send a chat completion to OpenRouter, the response body tells you which model actually answered. Even if you requested openrouter/auto, the model field in the response is the resolved model — anthropic/claude-3.5-sonnet, google/gemini-flash-1.5, whatever the router chose.

{
  "id": "gen-abc123",
  "model": "anthropic/claude-3.5-sonnet",
  "choices": [ ... ],
  "usage": { "prompt_tokens": 1840, "completion_tokens": 412 }
}
Enter fullscreen mode Exit fullscreen mode

That id is the thread you pull. OpenRouter exposes a generation-lookup endpoint that returns the authoritative record for any call — including the native cost in credits, the upstream provider, and token counts:

curl https://openrouter.ai/api/v1/generation?id=gen-abc123 \
  -H "Authorization: Bearer $OPENROUTER_API_KEY"
Enter fullscreen mode Exit fullscreen mode

The response includes total_cost, model, tokens_prompt, tokens_completion, and the provider that served it. The completion response gives you the model instantly; the generation endpoint gives you the dollars-and-cents truth a beat later (costs settle slightly after the call returns). For most monitoring you want both: the model in realtime, the cost on a short delay.

Polling, not proxying

You have two ways to capture this. One is to sit in the request path — a proxy or a patched SDK that intercepts every call. That works, but now you own a piece of latency-critical, prompt-handling infrastructure, and you've put yourself between your developers and their model. If your monitor hiccups, their agents hiccup.

The other way is to read the usage data after the fact, out of band. OpenRouter has an activity surface you can poll on a schedule:

# Account-level credit + usage snapshot
curl https://openrouter.ai/api/v1/credits \
  -H "Authorization: Bearer $OPENROUTER_API_KEY"
Enter fullscreen mode Exit fullscreen mode

Poll that on an interval, diff successive snapshots, and you have spend velocity without ever being in the hot path. The tradeoff is honest: polling gives you near-realtime, not sub-second, and you trade a little freshness for never being a dependency of the thing you're watching. For cost monitoring — as opposed to, say, rate limiting — that's the right trade nearly every time.

Attributing spend to a developer

A single org key tells you the org spent money, not who spent it. The clean fix is one OpenRouter key per developer (or per agent), each tagged. OpenRouter lets you create multiple keys, so issue them per person and keep a map:

KEY_OWNERS = {
    "sk-or-v1-aaa...": "ravi",
    "sk-or-v1-bbb...": "dana",
    "sk-or-v1-ccc...": "agent-ci",
}
Enter fullscreen mode Exit fullscreen mode

Now poll each key's usage and bucket by both owner and resolved model. A minimal aggregator:

import requests, collections

def snapshot(key):
    r = requests.get(
        "https://openrouter.ai/api/v1/credits",
        headers={"Authorization": f"Bearer {key}"},
        timeout=10,
    )
    r.raise_for_status()
    return r.json()["data"]["total_usage"]  # cumulative credits used

spend = collections.defaultdict(float)
for key, owner in KEY_OWNERS.items():
    spend[owner] = snapshot(key)
Enter fullscreen mode Exit fullscreen mode

Persist each poll with a timestamp and you can compute per-developer, per-window deltas. Join that against the per-call model you captured from completion responses, and you finally get the table you actually wanted: how much each person (or agent) spent, broken out by which model auto chose for them.

Turning data into a tripwire

A dashboard nobody opens won't catch a runaway agent at 2 a.m. You want a threshold that pages you. The cheapest version that works: compute each developer's rolling daily spend, then flag any day that exceeds their own mean plus three standard deviations.

import statistics

def is_anomaly(history, today):
    if len(history) < 7:
        return False  # not enough baseline yet
    mu = statistics.mean(history)
    sigma = statistics.pstdev(history)
    return today > mu + 3 * sigma
Enter fullscreen mode Exit fullscreen mode

Per-developer baselines matter more than one global number. The engineer fine-tuning prompts all day has a legitimately high baseline; the teammate who normally spends two dollars and suddenly spends eighty is your actual signal. A global threshold drowns the second case in the first. Wire the check to fire a Slack message the same day the anomaly appears — a spike you learn about at month-end is just an expensive history lesson.

The shape of the whole thing

Put together, the pattern is: read the resolved model from each completion response for realtime per-model attribution, poll the usage and generation endpoints out of band for authoritative cost, key per developer for attribution, store snapshots, and run a mean-plus-3σ check per person that alerts the same day. No proxy, no prompt access, no latency added to anyone's critical path. A read-only key and a cron job genuinely get you most of the way there.

If you'd rather not maintain the poller, the per-key plumbing, and the baseline math yourself, that's roughly what we build at Reckon — read-only usage-API polling (no proxy, no SDK, never sees your prompts), KMS-encrypted keys, and now realtime OpenRouter tracking by model, so openrouter/auto and OpenClaw runs show up per-model and per-developer as they happen, with Slack digests, same-day anomaly alerts, a /spend command, and Linear integration. Free for up to 3 developers. (Disclosure: I work on Reckon.) But nothing above requires us — the endpoints are right there, and a weekend is enough to wire your own.

Top comments (0)