Building an LLM Cost-Governance Dashboard with LiteLLM + BigQuery

#ai #webdev #llm #devops

I gave my team Claude Code, Codex, and Kimi. A month later I opened four separate billing pages and still couldn't answer three basic questions: who spent what, on which models, and whether anyone was even still using the tools after the first-week excitement wore off.

Every provider has its own billing page, and none of them line up with my teams and users. So I built a small, self-hostable reference implementation that puts one gateway in front of everything.

Repo (MIT): https://github.com/0xkaz/llm-governance-dashboard

The shape of it

Coding agents (Codex / Claude Code / Kimi)
      │  per-user key, base_url = the proxy
      ▼
LiteLLM proxy  ── model name picks the upstream ──►  OpenAI / Anthropic / OpenRouter / Kimi
      │  logs every request (callback)
      ▼
BigQuery  ──►  FastAPI dashboard (cost, budgets, adoption)

LiteLLM is the gateway: one OpenAI/Anthropic-compatible endpoint for every provider, with per-user virtual keys and budgets.
A callback streams every request into BigQuery — cost, tokens, latency, team, user, model.
A small FastAPI app renders the dashboard server-side (no JS build), plus budget alerts to Slack/email and self-service key issue/revoke.

Everything runs on your laptop via Docker; I verified the whole path on real GCP.

Two things it gives you

Cost governance — spend/latency by user, team and model; budget alerts at 80/100%; virtual keys people can issue and revoke themselves (no one touches the master key).
Adoption tracking — the more interesting half: is the team actually using the agents, and does it stick? Weekly-active rate, repeat usage, and how many different tools each person uses. Aggregated by team; individual usage stays on each person's own page. I deliberately track adoption, not a per-developer output number — usage that sticks is a signal you can act on; a leaderboard is one people just game.

Three things that surprised me

Streaming hides the cost. Providers often omit token usage on streamed responses, so the gateway's own per-key spend under-counts — sometimes down to $0 for a request that clearly cost money. Two fixes: I inject stream_options.include_usage=true at the proxy, and I backfill cost from tokens into BigQuery so the dashboard is accurate regardless. The lesson that stuck: treat the warehouse as the source of truth for cost, and the gateway's hard budget-block as best-effort. They'll disagree; that's fine once you know which to trust for which job.

Subscription CLIs bypass the gateway — but that's a routing problem, not a measurement limit. A ChatGPT-logged-in Codex, the claude.ai app, and Moonshot's Kimi CLI talk to their vendor directly, so in that mode nothing is logged. But anything that goes through the gateway is measured — including subscription / flat-rate usage (tokens and latency are recorded; cost just shows 0). The fix is routing: force traffic through the proxy (API-key auth, or a generic OpenAI-compatible client for Kimi) and block direct egress. That's org policy, not a tool limitation — worth being honest about both halves.

Per-user attribution needs virtual keys. With the raw master key, everything shows up as one user. Hand each person a scoped key and user_id / team land on every row in BigQuery — and suddenly the dashboard is actually useful.

What it is (and isn't)

It's a demo / reference implementation, not a production product. No TLS, single node, no multi-tenant hardening. BigQuery and Looker are swappable for any warehouse/BI — the real requirement is just one place that holds every per-request log.

It's also deliberately different from the observability tools you'll reach for first. Helicone, Langfuse, and Portkey are great at tracing and debugging calls; this is thinner and pointed at a different question — governance and adoption: who's allowed to spend what, and is the team actually using the tools. Complementary, not a competitor.

If that fits how you'd approach the problem, fork it.

Curious how others are governing AI coding spend across a team — gateway-first like this, or something else?