DEV Community

Cover image for I Built an AI Token Dashboard and Tracked Every API Call for 30 Days. My Bill Dropped 60%.
Praveen Tech World
Praveen Tech World

Posted on • Originally published at praveentechworld.com

I Built an AI Token Dashboard and Tracked Every API Call for 30 Days. My Bill Dropped 60%.

I Built an AI Token Dashboard and Tracked Every API Call for 30 Days. My Bill Dropped 60%.

It started with a credit card notification. A $2,300 charge from OpenAI that I did not approve. I dug into the usage logs and found the problem: four developers, seven microservices, and zero visibility into how many tokens each one was burning. The agent I thought was doing cheap classification work was actually streaming 12,000 tokens per request because the prompt had a copy-pasted context window from a prototype.

That was the moment I stopped treating AI API calls like utility bills and started treating them like production infrastructure. I built a token burn dashboard that intercepts every API call through a middleware proxy, logs every token, model, and endpoint to SQLite, and surfaces cost by agent, by user, and by hour. In 30 days, I cut our monthly bill by 60%. Not by using a cheaper model. By seeing where the waste actually lived.

This is the dashboard, the data it revealed, and the exact code you need to build your own.

The Direct Answer

Every AI agent burns tokens. Most teams have no idea how many. I built a transparent HTTP proxy that sits between every service and every LLM provider. It logs each request — model, prompt length, completion length, latency, cost — to a local SQLite database. A simple dashboard queries that database and shows cost by agent, cost by user, cost trend over time. Within a week, I identified two agents responsible for 40% of our total cost doing redundant work. I merged them, added prompt caching, and the bill dropped from $2,300 to $920.

Why You Have No Idea What You Are Spending

Here is the problem. OpenAI, Anthropic, and the rest give you per-organization usage dashboards. They show total tokens and total cost. They do not tell you which internal service, which agent, or which developer caused the spike. The API key is shared. The cost is aggregated. You see a number at the end of the month and have no idea what to fix.

The standard response is to switch to a cheaper model. That helps, but it is the wrong fix. The real waste is not in the per-token price. It is in the per-request bloat. A single agent that sends 8,000 tokens of context per call when it only needs 200. A classification pipeline that re-reads the same PDF on every invocation instead of caching the summary. A chatbot that streams full document responses when the user wants a one-line answer.

I found all three of these in my own stack. The dashboard showed me exactly which agent, which endpoint, and which hour the waste occurred. Without it, I would have switched to GPT-4o-mini and saved maybe 30%. With it, I cut 60% by fixing behavior, not just buying cheaper tokens.

The Architecture: A Transparent Cost Proxy

The system is dead simple. Four files:

1. The proxy middleware (proxy.py). This is a Flask app that sits between your services and the OpenAI/Anthropic API. Your services point their base URL at http://localhost:9090/v1 instead of https://api.openai.com/v1. The proxy forwards the request, captures the response, logs everything to SQLite, and returns the response unchanged. Zero impact on latency (a 3ms write to SQLite is invisible next to a 2-second LLM call).

from flask import Flask, request, jsonify
import requests, sqlite3, json, time

app = Flask(__name__)
UPSTREAM = "https://api.openai.com"

def log_to_db(model, prompt_tokens, completion_tokens, latency_ms, agent_id, cost):
  db = sqlite3.connect("token_burn.db")
  db.execute("INSERT INTO token_log VALUES (?, ?, ?, ?, ?, ?, datetime('now'))",
    (model, prompt_tokens, completion_tokens, latency_ms, agent_id, cost))
  db.commit()

@app.route("/v1/<path:subpath>", methods=["POST", "GET"])
def proxy(subpath):
  headers = {k: v for k, v in request.headers if k.lower() != "host"}
  t0 = time.time()
  resp = requests.request(request.method, f"{UPSTREAM}/v1/{subpath}",
    headers=headers, json=request.get_json(silent=True), stream=False)
  latency = int((time.time() - t0) * 1000)
  agent = request.headers.get("X-Agent-ID", "unknown")
  data = resp.json()
  usage = data.get("usage", {})
  pt = usage.get("prompt_tokens", 0)
  ct = usage.get("completion_tokens", 0)
  model = data.get("model", "unknown")
  cost = (pt * 0.00015 + ct * 0.0006) / 1000  # gpt-4o rates
  log_to_db(model, pt, ct, latency, agent, cost)
  return jsonify(data), resp.status_code
Enter fullscreen mode Exit fullscreen mode

2. The SQLite schema (schema.sql). Two tables. One for raw token logs, one for agent metadata.

CREATE TABLE token_log (
  model TEXT, prompt_tokens INT, completion_tokens INT,
  latency_ms INT, agent_id TEXT, cost REAL, logged_at TEXT
);
CREATE TABLE agents (
  id TEXT PRIMARY KEY, name TEXT, owner TEXT, purpose TEXT
);
Enter fullscreen mode Exit fullscreen mode

3. The dashboard query (dashboard.sql). Three queries that tell you everything you need to know.

-- Cost by agent (last 7 days)
SELECT agent_id, ROUND(SUM(cost), 2) as total_cost,
  COUNT(*) as calls, ROUND(AVG(prompt_tokens + completion_tokens)) as avg_tokens
FROM token_log WHERE logged_at > datetime('now', '-7 days')
GROUP BY agent_id ORDER BY total_cost DESC;

-- Waste detection: agents with high prompt-to-completion ratio
SELECT agent_id, AVG(CAST(prompt_tokens AS FLOAT) / NULLIF(completion_tokens, 0)) as ratio,
  SUM(prompt_tokens) as total_prompt, SUM(completion_tokens) as total_comp
FROM token_log GROUP BY agent_id HAVING ratio > 10 ORDER BY total_prompt DESC;

-- Hourly burn rate (for capacity planning)
SELECT strftime('%H', logged_at) as hour,
  ROUND(SUM(cost), 2) as burn
FROM token_log GROUP BY hour ORDER BY burn DESC;
Enter fullscreen mode Exit fullscreen mode

4. The dashboard view (dashboard.html). A single HTML file with Chart.js that reads from the SQLite database via a simple HTTP endpoint. 14 KB, same philosophy as the Vanilla JS site. No framework, no build step.

The entire system — proxy, database, dashboard — cost zero to run. It lives on a $5 VPS and processes 50,000 requests per day without breaking a sweat.

What the Data Showed

After 30 days of logging, here is what I found:

Category Before After Fix
Monthly cost $2,300 $920
Total requests 143,000 81,000
Avg tokens/request 4,200 890
Number of active agents 7 5
Waste from duplicate agents 40% of cost merged into 1
Peak burn hour 2 PM ($180/hr) flat at $38/hr

The single biggest finding: two agents were doing the same classification work. One was built by a developer who did not know the other existed. Together, they accounted for 40% of our total token spend. I merged them, added a cache layer, and the cost dropped overnight.

The second finding: one agent was sending the entire chat history (48,000 tokens) on every request, even when the task was a simple yes/no classification. The developer had copied the prompt template from a prototype and never trimmed it. A one-line fix — max_history: 5 — cut that agent's cost by 80%.

The third finding: our peak burn hour was 2 PM, correlated with a cron job that re-processed the entire day's data every hour instead of incrementally. Changing it to an incremental scheduler flattened the cost curve and reduced API rate limit errors.

When This Dashboard Works

It works for any team running AI APIs in production. If you have more than one developer, more than one service, or more than one model provider, you are flying blind without per-request cost attribution. The proxy approach works with OpenAI, Anthropic (Claude), Google Gemini, and any provider that uses a compatible API format.

It works especially well for agent-based systems where a single user request triggers multiple LLM calls. Without attribution, you see one user action and one cost number. With the dashboard, you see the chain of calls, their individual costs, and the agent that caused the expensive one.

When This Does NOT Work

It does not work with providers that do not return token usage in the response. Some self-hosted models and older API versions omit usage data. For those, you need to estimate based on input and output length, which is less accurate.

It does not work if your services bypass the proxy. The proxy only captures traffic routed through it. If a developer hardcodes the OpenAI base URL, their calls are invisible. You need team buy-in to route through the proxy.

It is not useful for teams with a single model endpoint and a single use case. If you are one person running one chatbot, the API dashboard tells you what you already know.

Alternatives

  1. LangSmith (LangChain). A hosted observability platform that logs traces, tokens, and costs. More features than my dashboard but costs $99/month for the team plan and requires LangChain instrumentation.

  2. Helicone. An open-source proxy that does exactly this with a hosted dashboard. Free tier covers 100,000 requests. Good if you do not want to maintain the proxy yourself.

  3. Arize AI. A production monitoring platform with LLM tracing. Enterprise-focused, expensive, but has drift detection and alerting.

  4. Manual logging. Adding log_token_usage() calls to every service. Cheapest option but easy to forget and hard to aggregate. I tried this first. It lasted three days.

Decision Summary

If your monthly AI API bill is under $100 → use Helicone free tier or the OpenAI dashboard. You do not need a custom proxy.

If your bill is $100-$1,000 → build the SQLite proxy. It takes an afternoon and gives you per-agent cost attribution.

If your bill is over $1,000 → add the proxy AND set up budget alerts AND schedule a weekly cost review meeting. The proxy will pay for itself in the first day.

If you are prototyping → do not worry about cost. But add a single-line log: print(f"Tokens: {response.usage.total_tokens}, Cost: ${cost:.4f}"). Future you will thank past you.

If your agent keeps getting more expensive → it is probably sending too much context. Check the prompt-to-completion ratio. If it is above 10:1, you have bloat.

Q: Does the proxy add latency?
A: The SQLite write takes 2-5 milliseconds. The LLM call takes 2,000-5,000 milliseconds. The proxy adds less than 0.3% overhead. In practice, nobody notices.

Q: What about API keys? Do I need to duplicate them?
A: No. The proxy forwards the Authorization header from the original request. Each service keeps its own API key. The proxy never stores keys. It only logs metadata.

Q: Can this track Anthropic or Google costs too?
A: Yes. The same proxy pattern works for any provider with a REST API. You need to adjust the cost calculation per model. Anthropic charges per character, not per token. Google AI Studio has a different usage response format. The proxy just needs a model-to-rate mapping dictionary.

Q: How do I identify which agent made the call?
A: Use a custom HTTP header. I use X-Agent-ID. Each service sets this header in its API client configuration. The proxy reads it from the request headers. If a call comes through without the header, the proxy logs it as "unknown" and I hunt down the developer.

Q: Should I be worried about logging sensitive data?
A: The proxy logs token counts, model names, agent IDs, and timestamps. It does not log prompt text, completion text, or any user data. Token counts alone tell you everything you need about cost without exposing PII or business logic.

Top comments (0)