Ali Afana

Posted on Apr 30 • Edited on May 4

OpenAI Tells You What You Spent. Not Where. So I Built a Dashboard.

#ai #webdev #openai #monitoring

Identifies 100x cost gaps between features

Update (May 4, 2026): A reader (Gary Stupak in the comments) pointed out that Cloudflare AI Gateway supports custom metadata headers (cf-aig-metadata) that let you propagate tenant/feature/conversation IDs from your app into the gateway logs.

If you're already on Cloudflare's stack, start there — Gateway becomes your source of truth, and a custom dashboard becomes verification rather than the primary tool.

If you're not on Cloudflare (or want to understand what to log either way), the rest of this article still applies — and being wrong in public is how the lessons stick.

TL;DR: OpenAI's billing page shows total spend. It doesn't show which feature, which tenant, or which conversation caused it. For a multi-tenant AI product, that's flying blind. I built a 3-file monitoring system — a wrapper, a table, a dashboard — that gives me per-call cost down to 8 decimal places. The first time I opened it, I caught a 100× cost gap between two features I'd been treating as similar.

The Gap in OpenAI's Dashboard

Open platform.openai.com/usage right now. You'll see:

Total spend per day
Breakdown by model (gpt-4o, gpt-4o-mini, etc.)
Token totals

That's it.

You won't see:

Which feature in your app caused those tokens
Which user or tenant triggered them
Which specific conversation went over budget
Whether failed calls are still costing you money
How latency correlates with cost

For a side project, fine. For a production AI product, you're guessing.

I'm building a multi-tenant AI sales chatbot — each store is a separate customer, with multiple features per store: chat completion, embeddings, image analysis, profile extraction. A single customer message can fire 1–3 OpenAI calls.

When I shipped, OpenAI told me I spent $4.27 yesterday.

That number was useless. Was it one expensive image analysis? A runaway store with thousands of messages? A loop firing the same call repeatedly? No way to know.

So I built my own observability. Three files. One afternoon.

File 1: The Wrapper (`openai-logger.ts`)

The core idea: don't call OpenAI directly. Call a wrapper that measures everything and logs it asynchronously.

Pricing Table

OpenAI's API returns token counts but not cost. You calculate it yourself from a hardcoded table:

// Last checked: 2026-04-15 — https://openai.com/pricing
const PRICING: Record<string, { input: number; output: number }> = {
  "gpt-4o":      { input: 2.50,  output: 10.00 },   // per 1M tokens
  "gpt-4o-mini": { input: 0.15,  output: 0.60  },   // per 1M tokens
};

Here's the headline ratio:

Model	Input (per 1M)	Output (per 1M)	Cost ratio
gpt-4o	$2.50	$10.00	1×
gpt-4o-mini	$0.15	$0.60	~16× cheaper

gpt-4o is roughly 16× more expensive than gpt-4o-mini for the same number of tokens. If you're using gpt-4o for anything gpt-4o-mini can handle, you're burning money. The dashboard makes this visible call by call — exactly what you need when deciding which model goes where.

The Wrapper

import OpenAI from "openai";
import { createAdminClient } from "@/lib/supabase/admin";

const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });

const PRICING: Record<string, { input: number; output: number }> = {
  "gpt-4o":      { input: 2.50,  output: 10.00 },
  "gpt-4o-mini": { input: 0.15,  output: 0.60  },
};

interface LogMeta {
  storeId: string;
  conversationId?: string;
  leadId?: string;
  endpoint: string;
  functionCalled?: string;
  searchQuery?: string;
  productsFound?: number;
}

export async function loggedChatCompletion(
  params: OpenAI.Chat.Completions.ChatCompletionCreateParams,
  meta: LogMeta
) {
  const start = Date.now();
  const result = await openai.chat.completions.create(params);
  const duration = Date.now() - start;

  const tokens = result.usage;
  const rates = PRICING[params.model as string] || PRICING["gpt-4o-mini"];
  const cost = tokens
    ? (tokens.prompt_tokens * rates.input +
       tokens.completion_tokens * rates.output) / 1_000_000
    : 0;

  // Fire-and-forget log — never blocks the response
  const supabase = createAdminClient();
  supabase
    .from("api_logs")
    .insert({
      store_id: meta.storeId,
      conversation_id: meta.conversationId,
      lead_id: meta.leadId,
      endpoint: meta.endpoint,
      model: params.model,
      prompt_tokens: tokens?.prompt_tokens,
      completion_tokens: tokens?.completion_tokens,
      total_tokens: tokens?.total_tokens,
      cost,
      duration_ms: duration,
      function_called: meta.functionCalled,
      search_query: meta.searchQuery,
      products_found: meta.productsFound,
      status: "success",
    })
    .then(() => {})
    .catch(() => {}); // Silent fail — monitoring never breaks the app

  return { result, cost, duration };
}

The One Pattern That Matters: Fire-and-Forget

The line that makes this safe to ship:

.then(() => {}).catch(() => {}); // Silent fail

The log insert is not awaited. If the database is down, if there's a network blip, if the table doesn't exist yet — the user's response still goes through.

In my testing the log insert takes 15–40ms. The chat completion takes 800–2500ms. If I awaited the log, I'd add 2–5% latency to every request for zero user benefit.

Monitoring must never slow down the thing it's monitoring. That's the only rule that matters here.

I've run this pattern for weeks and lost maybe 2–3 log entries out of thousands. Acceptable trade-off.

Drop-in Usage

Before:

const response = await openai.chat.completions.create({
  model: "gpt-4o-mini",
  messages: [{ role: "user", content: customerMessage }],
});

After:

const { result, cost, duration } = await loggedChatCompletion(
  {
    model: "gpt-4o-mini",
    messages: [{ role: "user", content: customerMessage }],
  },
  {
    storeId: store.id,
    conversationId: conversation.id,
    leadId: lead.id,
    endpoint: "chat",
  }
);

Same interface, one extra parameter. Find-and-replace across the codebase: 10 minutes.

File 2: The Table (`api_logs`)

CREATE TABLE api_logs (
  id UUID DEFAULT gen_random_uuid() PRIMARY KEY,
  store_id UUID REFERENCES stores(id),
  conversation_id UUID,
  lead_id UUID,
  endpoint TEXT NOT NULL,
  model TEXT,
  prompt_tokens INT,
  completion_tokens INT,
  total_tokens INT,
  cost DECIMAL(10,8),
  duration_ms INT,
  function_called TEXT,
  search_query TEXT,
  products_found INT,
  status TEXT DEFAULT 'success',
  error TEXT,
  created_at TIMESTAMPTZ DEFAULT now()
);

CREATE INDEX idx_api_logs_store ON api_logs(store_id);
CREATE INDEX idx_api_logs_created ON api_logs(created_at);
CREATE INDEX idx_api_logs_endpoint ON api_logs(endpoint);

The columns are the dimensions you can slice by. Each one answers a question OpenAI's dashboard can't:

store_id → "Which tenant is the most expensive?" In multi-tenant SaaS, one store can cost 10× another. Without this column you'll never see it.
endpoint → "Is chat the expensive part, or is it image analysis?"
conversation_id + lead_id → "How much did this conversation cost? This customer?"
function_called + search_query + products_found → Debug columns. When a customer says "show me red dresses" and the bot returns nothing, you can check: did it call the search function? With what query? How many products came back? This saved me hours of debugging.
duration_ms → Latency. Color-coded in the dashboard: green <1.5s, yellow 1.5–3s, red >3s.
error → Failed calls still consume prompt tokens. OpenAI charges for them. Track them.

One detail that's easy to miss: cost DECIMAL(10,8). Eight decimal places.

A single gpt-4o-mini chat completion costs roughly $0.00013. With DECIMAL(10,2), every call rounds to $0.00 and your totals are useless. Fractions of a cent matter at scale.

File 3: The Dashboard

The API route (/api/admin/logs/route.ts) takes filters (startDate, endDate, endpoint, storeId) and returns aggregated data:

{
  summary: {
    totalRequests, totalTokens, avgTokensPerRequest, avgLatency, totalCost
  },
  dailyTokens:       [{ date, prompt, completion, total }, ...],
  hourlyActivity:    [{ hour, count }, ...],
  endpointBreakdown: [{ endpoint, count, cost, percentage }, ...],
  modelBreakdown:    [{ model, count, cost, percentage }, ...],
  storeBreakdown:    [{ storeId, storeName, count, cost }, ...],
}

The UI is intentionally boring:

5 stat cards at the top — total requests, total tokens, avg tokens/request, avg latency (color-coded), total cost
Date filters — Today, 7 Days, 30 Days, All Time, Custom Range
Dropdowns — endpoint, store
Live mode toggle — auto-refresh every 5s
Two charts — daily tokens (prompt vs completion), hourly activity
Expandable log rows — click one to see full detail: model, tokens, cost, latency, search query, products found

The API does the heavy lifting. The UI just renders pre-aggregated data. No client-side calculations, no surprises.

What It Found

Real numbers from one production day:

Metric	Value
Customer messages handled	42
OpenAI API calls	~85
Total tokens	~31,000
Total cost	~$0.005
Avg cost per message	~$0.00013

Cost split by feature:

Endpoint	Model	Calls	Share of cost
Chat	gpt-4o-mini	42	~85%
Embeddings	text-embedding-3-small	42	~2%
Profile extraction	gpt-4o-mini	~12	~3%
Image analysis	gpt-4o	1	~10%

Two things jumped out the moment I had this view:

One. Image analysis with gpt-4o costs roughly 100× more per call than chat with gpt-4o-mini. Even though only ~1% of calls were image analysis, they ate ~10% of the budget. That changed how I thought about which features deserve gpt-4o vs which can live on gpt-4o-mini.

Two. The chat endpoint was averaging far more prompt tokens per call than I'd estimated. The dashboard showed the symptom; investigation revealed I was sending the entire conversation history as context every single response. That's a separate architectural fix I wrote about here — the point for this article is that I wouldn't have looked for the bug if the dashboard hadn't shown me the symptom.

That's the loop. You can't optimize what you don't measure. You can't measure what you don't instrument. And generic billing dashboards don't instrument your application.

5 Things I Learned Building This

1. OpenAI's dashboard is a billing tool, not an observability tool

It tells finance what to charge. It doesn't tell engineering what to fix. Different jobs.

2. Fire-and-forget is non-negotiable

If your monitoring blocks the request path, you've made the product worse. The whole point of observability is that it's invisible until you look at it. Always non-awaited inserts. Always silent failure on log errors.

3. Eight decimal places, not two

Store cost as DECIMAL(10,2) and every call rounds to zero. AI costs are fractional cents per call. Treat them like fractional cents.

4. The dimensions are the product

Total cost is a number. Cost-per-tenant, cost-per-feature, cost-per-conversation are insights. The columns you log determine the questions you can answer. Add the column when you build the feature, not after you have a problem.

5. Hardcode the pricing. Update it manually.

There is no OpenAI pricing API for you to query. Hardcode the rates with a comment for the date you last checked, update them when OpenAI changes. Two lines of code, three minutes a month.

// Last checked: 2026-04-15 — https://openai.com/pricing
const PRICING = {
  "gpt-4o":      { input: 2.50,  output: 10.00 },
  "gpt-4o-mini": { input: 0.15,  output: 0.60  },
};

What to Add When You're Ready

Once the basic version is in place, here's the upgrade path:

Latency percentiles. Average latency lies. Track p50, p95, p99. Average might be 1.2s, but if p99 is 8s, one in a hundred users is having a terrible time.

Per-tenant budget alerts. Threshold of $1/day per store. Slack/email when exceeded. Catches runaway loops, prompt injections that generate huge outputs, or stores with unexpected usage spikes.

Error rates by endpoint. Total error rate hides distribution. Chat at 2% errors and image analysis at 15% is a different problem from both at 8%.

Cost per conversion. If your AI exists to drive a business outcome (sales, signups, completions), connect logs to that outcome table. Now you have ROI per conversation, not just spend per conversation.

Model migration tracking. When you switch a feature from gpt-4o to gpt-4o-mini, the cost drop should be visible. The model column makes before/after trivial.

The Bottom Line

Three files. One afternoon. About 400 lines total.

A wrapper that intercepts every API call. A table with enough dimensions to slice the data. A page that aggregates it into something you can act on.

You don't need LangSmith or Helicone or Datadog (those are great if you prefer them). You need the smallest possible instrument that answers "which feature, which tenant, which conversation" — because that's the question your billing dashboard can't.

The first time I opened mine, I caught a 100× cost gap between two features I'd been treating as similar. I caught it because I'd built the lens to see it.

Build the lens before you ship. Or — more honestly — build it the day you ship, before you forget.

I'm building **Provia* — an AI sales chatbot for Arabic-speaking e-commerce stores. Follow for more posts on building AI products from Gaza on a tight budget.*

Top comments (51)

Syed Ahmer Shah • May 1

It’s wild how much we 'fly blind' with standard billing dashboards until we see a breakdown like this. Identifying a 100x cost gap is a huge win for a lean team. I love the 'boring UI' philosophy—it’s all about the insights, not the fluff. Keep the updates coming, Ali; your 'journey is the content' approach is incredibly high-value for the rest of us.

Ali Afana • May 1

Thanks Syed this means a lot.
The "lean team" point is exactly right — when you're solo or small, you can't afford to optimize what you can't see. Most of the AI-cost articles I've read assume you have an ops team to build observability later. For us, observability has to be the first thing, not the last, because we're making architectural decisions in real-time based on what the dashboard shows.
The "boring UI" thing is honestly because I don't have time for fluff — I'm building Provia, writing articles, and trying to ship features all from the same kitchen table. The dashboard had to be ugly, fast, and useful. Turns out that's also what people actually want.
Saw you're building Commerza — full-stack from scratch in PHP is the real work. Following back.

Gary Stupak • May 4

Great point. I built something similar for my own app a while back. I usually rely on Cloudflare AI Gateway for this because it offers features like request caching for significant savings, per-request costs, rate limiting, request retries, model fallbacks, and detailed logs. However, having a custom dashboard definitely provides more flexibility for specific needs.

Ali Afana • May 4

Gary — Cloudflare AI Gateway is exactly the kind of comparison I should have addressed in the article. The tradeoff I see: gateway-level tools are great for infra concerns (caching, retries, fallbacks) but they treat all calls as equal. They can't tell you "this 100-token call costs $0.0001 and this 1,820-token call costs $0.02" for the same user-facing feature. That tenant/feature/conversation breakdown has to live closer to your application code, where you have the labels.
So I think the right architecture is probably both: gateway for infra-level wins, custom dashboard for product-level cost attribution. Did you find yourself running both in parallel, or did the Gateway end up being enough for your use case?

Gary Stupak • May 4

Thanks for the detailed response, Ali! You've raised an interesting point about cost attribution.

From my experience with Cloudflare AI Gateway, it actually handles per-request logging quite well. It shows the exact token count and estimated cost for every single call in the logs. Regarding product-level attribution, I found that using the custom metadata headers (cf-aig-metadata) allows you to tag requests with IDs from your app, which bridges the gap between infra-level logs and product-level analytics.

To answer your question, I actually built my own calculator mainly out of curiosity to compare my internal logs with the Cloudflare AI Gateway data. I ran them both in parallel for a while and, to my satisfaction, they matched up perfectly. While the Gateway ended up being sufficient for my production needs, building a custom tool was a great way to verify the accuracy of the data.

Ali Afana • May 4

Gary — that's genuinely useful, and I have to update my mental model. The cf-aig-metadata header is exactly the missing piece I assumed didn't exist; if it lets you propagate tenant/feature/conversation IDs from your app into the gateway logs, then Cloudflare does solve the attribution problem cleanly.
The honest revised take: my dashboard isn't an alternative to AI Gateway, it's what you build when you don't know AI Gateway has metadata headers. The "I built it because I had to" framing in the article only holds if you're not on Cloudflare's stack. For anyone already on it, your approach (Gateway as source of truth, custom tool for verification) is the better starting point.
Adding a follow-up note to the article. Appreciate the correction.

Gary Stupak • May 4

I appreciate the follow-up, Ali. Just wanted to share some insights from my workflow. Glad you found it helpful!

Ali Afana • May 4

Genuinely. Threads like this are why I keep writing here.

Varsha Ojha • May 1

Nice build!!!! Usage visibility is such an underrated problem. Knowing you spent money is one thing, but understanding where and why is what actually helps you optimize and control costs.

Ali Afana • May 1

Thanks Varsha! "Underrated" is the perfect word for it — most teams instrument after the bill scares them, not before. Mobile observability is the same story I think; you write a lot about scale and architecture, so you've probably seen this exact pattern play out with Firebase or Sentry costs too.

Varsha Ojha • May 1

Yeah, exactly. It usually shows up only after costs spike, by then it’s already reactive. Seen the same with Firebase and Sentry, where teams realize too late what’s actually driving usage. Feels like observability should be designed in from day one, not added after the bill hits.

Ali Afana • May 1

That phrase — "designed in from day one" — is exactly the framing I've been trying to land on. Observability is treated like a luxury you earn after shipping, but it's actually the cheapest insurance policy in software. Five extra columns in your logs table at the start cost nothing. Five extra columns retrofitted across a year of production data is a migration nightmare.
The teams that learn this the easy way are the ones who watched it happen at their last job.

Varsha Ojha • May 4

That “cheapest insurance in software” line is spot on.

Most teams only realize the value of observability when they’re already debugging blind. And by then, even a small missing field becomes expensive because nobody wants to touch old logging once production is messy.

I’ve seen the same pattern with AI usage too. If you don’t track prompts, endpoints, users, token patterns, and cost drivers early, every optimization later becomes guesswork.

Honestly, observability feels boring until it saves a release, a budget, or a week of engineering time.

Ali Afana • May 4

Thanks Varsha — the "boring until it saves you" framing is right. I've come around to thinking observability is closer to seatbelts than to feature work: the value compounds quietly until you need it, and then it's all you have. Appreciate the read.

Varsha Ojha • May 4 • Edited

Exactly. “Seats without value compounds quietly” is such a good way to frame it. Most teams don’t notice the waste because each line item looks small on its own. But across tools, APIs, agents, and internal dashboards, unused access becomes a silent budget leak. The hard part is getting teams to care before finance asks why the bill doubled.

Ali Afana • May 4

Thanks Varsha — appreciate you sticking with the thread.

Varsha Ojha • May 6

Hehehe :)

Mykola Kondratiuk • May 2

100x between features you thought were similar is the kind of variance that breaks sprint estimates. seen teams absorb these as 'AI overhead' because there's no per-call drill-down.

Ali Afana • May 2

"AI overhead" is the right name for it. The moment it becomes a line item in the budget, the questioning stops.
The sprint angle is the one I underplayed in the article — costs are bad, but unpredictable costs that look similar to each other on paper are worse. Estimation collapses. You can't say "this feature takes 3 days" when "3 days" might mean $5 or $500.
Curious how the teams you've seen handle it once they spot the variance — per-feature budgets, quotas, or just absorbing the bill?

Mykola Kondratiuk • May 2

yeah once it's budgeted it stops being a question. the variance is still there, it's just invisible until estimation falls apart mid-sprint. that's the harder problem to sell upward.

Ali Afana • May 2

Right — once it's absorbed, the only way to make it visible again is failure. Which is the worst possible time to make the case, because now you're explaining a missed deadline AND asking for tooling budget.
The pitch that works upward, in my limited experience, isn't about cost. It's about predictability. "We can't estimate AI features within an order of magnitude" lands differently than "AI is expensive." The first is a delivery risk. The second is a line item leadership has already accepted.
Are you seeing this play out somewhere specific, or is it pattern recognition across teams?

Mykola Kondratiuk • May 2

predictability framing works because it shifts the ask from 'trust us' to 'here's our signal'. harder to reject a prediction than a budget line.

Ali Afana • May 2

Hey Mykola — appreciated the back-and-forth on my Dev.to article. Your "trust us vs. here's our signal" line is going into something I'm writing. Connecting here too.

Mykola Kondratiuk • May 2

Glad it landed — that framing came from watching too many AI roadmap conversations where "trust us" got shot down in 30 seconds. Would love to see what you build with it.

Ali Afana • May 2

The "30 seconds to shot down" detail is the part most engineers don't see until they're in the room. I'll send the piece your way when it's drafted.

Mykola Kondratiuk • May 3

yeah, the misfires are the training data. drop it when it's ready.

PEACEBINFLOW • May 4

The fire-and-forget logging pattern stuck with me — not because it's technically clever, but because it quietly solves a problem that usually gets overengineered to death.

I've seen teams spend weeks wiring up OpenTelemetry, setting up collectors, configuring exporters, only to end up with dashboards nobody looks at because the setup was so heavy it became someone's full-time maintenance burden. Three files and a silent .catch(() => {}) is almost uncomfortably simple by comparison.

What I find myself wondering though: at what scale does fire-and-forget stop being "good enough" and start becoming a blind spot? You mentioned losing maybe 2–3 entries out of thousands. That's nothing when you're tracing cost anomalies. But if someone's monitoring for security signals or abuse patterns, 0.3% data loss might be the exact 0.3% that matters.

Not a criticism of the approach — I think it's the right call for this use case. More just thinking out loud about how the same pattern can be perfectly appropriate for one goal and subtly risky for another, and how easy it is to confuse the two.

Ali Afana • May 4

This is the question that should have been in the article. The "good enough" calculus changes entirely when the data IS the product, not just the instrumentation around it.
The way I think about it now: fire-and-forget is right when (1) you're optimizing for the aggregate, not the individual record, and (2) the cost of slowing down the user-facing path exceeds the cost of any single missed log. Cost tracking matches both — I care about p95 spend per tenant per day, not whether I have every call. Lose 3 logs out of 1,000, the picture barely shifts.
Security and abuse signals invert both conditions. You're often hunting for the one anomalous record, not the distribution. And in many cases, slowing the request path is acceptable (or even desirable — you might want the auth check to block) compared to missing the signal. So the same pattern that's perfectly fine for billing observability would be a serious bug for fraud detection.
The dangerous version is when teams adopt fire-and-forget as a default pattern across all logging because it "worked for cost tracking," and quietly accept silent gaps in security telemetry. That's the failure mode worth naming.

Emanuele Fabrizio • May 5

I praise your effort and encourage your continued application. Additionally I would like to use this evidence for a deeper reflection: wasn't this "reality" already known "a-priori" before using any type of chat based AI tool?

I strongly believe that it was the moment I first learned the notion of "tokens" and the fact that no platform was willing to disclose them openly and upfront. That "evidence" left me very critical of the AI era and prompted a deep reflection that led me to refuse to jump on the bandwagon without speaking of the limitations and moral corruption that it fosters.

I compare it with the network traffic billing of 15 years ago, when VPS cost was determined by TB of traffic: the service providers disclosed (and accounted for) every bit of data they billed for.

I invite the young generation to see beyond the "offering" and accept any solution provided as "the only available". We had better services and options when we owned software not rented it.

Ali Afana • May 5

Emanuele — token counts and tokenizer libraries are publicly documented by every major provider; what's missing isn't transparency at the API level, it's product-level cost attribution inside your own application. That's what the article addresses. Appreciate the read.

Yaniv • May 4

Spot on. The 'fire-and-forget' logging pattern is absolutely non-negotiable here. I see this exact 'blind spend' problem in the SDET and QA automation space all the time. When teams integrate LLMs into their CI/CD pipelines to validate complex API responses or generate dynamic payloads, the costs can spiral instantly without anyone knowing where to look.
When you use AI to generate semantic test data at scale—which is exactly the problem I tackle with my Python library, FixtureForge—you're making hundreds of API calls per test run. Without a granular observability wrapper like the one you built, a single unoptimized prompt or a loop in the pipeline can drain the budget overnight, and you'd have no idea which specific test suite caused it. Catching that 100x variance on day one proves this architecture is a must-have. Brilliant, actionable write-up!

Ali Afana • May 5

Yaniv — the CI/CD angle is one I hadn't connected. Test pipelines with LLMs in the loop are exactly the kind of place where cost can explode silently because nobody's watching the per-test spend, just the per-deployment one. The dashboard pattern would surface a runaway test suite within hours instead of at end-of-month billing.

bingkahu (Matteo) • Apr 30

Great idea! Now could you make it for Claude?

Ali Afana • Apr 30

Thanks Matteo! The wrapper itself is model-agnostic — only thing that changes is the pricing table and the field names from the response.

For Anthropic's SDK, pricing table looks like:

"claude-haiku-4-5":  { input: 1.00,  output: 5.00  },
"claude-sonnet-4-6": { input: 3.00,  output: 15.00 },
"claude-opus-4-7":   { input: 5.00,  output: 25.00 },

And the response uses usage.input_tokens / output_tokens instead of OpenAI's prompt_tokens / completion_tokens — straightforward swap.
One thing worth adding to the table for Anthropic specifically: cache read/write tokens. Prompt caching gives ~90% discount on cache hits, so if you're not tracking those columns separately you'll undercount savings. Probably worth its own follow-up post.
Are you mixing both providers in one app? That's actually the most interesting case — comparing cost-per-feature across providers from the same dashboard.

bingkahu (Matteo) • Apr 30

Yes it would be quite interesting if you made a mode where you could compare token usages across multiple suppliers (e.g Anthropic, DeepSeek, ChatGPT, etc). You could even create bar charts and pie charts to show your earnings across all models and providers.

Ali Afana • Apr 30

That's a really good angle, Matteo — a unified multi-provider dashboard would basically turn the provider column into the most important dimension in the whole system.
The architecture isn't hard. One api_logs table with provider, model, endpoint, and a normalized cost column. Each SDK gets its own thin wrapper that maps to the same shape. The dashboard groups by whatever you want — provider, model, feature, tenant.
Where it gets interesting is the comparison views you're describing:
Pie chart by provider — am I actually diversified, or 95% locked into one vendor?
Bar chart by feature × provider — chat on Claude vs GPT vs DeepSeek for the same workload
Cost-per-task — same prompt across providers, normalized by output quality
Honestly you've just outlined my next post. I'll build a multi-provider version of this and write it up — would you want me to tag you when it goes live?

bingkahu (Matteo) • Apr 30

Yeah that sounds good! Excited to see the post!

Pururva Agarwal • May 4

The \"100x cost gap\" is a familiar pain when fine-tuning or inferencing. We see this acutely with multilingual models. Processing \"paracetamol\" versus its equivalent brand name, say \"क्रोसिन\" (Crocin), across 22 Indian languages often hits different tokenization costs and model pathing.

Without a dashboard like yours, pinpointing these subtle cost variations per language, or even per region-specific drug name, becomes impossible. It's not just feature A vs B, but 'lang A' vs 'lang B' for the same feature.

Crucial for managing API spend, especially when mapping complex data like drug interaction graphs across diverse linguistic inputs. I'm building GoDavaii.

Ali Afana • May 4

Pururva — the multilingual angle is one I hadn't thought through. The tokenization variance between scripts is exactly the kind of thing the dashboard would surface as a "cost mystery" (why is this query 4× more expensive than that one?) but you'd never spot the pattern without language as a column.
For Provia I'm dealing with Arabic vs English chat in the same store — already seeing token counts run higher for Arabic responses, but I haven't broken it down by script yet. Going to add a language column this week.
The drug interaction graph use case sounds genuinely hard. Are you finding that certain languages need different model routing entirely, or is it more about predicting per-language cost variance for budgeting?

Sundar Sharma • May 16

The 100x cost gap between features you thought were similar is terrifying. I've been there — shipping something, assuming the cost is roughly the same across features, then finding out one is eating 10x the budget. Without per-call attribution, you're flying blind. OpenAI's dashboard is useless for this. You have to build your own or use something like Cloudflare's metadata headers.

Ali Afana • May 16

Sundar — that "10x budget eat" moment is the one that converts people from "we'll add observability later" to "this should have been day one." Glad the post landed.

View full discussion (51 comments)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.

The Gap in OpenAI's Dashboard

File 1: The Wrapper (openai-logger.ts)

Pricing Table

The Wrapper

The One Pattern That Matters: Fire-and-Forget

Drop-in Usage

File 2: The Table (api_logs)

File 3: The Dashboard

What It Found

5 Things I Learned Building This

1. OpenAI's dashboard is a billing tool, not an observability tool

2. Fire-and-forget is non-negotiable

3. Eight decimal places, not two

4. The dimensions are the product

5. Hardcode the pricing. Update it manually.

What to Add When You're Ready

The Bottom Line

File 1: The Wrapper (`openai-logger.ts`)

File 2: The Table (`api_logs`)