DEV Community

Cover image for OpenAI Tells You What You Spent. Not Where. So I Built a Dashboard.
Ali Afana
Ali Afana

Posted on

OpenAI Tells You What You Spent. Not Where. So I Built a Dashboard.

Identifies 100x cost gaps between features

TL;DR: OpenAI's billing page shows total spend. It doesn't show which feature, which tenant, or which conversation caused it. For a multi-tenant AI product, that's flying blind. I built a 3-file monitoring system — a wrapper, a table, a dashboard — that gives me per-call cost down to 8 decimal places. The first time I opened it, I caught a 100× cost gap between two features I'd been treating as similar.


The Gap in OpenAI's Dashboard

Open platform.openai.com/usage right now. You'll see:

  • Total spend per day
  • Breakdown by model (gpt-4o, gpt-4o-mini, etc.)
  • Token totals

That's it.

You won't see:

  • Which feature in your app caused those tokens
  • Which user or tenant triggered them
  • Which specific conversation went over budget
  • Whether failed calls are still costing you money
  • How latency correlates with cost

For a side project, fine. For a production AI product, you're guessing.

I'm building Provia — an AI sales chatbot for Arabic-speaking e-commerce stores. It's multi-tenant (each store is a separate customer) and multi-feature: chat completion, embeddings, image analysis, profile extraction. A single customer message can fire 1–3 OpenAI calls.

When I shipped, OpenAI told me I spent $4.27 yesterday.

That number was useless. Was it one expensive image analysis? A runaway store with thousands of messages? A loop firing the same call repeatedly? No way to know.

So I built my own observability. Three files. One afternoon.


File 1: The Wrapper (openai-logger.ts)

The core idea: don't call OpenAI directly. Call a wrapper that measures everything and logs it asynchronously.

Pricing Table

OpenAI's API returns token counts but not cost. You calculate it yourself from a hardcoded table:

// Last checked: 2026-04-15 — https://openai.com/pricing
const PRICING: Record<string, { input: number; output: number }> = {
  "gpt-4o":      { input: 2.50,  output: 10.00 },   // per 1M tokens
  "gpt-4o-mini": { input: 0.15,  output: 0.60  },   // per 1M tokens
};
Enter fullscreen mode Exit fullscreen mode

Here's the headline ratio:

Model Input (per 1M) Output (per 1M) Cost ratio
gpt-4o $2.50 $10.00
gpt-4o-mini $0.15 $0.60 ~16× cheaper

gpt-4o is roughly 16× more expensive than gpt-4o-mini for the same number of tokens. If you're using gpt-4o for anything gpt-4o-mini can handle, you're burning money. The dashboard makes this visible call by call — exactly what you need when deciding which model goes where.

The Wrapper

import OpenAI from "openai";
import { createAdminClient } from "@/lib/supabase/admin";

const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });

const PRICING: Record<string, { input: number; output: number }> = {
  "gpt-4o":      { input: 2.50,  output: 10.00 },
  "gpt-4o-mini": { input: 0.15,  output: 0.60  },
};

interface LogMeta {
  storeId: string;
  conversationId?: string;
  leadId?: string;
  endpoint: string;
  functionCalled?: string;
  searchQuery?: string;
  productsFound?: number;
}

export async function loggedChatCompletion(
  params: OpenAI.Chat.Completions.ChatCompletionCreateParams,
  meta: LogMeta
) {
  const start = Date.now();
  const result = await openai.chat.completions.create(params);
  const duration = Date.now() - start;

  const tokens = result.usage;
  const rates = PRICING[params.model as string] || PRICING["gpt-4o-mini"];
  const cost = tokens
    ? (tokens.prompt_tokens * rates.input +
       tokens.completion_tokens * rates.output) / 1_000_000
    : 0;

  // Fire-and-forget log — never blocks the response
  const supabase = createAdminClient();
  supabase
    .from("api_logs")
    .insert({
      store_id: meta.storeId,
      conversation_id: meta.conversationId,
      lead_id: meta.leadId,
      endpoint: meta.endpoint,
      model: params.model,
      prompt_tokens: tokens?.prompt_tokens,
      completion_tokens: tokens?.completion_tokens,
      total_tokens: tokens?.total_tokens,
      cost,
      duration_ms: duration,
      function_called: meta.functionCalled,
      search_query: meta.searchQuery,
      products_found: meta.productsFound,
      status: "success",
    })
    .then(() => {})
    .catch(() => {}); // Silent fail — monitoring never breaks the app

  return { result, cost, duration };
}
Enter fullscreen mode Exit fullscreen mode

The One Pattern That Matters: Fire-and-Forget

The line that makes this safe to ship:

.then(() => {}).catch(() => {}); // Silent fail
Enter fullscreen mode Exit fullscreen mode

The log insert is not awaited. If the database is down, if there's a network blip, if the table doesn't exist yet — the user's response still goes through.

In my testing the log insert takes 15–40ms. The chat completion takes 800–2500ms. If I awaited the log, I'd add 2–5% latency to every request for zero user benefit.

Monitoring must never slow down the thing it's monitoring. That's the only rule that matters here.

I've run this pattern for weeks and lost maybe 2–3 log entries out of thousands. Acceptable trade-off.

Drop-in Usage

Before:

const response = await openai.chat.completions.create({
  model: "gpt-4o-mini",
  messages: [{ role: "user", content: customerMessage }],
});
Enter fullscreen mode Exit fullscreen mode

After:

const { result, cost, duration } = await loggedChatCompletion(
  {
    model: "gpt-4o-mini",
    messages: [{ role: "user", content: customerMessage }],
  },
  {
    storeId: store.id,
    conversationId: conversation.id,
    leadId: lead.id,
    endpoint: "chat",
  }
);
Enter fullscreen mode Exit fullscreen mode

Same interface, one extra parameter. Find-and-replace across the codebase: 10 minutes.


File 2: The Table (api_logs)

CREATE TABLE api_logs (
  id UUID DEFAULT gen_random_uuid() PRIMARY KEY,
  store_id UUID REFERENCES stores(id),
  conversation_id UUID,
  lead_id UUID,
  endpoint TEXT NOT NULL,
  model TEXT,
  prompt_tokens INT,
  completion_tokens INT,
  total_tokens INT,
  cost DECIMAL(10,8),
  duration_ms INT,
  function_called TEXT,
  search_query TEXT,
  products_found INT,
  status TEXT DEFAULT 'success',
  error TEXT,
  created_at TIMESTAMPTZ DEFAULT now()
);

CREATE INDEX idx_api_logs_store ON api_logs(store_id);
CREATE INDEX idx_api_logs_created ON api_logs(created_at);
CREATE INDEX idx_api_logs_endpoint ON api_logs(endpoint);
Enter fullscreen mode Exit fullscreen mode

The columns are the dimensions you can slice by. Each one answers a question OpenAI's dashboard can't:

  • store_id → "Which tenant is the most expensive?" In multi-tenant SaaS, one store can cost 10× another. Without this column you'll never see it.
  • endpoint → "Is chat the expensive part, or is it image analysis?"
  • conversation_id + lead_id → "How much did this conversation cost? This customer?"
  • function_called + search_query + products_found → Debug columns. When a customer says "show me red dresses" and the bot returns nothing, you can check: did it call the search function? With what query? How many products came back? This saved me hours of debugging.
  • duration_ms → Latency. Color-coded in the dashboard: green <1.5s, yellow 1.5–3s, red >3s.
  • error → Failed calls still consume prompt tokens. OpenAI charges for them. Track them.

One detail that's easy to miss: cost DECIMAL(10,8). Eight decimal places.

A single gpt-4o-mini chat completion costs roughly $0.00013. With DECIMAL(10,2), every call rounds to $0.00 and your totals are useless. Fractions of a cent matter at scale.


File 3: The Dashboard

The API route (/api/admin/logs/route.ts) takes filters (startDate, endDate, endpoint, storeId) and returns aggregated data:

{
  summary: {
    totalRequests, totalTokens, avgTokensPerRequest, avgLatency, totalCost
  },
  dailyTokens:       [{ date, prompt, completion, total }, ...],
  hourlyActivity:    [{ hour, count }, ...],
  endpointBreakdown: [{ endpoint, count, cost, percentage }, ...],
  modelBreakdown:    [{ model, count, cost, percentage }, ...],
  storeBreakdown:    [{ storeId, storeName, count, cost }, ...],
}
Enter fullscreen mode Exit fullscreen mode

The UI is intentionally boring:

  • 5 stat cards at the top — total requests, total tokens, avg tokens/request, avg latency (color-coded), total cost
  • Date filters — Today, 7 Days, 30 Days, All Time, Custom Range
  • Dropdowns — endpoint, store
  • Live mode toggle — auto-refresh every 5s
  • Two charts — daily tokens (prompt vs completion), hourly activity
  • Expandable log rows — click one to see full detail: model, tokens, cost, latency, search query, products found

The API does the heavy lifting. The UI just renders pre-aggregated data. No client-side calculations, no surprises.


What It Found

Real numbers from one production day on Provia:

Metric Value
Customer messages handled 42
OpenAI API calls ~85
Total tokens ~31,000
Total cost ~$0.005
Avg cost per message ~$0.00013

Cost split by feature:

Endpoint Model Calls Share of cost
Chat gpt-4o-mini 42 ~85%
Embeddings text-embedding-3-small 42 ~2%
Profile extraction gpt-4o-mini ~12 ~3%
Image analysis gpt-4o 1 ~10%

Two things jumped out the moment I had this view:

One. Image analysis with gpt-4o costs roughly 100× more per call than chat with gpt-4o-mini. Even though only ~1% of calls were image analysis, they ate ~10% of the budget. That changed how I thought about which features deserve gpt-4o vs which can live on gpt-4o-mini.

Two. The chat endpoint was averaging far more prompt tokens per call than I'd estimated. The dashboard showed the symptom; investigation revealed I was sending the entire conversation history as context every single response. That's a separate architectural fix I wrote about here — the point for this article is that I wouldn't have looked for the bug if the dashboard hadn't shown me the symptom.

That's the loop. You can't optimize what you don't measure. You can't measure what you don't instrument. And generic billing dashboards don't instrument your application.


5 Things I Learned Building This

1. OpenAI's dashboard is a billing tool, not an observability tool

It tells finance what to charge. It doesn't tell engineering what to fix. Different jobs.

2. Fire-and-forget is non-negotiable

If your monitoring blocks the request path, you've made the product worse. The whole point of observability is that it's invisible until you look at it. Always non-awaited inserts. Always silent failure on log errors.

3. Eight decimal places, not two

Store cost as DECIMAL(10,2) and every call rounds to zero. AI costs are fractional cents per call. Treat them like fractional cents.

4. The dimensions are the product

Total cost is a number. Cost-per-tenant, cost-per-feature, cost-per-conversation are insights. The columns you log determine the questions you can answer. Add the column when you build the feature, not after you have a problem.

5. Hardcode the pricing. Update it manually.

There is no OpenAI pricing API for you to query. Hardcode the rates with a comment for the date you last checked, update them when OpenAI changes. Two lines of code, three minutes a month.

// Last checked: 2026-04-15 — https://openai.com/pricing
const PRICING = {
  "gpt-4o":      { input: 2.50,  output: 10.00 },
  "gpt-4o-mini": { input: 0.15,  output: 0.60  },
};
Enter fullscreen mode Exit fullscreen mode

What to Add When You're Ready

Once the basic version is in place, here's the upgrade path:

Latency percentiles. Average latency lies. Track p50, p95, p99. Average might be 1.2s, but if p99 is 8s, one in a hundred users is having a terrible time.

Per-tenant budget alerts. Threshold of $1/day per store. Slack/email when exceeded. Catches runaway loops, prompt injections that generate huge outputs, or stores with unexpected usage spikes.

Error rates by endpoint. Total error rate hides distribution. Chat at 2% errors and image analysis at 15% is a different problem from both at 8%.

Cost per conversion. If your AI exists to drive a business outcome (sales, signups, completions), connect logs to that outcome table. Now you have ROI per conversation, not just spend per conversation.

Model migration tracking. When you switch a feature from gpt-4o to gpt-4o-mini, the cost drop should be visible. The model column makes before/after trivial.


The Bottom Line

Three files. One afternoon. About 400 lines total.

A wrapper that intercepts every API call. A table with enough dimensions to slice the data. A page that aggregates it into something you can act on.

You don't need LangSmith or Helicone or Datadog (those are great if you prefer them). You need the smallest possible instrument that answers "which feature, which tenant, which conversation" — because that's the question your billing dashboard can't.

The first time I opened mine, I caught a 100× cost gap between two features I'd been treating as similar. I caught it because I'd built the lens to see it.

Build the lens before you ship. Or — more honestly — build it the day you ship, before you forget.


I'm building **Provia* — an AI sales chatbot for Arabic-speaking e-commerce stores. Follow for more posts on building AI products from Gaza on a tight budget.*

Top comments (5)

Collapse
 
bingkahu profile image
bingkahu (Matteo)

Great idea! Now could you make it for Claude?

Collapse
 
alimafana profile image
Ali Afana

Thanks Matteo! The wrapper itself is model-agnostic — only thing that changes is the pricing table and the field names from the response.

For Anthropic's SDK, pricing table looks like:

"claude-haiku-4-5":  { input: 1.00,  output: 5.00  },
"claude-sonnet-4-6": { input: 3.00,  output: 15.00 },
"claude-opus-4-7":   { input: 5.00,  output: 25.00 },
Enter fullscreen mode Exit fullscreen mode

And the response uses usage.input_tokens / output_tokens instead of OpenAI's prompt_tokens / completion_tokens — straightforward swap.
One thing worth adding to the table for Anthropic specifically: cache read/write tokens. Prompt caching gives ~90% discount on cache hits, so if you're not tracking those columns separately you'll undercount savings. Probably worth its own follow-up post.
Are you mixing both providers in one app? That's actually the most interesting case — comparing cost-per-feature across providers from the same dashboard.

Collapse
 
bingkahu profile image
bingkahu (Matteo)

Yes it would be quite interesting if you made a mode where you could compare token usages across multiple suppliers (e.g Anthropic, DeepSeek, ChatGPT, etc). You could even create bar charts and pie charts to show your earnings across all models and providers.

Thread Thread
 
alimafana profile image
Ali Afana

That's a really good angle, Matteo — a unified multi-provider dashboard would basically turn the provider column into the most important dimension in the whole system.
The architecture isn't hard. One api_logs table with provider, model, endpoint, and a normalized cost column. Each SDK gets its own thin wrapper that maps to the same shape. The dashboard groups by whatever you want — provider, model, feature, tenant.
Where it gets interesting is the comparison views you're describing:
Pie chart by provider — am I actually diversified, or 95% locked into one vendor?
Bar chart by feature × provider — chat on Claude vs GPT vs DeepSeek for the same workload
Cost-per-task — same prompt across providers, normalized by output quality
Honestly you've just outlined my next post. I'll build a multi-provider version of this and write it up — would you want me to tag you when it goes live?

Thread Thread
 
bingkahu profile image
bingkahu (Matteo)

Yeah that sounds good! Excited to see the post!