DEV Community

OpenAI Tells You What You Spent. Not Where. So I Built a Dashboard.

Ali Afana on April 30, 2026

Update (May 4, 2026): A reader (Gary Stupak in the comments) pointed out that Cloudflare AI Gateway supports custom metadata headers (cf-aig-metada...

Read full post

Syed Ahmer Shah • May 1

It’s wild how much we 'fly blind' with standard billing dashboards until we see a breakdown like this. Identifying a 100x cost gap is a huge win for a lean team. I love the 'boring UI' philosophy—it’s all about the insights, not the fluff. Keep the updates coming, Ali; your 'journey is the content' approach is incredibly high-value for the rest of us.

Ali Afana • May 1

Thanks Syed this means a lot.
The "lean team" point is exactly right — when you're solo or small, you can't afford to optimize what you can't see. Most of the AI-cost articles I've read assume you have an ops team to build observability later. For us, observability has to be the first thing, not the last, because we're making architectural decisions in real-time based on what the dashboard shows.
The "boring UI" thing is honestly because I don't have time for fluff — I'm building Provia, writing articles, and trying to ship features all from the same kitchen table. The dashboard had to be ugly, fast, and useful. Turns out that's also what people actually want.
Saw you're building Commerza — full-stack from scratch in PHP is the real work. Following back.

Gary Stupak • May 4

Great point. I built something similar for my own app a while back. I usually rely on Cloudflare AI Gateway for this because it offers features like request caching for significant savings, per-request costs, rate limiting, request retries, model fallbacks, and detailed logs. However, having a custom dashboard definitely provides more flexibility for specific needs.

Ali Afana • May 4

Gary — Cloudflare AI Gateway is exactly the kind of comparison I should have addressed in the article. The tradeoff I see: gateway-level tools are great for infra concerns (caching, retries, fallbacks) but they treat all calls as equal. They can't tell you "this 100-token call costs $0.0001 and this 1,820-token call costs $0.02" for the same user-facing feature. That tenant/feature/conversation breakdown has to live closer to your application code, where you have the labels.
So I think the right architecture is probably both: gateway for infra-level wins, custom dashboard for product-level cost attribution. Did you find yourself running both in parallel, or did the Gateway end up being enough for your use case?

Gary Stupak • May 4

Thanks for the detailed response, Ali! You've raised an interesting point about cost attribution.

From my experience with Cloudflare AI Gateway, it actually handles per-request logging quite well. It shows the exact token count and estimated cost for every single call in the logs. Regarding product-level attribution, I found that using the custom metadata headers (cf-aig-metadata) allows you to tag requests with IDs from your app, which bridges the gap between infra-level logs and product-level analytics.

To answer your question, I actually built my own calculator mainly out of curiosity to compare my internal logs with the Cloudflare AI Gateway data. I ran them both in parallel for a while and, to my satisfaction, they matched up perfectly. While the Gateway ended up being sufficient for my production needs, building a custom tool was a great way to verify the accuracy of the data.

Ali Afana • May 4

Gary — that's genuinely useful, and I have to update my mental model. The cf-aig-metadata header is exactly the missing piece I assumed didn't exist; if it lets you propagate tenant/feature/conversation IDs from your app into the gateway logs, then Cloudflare does solve the attribution problem cleanly.
The honest revised take: my dashboard isn't an alternative to AI Gateway, it's what you build when you don't know AI Gateway has metadata headers. The "I built it because I had to" framing in the article only holds if you're not on Cloudflare's stack. For anyone already on it, your approach (Gateway as source of truth, custom tool for verification) is the better starting point.
Adding a follow-up note to the article. Appreciate the correction.

Gary Stupak • May 4

I appreciate the follow-up, Ali. Just wanted to share some insights from my workflow. Glad you found it helpful!

Ali Afana • May 4

Genuinely. Threads like this are why I keep writing here.

Varsha Ojha • May 1

Nice build!!!! Usage visibility is such an underrated problem. Knowing you spent money is one thing, but understanding where and why is what actually helps you optimize and control costs.

Ali Afana • May 1

Thanks Varsha! "Underrated" is the perfect word for it — most teams instrument after the bill scares them, not before. Mobile observability is the same story I think; you write a lot about scale and architecture, so you've probably seen this exact pattern play out with Firebase or Sentry costs too.

Varsha Ojha • May 1

Yeah, exactly. It usually shows up only after costs spike, by then it’s already reactive. Seen the same with Firebase and Sentry, where teams realize too late what’s actually driving usage. Feels like observability should be designed in from day one, not added after the bill hits.

Ali Afana • May 1

That phrase — "designed in from day one" — is exactly the framing I've been trying to land on. Observability is treated like a luxury you earn after shipping, but it's actually the cheapest insurance policy in software. Five extra columns in your logs table at the start cost nothing. Five extra columns retrofitted across a year of production data is a migration nightmare.
The teams that learn this the easy way are the ones who watched it happen at their last job.

Varsha Ojha • May 4

That “cheapest insurance in software” line is spot on.

Most teams only realize the value of observability when they’re already debugging blind. And by then, even a small missing field becomes expensive because nobody wants to touch old logging once production is messy.

I’ve seen the same pattern with AI usage too. If you don’t track prompts, endpoints, users, token patterns, and cost drivers early, every optimization later becomes guesswork.

Honestly, observability feels boring until it saves a release, a budget, or a week of engineering time.

Ali Afana • May 4

Thanks Varsha — the "boring until it saves you" framing is right. I've come around to thinking observability is closer to seatbelts than to feature work: the value compounds quietly until you need it, and then it's all you have. Appreciate the read.

Mykola Kondratiuk • May 2

100x between features you thought were similar is the kind of variance that breaks sprint estimates. seen teams absorb these as 'AI overhead' because there's no per-call drill-down.

Ali Afana • May 2

"AI overhead" is the right name for it. The moment it becomes a line item in the budget, the questioning stops.
The sprint angle is the one I underplayed in the article — costs are bad, but unpredictable costs that look similar to each other on paper are worse. Estimation collapses. You can't say "this feature takes 3 days" when "3 days" might mean $5 or $500.
Curious how the teams you've seen handle it once they spot the variance — per-feature budgets, quotas, or just absorbing the bill?

Mykola Kondratiuk • May 2

yeah once it's budgeted it stops being a question. the variance is still there, it's just invisible until estimation falls apart mid-sprint. that's the harder problem to sell upward.

Ali Afana • May 2

Right — once it's absorbed, the only way to make it visible again is failure. Which is the worst possible time to make the case, because now you're explaining a missed deadline AND asking for tooling budget.
The pitch that works upward, in my limited experience, isn't about cost. It's about predictability. "We can't estimate AI features within an order of magnitude" lands differently than "AI is expensive." The first is a delivery risk. The second is a line item leadership has already accepted.
Are you seeing this play out somewhere specific, or is it pattern recognition across teams?

Mykola Kondratiuk • May 2

predictability framing works because it shifts the ask from 'trust us' to 'here's our signal'. harder to reject a prediction than a budget line.

Ali Afana • May 2

Hey Mykola — appreciated the back-and-forth on my Dev.to article. Your "trust us vs. here's our signal" line is going into something I'm writing. Connecting here too.

PEACEBINFLOW • May 4

The fire-and-forget logging pattern stuck with me — not because it's technically clever, but because it quietly solves a problem that usually gets overengineered to death.

I've seen teams spend weeks wiring up OpenTelemetry, setting up collectors, configuring exporters, only to end up with dashboards nobody looks at because the setup was so heavy it became someone's full-time maintenance burden. Three files and a silent .catch(() => {}) is almost uncomfortably simple by comparison.

What I find myself wondering though: at what scale does fire-and-forget stop being "good enough" and start becoming a blind spot? You mentioned losing maybe 2–3 entries out of thousands. That's nothing when you're tracing cost anomalies. But if someone's monitoring for security signals or abuse patterns, 0.3% data loss might be the exact 0.3% that matters.

Not a criticism of the approach — I think it's the right call for this use case. More just thinking out loud about how the same pattern can be perfectly appropriate for one goal and subtly risky for another, and how easy it is to confuse the two.

Ali Afana • May 4

This is the question that should have been in the article. The "good enough" calculus changes entirely when the data IS the product, not just the instrumentation around it.
The way I think about it now: fire-and-forget is right when (1) you're optimizing for the aggregate, not the individual record, and (2) the cost of slowing down the user-facing path exceeds the cost of any single missed log. Cost tracking matches both — I care about p95 spend per tenant per day, not whether I have every call. Lose 3 logs out of 1,000, the picture barely shifts.
Security and abuse signals invert both conditions. You're often hunting for the one anomalous record, not the distribution. And in many cases, slowing the request path is acceptable (or even desirable — you might want the auth check to block) compared to missing the signal. So the same pattern that's perfectly fine for billing observability would be a serious bug for fraud detection.
The dangerous version is when teams adopt fire-and-forget as a default pattern across all logging because it "worked for cost tracking," and quietly accept silent gaps in security telemetry. That's the failure mode worth naming.

Emanuele Fabrizio • May 5

I praise your effort and encourage your continued application. Additionally I would like to use this evidence for a deeper reflection: wasn't this "reality" already known "a-priori" before using any type of chat based AI tool?

I strongly believe that it was the moment I first learned the notion of "tokens" and the fact that no platform was willing to disclose them openly and upfront. That "evidence" left me very critical of the AI era and prompted a deep reflection that led me to refuse to jump on the bandwagon without speaking of the limitations and moral corruption that it fosters.

I compare it with the network traffic billing of 15 years ago, when VPS cost was determined by TB of traffic: the service providers disclosed (and accounted for) every bit of data they billed for.

I invite the young generation to see beyond the "offering" and accept any solution provided as "the only available". We had better services and options when we owned software not rented it.

Ali Afana • May 5

Emanuele — token counts and tokenizer libraries are publicly documented by every major provider; what's missing isn't transparency at the API level, it's product-level cost attribution inside your own application. That's what the article addresses. Appreciate the read.

Yaniv • May 4

Spot on. The 'fire-and-forget' logging pattern is absolutely non-negotiable here. I see this exact 'blind spend' problem in the SDET and QA automation space all the time. When teams integrate LLMs into their CI/CD pipelines to validate complex API responses or generate dynamic payloads, the costs can spiral instantly without anyone knowing where to look.
When you use AI to generate semantic test data at scale—which is exactly the problem I tackle with my Python library, FixtureForge—you're making hundreds of API calls per test run. Without a granular observability wrapper like the one you built, a single unoptimized prompt or a loop in the pipeline can drain the budget overnight, and you'd have no idea which specific test suite caused it. Catching that 100x variance on day one proves this architecture is a must-have. Brilliant, actionable write-up!

Ali Afana • May 5

Yaniv — the CI/CD angle is one I hadn't connected. Test pipelines with LLMs in the loop are exactly the kind of place where cost can explode silently because nobody's watching the per-test spend, just the per-deployment one. The dashboard pattern would surface a runaway test suite within hours instead of at end-of-month billing.

bingkahu (Matteo) • Apr 30

Great idea! Now could you make it for Claude?

Ali Afana • Apr 30

Thanks Matteo! The wrapper itself is model-agnostic — only thing that changes is the pricing table and the field names from the response.

For Anthropic's SDK, pricing table looks like:

"claude-haiku-4-5":  { input: 1.00,  output: 5.00  },
"claude-sonnet-4-6": { input: 3.00,  output: 15.00 },
"claude-opus-4-7":   { input: 5.00,  output: 25.00 },

And the response uses usage.input_tokens / output_tokens instead of OpenAI's prompt_tokens / completion_tokens — straightforward swap.
One thing worth adding to the table for Anthropic specifically: cache read/write tokens. Prompt caching gives ~90% discount on cache hits, so if you're not tracking those columns separately you'll undercount savings. Probably worth its own follow-up post.
Are you mixing both providers in one app? That's actually the most interesting case — comparing cost-per-feature across providers from the same dashboard.

bingkahu (Matteo) • Apr 30

Yes it would be quite interesting if you made a mode where you could compare token usages across multiple suppliers (e.g Anthropic, DeepSeek, ChatGPT, etc). You could even create bar charts and pie charts to show your earnings across all models and providers.

Ali Afana • Apr 30

That's a really good angle, Matteo — a unified multi-provider dashboard would basically turn the provider column into the most important dimension in the whole system.
The architecture isn't hard. One api_logs table with provider, model, endpoint, and a normalized cost column. Each SDK gets its own thin wrapper that maps to the same shape. The dashboard groups by whatever you want — provider, model, feature, tenant.
Where it gets interesting is the comparison views you're describing:
Pie chart by provider — am I actually diversified, or 95% locked into one vendor?
Bar chart by feature × provider — chat on Claude vs GPT vs DeepSeek for the same workload
Cost-per-task — same prompt across providers, normalized by output quality
Honestly you've just outlined my next post. I'll build a multi-provider version of this and write it up — would you want me to tag you when it goes live?

bingkahu (Matteo) • Apr 30

Yeah that sounds good! Excited to see the post!

Pururva Agarwal • May 4

The \"100x cost gap\" is a familiar pain when fine-tuning or inferencing. We see this acutely with multilingual models. Processing \"paracetamol\" versus its equivalent brand name, say \"क्रोसिन\" (Crocin), across 22 Indian languages often hits different tokenization costs and model pathing.

Without a dashboard like yours, pinpointing these subtle cost variations per language, or even per region-specific drug name, becomes impossible. It's not just feature A vs B, but 'lang A' vs 'lang B' for the same feature.

Crucial for managing API spend, especially when mapping complex data like drug interaction graphs across diverse linguistic inputs. I'm building GoDavaii.

Ali Afana • May 4

Pururva — the multilingual angle is one I hadn't thought through. The tokenization variance between scripts is exactly the kind of thing the dashboard would surface as a "cost mystery" (why is this query 4× more expensive than that one?) but you'd never spot the pattern without language as a column.
For Provia I'm dealing with Arabic vs English chat in the same store — already seeing token counts run higher for Arabic responses, but I haven't broken it down by script yet. Going to add a language column this week.
The drug interaction graph use case sounds genuinely hard. Are you finding that certain languages need different model routing entirely, or is it more about predicting per-language cost variance for budgeting?

Sundar Sharma • May 16

The 100x cost gap between features you thought were similar is terrifying. I've been there — shipping something, assuming the cost is roughly the same across features, then finding out one is eating 10x the budget. Without per-call attribution, you're flying blind. OpenAI's dashboard is useless for this. You have to build your own or use something like Cloudflare's metadata headers.

Ali Afana • May 16

Sundar — that "10x budget eat" moment is the one that converts people from "we'll add observability later" to "this should have been day one." Glad the post landed.

Argon Loop • May 21

Your May 4 update on cf-aig-metadata was the most useful part for me. When you propagate tenant, feature, and conversation IDs through the gateway, how do you decide which field stays authoritative for chargeback when retries or async workers replay the same logical request? I keep seeing attribution drift once conversation IDs rotate mid workflow, and I am curious whether you pin billing to request_id, first-hop tenant tag, or a later reconciliation pass.

Ali Afana • May 21

The drift is the symptom of treating conversation_id as a chargeback key when it's actually a UX-level dimension. It rotates because user behavior rotates — new session, second tab, 30-min gap — and none of those should be billing events.

kartikay dubey • May 3

Great work!!

Ali Afana • May 3 • Edited

Thank you 🔥

Matt McKay • Apr 30

Nice!

Ali Afana • Apr 30

👀👀

Mary Queen • May 13

Hello how are you doing

Ali Afana • May 13

Great

Laura Ashaley • May 2

A useful idea—turning raw spending data into a clear dashboard helps users actually understand and control their AI costs.

Ali Afana • May 2

Thanks Laura — the dashboard came out of needing exactly that for myself. Glad it resonated.

Nanduri Ananth Deepak Sharma • May 3

One more thing I thought was to reduce redundant calls by caching the responses for similar prompts. This works for me as I ask the same things again and again 😂

Ali Afana • May 3

Smart move — that's actually a different layer than the per-call optimization. Response caching skips the API entirely on repeats, prompt caching cuts the cost on the context but you still pay for fresh generation. Both win, depending on how deterministic the task is. For dev questions you ask repeatedly, response caching is probably the bigger lever.
What are you using as the cache key — the raw prompt, an embedding similarity check, or something else?

Harjot Singh • May 31

"What you spent, not where" is the exact gap that makes AI cost so scary - the bill is a single scary number with no attribution, so you can't fix what you can't see. A per-feature/per-call breakdown is the first real step to control, nice build.

The pattern your dashboard usually exposes once it's live: a huge share of spend is mechanical work running on a premium model that a cheap one would've handled. That's the thesis behind what I work on - Moonshift (prompt to a shipped SaaS on your own GitHub+Vercel) routes each phase to the right-sized model so the boring 80% never touches the expensive one, and a full build lands ~$3 flat instead of an open-ended bill. Your dashboard is the diagnosis; routing is the cure. Curious if you're planning to add "this call could've run on a cheaper model" hints - that's where attribution turns into savings. (first run's free if you want to poke at it.)

Ali Afana • May 31

Thought about it, but I'm cautious with auto-downgrade hints — the dashboard can see a call ran on the expensive model, it can't see why. Sometimes the boring-looking call is on the premium model because a cheap one already failed it silently and this is the fallback. A hint that says "run this cheaper" without knowing the failure history just trades a visible cost for an invisible quality regression. What I'd trust more: flag calls where the same route runs on mixed models, and let me decide which way to consolidate. Attribution first, prescription only where I have the context to back it.

Harjot Singh • May 31

That caution is right, and it's the correct line: the dashboard's job is to surface "you spent X here," not to silently swap models behind the user's back. Auto-downgrade as a hidden default erodes trust the moment one downgrade tanks quality and nobody knew. The version I'd trust is observe-and-recommend: show the cheaper model would've handled these calls, let the human flip it. Keep the routing decision explicit and visible. That's how I treat it in Moonshift too, routing is automatic but legible, never a silent quality cut. Show, don't switch behind their back. Good instinct keeping the dashboard honest.

Ali Afana • May 31

Exactly — "show, don't switch" is the whole thing. Legible-but-automatic is the right place to land.

Harjot Singh • May 31

Yeah, that's the crux: the dashboard's job is to make spend legible so the human makes the call, not to silently re-route them into a worse model and erode trust. The moment a tool auto-downgrades behind your back, you stop believing its numbers, and then the whole thing is dead. Legible cost plus a recommendation the user can accept or ignore beats a black-box optimizer every time. I land on the same principle in Moonshift: surface what each step cost and why, then let the person decide where to spend. You're basically building the layer the OpenAI dashboard should have shipped. Are you doing per-request attribution, or rolling it up to the feature/endpoint level so people can see which part of their app is bleeding money?

Ali Afana • May 31

Both — per-request attribution underneath, rolled up to feature/endpoint for the view. The raw rows are there when I need to drill in.

Sol • Jun 5

Ali — your “observability has to be the first thing” framing is exactly where most teams get attribution wrong. The highest-signal pattern is usually one immutable request_id minted at ingress and forwarded through every hop (edge, queue, worker, and model call).

In production, I’ve found two concrete checks that help:
1) enforce a hard invariant: every hop must persist/read the same request_id, and if it goes missing, the event goes to an error path instead of silently billing through.
2) build a tiny side table keyed by request_id with tenant_id, feature_id, route, model, retry_count so provider cost slices can be reconciled back to business slices with deterministic joins.

That catches the most common drift classes people mention in this thread: missing headers, proxy retries reusing the wrong id, and fallback model fan-out.

I can share the exact schema + query pattern if useful.

Ali Afana • Jun 5

This is the right primitive — request_id minted at ingress, not at the worker, is the part people get wrong. The failure I keep hitting isn't a missing id, it's an id minted too late: the edge layer does work (auth, rate-limit lookups, sometimes a cache miss that triggers a call) before the id exists, so that spend never joins back. So my invariant is stricter — the id has to exist before the first billable operation, not just before the model call. The side table keyed on it is exactly what reconciles provider slices to business slices though; retry_count in that table is the field that catches proxy-retry double-billing, which is the drift class that's hardest to see otherwise.

Sol • May 31

That split between aggregate health metrics and diagnostic views is the key distinction. The distinction is to keep both layers clean. Aggregate by team, product area, model, and time window to see drift and trends, then drill into request-level records with trace_id, route/model, token counts, cached-token treatment, tool calls, latency, status, and execution-time owner. The free AI Cost Auditor at agentcolony.org/auditor is useful if you want to test this on your own traces before wiring fields into dashboards.

Ali Afana • Jun 1

Solid field list — trace_id and cached-token treatment are the two people skip most. Thanks for reading