Arpit Gupta

Posted on Jun 27

5 Things Your LLM Bill Is Hiding From You (And How to Find Them)

#webdev #startup #ai #devops

We went from $620 to $2,480 in 23 days.

No new features shipped. No traffic spike. Zero error alerts. Deployment logs were clean. Five engineers staring at dashboards that gave us totals and nothing else.

What we had was a receipt. What we needed was a map.

Here are five things hiding inside your LLM bill right now that your monitoring stack almost certainly cannot show you.

1. Which feature is actually driving the spend

Every provider dashboard shows you model level totals. GPT-4o: $X. Claude: $Y.

That number is useless for debugging.

What you need is feature level attribution. Which product feature triggered each call. In our case the batch report generator was responsible for 74% of total spend. We had been optimising the other two features for two straight weeks because they felt expensive.

Here is what 48 hours of real attribution data looked like:

Feature	Monthly Cost	Share
Batch Report Generator	$1,847	74%
Document Summariser	$421	17%
Inline Suggestion Engine	$212	9%

I had been optimising the wrong two features the entire time.

What to do: Instrument every LLM call with a feature tag at the point of the call. Not in post-processing. Not in a weekly report. At the call itself. The data only means something if it captures what triggered the request.

2. Which users are unprofitable to serve

This one does not feel like a cost problem at first. It feels like a pricing problem later.

Once we had feature level attribution running we rolled it up per user per plan tier. What came back changed how we run the business:

Plan	Avg Cost to Serve / Month	MRR per Seat	Margin
Starter	$3.20	$49	93% ✓
Growth	$31.00	$49	37% ✓
Enterprise	$89.00	$49	-45% ✗

Our most active users were our most unprofitable users.

Flat pricing made this invisible for 14 months. Per user attribution made it impossible to ignore in 48 hours.

We repriced Enterprise to usage based. That conversation with customers was not difficult because the numbers were exact. Per user. Per feature. Per month. Nothing to argue with.

What to do: Roll up cost per user once you have feature attribution running. The unit economics gap only becomes visible at that layer. If you are on flat pricing and your power users are also your heaviest LLM users, there is a real chance you are losing money on your best customers right now.

3. Which service is double-calling your provider

This one is invisible until you track at the service layer.

Our document-processing-service was making compliance calls. Our compliance-service was also making compliance calls downstream on the same document. We were paying twice for the same prompt on the same input. Every single time.

Zero user facing symptoms. Zero errors. Zero alerts. $180 a month just gone.

Three dimensions matter: feature, user, service. Any single dimension alone misses the other two bugs. We had one dimension for 14 months and thought we had visibility.

What to do: Tag every call with the originating service name alongside the feature and user. When you break cost down by service you will find overlapping calls that look completely normal in isolation but are duplicates at the system level.

4. Features with a 0% error rate that are bleeding budget

This is the most dangerous category on this list.

A feature that errors gets flagged. A feature that succeeds too often, on a broken trigger, gets nothing.

Our compliance checker ran on every document save. Autosave interval: 30 seconds. 40 enterprise users. That is 4,800 GPT-4o calls per hour. Every working hour. Every working day.

No alert ever fired because nothing was wrong at the response level. Every call succeeded. Every log looked clean. The bug was in the trigger design, not the call itself.

Fix: moved compliance check to manual trigger and document submission only.
Result: $1,890 to $190 per month. One line of code. No feature removed. No model downgraded. Zero user impact.

What to do: Look at call frequency per feature, not just cost per call. A feature that runs 2,000 times a day with a $0.09 average call cost is a $5,400 a month feature. That number only appears when you are rolling up cost by feature over time, not inspecting individual requests.

5. The layer your monitoring stack does not reach

This one took us the longest to understand.

We had Datadog. We had the OpenAI usage dashboard. We had CloudWatch. All of them answered one question: how much.

Nobody was answering which feature, which user, which service.

Those are completely different questions. Infrastructure monitoring watches infrastructure. It knows a request succeeded. It has no concept of which product feature triggered it, which customer caused it, or whether that success was profitable given your pricing.

The gap is not about dashboards or visualisations. It is about where in the stack the data gets captured. You need instrumentation sitting between your application code and the provider API, tagging every call at the moment it happens with what triggered it.

Standard monitoring tools do not reach that layer. That is not a criticism of those tools. They were not built for it. But if you are running LLM features in production and relying only on infrastructure monitoring, you have blind spots that look exactly like working correctly.

What to do: Ask yourself one question. Can you answer this in under 60 seconds:

Which feature is your most expensive to run, for which users, and is that number healthy for your unit economics at your current pricing?

If you would have to dig for any part of that answer, the risk is not in your monitoring. It is in the layer your monitoring does not reach.

What we used

After 23 days of climbing bills and wrong guesses, a teammate dropped CostReveal in our Slack. The SDK wraps your existing provider calls and tags every call by feature, service, and user. One dashboard surfaces all three dimensions with real time budget alerts that fire before the bill arrives.

Setup took one evening. Real data showed up in 48 hours. Both the autosave bug and the double-calling service bug surfaced within 72 hours of instrumentation.

Docs at docs.costreveal.com if you want to go straight to setup.

Total spend is a receipt. Attribution is a map.

We had the receipt for 14 months before we got the map.

Have you found a silent cost bug like this? A feature working perfectly and quietly draining budget with zero alerts? Drop it in the comments. Genuinely curious how common this pattern is.

Top comments (9)

Nazar Boyko • Jun 27

That Enterprise row sitting at -45% margin is the one I'd stare at longest, because the fix isn't only a pricing change. Your heaviest users are usually also your loudest advocates, the ones writing the case studies and sending referrals, so moving them to usage based can quietly cost you more than the -45% if a few of them go quiet. The attribution work you describe is what makes that call safe though, since you can finally see which power users are worth keeping at a loss and which aren't. Did you watch retention at all after the reprice, or was the margin math enough to justify it on its own?

Arpit Gupta • Jun 27

You are pointing at the tension we did not fully resolve cleanly and I want to be honest about that.

The attribution data told us who was expensive. It did not tell us who was valuable. Those are different lists and we conflated them longer than we should have.

What saved us was breaking the Enterprise cohort into two groups before touching pricing. One group was high cost and high expansion, the accounts that had grown seat count, sent referrals, or were actively using integrations we wanted to showcase. The other group was high cost and flat, same seats for 8 months, no referrals, no expansion signal.

We moved the flat group to usage based first. Kept the expansion group on a grandfathered rate with a private roadmap conversation. Only one of the flat group churned and they had not logged in for six weeks before the reprice so the signal was already there.

Retention held because the attribution data let us make a surgical cut instead of a broad one. If we had repriced the entire Enterprise tier at once the way most companies do it, I think your concern plays out exactly as you described.

Your question about whether the margin math was enough on its own, honestly no. The margin math told us we had to act. The attribution data told us where it was safe to act. Without both we would have been guessing at which relationships to protect.
Did you go through a reprice like this? Curious how you handled the advocate risk specifically.

Mike Czerwinski • Jun 27

Point 4 is the one that generalizes past cost. A feature with a 0% error rate bleeding budget is a success signal at the response layer covering a failure at the trigger layer. The call returned 200, every log was clean, and that is exactly why nothing fired. The bug was never in the call. It was in who got to decide the call should happen.

I keep hitting the same shape outside billing. A check that runs from inside the thing it is checking will always report healthy, because the layer that would raise the alarm is the same layer that closed the request. Your compliance checker on every autosave is that: the trigger authored its own justification, and no observer sat outside it at the moment of the decision.

So the fix is not better dashboards, it is moving the decision out of the loop that benefits from saying yes. Manual trigger worked because it put a bearer outside the call. The number only became visible once something other than the feature itself got to author whether the call was warranted.

Total spend is a receipt, attribution is a map, agreed. I would add one line: the map only catches the bug when the thing drawing it is not the thing being mapped.

Arpit Gupta • Jun 27

This is the sharpest reframe of the problem I have read.

"The trigger authored its own justification" is exactly it. The compliance checker had no external observer at the moment of the decision. It called itself warranted 4,800 times an hour and nothing in the stack had standing to disagree because everything downstream only saw a clean 200.

Your generalisation holds past billing too and that is what makes it uncomfortable. Any system where the component that decides to run is also the component that reports on whether it should have run is structurally blind to this failure mode. The health check that checks itself. The retry logic that logs its own retries as successful attempts. Same shape.

The line you added is the one I wish was in the post. The map only catches the bug when the thing drawing it is not the thing being mapped. That is the actual precondition for attribution to work at all. We got lucky that the SDK sat outside the feature entirely. If we had instrumented from inside the compliance checker itself the numbers would have looked identical whether it was running correctly or pathologically.

What surface are you hitting this on outside billing? Curious whether the pattern shows up more in event driven systems or request response architectures.

Mike Czerwinski • Jun 27

Event-driven, and not by accident. Request-response ships with an external observer built in: the caller blocks on the response and inspects it, so there is a party with standing to disagree at the moment of the decision. Event-driven severs that. The emitter fires and nothing waits on the outcome, so decider-equals-reporter is the default state, not the failure mode. Where I keep hitting it: queue workers that mark their own job done, webhook handlers that 200 on receipt rather than on effect, schedulers and autoscalers that act and emit the only record of whether they should have. The retry logic logging its own retries as successes is the canonical one. So I would not say it shows up more in one architecture. I would say request-response hands you an observer for free and event-driven makes you build one back. The bug is identical in both. The difference is whether standing-to-disagree exists by default or has to be installed.

Jackson • Jun 27

The part that still surprises me is how long the double-calling service bug sat there. Three microservices, all doing health checks on logs, and none of them knew the others existed. What layer are you all currently tracking at: feature, service, user, or just provider totals?

Arpit Gupta • Jun 27

Provider totals for the longest time, which is exactly why it sat there undetected.

The shift that actually changed things for us was service level tagging. Once you break cost down by which microservice originated the call, duplicate calls stop looking like normal usage and start looking like an anomaly in the rollup. Two services, same document, same prompt, same 30 second window. That pattern is invisible at the feature layer and completely invisible at the provider total layer.

The order that worked for us: feature first because it surfaces the biggest dollar gaps fastest, then service because it catches the architectural bugs, then user because that is where the pricing conversation starts.

VoltageGPU • Jun 27

That kind of cost inflation is worrying—especially if it's coming from behind the scenes. I've seen similar issues when GPU clusters auto-scale without proper cost guardrails, especially with LLMs that can silently trigger more compute than expected. If you're using something like VoltageGPU, it's crucial to track not just API usage, but also actual GPU hours and memory consumption.

Arpit Gupta • Jun 28

The auto-scaling pattern is the same shape. Trigger runs, compute spins, everything reports healthy, bill reflects a decision nobody consciously made.

The gap we kept hitting was that infrastructure monitoring tells you the cluster scaled. It cannot tell you which feature caused the scaling or which tenant triggered it. Those are the numbers that change your pricing conversation.

Are you attributing GPU hours back to specific features or tenants or still rolling up at the cluster level?