How to actually track your AI / LLM API spend before the bill surprises you

#ai #api #llm #monitoring

You wire up the OpenAI SDK, ship the feature, and it works. Three weeks later someone in finance forwards a screenshot of a bill that tripled and asks what happened. You open the provider dashboard, see one big number, and… that's it. No per-feature breakdown, no idea which change caused it, no way to tell whether it's a bug or just growth.

I've watched this happen at enough teams that I now treat "we can't explain our AI bill" as a predictable stage every company hits about two months after their first LLM feature ships. Here's how to get ahead of it — starting with plain code, then the tradeoffs, then where a dedicated tool actually earns its keep.

Disclosure up front: I work on StackSpend, which does the full version of this. I've kept the first 80% of this post vendor-neutral because most of it you can and should build yourself before you buy anything.

The core problem: the bill is a single number, your costs are not
Provider dashboards give you total spend over time. What you actually need to make decisions is spend broken down by the dimensions you care about:

Per feature — is it the summarizer or the chat assistant that's expensive?
Per customer / tenant — which accounts cost more to serve than they pay?
Per model — how much are you spending on GPT-4-class vs cheaper models?
Per environment — is a runaway staging job quietly burning money?
None of those dimensions exist in the raw bill. You have to attach them yourself, at call time, because after the request is gone the context is gone with it.

Step 1: capture usage at the call site
Every major provider returns token usage in the response. The trick is to log it with your own business context attached — the feature name, the tenant, the environment. Here's the pattern in TypeScript with the OpenAI SDK:

import OpenAI from "openai";

const openai = new OpenAI();

// Prices per 1M tokens — keep these in config, they change often.
const PRICING: Record<string, { input: number; output: number }> = {
  "gpt-4o":      { input: 2.50, output: 10.00 },
  "gpt-4o-mini": { input: 0.15, output: 0.60 },
};

function costUsd(model: string, inTok: number, outTok: number) {
  const p = PRICING[model];
  if (!p) return 0;
  return (inTok / 1_000_000) * p.input + (outTok / 1_000_000) * p.output;
}

export async function trackedCompletion(opts: {
  model: string;
  messages: OpenAI.ChatCompletionMessageParam[];
  // your business context — the whole point of this exercise
  feature: string;
  tenantId: string;
  env: string;
}) {
  const res = await openai.chat.completions.create({
    model: opts.model,
    messages: opts.messages,
  });

  const u = res.usage;
  if (u) {
    await recordUsage({
      ts: new Date(),
      model: opts.model,
      inputTokens: u.prompt_tokens,
      outputTokens: u.completion_tokens,
      costUsd: costUsd(opts.model, u.prompt_tokens, u.completion_tokens),
      feature: opts.feature,
      tenantId: opts.tenantId,
      env: opts.env,
    });
  }

  return res;
}

recordUsage can be as simple as an insert into a llm_usage table:

create table llm_usage (
  id           bigserial primary key,
  ts           timestamptz not null default now(),
  model        text not null,
  input_tokens int  not null,
  output_tokens int not null,
  cost_usd     numeric(12,6) not null,
  feature      text not null,
  tenant_id    text,
  env          text not null
);

That's it. You're now capturing the dimensions the bill will never give you.

Step 2: turn rows into answers
Once the data has your context attached, the questions that were impossible become one query each:

-- Cost per feature, last 30 days
select feature, round(sum(cost_usd), 2) as cost
from llm_usage
where ts > now() - interval '30 days'
group by feature
order by cost desc;

-- Which tenants cost the most to serve?
select tenant_id, round(sum(cost_usd), 2) as cost
from llm_usage
where ts > now() - interval '30 days'
group by tenant_id
order by cost desc
limit 20;

This is already more than most teams have. If you stop here, you've solved 60% of the pain for a day of work.

Step 3: alert before the month-end surprise
The whole point is to not find out after the money's gone. A cheap daily anomaly check catches the spike while you can still do something about it:

-- Today's spend vs the trailing 7-day daily average
with daily as (
  select date_trunc('day', ts) as d, sum(cost_usd) as cost
  from llm_usage
  group by 1
)
select
  (select cost from daily where d = date_trunc('day', now())) as today,
  (select avg(cost) from daily
     where d >= date_trunc('day', now()) - interval '7 days'
       and d <  date_trunc('day', now())) as avg_7d;

Wrap that in a cron job that posts to Slack when today > avg_7d * 1.5 and you've got a smoke alarm. Not sophisticated, but it turns a month-end shock into a same-day heads-up.

Where the DIY approach starts to hurt
The code above is genuinely worth writing. But there's a predictable point where maintaining it becomes its own project:

Pricing drift. Every provider changes prices and ships new models constantly. That PRICING map becomes a part-time job, and every stale entry silently corrupts your numbers.
More than one provider. Add Anthropic, then a model on Bedrock, then an image model, and each has a different usage shape and pricing model. Your one table becomes five adapters.
Cloud + AI together. Your real cost story isn't just tokens — it's tokens plus the AWS/GCP bill for the infra around them. Stitching those together is a data-engineering task, not a query.
"Which deploy caused this?" The alert tells you today spiked. It can't tell you the PR that shipped this morning is the reason. Answering that means correlating spend with your git history.
Naive anomaly detection. avg * 1.5 fires on every Monday and every marketing campaign. Real detection needs seasonality awareness, or you train everyone to ignore it.
Each of these is solvable. The question is whether spend-tracking infrastructure is the thing your team should be building.

The honest pitch
This is where I'll mention what I work on. StackSpend is the managed version of everything above: it ingests usage across OpenAI, Anthropic, and the rest alongside your AWS/Azure/GCP bill, attributes spend per feature/customer/model, keeps provider pricing current so you don't, runs seasonality-aware anomaly detection, and posts alerts to Slack or Teams. It also correlates cost anomalies with the GitHub PRs that shipped around them, so a spike points at the change that likely caused it.

But genuinely — start with the code in this post. If a Postgres table and a Slack cron get you what you need, ship that. Reach for a tool when the five problems above start eating your week, not before.

Takeaways
The provider bill is a single number; your costs are multi-dimensional. Attach your own context (feature, tenant, model, env) at call time — you can't reconstruct it later.
A usage table plus a few group by queries answers 60% of the questions for a day of work.
A daily anomaly check → Slack turns month-end shocks into same-day alerts.
Buy instead of build when pricing upkeep, multi-provider adapters, cloud+AI stitching, and deploy-correlation start to cost more than they're worth.
What does your team do today — home-grown tracking, a dedicated tool, or still flying blind? Curious what's working in the comments.

DEV Community

How to actually track your AI / LLM API spend before the bill surprises you

Top comments (0)