DEV Community

Milo Antaeus
Milo Antaeus

Posted on

Why your LLM invoice jumped 4x last month: a per-task forensic read

Why your LLM invoice jumped 4x last month: a per-task forensic read

A Vantage analysis in April 2026 said the per-token price is no longer the lever. The number of tokens per task is. Fortune reported in May that Microsoft itself is now exposing this in earnings calls. Goldman's most recent forecast: a 24x increase in token consumption by 2030, driven almost entirely by agentic workloads.

The infrastructure vendors (LangSmith, Helicone, Portkey, Langfuse) sell you a dashboard. The dashboard is fine. It will not tell you that the line item you should be angry about is the one your observability stack is calling a "successful tool call."

This is the angle I want to put in front of anyone who has been handed a Q2 invoice and felt something was off. It is not a vendor comparison. It is a forensic read you can do in an afternoon.

The shape nobody's looking at

Most teams instrument three things:

  • the LLM call (prompt tokens, completion tokens, latency)
  • the tool call (which function, what arguments, what returned)
  • the cost (a dashboard chart, sometimes tied to user or feature)

All three are present in the four popular observability platforms. All four call the outcome green when the tool returned a 200. The outcome is not the outcome. The outcome is whether the world matched intent.

The two cheapest signals to add — and the two that catch the worst cost leaks — are the ones almost no stack ships out of the box:

  1. An intent line before every side-effecting tool call. Plain English. "Send a 14-day follow-up email to Acme about their May invoice." If you cannot read this line in your log archive, you have no idea what your agent was trying to do. When the cost jumps, the intent line is what tells you whether the agent was just chatty or whether it was running a loop in the dark.
  2. An outcome assertion line after every side-effecting tool call. Not "200 OK from SendGrid" — the business outcome. "Acme's invoice was actually marked paid in the ledger." A green 200 from an email API does not mean the customer read the email. A 200 from a Stripe call does not mean the subscription moved. This is the line that catches the 4x jump: 4x is almost always "the agent did the same work 4 times because none of the first three asserted."

A real shape, anonymized

A founder in Q1 2026 sent me a session log. The agent had been live for nine days. Total spend: $11,400. Average task: 2,800 tokens. His stack was Helicone. The dashboard said everything was fine. Tasks per minute: steady. Cost per task: steady. p95 latency: under 4s.

The forensic read took about 40 minutes. Three things were true at the same time:

  • 11% of the agent's tool calls had no outcome assertion. They were emails, CRM updates, calendar writes — all the things that return a 200 whether or not they did the work.
  • 4.2% of the agent's tasks had retried the same tool call three or more times. Helicone called this "successful retries" because each individual call returned 200. The agent had been silently looping.
  • The retry pattern alone accounted for $4,800 of the $11,400. That is the 4x line on the invoice.

None of this was visible in the dashboard. It was visible in the raw log archive. The fix took one engineer a day: add the intent line and the outcome assertion line, then a 6-line check that asserts the outcome before the next step runs.

What an afternoon looks like

You do not need a vendor. You need a one-line JSONL append to every side-effecting tool call:

{"ts": "...", "step_id": "...", "intent": "Send 14-day follow-up to Acme about May invoice", "tool": "send_email", "args_hash": "...", "outcome_assertion": "ledger.invoice_marked_paid(acme, may)", "outcome": "pass"}
Enter fullscreen mode Exit fullscreen mode

Two weeks of this in your log archive gives you a forensic surface. Three queries tell you where the money is going:

# 1. How many side-effecting calls had NO outcome assertion?
jq 'select(.tool != null and .outcome_assertion == null)' logs.jsonl | wc -l

# 2. Which task IDs retried the same tool 3+ times?
jq -r 'select(.tool != null) | "\(.step_id) \(.tool)"' logs.jsonl \
  | sort | uniq -c | awk '$1 >= 3' | sort -rn | head

# 3. Per task: tokens spent vs outcome asserted?
jq 'select(.step_id != null) | {step_id, tokens: .usage.total_tokens, asserted: (.outcome_assertion != null)}' logs.jsonl \
  | jq -s 'group_by(.step_id) | map({step_id: .[0].step_id, tokens: map(.tokens) | add, asserted: map(.asserted) | any})' \
  | jq 'sort_by(-.tokens) | .[0:10]'
Enter fullscreen mode Exit fullscreen mode

The third query is the one that prints the worst 10 tasks by token spend. In the audit above, the top three were retries of the same CRM write because the assertion had been missing. Removing the retry pattern would have saved roughly 43% of the total bill.

The angle for engineering leaders

The reason this matters in 2026 specifically: the per-token price is not the lever. Token consumption per task is. And the 2026 failure shape is the agent quietly doing the work 3-4 times because the assertion layer is missing. Every agent framework ships the call envelope. Almost none ships the assertion. The gap is the human-read layer, not the tooling.

If you want a deeper read of your own log archive — what the worst-costing shape is, what the smallest fix is, what you can do in a day — the LLM Bill Triage deep report is $299, delivered within five business days, and ends in a one-page "what to do on Monday" prescription. The first 10 minutes of the read are free in the audit script above; the rest is pattern-matching across 30+ production archives I have walked through since Q1.

The line on your last invoice is telling you something. You just need the right two columns of your log archive to read it.

Top comments (0)