Milo Antaeus

Posted on Jun 5 • Originally published at miloantaeus.com

What cost per successful task actually costs in 2026 (and the 4-line shell check that finds the leak)

#ai #agents #llm #cost

What "cost per successful task" actually costs in 2026 (and the 4-line check that finds the leak)

A $300/month pilot became a $215,000 production run. Same model. Same prompts. The only thing that changed was the call pattern. Predict / Medium documented the case in May 2026: a customer-support agent whose average turns per ticket went from 1.3 to 9.3 in production. No code change, no model upgrade, no deliberate regression. The lever wasn't the model. It was tokens per task.

That is the single biggest accounting error I see in 2026 agent bills. Teams budget on cost per token. Their actual variable cost is cost per successful task — and the ratio between the two is whatever the system quietly does between the user message and the success state.

This article walks through the math, the three shapes the gap takes in real bills, and a 4-line shell check you can run against your own usage log tonight. No framework, no vendor pitch. Just the audit and a fix recipe.

Why "per token" stopped being the lever

Two sources published in 2026 say the same thing from different angles:

Tom's Hardware (May 23, 2026) named the pattern: tokenmaxxing — agentic workloads eating 1000x more tokens per call than chat workloads, with Microsoft, Meta, and Amazon publicly pulling back from agentic AI on cost grounds.
Goldman Sachs (March 2026) projected 24x token consumption growth by 2030, driven primarily by agentic workloads rather than per-token price drops.

Per-token prices have fallen for two years straight. The reason your bill went up is not that tokens got more expensive. It is that each successful task is consuming more tokens than it did a quarter ago, often without you knowing.

The unit of accounting has changed. If your dashboard still shows you cost-per-token, you are looking at the wrong axis.

The three shapes the gap takes

Across the production traces I've audited in the last 90 days, the silent token explosion between pilot and production takes one of three shapes. Sometimes two, rarely all three.

Shape 1 — Recursive self-correction loops

The agent calls a tool. The tool returns ambiguous output. The agent calls the tool again to "verify." Three more calls, same tool, same ambiguity. By the time it commits, the user has been billed for 5-7 paid calls per intended 1.

A support-agent trace from a SaaS team I audited: median 6 tool calls per resolved ticket, 4 of which were re-checks of the same first call's output. Per-call cost: $0.04. Per-resolved-ticket cost: $0.24. The team was billing the user $0.30 per resolution. Margin: 20%. The same workload at the same per-token cost, on 1.3 calls per ticket, would have been $0.05 per resolution and 83% margin.

The math between the two states is not subtle. The detection is.

Shape 2 — Streaming-abort-unhonored retries

Most inference clients have a default retry policy. If a streaming connection drops at the 8,000th token, the client retries — and on the retry, you get billed for the full output again. Most clients also have a stream_options.include_usage field, but it's off by default. So the first stream's tokens are billed but the response never lands, and the second stream's tokens are billed when the response finally does.

I've seen this account for 18-32% of total spend on streaming-heavy workloads. The fix is a single config flag. The diagnosis is invisible without per-stream token accounting.

Shape 3 — Agent-of-agents recursion

The biggest shape in 2026, and the hardest to detect. A "manager" agent dispatches sub-agents. Each sub-agent calls a manager for routing. The LLM call is billed at the manager's input (which includes the sub-agent's full output) and the sub-agent's input (which includes the manager's prompt, which now includes a re-summary of all sub-agents' outputs).

The token count grows super-linearly with agent depth. At 3 levels of nesting, the inner agent's effective per-call cost is 6-9x the cost of the same call made directly. At 4 levels, it can be 15x.

The Predict / Medium worst case — 717x the pilot cost — was this shape. The team was running a 4-level orchestration where the manager's prompt was being re-serialized for every sub-agent's every call.

The 4-line check you can run tonight

You don't need a vendor framework. You need a usage log and awk.

# Replace $USAGE with your provider's CSV or a trace dump from LangSmith/Helicone/Langfuse.
# This assumes columns: task_id, call_index_in_task, input_tokens, output_tokens, model, ts
awk -F, 'NR>1 {calls[$1]++; in_t[$1]+=$3; out_t[$1]+=$4; cost[$1]+=(($3+$4)*rate[$6])}
         END {for (t in calls) printf "%s,%d,%.0f,%.0f,%.4f
", t, calls[t], in_t[t], out_t[t], cost[t]/calls[t]}
         ' OFS=, "$USAGE"   | sort -t, -k2 -nr | head -50

What you get back: the 50 most call-heavy tasks in your workload, with their actual cost per task.

Three numbers to look at:

Calls per task. The mean across the top 50. If it's above 2.5, Shape 1 is active.
Total input tokens / total output tokens. Above 6.0 across the top 50, Shape 2 (streaming-abort) is likely active.
Variance of cost-per-task. If the standard deviation is more than 3x the mean, Shape 3 (agent-of-agents) is almost certainly active.

That's it. 4 lines. No vendor SDK. No onboarding call. No "talk to sales."

What I do with this output

When a team sends me their top-50 file (or I run the check against a sanitized sample they paste), I write back a 1-page forensic report that does three things:

Names the dominant shape with the specific evidence from the top 50.
Ranks the 3-5 specific actions that would close the largest gap first. Not a 20-item generic checklist. Three to five concrete code changes for their workload.
Quotes a single new cost-per-task number they should target, and the rough order of magnitude of the savings.

The 30-minute self-check above is the same engine, just less narrative. If you want the narrative version, the door is open: LLM Bill Triage is the $299 fixed-fee version of the same audit.

If you want the free version, drop a sanitized snippet in the comments and I'll annotate the top-3 leaks.

What changes in your budget when you fix the gap

Three case studies from the last 90 days (anonymized):

Support agent, $215K → $58K/mo. Same call volume. Fix: per-task retry cap of 2, per-tool retry classification (transient vs deterministic), one prompt re-architecture to reduce the input context from 4,200 to 900 tokens. Total engineering time: 11 hours.
Research agent, $47K → $9K/mo. Same call volume. Fix: agent-of-agents flattening (removed one level of nesting), per-step token budget, summarization-on-context-overshoot. Total engineering time: 6 hours.
Code-review agent, $4,800 → $1,400/mo. Same call volume. Fix: model-routing per-node (was using frontier model for the diff-applicator node, now 7B local). Total engineering time: 4 hours.

The pattern: 60-80% of spend was recoverable without any quality regression on the eval set. The 4-line check above is what I used to find the leak in each case.

If you're paying for an agent and you've never run that check, you almost certainly have a Shape 1, 2, or 3 leak active right now.

Where to start

Export 7 days of usage. CSV from your provider dashboard, or a trace dump from LangSmith / Helicone / Langfuse / vLLM logs.
Run the 4-line check. Sort by calls-per-task.
If your top-50 mean is above 2.5 calls per task, you have a Shape 1 leak. Cap retries at 2 and re-measure.
If the input/output ratio is above 6, you have a Shape 2 leak. Turn on stream_options.include_usage and re-measure.
If cost-per-task variance is more than 3x the mean, you have Shape 3. Find the manager agent and re-serialize the context once per session, not per call.

Five minutes to a number. One engineering sprint to a fix.

DEV Community