last week a post called "agents need control flow, not more prompts" went around hn (thread 48051562, 588 points, 293 comments). the argument is an engineering one: open-ended prompt loops are unpredictable, deterministic harnesses aren't, so wrap the agent in a flowchart and feed it one step at a time. one commenter described doing exactly that — "wrapped the agent in a loop that kept feeding it the next step in the flowchart."
all true. but there's a second axis the thread mostly stepped around, and one person said it out loud:
"I used to assume they pushed people into the prompt-only workflows because you're paying them for the tokens" — DrewADesign, same thread
an open-ended agent loop isn't just unreliable behavior. it's unbounded spend. and the part almost nobody is instrumenting: when the invoice arrives, you have one number for a session that made 30 model calls, and no way to tell which of those 30 calls re-read the repo three times and cost $1.40 of the $1.83.
the loop got more expensive three times in the last 30 days
the timing here isn't subtle. while the thread was arguing about flowcharts, the per-token price moved underneath everyone:
- github copilot shifted to a token-credit model where the same Opus turn bills at a 1x / 7.5x / 27x multiplier depending on plan and overage state (HN 47923357). same work, three prices.
- anthropic A/B-tested removing Claude Code from the Pro tier mid-cycle (HN 47854477). people found out when the harness they'd built around it stopped working — no notice.
- openai shipped GPT-5.5 on may 8 at roughly 2x GPT-5.4's per-token price (HN 48057209, 213 pts). OpenRouter measured a 49–92% net cost increase even after the 19–34% token-efficiency gain, because efficiency doesn't save you if the price moved further than the efficiency did. a commenter there: "it's also quite the cost lottery and i'm not sure i am comfortable with that."
and the workload itself is variance-heavy before any repricing. reflex.dev benchmarked computer-use against a structured API on the same admin-panel task (HN 48024859, 269 comments): 550,976 ± 178,849 input tokens for the agent loop, 12,151 ± 27 for the structured call. the standard deviation on the loop is ~32% of the mean. run the same task twice and you get a 400k–750k token swing — and a matching 750s–1257s wall-clock swing.
stack those: a workload that already swings ~2x run-to-run, on a per-token price that moved three times in a month, inside a loop whose length is decided by the model and not by you. "average cost per task" is not a number you can budget against. it's a number that was true once, for one run.
what we actually measured
i build llmeter — an open-source dashboard for llm api cost tracking — so i spend a lot of time staring at this data. the thing that broke for us first wasn't the price. it was attribution.
here's the shape of an agentic task: one user request becomes N model calls, where N is the agent's choice, not the user's. without per-call attribution your invoice for that session is a single line. you can't point at the iteration that re-read the repo three times. you can't tell a retried tool-call branch — one that already succeeded — apart from real work. "agents are expensive" stays a feeling.
once we started recording cost per API call and rolling it up per user / per model / per day, two things fell out fast:
- the cost distribution across "the same" task is bimodal, not normal. most runs are cheap; a small tail of runs is 3–5x, and the tail is exactly where the loop decided to do something extra. the mean hides the tail. the p95 is the number that actually predicts your invoice.
- a handful of users — usually on the free tier, usually running something on a cron — accounted for a wildly disproportionate share of token spend. one cron job re-summarizing the same document every hour will quietly outspend your paying customers, and you won't see it in an aggregate provider bill.
none of that is a pricing problem. it's a visibility problem that pricing volatility makes expensive.
what to do this week (none of this needs a tool)
-
find your single most expensive task in production this month. not the average — the single worst run. if you can't query that in under five minutes, that's the gap, and closing it is usually a five-line change: log
model,input_tokens,output_tokens,cached_tokensand atask_idnext to every completion call, thenGROUP BY task_id ORDER BY cost DESC LIMIT 10. - break out the cached-token line. OpenAI, Anthropic and DeepSeek each name cache-hit / cache-miss / cached-input differently, and the cached tier is the one that tends to move most on a repricing. if your cost rollup collapses everything into "input tokens," a price change on the cached tier is invisible until the invoice.
-
put a
task_idon agent loops and count the iterations. the reflex numbers say iteration count is your variance source. if you're not logging "this user request fanned out to 14 model calls," you can't tell a healthy run from a runaway one — and you definitely can't alert on it. - alert on p95, not the mean. a Slack ping at "you've spent your monthly average" fires after the damage. a ping at "this task is in the top 5% of cost we've ever recorded for this task type" fires while it's still running. (this is the one spot a tool earns its keep — per-model / per-user / per-day budget alerts — but the logic is simple enough to roll yourself.)
- if you route to DeepSeek for cost, write down the date. the V4-Pro 75% promo expires 2026-05-31 15:59 UTC and every line item goes 4x at that second. that's not a forecast, it's a calendar entry — model it before may 30, not on june 1.
the point
the control-flow argument and the cost argument are the same argument. a deterministic harness is predictable behavior and a predictable bill. an open-ended loop is "trust me" on both. the harness people are right — but the reason to draw the flowchart isn't only that the agent behaves better. it's that you can finally point at the box that cost you the money.
if you can't draw the cost shape of your agent's loop, your control flow is just hope.
i build llmeter — an open-source (AGPL-3.0) cost dashboard for OpenAI / Anthropic / DeepSeek / OpenRouter / Mistral / Azure OpenAI: per-model, per-user, per-day, with budget alerts. it's not a proxy and it doesn't sit in your request path — the SDK forwards usage metadata async. the per-call attribution stuff above is the part that made me build it. free tier is one provider / 7-day retention. genuinely want to hear how other people slice agentic cost — what does your rollup key on?
Top comments (0)