hefty

Posted on Jun 1

Token spend is the new architecture smell

#ai #architecture #programming #productivity

The AI coding conversation is finally moving past the fun part.

For a while, the question was simple: can the agent write the code? Now the better question is nastier: why did it need that many tokens to get there?

That sounds like finance. It is not. It is architecture.

When a coding agent burns through a mountain of context, retries, tool calls, and half-correct edits, the bill is only the easiest symptom to measure. The real problem is usually that the system gave the agent too much room to wander and too little structure to finish.

The budget is telling you where the workflow leaks

The interesting part of the recent cost panic around AI coding agents is not the exact number on the invoice. Big companies spend absurd money on all kinds of engineering tools. Sometimes that is rational.

The useful signal is the shape of the spend.

If token usage rises because an agent is doing valuable work against a clear spec, fine. That is a cost center you can reason about. If token usage rises because agents keep rereading the repo, guessing at intent, patching around their own mistakes, and asking for more context every time they get stuck, that is not "AI is expensive."

That is an unbounded loop.

Developers already know this smell in other forms. A slow build tells you something about dependency structure. A flaky test suite tells you something about isolation. A runaway cloud bill tells you something about lifecycle control.

Runaway token spend belongs in the same bucket.

It means the agent workflow has no useful stopping rule, no tight task boundary, or no cheap way to decide whether the work is actually getting better.

The 80% problem is where the tokens go to die

The optimistic version of agentic coding is that the model handles the boring 80% and humans handle the interesting 20%.

That is close enough to be dangerous.

The first 80% is often cheap because it is mostly production. The agent can scaffold files, follow patterns, fill in obvious glue, and produce a plausible diff. The last 20% is expensive because it is comprehension. Does this match the product intent? Did it preserve the invariant? Did it quietly break the boring path nobody mentioned? Is the patch smaller than the problem, or did it just create a new surface area for review?

This is where unguided agents start spending like a badly written query.

They search again. They widen the context. They rewrite nearby code. They add abstractions because the current shape feels confusing. They run tests, fail, patch the failure, and keep going without ever proving that the original design was right.

The model is not being malicious. It is doing what you asked, or what your workflow accidentally allowed.

That is why token spend is a better diagnostic than people want to admit. It shows you where your instructions are vague, your repo boundaries are mushy, and your review process depends on vibes.

"More agents" makes the smell worse unless review scales too

Parallel agents are useful. I use the pattern. One agent explores docs, another patches a narrow module, another checks a failure mode. Done well, it feels less like pair programming and more like running a small build system made of judgment.

But parallelism does not magically create leverage. It multiplies whatever workflow you already have.

If the task is underspecified, you now have three underspecified tasks. If the repo context is too broad, you now have three agents dragging different chunks of it into their windows. If the review surface is weak, you now have three diffs competing for human attention.

This is why the tooling trend around diff reviewers, model comparison, skills, and operator stacks matters more than the demo videos. The useful layer is not "agent, but more autonomous." The useful layer is routing.

What kind of task is this?

Which context is allowed?

What file boundary is owned?

What evidence is required before the work is considered done?

Which failures should stop the run instead of inviting another 40,000 tokens of improvisation?

That is architecture. Boring, necessary architecture.

Context is an attack surface on your budget

People talk about repo context as if more is always better. That makes sense for a chat demo. It is a bad default for production agent work.

Every extra file can become a distraction. Every broad instruction can become a permission slip. Every fuzzy requirement can turn into another loop where the agent tries to infer the missing product decision from code shape, naming conventions, old tests, stale comments, and whatever happens to fit in the context window.

The answer is not to starve the model. The answer is to treat context like a dependency.

Give the agent the files it needs. Give it the contract it must preserve. Give it examples when examples are cheaper than prose. Put durable workflow rules in versioned skill files or repo docs instead of rewriting the same prompt every time. Make the happy path obvious and the off-ramp explicit.

In other words: stop treating context as a giant bucket and start treating it as an interface.

That applies outside code too. If an AI image workflow keeps burning generations just to fix a tiny artifact, the architectural move is often to split generation from cleanup. For Gemini image outputs, a narrow browser-side cleanup step like Gemini Watermark Cleaner is the kind of bounded tool that keeps the model from being used as a hammer for every downstream problem.

The same principle holds for coding agents. Use the expensive generative system where generation is actually the work. Push repeatable cleanup, validation, formatting, and review into smaller tools with clearer contracts.

A better agent workflow has budgets in the design

The practical fix is not "use fewer tokens." That is like telling someone with a slow database to "query less."

Better constraints beat guilt.

Start with task shape. A good agent task should have a small ownership boundary, a known success condition, and a clear reason to stop. "Improve the auth flow" is a fog machine. "Update the password reset form to use the existing validation helper, keep API contracts unchanged, and add one regression test for expired tokens" is work.

Then control context. If the agent needs five files, do not hand it the whole repo and hope it discovers wisdom. If it needs the design system, point to the actual component pattern. If it needs product intent, put that intent in a durable file that humans can review.

Then make review native. A diff is not enough when agents can produce a lot of plausible code quickly. You want a short summary of what changed, what was intentionally not changed, which tests ran, and where the agent is uncertain. You also want the workflow to stop when the evidence is missing.

Finally, track spend per task type. Token usage in isolation is noisy. Token usage by category is useful. Bug fix. Refactor. test repair. exploratory spike. dependency upgrade. If one category keeps getting expensive, it probably needs better instructions, smaller task slices, or a different tool.

High token spend is not automatically bad

There is an annoying caveat here: some expensive runs are worth it.

A deep migration across a messy legacy codebase will cost more than a small bug fix. A security-sensitive change should spend more time reading and verifying. A long-running agent that produces a clean, reviewed, high-value patch may be cheap compared with the human time it saved.

So no, the goal is not to worship a tiny token bill.

The goal is to notice when spend is buying motion instead of progress.

That distinction matters. Motion is the agent editing, retrying, summarizing, and expanding context. Progress is the system getting closer to a correct, reviewable, owned change.

If you cannot tell the difference, your architecture is missing instrumentation.

The real operator skill is saying no earlier

The best agent operators are not the people who let models run forever. They are the people who know when to constrain the run before it starts.

They write sharper specs. They split work into smaller units. They keep reusable instructions in files. They demand evidence. They do not let a model compensate for unclear product thinking by spending more tokens.

That is the real shift behind all the "operator stack" chatter. The stack is not just Codex, Claude Code, routers, skills, diff viewers, and whatever launches next week. The stack is a way to make agent work reviewable.

Once you see token spend that way, the invoice stops being a surprise and starts being a profiler.

And if the profiler says your agent spent half the run wandering around your repo trying to understand what you meant, the fix is probably not a cheaper model.

The fix is a better boundary.

Source notes

DEV Community