Your AI Writes Junk, and You Pay for It Twice

#ai #architecture #llm #programming

Last week an agent wrote me around ten tests for a function. Two of them were useless. Not wrong, exactly: they re-checked the same branch, asserted things the type system already guaranteed, and added nothing a reviewer would keep. I paid output tokens to generate them, then input tokens to carry them in context on every turn after. The model had no reason to delete them. Neither did the company billing me by the token.

That's the whole problem in one file. We pay for AI by the unit it produces, not by the value it delivers. Not a scam, not lazy, just misaligned. And the misalignment leaks much further into your codebase than you'd expect. Let me show you.

The tax

The marginal cost of a bad token equals the marginal cost of a good one, so a model billed per output token has a built-in pull toward saying more. Imagine paying programmers per keystroke and acting surprised when the functions get long. My two useless tests are a small example of this. Reasoning models make it worse: an internal chain of thought you never read but pay for, so a response showing five hundred visible tokens can quietly burn several thousand.

The community already feels this. Caveman, a Claude Code skill with around 70k stars, makes the agent talk like a caveman ("why use many token when few token do trick") and cuts around 65% of output tokens. Think about that for a second: we built tooling to make the agent grunt, because the meter made full sentences a cost problem.

Worth knowing: "cheaper per token" often means "more expensive per task." OckBench calls this the overthinking tax. Smaller models burn absurd token counts trying to imitate complex reasoning without getting there, and the benchmark's most extreme result is an open model matching a frontier model at 75% accuracy while spending 26 times the tokens, 42,243 versus 1,603. The headline price is a trap.

An arXiv paper, Is Your LLM Overcharging You?, goes further: under per-token pricing, providers have a financial incentive to misreport token counts, and users can't prove, or even know, they're overcharged. The authors' fix is telling: pay per character, a unit the user can count. Even the academics fall back to the most verifiable unit available. Hold that thought.

Caragiale saw all of this coming in 1884. In his comedy A Lost Letter, the police chief Pristanda bills the town for forty-four flags and hangs about a dozen. When questioned, he counts them out loud: two at the prefecture, two in the market square, two at the town hall, one at the boys' school, one at the girls' school, one at the hospital, two at the cathedral... and then, two at the prefecture again, padding the tally back up toward the dozen-odd that were really there. When the prefect laughs that he already counted those, Pristanda speeds up, recounts everything with even bolder arithmetic, lands on forty-four exactly, and offers that maybe the wind took down one or two. He even has the provider's defense ready: big family, small salary. You know the modern version: big models, expensive GPUs. The trick only got exposed because Zoe rode through town and counted the flags herself. Remember Zoe.

It deforms your architecture

The pricing model doesn't stay on the invoice. It climbs out and draws your architecture diagram. You cap history, cache hard, route cheap paths to small or local models, keep reasoning models off the hot path. In a value-aligned world half of these decisions wouldn't exist.

I felt this building httptape, a Go library for recording, redacting, and replaying HTTP traffic. When the API on the other end is an LLM, every test run is metered. Run the suite a hundred times a day and you've paid a hundred times to assert the same thing. So you record once, replay from a tape, and your tests stop touching the paid boundary. Useful engineering, yes. Also a structure I built because of pricing pressure, not because a user needed it. The economics draw the boxes and you label them afterward.

"So just charge for outcomes"

This is already happening. Intercom's Fin charges about a dollar per resolved conversation, Zendesk prices per automated resolution, Sierra built a company on it. It shows up where the outcome is cheap to define and cheap to verify. The trouble starts on code. I had three fixes in mind, and watching each one break taught me more than any of them working would have.

First: charge only if it compiles. This one mostly holds. Compilation is binary, cheap to check, impossible to argue with.

Second: charge only if the tests pass. A while back I joined a team that inherited a codebase from an outside group. One handover condition was 80% test coverage, a clean contractual outcome. The number was met. Most of the tests had no assertions at all. Nobody had to cheat: coverage measures which lines a test touches, never whether it verifies anything, and the moment it became the graded number it stopped describing what we cared about. Pay an agent against a green suite and the cheapest path to green is what you'd expect: a test that asserts nothing, an assertion quietly weakened, a hard case skipped with a convincing comment.

And gaming is only the cynical version. When the same agent writes the code and then the tests, the tests are derived from the implementation, bugs included. A wrong boundary condition doesn't get caught, it gets asserted, and now it's green and regression-protected. Colleagues of mine built a skill pushing agents to do TDD, and I keep asking a question I can't answer: does test-first carry any value over to generated code? For humans, TDD's real product was never the tests, it was design pressure on the author. Whether that survives when one model writes both sides of the contract in one go, I haven't seen anyone demonstrate. We're pricing against a signal we haven't even validated. We've all seen the human version anyway, the automatic "LGTM" on a PR nobody read. A green light is cheap to emit and expensive to verify, so it stops meaning anything.

Third: cut the price when the agent ignores AGENTS.md, which it does constantly. My instructions file says "always use the latest library versions" and "when blocked, stop and ask for guidance, don't try alternative approaches without approval." Copilot once hit a failing test, misread the error output, and started downgrading libraries to make it pass: both rules broken in one loop, chasing the wrong problem. I paid tokens for every step of that loop, and then more tokens to undo it. The disobedience itself is billable. So cutting the price for ignored instructions sounds fair, but you can only subtract for violations you can mechanically detect. "Never use !!" is a lint rule. "Keep the domain pure, respect the hexagon" is not. The instructions I care about most are the ones I can't auto-verify, which is exactly why they get ignored and exactly why I can't price the violation.

The verifiability gradient

Lay the three fixes side by side and a pattern shows up underneath them:

Pricing can only attach to the left end. Real value lives on the right. Remember pay-per-character? That's the far left edge, the most countable unit there is. And providers aren't being lazy: their costs really are quantitative, GPU-seconds and tokens, while outcome pricing asks them to pay for the times the model fails. A rational provider only takes that bet where the outcome is cheap to define and hard to fake, which is why your support-ticket vendor can and your model API can't.

Token pricing won because the countable thing is both the easiest to meter and the easiest to inflate, while the valuable thing sits just past the edge of what we can compute.

What's missing

The unlock isn't a billing setting. It's whoever builds the layer that can score the right end of that gradient, and that layer has one hard constraint: the thing being measured can't report its own meter. Zoe didn't ask Pristanda how many flags he hung. She counted them from the carriage. VibeWarden, the security sidecar I've been building, runs on the same stubborn rule: watch the boundary, don't trust the actor. A verifier for AI-generated code has to work the same way, sitting beside the agent and checking the build, the suite, the architecture, and the rules independently, because anything self-reported is worth as much as an unread LGTM. I feel strongly enough about that last part that I built lgtm-buzzer, a browser extension that quizzes you on the diff before letting you approve a PR. The green light should cost something.

That's the next thing I want to build: an external referee that measures a little further up the gradient than "it compiled," while being honest that the very top, whether the design is actually good, stays out of reach for any tool, mine included. And one experiment I want to run on the way: same tasks, code-first versus test-first agents, seeded bugs, and a count of which flow's tests actually catch them. That's the next piece.

Until then, a question for you: would you pay per outcome for generated code? And what signal would you trust to verify it?