Morgan

Posted on May 15

Agent cost bugs are debugging bugs

#ai #llm #agents #devtools

A coding agent does not need to bankrupt you to create a cost bug.

It just needs to make the run impossible to explain.

You see the number on the invoice. You see the "done" in chat. You cannot connect them.

Cost is not only billing

When developers talk about agent costs, the conversation usually drifts toward dashboards and rate limits — invoice-shaped problems with invoice-shaped fixes. That framing hides the actual pain.

The pain in real workflows is closer to this: an agent runs, something happens, the bill or the environment changes in a way you did not expect, and you cannot quickly tell why.

The fix is not a fancier billing UI. The fix is a record of what the run actually did.

I keep noticing the same four shapes show up under "cost." All four are really debugging bugs.

Four failures that look like cost but are really run-truth failures

1. Unexpected token budgets

A developer sends one image to a vision-capable model and watches the prompt-token count balloon to something they did not predict. The docs say one thing; the meter says another. They are left manually reconciling published documentation against observed billing.

That is not a billing problem. That is a "what did this run actually consume?" problem. The run did not carry its own accounting. The developer is doing post-hoc forensics with whatever scraps the chat history preserved.

This shows up repeatedly in vendor forums. There is nothing exotic going on — properly formatted requests, supported image sizes, normal API surfaces — and the result still surprises. Without a per-run record that ties the call shape to the observed cost, the conversation is forced upstream into "is the model overcounting?" instead of staying local where it could be diagnosed.

2. Credits or usage attached to the wrong workspace

You are working in one client. You buy more usage. The credits land on a different account, or the same account in a different workspace, or the same workspace in a different surface. The work is paused while you try to figure out which identity owned the run.

This is not a credit-card issue. It is an identity attribution issue masquerading as a money issue. The run did not record which workspace, which account, which environment was actually active when it spent budget.

When attribution is invisible at run time, the only recovery path is a vendor support ticket. That is fine occasionally. It is a productivity disaster as a default mode.

3. Local processes that keep running after the agent is done

You finish a session in a desktop client. You close it. Hours later you notice a Python interpreter still using memory, still doing whatever the agent was doing when the window closed. No one is watching it. No one billed for it on paper. But the laptop is doing work nobody asked for, and the agent that started it has no idea it left a tail.

That is a cost in machine resources, in attention, in trust. And it is invisible to every vendor dashboard because it is local-by-definition. The only way to surface it would be a run record that says "I started a subprocess, here is its handle, here is what should happen to it when I exit."

Today nothing in the agent stack writes that down. You discover the leak by looking at Activity Monitor and asking yourself, "what is this and where did it come from?"

4. Model swaps that change production behavior

A team updates a model version. Same API, same prompts, same client code. Behavior drifts. The new model is faster, or cheaper, or differently-tuned, and it stops doing the part of the job everyone was relying on without saying anything.

This is not advertised as a cost issue, but it is the same family. The cost of the change shows up as a quality regression in production. The team is running tests after the fact to figure out what shifted. The model swap was framed as low-risk because the surface stayed the same.

A run record would not prevent the regression. But it would make the diff readable. You would have a frozen record of what the old model did and what the new model does, attached to the same run shapes, instead of a slow reconstruction from chat memory and prod logs.

What these have in common

All four are easy to describe and surprisingly hard to debug, and the reason is the same.

The agent ran. The vendor saw the call. The chat saw the prompt. None of those views capture the things you actually want to look at when something costs more than expected:

the active identity at run time (account, workspace, branch, worktree)
the call shape (model, parameters, attachments, sizes)
the process footprint (what was spawned, what is still running)
the observed cost (tokens, time, anything that crossed a threshold)
the deviation from baseline (last week this same run cost X)

That set is small. None of it is exotic. None of it requires a hosted dashboard. It just has to exist somewhere a person can read it in two minutes and the next agent can re-use.

What a useful run record would capture

I keep coming back to a boring shape:

Ask: <one or two lines of intent, not paraphrased>
Identity: <model, account, workspace, branch, worktree>
Inputs: <files read, attachments sent>
Calls: <tool / API calls with model attribution>
Outputs: <diffs, generated artifacts, where they ended up>
Verification: <tests, lints, build, browser checks>
Cost footprint: <tokens, wall time, surprising spikes>
Process footprint: <subprocesses started, still running>
Open risks: <what the agent suspects but did not confirm>
Next-agent handoff: <first three things a fresh agent should do>

Plain markdown. One file per meaningful run. Human-readable first, structured enough for the next agent second.

The point of the cost footprint and process footprint sections is not to replace the invoice. It is to let you say, with the run in front of you:

Of course it cost that much — look at this 66K-token input.
Of course it billed the wrong workspace — the run shows the other one was active.
Of course there is a leaked subprocess — the run started one and never recorded it being torn down.
Of course production drifted — the run identifies a model swap and the verification field is empty.

That is what "cost is a debugging problem" looks like in practice. The bill is the headline. The record is the diagnosis.

The smallest thing I would build

I would not build a dashboard. I would write a small per-run markdown file with the shape above, locally, and stop.

No sync server. No vector store. No vendor lock-in. No new schema until the markdown shape has survived three real runs without changing.

If that file existed, every "why did this cost X?" question would have a place to start that is not "let me scroll back through chat history." Most of the cost surprises above would stop being mysteries.

The reason this is not a product yet is the same reason it is interesting: nobody has accepted that agent cost is a local-debugging surface. Vendors think it is their billing UI. Observability tools think it is a span. Memory products think it is something to retrieve later. None of them write the boring per-run file that would close most of the open questions.

Closing

If your agent is burning money in a way you cannot explain in two minutes, the problem is not your budget. The problem is that the run did not write itself down.

Cost bugs are debugging bugs. The fix is run truth.

If you have hit any of the four failures above and patched around them differently — I want to hear about it. The shape of the right artifact is still up for grabs.

Top comments (1)

Harjot Singh • Jun 1

i totally agree that the real pain comes from the disconnect between what you see in the invoice and the actual workflow. understanding run-truth failures is key to addressing those cost bugs. on another note, if you're looking for a quick build solution, moonshift lets you get a next.js + postgres + auth app deployed in about 7 minutes, and you own the code. happy to give you a free run if you're interested.