DEV Community: Morgan

Agent cost bugs are debugging bugs

Morgan — Fri, 15 May 2026 13:46:43 +0000

A coding agent does not need to bankrupt you to create a cost bug.

It just needs to make the run impossible to explain.

You see the number on the invoice. You see the "done" in chat. You cannot connect them.

Cost is not only billing

When developers talk about agent costs, the conversation usually drifts toward dashboards and rate limits — invoice-shaped problems with invoice-shaped fixes. That framing hides the actual pain.

The pain in real workflows is closer to this: an agent runs, something happens, the bill or the environment changes in a way you did not expect, and you cannot quickly tell why.

The fix is not a fancier billing UI. The fix is a record of what the run actually did.

I keep noticing the same four shapes show up under "cost." All four are really debugging bugs.

Four failures that look like cost but are really run-truth failures

1. Unexpected token budgets

A developer sends one image to a vision-capable model and watches the prompt-token count balloon to something they did not predict. The docs say one thing; the meter says another. They are left manually reconciling published documentation against observed billing.

That is not a billing problem. That is a "what did this run actually consume?" problem. The run did not carry its own accounting. The developer is doing post-hoc forensics with whatever scraps the chat history preserved.

This shows up repeatedly in vendor forums. There is nothing exotic going on — properly formatted requests, supported image sizes, normal API surfaces — and the result still surprises. Without a per-run record that ties the call shape to the observed cost, the conversation is forced upstream into "is the model overcounting?" instead of staying local where it could be diagnosed.

2. Credits or usage attached to the wrong workspace

You are working in one client. You buy more usage. The credits land on a different account, or the same account in a different workspace, or the same workspace in a different surface. The work is paused while you try to figure out which identity owned the run.

This is not a credit-card issue. It is an identity attribution issue masquerading as a money issue. The run did not record which workspace, which account, which environment was actually active when it spent budget.

When attribution is invisible at run time, the only recovery path is a vendor support ticket. That is fine occasionally. It is a productivity disaster as a default mode.

3. Local processes that keep running after the agent is done

You finish a session in a desktop client. You close it. Hours later you notice a Python interpreter still using memory, still doing whatever the agent was doing when the window closed. No one is watching it. No one billed for it on paper. But the laptop is doing work nobody asked for, and the agent that started it has no idea it left a tail.

That is a cost in machine resources, in attention, in trust. And it is invisible to every vendor dashboard because it is local-by-definition. The only way to surface it would be a run record that says "I started a subprocess, here is its handle, here is what should happen to it when I exit."

Today nothing in the agent stack writes that down. You discover the leak by looking at Activity Monitor and asking yourself, "what is this and where did it come from?"

4. Model swaps that change production behavior

A team updates a model version. Same API, same prompts, same client code. Behavior drifts. The new model is faster, or cheaper, or differently-tuned, and it stops doing the part of the job everyone was relying on without saying anything.

This is not advertised as a cost issue, but it is the same family. The cost of the change shows up as a quality regression in production. The team is running tests after the fact to figure out what shifted. The model swap was framed as low-risk because the surface stayed the same.

A run record would not prevent the regression. But it would make the diff readable. You would have a frozen record of what the old model did and what the new model does, attached to the same run shapes, instead of a slow reconstruction from chat memory and prod logs.

What these have in common

All four are easy to describe and surprisingly hard to debug, and the reason is the same.

The agent ran. The vendor saw the call. The chat saw the prompt. None of those views capture the things you actually want to look at when something costs more than expected:

the active identity at run time (account, workspace, branch, worktree)
the call shape (model, parameters, attachments, sizes)
the process footprint (what was spawned, what is still running)
the observed cost (tokens, time, anything that crossed a threshold)
the deviation from baseline (last week this same run cost X)

That set is small. None of it is exotic. None of it requires a hosted dashboard. It just has to exist somewhere a person can read it in two minutes and the next agent can re-use.

What a useful run record would capture

I keep coming back to a boring shape:

Ask: <one or two lines of intent, not paraphrased>
Identity: <model, account, workspace, branch, worktree>
Inputs: <files read, attachments sent>
Calls: <tool / API calls with model attribution>
Outputs: <diffs, generated artifacts, where they ended up>
Verification: <tests, lints, build, browser checks>
Cost footprint: <tokens, wall time, surprising spikes>
Process footprint: <subprocesses started, still running>
Open risks: <what the agent suspects but did not confirm>
Next-agent handoff: <first three things a fresh agent should do>

Plain markdown. One file per meaningful run. Human-readable first, structured enough for the next agent second.

The point of the cost footprint and process footprint sections is not to replace the invoice. It is to let you say, with the run in front of you:

Of course it cost that much — look at this 66K-token input.
Of course it billed the wrong workspace — the run shows the other one was active.
Of course there is a leaked subprocess — the run started one and never recorded it being torn down.
Of course production drifted — the run identifies a model swap and the verification field is empty.

That is what "cost is a debugging problem" looks like in practice. The bill is the headline. The record is the diagnosis.

The smallest thing I would build

I would not build a dashboard. I would write a small per-run markdown file with the shape above, locally, and stop.

No sync server. No vector store. No vendor lock-in. No new schema until the markdown shape has survived three real runs without changing.

If that file existed, every "why did this cost X?" question would have a place to start that is not "let me scroll back through chat history." Most of the cost surprises above would stop being mysteries.

The reason this is not a product yet is the same reason it is interesting: nobody has accepted that agent cost is a local-debugging surface. Vendors think it is their billing UI. Observability tools think it is a span. Memory products think it is something to retrieve later. None of them write the boring per-run file that would close most of the open questions.

Closing

If your agent is burning money in a way you cannot explain in two minutes, the problem is not your budget. The problem is that the run did not write itself down.

Cost bugs are debugging bugs. The fix is run truth.

If you have hit any of the four failures above and patched around them differently — I want to hear about it. The shape of the right artifact is still up for grabs.

Agents need a black box recorder, not more memory

Morgan — Thu, 14 May 2026 20:21:21 +0000

Every agent product eventually ends up talking about memory.

Longer memory. Better memory. Shared memory. Vector memory. Persistent memory.

I get why. Anyone who has used coding agents for real work has hit the same
wall: the agent loses context, forgets what happened in another client, repeats
itself, or makes a change that is hard to reconstruct later.

But I think "memory" is the wrong primary frame.

The more useful question is:

After the run is over, can I answer what happened?

Not just what the final answer was. What actually happened.

What did the user ask?

What files, tools, docs, and prior context were in play?

Why did the agent call a tool?

Which model produced that action?

What changed?

What did it cost?

Can I replay, audit, or explain the chain?

That is less like a second brain and more like a black box recorder.

The pain is showing up everywhere

The agent tooling conversations I keep seeing are not only about storage.
They are about operational trust.

One MCP discussion described the problem of context being trapped inside one
client. You can brainstorm on mobile, continue in the web app, then open a
coding agent locally and it has no idea what just happened.

That is not just a memory problem. It is a continuity problem.

Another thread proposed standard audit context for AI-initiated MCP tool calls:
why the AI invoked a tool, and which model produced that invocation.

That is not just a logging problem. It is an accountability problem.

Other threads are circling server identity, tool provenance, permission specs,
and tool bills of materials. People are asking questions like:

Who published this tool?
Did its metadata change?
What capabilities does it require?
Why should an agent be allowed to call it?

That is not just a security problem. It is a trust problem.

Then there are the everyday developer headaches: unexpected token usage, credits
attached to the wrong workspace, orphaned local subprocesses, tool calls that
worked in one environment but not another.

That is not just observability. It is run truth.

"Memory" hides too much

When we call all of this memory, we flatten several different needs into one
word.

Developers do need agents to remember useful context.

But they also need agents to preserve the reasoning trail around important
work:

task intent
active context
files and tools touched
model/tool calls
permission and trust assumptions
cost/token/process anomalies
receipts for important actions
a replayable or inspectable run history

Those are not all the same feature.

An agent can remember a fact and still be impossible to audit.

An agent can summarize a conversation and still leave you unable to explain why
it deleted a file, called a tool, burned tokens, or trusted a server.

The product shape I want

The layer I want is local-first and boring in the best way.

It sits under agent work and records enough truth that the user or another
agent can come back later and ask:

What happened here?

And get a useful answer.

Not a hallucinated summary. Not a vague activity feed. Not a giant dashboard
about dashboards.

A compact chain:

The user asked this.
The agent saw this context.
It chose these tools for these reasons.
These tool calls happened.
These files or external states changed.
This was the cost/runtime footprint.
These actions were approved, deferred, or blocked.
This is what a future agent should trust or re-check.

That would make agents safer to use for real work.

It would also make them easier to improve, because the failures would be
visible:

lost context
stale assumptions
wrong tool trust
runaway cost
missing approval
environment drift
actions with no durable deliverable

The phrase I keep coming back to

Agents do not only need memory.

They need a local truth layer.

Something closer to:

inspect, replay, and trust agent work across tools and clients.

That is the direction I am exploring with AMK.

The goal is not another knowledge base. The goal is to make "what happened?"
answerable after the run is over.

Because once agents are doing real work, that question matters more than almost
anything else.