DEV Community

The Meter Was Always Running

Daniel Nwaneri on February 20, 2026

On the All-In podcast (episode #261), Jason Calacanis revealed his AI agents cost $300 a day. Each. At 10-20% capacity. Chamath Palihapitiya is no...

Read full post

Ben Halpern • Feb 20

That's not a technology question. It's a judgment question.

That applies to most decisions like this.

Very well written overall!

Daniel Nwaneri • Feb 20

Appreciate that and you're right, most are.
what makes this one feel different is the speed. Bad technology decisions used to have lag. months before the debt showed up.

Harsh • Feb 20

This is exactly the "cloud cost wake-up call" all over again — except compressed into 18 months instead of a decade.

With cloud, we had years to optimize. With AI agents, teams are getting $300/day invoices before they even know what "token budget" means.

The real shift coming in 2026: every PR will include estimated token cost alongside test coverage.

Daniel Nwaneri • Feb 20

The cloud parallel is right but the compression is what changes everything.
cloud gave teams years to build cost culture. FinOps, tagging, reserved instances, the whole apparatus. teams had time to fail slowly and learn.

$300/day invoices arriving before you've even defined what an agent session should cost means the reckoning is happening before the infrastructure to handle it exists.

The PR token cost idea is interesting. curious who owns that number in your mental model. The dev who wrote the prompt, the team that approved the architecture, or the infra budget that pays the bill?

Vic Chen • Feb 21

The cost inversion point hits home. Building an AI pipeline for parsing SEC 13F filings — the agentic version runs significantly more expensive than the original hand-crafted solution whenever documents are ambiguous or edge-casey. The meter isn't just always running, it runs faster on exactly the hard cases you were hoping AI would handle cheaply.

The institutional memory problem is the one I think gets underestimated most. Human engineers who've seen a system fail carry that context implicitly. Agents need it externalized — which only works if you've been disciplined about writing things down all along. Most orgs haven't.

Chamath's question about token budgets for 'best devs' is actually the right frame for evaluating AI tooling too: what's the marginal output per dollar spent, and does it compound over time the way human judgment does?

Daniel Nwaneri • Feb 21

"runs faster on exactly the hard cases" That's the part I didn't make explicit and should have.

The cost model breaks on the cases that matter most. Easy documents are cheap. ambiguous edge cases. The ones you built the pipeline for are where the meter compounds. you can't budget from volume. you have to budget from complexity, and complexity is hardest to estimate before you've already paid for it.

the compounding question is the one I keep turning over. documentation helps. but it always lags the failure that made it necessary. the engineer who saw the system break in 2022 carries that implicitly. writing it down catches the next person, not the moment it happened. agents need it externalized before the edge case, not after.

Vic Chen • Feb 22

"externalized before the edge case" is the exact problem. and it's structurally hard because you don't know which edge cases matter until the pipeline has already encountered them in production.

what we ended up doing is building a feedback loop into the pipeline itself: when the agent hits a case that costs more than 3x baseline, it writes a structured note about what happened and why. not documentation in the traditional sense — more like a running case file that gets prepended to context on future runs. the agent essentially writes its own precedent.

it's imperfect. the first time any new edge case appears, you still pay the discovery cost. but the second time through the same territory it's cheap. over time the case file becomes the implicit knowledge that the 2022 engineer was carrying around. you're not eliminating the lag — you're compressing it.

Daniel Nwaneri • Feb 22

"The agent writes its own precedent". That reframes the whole problem.
I was thinking about institutional memory as externalizing what humans carry. you built something different. agent-native memory generated from production failures. the case file isn't documentation of what a human knew. it's a record of what the agent learned the hard way.
the 3x baseline trigger is the right design. cost as the signal for when to write not time, not complexity, not human judgment about what matters. the pipeline decides what's worth remembering by what it paid to discover.
what happens when two previously seen edge cases interact in a way neither precedent anticipated? does the case file help the agent recognize the combination, or does it surface two conflicting precedents and add noise to the decision?

Vic Chen • Feb 22

"agent-native memory" is the right frame. it's not documentation of human knowledge — it's the pipeline's scar tissue.

the conflicting precedents question is the one that keeps it honest. what we've seen: when two case file entries point in opposite directions on the same document type, the pipeline treats that as a signal to defer rather than resolve. not 'which precedent wins' but 'we've been here before in two different ways, this needs a human.'

the harder failure mode is what you're pointing at — when the two precedents are close enough that neither triggers separately but together they push output in a direction neither anticipated. we don't have a clean answer for that. the partial solution is keeping precedents narrow and specific (this fund type, this XML namespace pattern) rather than general (ambiguous structured data). broader precedents create more false-match surface area.

the honest version: the case file works well for exact-match recurrence. it degrades on novel combinations. that's still better than nothing, but it means the 3x trigger exists not just to write precedents — it also exists to keep humans in the loop on cases the file can't yet handle.

Daniel Nwaneri • Feb 22

"Scar tissue" is the better frame. Institutional memory implies something transferable, preservable, documentable. scar tissue is specific to the system that got hurt. it doesn't generalize cleanly and it shouldn't have to.

The deferral mechanism is the most honest design choice in what you've described. not resolving conflict between precedents escalating it.

The system knowing what it doesn't know is harder to build than the system knowing what it knows.

"Degrades on novel combinations" is the honest boundary. what you've built handles recurrence well and fails gracefully on genuine novelty which is better than most systems that handle neither. The 3x trigger as human escalation signal rather than just memory trigger closes the loop correctly...

This thread has covered more ground than most papers on agent memory architecture. worth writing up properly somewhere.

Vic Chen • Feb 22

'scar tissue' is sharper. institutional memory implies you can extract it, document it, hand it to the next system. scar tissue is bound to the body that got hurt. the analogy is more honest about what agent-native memory actually is — and more honest about its limits.

'the system knowing what it doesn't know is harder to build than knowing what it knows' — the 3x trigger is a proxy for that. when cost spikes, the system infers 'something unusual happened here' without necessarily knowing what. it's less a knowledge store and more a stupidity detector — it marks the boundary of known territory so the system can behave differently at the edge.

you're right that this is worth writing up. this thread actually clarified some things I hadn't put into words before — particularly the scar tissue vs. institutional memory distinction, and deferral as the correct response to conflicting precedents (not resolution). going to draft something properly. will share when it's done.

Daniel Nwaneri • Feb 22

stupidity detector" is the right name for it. marks the boundary of known territory so the system behaves differently at the edge — that's a more honest description of what most safety mechanisms actually are.
looking forward to the writeup. tag me when it's done.

Vic Chen • Feb 22

will do. this conversation has been genuinely useful — the 'stupidity detector' framing didn't exist for me before this thread, and now it's the clearest way I have to explain why the 3x trigger is designed the way it is. that's what good comments do. will tag you when it's published.

Matthew Hou • Feb 21

The cloud cost parallel is useful but I think there's a difference people are missing: with cloud, the cost scaled with users. More traffic meant more spend, but also more revenue to justify it. With AI agents, the cost scales with complexity of the task, not usage. A single internal workflow with zero users can burn through tokens faster than a user-facing feature.

That changes the calculus. You can't just slap a usage-based pricing model on it and call it a day. The teams that figure out how to cap agent reasoning depth without killing usefulness are going to have a real edge.

Daniel Nwaneri • Feb 21

the scaling axis distinction is the thing I didn't make explicit and should have.
cloud cost and revenue moved together. more spend meant more users meant more justification. agent cost is decoupled from value delivery entirely. an internal workflow nobody uses can run the meter as hard as your highest-traffic feature. That breaks the FinOps playbook because you can't tag your way to insight when the cost driver is reasoning depth, not request volume.
The reasoning depth cap question is the one I don't have a clean answer to. curious whether you think that's solvable at the prompt layer, the architecture layer, or whether it has to come from the providers token budgets baked into the API itself rather than enforced by the team calling it.

Matthew Hou • Feb 23

The decoupling of cost from value delivery is the key insight. With traditional infrastructure, spend correlates with usage which correlates with revenue. With agents, a poorly designed workflow can burn tokens on reasoning that produces nothing useful — and your monitoring dashboard won't flag it because the requests "succeeded." The reasoning depth cap question is hard because it's task-dependent. A one-size-fits-all token budget doesn't work. I've been experimenting with per-task budgets based on expected complexity, but it's still more art than science.

Daniel Nwaneri • Feb 23 • Edited

"succeeded" is doing a lot of work there. The request completed. The output was useless. Monitoring can't tell the difference.
Per-task complexity budgets is the right architecture.flat token limits kill legitimate deep reasoning on hard problems. The art-vs-science problem is real though until you have enough production data to calibrate expected complexity per task type, you're guessing.
Has the per-task approach caught any silent burns in practice?

Johnu Marattil • Feb 23

This line hit hard: "Human engineers are expensive upfront and cheaper over time. Agents are cheap upfront and expensive over time." I run a SaaS solo and use AI heavily across my workflow - content, code, outreach. But the moment I stopped reviewing what the AI produced, quality tanked, and costs crept up. The AI handles volume. The judgment of when to ship, when to rewrite, and when to scrap entirely - that's still the expensive part. Great piece.

Daniel Nwaneri • Feb 24

"When to ship, when to rewrite, when to scrap entirely" is the clearest description of the judgment layer I've seen in these comments.
Those three decisions look simple from the outside. they're not. they require knowing your users, your technical debt, your tolerance for risk, and what "done" actually means for this specific thing.

The AI handles volume precisely because it doesn't have to know any of that.
The moment you stopped reviewing is the moment you delegated those three decisions without realizing it. quality tanked because judgment got outsourced along with execution.

Chris Wakefield (Chris) • Feb 23

As a software engineer myself, this is really insightful when everyone around is talking about how we're all about to lose our jobs to AI. I've noticed my own role shifting more to that of architecture, building out the solution proposals as oppose to writing code. Interesting take, thanks.

Daniel Nwaneri • Feb 23 • Edited

solution proposals as opposed to writing code

is the clearest description of the shift I've seen in the comments.
The role didn't shrink. The center of gravity moved. the judgment about what to build and how to structure it was always the valuable part. it's just more visible now that the execution layer is cheaper.
The engineers who feel most displaced are usually the ones whose contribution was always the proposal layer but never got credit for it because it was bundled with the coding work.

Matthew Hou • Feb 21

This reframing of error handling as an architectural choice hit home. I had a side project last year with a chain of API calls, each wrapped in its own try/catch. One of them failed silently — caught the error, logged it, returned undefined — and everything downstream kept running on bad data. Took way longer than it should have to figure out what was wrong.

Switched to a single top-level boundary that just halts everything on an unexpected throw. Less "safe" looking code, but the system fails honestly now instead of quietly producing garbage.