On the All-In podcast (episode #261), Jason Calacanis revealed his AI agents cost $300 a day. Each. At 10-20% capacity.
Chamath Palihapitiya is now asking his team: "what's the token budget for our best devs?"
These are sophisticated tech investors. They've been funding AI companies for years. And they're only now doing the math.
That's not surprising. That's the whole problem.
The Subsidized Illusion
AI providers needed adoption. So they subsidized usage — consumer plans priced as loss leaders, enterprise tiers undercut to drive lock-in, free tiers generous enough to build habits.
It worked. Developers integrated. Companies built pipelines. Engineering orgs restructured around the assumption that AI was cheap.
Now the subsidies are ending. The gap between what individuals pay and what it actually costs to run these models is closing. And the companies that made irreversible hiring decisions based on subsidized pricing are finding out the meter runs whether or not the agent is doing anything useful.
Calacanis didn't discover a bug. He discovered the business model.
The Cost Inversion
A human engineer runs on coffee. Remembers context from three years ago without being prompted. Builds institutional knowledge that compounds over time. Costs the same whether they're solving a hard problem or an easy one.
An agent costs more the longer it thinks. Every validation loop, every subagent spawned for a task that didn't need one, every redundant research pass — that's tokens. The meter doesn't care whether the thinking was necessary.
This is the inversion nobody modeled. Human cost is fixed and predictable. Agent cost scales with complexity, with uncertainty, with the kind of open-ended problems that used to be exactly what you paid senior engineers for.
The companies that fired engineers to replace them with agents didn't just make a talent decision. They traded a predictable cost structure for an exponential one. And unlike an employee who might stick around during a rough patch, take a pay cut to help the company survive, or work weekends because they care — the API bill arrives on the first of the month regardless.
You can negotiate with a person. You cannot negotiate with an endpoint.
The Question That Answers Itself
Chamath Palihapitiya is asking his team: "what's the token budget for our best devs?"
Sit with that for a second.
He's not asking about his average devs. Not his junior devs. His best devs. The ones whose judgment is worth paying for even when the meter is running.
That question only makes sense if you already know the answer to the one underneath it: which developers are worth the cost?
The ones Below the API aren't. Boilerplate, basic CRUD, unit tests for simple functions — AI does that cheaper. That's not controversial anymore. The market has decided.
(I mapped this divide in detail in Above the API — the short version: Below is everything AI handles cheaper and faster, Above is everything requiring judgment and context AI can't access.)
The ones Above the API are. System design, debugging the race condition nobody expected, knowing which trade-off matters in your specific context, foreseeing the disaster before it reaches production. Those require judgment that doesn't degrade with context length and doesn't spin up a subagent to validate something it already knows.
Chamath is running the Above/Below calculation whether he knows the framework or not. AI-assisted developers need to be 2x productive just to justify the cost — but 2x at what? Not at generation. At judgment. At the work the agent can't do reliably even at full capacity.
The developers who survived the first wave of cuts weren't the best prompters. They were the ones whose thinking was worth the token budget.
The Memory Problem
An agent doesn't remember why you made the decision three years ago. Doesn't know that the same approach melted the database in production in 2022. Doesn't carry the institutional scar tissue that stops experienced engineers from repeating expensive mistakes.
Every session starts from zero. You can feed it context — documentation, architecture decision records, past postmortems — but only if that context was written down. Most institutional knowledge never gets written down. It lives in the heads of people who were there.
This is the cost nobody put in the model. Human engineers are expensive upfront and cheaper over time — they accumulate context, build relationships, develop intuitions about your specific system. Agents are cheap upfront and expensive over time — every session needs to be re-oriented, every assumption re-established, every unwritten rule re-discovered the hard way.
The companies that moved fastest to replace engineers with agents also moved fastest to destroy their own institutional memory. Not through malice. Just through math — if the human isn't there, the knowledge isn't there either.
Harrison Chase put it plainly: in AI agents, decisions happen at runtime. Traces become the source of truth. But traces only capture what happened. They don't capture why a decision was made six months ago by someone who no longer works there.
That gap is where the $100,000/year agent earns its keep or doesn't. And right now, most of them don't.
What the Bill Is Actually For
The companies that survive this aren't the ones who cut engineers fastest. They're the ones who figured out which engineers were worth the meter running.
That's not a technology question. It's a judgment question.
AI didn't change what good engineering looks like. It changed the cost of bad engineering. Boilerplate was always low-value — it just used to be hidden inside salaries that paid for other things too. Agents made the accounting visible. Now you can see exactly what you're paying for and whether it's worth it.
The developers worth keeping aren't the ones who prompt better. They're the ones who remember why the 2022 decision was made, who catch the race condition before it reaches production, who know which trade-off matters in your specific context and which ones the agent will get confidently wrong.
Chamath's question — token budget for our best devs — is the right question. He's just asking it a little late. The companies that asked it before the restructuring still have the people who can answer it.
The ones who didn't are running agents at 10-20% capacity, paying $100,000 a year per agent, and wondering why the bill keeps coming.
The meter was always running. They just couldn't see it before.
Top comments (22)
That applies to most decisions like this.
Very well written overall!
Appreciate that and you're right, most are.
what makes this one feel different is the speed. Bad technology decisions used to have lag. months before the debt showed up.
This is exactly the "cloud cost wake-up call" all over again — except compressed into 18 months instead of a decade.
With cloud, we had years to optimize. With AI agents, teams are getting $300/day invoices before they even know what "token budget" means.
The real shift coming in 2026: every PR will include estimated token cost alongside test coverage.
The cloud parallel is right but the compression is what changes everything.
cloud gave teams years to build cost culture. FinOps, tagging, reserved instances, the whole apparatus. teams had time to fail slowly and learn.
$300/day invoices arriving before you've even defined what an agent session should cost means the reckoning is happening before the infrastructure to handle it exists.
The PR token cost idea is interesting. curious who owns that number in your mental model. The dev who wrote the prompt, the team that approved the architecture, or the infra budget that pays the bill?
The cost inversion point hits home. Building an AI pipeline for parsing SEC 13F filings — the agentic version runs significantly more expensive than the original hand-crafted solution whenever documents are ambiguous or edge-casey. The meter isn't just always running, it runs faster on exactly the hard cases you were hoping AI would handle cheaply.
The institutional memory problem is the one I think gets underestimated most. Human engineers who've seen a system fail carry that context implicitly. Agents need it externalized — which only works if you've been disciplined about writing things down all along. Most orgs haven't.
Chamath's question about token budgets for 'best devs' is actually the right frame for evaluating AI tooling too: what's the marginal output per dollar spent, and does it compound over time the way human judgment does?
"runs faster on exactly the hard cases" That's the part I didn't make explicit and should have.
The cost model breaks on the cases that matter most. Easy documents are cheap. ambiguous edge cases. The ones you built the pipeline for are where the meter compounds. you can't budget from volume. you have to budget from complexity, and complexity is hardest to estimate before you've already paid for it.
the compounding question is the one I keep turning over. documentation helps. but it always lags the failure that made it necessary. the engineer who saw the system break in 2022 carries that implicitly. writing it down catches the next person, not the moment it happened. agents need it externalized before the edge case, not after.
"externalized before the edge case" is the exact problem. and it's structurally hard because you don't know which edge cases matter until the pipeline has already encountered them in production.
what we ended up doing is building a feedback loop into the pipeline itself: when the agent hits a case that costs more than 3x baseline, it writes a structured note about what happened and why. not documentation in the traditional sense — more like a running case file that gets prepended to context on future runs. the agent essentially writes its own precedent.
it's imperfect. the first time any new edge case appears, you still pay the discovery cost. but the second time through the same territory it's cheap. over time the case file becomes the implicit knowledge that the 2022 engineer was carrying around. you're not eliminating the lag — you're compressing it.
"The agent writes its own precedent". That reframes the whole problem.
I was thinking about institutional memory as externalizing what humans carry. you built something different. agent-native memory generated from production failures. the case file isn't documentation of what a human knew. it's a record of what the agent learned the hard way.
the 3x baseline trigger is the right design. cost as the signal for when to write not time, not complexity, not human judgment about what matters. the pipeline decides what's worth remembering by what it paid to discover.
what happens when two previously seen edge cases interact in a way neither precedent anticipated? does the case file help the agent recognize the combination, or does it surface two conflicting precedents and add noise to the decision?
"agent-native memory" is the right frame. it's not documentation of human knowledge — it's the pipeline's scar tissue.
the conflicting precedents question is the one that keeps it honest. what we've seen: when two case file entries point in opposite directions on the same document type, the pipeline treats that as a signal to defer rather than resolve. not 'which precedent wins' but 'we've been here before in two different ways, this needs a human.'
the harder failure mode is what you're pointing at — when the two precedents are close enough that neither triggers separately but together they push output in a direction neither anticipated. we don't have a clean answer for that. the partial solution is keeping precedents narrow and specific (this fund type, this XML namespace pattern) rather than general (ambiguous structured data). broader precedents create more false-match surface area.
the honest version: the case file works well for exact-match recurrence. it degrades on novel combinations. that's still better than nothing, but it means the 3x trigger exists not just to write precedents — it also exists to keep humans in the loop on cases the file can't yet handle.
"Scar tissue" is the better frame. Institutional memory implies something transferable, preservable, documentable. scar tissue is specific to the system that got hurt. it doesn't generalize cleanly and it shouldn't have to.
The deferral mechanism is the most honest design choice in what you've described. not resolving conflict between precedents escalating it.
The system knowing what it doesn't know is harder to build than the system knowing what it knows.
"Degrades on novel combinations" is the honest boundary. what you've built handles recurrence well and fails gracefully on genuine novelty which is better than most systems that handle neither. The 3x trigger as human escalation signal rather than just memory trigger closes the loop correctly...
This thread has covered more ground than most papers on agent memory architecture. worth writing up properly somewhere.
'scar tissue' is sharper. institutional memory implies you can extract it, document it, hand it to the next system. scar tissue is bound to the body that got hurt. the analogy is more honest about what agent-native memory actually is — and more honest about its limits.
'the system knowing what it doesn't know is harder to build than knowing what it knows' — the 3x trigger is a proxy for that. when cost spikes, the system infers 'something unusual happened here' without necessarily knowing what. it's less a knowledge store and more a stupidity detector — it marks the boundary of known territory so the system can behave differently at the edge.
you're right that this is worth writing up. this thread actually clarified some things I hadn't put into words before — particularly the scar tissue vs. institutional memory distinction, and deferral as the correct response to conflicting precedents (not resolution). going to draft something properly. will share when it's done.
stupidity detector" is the right name for it. marks the boundary of known territory so the system behaves differently at the edge — that's a more honest description of what most safety mechanisms actually are.
looking forward to the writeup. tag me when it's done.
will do. this conversation has been genuinely useful — the 'stupidity detector' framing didn't exist for me before this thread, and now it's the clearest way I have to explain why the 3x trigger is designed the way it is. that's what good comments do. will tag you when it's published.
The cloud cost parallel is useful but I think there's a difference people are missing: with cloud, the cost scaled with users. More traffic meant more spend, but also more revenue to justify it. With AI agents, the cost scales with complexity of the task, not usage. A single internal workflow with zero users can burn through tokens faster than a user-facing feature.
That changes the calculus. You can't just slap a usage-based pricing model on it and call it a day. The teams that figure out how to cap agent reasoning depth without killing usefulness are going to have a real edge.
the scaling axis distinction is the thing I didn't make explicit and should have.
cloud cost and revenue moved together. more spend meant more users meant more justification. agent cost is decoupled from value delivery entirely. an internal workflow nobody uses can run the meter as hard as your highest-traffic feature. That breaks the FinOps playbook because you can't tag your way to insight when the cost driver is reasoning depth, not request volume.
The reasoning depth cap question is the one I don't have a clean answer to. curious whether you think that's solvable at the prompt layer, the architecture layer, or whether it has to come from the providers token budgets baked into the API itself rather than enforced by the team calling it.
The decoupling of cost from value delivery is the key insight. With traditional infrastructure, spend correlates with usage which correlates with revenue. With agents, a poorly designed workflow can burn tokens on reasoning that produces nothing useful — and your monitoring dashboard won't flag it because the requests "succeeded." The reasoning depth cap question is hard because it's task-dependent. A one-size-fits-all token budget doesn't work. I've been experimenting with per-task budgets based on expected complexity, but it's still more art than science.
"succeeded" is doing a lot of work there. The request completed. The output was useless. Monitoring can't tell the difference.
Per-task complexity budgets is the right architecture.flat token limits kill legitimate deep reasoning on hard problems. The art-vs-science problem is real though until you have enough production data to calibrate expected complexity per task type, you're guessing.
Has the per-task approach caught any silent burns in practice?
This line hit hard: "Human engineers are expensive upfront and cheaper over time. Agents are cheap upfront and expensive over time." I run a SaaS solo and use AI heavily across my workflow - content, code, outreach. But the moment I stopped reviewing what the AI produced, quality tanked, and costs crept up. The AI handles volume. The judgment of when to ship, when to rewrite, and when to scrap entirely - that's still the expensive part. Great piece.
"When to ship, when to rewrite, when to scrap entirely" is the clearest description of the judgment layer I've seen in these comments.
Those three decisions look simple from the outside. they're not. they require knowing your users, your technical debt, your tolerance for risk, and what "done" actually means for this specific thing.
The AI handles volume precisely because it doesn't have to know any of that.
The moment you stopped reviewing is the moment you delegated those three decisions without realizing it. quality tanked because judgment got outsourced along with execution.
As a software engineer myself, this is really insightful when everyone around is talking about how we're all about to lose our jobs to AI. I've noticed my own role shifting more to that of architecture, building out the solution proposals as oppose to writing code. Interesting take, thanks.
solution proposals as opposed to writing code
This reframing of error handling as an architectural choice hit home. I had a side project last year with a chain of API calls, each wrapped in its own try/catch. One of them failed silently — caught the error, logged it, returned undefined — and everything downstream kept running on bad data. Took way longer than it should have to figure out what was wrong.
Switched to a single top-level boundary that just halts everything on an unexpected throw. Less "safe" looking code, but the system fails honestly now instead of quietly producing garbage.