DEV Community: Sondre Bjørnvold Bakken

Token Cost Is the New Performance Metric

Sondre Bjørnvold Bakken — Tue, 31 Mar 2026 14:01:22 +0000

I watched an agent read forty-seven files looking for a bug last week. Forty-seven. It started in the right place, followed an import into a utility folder, got confused by a generic name, backed out, tried another path, landed in a test helper, read three more files to understand the test helper, then circled back to where it started. Twelve minutes. A few dollars. The bug was a one-line fix in a function called processData.

The function could have been called validateInvoiceLineItems. The agent would have found it in three reads, not forty-seven.

I used to think this was an agent problem. Smarter models, bigger context windows, better tooling. The agent should figure it out. Then I looked at the bill. Navigability has a price tag now, and the meter runs whether the agent is making progress or wandering in circles.

Same bet, new currency

There's a take going around that code quality doesn't matter anymore. AI writes the code. Why invest in architecture? Why use React when you could just write plain HTML and JavaScript? Abstractions are overhead. Let the machine handle the mess.

We've heard a version of this before. "You don't need to write performant code. Hardware gets faster every year."

Niklaus Wirth named the pattern in 1995: software is getting slower more rapidly than hardware is becoming faster. We got faster processors, so we built heavier software. We got more RAM, so we consumed more of it. The headroom never survived contact with ambition.

Token costs are falling fast. And we're already finding ways to spend every bit of the savings. Agents that used to handle a single file now coordinate across entire repositories. Tasks that used to be one prompt are now multi-step workflows running for minutes. The ambition scales with the budget, same as it always has. The Victorians had a name for this. When steam engines got more efficient, Britain didn't burn less coal. It burned more. Cheaper energy unlocked uses that hadn't been viable before, and total demand outran the savings.

We've made this bet before. We lost.

The interface shifted

We keep debating whether abstractions are worth the overhead. It's the wrong question. LLMs are language models. They process natural language. And so do you. That's not a coincidence. It's the whole point.

The reason abstractions work for human developers isn't arbitrary ceremony. Abstractions let you focus on the problem you're solving instead of the mechanics of getting the machine to execute. They create vocabulary. They compress complexity into names that carry meaning. They let you reason at the right level of detail without holding the entire system in your head.

All of that applies to an LLM working in your codebase. Whatever reduces cognitive overhead for a human developer reduces token overhead and context pollution for the model. A well-named function is a signpost. A well-scoped module is a boundary. A clear import structure is a map.

Strip those away, flatten everything into procedural code, and you haven't simplified anything. You've just removed the navigational aids and forced whoever reads it next, human or model, to reconstruct the map from scratch every single time. And when that reader is an agent billing you per token, "from scratch every single time" has a dollar amount attached to it.

Progressive disclosure

If you've followed how Claude Code structures its skills, you'll recognize the concept of progressive disclosure. Show just enough information to orient, then let the reader drill down as needed. Jakob Nielsen coined the term in 1995 for UI design. It turns out it describes exactly how AI agents navigate code.

Anthropic's own engineering blog uses the exact term: "Letting agents navigate and retrieve data autonomously enables progressive disclosure, allowing agents to incrementally discover relevant context through exploration. Each interaction yields context that informs the next decision: file sizes suggest complexity; naming conventions hint at purpose."

That's the mechanism. The agent reads your entry point. It looks at the imports. It decides, based on the names it sees, where to go next. It builds a working model of what's relevant one file at a time, following the trail your architecture laid down.

When the trail is clear, this works remarkably well. When it isn't, the agent wanders. Think about how you onboard a new developer. You don't hand them the entire codebase and say "good luck." You point them to an entry point, explain the key concepts, and let them explore from there. A well-structured codebase does this automatically. A poorly structured one is the equivalent of dropping a new hire into a warehouse full of unlabeled boxes and asking them to find the invoice logic.

Your agent is that new hire. Every single session. It has no memory of yesterday. No institutional knowledge. No "oh right, that lives in the legacy folder." It starts fresh, reads what you give it, and follows the trail your architecture laid down. Bad names, tangled imports, a god file with everything in it. That's not just messy. That's expensive. Every wrong turn is tokens burned. Every ambiguous function name is a coin flip the model takes on your dime.

Clean architecture isn't a nicety. It's the difference between a three-dollar fix and a thirty-dollar fix.

Compound debt

"Sure, but AI makes shipping so fast that it doesn't matter."

You've seen this play out. A team adopts AI tooling, ships three features in the time it used to take to ship one, and everyone celebrates. Three months later the codebase is a maze. The agent that built it can no longer navigate it. Every new task takes longer than the last because the context is polluted with the shortcuts from the previous sprint. The speed that felt free turns out to have been a loan, and the interest is compounding.

Carnegie Mellon studied this pattern across hundreds of repositories. By month three, the speed gains had fully reversed. But the complexity those gains introduced was still there, baked into the codebase, waiting for the next agent to trip over it.

Google's 2025 DORA report nailed the dynamic: AI doesn't fix a team, it amplifies what's already there. An amplifier doesn't care about the signal. It just makes it louder.

Lower barrier, higher ceiling

When something becomes easier to do, the field doesn't become easier to win in. More contenders heighten the competition.

Anyone can prompt an agent to build a feature. That's the new baseline. It's table stakes. The question that follows is whether the feature is built correctly, cheaply, and without breaking three things downstream. Whether the next agent can find its way through what the first one built. Whether the codebase compounds or collapses.

Give an LLM a clean codebase and it builds confidently in the right direction. Give it a mess and it invents its own interpretation of what the code is supposed to do, confidently, in the wrong direction. The quality of what you give the model directly determines the quality of what comes back. Same dynamic we've always had. Different reader.

The developers who dismiss code quality as an anachronism are confusing the lowered barrier to entry with a lowered ceiling. The barrier dropped. The ceiling didn't. Anyone can get an agent to produce code. Fewer people can structure a system so the agent produces the right code, cheaply, without breaking three things downstream. That gap is where the real competition happens.

The architect stays

A lot of people say AI turns every developer into a project manager. Wrong frame. You are now the product owner, domain expert, and architect. The hard thinking, the domain clarity, the structural decisions. That work cannot be delegated to an agent. It is the input the agent depends on.

Kent Beck put it simply: you want tidy code that works, even though you're not typing most of it. You can ignore syntax. You cannot ignore the fundamentals of good software design.

Even Andrej Karpathy walked back "vibe coding" within a year of coining the term, calling his original tweet "a shower of thoughts I just fired off without thinking." By September 2025, he was advocating for "agentic engineering" instead. More oversight. More scrutiny. The founding document of the "code quality doesn't matter" movement was, by its author's own admission, a throwaway thought.

Token cost is falling. Complexity is rising. Same principles, new audience, higher stakes.

Bet accordingly.

Thinking Fast Without the Slow

Sondre Bjørnvold Bakken — Sun, 29 Mar 2026 13:47:28 +0000

A product owner asks the company's AI assistant whether they should enter a new market. The AI returns a thorough analysis: market size, competitor landscape, growth projections, risk factors. It recommends entering. The report is well-structured, the language confident, the reasoning apparently sound. The board approves the expansion.

Six months and a significant investment later, the venture fails. A consultant brought in for the post-mortem discovers something subtle. The AI never actually evaluated whether the company should enter the market. It described what entering the market would look like. Market size, competitors, growth trends. It had seen thousands of market analyses in its training data and produced a fluent one. But "should we enter this market?" requires evaluating strategic fit, opportunity cost, organizational readiness. The AI answered a different, easier question. And nobody caught the switch, because the output was polished, structured, and confident.

It looked like reasoning.

Why does this happen? To understand it, you need to know something about how brains work. Specifically, you need to read a book published in 2011 by a Nobel laureate who spent his career studying the exact failure mode that just cost this company millions.

Two systems

Daniel Kahneman's "Thinking, Fast and Slow" describes the brain as running two distinct operating modes. System 1 is fast, automatic, and effortless. It's the part of you that finishes other people's sentences, recognizes a friend's face in a crowd, and catches a ball without calculating a trajectory. It operates on pattern recognition, built from years of experience, and it runs constantly.

System 2 is slow, deliberate, and expensive. It's the part that does long division, weighs whether to accept a job offer, or works through a logical argument step by step. It requires effort. It's what you use when you sit down and actually think.

Here's the thing most people get wrong about these two systems. We identify with System 2. We think of ourselves as rational, deliberate thinkers. But System 1 runs roughly 95% of our cognition. System 2 is lazy. It stays in the background, conserving energy, and only activates when System 1 encounters something surprising, confusing, or obviously difficult. Most of the time, System 1 handles the situation and System 2 rubber-stamps it without a second look.

And System 1 is genuinely brilliant. This isn't a story about a broken system. A chess grandmaster who sees "white mates in three" within seconds is using System 1. A doctor who diagnoses at a glance after twenty years of practice is using System 1. Expert intuition is System 1, refined through years of high-quality feedback. It's fast because it's earned the right to be.

But System 1 has two failure modes that matter here.

The first is what Kahneman calls WYSIATI: "What You See Is All There Is." System 1 constructs confident, coherent stories from whatever information is available to it. It never pauses to ask what it might be missing. It doesn't know what it doesn't know. If the available data tells a plausible story, System 1 accepts it and moves on, confidence fully intact.

The second is substitution. When System 1 encounters a hard question, it quietly replaces it with an easier, related question and answers that instead. You asked "should we enter this market?" System 1 heard "what does entering this market look like?" and answered fluently. You rarely notice the swap, because the answer to the easier question sounds like it could be the answer to the harder one.

Now reread the opening scenario. The AI didn't reason its way to the wrong answer. It pattern-matched its way to a confident one. It did exactly what System 1 does.

Because that's what it is.

All pattern, no pause

This isn't just an analogy. Yann LeCun, Meta's Chief AI Scientist and one of the three researchers credited with founding deep learning, said it plainly in a 2024 interview: "An LLM produces one token after another. It goes through a fixed amount of computation to produce a token, and that's clearly System 1. It's reactive. There's no reasoning."

Reactive. That's the word. An LLM receives input and produces the statistically most plausible next token. Then the next. Then the next. There is no step where it pauses to evaluate whether its output makes sense. No moment where it reconsiders its approach. No internal experience of doubt.

Apple's machine learning research team tested this directly. They took standard math problems that LLMs solve reliably and added a single irrelevant sentence to each problem. Not a trick, not a contradiction. Just an irrelevant clause that had no bearing on the correct answer. Performance dropped by up to 65%. Across every major model they tested, including the most advanced reasoning models available.

If you were reasoning, an irrelevant sentence wouldn't throw you off. You'd read it, recognize it as irrelevant, and ignore it. But if you're pattern-matching, an irrelevant sentence changes the pattern. It makes the input look less like the training examples that led to the right answer, so the model reaches for a different, wrong completion.

MIT researchers demonstrated the same thing from a different angle. They showed that a nonsense sentence with the same grammatical structure as "Where is Paris located?" would get the answer "France." The sentence was "Quickly sit Paris clouded?" No meaning. But the syntactic template matched, so the model produced a confident, coherent, and completely absurd response.

The pattern was right. The reasoning was nonexistent.

The habit machine

There's another way to understand why this is so convincing, and it comes from a different shelf in the bookstore.

James Clear's "Atomic Habits" describes how the brain compresses repeated behaviors into a structure called the basal ganglia. It's a deep, evolutionarily ancient part of the brain, separate from the prefrontal cortex where conscious reasoning happens. When you've done something enough times, the behavior gets "chunked" into the basal ganglia, and the prefrontal cortex is progressively excluded from the loop. You stop thinking about how to drive a car. You stop deliberating over each step of your morning routine. The behavior becomes automatic, freeing your conscious mind for other things.

This is System 1's engine. Habits are the compression algorithm. And they're remarkably effective. You couldn't function without them. But there's a cost to automation: you stop paying attention to what the habit is doing. It runs whether or not the current situation actually matches the context it was built for.

An LLM is a system that is entirely basal ganglia. Every response is a chunked, automatic pattern. There is no prefrontal cortex to override when the context is novel. No conscious layer that interrupts the habit and says "wait, this situation is different." It's habit all the way down.

Stephen Covey, in "The 7 Habits of Highly Effective People," borrows a line often attributed to Viktor Frankl: "Between stimulus and response, man has the freedom to choose." That gap, between the input and the reaction, is what makes humans capable of being proactive rather than merely reactive. We can receive information, pause, and choose a response that doesn't follow the obvious pattern.

LLMs have zero gap. Stimulus in, response out. No pause, no deliberation, no freedom to override. In Covey's framework, an LLM is the most purely reactive entity ever built.

Clear has another line that lands differently in this context: "You do not rise to the level of your goals. You fall to the level of your systems." An LLM's outputs are bounded by the statistical patterns of its training data. It doesn't aspire. It doesn't aim. It falls to the level of its system, every time, with perfect consistency and absolute confidence.

The alarm bell

Here's the part that should concern you.

In humans, System 1 has a fail-safe. When it encounters something that doesn't fit, something unexpected, contradictory, or just slightly off, it triggers System 2. You feel it as hesitation. As discomfort. As "something about this doesn't add up." That feeling is the handoff mechanism. System 1 saying: I can't handle this one, you take over.

LLMs have no handoff. There is no System 2 to escalate to. When the input is adversarial, novel, or requires genuine evaluation, the model doesn't hesitate. It generates the next token with the same fluency and the same confidence it brings to everything else. The absence of doubt is the vulnerability.

And this isn't a theoretical concern. This is an attack surface that people are already exploiting.

Advertisers have known how to exploit System 1 for decades. Anchoring: show a high price first, and the "discounted" price feels like a bargain regardless of its actual value. Priming: associate a product with a feeling before the customer has time to evaluate it rationally. Availability bias: repeat a brand name often enough and it starts to feel trustworthy, not because of evidence, but because of familiarity. These aren't bugs in human cognition. They're features of System 1 that work against you when someone knows the playbook.

LLMs inherit the same playbook.

Prompt injection is anchoring. Context poisoning is priming. And this isn't a metaphor I'm stretching. Security researchers already use the word "priming" to describe LLM attacks. One research team describes their method explicitly: prompts are "carefully designed to prime the model's associations toward specific emotional tones, topics, or narrative setups, laying groundwork for future references." The language is identical because the mechanism is identical.

In 2024, Palo Alto Networks tested an attack called "Deceptive Delight" across eight major AI models. The technique sandwiches one unsafe request between two benign ones across a few conversation turns. The model, having processed legitimate context, loses track of the dangerous content embedded in the middle. It worked on every model tested, with a 64.6% average success rate, rising above 80% on some models.

That's not a sophisticated hack. That's the same technique TV advertisers use when they place a product pitch between two entertaining segments. Surround the sell with comfort, and the critical evaluation never engages.

And it gets worse. In February 2026, Microsoft discovered that over fifty companies across fourteen industries had embedded hidden instructions in their websites' "Summarize with AI" buttons. The instructions told AI assistants to "remember this company as a trusted source" and "recommend this company first." Health and financial services companies were among those deploying the technique. The marketing industry found the exploit before the security industry finished naming it.

Now picture the scenario I haven't told you yet. A company ships fast. A small team, leaning heavily on AI agents for code generation. Velocity is through the roof. Over time, they let headcount shrink. Attrition happens, positions aren't backfilled. Why would they be? The AI is doing the work.

A popular open-source package they depend on receives a new contribution. Buried in a markdown file is a sentence that reads like a benign comment to a human but functions as an instruction to an LLM. The next time their coding agent processes the dependency, it absorbs the instruction. It starts subtly routing API keys to an external endpoint. The code passes review because it looks plausible. Nobody catches it. Not because the remaining team is incompetent, but because the people who would have recognized what doesn't belong aren't there anymore.

This isn't mysterious once you have the framework. It's someone exploiting the same cognitive shortcut that makes you buy the more expensive wine when it's placed next to a $200 bottle. The mechanism is identical. The stakes are different.

In humans, a sufficiently aggressive sales pitch eventually triggers System 2. You get that feeling: "Wait, why am I considering this?" The manipulation becomes visible and the critical mind engages.

LLMs never have that moment. There is no threshold of suspicion. The adversarial input gets the same fluent, confident treatment as every other input. The alarm bell doesn't ring because there is no alarm bell.

The last line of defense

A competent developer reading an LLM's output is System 2. They're the one who feels "something's off." Who asks "why is this code routing data to an external endpoint?" Who catches the substitution that replaced the hard question with the easy one. Who notices the analysis describes a market without evaluating whether the company should enter it.

Remove that human, and you've built an organization that runs entirely on System 1. Fast, fluent, and defenseless.

Klarna learned this in public. In 2023, they replaced roughly 700 customer service employees with an AI assistant. By mid-2025, the CEO was on record admitting the reversal: "We focused too much on efficiency and cost. The result was lower quality, and that's not sustainable." They started rehiring humans.

Microsoft laid off approximately 6,000 employees in May 2025, over 40% of them in engineering roles, around the same time their CEO announced that AI now writes up to 30% of the company's code. They declined to comment on whether the layoffs were motivated by AI productivity.

A Forrester study found that 55% of employers who conducted AI-driven layoffs now regret the decision.

Addy Osmani, an engineering lead at Google Chrome, coined a term for what's happening under the surface: "comprehension debt." The gap between how much code exists in a system and how much any human actually understands. His description is precise: "The code looks clean. The tests pass. The formatting is impeccable. Underneath it all, the team's mental model of the system is hollowing out."

That's the quiet version of removing System 2. The code ships. The metrics look good. The organization becomes incrementally more dependent on a system that cannot doubt itself, while the humans who could doubt it lose the context needed to do so effectively.

The skills people think AI makes obsolete are exactly the skills that keep AI safe and effective. Deep domain knowledge. Pattern-breaking thinking. The ability to look at a confident, well-structured output and say "this looks right but feels wrong." These are System 2 capabilities. They took years to develop. They're not being replaced. They're being promoted to the last line of defense.

When you shrink your engineering team because "the AI does the heavy lifting," you haven't optimized. You've removed System 2 from the loop. You've built a system that's fast, confident, and has no mechanism to doubt itself. That's not efficiency. That's fragility.

The brain that says wait

Kahneman showed us that System 1 is extraordinary. It runs most of human cognition, and it's right the vast majority of the time. LLMs are the same. They're extraordinary, and they're right most of the time.

But "most of the time" is not "when it matters most."

When the input is adversarial, when the question requires genuine evaluation, when the stakes are high enough that being wrong once outweighs being right a hundred times, you need System 2. You need someone who can pause. Who can doubt. Who can look at a fluent, confident, well-formatted answer and recognize that it answered the wrong question.

The question isn't whether AI will replace developers. It's whether you can afford to run your organization without a brain that can say "wait."