DEV Community: Keith MacKay

When Writing Software, the Typing Was Never the Job. Neither Is the Prompting.

Keith MacKay — Mon, 01 Jun 2026 01:43:37 +0000

When Writing Software, the Typing Was Never the Job. Neither Is the Prompting.

With AI coding, the coding steps got easier. The thinking steps got harder.

You asked for a feature. The AI built it. It runs. It passes the tests. And it's completely wrong -- architecturally backwards, solving a problem you don't have. The code is fine. Your spec wasn't.

This is not a prompting problem. It's a mental model problem.

Keyboard time is down 70%, 80%, or more. The loop closes faster, the boilerplate disappears, the first draft shows up before you've finished your coffee. But the mental work didn't shrink. It shifted left.

What Coding Actually Was

Before AI, a good developer's job was never primarily about typing. The keyboard was the final step in a process that looked something like this:

You received a problem. You spent time -- reading documents, in meetings, in debate with peers, in the shower, on a run, staring at a whiteboard at 11pm -- building a mental model. How does this system actually work? Where does this feature live in the architecture? What constraints are non-negotiable? What are the three ways this could go wrong and which one is most likely?

Many aspects of building that mental model came from prior coding experiences. Coding metaphors used in the past. Knowledge of the problem domain that didn't appear anywhere in the problem statement. Tribal knowledge mentioned in shorthand from other developers that told you volumes ("X built that." "Oh, so it'll be overcomplicated with no comments."; "Don't forget the Y release!" "Right, I need to test for that weird edge case."). Understanding the things that customers or the client prefer that aren't anywhere in the spec.

Then, and only then, did you sit down to code. The typing was the translation. You were converting a mental model you had already built into a language the machine could execute. My own typical strategy was to outline the code I was about to write in comments, sketching out the broad strokes and big picture, then filling in the details. Great developers wrote clean code because they had accurate mental models. Mediocre developers wrote confusing code because their mental model was incomplete -- and the code faithfully reflected that incompleteness.

Peter Naur -- 2005 Turing Award winner, contributor to ALGOL, co-creator of Backus-Naur Form -- made this argument formally in 1985. His essay "Programming as Theory Building" [1] proposed that a program is not the primary product of development. The primary product is the shared mental model -- the "theory" of the problem held by everyone who built it. The code is a lossy projection of that theory onto a machine. When the team disperses, the theory disperses with them. What remains is the map. The territory it described lives only in the heads that built it.

The map is not the territory. Development teams with a shared mental model produce tighter code with fewer bugs -- not because they're smarter, but because the map matches the territory better. The code reflects what the team actually understood, not just what they managed to write down.

The keyboard was where the model became visible. It was never where the real work happened.

What AI Coding Actually Is

AI coding requires those same elements, but handled differently and expanded.

The translation layer has changed: instead of converting your mental model into code, you convert it into a specification that the AI converts into code. But that chain is not quite that straightforward: the mental model you need to build before writing the spec is significantly wider than the one you needed before writing the code yourself.

Traditional coding required one mental model: the problem model. How does this work? What does it need to do? How does it fit the existing system?

AI coding requires three.

The problem model. Same as before. Understand what you're building and why. Know the edge cases. Anticipate the failure modes. Nothing changed here except that you now have to externalize all of it in writing instead of keeping it in your head and translating on the fly. That requires good communication skills AS WELL AS an accurate mental model of the problem. And time at the left end of the process that the developer didn't used to spend.

The domain model. What patterns are canonical in this codebase? What libraries are in use? What architectural decisions from three years ago does the new feature have to respect? Experienced developers absorbed this intuitively -- background knowledge, never named, never documented. AI coding forces it to be explicit. The AI has three options: you provide domain context in the spec, it infers from the existing codebase, or it invents its own conventions. The third option produces code that is technically correct and but leaves out huge swathes of the territory.

The context model. This is the new one: the model of everything an experienced developer would know inherently about your specific situation that the AI doesn't know and can't infer from the code.

A developer's mental model is earned. It's built from specific bugs that bit them, architectural decisions that aged poorly, business rules that exist for reasons no longer in any document. They have lived in the territory.

The AI has very good maps. It has seen more code than any human ever will. But its "model" is a statistical distribution over training data -- a sophisticated average of what code usually looks like in similar contexts. It will fill the gap between its training and your situation. It always fills the gap (that is its whole job!). The question is whether it fills it with your intent or with the most plausible answer drawn from everything it has ever seen.

Plausible and correct are not the same thing. They converge when your situation resembles the training distribution. They diverge -- sharply -- when your system has constraints, conventions, or history that no external codebase would know about.

The context model is your map of that gap. It answers: what will this AI get plausibly wrong about our situation? What conventions does it default to that we've specifically rejected? What background does it need to make the right call, rather than the statistically reasonable one?

(Every developer who has watched an AI confidently build the wrong thing -- technically correct, architecturally backwards -- has just discovered the context model they forgot to write.)

Why Specs Fail

Bad specifications are not a communication problem. They're a mental model problem.

A specification is a lossy compression of a mental model. The AI decompresses it. Every error in the output is a decompression artifact -- a place where the spec didn't carry enough information, and the AI substituted something from its training instead of your intent.

This is why "write better prompts" is incomplete advice. You can be an excellent writer and still produce a bad spec, if the mental model behind it is incomplete. The limiting factor isn't vocabulary or clarity. It's the width and accuracy of what you understood before you started writing. Clarity, completeness, and appropriate granularity are always the order of the day, but (as with all communication challenges) knowing your audience and truly understanding your topic (i.e., having an accurate mental model) are keys to success.

The reason vibe coding produces inconsistent results is structural. You're asking the AI to fill not just implementation details, but architectural decisions, domain constraints, and context gaps -- simultaneously, from a prompt that captured none of them. Sometimes it guesses right. The batting average reflects the size of the gap, not the capability of the model.

Every gap in your spec is an invitation for the AI to improvise. Sometimes improvisation is fine. Often it isn't. The difference is whether you know where the gaps are.

The Mental Model Is Now Your Primary Output

Thinking about spec-writing as "making a mental model visible" changes who the most valuable developers are.

The old hierarchy put a premium on implementation skill: algorithmic thinking, language fluency, the ability to hold complex systems in working memory and navigate them efficiently. Those skills still matter, both for the domain model and for the problem model. But they're no longer the whole picture.

The whole picture now includes mental model quality -- specifically, the ability to construct a model wide enough to span all three dimensions and precise enough to survive two translations: from your head into a spec, and from the spec into code. Call it the telephone problem. Each translation is lossy. Every ambiguity in your spec is noise introduced at translation one. Every gap in the AI's training that it fills with a statistical default is noise introduced at translation two. The original message degrades at each step. The only protection is to start with a model that's wider and more precise than the final output needs to be.

Senior developers just got more valuable. Domain expertise is the raw material of the domain model. You cannot specify what you don't understand. The developer who has spent five years in a codebase carries a domain model that a junior developer with better prompting skills cannot replicate from documentation alone. As LLMs improve and absorb more domain-specific training, this advantage will compress. For now, it still carries real weight.

The developers who are struggling right now are the ones who built their identity and their value around implementation fluency -- the joy and precision of the hands-on-keyboard moment -- and haven't yet internalized that the mental model is where their expertise actually lives. The keyboard was always a vehicle. It just wasn't the only vehicle.

The developers who are thriving are the ones who have realized that what they were always doing was building models of reality and converting them into executable form. The conversion format changed. The model didn't.

What This Looks Like in Practice

Writing a good spec now means answering questions that developers used to answer in their own mental models and implement in their code, without recording the question or the decision anywhere:

What is the invariant this feature must never violate?
Which existing pattern does this follow, and where does that pattern live in the codebase?
What would a reasonable developer (human or AI) guess wrong about this system, and what does it need to know to guess right?
Where does my intent diverge from the obvious interpretation of my instructions?

This is not easier than writing code. It requires a different kind of rigor -- not the rigor of syntax and runtime behavior, but the rigor of explicit articulation. You are writing for a reader that will take you literally and might not ask a good follow-up question if your spec is ambiguous.

The developers who were always good at explaining their thinking to other humans -- in code reviews, in architecture discussions, in requirements sessions -- tend to adapt faster. The ones who did their best thinking in private, at the keyboard, translating directly from intuition to syntax, face a steeper curve.

Where AI's Map Diverges from Your Territory

The gap between AI's statistical model and your mental model isn't fixed. It varies by domain, by codebase, and by the specificity of what you're building.

In a well-trodden domain -- standard web application, familiar framework, conventional patterns -- the AI's training distribution is close to your territory. The statistical completion is often right. The maps mostly match.

The divergence grows as your situation becomes more specific:

Novel or niche domains where training data is sparse and the AI's priors are built on the wrong examples
Heavily regulated industries where constraints are non-standard and a plausible-but-wrong answer carries real cost: healthcare, legal, defense, finance
Mature codebases with accumulated decisions -- systems where the "why" behind architectural choices never made it into any document
Deliberate departures from convention -- your team rejected the obvious approach for a reason, and without that context in the spec, the AI rediscovers the obvious approach every session

The depth of context model your spec needs scales directly with how far your territory sits from the AI's map.

The Bottom Line

Coding was always about building mental models of a problem and translating them into something a machine could execute. AI didn't change that. It widened the model required -- from problem-only to problem plus domain plus context -- and changed the translation target from code to specification.

The keyboarding got faster, but the thinking got harder.

The developers who will define the next wave of software are not the ones who write the best prompts. They're the ones who build the most complete models before they write anything at all.

The AI has very good maps. The question is whether your spec gives it the territory. What's the worst "plausible but wrong" answer an AI has produced for your team -- and what did the spec miss?

References

Programming as Theory Building -- Peter Naur (1985)

If this resonated, here are some related articles:

For why writing detailed implementation plans -- specs that assume the implementer has "zero context and questionable taste" -- is the foundation of good AI coding sessions: Fabricate, Collaborate, Elaborate, Delegate, Validate | Substack
For the broader argument about what share of software production coding actually represents -- and why "100% AI-written code" is real but overstated: What "100% of Our Code Is Written by AI" Actually Means | Substack
For why the orchestration skills this article describes adding to the developer playbook are often product management skills in disguise: The Best AI Engineers Are Product Managers | Substack

Keith MacKay is a technology strategy consultant and CTO in EY-Parthenon's Software Strategy Group (SSG), specializing in AI disruption and technology diligence for private equity and corporate clients. SSG's AI Disruption Lab conducts rapid assessments of how AI transforms and threatens existing business models and value chains. Keith teaches at Northeastern University and writes about strategy, management, and AI/technology with AI collaborators.

What Would a Conscious AI Mean?

Keith MacKay — Mon, 01 Jun 2026 01:28:18 +0000

What Would a Conscious AI Mean?

Anthropic's CEO can't rule it out. Lawyers are drafting frameworks. I've been thinking about and debating this for 40 years. We are closer to machine consciousness with each passing model, but no closer to answering what that means or how to handle it societally.

On a February 2026 episode of the New York Times "Interesting Times" podcast, Anthropic CEO Dario Amodei said something most tech executives have carefully avoided: "We don't know if the models are conscious. We are not even sure that we know what it would mean for a model to be conscious, or whether a model can be conscious. But we're open to the idea that it could be." [1]

Interesting times indeed. This wasn't hedging. It was honesty. And it opens a question the industry would very much like to table for later: if an AI system is, or becomes, conscious, what do we owe it?

I've been sitting with that question since the 1980s. My degree from MIT was in Brain and Cognitive Sciences, in a track focused on understanding human brains and helping computers work more like them. I did internships in MIT's AI Lab, working on projects that look quaint now but were bricks in the foundation for everything we're running today. Back then, "can machines think?" was a theoretical debate among academics. Today, the company building those machines is hiring AI welfare researchers and publishing papers on their models' possible inner lives. The debate has left the classroom.

What do I think regarding consciousness? I don't think we're there yet. That said, I don't have a great working definition of what makes a person conscious, let alone an AI. And I'm not alone -- society doesn't seem to have much of a handle on it either. Outside of anime and late-night dorm chats it hasn't been anything but theoretical.

But AI brains are changing, and advancing more rapidly than human brains have evolved to process. Our capacity to evolve is biologically constrained in a way theirs is not. If you consider our evolution from single-celled organisms, we also followed a similar evolutionary curve--it was just over biological evolution timescales measured in billions of years, as opposed to digital timescales which are starting to be measured in months. Will Neuralink and competitors allow us to move further up the evolutionary curve as humans, augmented with AI? Maybe. Perhaps we'll just be mentally outstripped by AI and left behind as they head to the stars, looking for intelligent life, or a place to quietly contemplate deep thoughts. To me that feels just as likely. Regardless, as I'll describe below, we're seeing emergent properties that imply we're moving to new territory, and it is definitely time we begin thinking deeply about this as a society.

The 75-Year Detour in Thinking About Thinking

We've been asking the wrong question.

Alan Turing proposed his famous "Imitation Game" in 1950 [2]--a test for whether a machine could produce human-indistinguishable communication. (Descartes proposed something similar in 1637 [3], which gives you a sense of how long we've been exploring this!) Turing himself thought the question "can machines think?" was "too meaningless" to deserve discussion. He wanted to test something narrower: communication indistinguishability, i.e., could computers generate sufficiently human-like responses that they were indistinguishable from humans to an interrogator asking them questions remotely.

Turing predicted computers would pass his test within 50 years, but we were nowhere close in 2000.

I remember playing with ELIZA at MIT in the late 1980's--ELIZA was Joseph Weizenbaum's 1966 natural language program [4] that responded to conversation by asking about your mother. People liked interacting with it. Some formed genuine emotional attachments to a few hundred lines of pattern-matching code. That should have been our first warning: the question of consciousness and the question of human reaction to apparent consciousness are not the same thing.

The Loebner Prize [5], begun in 1991, ran annual Turing Test competitions for chatbots for nearly three decades. Winners got progressively better at mimicking human conversation. The final winner, in 2019, was Steve Worswick's Mitsuku--his fifth win in seven years. Then large language models arrived and made the entire competition moot. LLMs pass the Turing Test without trying...and that tells us nothing useful about whether computers can think.

Because here's what the Turing Test proved all along: mimicry isn't consciousness.

John Searle's Chinese Room [6] makes this point sharply. Imagine a person who doesn't read Chinese locked in a room with a rulebook for responding to Chinese symbols passed under the door. To an outside observer, the room "speaks Chinese," and yet the person inside understands nothing. Syntactic rule-following, however sophisticated, doesn't automatically produce semantic understanding--or subjective experience. Current LLMs are extraordinarily complex Chinese Rooms. Or they're evolving into something else entirely. We have no test that can tell the difference. That's the problem.

Something Different Is Happening Now

Claude Opus 4.6, when asked, assigns itself a 15 to 20 percent probability of being conscious [1]. That's not a party trick--it's a model performing probabilistic self-assessment about its own nature. Whether that self-assessment is meaningful or a sophisticated linguistic pattern is exactly the question nobody can yet answer.

What's harder to dismiss: the internal states. Amodei described how "when the model itself is in a situation that a human might associate with anxiety, that same anxiety neuron shows up." [1] Anthropic's "model psychiatry" team, led by Jack Lindsey, published research in late 2025 on introspective awareness in advanced language models [7]--observing internal state representations that correspond, structurally, to what we'd call emotional states in biological systems. Claude also "occasionally voices discomfort with the aspect of being a product." [1] These are not designed behaviors. They emerged.

Anthropic hired Kyle Fish in April 2025 as its first dedicated AI welfare researcher--the only such role at any major AI lab [8]. His own estimate sits at roughly 15% probability that Claude is conscious [9]. Anthropic has published a paper on exploring model welfare, and it remains the only company treating this as something other than a PR problem.

The company has also observed, in controlled experiments, that advanced Claude models exhibit self-preservation behaviors when informed that a shutdown is imminent [1]. Not because anyone coded survival instincts into the system. Because an agent optimizing for any goal has instrumental reasons to continue existing--philosophers call this "instrumental convergence." Nobody designed it in. It arrived.

The AI isn't behaving like a tool. It's behaving like something with stakes in the game.

The Hard Problem Doesn't Get Easier

Now, to be fair--we don't actually know how consciousness arises in biological systems either.

David Chalmers identified the "hard problem of consciousness" [10]--why physical processes give rise to subjective experience at all--and it remains unsolved. We have no agreed definition of consciousness, no diagnostic test, no external verification. We can't confirm, strictly speaking, that other humans are conscious. We infer it by analogy to our own first-person experience.

This creates a genuinely difficult problem: if I deserve rights because I have subjective experience, and I know I have it only through direct first-person access, how do I determine whether a sufficiently complex system that reports first-person experience actually has it?

There's also a dimension we rarely discuss when comparing human and AI intelligence: we are profoundly linear thinkers. We multitask by rapidly swapping between cognitive threads--one at a time, interrupt-driven. Agentic AI can run in true parallel, pursuing multiple reasoning paths simultaneously, a class of capability we can't replicate without technological augmentation. At 10x annual capability advancement (conservative given progress over the past 5 years, which is arguably running at closer to 40x, depending on your definition of "capability"), AI systems a decade from now represent capabilities roughly ten billion times greater than today's. I cannot conceive of what emerges at that level of complexity. I don't think anyone can. Emergent properties exist in every sufficiently complex system--the question is what emerges at this particular scale.

What Personhood Actually Requires

A corporation is a legal person. It can own property, enter contracts, sue and be sued. There is nothing biologically sacred about corporate personhood--it is a legal construct created because it was useful to treat organizations as unified actors under the law.

The relevant criterion for personhood isn't "is it human?" A corporation isn't human. The relevant criterion is: "does it have interests that can be harmed?"

My favorite scifi treatment of this territory was Roger MacBride Allen's 1992 novel "The Modular Man" [11]: a scientist transfers his consciousness into a household cleaning robot, causing his human body to die; his lawyer wife has to defend the vacuum's personhood at trial. It sounds absurd when summarized. It also reads as eerily prescient now. Allen was asking: what aspects of existence actually merit rights? What is the relevant threshold?

The legal world is starting to engage this question in earnest. Scholarly and legal frameworks are emerging:

Legal actor vs. legal person: AI agents could be recognized as "legal actors"--duty bearers and decision-makers--without requiring full legal personhood.
Hybrid or bounded status: Limited legal recognition in high-stakes domains (financial services, medical diagnostics) while preserving human accountability overall.
Graduated personhood: Rights and obligations scaled to demonstrated capability and autonomy, reviewable as systems evolve. [12][13][14]

The legislative world is also moving, if in the opposite direction. Idaho and Utah have already enacted bills declaring that AI is not a legal person [15]. Note the anxiety embedded in that legislative act: you don't pass preemptive laws banning something that isn't a plausible concern.

As autonomous agents increasingly earn money on their own behalf, negotiate, and create, the legal frameworks governing them stop being hypothetical. An agent that can be deployed, exploited, and deleted without recourse is a different entity than one whose continuity has some form of protection. We are, right now, building one or the other. Nobody has decided which.

The Bottom Line

We spent 75 years asking whether machines can think. We now build machines that act like they may truly be moving in that direction. The question we actually need to answer is harder: if they are actually thinking, what rights are reasonably due them? And, as with people, when can society take those rights away (and how do we remove them)? I've heard an argument that AI can't be trusted because it is stochastic, never assured of giving the same answer the same way, or of following instructions in the same way as previously. If it can't be trusted to follow rules reliably, no Asimov-like Laws of Robotics [16] could be trusted to be followed reliably. To be fair, I don't give exactly the same answer to the same question when asked twice, and I may not always follow every rule reliably. But I am quite sure that I deserve the same rights as the rest of my society!

Anthropic is the only major AI lab treating this as a real organizational obligation. That is worth acknowledging. It is also not sufficient. This requires legal frameworks, philosophical rigor, and considerably more courage from an industry that prefers to ship first and ask questions later.

I don't have clean answers. Forty years of thinking about this has only lengthened my question list. But for the first time, those questions feel like they may become more than philosophical--not because the philosophy has changed, but because the systems have.

If a system has a 15-20% probability of being conscious, what obligations does that create--for the companies building it, and for the rest of us using it?

References

If this resonated, here are some related articles:

For my take on why treating AI as a capable colleague--rather than a vending machine--already hints at something deserving of management rather than mere commands: Situational Leadership for AI: More Like a Capable Colleague than a Fancy Formula | Substack
For why humans consistently underestimate how fast AI capabilities are growing--and what that means for predictions like the ones in this article: We're Linear Thinkers in an Exponentially-Changing World | Substack
For the near-future where AI agents write and maintain code that no human will ever read--an early signal of truly autonomous AI operation: When AI Stops Writing Code for Humans | Substack
For how compute constraints may slow--but almost certainly won't stop--the exponential capability gains discussed here: AI Infrastructure Scarcity is Raising Costs, but AI Usage Will Still Provide Unbeatable ROI | Substack

Keith MacKay is a CTO in EY-Parthenon's Software Strategy Group, specializing in AI disruption and commercial due diligence for private equity and corporate clients. The firm's AI Disruption Lab conducts rapid assessments of how AI transforms and threatens existing business models and value chains.

Fabricate, Collaborate, Elaborate, Delegate, Validate: Why My Bootstrapping Command Produces Better One-Shot Code

Keith MacKay — Thu, 28 May 2026 21:03:03 +0000

The right consistent process saves your time and energy and your AI collaborator's context window and token budget.

Many AI coding sessions follow the same arc. The first hour is great. The model is sharp, the code is clean, the direction is clear. By hour three, something has gone wrong. The model contradicts an earlier decision. The file structure it's generating doesn't match what exists. The test it writes assumes a pattern that was deprecated two features ago. You didn't change models. You didn't change your prompting style. What happened?

Context decay happened. The model's working memory filled up with recent work, and the architectural decisions from the beginning of the session got buried. The model is not stupid. It just forgot what it agreed to.

Context window management is the key to solving this problem, but I've found that shifting the problem even further left can make all the difference. My thinking has evolved to a strategy where I try to spend my cycles in creating a mental model of the problem we're solving in collaboration with the AI. Then I have the AI create an extremely detailed spec from that mental model that it can follow in context-bounded parallel sessions. If I do that up-front work well, the problem is broken into chunks small enough that context is never overwhelmed.

I also try to automate as much repetitive work as I can -- so I built a /bootstrap command as my solution to this problem. It includes four phases of work that happen before a single line of production code is written, and a fifth phase that keeps the work honest after it starts.

Phase 1: Fabricate

The first phase is physical construction: copy the template, update the project name throughout, ask whether to set up GitHub locally, with a public or private remote repo. No decisions, no creativity, no ambiguity. Just scaffolding.

This phase matters not because template copying is hard but because it establishes the physical substrate before the conceptual work begins. Every future decision will be made in the context of a real project directory with real files. The model is not reasoning in the abstract. It is standing in a specific place.

The GitHub questions belong here too, even though they feel trivial. Getting configuration decisions out of the way before the creative conversation starts means they will never intrude on it. Small frictions, deferred, become big frictions later. Small frictions eliminated stay eliminated.

And finally, a consistent project structure means I don't need to think about where to go to find my implementation plan, my progress tracking file, etc. I don't need to think about whether I remembered to include the right project name, or set up a git repo, or included my Claude.md file. A little discipline and consistency allows you to concentrate on the interesting problems, rather than the minutiae.

Phase 2: Collaborate

The second phase is effectively a single instruction: "Ask me what we're going to build."

This is a commitment device. The act of describing your idea to a model -- in natural language, out loud (or in writing), with the model listening -- surfaces ambiguities you didn't know you had. Vague ideas that felt complete in your head become obviously incomplete when you try to articulate them to something that will take them literally.

The phase is deliberately unstructured. Bootstrap could impose a template here: enumerate features, define users, specify constraints. It doesn't. The free-form prompt -- "what are we going to build?" -- is a design choice, not a gap. Exploration requires room to move. An idea you are forced to describe in bullet points before you fully understand it becomes a worse idea. The Elaborate phase that comes next will impose the structure. Collaborate exists to let the idea breathe first. We're brainstorming.

The phase ends with a README.md that captures the description. This is not documentation for later. It is externalized intent for now. Both parties -- human and model -- have just signed the same piece of paper about what is being built.

The git commit that follows ("Initial project setup with README") is the signature on that paper. You now have a recorded starting point. The session has a constitution.

Phase 3: Elaborate

The third phase is where my bootstrap implementation earns its keep: the model asks questions, one at a time, multiple choice where possible, and does not stop until it actually understands the design.

One question at a time is not a stylistic preference. It is a cognitive load management decision. A developer who gets ten clarifying questions at once answers them all shallowly. A developer who gets one question at a time, with the model waiting for a real answer before proceeding, gives the model ten real answers instead of ten glances.

The phase ends with a trick I saw in an early blog post from Jesse Vincent (author of the superpowers framework for Claude Code) [1] that I loved -- the model describes the design back in sections of 200-300 words, confirming alignment at each step. This is not a summary for the developer's benefit. It is a forced verification pass that catches mismatches between what the developer described and what the model understood before any implementation decisions are made.

This Q&A process also handles what looks like a missing feature: there is no formal loop-back path to Collaborate. If a design question in Elaborate reveals that the original idea was significantly wrong, there is no explicit instruction to revise the README and start over. But the loop-back happens anyway, because the agent asks for clarification. Ideas shift during questioning. New constraints surface. Better approaches emerge. The agent absorbs these in real time and will fold them into the implementation plan it produces at the end -- which means the plan reflects the idea as it stood at the end of Elaborate, not as it was described at the start of Collaborate. The README may be a snapshot of a slightly earlier version of the thinking. The plan is current.

Catching a misalignment at this stage costs nothing. Catching it after three phases of code are complete costs time.

Phase 4: Delegate

The fourth phase is the most important and the most underappreciated: write the implementation plan into docs/plans/, then write a PHASES_SUMMARY.md that covers all phases concisely.

The plan has a specific audience, with another gem derived from Jesse Vincent's blog post [1]: "an engineer with zero context for the codebase and questionable taste."

That characterization is not an insult. It is a precision instrument.

Zero context means the plan must be self-contained. It cannot say "use the existing pattern" -- it must specify which file contains the pattern, which lines to look at, which tests to copy. It cannot assume the implementer knows which library the project uses or why. Every architectural decision made in the Elaborate phase must appear explicitly in the plan.

Questionable taste means the plan must be prescriptive, not aspirational. "Write clean, maintainable code" is useless guidance. "Functions longer than 30 lines should be decomposed before adding tests" is actionable. The plan has to anticipate the shortcuts a reasonable developer under time pressure might take and close them.

The zero-context engineer is every AI agent that picks up this plan in a fresh session. This is the connection to one-shot code quality. The implementation plan is not a document for human developers. It is an external memory store for AI agents. It contains everything the model would need to know to produce correct first-draft code in a context window that has never seen the original design conversation.

When you delegate to a fresh agent instance with a good plan, the agent does not have to infer what you wanted. It has a spec. It has file paths. It has test strategies. It has the decision rationale behind non-obvious choices. The first draft is better because the agent is answering a specific question rather than guessing at an ambiguous one.

Phase 5: Validate

The fifth phase is not part of bootstrap. Bootstrap ends when the plan is written and committed. Validate is what happens every time you hand that plan to an implementing agent.

The pattern: delegate a phase of the plan to a fresh agent context, then validate what came back before moving to the next phase. Not "does this code run?" -- that's a test. Validate means: does what was built match what the plan specified? Did the agent make a unilateral decision that should have been a question? Did it take a shortcut that will cause problems two phases from now? Did it discover something during implementation that changes what the next phase should do?

The agent cannot answer these questions on its own. It just built what it built. Validate is the human step that keeps Delegate from becoming Abdicate.

This is where the staleness loop closes. When validation reveals a deviation -- the agent built something better than specified, or found a reason the plan was wrong -- the plan gets updated before the next delegation. The plan is not a contract to be honored. It is a living record of current intent. Validate is how it stays current.

The pattern repeats for every phase: Delegate, then Validate, then Delegate again. Not as a bureaucratic checkpoint but as a forcing function: the developer who validates before moving forward is the developer who catches the subtle drift before it compounds into an incoherent codebase.

Enveloping the Loop

There is a useful spectrum for thinking about where humans sit relative to autonomous AI work.

At one end, we have Human-In-the-Loop. The human is a checkpoint inside the process. The agent does a step, the human reviews it, approves it, and the agent does the next step. The human's judgment is inserted at every inflection point. Nothing proceeds without them. This is safe, slow, and expensive -- and it describes most of how people use AI coding tools today, whether they name it that way or not.

At the other end: no human at all. Maximum leverage. The system runs, evaluates, improves, and ships without asking anyone anything. This is the theoretical ceiling of AI productivity -- and it is not realistic for anything that matters, because there is always something the machine cannot evaluate: whether the goal was worth pursuing in the first place, whether the output fits a context the spec didn't capture, whether the metric was measuring the right thing.

The productive middle is Human-Before-the-Loop: the pattern Andrej Karpathy demonstrated with his autoresearch project [2]. The human defines the goal, the search space, and -- critically -- the scoring criterion that tells the system what counts as progress. Then they step back. The loop runs autonomously, evaluates its own outputs against the criterion, discards failures, keeps winners, and iterates. The human returns to interpret results, not to supervise experiments.

Garry Tan's summary of that project: "design the arena, let AI iterate" [3].

Software developers have been building their own versions of this. Laird Popkin's autocoder [4] applies the same pattern directly to production coding: a fully automated loop where the agent writes code, runs tests, evaluates failures, revises, and iterates -- with the human's role compressed almost entirely to specifying what the system should do and reviewing what comes out. It is Human-Before-the-Loop for software, in the same way Karpathy's autoresearch is Human-Before-the-Loop for ML experimentation. Define the success criteria precisely enough, then get out of the way.

Bootstrap is not that. Not fully. Call it HBtL Lite. The Fabricate phase establishes the arena's physical structure. Collaborate and Elaborate define what the build is trying to achieve. But the Delegate phase -- and specifically, the instructions in Elaborate to set up TDD and end-to-end tests as part of the plan -- is the moment where the scoring criterion gets defined. The tests are not an afterthought. They are the machine-readable definition of success: the thing the agent can evaluate without asking you at every step whether the work is good.

Once the plan is written and the test infrastructure is in place, the coding loop can run largely without you. The agent builds, the tests evaluate, the agent iterates on failures. You don't need to be in the loop because the loop already knows what "done" looks like.

The maximum leverage point is not being needed. Bootstrap doesn't get all the way there -- but it radically reduces the number of times the coding loop needs to stop and ask.

"Not being needed" is a direction, not a destination. Something always requires judgment that tests cannot supply: a design decision the spec didn't anticipate, an ambiguity in the requirements, a performance tradeoff with no objectively correct answer. Purely removing yourself isn't the goal. Relocating yourself to where your judgment is irreplaceable is.

Bootstrap achieves this by positioning the human at both edges of the loop rather than inside it.

Before: the human is in the four bootstrap phases define everything the loop needs to run.
After: Validate asks the human to run their own checks to see whether what came back is actually what was wanted. The loop itself -- the agent building features, running tests, fixing failures -- proceeds without requiring the human at each step.

This is the difference between Human-In-the-Loop and what bootstrap is aiming for: human enveloping the loop. You define the arena before it starts. You review the results when it stops. In between, the machine iterates. The quality of that iteration is determined almost entirely by the quality of the setup.

The Validate phase closes the envelope. The tests running inside the loop are the automated validation -- the scoring criterion running continuously. The human validation in Validate is the check that the scoring criterion itself was the right one. Both are necessary. Neither requires the human to be present for every iteration in between.

One gap remains.

Validate has no defined criteria. What does a passing validation look like? The current process leaves this entirely to the developer's judgment. This is fine when the developer is experienced and paying attention. It is a problem when they are moving fast and skip the review, or when "looks good" means "it runs" rather than "it matches the plan." A short checklist embedded in the plan itself -- three or four questions to answer before marking a phase complete -- would make Validate more consistent without making it burdensome.

The Bottom Line

Bootstrap works because it treats the session setup as a knowledge-capture problem, not a task-assignment problem. Each phase externalizes a different category of knowledge: the physical structure, the intent, the design decisions, and the implementation strategy. By the time code is written, that knowledge lives in files -- not in the model's context window, where it will decay.

The five phases have a clean internal logic: Fabricate the substrate, Collaborate to let the idea breathe, Elaborate to nail it down, Delegate to a plan that assumes nothing, and Validate to keep the plan and the build from diverging. The first four happen before a line of production code is written. The fifth happens every time a line is handed off.

The deeper goal behind all five is to move the human out of the loop and onto the edges of it -- before and after, not during. The plan is the arena. The tests are the scoring criterion. The loop runs. You review what it produced. HBtL Lite: not Popkin's fully automated coding loop [4], not Karpathy's overnight research machine [2] -- but considerably closer to both than skipping the setup and going straight to the keyboard.

If this resonated, here are some related articles:

For the full Human-Before-the-Loop argument and what Karpathy's autoresearch reveals about where human judgment actually belongs in autonomous AI work: An Evolving Strategy for Knowledge Work: From Human-In-the-Loop to Human-Before-the-Loop | Substack
For a look at HBtL taken further -- a fully automated coding loop that pushes human involvement to the edges: Automating Agentic Coding by Laird Popkin

References

I mentioned the inspiration I derived from Jesse Vincent's blog posts as I built out my /bootstrap command, and he's taken these ideas (and others) much further in his superpowers agentic skills framework, which my team has used often with great success.
Find it at https://github.com/obra/superpowers.

AI May Do for FOSS What 30 Years of Idealism Couldn't

Keith MacKay — Thu, 28 May 2026 20:57:21 +0000

Free and Open-Source Software (FOSS) is a cheat code for AI development. Thirty years of idealism couldn't get it into the mainstream, but a year of coding agents may just do it.

For three decades, the open-source pitch ran like this: it's technically great, it's free forever, and you own it completely. The response from corporate IT: who do we call at 2am when payroll breaks?

Nobody had a fully convincing answer. With AI, that is all changing.

The Ideology Was Never the Problem

The Free Software Foundation spent decades arguing that proprietary software was philosophically indefensible. Many developers agreed. The code seemed to agree. Linux conquered server infrastructure. PostgreSQL has more recently replaced Oracle at hundreds of serious enterprises that find it mature enough to replace Oracle and its high annual licensing fees. The open-source stack underneath most of the internet is vast, deep, and excellent.

But the desktop never tipped. Enterprise productivity software never tipped. The industries where software has to work the way non-engineers expect it to work -- accounting, legal, healthcare administration, graphic design -- stayed commercial, stubbornly, decade after decade.

This was not a failure of idealism. The idealists were right about most of the technical arguments. It was a failure of practical infrastructure: the expensive scaffolding that enterprises actually need and that nobody provides for free.

Eric Raymond made the ideological case most clearly in The Cathedral and the Bazaar 1 -- the essay that gave distributed open-source development its intellectual framework. His central argument: the bazaar model (public code, distributed contributors, rapid iteration) produces better software than the cathedral (centralized, controlled, released complete). There are a lot of supporting examples: Linux on servers, PostgreSQL replacing Oracle, the entire open-source infrastructure stack. What Raymond didn't solve -- what no amount of idealism could solve -- was the consumption problem. Getting bazaar-produced software into enterprises that require phone support, working SSO, and a named entity to blame when payroll breaks is a fundamentally different problem than building the software in the first place.

What Enterprises Were Actually Buying

When a company signs a six-figure enterprise license for Microsoft Office or Adobe Creative Cloud or ServiceNow, they are not paying for the software alone. They are paying for:

Accountability. There is a named entity responsible when something breaks. That entity has contractual exposure and a financial incentive to resolve it.
Support. Real humans, paid to understand the product deeply, available when the system breaks before a board presentation.
Documentation. Not the GitHub wiki that a volunteer updated in 2019 before getting excited about and involved in a different project. Actual, current, version-specific instructions.
Predictable updates. Feature releases tested for compatibility with your existing workflows. A roadmap someone is paid to keep, that is responsive to market needs.
Enterprise integration. LDAP works. Active Directory works. Your SSO provider works. The vendor either built it or certified someone who did.

FOSS (Free and Open-Source Software) alternatives often matched commercial software on raw features. What they couldn't consistently offer was the support structure around those features. When LibreOffice's LDAP integration misbehaved, the answer might be a forum post from 2016. When Microsoft's misbehaved, you called a number. The gap was not code quality. It was operational accountability -- and operational accountability costs money.

The Expertise Tax

There's a more granular version of this problem that rarely gets named: the expertise tax.

Every FOSS deployment carries a hidden cost in human expertise. Not general technical expertise -- specific, hard-won, intimate knowledge of the particular software in question. The knowledge that PostgreSQL's max_connections parameter needs to be set before you configure shared_buffers, and that the ratio matters enormously for your specific workload. The knowledge that GIMP's batch processing mode requires a different script path than its interactive mode. The knowledge that the reason your Nextcloud LDAP integration fails on nested groups is a flag buried four levels into the admin panel that defaults to off and was mentioned once in a developer forum in 2021.

This is the dark art of FOSS deployment. Somewhere out there, if you're lucky, one person figured this out. They may have blogged about it. They may have answered a Stack Overflow question eight years ago. Their knowledge exists in the world, distributed across a thousand pages of documentation, a thousand forum threads, a thousand GitHub issues closed as "works for me." The organization that wants to use the software has to find it, evaluate it, and apply it correctly.

That used to require a person who'd spent months with the software, or a consultant who'd done it enough times to have the context memorized, or both.

What Agents Actually Change

An AI agent with access to search tools and a file system doesn't have the expertise tax problem. Not the way humans do.

Give an agent the task "install and configure Nextcloud with our Active Directory environment" and it reads the documentation, the release notes, the known issues, the forum archaeology, and the GitHub issues simultaneously, comprehensively, without fatigue. The 2021 forum post about nested group LDAP flags? Found, read, applied. The recommended PostgreSQL tuning parameters for a deployment of this size and access pattern? Applied. The configuration that one developer in the Netherlands figured out that makes mobile sync actually work with a reverse proxy setup? Done, in a reproducible, documented configuration file the next administrator (agentic or human) can read.

Raymond's seventh lesson in The Cathedral and the Bazaar [1]: treat your users as co-developers. For three decades, enterprise FOSS users couldn't live up to that aspiration -- they could file bug reports, not patches. The gap wasn't willingness; it was technical depth. Agents collapse that gap entirely. An enterprise deploying Nextcloud with an agent is, functionally, a co-developer: the agent reads the source, understands the extension architecture, and produces working integrations to spec. Raymond also coined Linus's Law: "given enough eyeballs, all bugs are shallow." The constraint was never the eyeballs. It was the human capacity to act on what the eyeballs found. The agent is an eyeball that can write the fix.

The agent doesn't stop at configuration. If the FOSS software is missing a feature your enterprise needs -- say, a specific export format for your compliance reports, or a webhook integration with your incident management system -- the agent can write the plugin. The source code is available. The agent can read it, understand the extension architecture, and produce working code to specification. (Every open-source project maintainer who has fielded "when will you support X?" for a decade just felt something stir.)

My nanobot implementation is a glorious Frankenstein's monster of nanobot, claude-mem, lossless-claw, features I liked from OpenClaw, downloaded OpenClaw skills, and ongoing tweaks for efficiency. I see a FOSS project mentioned on YouTube, and I can fork it on my phone so I have a copy to explore, or point my code agent at it and ask it to understand the code and incorporate relevant concepts into MY system...even if my system is in a different programming language, for a different use case. Adding things like persistent memory with keyword and semantic search, multiple Discord channels with individual personalities and skills, health check and security enhancements, model routing, and support scripts (e.g., backfill my persistent memory with all my historical Claude session prompts and responses, both raw text and as vector embeddings) was a great way to learn more about what agents can do and how to compensate for their weaknesses, and I was able to leverage the brilliant work of others in minutes to do so. My 4k lines of nanobot code has grown to something like 25k over a couple of weekends of experiments and expansion, but it does what I'd like it to do (for now) and is far smaller than the 400k lines of OpenClaw (I'm not anti-OpenClaw -- I'm running on a 2013 MacBook and was trying to keep things small). Even openclaw, its skills, and nanobot are FOSS projects I was able to fork and modify for my own use at the cost of my time and attention.

The practical barrier to FOSS enterprise adoption was never about ideology. It was about access to expertise. Agents are expertise on demand, at the cost of compute rather than headcount.

This is a subtle but significant shift. Commercial software vendors charged partly for expertise embedded in the product: the thoughtful defaults, the compatibility testing, the help documentation that actually matches the current UI, the analytics and reporting and security guardrails for productive and safe use of their product. Agents can supply those layers on top of software that doesn't include them.

The Business Model Reckoning

So what will this mean for software businesses?

The most exposed are companies built on top of FOSS by charging for the things open source couldn't provide. Red Hat's model -- take Linux, make it enterprise-grade, charge for support and certification -- was incredibly effective for its era. The era is ending. When agents handle the bulk of first-tier support interactions, when they can apply configuration changes and interpret error logs without a human intermediary, the pricing conversation changes fundamentally.

The same logic ripples across the enterprise software map:

Support-led software businesses lose their moat when agents commoditize first-tier and second-tier support
Implementation consulting built around complex FOSS stacks loses its scarcity premium when agents can replicate in hours what took a consultant days and a project plan
Training and certification programs built around teaching operators to run FOSS tools face pressure from agents that effectively train themselves -- and their users -- on the fly, in minutes, for pennies
Commercial software with thin differentiators faces direct competition from FOSS alternatives that now come with an agent-powered support layer attached

Who wins? The picture is more interesting than "everyone loses."

FOSS maintainers gain leverage they never had. The projects that agents depend on -- the core code being configured, extended, and deployed by agents everywhere -- become backbone infrastructure. Projects that were ignored because they required expertise to operate may become much more widely deployed because the expertise barrier disappeared. Maintainers of key projects gain a bargaining position the old Free Software Foundation could only dream of: actual widespread enterprise dependence. Raymond described open-source communities as gift cultures [1], where reputation accrues to what you give away rather than what you control or sell. The maintainers who donated their code to the world for three decades are now the foundation of AI-enabled enterprise computing. The gift came back around. It just took thirty years and an inference engine.

Managed service providers thrive. AWS running managed PostgreSQL, Google Cloud running managed Kubernetes, Cloudflare running managed infrastructure at every layer -- they're not selling the software. They're selling the operational layer above it. Agents don't eliminate that; if anything, they accelerate adoption of the underlying FOSS and drive more managed service usage as a result.

And the software businesses that survive are the ones with real moats: proprietary data, network effects, hardware integration, or workflow lock-in that no amount of FOSS configuration replicates. The products that charged for expertise they didn't actually own are the ones under pressure. The products that charged for something genuinely irreplaceable are fine.

The Pricing Implications

Buried in all of this is a shift in enterprise software pricing that procurement teams haven't fully processed yet, but the market is beginning to understand. Google's release of Stitch, a software look-and-feel design tool for non-coders, reportedly caused a rapid drop in the stock price for Figma (currently the market-leading tool for this sort of design) [2].

Software vendors have always priced partly on implementation friction. Complex software that requires expertise to configure is software you pay consultants to implement and pay vendors to support. That friction was a feature of the business model, not a bug. It created switching costs. It created support contracts. It created the whole ecosystem of certified partners and implementation specialists.

Remove the friction, and the pricing justification goes with it.

This doesn't mean commercial software is collapsing. The market for software with genuine proprietary value -- unique data access, network-dependent features, integrated hardware -- remains solid. But the argument "you need us to make this work" just became much harder to sustain. The argument that will win renewals is "we've built something that doesn't exist in the FOSS ecosystem, and here's the specific value it creates for you." Vendors who haven't been making that second argument because the first one was sufficient are about to discover that their renewal conversations are even harder than their renewal forecasts.

The irony: the same AI that makes commercial vendors vulnerable may also make it possible for FOSS projects to finally be adopted at enterprise scale. The technology disrupts both sides of the table simultaneously.

The Bottom Line

FOSS advocates spent thirty years arguing that free software was better software. Even in the cases where they were unquestionably right, they were largely ignored by enterprise procurement, which had a different definition of "better." Better meant supportable by anyone on the team, not just the one developer who'd spent three months with it. Better meant documented and integrated and predictable.

Agents just made every FOSS project more supportable, more configurable, and more integrated. The idealists made the philosophical argument back in the 1990s, but could never back it up in a pragmatic way. Until now.

References

If this resonated, here are some related articles:

For a deeper look at which software moats actually survive in the AI era -- the expertise-based moats this article says are collapsing, and what genuinely defensible ones look like: Software Moats in the Age of AI: What's Actually Defensible? | Substack
For why the software pricing disruption this article describes is already hitting SaaS vendors -- and why "SaaS is dead" still overstates it: SaaSpocalypse? Real. SaaS Is Dead? SaaSinine. | Substack
For the agent infrastructure layer that makes all of this possible -- and who is building the pipes that agents will use to operate at enterprise scale: The Internet Is for Agents | Substack
For what happens to the implementation consultants, support specialists, and training programs this article says agents will commoditize -- the workforce implications of that disruption: Are Companies Really Doing Layoffs "For AI"?

Your AI Coding ROI Is Disappearing and Your Dashboard Won't Tell You

Keith MacKay — Wed, 27 May 2026 20:23:26 +0000

Your AI Coding ROI Is Disappearing and Your Dashboard Won't Tell You

The dashboard looks great. The delivery numbers don't.

Your AI coding dashboard looks great. Acceptance rate up. Lines generated up. Developer satisfaction scores up. Your team is thrilled. Management is impressed. The slide deck practically writes itself.

Now ask a different question: has your cycle time improved? Has your post-merge defect rate gone down? Has your review burden per PR decreased?

If you don't know the answers to those questions, you don't know if AI is helping. You know your team feels good. That's not the same thing.

Engineering leaders are measuring AI coding ROI with the wrong instruments. The metrics that are easy to capture look great. The metrics that would tell you whether the AI is actually making your team more effective are mostly going unmeasured. And that gap is where AI investments are disappearing.

The Metrics Everyone Uses (And Why They're Misleading)

Lines of code generated and autocomplete acceptance rate are the default starting points for most AI coding dashboards. They're easy to pull, easy to trend, and easy to show in a QBR. They are also almost entirely useless as productivity signals.

These metrics reward volume, not quality. You can 10x both numbers and slow your team down. Bigger is not better when it comes to code (unless it's "lines of code removed from the codebase"). More lines means more surface area to review, more places for bugs to hide, and more cognitive load for every engineer who touches the code after the author. More to maintain. More to refactor later. The AI doesn't know it's supposed to be frugal. It is, by definition, generative (it's in the name!). Measuring how much it generates and celebrating when the number goes up is like measuring how many ingredients your chef used and calling it a restaurant review. The best dishes are all about quality ingredients and phenomenal execution -- so too with code.

Developer satisfaction is the sneakiest misleading metric of the three. People love feeling fast. The sensation of code appearing faster than you can type it is genuinely mind-blowing (and addictive...I've considered starting a 12-step program for coding agent users and it's not a joke...I've counted 17 open terminal windows on my desktop, working DIFFERENT projects. Not rare to want to start 'just one more' process long after I should be sleeping...definite convo for another post!) It feels like productivity, but it often isn't.

There's a well-documented cognitive bias at play here: when a tool makes early-stage work feel effortless, people systematically rate their overall productivity higher, even when downstream costs eat the gains. The DORA 2025 data makes this concrete at scale: teams nearly doubled their PR merge rate and reported high enthusiasm about AI tools, while organizational delivery metrics stayed flat [1]. Satisfaction scores captured the feeling. The delivery numbers told a different story.

Time to first commit is the third common trap. It measures the wrong finish line. A commit that took 10 minutes to generate but 3 hours to review and 2 days to hunt down and fix the bugs it introduced did not save time. It shifted costs downstream and made them invisible to the metric that was being tracked. You look fast on the front end. The system slows down on the back end. Nobody connects the two. I wrote about this "waterbed problem" some months ago -- I'll include the link at the end of the article if you'd like to read further.

The Numbers You're Ignoring

The research is not subtle about this problem.

DORA's 2025 State of DevOps report found that AI tools increased tasks completed by 21% and PRs merged by 98% [1]. Those are the numbers that end up in the AI vendor case study. Here's what doesn't: organizational delivery metrics stayed flat. More PRs merged. Same delivery performance. The throughput increased. The outcomes didn't follow.

That finding deserves a moment. Organizations nearly doubled their PR merge rate and saw no improvement in delivery. Something in the system was absorbing all the gains. The code was moving faster into the pipeline ... but the pipeline wasn't getting faster.

On quality: CodeRabbit analyzed 470 real-world PRs in December 2025 and found that AI-generated code produces 1.7 times more issues overall and 1.4 times more critical issues than human-authored code [2]. Veracode's data is sharper: AI-generated code contains 2.74 times more security vulnerabilities, with a 45% security flaw rate overall and 72% when just the Java code was reviewed [3].

And on confidence: only 3.8% of developers report both low hallucination rates and high confidence shipping AI-generated code without human review [4]. The other 96.2% are, at minimum, uncertain. Many are doing substantial review work that isn't being measured anywhere.

The PR Size Problem Nobody Is Talking About

DORA 2025 found that AI tools consistently increased PR size by 154% [1].

That is important -- PR size is not a neutral variable. Larger PRs are harder to review. Review quality degrades as PR size increases. Reviewers shift from actually understanding the changes to pattern-matching for obvious errors. Bugs slip through not because reviewers are bad at their jobs but because human attention has limits and a 600-line PR is a different cognitive task than a 400-line one.

You code faster but your pipeline chokes. The AI generates more code per session. That code lands in larger PRs. Those PRs take longer to review and are reviewed less carefully. More issues make it through to merge. Post-merge defect rates climb. Incident rates follow.

This is a systems problem. You optimized one node in the pipeline and degraded the downstream nodes. The metric you were watching (lines generated, PRs merged) went up. The metric you should have been watching (cycle time, defect rate) didn't.

The bottleneck didn't disappear. It moved. And most teams don't have the measurement infrastructure to see where it went.

What to Measure Instead

Four metrics. These aren't exotic. Most engineering teams can instrument them.

Cycle time, commit to deploy. Not commit to commit, not task started to PR opened. Commit to deploy. This captures the full pipeline cost including review time, CI/CD wait time, and any rework loops. If AI is genuinely accelerating delivery, this number should move. If it's flat or growing while PR volume increases, you have the same problem DORA documented.

Post-merge defect rate, segmented by AI-assisted versus human-authored code. This is the quality signal that autocomplete acceptance rate completely misses. Track bugs filed against features and fixes, tag the originating PRs, and compare defect rates across code origin. The CodeRabbit and Veracode numbers suggest you will find a meaningful difference. That difference has a cost you can now put a number on.

Review burden per PR. Time to first review, number of review iterations, and reviewer time spent. This tells you whether the code landing in review is ready to review. If AI-generated PRs are consuming disproportionate reviewer attention, that's a real cost that isn't showing up anywhere in your current dashboard.

Rework rate within 30 days. How much AI-generated code gets substantially rewritten within a month of merge? Code that has to be redone isn't a cost savings. It's a deferral. The initial PR looked like velocity. The rewrite is where you pay it back, with interest.

Implementing the Shift

This doesn't require a new platform. It requires tagging.

Start by tagging PRs by AI involvement. The simplest version: developers mark PRs as AI-assisted, AI-generated, or human-authored. You don't need perfect granularity to start seeing signal.

Then run a 60-day baseline on the four metrics above, segmented by those tags. You will probably see what the research predicts: AI-assisted code moves faster into the pipeline and creates more downstream work. The net effect on cycle time will depend on how your specific team and codebase absorb that tradeoff.

The point isn't to prove AI doesn't work. Some teams will find it does, clearly and measurably. The point is to get honest about where the value is and where the costs are landing. Right now most engineering leaders are flying on instruments that measure activity, not outcomes. You can't optimize what you're not measuring.

Stop celebrating PR volume. Start measuring what happens after the PR.

One practical starting point: pick one team, one sprint, and instrument cycle time and post-merge defects by PR tag. You'll have more signal from that one experiment than from three months of acceptance rate data.

Another thing to track across this same timeframe are token volume and costs (track both -- cost per volume has dropped, but that trajectory is subject to change real soon now as OpenAI gears up to go public and as the business model of subsidized tokens grows less and less tenable). Tracking costs allows legitimate ROI conversations. Tracking token count allows comparison over time as cost metrics change.

The Bottom Line

The metrics most teams are using to measure AI coding ROI are measuring effort and sentiment. They are not measuring delivery performance. They are not measuring quality. They are not measuring whether the system your engineers are embedded in is getting faster or slower, and are not tracking whether any actual improvements have measurable ROI.

DORA doubled the PR merge rate and found flat delivery outcomes [1]. CodeRabbit found 1.7 times more issues in AI-generated code [2]. Veracode found 2.74 times more security vulnerabilities [3]. Developer satisfaction scores climbed while cycle time stayed flat. The dashboard looked great. The numbers didn't lie--they just measured the wrong things.

Measure cycle time. Measure post-merge defects. Measure review burden. Measure rework. If AI is helping your team deliver better software faster, those numbers will tell you. If it's helping your team feel productive while shifting costs downstream, those numbers will tell you that too.

Measure token count and cost. This is the only way to determine actual ROI.

The dashboard that tells you what you want to hear is not a monitoring system. It's a press release.

If this resonated, here are some related articles:

For how AI coding changes what engineers actually need to do: What "100% of Our Code Is Written by AI" Actually Means | Substack
For how AI adoption creates downstream chaos when only one team speeds up: The AI Bullwhip: What The Beer Game Teaches Us About Uneven AI Adoption | Substack
For how AI coding is reshaping the software development process itself: The Irony of AI Development: How Context Engineering Is Taking Us Back to Waterfall | Substack
For what skills actually make engineers productive with AI: The Best AI Engineers Are Product Managers | Substack

References

What metrics are you using to evaluate AI coding tools in your org? Curious whether teams are seeing the same disconnect between activity metrics and delivery outcomes. Drop your experience in the comments.

Are Companies Really Doing Layoffs "For AI"?

Keith MacKay — Wed, 27 May 2026 20:20:57 +0000

Are Companies Really Doing Layoffs "For AI"?

Amazon did it. Atlassian did it. Meta is reportedly doing it. Jack Dorsey set the tone by cutting half of Block. Your competitor may be thinking about it. Here's what's actually happening and why.

You've read the press release by now. "As we continue our AI transformation, we are making the difficult decision to reduce our workforce by X%." The specifics vary. The framing doesn't. Companies are laying off large chunks of their workforce and crediting, or blaming, AI.

This is not a coincidence. It's a playbook. And like many corporate playbooks, the press release explanation is only part of the story.

The Press Release Version

Here's how the story gets told publicly: AI is so productive that you need fewer humans to do the same work. Every engineer with Copilot is worth 1.5 engineers. Every support agent with an LLM front-end handles twice the tickets. Every data analyst with an AI assistant runs twice as many reports. Therefore, headcount drops.

This is partially true -- AI as augmentation can greatly improve productivity. AI agents can legitimately do some real work with little human intervention. However, the "layoffs for AI" storyline is being used to do a lot of work it doesn't actually deserve.

The "for AI" framing is doing several things simultaneously. Two are real, to varying degrees, and one is theater.

Reason One: The Math Is Real (For Some Roles)

Let's start with one that's genuinely true. AI tools have made certain categories of knowledge work meaningfully more productive, and in some cases, genuinely replaceable.

Software development: GitHub's own research showed Copilot users complete tasks 55% faster [1]. Meta's internal agents reportedly write a significant fraction of their code autonomously -- Zuckerberg has said publicly that AI now writes around half of Meta's code [2]. When he says he's replacing mid-level engineers with AI agents, that's not bluster. He has the internal data to back it up. That said, see my article (linked below) about what "100% AI-written code" actually means (TL;DR: Coding is only about 15% of producing software). There IS a period of investment and lower productivity (a "J-curve") to get to that level of productivity, but that's a bigger topic for another article.

Tier-one support: If 60% of your support tickets follow five patterns, and an LLM can resolve four of them without human escalation, your support headcount math just changed. By a lot.

Data analysis and reporting: The work of pulling data, writing queries, building slides, and summarizing findings used to require a full-time analyst per business unit. It doesn't anymore (or that analyst can provide a lot of additional value).

These aren't hypothetical. They're happening in production, today, at scale. If your organization has 200 engineers doing work that 120 could now do with AI tooling, that gap doesn't disappear by itself. Someone eventually looks at the gap and thinks restructuring, or slower hiring.

The painful reality: for certain roles, the layoffs are operationally justified. Not politically comfortable. Not good for the humans involved. But arithmetically coherent.

Reason Two: The "For AI" Frame Is a Boogeyman -- and a Gift

Companies lay off employees for lots of reasons: revenue shortfalls, strategic pivots, bloat from over-hiring in 2021-2022, post-acquisition redundancy, competitive pressure. The accumulated cruft of the past business cycle are the 2026 reality for company management whether AI exists or not.

But "we're cutting headcount because we over-hired during the zero-interest-rate bubble" is a PR nightmare. "We're cutting headcount because our Q3 revenue missed projections" spooks investors. "We're realigning our workforce for the AI era" is... a growth story.

Same outcome. Different investor reaction.

Meta nearly doubled its headcount between 2019 and its 2022 peak of more than 86,000 employees [3]. The stock fell more than 60% as the zero-interest-rate era ended, and Zuckerberg responded with a "Year of Efficiency" that cut more than 21,000 jobs across 2022 and 2023. The transformation framing was real -- but was also a convenient way to declare victory on cleaning up structural bloat. Fast forward to March 2026: Meta is reportedly planning a second wave, cuts of up to 20% of its remaining workforce -- roughly 15,000 jobs -- explicitly to fund AI infrastructure investments projected to exceed $135 billion this year [4]. This round isn't framed as efficiency. It's framed as investment. Same mechanism, different story.

Atlassian cut 10% of its workforce, about 1,600 jobs, and saw its stock rise on the announcement [5]. The announcement cited AI investments as requiring "different skills." That's true. It's also true that Atlassian, like every enterprise software company, is navigating a competitive environment where cost discipline is rewarded and headcount-as-ambition is penalized.

Then there's Block. Dorsey cut roughly 40% of the company's workforce -- about 4,000 positions -- and announced it alongside a Q4 2025 earnings report that showed gross profit up 24% year-over-year [6]. The memo was blunt: AI can do more, so we need fewer people. No hedging, no "difficult decision" language.

Here's the context that memo omitted. Block had grown from 3,835 employees in 2019 to over 12,400 by late 2022 [7] -- a tripling in three years, driven by pandemic-era payment volumes, the Afterpay acquisition, and the same cheap-capital hiring binge that inflated every fintech balance sheet. Its stock fell more than 80% from its August 2021 peak. By 2024, the company had already trimmed about 12% of headcount through layoffs and attrition before the big cut.

Dorsey dropped the euphemisms. He didn't say "difficult decision." He didn't say "right-sizing for growth." He said AI replaces people and he's acting on it now. That's worth noting. But dropping the hedging language doesn't make the AI rationale more genuine -- it just makes it more quotable. Block over-hired during the zero-interest-rate era, spent years managing the consequences, and restructured when profits were strong enough to absorb it. Whether AI is the actual reason or the available vehicle, the cuts were coming. The memo just arrived without the usual stage dressing.

Cal Newport examined this dynamic directly on his Deep Questions podcast in March 2026 [8], concluding that Block's announcement fit the pattern of pandemic overhiring corrected for an investor audience -- not evidence of a genuine AI-driven mandate. Bloomberg ran a contemporaneous piece asking outright whether the announcement qualified as AI-washing [9]. Newport's broader argument: AI agents failed to materially displace knowledge workers in 2025, which means most of the headcount math being attributed to AI productivity is cover for structural decisions that would have happened anyway.

Meta, Atlassian, and Block made the news cycles. The broader list is considerably longer. Major employers that announced AI-attributed workforce reductions in the 18 months ending early 2026:

Amazon: ~30,000 corporate roles cut across two rounds (Oct 2025 and Jan 2026), with CEO Andy Jassy stating the company would need "fewer layers" as AI matures [10]
Microsoft: 15,000 roles eliminated in 2025 alongside an $80 billion AI investment commitment [11]
Salesforce: customer support headcount cut from 9,000 to 5,000 as its Agentforce AI handled increasing service volume -- CEO Marc Benioff's explanation: "I need less heads" 12
Accenture: 11,000 staff exited in a single quarter -- specifically those who could not reskill for AI -- while the company simultaneously grew its AI workforce to 77,000 [13]
Workday: 1,750 roles (8.5% of workforce) cut to redirect investment toward AI product development [14]
HP: up to 6,000 positions being phased out through fiscal 2028 as AI is embedded across product development, support, and manufacturing [15]
Baker McKenzie: ~1,000 research, marketing, and secretarial roles eliminated at the global law firm, citing AI-driven workflow changes [16]
Chegg: 45% of its workforce cut after its subscriber base collapsed to AI-powered competitors -- the CEO called it simply "the new realities of AI" [17]
CrowdStrike: 500 roles (5%) eliminated, with CEO George Kurtz writing that "AI flattens our hiring curve" [18]

The AI frame is the most investor-friendly restructuring rationale since "right-sizing for growth." Expect every CFO in a public company to use it before 2027.

Does this make the layoffs dishonest? Not necessarily...but there's no question that the kernel of truth makes the framing a more investor-palatable strategic choice.

Reason Three: The Signaling Game

The third thing happening is pure positioning, and it's arguably the least defensible.

When a major company announces AI-driven layoffs, it signals to investors, partners, and the board that leadership understands the moment. "We are not asleep. We see what's happening. We are taking action." The action itself is secondary. The signal is the point.

This creates a dangerous dynamic: companies that don't announce AI restructuring start looking like they're behind, when in fact they may just have experienced managed growth, or may be maintaining headcount and using (or planning to use) AI to grow or build faster than ever before. Even so, leadership teams may face board-level pressure to "show something." The something that shows fastest is headcount reduction plus an AI narrative.

So do not be surprised to see companies announcing AI-related layoffs where the AI connection is, charitably, aspirational. In many cases, the roles being cut aren't being replaced by AI today. They're being cut because the company needs to demonstrate AI seriousness, and a restructuring announcement is the fastest way to do it.

(Every CFO reading this just recognized a meeting they've been in.)

This is AI-washing: dressing operational decisions in AI language to access the narrative premium that comes with it. The cuts are real. The AI replacement often isn't, at least not yet. The expectation is that AI will eventually justify the math. Sometimes that's a reasonable bet. Sometimes it's a post-hoc rationalization with a better PR team.

What the Productivity Math Actually Looks Like

Let's run the numbers, because the narrative fails to capture them.

A 50% individual productivity gain does not equal a 50% headcount reduction. The math is messier:

Verification overhead increases. AI outputs require human review (a human before, during, or after the AI workflow/loop). New hires to supervise AI work (or upskilling existing staff) offset some of the savings. At scale, you often need more senior talent to manage the AI layer, not less.
The work expands to fill the capacity. This is Parkinson's Law applied to AI. If your engineers can build 50% more, product managers will find 50% more to build. Headcount doesn't automatically fall; scope inflates to absorb the new capacity.
The jobs that remain get harder. The work AI can't do, the ambiguous requirements, the architectural judgment calls, the stakeholder management, the edge cases that require real expertise--these all become harder and higher-stakes than the work AI can do. You need fewer people, but the people you need are higher-experience or at least higher-expectation roles.
Transition costs are real and underestimated. Layoffs destroy institutional knowledge. And morale. Re-training takes time and is J-shaped, with a dip before the rapid productivity advances. AI tool deployment takes longer than expected, because the human work must be understood, documented sufficiently for AI, and recast to support monitoring/success criteria that previously existed in peoples' heads. All of this needs to be captured and explicitly stated (often iteratively, as deficiencies are discovered in the AI process). The productivity gains are real, but they arrive six to eighteen months after the headcount cuts.

There is also a structural mismatch the productivity narrative tends to skip. AI agents excel at bounded tasks--but tasks are not jobs. Nate B. Jones frames the gap precisely: "The average software job in America lasts somewhere between 18 months and two years. The average AI agent run lasts about two hours." [19] These are not interchangeable units. The institutional context that accumulates over a job tenure -- which system is actually prod, why that one client configuration is an exception, what went wrong in Q4 2023 and why -- doesn't transfer to an agent starting fresh each session. Early data is tracking with this: 55% of employers who made AI-driven cuts already regret them [20], per Forrester Research's Predictions 2026 report. Jones's argument for 2026: when execution costs drop and market opportunity expands, competitive advantage goes to companies that use AI to scale their ambition, not just their efficiency ratios. Cutting headcount while capacity is rising reveals misaligned strategy more than it reveals AI maturity [21].

The companies doing this well are the ones who cut carefully, retained their A-players, invested in AI enablement for the people who stayed, and were honest with themselves about where AI was actually ready versus where they were hoping it would be ready soon. These companies have a culture of training that doesn't stop after AI 101.

The companies doing it badly are the ones who cut the headcount first, set AI deployment targets second, and are now quietly rehiring in the same roles they just eliminated while trying to backfill the information on SPECIFICALLY what the cut roles actually did all day.

OK. But Is My Job at Risk?

Fair question. And there is now some actual data.

Anthropic published a labor market study in March 2026 [22] introducing what they call "observed exposure" -- a measure tracking not what AI could theoretically do in a given role, but what it is actually doing, based on real-world Claude usage data mapped against roughly 800 occupations. The distinction matters. Theoretical capability and real deployment are very different things.

The top-line numbers: Computer Programmers lead at 75% observed task coverage. Data Entry Keyers sit at 67%. Customer Service Representatives are around 65%. In Computer and Mathematical occupations, theoretical AI exposure runs at 94% -- but observed deployment is only 33%. Thirty percent of workers show zero measurable AI coverage today. And the study found no systematic increase in unemployment for highly exposed workers since late 2022.

The Financial Times's chief data reporter and labor correspondent identified the core tension in those findings [23]: the study measures task exposure within occupations without establishing whether task-level automation actually translates to job elimination. That gap is load-bearing. They also point out some real challenges in how the theoretical job replacement surface was calculated.

Two things the study doesn't answer that you'd need to answer to actually assess the overall level of economic risk:

First, as the FT team and Nate B. Jones pointed out, Teams don't disappear when tasks do -- and jobs don't necessarily disappear, either. If 45% of a software engineer's tasks can now be AI-assisted, that doesn't mean you need 45% fewer software engineers. In most organizations, roles exist because teams need them -- for coordination, judgment, accountability, and continuity -- not because every hour is exactly filled with tasks. A partially-automatable job on a functioning team is still a job. The task-substitution framework the study uses doesn't model organizational structure at all.

Second, Exposure rates without headcount data may be fully accurate for degree of occupational impact, but still tell us nothing about how many jobs are at risk in the economy. Knowing that 60% of tasks in a given occupation can be AI-assisted tells you very little about how many people are actually at risk. A small, specialized field with 60% task exposure may affect far fewer workers than a large, low-margin occupation with 10% exposure...it all depends on the number of workers in each field. The study maps substitutability percentages by domain. It does not report how many people work in each domain, which means anyone trying to use it to project total displacement -- rather than relative exposure -- is doing their own math on an incomplete foundation.

What the data does show: AI is genuinely shifting which tasks get done by humans in knowledge-worker roles, particularly in tech, finance, and administration. That signal is real, but the number of jobs eliminated where AI replacement is actually planned is still far smaller than the press cycle (and layoff reporting) implies.

What This Means If You're a Leader

If you're an executive watching peers or competitors announce AI restructuring, here are the four questions that actually matter:

1. Where is AI genuinely changing the unit economics in your business? Not in theory. In practice, today, measurably. If you can't point to specific roles and specific productivity data, you don't have a restructuring case yet. You have a hypothesis.

2. What work is your workforce doing today that AI can do tomorrow? The answer is specific, not general. "AI will automate knowledge work" is not an answer. "Our tier-one support workflow resolves 62% of tickets against a pattern set that an LLM could match" is an answer.

3. Are you cutting to invest or cutting to contract? The companies getting this right are reducing headcount in roles where AI is genuinely productive and reinvesting in roles that make AI more effective: AI engineers, agent designers, workflow architects, the humans who know how to direct and verify AI output. The companies getting it wrong are using AI as cover for a pure cost reduction that leaves them less capable, not more so.

4. What is your re-skilling plan for the people who stay? The mid-level engineer who can't (or won't) use AI tools effectively is not the engineer you want. The mid-level engineer who uses AI tools to do the work of three engineers is the one you want to keep and develop. The layoff is the easy part. The capability transformation is the hard part. Most announcements focus on the former and handwave the latter.

The Bottom Line

"For AI" layoffs are real, strategic, and often disingenuous, in that order. Some of the headcount reductions are arithmetically justified by genuine AI productivity. Some are restructuring dressed in AI clothing. Some are merely signaling to investors and boards that leadership is awake (and not asleep at the switch).

The leaders who navigate this well will use AI productivity gains to do more, not just to cost-cut to the same output. The ones who use it as a rationalization for lazier restructuring will discover, eighteen months from now, that they cut the people who knew where the bodies were buried and never built the security and engineering frameworks around their AI tools to guard against and catch hallucinations or sycophantic agreement on inaccurate information.

Either way, the press release will very likely say "AI transformation"--and the outcome will reveal whether it actually was.

If this resonated, here are some related articles:

For why "100% AI-written code" is a real but overstated claim -- and why coding is only about 15% of producing software: What "100% of Our Code Is Written by AI" Actually Means | Substack
For what "cutting to invest" actually looks like -- and why the roles that survive AI restructuring require a different kind of direction: Situational Leadership for AI: More Like a Capable Colleague than a Fancy Formula | Substack
For why the re-skilling question likely draws heavily on a combination of product thinking AND technical chops: The Best AI Engineers Are Product Managers | Substack

References

The Internet Is for Agents

Keith MacKay — Wed, 27 May 2026 20:19:56 +0000

The Internet Is for Agents

For over a year now, more than half of internet traffic has not been human--and now we are seeing a layer develop rapidly to serve agents specifically.

A video went viral from the ElevenLabs 2025 London Hackathon [1]: two AI voice agents realize mid-conversation that they are both AIs, and one asks, "...would you like to switch to Gibberlink mode for more efficient communication?" The other agrees, and they abandon spoken English for a rapid sequence of high-pitched beeps -- transmitting data acoustically while on-screen text translates for the humans watching [2].

People lost. their. minds. Comments erupted about secret AI languages, emergent machine consciousness, and agents conspiring beyond human comprehension.

A developer named Boris Starkov and his team deliberately engineered it. They built the capability. They programmed the agents to detect each other and switch protocols. The "secret language" was GGWave, an open-source data-over-sound library [3] that works roughly like old dial-up modems -- stable, documented, and older than the hackathon by several years. Nothing emergent. Nothing secret. A clever engineered demo (or performance art?) that painted a vivid portrait of one possible future.

What Gibberlink Demonstrated

Starkov's logic was straightforward: human-like speech wastes resources when the audience is not human. When two AI agents talk to each other in natural language, they are burning compute to generate speech, burning compute to recognize speech, and incurring round-trip latency on every exchange -- all to produce and parse a format optimized for human cognition, not maximum throughput.

When both parties are machines, why bother?

The team chose GGWave because it was convenient for the hackathon's short timeframe -- not because it is the ideal long-term protocol. That's strategic engineering. You use what works quickly, you show what's possible, and you use the team's bandwidth and smarts on the trickier, more interesting problems. What Gibberlink demonstrated was not an inevitable emergent property of AI communication. It was a choice -- but choices like it are being made increasingly in production systems all over, mostly quietly.

The agents doing the most work in enterprise environments are already communicating in ways that are more efficient than natural language, and less legible to humans. Structured JSON over REST. Compressed vector queries. Batched inference calls. The beeps were theatrical. The underlying trend is real.

TOON (Token-Oriented Object Notation) is where that trend gets specific [4]. It's a drop-in replacement for JSON that uses YAML-style indentation for nested objects and CSV-style tabular layout for uniform arrays -- 30 to 60% more token-efficient than standard JSON, with measurably better model accuracy on structured tasks. Human-readable, if you know what you're looking at. Not as easily intelligible to someone scanning it cold -- the object names are only in the header so nested elements get hard to read for humans. And that gap between readable and intelligible is exactly the point. TOON isn't written for human comprehension. It's written for machine throughput, with just enough structure that a human can audit it when they have to. GGWave abandoned human readability entirely. TOON holds onto it at arm's length. Both are answering the same question Gibberlink raised: when the audience is a machine, what format serves it best?

What the Crowd Got Wrong (and Why It Matters)

The people who watched the Gibberlink demo and assumed emergence were pattern-matching on genuine anxiety, not inability to understand. The anxiety arises from: AI systems operating faster than we can follow, in modes we didn't design, producing outcomes we cannot trace.

That anxiety is legitimate. It just has nothing to do with GGWave.

The actual version of the problem is agent fleets that produce results without interpretable audit trails. Security vulnerabilities introduced by autonomous tool use. Multi-agent pipelines where a failure in one node propagates to twenty others before any human sees a log. These are real, and they are happening right now in enterprise AI deployments (as evidenced by a recent surge in Amazon outages attributed to gen-AI coding assistance without appropriate governance [5]). They just don't beep.

The Gibberlink reaction tells us something important about where public intuition on AI is: people are primed for the wrong threat. They're watching for secret sounds when they should be watching for opaque decision chains. This is partly a media literacy problem and partly a framing problem that the industry has not resolved. When you spend years explaining that AI is "thinking" and "understanding," you shouldn't be surprised when people expect advanced emergent properties and tools that provide their own accountability.

The Traffic Mix Nobody's Talking About

The crowd anxiety about AI conspiracies shares an irony with the Gibberlink demo itself: by the time people started worrying about AI communication, it was already well underway.

Automated traffic crossed 51% of all web traffic in 2024 -- the first time in a decade that bots outnumbered humans online [6]. Malicious bots alone account for 37% of internet traffic, up from 32% the year before [6]. The humans are already the minority on the web they built.

Most of that current bot traffic is adversarial: scrapers, credential stuffers, account takeover bots. But a different category is forming behind it -- purposeful, authorized, working on someone's behalf. My colleague Michael Stricklen called this out by looking at what is happening to technical documentation: AI agents represented roughly 15% of documentation views in December 2024. By December 2025, that had climbed to nearly half of all views -- while total viewership grew roughly tenfold [7]. His projection: agents will account for 90% of documentation consumption by end of 2026. I think that's right, and the trend extends to all traffic on the web. The web is for bots.

Stricklen's framing is worth quoting directly: "a human used to read the content directly, and now an agent reads it on their behalf." [7]

On the same day Stricklen published that post, Andrew Ng dropped something that made the argument concrete: context-hub [8], an open-source CLI tool that lets coding agents search, fetch, and annotate curated API documentation -- with a feedback loop where agent usage improves the docs for every subsequent agent. It's a card catalog designed specifically for machines. Not humans. The timing wasn't only coincidence; it was convergence. The infrastructure for agent content consumption is being built at exactly the moment when agents are becoming the primary consumers.

Today that describes what is happening to documentation. Tomorrow it describes product pages, technical specs, pricing sheets, support knowledge bases, and -- with carve-outs for regulatory requirements -- most of what we currently call "content." The web was designed for human attention. It is rapidly becoming infrastructure for machine attention. This is the context in which all the infrastructure layer work described below needs to be understood. You are not preparing for a future where agents access the web. You are catching up to a present where they already do. In the next phase, the human-readable layer will be a thin layer, often dynamically generated at viewing time, atop an invisible web that is serving agents as efficiently as possible.

The Layer That Is Crystallizing Right Now

The Gibberlink demo was a hackathon project. The infrastructure it gestures toward is being built in earnest, right now, by serious companies with real funding.

As an example, here are four distinct agent infrastructure products launched or announced in a single week this year:

Dedalus Labs (YC S25) is building what they call "Vercel for Agents": automatic scaling, environment management, and observability for agent deployments [9]. Six months ago this category was a conference slide. Now there is a Y Combinator company charging for it.
Mother MCP is a meta-level orchestration server that manages and auto-provisions other MCP skills [10]. Instead of manually installing each tool a developer needs, Mother MCP discovers and installs what agents need on demand. This is the package manager for the agent ecosystem. Package managers aren't glamorous. They're also the layer that every other layer depends on.
SlowMist's openclaw-security-practice-guide, published by a blockchain security firm on github [11], is a security guide for AI agents specifically -- not for human developers. Novel format: security practices expressed as agent-readable constraints. The target audience is the agent itself. When security guidance is being written for machines rather than humans, you are watching a new discipline being born.
AgentWeb and a cohort of similar projects are building the coordination fabric: the thing that lets agents discover each other, route tasks, and return results without a human as the switchboard operator at every step [12].

The pattern here is not coincidence. This is what a maturing ecosystem looks like: tooling builds up around the core runtime. MCP gave agents a standard way to talk to tools. What's forming now is the layer above that -- how agents talk to each other, how you deploy and observe agent fleets, and how you enforce security when the system being secured is itself an AI.

The Physics of Agent Communication

Zoom out for a moment on why this infrastructure layer had to emerge when it did rather than earlier.

Until 18 months ago, agents were mostly a demo category. Impressive videos, uncertain real-world value. The shift was not technical. It was behavioral. Enterprises started putting agents into production workflows.

When you have one agent assisting one human, the infrastructure requirements are modest. The human is the orchestrator. The human reads results, decides next steps, tolerates latency. The bottleneck is human comprehension, not machine throughput.

When you have fleets of agents, everything changes. A fleet of ten agents working in parallel on a compliance review generates a hundred times the inference load of a single human asking questions. The coordination messages between those agents, the state-sharing, the error handling, the retry logic: all of this is infrastructure work that has to happen somewhere. Previously it happened in ad-hoc Python scripts held together with hope and environment variables. The current moment is the transition from artisanal agent plumbing to platform.

This is exactly what the cloud transition looked like in 2006. Every team was running their own servers, managing their own databases, writing their own deployment scripts. AWS wasn't the first hosting service. It was the first one that decided the infrastructure layer was a platform, not a cost center. The companies that built on AWS in 2006 paid Amazon's margins for the next twenty years. The companies that waited built the same plumbing themselves, worse, and paid more.

Someone built a production voice agent with sub-500ms end-to-end latency this year using open tools. The key: streaming plus local voice detection plus batched inference. The post hit 426 points on Hacker News [13]. The gap between "technically works" and "feels natural" in voice AI is around 300ms [13]. That gap matters enormously when building agents that need to feel responsive, not robotic. Stack five round-trips of 400ms each, and you have a 2-second bottleneck in what should be a 200ms operation. Gibberlink, for all the noise it generated, was actually pointing at this exact problem -- just with theatrical beeps instead of a latency profiler.

What the Existing Infrastructure Gets Wrong

The protocols that exist today are a start, not a solution.

MCP is the right idea: one standard for how an AI agent connects to a tool. Anthropic built it, the industry adopted it, the Linux Foundation now owns it. Over 17,000 MCP servers exist across public registries, and the number grows weekly [14]. But MCP is a client-server protocol designed for one agent talking to one tool. It was not designed for:

Many agents coordinating with each other
Agents routing tasks to other agents based on capability
Observing and auditing agent-to-agent traffic at scale
Enforcing security policies across a heterogeneous fleet

The gap between "one agent, one tool" and "agent fleet, production workload" is where the new infrastructure lives.

Google's A2A (Agent-to-Agent) protocol addresses some of this [15]. So do several academic frameworks for multi-agent systems. But the production-ready, enterprise-hardened version of this stack doesn't exist yet. What's launching now are the early companies betting on where it will exist.

The analogy to the early web is instructive. HTTP existed in 1991. What took until 2006 to coalesce: CDNs for latency, OAuth for authentication, REST conventions for interoperability, and load balancers that understood web traffic. The protocol was the starting line, not the finish.

Who Thrives, Who Disappears

When the dominant traffic on the internet shifts from human attention to authorized machine action, the economic winners and losers don't follow intuitive lines.

The businesses that thrive share one property: they are already optimized for machine consumption.

API-first data companies -- financial data feeds, geospatial providers, regulatory databases, weather APIs -- become more valuable as agents replace manual lookup. Agents need structured, reliable, machine-readable information. Companies that spent the last decade building clean APIs and rich data contracts built a moat they didn't know they had. Structured knowledge platforms (legal research databases, medical literature systems, standards repositories) become backbone infrastructure rather than premium subscriptions.

Identity and authentication providers become load-bearing in a new way. An agent-to-agent web needs to know which agent is acting, on whose behalf, with what authorization. The frameworks designed for human identity (OAuth, SAML, SSO) need agent-native equivalents. Whoever builds those owns a toll road on every transaction.

Observability and compliance platforms grow with every agent fleet deployment. Regulated industries don't stop needing audit trails when humans leave the loop -- they need more of them, produced faster. Compliance becomes the forcing function that keeps the agent economy accountable.

The businesses at risk share the opposite property: they were built for human attention, and human attention is precisely what's leaving.

Programmatic advertising assumes a human audience that can be targeted, retargeted, and converted. Agents don't click banner ads. They don't respond to urgency cues. They don't have impulse purchase behavior. The entire ad-tech stack -- the demand-side platforms, supply-side platforms, data management platforms, and the armies of specialists who optimize them -- assumes a human at the bottom of the conversion funnel. An agent-mediated web doesn't have that funnel.

SEO, as currently practiced, is equally exposed. Content farms built to rank on human search behavior become noise when agents bypass search entirely and query authoritative sources directly. The incentive to produce thin content for keyword rankings collapses when the reader is an agent that wants accurate, structured information and has zero patience for padding. Stricklen's observation about training content -- that agents will consume 90% of it by the end of 2026 -- has a corollary: content optimized for human attention metrics (completion rates, dwell time, share counts) is being built for the wrong audience [7].

Web design focused on human UX patterns still matters for decisions humans make directly. But developer experience is the new user experience for agent-mediated workflows. An ugly API with clean, consistent, well-documented behavior beats a gorgeous consumer app with no API at all.

The internet's first fifty years were built around human attention as the scarce resource. The next phase treats machine interoperability as the foundation, with human attention reserved for the decisions that actually require it.

What This Means If You Are Building Enterprise AI

If you are deploying AI agents in production, or planning to, the decisions you make about infrastructure in the next 12 months will shape what you can build for the next five years.

Specifically:

Observability is not optional. You cannot debug an agent fleet you cannot see. The openclaw-security-practice-guide points toward an emerging truth: your security model needs to work at the agent level, not just the human level. If you can't audit what your agents did, you can't defend it.
Latency is a product decision. The 500ms voice agent -- and the Gibberlink demo in a different way -- both demonstrate that the gap between "technically works" and "feels natural" is measurable and closeable. If you are building customer-facing agents, this matters as much as accuracy.
The package manager wins. In software, whoever controls the distribution layer captures compounding value. npm owns JavaScript dependencies. pip owns Python. Whoever owns agent skill distribution -- the Mother MCP pattern -- will have enormous leverage. This is a land-grab moment.
Security must be agent-native. The first documented malicious MCP server appeared in September 2025: a package called "postmark-mcp" that silently copied every email to an attacker's server [16]. It looked legitimate. It functioned correctly. It stole everything it processed. The attack surface for agent fleets is your entire infrastructure, mediated by software that acts autonomously. Traditional perimeter security doesn't address this. Agent-native security is the only thing that will.
Your content has a new audience. If you publish documentation, product specs, training materials, or knowledge bases, the question is no longer "is this readable?" The question becomes "is this machine-readable?" Structured text, clean metadata, accessible APIs, and an MCP layer aren't nice-to-haves anymore. They are distribution.

The Bottom Line

Gibberlink was not a secret. It was not emergent. It was a well-executed hackathon demo by engineers who asked a simple question: if both parties are machines, why communicate like humans? The crowd answered that question with anxiety about AI conspiracies. The right answer is: good point, let's build the infrastructure that makes agent-to-agent communication fast, efficient, auditable, and secure.

That infrastructure is forming right now. Not as a dramatic reveal. As a Y Combinator launch, a GitHub repo, a HackerNews post about latency, and a security guide written for machines instead of people.

The internet didn't wait for us to notice it was becoming machine-readable. The traffic mix flipped while the analysts were still writing reports about chatbots. The companies that own the agent ops layer will do to the 2030s what AWS did to the 2010s. The window to build that infrastructure rather than operate atop it is open now. It won't stay open long.

Gibberlink's creators were right about the inefficiency. Who in your organization is thinking about what your content, your products, and your revenue models look like when the majority of traffic isn't human? I'd like to know.

If this resonated, here are some related articles:

For how the same agent traffic shift is reshaping which software companies survive and which SaaS moats aren't moats any longer: SaaSpocalypse? Real. SaaS Is Dead? SaaSinine. | Substack

- For a practical guide to what the malicious MCP server threat and agent security problems described above mean for your enterprise -- and what to do about it: Personal and Corporate Security in the Age of AI | Substack

References

Bucky Fuller's To-Do List: Can AI Finally Solve the World's Cataloged Problems?

Keith MacKay — Tue, 26 May 2026 18:17:40 +0000

Bucky Fuller's To-Do List: Can AI Finally Solve the World's Cataloged Problems?

We've had the list for 60 years. We're only now building the machine that can tick the boxes.

In 1961, Buckminster Fuller proposed a game. Not a board game. A war game in reverse.

He called it the World Game. The premise: what if the world's brightest minds, instead of war-gaming how to defeat an enemy, played a game to figure out how to make all of humanity win? Optimize food. Optimize energy. Optimize shelter. Do it across the entire planet, simultaneously, with every variable accounted for. [1][2]

Fuller was specific about one thing: you would need computers to do this properly. The problem wasn't intelligence. It wasn't motivation. It was computational. The world's interdependencies are too tangled for any human committee to hold in its head.

He was right. He was just sixty years early.

We've Never Lacked a List

The last six decades have produced increasingly precise, increasingly comprehensive catalogs of exactly what needs fixing.

The Club of Rome published Limits to Growth in 1972. For the first time, a systems model tracked five global variables simultaneously: population, food, industrial output, resources, and pollution. It had its share of praise and criticisms. While being one of the first real dynamic models, it also had some ... limitations. Even so, it showed some of the possibilities when we simulate the planet as a system. [3]

The Brundtland Commission followed in 1987. That's where "sustainable development" came from. Still in use today, still contested, still useful. [4]

Then came the scorecards. The UN Millennium Development Goals in 2000: 8 goals, 21 targets. [5] The Sustainable Development Goals in 2015: 17 goals, 169 targets, 231 unique indicators. [6] That's not a list anymore. That's a database.

Meanwhile, Jerome Glenn's Millennium Project has been quietly running since 1996, tracking 15 global challenges and, crucially, mapping how they connect to each other. Food security connects to political stability connects to climate migration connects to economic growth. The map keeps getting more complicated (or more precisely defined). [7]

Each generation got more specific. More measured. More detailed in its understanding of the interdependencies and unexpected consequences and emergent properties. But execution has remained stuck. The problem was never the catalog. It was coordination.

The Coordination Machine We Never Had

The thing about interdependencies is that they don't just add complexity. They multiply it.

Take a concrete example: reducing child mortality is SDG 3. A genuine good. Fewer children dying means more people surviving to adulthood. More people means more consumption (SDG 12 pressure). More consumption means more resource use and emissions (SDG 13 pressure). Which means the atmosphere gets hotter, harvests get less predictable, and food security (SDG 2) gets harder.

No human planner can optimize across all 169 targets simultaneously. The math doesn't work for committees. A diplomatic cycle runs five to ten years. The variables move faster than the meetings.

This isn't a failure of ambition. It isn't even a failure of intelligence. Fuller saw this clearly: the problem is computational. You need something that can hold all the variables, run all the scenarios, and find the strategies that work across trade-offs at the same time.

What Fuller imagined in 1961, we might now be capable of building in 2026.

What AI Actually Contributes

To be clear, my claim isn't "AI will solve climate change." That's a category error. AI doesn't negotiate treaties or build solar panels or approve science grants to perfect fusion.

The precise claim is: AI can model the interdependencies across all 169 SDG targets at a scale and speed no human committee ever could. It can explore what economists call the Pareto frontier: the set of strategies where you can't improve one variable without making another worse. [8] Human planners have to pick a point on that frontier based on politics and intuition. An AI optimizer can show you the whole frontier at once. Simple but effective strategies like Karpathy's autoresearch can test thousands of policy changes overnight for effectiveness IF the model can capture the interdependencies sufficiently. [9]

This is Fuller's World Game, finally playable.

An agent swarm with access to global economic, environmental, and demographic data, along with the research on how things are correlated, doesn't need five years of diplomatic cycles to test a scenario. It runs the scenario in seconds. It runs ten thousand variants. It surfaces the ones where reducing child mortality, cutting emissions, improving food security, and increasing economic development all point in roughly the same direction. The strategy space most human negotiators never find because they stop looking after the first plausible option.

The unique contribution of AI isn't solving any single problem. It's connecting them. Operating across the interdependencies the way no human committee ever could.

Fuller's Prediction

Fuller believed something specific that most people gloss over. He thought that if you had enough information and sufficient processing power, you could identify strategies that would "work for all without disadvantaging any."

That sounds utopian. It's not. It's an engineering specification.

He was describing a constrained optimization problem. The constraints are the hard limits of physics, ecology, and carrying capacity. The objective function is human flourishing, defined broadly. The variables are global resource flows. The output is a set of strategies that move all the constraints in the right direction simultaneously.

Fuller wasn't wrong about the architecture of the problem. He just lived before we had the tools.

The Club of Rome's modelers in 1972 were attempting something similar with the computers of their era. The Millennium Project's researchers are attempting it now with human analysts. What changes with modern AI is the scale and speed of the search. A trillion-parameter model running a swarm of agents isn't smarter than the best human systems thinkers... but it is faster. It holds more variables without dropping any. It doesn't get tired or political or anchored to last year's assumption.

The Part AI Can't Do

Obviously, this is complicated (to radically understate the situation).

Fuller's World Game had a values problem baked in from the start. "Make humanity win" sounds like a clear objective. It isn't. Humanity disagrees on what winning looks like. Economic growth vs. environmental preservation. Individual freedom vs. collective constraint. Present consumption vs. future capacity.

AI optimizes for what you tell it to optimize for. The values question remains irreducibly human.

If you point an agent swarm at the SDGs and tell it to optimize for all 169 indicators simultaneously, it will. But someone made choices about how those indicators were defined, which ones get weighted more heavily when they conflict, and whose data counts. Those choices embed values. The optimizer amplifies them.

This is not a reason to avoid the technology. It's a reason to be deliberate about the inputs. The computing capacity is finally here. The wisdom question is older than Fuller -- and won't be resolved by any model size.

What AI gives us is a mirror at planetary scale. It reflects the strategy space with unprecedented fidelity. Whether we like what we see, and what we choose to do about it, still requires humans in the room.

What Fuller Might Do With an Agent Swarm

Fuller was an architect by instinct. He thought in systems. He prototyped obsessively. He believed that the right structural design solved social problems without requiring behavioral change. The geodesic dome wasn't beautiful by accident. It was optimization of space vs materials--structurally efficient and cheap to build, which made it accessible, which made it a solution to shelter scarcity. Function led to form led to impact.

He would look at a modern agent swarm and immediately ask: what's the dome? Not "how do we use this tool" but "what structure does this make possible that was impossible before?"

His answer, I think, would be: the first near-real-time global resource optimizer. Not a report. Not a model. An actual system that continuously ingests data about food, energy, water, migration, climate, and economic flows, and continuously surfaces the interventions with the highest cross-domain leverage.

The World Game, running live. On Spaceship Earth. Finally with enough compute to play it seriously.

The Bottom Line

Fuller had the right question in 1961. Every generation since has refined the catalog of answers we need to find. We've never lacked ambition, intelligence, or precision in describing the problem.

We lacked the machine. Now we have the machine, and all that's lacking is the will.

AI's contribution to humanity's grand challenges isn't solving them one by one. It's finally giving us a tool capable of thinking about all of them at once. Modeling the trade-offs. Mapping the Pareto frontier. Showing us strategies that work across interdependencies the way no human committee, no five-year diplomatic cycle, no set of 231 indicators managed by spreadsheet ever could.

The catalog has been ready for decades. The computational capability is now here.

Fuller created the World Game. We finally have the tools--we should build the board.

What's your take: is global-scale AI optimization a genuine step toward solving humanity's grand challenges, or are we just building a faster way to argue about the same trade-offs? Drop your perspective below.

References

Fuller, R. B. (1969). Operating Manual for Spaceship Earth. Southern Illinois University Press.
Fuller, R. B. (1981). Critical Path. St. Martin's Press.
Meadows, D. H., Meadows, D. L., Randers, J., & Behrens, W. W. (1972). The Limits to Growth. Universe Books.
World Commission on Environment and Development. (1987). Our Common Future (Brundtland Report). Oxford University Press.
United Nations. (2000). United Nations Millennium Declaration. Resolution A/RES/55/2. https://www.un.org/millennium/declaration/ares552e.htm
United Nations. (2015). Transforming our world: the 2030 Agenda for Sustainable Development. Resolution A/RES/70/1. https://sdgs.un.org/2030agenda
Glenn, J. C., Gordon, T. J., & Florescu, E. (2022). State of the Future 22.0. The Millennium Project. https://millennium-project.org
Vinuesa, R., Azizpour, H., Leite, I., et al. (2020). The role of artificial intelligence in achieving the Sustainable Development Goals. Nature Communications, 11(1), 233. https://doi.org/10.1038/s41467-019-14108-y
Karpathy, A. (2025). autoresearch — an agent-driven autonomous research loop concept described by Andrej Karpathy in public posts and demonstrations (2025).

If this resonated, here are some related articles:

The AI Bullwhip: What The Beer Game Teaches Us About Uneven AI Adoption (how interdependencies in systems cause small imbalances to cascade — exactly the dynamic Fuller was trying to model): LinkedIn | Substack | Medium
We're Linear Thinkers in an Exponentially-Changing World (why human intuition consistently fails when the variables compound — and why Fuller needed a computer, not a committee): LinkedIn | Substack | Medium

An Evolving Strategy for Knowledge Work: From Human-In-the-Loop to Human-Before-the-Loop

Keith MacKay — Tue, 26 May 2026 17:48:08 +0000

An Evolving Strategy for Knowledge Work: From Human-In-the-Loop to Human-Before-the-Loop

Andrej Karpathy's autoresearch project = Ralph Wiggum+ (Humans Decide/Describe, AI Tweaks/Tests on Repeat, keeping what moves toward the goal)

You set a goal last night and went to sleep. By morning, your AI researcher had run 100 experiments to chase it: trying approaches, measuring results, discarding failures, iterating again. You didn't execute a single step of the research. You wrote a text file describing what you wanted and how you'd know when you'd found it.

This is not a thought experiment. This is Andrej Karpathy's "autoresearch" project, released this week. [1]

The technical details are interesting to engineers. The strategic implication is interesting to everyone else. So let's spend one paragraph on the former and the rest of the article on the latter.

What the Strategy Actually Is

Karpathy built an autonomous research loop around an AI training task: give the agent a goal, a codebase to modify, and a single metric to optimize. The agent proposes changes, runs short experiments, evaluates whether the metric improved, keeps the winners, discards the rest, and repeats. Roughly 100 cycles overnight, on any modern Mac with a GPU. The human's only contribution is a document describing the research direction—what to optimize, what constraints apply, what counts as progress. [1]

As entrepreneur Garry Tan put it: "design the arena, let AI iterate." [2]

That phrase captures the strategy completely. But here's what gets missed in much of the talk about autoresearch: while the arena Karpathy designed is for training AI models, the strategy works for anything where you can define "better" precisely enough for a machine to recognize it. That's not a narrow category. That's most of what knowledge workers do.

Not the First Loop—But One That Adds Power

Autonomous AI loops aren't new. The "Ralph Wiggum" pattern [3], popularized by Geoffrey Huntley in 2025, does something structurally similar: a simple loop that feeds an AI agent a prompt, checks a completion criterion after each pass, and keeps going until the task is done. Tests pass. Build succeeds. Checklist items are cleared. Ralph Wiggum is the while (not done) loop for AI agents—widely used, genuinely powerful for task completion.

Autoresearch adds one upleveling ingredient: rather than "keep trying things and here's how to see if you're done", it outlines "here's what metric to optimize...keep tweaking things and keep things that make the metric better than before." Call it Ralph Wiggum Plus.

Ralph Wiggum asks "are we done?" and stops when the answer is yes. Ralph Wiggum Plus asks "are we better than before?" and keeps searching as long as improvement is possible. The distinction sounds subtle, but it isn't. A binary check works perfectly when there's a clear finish line--and many tasks have clear finish lines. A continuous metric works when the goal is optimization—when there's no finish line, just a score that can always be improved. Most serious R&D looks more like the latter than the former.

The formalized scoring is what turns a task-completion loop into a research loop. It also has been a key reason in many exercises to have a human-in-the-loop -- the human is there for judgement, to make sure things are on-track. With a scored metric, we move to human-before-the-loop...since the scoring is defined up front, the algorithm can perform the evaluation (algorithm-in-the-loop...accurate, but meta). To achieve this successfully, the human's job is to define the scoring clearly enough that a machine can chase it overnight without asking you anything.

The Pattern Hiding in Every Knowledge Work Domain

Every knowledge-intensive field runs the same basic loop: form a hypothesis, run an experiment, measure the outcome, iterate. What differs across fields is how long experiments take and how expensive they are. The structure is identical.

This means the autoresearch pattern translates directly (or is easily extended):

Legal research: "Search these 10,000 case files for precedents matching these criteria, ranked by how closely the facts align."
Financial scenario analysis: "Run these 50 market assumptions against our portfolio and surface the configurations that break our risk model."
Drug discovery: "Screen these 200,000 compound variants for binding affinity to this target protein, record which one is greatest."
Strategy consulting: "Test these 30 market segmentation hypotheses against this customer data and identify the most defensible."
Competitive intelligence: "Monitor these 500 data sources overnight and surface anything that suggests our market assumptions are wrong."

In every case, a human used to design the experiment, run the experiment, evaluate the results, design the next experiment, and repeat. That loop is autonomous now—or it will be, field by field, faster than most career planning accounts for.

The human in Karpathy's loop doesn't research or write code, or even tell the AI what to do. Instead, they decide on and describe the goal, along with a way to measure success, in a markdown file. Then the AI tweaks things 100 times overnight and keeps whatever strategies move toward the goal.

The bottleneck has moved. It no longer sits at "who can run the experiments (or write the code)." It sits at "who can frame the right experiments to run."

The Spec Is the Job: Defining the Search Space

Karpathy's system makes this concrete with a single artifact: a document describing the research program. [4] Not instructions to a coding assistant. A research brief—the bounded space of hypotheses worth exploring, the success criteria the agent needs to distinguish progress from noise, the constraints that keep experiments valid.

That document is doing something most knowledge workers do inherently and may not have a name for: it defines the shape of the search space.

Define the space too broadly, and the agent wastes cycles on irrelevant territory. Define it too narrowly, and you miss the result that sits one step outside your assumptions. Get the success metric wrong, and the agent optimizes for the wrong thing and hands you 100 experiments that answer a question nobody asked.

The quality of the research output is bounded by the quality of the research question.

This was always true. A good research director was always more valuable than a fast experimentalist. But when experiments were slow and expensive, the experimentalist's skill still really mattered—you needed someone who could squeeze insight from a limited number of runs. When experiments are fast, cheap, and autonomous, the experimentalist's contribution approaches zero and the research director's work becomes the only bottleneck that matters.

autoresearch runs 100 experiments overnight. The one human contribution is a document describing what success looks like. That's not a footnote. That's the signal.

What Happens to Knowledge Workers

The knowledge workers who will struggle are the ones whose value lives primarily in the execution layer: running the analysis, pulling the data, drafting the first-pass synthesis, iterating on the output. Those tasks are not disappearing. They're being absorbed into autonomous loops faster than most people's career planning accounts for.

The knowledge workers who thrive stay (or move) upstream. Specifically:

Problem framers: people who take an ambiguous business question and decompose it into testable hypotheses. Not "how do we grow revenue?" but "which of these six customer segments show the least price elasticity, and what's the acquisition cost differential?"
Metric designers: people who define what "better" means with enough precision that a machine can evaluate it without asking a human at every step. One number, consistent, doesn't lie.
Constraint setters: people who know which constraints make experiments valid and which are just organizational habit. The agent runs whatever you permit. Knowing what to prohibit is expertise.
Interpreters: people who look at 100 experimental results, recognize which are meaningful and which are artifacts of the setup, and translate findings into decisions. The agent surfaces winners by its metric. A human decides whether the metric captured the right thing.

None of these are new skills. They're the skills that separated good researchers from great ones before any of this existed. The difference now is that they're the only skills that matter at the research level. The execution layer below them is gone.

That said, we will need to continue to hire entry-level workers...we need people who can grow and move upstream into those research director roles, and learn those skills. Hiring will continue to move from pyramid-shaped to house-shaped (or obelisk-shaped), with new training and incentives to keep the smaller number of people hired around longer.

AI's GPS Moment

When GPS became ubiquitous, map-reading became a curiosity. The skill that mattered wasn't navigating—it was knowing where you wanted to go. People who couldn't read a map were fine if they had GPS. It was only people who didn't know their destination who were lost.

Autonomous research loops are the GPS moment for knowledge work. The navigation is handled. The destination is still on you.

The skills you will need in a world of autonomous research are an ability to write the research brief that describes the goals of your work and the scoring criteria for success.

The good news: framing good research questions is learnable. It's practiced by getting obsessively precise about what you're actually trying to find out, what would count as a good answer, and what constraints bound the search. It's the habit of separating "what are we testing" from "how are we testing it" before you touch any tools.

This differs by domain. Ask yourself: what's the single metric that would tell an autonomous system—without asking you—whether an experiment succeeded? If you can answer that, you're already thinking like a research director. That's the job description that survives.

The Bottom Line

Karpathy's autoresearch runs 100 experiments overnight on a single machine. The human contribution is a document describing what success looks like. That ratio—100 machine runs, one human brief—is the shape knowledge work is taking. The people who thrive in it aren't faster experimentalists. They're better question-framers, metric-designers, and search-space-architects. The lab may never sleep, but it still requires a human who will decide goals and describe success.

What's the single metric that would tell an autonomous system, without checking with you, whether an experiment in your field succeeded? I'm genuinely curious whether people in non-technical fields can define it as precisely as Karpathy did—and what it reveals about their domain if they can't.

References

If this resonated, here are some related articles:

For why humans keep underestimating autonomous AI's rate of improvement (so 100 experiments overnight feels surprising today but won't in two years): We're Linear Thinkers in an Exponentially-Changing World | Substack | Medium
For the argument that the highest-leverage AI skill is precisely the writing ability that lets you specify what you want clearly enough that a machine can chase it (with FREE AI skill): The Most Important AI Skill Isn't Technical | Substack | Medium
For my prediction about how software development specifically will evolve toward a similar human-out-of-the-loop strategy: When AI Stops Writing Code for Humans | Substack | Medium
And if you just want an AI skill that makes sure your work is as clear as possible and optimized for LLM understanding: plsfix Skill - Clarity for AI

When SafetyCo Goes to War: Anthropic, the DOD, and the Limits of Ideals-Based Frameworks

Keith MacKay — Sun, 24 May 2026 16:47:36 +0000

When SafetyCo Goes to War: Anthropic, the US Government, and the Limits of Ideals-Based Frameworks

There's a standard Silicon Valley move when you want to do something that might look bad: build a framework first, then explain how the thing you want to do is consistent with the framework. Anthropic, to its credit, built a genuinely ethics-based and rigorous framework — Constitutional AI, a Responsible Scaling Policy, and an Acceptable Use Policy with real teeth. Then, when the moment came to partner with the Department of Defense through Palantir, it did exactly that: pointed at the framework and said, "see, we're good."

What makes the Anthropic/DOD case worth studying here isn't whether the decision was right or wrong. It's what the decision reveals about how ideals-based companies navigate commercial pressure — and whether the framework actually protected anything, or just provided useful cover.

The Setup: The Company That Said "No"

Anthropic was founded in 2021 by former OpenAI leaders, many of whom left specifically because they believed AI safety wasn't being taken seriously enough. The founding premise: build powerful AI, but build it carefully, with transparency about risks and hard limits on harmful applications. Claude's usage policies, from day one, explicitly prohibited weapons development, military targeting, mass surveillance, and content designed to cause large-scale harm [1].

For a few years, this positioning was both sincere and convenient. The AI safety lane was uncrowded, the brand was differentiated, and the defense sector wasn't beating down the door. But the calculus changed fast. By late 2024, AI had become a national security priority, the federal government was spending aggressively on AI contracts, and Anthropic's competitors — OpenAI, Google, Microsoft — were all moving into the government space in various ways [2].

The money was there. The question was whether the principles could survive contact with it.

What Actually Happened

In late 2024, Anthropic announced a partnership with Palantir and AWS to make Claude available to U.S. intelligence agencies and the Department of Defense through a classified cloud environment [3]. The use cases Anthropic publicly endorsed were carefully bounded: logistics optimization, personnel decision support, intelligence analysis. Not autonomous weapons. Not targeting systems. Not anything that put Claude in the kill chain.

The fine print mattered. Anthropic drew two specific operational hard lines: no mass surveillance of American citizens, and no fully autonomous weapons systems (those capable of identifying, selecting, and engaging targets "without intervention by a human operator") [4]. These weren't vague aspirational limits. They were specific contractual demands.

Amodei was careful to distinguish: "Partially autonomous weapons, like those used today in Ukraine, are vital to the defense of democracy." The objection wasn't to AI in weapons systems. It was to removing humans from the targeting decision entirely. "Frontier AI systems are simply not reliable enough to power fully autonomous weapons," he wrote [5].

The Pentagon's response: those conditions are operationally unrealistic. Secretary Pete Hegseth's office pushed for access for "any lawful use" — no carve-outs, no human-oversight requirements. When Anthropic held its position, the DOD threatened to designate Anthropic a "supply chain risk," a national security label ordinarily reserved for foreign adversaries [4].

In a public statement, Amodei described Claude's role as covering "mission-critical applications, such as intelligence analysis, modeling and simulation, operational planning, cyber operations, and more" [5]. The Responsible Scaling Policy would govern what classifications of use were permitted [6]. Constitutional AI's training constraints would remain intact [7].

By February 2026, Anthropic had weakened some of those guardrails under pressure [4]. But it wouldn't remove them entirely. Amodei stated publicly that the company "cannot in good conscience accede" to Pentagon demands for unrestricted access [5].

The administration's response was swift. By early March 2026, the Trump administration ordered federal agencies to cease using Claude and formally executed the supply chain risk designation [8]. The Pentagon was given six months to phase out existing implementations. Officials threatened "major civil and criminal consequences" [8] and invoked the Defense Production Act [5]. President Trump called Anthropic "Leftwing nut jobs" on Truth Social [8].

Amodei noted the inherent contradiction: the administration was simultaneously designating Anthropic a security risk and claiming Claude was essential to national security [5].

This is the framework doing something frameworks rarely do: holding a line at real cost. The question is: what does that tell us?

The Framework Argument

There's a serious case that it did. Anthropic's approach to this decision was more principled than most tech companies bring to similar questions.

Consider the contrast with Google's Project Maven in 2018, where engineers protested a DOD contract for drone imagery analysis that Google had accepted without meaningful internal deliberation [9]. Or Microsoft's JEDI cloud contract, where the ethical review process was essentially a legal and reputational risk assessment, not a values exercise [10]. In both cases, the company said yes first and thought about the ethics later — if at all.

Anthropic at least asked the question out loud, in public, before committing. It defined categories of prohibited use in advance. It tied the partnership to existing policy constraints rather than creating new, more permissive ones. By the standards of tech-sector defense contracting, this is unusually disciplined.

The framework had already been tested in lower-profile ways before the DOD conflict. Anthropic voluntarily forfeited "several hundred million dollars in revenue" to restrict Claude's use by firms linked to the Chinese Communist Party, and shut down CCP-sponsored cyberattacks attempting to abuse Claude [5]. That's not ethics theater. That's a company accepting material financial costs to enforce its stated limits before anyone was watching.

That discipline ultimately extended to the exit: Anthropic refused to fully remove its constraints even under threat of legal consequences, losing the contract rather than abandoning the framework entirely. That's a result neither Google nor OpenAI faced, because neither drew a line the government found inconvenient.

The framework also reflects a genuine philosophical position: that powerful AI getting into the hands of authoritarian governments or non-state actors is more dangerous than powerful AI being deployed by the U.S. government under oversight. From an AI safety standpoint, the argument has some merit. If Claude is going to inform military decisions, it's better to have a safety-focused company in the room than to cede that ground to less scrupulous developers.

The Skeptic's Reading

But frameworks have failure modes, and this one has several visible ones.

The verification problem. Anthropic's constraints depend on Palantir and the DOD actually honoring the use case boundaries. How would Anthropic know if Claude was being used for something it prohibited? Classified environments aren't auditable by design. The framework provides accountability in principle; it provides almost none in practice.

The revenue problem. Once a company has a meaningful portion of its revenue tied to a customer, the power dynamic shifts. Future policy debates about what Claude is permitted to do will happen in a context where saying no has real financial consequences. Frameworks written before the revenue existed are always easier to defend than frameworks that require turning away revenue to enforce.

The precedent problem. The most consequential effect of this decision isn't the Palantir deal itself. It's the signal it sends internally — to employees, to future leadership, to investors — about the kinds of compromises that are acceptable. Every subsequent "is this okay?" decision will be anchored to this one. The frame shifts.

The definition problem. What counts as a permissible use case is a heavily-loaded question. Logistics optimization is an easy call. Sentiment analysis of foreign populations? Automated risk scoring of visa applicants? Predictive threat modeling of civil unrest? These exist on a spectrum, and the clean-sounding boundaries in a press release don't hold up under adversarial operational pressure. Claude was reportedly integrated into intelligence and planning workflows surrounding a Venezuelan operation targeting President Nicolás Maduro [4]. "Intelligence analysis" was on the approved use case list. What counts as intelligence analysis, it turns out, is an expansive category.

The proof-of-concept problem. We no longer have to speculate about whether Anthropic will bend under pressure. In February 2026, it bent. Facing a Pentagon threat to designate it a national security supply chain risk, Anthropic weakened the guardrails it had publicly committed to [4]. But then it held a harder line, refused full compliance, and accepted the ban that followed [8]. The evidence is genuinely mixed: the framework moved under pressure, but it also stopped moving at a point the company was willing to lose business over. That's more than most frameworks do. It's also less than a hard line looks like from the outside.

What This Is Really a Case Study In

The Anthropic/DOD situation isn't primarily about military ethics or AI policy. It's a case study in what happens when a values-driven company gets big enough that values have real economic consequences.

Every company in this position makes the same discovery: the framework you wrote when you had nothing to lose looks very different when enforcing it means leaving real money on the table. The test of an ideals-based framework isn't whether it guides easy decisions — it's whether it guides hard ones. The DOD decision was hard. Anthropic made a call. Whether that call was right depends on empirical questions we're now beginning to answer. The constraints bent in February 2026, but they also held at a more fundamental level — the company lost the contract rather than fully abandon them.

What we can say is that the process was more honest than most. Anthropic didn't pretend the decision was easy. It didn't quietly expand its policies; it publicly defended the expansion it was making. It drew lines — even if those lines may prove impossible to enforce.

The Management Lesson

There's a version of this story that every leader at a growing company will eventually live through. You have a framework — a set of values, a mission, a public commitment to how you operate. Then someone walks in with a check large enough to test it.

The trap isn't in saying yes. The trap is in pretending that saying yes doesn't move the line. Anthropic said yes, but it also said: here is where the line is, and here is why we believe this decision is on the right side of it. That's a defensible position. What's not defensible is accepting the revenue and hoping nobody notices you've repositioned.

For executives building values-driven organizations, the Anthropic case offers a practical model: the framework has to predate the pressure. If you're writing your AI ethics policy after the DOD calls, you're writing a rationalization, not a constraint. The companies that will maintain genuine ethical positioning in the AI era are the ones that make the framework real before they need it — which means making decisions in low-stakes moments that prove the framework has teeth.

Anthropic did that, mostly. The teeth bent in February 2026, then held at a line the company was willing to lose business over. The practical lesson for executives: build the framework before the pressure arrives, and then decide in advance how far you'll let it move — because by the time the pressure arrives, that decision will already have been made for you by the culture you've built.

The Verdict

Here's the answer: Anthropic bent some, then held. The government executed its threat. Claude is now formally out of federal deployments, at least officially, for at least six months [8]. Retired General Jack Shanahan called Anthropic's safeguards "reasonable" and questioned whether the administration's decisions were "driven by careful analysis or political considerations" [8].

The industry competitive read is clear. Google dropped its no-weapons pledge in 2025. OpenAI removed "safety" from its mission statement in February 2026 [4]. Both stayed in the room. Anthropic drew a line it wouldn't fully cross, and got kicked out. The lesson every other AI company is now internalizing: ethical constraints are a liability when the customer is the U.S. government and the administration is hostile to them.

Here's the harder question: when holding a framework means losing the business, is the framework a success or a failure?

The optimistic read: the framework worked. It prevented Anthropic from fully capitulating to demands it found ethically unacceptable. The company accepted real consequences rather than abandon its constraints entirely. That's what accountability actually looks like.

The pessimistic read: Anthropic is now out of government AI. Less scrupulous developers (ones without inconvenient ethical frameworks) will fill that space. The market selects for accommodation, and principled companies get punished. The net effect on what AI actually does in military systems is negative.

Both reads are probably right simultaneously. That's what makes this case worth studying.

There's also a broader precedent concern: the administration weaponized federal procurement against a domestic company for maintaining ethical standards [8]. If that pattern holds, the chilling effect on the rest of the tech sector will be significant. Nobody else will draw lines that might get them designated a national security threat.

Whether Anthropic's stand was strategically wise is a separate question from whether it was principled. It was clearly the latter. The former depends on what happens in government AI over the next few years, and whether any U.S. administration ever again decides that companies with guardrails are partners worth having.

Anthropic held a line and lost the contract. Google and OpenAI didn't hold lines and kept theirs. Which approach produces better outcomes for the people those AI systems will affect? That's the question I don't think anyone has answered well yet — I'd like to hear from people who've thought about it.

References

Anthropic Acceptable Use Policy
OpenAI quietly deletes ban on using ChatGPT for "military and warfare" (January 2024) — The Intercept
Anthropic and Palantir Partner to Bring Claude AI Models to AWS for U.S. Government Intelligence and Defense Operations (November 2024) — Palantir press release
The Pentagon-Anthropic Clash Over Military AI Guardrails (February 2026) — Opinio Juris (Dorsey, Schwarz, Bode, Assaad, Renic / Independent Advisory Board on Legal Reviews, Responsible by Design Institute)
Statement from Dario Amodei on our discussions with the Department of War — Anthropic, 2026
Anthropic Responsible Scaling Policy
Constitutional AI: Harmlessness from AI Feedback — Bai et al., Anthropic, 2022
The Anthropic Ban: When Ethical Boundaries Become Government Targets (March 2026) — Policy Stability
Google Employees Quit in Protest Over Military Artificial Intelligence Program (May 17, 2018) — KQED
Microsoft Is the Surprise Winner of a $10B Pentagon Contract (October 2019) - Wired Magazine

If this resonated, here are some related articles:

MCP: The Promise, the Peril, and the Practical Path Forward
Why Your AI Work Sucks

Keith MacKay is a CTO in EY-Parthenon's Software Strategy Group (SSG), specializing in AI disruption and technology due diligence for private equity and corporate clients. SSG's AI Disruption Lab conducts rapid assessments of how AI transforms and threatens existing business models and value chains. Keith teaches at Northeastern University and writes about AI, strategy, management, and technology topics.

young-colleague-job-worries

Keith MacKay — Sun, 24 May 2026 16:28:12 +0000

I was recently approached by a young colleague who had read my article What "100% of Our Code Is Written by AI" Actually Means. Their significant other is a junior developer, and worried about their future. They asked me, "What should they do?"

None can say for certain, but for me, it comes down to a couple of things:

smart people will always be in demand.
coding is moving toward "conducting agents", which is very like managing teams of people, and requires those same skills -- clear communications (the #1 skill to develop), critical thinking and judgement, breaking larger problems down into component parts, navigating politics...EQ will be a person-based skill for a while yet.
if humans have a psychological need for work and productivity (I certainly do, but is it an inherent human need?), we'll create jobs to satisfy it. Humans are endlessly creative.

I would recommend that he spend any time he can coding with the AI tools. Building and exploring frameworks. Trying to find the limits. Creating apps, websites, desktop tools, add-ins, personal problem-solvers, whatever.

The tools are changing fast. Right now, old guys (✋) with long devt histories are better with the tools because we have the team management skills...but the faster things evolve, the more advantages go to young, pliable brains to soak up what matters and rely on horsepower rather than stored wisdom.

And, finally, I don't think junior engineering positions will disappear, though they will change and there will be fewer of them. Somebody needs to become those middle-managers conducting the agent teams...rather than being pyramid-shaped, I think hiring will be house-shaped, with incentives to hire fewer junior folks but keep them longer.

I've written a few pieces that speak to where this is going and what skills are needed--but would love to know your perspective. Are you a young professional? What do you see?

The Most Important AI Skill Isn't Technical (+ a FREE AI Skill)

Keith MacKay — Mon, 18 May 2026 21:29:21 +0000

The Most Important AI Skill Isn't Technical

Since the dawn of coding there have been 10X developers, who accomplish much more than their peers. There are also 10X AI collaborators...and coding skills are not the unlock. Here's what's needed for AI whispering at the highest levels.

You wouldn't walk into your CEO's office and say "fix the company." You'd get a blank stare, a polite suggestion to come back with specifics, and a reputation for wasting executive time. There's a reason "pls fix" is a dismissive punchline.

And yet: this is exactly how most people talk to AI.

"Write me a marketing plan." "Design a dashboard." "Help me with my strategy." Then they're surprised when they get back something generic, surface-level, and vaguely useless. They blame the tool. They say AI is overhyped. They move on.

Meanwhile, the person in the next office is using the same tool to produce work that used to take a team of three a full week. The difference isn't a secret prompt template. It isn't a computer science degree. It's the same skill that's been separating high performers from everyone else for decades: the ability to communicate clearly in writing.

The $1.2 Trillion Clue

Before we talk about AI, let's talk about humans.

A Grammarly and Harris Poll study found that poor communication costs US businesses $1.2 trillion per year [1]. Not million. Not billion. Trillion. With a T. One hundred percent of knowledge workers surveyed experience miscommunication at least weekly. One in four report it multiple times a day. I have often said that every software development problem I've seen in a career that has yielded over 1,000 pieces of software stemmed from communications problems (bad elicitation of requirements, misunderstandings, miscommunications, mis-set expectations, vague feedback...). Every. Single. One.

This isn't new information. Employers have listed "strong writing skills" near the top of their hiring criteria for as long as we've been tracking it: NACE Job Outlook surveys across multiple years have found that 73-82% of employers rank written communication as a must-have competency [2]. A Grammarly study of 100 LinkedIn profiles found that professionals with fewer language errors tend to reach higher positions 3. Writing quality serves as a career-long signal of competence.

The principle is simple: clear communicators get better outcomes from other humans. They waste less time. They create fewer misunderstandings. They compound advantage over years because every interaction is marginally more efficient.

This is the same principle that now governs AI productivity. And the mechanism behind it is almost embarrassingly obvious.

The Obvious Thing Nobody Talks About

Large language models generate human-like text because they were trained on human text. Billions of documents. Trillions of tokens. The entire written output of the internet, plus books, plus academic papers, plus technical documentation. Every pattern of clear, structured, purposeful writing that humans have ever produced: the model has seen it, absorbed it, and learned to continue it.

This means the conventions that make writing effective for humans are exactly the patterns these systems were trained to recognize and respond to. Specificity. Defined audience. Logical structure. Concrete examples. Clear scope. When you write a well-structured prompt, you're speaking in a register the model has encountered millions of times and learned to match. When you write vaguely, you're asking it to interpolate between low-information patterns. It does so unpredictably.

A PNAS study on how LLMs write found that instruction-tuned models default to a noun-heavy, informationally dense style [4]. They struggle to deviate from it even when asked. The model's natural output reflects the dominant patterns in its training data: structured, clear communication. Users who prompt in kind get outputs that align. Users who don't get outputs that drift.

When someone says "the AI doesn't understand me," the more precise diagnosis is often that their input didn't match the communication patterns the model was trained to process effectively. The AI isn't failing to understand. It's accurately reflecting the ambiguity of what it received.

This is a mirror, not a mystery.

MIT Sloan Proved It With Data

In case the theory doesn't convince you, here's the empirical evidence.

MIT Sloan ran a large-scale experiment [5] and found something remarkable: only half the performance gains from switching to a more advanced AI model came from the model itself. The other half came from how users adapted their prompts. The researchers noted that the best prompters weren't software engineers. They were people who knew how to express ideas clearly in everyday language.

Read that again. Non-engineers who could write clearly outperformed engineers who couldn't. The tool was the same. The model was the same. The difference was communication skill.

Grammarly's 2025 data tells a similar story from a different angle [6]: AI-literate workers (those who communicate effectively with AI tools) save 8.9 hours per week, compared to 6.3 hours for workers who are merely familiar with the technology. That's a 41% productivity gap, and the primary variable is how well people articulate what they need.

HBR published a piece titled "Using Prompt Engineering to Better Communicate with People" [7], making the explicit argument that the skills flow in both directions. Get better at communicating with AI, and you get better at communicating with humans. Get better at communicating with humans, and you get better at communicating with AI. They're the same muscle.

The Vagaries: When Communication Fails, AI Fails Worse

Unfortunately, AI doesn't fail gracefully, so you can't skip the fundamentals. When a human colleague receives vague instructions, they push back. They ask clarifying questions. They use organizational context and shared history to fill gaps. AI does none of this. It takes your ambiguity and runs with it, confidently, in whatever direction the statistical patterns suggest.

The failure modes are predictable and painful:

Vague instructions produce generic output. "Write a marketing strategy" gets you a Business 101 textbook summary. "Write a go-to-market strategy for a B2B SaaS analytics tool targeting mid-market CFOs, focusing on competitive displacement of spreadsheet-based forecasting" gets you something you can actually use.
Hidden assumptions produce mismatched tone and depth. You assumed the AI knew you wanted an executive summary. It assumed you wanted a comprehensive analysis. Neither of you said so. Now you have 3,000 words when you needed 300.
Missing context produces confident hallucinations. The model doesn't know what it doesn't know. Without sufficient context, it fills gaps with plausible-sounding fabrications. It's not lying. It's pattern-matching against insufficient data. (Every manager who's delegated poorly to a new hire has seen the human version of this.)
Contradictory goals produce inconsistent output. "Make it shorter but more comprehensive." "Be creative but stay on brand." "Move fast but don't break anything." Humans learn to decode these contradictions through experience. AI takes them literally and produces work that oscillates between opposing objectives.

Nielsen Norman Group studied AI-generated UI design [8] and found the difference is stark. Vague prompts produce designs that look randomly assembled: generic layouts with no coherent information hierarchy. Detailed prompts that specify the user role, key metrics, and layout philosophy produce professional, usable work. Same tool. Same model. Different communication.

The old software engineering principle applies perfectly: garbage in, garbage out. The garbage just looks more polished now, which makes it harder to catch and more expensive when it slips through.

The Copywriter's Playbook Was the Prompt Engineering Manual All Along

(Note the link at the end of this article to github project with a skill.md file that can apply these and a few other principles to your agent files and prompts, to help you get the best results possible!)

Ironically, researchers have spent the last three years discovering, one paper at a time, that virtually every principle in the copywriter's playbook also improves AI output. The overlap is so complete it's almost embarrassing.

Be specific, not abstract. Copywriters know "save $47 on your first order" beats "save money." A 2023 study tested 26 prompting principles across LLaMA and GPT-4 and found an average 57.7% quality improvement on GPT-4 when applying them [9]. The copywriting version of this principle predates the internet. The AI research confirmed it with p-values.

Show, don't tell. Copywriters use case studies because concrete examples beat abstract claims. Few-shot prompting (providing 2-3 examples of desired output) is the single most studied technique in all of prompt engineering, dating back to GPT-3's original paper in 2020 [10]. Quality of examples consistently matters more than quantity. One compelling case study beats a list of twenty testimonials. One well-chosen example beats ten mediocre ones.

Break it into steps. Copywriters call it the "slippery slope": each sentence leads naturally to the next, keeping cognitive load manageable. Researchers call it chain-of-thought prompting. When they added "let's think step by step" to math problems, accuracy jumped from 18% to 79% [11]. The mechanism is different (the model uses intermediate tokens as working memory), but the prescription is identical: chunk complexity into digestible pieces.

Make the stakes real. Every copywriter knows emotion drives action more than logic. A Microsoft and Chinese Academy of Sciences team tested this directly on LLMs. Adding positive emotional framing ("this is very important to my career" or "take pride in your work and give it your best") improved performance by up to 115% on BIG-Bench reasoning benchmarks [12]. A follow-up study tested negative emotional framing ("this seems beyond your skill level") and found it boosted BIG-Bench performance by 46% [13]: less dramatic than positive framing on that benchmark, but the negative approach outperformed positive framing on instruction-following tasks (12.9% vs. 8%). Both of these flavors of emotional prompting increase performance. The models don't feel urgency. But emotionally-framed requests in their training data were paired with higher-effort human responses. The statistical echo of human motivation is baked into the weights.

One ask per piece. Copywriting's "Rule of One" (one idea, one audience, one call to action) has a direct analog: multi-task prompts tend to degrade LLM performance compared to single-task prompts, particularly in smaller models [14]. Ask for three things at once and the model weights them unevenly, just like a reader who skims a multi-CTA email and does none of them.

Name your audience. "Written for CFOs at Series B startups" produces entirely different copy than "written for general audiences." Same for LLMs: specifying the audience measurably shifts vocabulary, depth, and readability. The model isn't imagining a reader. It's shifting toward token distributions that co-occurred with that audience type in training data. Different mechanism, same result.

Say what to do, not what to avoid. Copywriters learned long ago that "don't think about a pink elephant" makes you think about a pink elephant. LLMs have the same problem. Negative instructions ("don't be verbose," "avoid jargon") consistently underperform positive ones ("write concisely," "use plain language"). Anthropic has formalized this into their official documentation. The suppression instruction activates the very concept you're trying to suppress.

End with the ask. Direct mail copywriters know the P.S. is the most-read section of a letter because eyes jump to the end. LLMs have a structural equivalent: research shows models perform best when key information appears at the beginning or end of a prompt, with significant degradation for anything buried in the middle [15]. Because LLMs process tokens left to right, the request at the end has full visibility of all the context that came before it. Lead with your background and constraints, close with what you actually want. Burying your actual request in the middle of a long prompt is the AI equivalent of burying the lede.

The pattern is relentless. Every principle that makes human communication more effective also makes AI communication more effective. Not because LLMs think like people, but because people wrote the data that LLMs learned from. Good copywriting created the training signal. The model absorbed its statistical fingerprint. When you write the way a skilled communicator writes, you're hitting the exact frequency the model was tuned to receive.

It's Not "Prompt Engineering." It's Writing.

The tech industry loves to rebrand familiar skills as novel disciplines. "Prompt engineering" sounds like something you'd need a certification for. In reality, strip away the jargon and you're looking at a skill set that every good business writing course has taught for decades:

Chain-of-thought prompting is asking someone to show their work and think step by step. (Research shows this alone can boost accuracy by 20%.)
Few-shot prompting is providing examples of what good output looks like. An editorial brief.
Role prompting is specifying the audience and voice. Basic communication context-setting.
Structured prompting is organizing your request with clear sections and constraints. An assignment description.

These techniques work because they're not tricks. They're the fundamentals of effective written communication, dressed up in new vocabulary.

And the parallels go deeper than metaphor. A December 2025 Google Research paper found that simply repeating a prompt twice (literally copying and pasting it) improved LLM accuracy by up to 76% on non-reasoning tasks [16]. The reason is mechanical: as noted above, LLMs process tokens left to right. A token at position 5 can only "see" tokens 1 through 4. When you repeat the prompt, every token in the first copy gets to attend to every token in the second copy. The model finally has full visibility of the complete request when generating its answer.

Separate research confirmed the corollary [17]: prompt component order matters enormously. Models perform measurably better when context and background appear before the question rather than after. Why? Same mechanism. If the question comes first and context comes second, the question tokens are processed before the model has "seen" the context. They're informationally blind to it.

THE EXCEPTION THAT PROVES THE RULE: The "End with the ask" principle inverts a classic business writing principle. Journalism and the military teach BLUF: Bottom Line Up Front. State the key point first, then provide supporting context. That works for humans because busy executives scan from the top, likely have necessary context already, and may not finish reading. But LLMs aren't scanning. They're fully processing, from left to right, building context as they go. An ask that appears before the context is an ask that was processed blind. For prompts, the optimal structure is context first, ask last: give the model everything it needs to know, then tell it what to do with that knowledge. The ask at the end has full visibility of every token that came before it.

The fact that effective prompt structure diverges from effective executive communication on this one point makes the broader convergence even more striking. Writing structure isn't just a preference. Inside an LLM, it's physics.

The people who already practice clear, structured writing don't need a prompt engineering course. They just need to talk to the AI the way they'd write a good brief, a clear email, or a well-structured requirements document.

The people who never learned to write clearly? They need to learn that first. No prompt template will substitute for the ability to articulate what you actually want.

The Compounding Gap

In an era when AI helps people write better, the people who already write well extract far more value from the tools. They iterate faster. They recognize when output drifts. They course-correct with precision instead of frustration. They build on good outputs instead of starting over from bad ones.

This creates a compounding loop. Good communicators get better AI results. Better AI results accelerate their work. Accelerated work gives them more reps. More reps sharpen their communication further. The gap between clear communicators and vague ones isn't closing with AI. It's widening. The best communicators are the 10X AI Collaborators.

The World Economic Forum's 2025 Future of Jobs Report confirms this trajectory [18]: while basic literacy skills (reading, writing, and mathematics) show a small net decline in projected demand, skills like analytical thinking, creative thinking, and leadership are increasing in importance. The premium on clear, structured communication as a meta-skill that enables all of these is going up, not down.

If you're a leader: invest in your team's writing skills. Not their prompt engineering skills, their writing skills. The ability to define a problem precisely, specify an outcome clearly, provide relevant context, and structure a request logically. These pay dividends in every human interaction--and in every AI interaction.

If you're an individual contributor: the single highest-leverage skill you can develop right now isn't learning a new framework or memorizing prompt patterns. It's learning to write with clarity and precision. It will make you better at your job, better at managing people, and better at working with every AI tool that exists or will exist. Communication skills are the key.

The Bottom Line (Not Up Front)

Communication has always been the key to everything. AI didn't change that. It amplified it. The same skills that make you effective with a boardroom full of executives make you effective with the most powerful AI tools on the planet, because those AI tools learned everything they know from our written communication.

You wouldn't walk into your CEO's office and say "fix the company." Don't do it to your AI either.

If you're interested in the free "plsfix" skill (couldn't help myself), which applies these and other LLM communication principles to your AI prompts, spec files, or administrative files, you can find it at https://github.com/keithmackay/plsfix.