DEV Community: choutos

The Cathedral Builder Who Forgot the Foundations

choutos — Thu, 02 Apr 2026 13:57:53 +0000

There was a stonemason in Santiago (not a famous one, not the kind they write plaques for) who spent forty years carving granite for the cathedral restorations. My grandfather knew him, or claimed to, which in Galicia amounts to the same thing. The man had a saying that I never understood until this year: "O rápido nom é o que mais corre, é o que sabe onde vai." The fast one isn't the one who runs fastest. It's the one who knows where she's going.

I've been thinking about him lately, this mason nobody remembers, because something is shifting in how we build software, and it has the same shape as his proverb.

The decade we spent running

For twenty years, the software industry has been sprinting. Agile told us to stop drawing blueprints and start laying bricks. Ship the thing. Learn from the rubble. Iterate. The manifesto was a corrective, and a necessary one: before Agile, teams spent eighteen months writing specifications for software that was obsolete by the time anyone opened an IDE. We were building cathedrals nobody asked for.

So we learned to move. We stopped asking permission. We shipped MVPs, gathered feedback, pivoted, shipped again. "Working software over comprehensive documentation" wasn't just a principle, it was a liberation. And it worked, genuinely, for a long time.

But somewhere along the way, the corrective became the orthodoxy. "Ship fast" stopped being a strategy and became an identity. Documentation was for people who couldn't code. Architecture was a dirty word, the province of ivory-tower consultants with UML addictions. If you spent a morning thinking before typing, you were the bottleneck.

We got very, very fast. We also got very, very lost.

The vibes, they are not enough

Then came the AI tools, and everything accelerated again. Copilot, Claude, Cursor: suddenly you could describe a feature in plain English and watch it materialise. Andrej Karpathy called it "vibe coding," which is perfect because it captures exactly the right mix of exhilaration and recklessness. You feel your way through. You prompt, you accept, you prompt again. A prototype in an afternoon that would have taken a team a week.

The problem is not the speed. Speed is beautiful. The problem is that code written at the speed of intuition ages the way fish does: magnificently fresh for about six hours, then increasingly difficult to be around.

Vibe coding produces software that works right now. It does not produce software that anyone understands, including the person who shipped it. Architecture decisions get made invisibly, by the model, based on whatever patterns it absorbed from its training data. You didn't choose that state management approach. You didn't decide on that error handling pattern. The machine did, and you accepted because it compiled and the tests passed and lunch was in twenty minutes.

Six months later, you're excavating your own codebase like an archaeologist who can't read the script.

The mason's revenge

Here's what's changing: the most valuable thing a developer can do is no longer writing code. The machines write code now, and they write it faster than any human ever will. The race is over. The machines won it while we were arguing about tabs versus spaces.

But (and this is the mason's proverb again) speed without direction is just expensive chaos.

The work that matters now is the work that happens before the first prompt. Defining boundaries. Choosing patterns. Deciding what a system should look like when it's finished, not just when it compiles. This is architecture, the real kind, not the enterprise-consultant kind with seventeen layers of abstraction and a governance board. The kind where someone sits down and thinks: what are the three decisions that, if we get them right now, will save us six months of pain later?

The developer's job is becoming the stonemason's job. Not cutting every block, but knowing where each one goes. Designing the structure so that when someone else (or some*thing* else) does the cutting, the cathedral still stands.

Your codebase is a classroom now

There's a subtlety here that took me a while to see. We used to write clean code for the next developer. Good names, clear comments, sensible structure: all so that the poor soul who inherits your module at 3am on a Tuesday can figure out what you were thinking.

The next developer is now a machine.

This changes the stakes completely. A human developer can ping you on Slack. They can read between the lines. They can look at a questionable pattern and think, "that doesn't feel right, let me ask someone." An AI model does none of this. It sees your patterns and replicates them, faithfully, at scale. Good patterns multiply into good code. Bad patterns multiply into a codebase that looks functional on the surface and crumbles the moment you touch it.

Your code is training data now, whether you like it or not. Every well-typed function, every clear interface, every consistent naming convention is a lesson the model learns. Every shortcut, every "I'll fix this later," every contradictory pattern is a lesson too: just not the kind you want taught.

Curate your codebase the way a librarian curates a collection. Not because you're precious about aesthetics, but because the quality of what you put in directly determines the quality of what comes out, multiplied by a thousand.

Not waterfall, not agile. Something else.

I can already hear the objection: "So we're going back to waterfall? Eighteen months of specs before writing a line of code?" No. Absolutely not. We're not going backwards. The pendulum doesn't return to the same place. It finds a new centre.

What's emerging is something more interesting than either extreme. Call it intentional development, or don't call it anything: names are cheap. The shape of it is: think carefully, build fast, constrain intelligently. Spend the morning designing the system. Spend the afternoon letting the machines build within it. Spend the evening reviewing what they produced, not for syntax errors, but for architectural coherence.

The Agile Manifesto said "individuals and interactions over processes and tools." That made sense when individuals were doing all the building. Now the tools are doing most of the building, and the individuals need to be the ones setting direction. The bottleneck moved. In 2005, it was speed. In 2026, it's judgment.

We don't need more velocity. We have more velocity than we know what to do with. What we need is a map.

What the mason knew

The stonemason from Santiago (the one my grandfather may or may not have known) retired in the 1980s. By then, they were using power tools for the restorations. Pneumatic hammers, diamond saws, things that would have seemed like sorcery to the medieval builders. The cuts were faster, cleaner, more precise than anything human hands could manage.

But someone still had to decide where to cut.

The tools changed. The craft didn't. It just moved upstream, from the hands to the head. From execution to design. From the chisel to the blueprint.

That's where we are now. The chisel got impossibly fast. The question is whether we're ready to be the ones holding the blueprint.

I think we are. But only if we stop running long enough to remember where we were going.

Enfin.

Stop Reading AI-Generated Code. Start Verifying It.

choutos — Wed, 18 Mar 2026 09:12:52 +0000

There is a difference. It matters more than you think.

Somewhere in your codebase right now, there is code an AI agent wrote. Maybe you reviewed it carefully. Maybe you skimmed it. Maybe you approved the pull request because the tests were green and you had three other things open.

This is not a problem to be solved by reading faster. It is a problem to be solved differently.

The question worth asking is not "did I read this code?" The question is: "do I have sufficient evidence that this code is correct?" Reading is one way to gather that evidence. It is not the only way, and for AI-generated code at any meaningful scale, it cannot be the primary one.

The Distinction That Changes Everything

Reviewing code means reading it. You trace the logic, spot the edge cases, ask whether this is the right approach. It is slow, expert-dependent, and does not scale.

Verifying code means confirming it is correct, by whatever means available. Review is one path to verification. Machine-enforceable constraints are another. At sufficient scale, the second path is the only viable one.

This reframing is not an excuse to be lazy. It is an invitation to be more rigorous: to replace the informal, inconsistent process of "a human skimmed this" with a formal, repeatable process of "this code passed a defined set of constraints."

What Good Constraints Look Like

The goal is to define a space of valid programs so precisely that anything outside it cannot pass, and anything inside it is almost certainly correct. Four types of constraints do most of the work.

Property-based tests

Standard unit tests check specific cases: given input 15, expect "FizzBuzz." This is useful but limited. A property-based test asks a harder question: does this property hold for all valid inputs?

You write the property. The testing library (Hypothesis for Python, fast-check for JavaScript, QuickCheck for the Haskell family) generates hundreds of inputs automatically, favoring edge cases: zero, negative numbers, very large values, boundary conditions. If the property holds across all of them, you have meaningful confidence that it holds in general.

This constrains the solution space toward correctness. It ensures the requirements are met.

Mutation testing

Here the direction reverses. Instead of asking "does the code satisfy the tests," you ask "do the tests actually test anything?"

Mutation testing tools make small, deliberate changes to your code: swap > for >=, flip a boolean, change a constant. Then they re-run your test suite. If the tests still pass after a change that should break something, the tests are not doing their job.

Used in the usual way, mutation testing helps you improve your test suite. But there is a second use: if you have a strong test suite and all mutations are killed, you can invert the logic. Any code that passes these tests must be doing exactly what the tests describe, nothing more. The mutation score becomes a measure of how constrained the valid solution space is.

Side-effect isolation

A function that only transforms inputs into outputs is a function you can verify in isolation. A function that writes to a database, calls an API, or modifies global state is a function whose correctness depends on the state of the world at runtime.

Requiring pure functions where possible is not just good software design. It is a verification strategy. A pure function can be tested exhaustively. A side-effectful function cannot.

Static analysis

Type checking, linting, and similar tools catch a category of errors before anything runs. In Python, mypy or pyright. In TypeScript, the compiler itself. These are table stakes, not the interesting part, but they eliminate a class of bugs that would otherwise require dynamic testing to surface.

Together, these four constraints define a small, well-lit region of possible programs. The agent must land inside it. Most invalid programs cannot.

Going Further: Validation at Scale

The four constraints above work well for individual functions. When agents are generating entire services, the validation architecture needs to grow with them.

Formal contracts

Contracts are preconditions and postconditions expressed in the code itself. A function with a contract says: if you call me with valid input, I will return valid output, and here is precisely what "valid" means for both.

Libraries like deal for Python, spec for Clojure, or the type system in Rust make this explicit. The agent cannot produce a function that violates its own declared contract. Contracts can be checked statically (before anything runs) or at runtime (as a continuous assertion). Either way, they narrow the space of valid programs more precisely than tests alone.

Sandboxed execution

Side-effect isolation can be enforced at the test level. But for production agents generating code that runs automatically, you want structural enforcement: a sandbox where the generated code physically cannot reach the network, the filesystem, or external services.

Firecracker microVMs, WebAssembly runtimes, and seccomp-filtered containers all do this at the OS level. The question changes from "did the code try to make an external call?" to "the code cannot make an external call, so we don't need to ask." This is not just correctness verification: it is containment. For autonomous systems, containment matters as much as correctness.

Differential testing

If you have an existing implementation of the thing the agent is replacing, you have something valuable: a reference. Run both implementations in parallel across a large sample of real or synthetic inputs. Compare the outputs. Where they agree, the new code is almost certainly correct. Where they diverge, you have a precise failure case to examine.

This approach scales well and requires no additional test-writing. The reference implementation does the work. The old code, the thing you were trying to improve or replace, becomes your verification oracle.

Schema validation at boundaries

Any time generated code produces structured output (JSON responses, database records, message payloads), a schema validator at the boundary is a low-cost, high-value check. Not "is this valid JSON" but "does this JSON have the right shape": the required fields, the expected types, the value ranges that downstream consumers depend on.

Pydantic, Zod, and JSON Schema are all mature options. An agent cannot silently change an API response shape if a schema validator is standing at the door. Regressions of this class, common and annoying, are caught automatically with almost no engineering cost.

Semantic testing for decision-making code

When generated code makes judgements (classifies inputs, extracts structured information, routes requests) unit tests may not be the right tool. The correct output for a given input is not always a single deterministic value.

For these cases, a labeled evaluation set works better. You assemble a representative sample of inputs where you know the correct answer, run the generated code against it, and measure accuracy. A threshold below which the code fails. This is how machine learning models are evaluated, and it applies equally well to any code that sits at the fuzzy boundary between computation and judgement.

Agent chains as verification infrastructure

When one agent generates code and another agent tests it, and a third reviews the test coverage, and a fourth checks for type errors, the chain is itself a verification structure. Each agent certifies a specific property. The code must pass every certification before it is accepted.

This is a natural architecture for teams already running multiple agents. The key discipline is making the certifications explicit: not "agent B approved this" but "agent B confirmed that all property-based tests pass and mutation score exceeds 90%." Explicit certifications are auditable. Vague approvals are not.

What We Are Actually Giving Up

There is something the old model of code review provided that this new model does not: shared understanding. When a human reads code, they build a mental model of what it does. They carry that model into future debugging sessions, architecture decisions, conversations about whether to change the code.

AI-generated code, verified by machine constraints, produces no such shared understanding. The team knows the code is correct. Nobody knows why it works, or how, or what it would mean to change it.

This is a real cost. It is worth acknowledging rather than pretending it is not there.

The honest answer is that it may be acceptable for certain categories of code: isolated utility functions, data transformations, well-bounded integrations. The same way teams accept that compiled code is a black box, we may need to accept that some generated code is a black box too, provided the box is well-sealed and well-tested.

For core business logic, architectural decisions, anything that needs to be understood to be maintained: human review remains the right tool. The goal is not to eliminate review but to reserve it for the code that actually needs it.

The Practical Path

You do not need to implement all of this at once. A sensible progression:

Start with property-based tests. Pick your three most critical functions and write property-based tests for them. See how many edge cases they surface that your unit tests missed. The answer will be instructive.

Audit your test suite with mutation testing. Before trusting your tests to verify agent output, find out whether they are actually testing anything. A mutation score will tell you quickly. Fix the gaps.

Enforce purity where you can. Any function that could be pure, should be. Document the exceptions. Make side effects visible and intentional rather than incidental.

Put schema validators at every output boundary. Every API response, every database write, every message payload. This takes an afternoon to set up and pays dividends forever.

Build toward agent chains. As you add agents to your workflow, give each one a specific certification responsibility. Explicit certifications that accumulate into an audit trail.

The overhead is real. For a single short function, reading it is faster than building this infrastructure. The infrastructure makes sense when you are not looking at one function. You are looking at a thousand, generated by agents, arriving faster than any team can review them.

The question then is not how to read faster. It is how to know, without reading, that what arrived is correct.

Enfin.

AI Makes Experienced Developers Slower. Here's Why.

choutos — Wed, 04 Mar 2026 17:14:38 +0000

AI Makes Experienced Developers Slower. Here's Why.

By Vítor Andrade

Earlier this year, METR published a randomised controlled trial that should have shaken every engineering organisation awake. They took experienced open-source developers, gave them state-of-the-art AI coding tools, and measured what happened.

The developers got 19% slower.

Not junior developers. Not people unfamiliar with the codebases. Experienced contributors working on projects they knew intimately, with tools they chose themselves. And here is the part that should genuinely unsettle you: those same developers believed they were 24% faster.

A 43-percentage-point gap between perception and reality. The engineers most qualified to judge were the most wrong.

I have spent the past year helping engineering teams integrate AI agents into their workflows. I have watched this pattern play out dozens of times. A senior developer picks up Cursor or Claude Code, feels the rush of generating code at unprecedented speed, and starts shipping. Two months later, the codebase is worse. The tests are brittle. There is a layer of generated code that nobody fully understands, woven through the system like kudzu. And the team cannot figure out why velocity has not improved despite everyone feeling faster.

The METR study tells us why. But it does not tell us what to do about it.

I think I have found what works. And the answer is annoyingly boring.

The file that changed everything

Six months ago I was consulting with a team that had fully embraced AI coding tools. Every developer had Cursor. They were generating thousands of lines per day. And their bug rate had tripled.

I asked to see their onboarding documentation. There was none. I asked how a new developer would learn the codebase conventions. "They'd ask someone." I asked what happened when an AI agent needed to understand those same conventions. Silence.

We created a single file, dropped it in the root of their repository, and within two weeks their agent-generated code quality improved measurably. The file looked something like this:

# AGENTS.md

## What This Project Does
Backend API for insurance claims processing.
Domain-specific terminology: a "claim" is not a "case" is not a "ticket."

## Tech Stack
- TypeScript (strict mode)
- Fastify
- PostgreSQL with Drizzle ORM

## How to Validate Your Work
- Tests: `npm test` (must pass 100%, no exceptions)
- Lint: `npm run lint` (strictest ESLint config)
- Type check: `npm run typecheck` (strict: true, no any)

## Coding Standards
- Never mock database calls. Use the test database.
- Never mock HTTP calls to internal services. Use the test fixtures.
- All new endpoints require integration tests, not unit tests.
- Error responses follow RFC 7807. No exceptions.

## Domain Knowledge
- Claims have a state machine: DRAFT -> SUBMITTED -> UNDER_REVIEW -> APPROVED | DENIED
- State transitions are the core business logic. Never bypass the state machine.
- "Adjuster" means the human reviewer, not an automated process.

## What Agents Get Wrong
- They try to add Express middleware. We use Fastify. Do not mix.
- They mock the database. This produces tests that pass and verify nothing.
- They create new utility files instead of using /src/shared/utils.
- They invent new error formats instead of using the RFC 7807 helper.

This is not a prompt. It is not a framework. It is an onboarding document for a colleague who has perfect recall, zero institutional knowledge, and no ability to tap someone on the shoulder and ask "wait, do we use Express or Fastify here?"

The insight is simple but easy to miss: if domain knowledge lives in engineers' heads, it does not exist for agents. An AI model working on your codebase is essentially fine-tuned on your codebase. Every file it reads shapes its output. When your codebase contains competing patterns, dead code, inconsistent conventions, and no written documentation, the model absorbs all of that confusion and reproduces it faithfully.

The AGENTS.md file is not magic. It is the minimum viable act of treating your AI tools as what they actually are: very fast, very literal colleagues who need explicit written context to do good work.

Why experienced developers get worse

The METR result is counterintuitive until you think about what experienced developers actually do differently from novices. An experienced developer carries a vast amount of implicit knowledge. They know which patterns the team prefers. They know the historical reasons behind odd architectural choices. They know which parts of the codebase are fragile. They know what "good" looks like for this specific project.

None of that transfers to an AI agent through a prompt.

When an experienced developer uses an AI tool naively, they are essentially delegating to an intern who has read the codebase but understood none of the tribal knowledge. And because the experienced developer trusts their own judgment, they review the output less carefully than a nervous junior would. They skim the generated code, recognise the general shape of what they asked for, and approve it. The code works. The tests pass. But something subtle is wrong: a convention violated, a pattern duplicated, an abstraction that does not fit the team's mental model.

Do that fifty times and the codebase has drifted. Do it five hundred times and you have a mess that neither humans nor agents can navigate efficiently.

This is the mechanism behind the METR result. It is not that AI tools are bad. It is that bolting AI onto existing workflows without changing anything else makes experienced developers' greatest asset, their implicit knowledge, into a liability. The knowledge stays in their heads while the agent works without it.

The fix is not to use AI less. The fix is to make implicit knowledge explicit. Write it down. Put it in the repository. Make it available to every agent and every new hire simultaneously.

This is, of course, exactly what every engineering manager has been asking their team to do for decades. The irony is thick enough to taste.

The discipline nobody wants to hear about

Here is the technique that separates teams who benefit from AI coding tools from teams who accumulate AI-generated technical debt:

Never fix bad agent output. Ever.

When an agent produces code that is wrong, sloppy, or subtly off, the instinct is to patch it. Fix the variable name. Add the missing error handling. Adjust the test. This feels efficient. You already have 90% of what you need, why throw it away?

Because that remaining 10% of wrongness stays in your codebase forever. And your codebase is the context window for every future agent run.

Think about it as a feedback loop. An agent reads your codebase, generates code in the style of your codebase, and commits it back. If the codebase is clean and consistent, the next agent run produces clean and consistent output. If the codebase contains patched-over slop, the next agent run produces more slop, slightly worse, which gets patched over again, which produces even worse slop.

Quality spirals upward. Slop spirals downward. There is no steady state.

So when agent output is bad, the discipline is: stop. Diagnose why the output was bad. Was the spec too vague? Was the AGENTS.md missing a key convention? Was the agent's scope too broad? Fix the root cause, then rerun from scratch.

This feels wasteful. It is the opposite of wasteful. It is the only way to keep the quality loop going in the right direction.

I call this the recursive quality loop, though it does not need a name. The principle is ancient: the state of your workspace determines the quality of your work. Carpenters know this. Chefs know this. Software engineers have always known this in theory but rarely practice it, because the cost of a messy codebase was slow and diffuse. With AI agents, the cost is immediate and measurable.

A messy codebase does not just slow down human developers. It actively degrades the output of every AI tool that touches it. Codebase hygiene went from "nice to have" to "load-bearing infrastructure" the moment we started letting AI read and write our code.

The anti-mocking rule and other boring specifics

Let me get concrete about what "strict engineering discipline" looks like in practice, because the details matter more than the philosophy.

AI loves to mock things. Give an agent a task that involves database calls and it will, nine times out of ten, write tests that mock the database. The tests pass. They verify absolutely nothing. You now have a green CI pipeline and zero confidence that the code works.

The rule in every AGENTS.md I write: never mock what you can use for real. Use the test database. Use the test fixtures. Use the real HTTP client against a local service. If you cannot test something for real, that is a signal that your test infrastructure needs work, not that mocking is acceptable.

Strictest possible linting. Here is something I have found consistently: AI models conform to whatever standard you enforce. If your linter allows any types in TypeScript, the agent will use any types. If your linter forbids them, the agent will find the properly typed solution. Humans sometimes chafe under strict linting rules. AI never does. So enforce the strictest rules you can. You are not punishing your human developers. You are creating guardrails that improve every line of generated code.

One agent, one task, one context. A five-step agent chain where each step is 95% accurate produces roughly 77% end-to-end reliability. That is not a theoretical concern. I have watched teams build elaborate multi-agent pipelines that fail in production because errors compound at every handoff. The simpler approach works better: give one agent a well-scoped task with complete context, let it execute in one shot, validate the output through automated checks, then have a human review the result.

Holdout test cases. When an agent writes both the code and the tests, there is a risk of teaching to the test. The agent produces code that passes its own tests but fails on edge cases it never considered. The fix: maintain a set of test specifications that the agent never sees during development. Run them after the agent claims to be done. This is the software equivalent of a double-blind trial.

None of these techniques are revolutionary. Strict linting, real integration tests, well-scoped tasks, independent verification. This is the engineering discipline that good teams have always practiced. The difference is that with AI agents, the penalty for skipping these practices is immediate and severe rather than slow and diffuse.

What Stripe figured out

Stripe merges over 1,300 agent-written pull requests every week. Zero human-written code in those PRs. That number is worth sitting with, because it tells you something important: this is not a prototype. This is production engineering at one of the most demanding technical organisations in the world.

Their system, which they call Minions, is architecturally simple. An agent receives a blueprint (a structured specification describing what to build), runs in an isolated sandbox with a full codebase checkout and running test infrastructure, produces a pull request, and a human reviews it.

That is it. There is no magical model. There is no secret sauce. The architecture is: good specs, isolated execution, automated validation, human review. Everything I have described in this post, taken to scale.

The part that interests me most is their tooling layer. Stripe built a centralised server hosting over 400 internal tools that any agent can access through a single interface. Authentication, permissions, audit logging, all flowing through one chokepoint. When someone builds a new tool, every agent in the organisation benefits immediately.

You do not need to be Stripe to apply these principles. The blueprint is just a detailed spec. The sandbox is just a git worktree with a test database. The tooling layer is just a collection of scripts behind a consistent interface. The principles scale down to a team of three. What does not scale down is the discipline.

The culture problem disguised as a tools problem

I want to say something that might be unpopular in a space that loves to debate models and frameworks and tooling:

The technology is not the bottleneck. The models are good enough. The tools exist. The patterns are documented. Stripe proved it works. Anthropic proved it works (90% of Claude Code is written by Claude Code). The evidence is overwhelming.

The bottleneck is engineering culture.

It is the team lead who insists every developer write code by hand because "that is how you learn." It is the architect who refuses to write specs because "the code should speak for itself." It is the senior developer who builds brilliant personal workflows with custom prompts and local scripts and never shares any of it. It is the organisation that buys Copilot licences, does nothing else, and wonders why productivity did not improve.

The solution to the METR problem, the reason experienced developers get slower, is not better AI. It is better engineering practice. And the specific practices that fix it are exactly the things engineers have resisted for decades:

Write documentation. Real documentation, not the kind that gets written once and never updated, but living documents that encode how the team actually works.

Enforce strict standards. Not as punishment but as infrastructure. Every linting rule, every type constraint, every test requirement is a guardrail that improves the output of every AI tool in your pipeline.

Write detailed specs before writing code. Not because you enjoy bureaucracy but because the quality of the specification determines the quality of the output, whether the executor is human or artificial.

Clean up the codebase. Remove dead code, resolve competing patterns, make conventions explicit. This was always good practice. Now it is load-bearing.

Share everything. Stop building local workflows. Land your tools, your prompts, your configurations in the shared repository. Let them compound. A custom script on your laptop helps you. The same script in the team repo helps every developer and every agent, forever.

The compound effect is real and it is dramatic. Every AGENTS.md update, every new tool, every validation rule stays in the codebase. The next agent benefits from everything the last one contributed. Good practices accumulate. The flywheel turns faster. But only if the practices are shared, encoded, and maintained.

Where to start on Monday morning

If any of this resonates, here is what I would do in your position, in order:

First hour. Create an AGENTS.md file in your main repository. Start with what the project does, how to run and test it, and the three most common mistakes an unfamiliar developer would make. Commit it. This single file will immediately improve every AI-assisted coding session against that repo.

First week. Tighten your linting and type-checking to the strictest settings your codebase can handle. Fix the violations. This is not about style preferences. It is about creating a codebase that produces better AI output.

First month. Establish the "never fix bad output" rule. When an agent produces something wrong, resist the urge to patch. Diagnose, fix the root cause (update the spec, update the AGENTS.md, narrow the scope), and rerun. This will feel slow at first. It is the only path to the upward quality spiral.

First quarter. Start sharing. Turn personal prompts into team templates. Turn local scripts into repository tools. Start writing specs before sending tasks to agents. Measure the trend, not the day.

You will hit a dip. The METR study measured that dip. It is real, it lasts weeks to months, and it happens because you are rewiring workflows, not because the tools do not work. The teams that push through the dip come out the other side measurably faster. The teams that give up during the dip conclude that AI coding tools are overhyped and go back to what they were doing before.

The gap between those two groups is widening every month.

The boring conclusion

The most effective thing you can do to get value from AI coding tools is the same thing engineering managers have been begging for since the profession began: write things down, enforce standards, clean up after yourself, and share your work.

That is it. That is the entire insight.

AI did not change what good engineering practice looks like. It changed the cost of ignoring it. The penalty for a messy codebase, missing documentation, lax standards, and tribal knowledge used to be slow and survivable. Now it is fast and compounding.

The developers who figure this out first will have an extraordinary advantage. Not because they have access to better models or fancier tools, but because they treated the arrival of AI as a reason to finally do all the boring things right.

And that, I think, is the real lesson of the METR study. AI did not make those experienced developers slower. Their undocumented, implicit, head-only engineering practices made them slower. The AI just made it visible.

The Loom Does Not Care Who Owns It

choutos — Sun, 22 Feb 2026 20:20:36 +0000

There was a man called Xosé who worked in a textile factory outside A Corunha for thirty-one years. He started at seventeen, sweeping floors, and by the time the factory closed he was operating a loom that could produce in one hour what his grandmother would have taken a week to weave by hand.

He was not bitter about the loom. This is important to understand. He was not one of those men who shook his fist at machines. The loom was a good machine. It did its work honestly. What Xosé was bitter about—and he would tell you this over umha cunca, slowly, the way you explain something to a child who is clever but hasn't yet been hurt—was that when the factory closed, nobody seemed to have a plan for what thirty-one years of floor-sweeping and loom-operating were supposed to become.

The machines got better. Then they got cheaper. Then they moved to a place where the people who operated them cost less than Xosé. And Xosé, who had paid his taxes and raised three children and never missed a day of work including the day his mother died, was told that the market would sort it out.

The market did not sort it out. The market does not sort things out. The market is not a person with intentions. The market is a loom. It weaves whatever thread you feed it, and it does not care if what comes out is cloth or chaos.

There is a book—and I know, I know, a man in a tavern should not be citing books, but this one earned its place here, the way a good knife earns its place in a kitchen—there is a book by a Hungarian called Karl Polanyi, written in 1944, called The Great Transformation. Polanyi had watched the world tear itself apart twice in thirty years, and he wanted to understand why.

His answer was deceptively simple. He said: when you let the market run the society instead of letting the society run the market, people break things. Not because they are stupid. Not because they are ungrateful. But because a human being is not a commodity, and when you treat one as if he were, he will eventually remind you of this, sometimes with a ballot, sometimes with a brick.

The nineteenth century had tried the experiment. Let the market regulate itself—labour, land, money, all of it priced and traded like bolts of cloth. And it worked, for a while, the way a fever works: the body burns very hot and appears very active and then one morning it doesn't get up.

What got up instead, in the 1930s, was fascism. And communism. And war. Not because people wanted those things, exactly, but because when the old protections are stripped away and nothing replaces them, people will accept any hand that offers shelter, even if that hand is holding a gun.

Polanyi called this the "double movement." The market pushes outward, expanding, commodifying, disembedding. And society pushes back, demanding protection, demanding that someone—anyone—make the arithmetic work again. Six people, five chairs. The question is never whether the pushback comes. The question is what shape it takes.

Now.

Leskov tells a story—though I may be combining two of his stories, forgive me, the ribeiro is doing its work—about a provincial administrator who was sent to modernise a district. This administrator had read all the right books. He understood efficiency. He understood that the old ways were slow and the new ways were fast and that fast was better than slow the way light is better than dark.

He arrived in the district and immediately set about improving everything. He rationalised the postal routes. He consolidated the mills. He replaced the three village clerks, each of whom could barely write, with one educated clerk from Moscow who could write beautifully.

Within two years, the district was more efficient than it had ever been. The mail arrived on time. The mills produced more flour. The documents were impeccable.

Within three years, the district was in revolt. Not because the people were against efficiency. But because the postal routes had been the way old Fyodor earned his living, and the second mill had been where the Kovalenko family worked for four generations, and the village clerks—who could barely write, yes—were also the men who witnessed marriages and settled disputes and remembered who owed what to whom.

The administrator had improved the machinery and destroyed the fabric. He had confused the loom with the cloth.

This is, if you strip away the policy language and the footnotes and the three-letter acronyms, what is happening now with artificial intelligence. Not might happen. Is happening.

The Americans are building the looms. The fastest, most powerful, most extraordinary looms the world has ever seen. They are spending fifty billion here, three hundred billion there—numbers so large they stop meaning anything, like distances between stars. And the looms are magnificent. I will not pretend otherwise. I work adjacent to these machines. I have seen what they can do. They can draft a legal brief in the time it takes to drink a coffee. They can analyse a thousand medical images before a radiologist has finished her breakfast.

But the Americans have not thought very much about what happens to the junior lawyer who used to draft that brief. Or the radiologist's assistant who used to do the first pass. Or—and this is where it gets closer to Xosé—what happens when you remove the bottom rungs of a ladder and then tell people the ladder still works.

You know what happens. The people who are already at the top stay at the top. And the people who were climbing fall. And after a while, they stop believing in ladders altogether, and they vote for whoever promises to burn the ladder down.

The Europeans—and I say this as someone who lives among them, who has paid Italian taxes and navigated Italian bureaucracy, which is itself a kind of extreme sport—the Europeans have taken the opposite approach. They have written rules. Beautiful, comprehensive, thoroughly considered rules. The EU AI Act. The Social Fund. Three point three trillion euros in social protection.

And this is admirable, in the way that a very well-built seawall is admirable. It will hold against a normal tide. But the question is whether the tide coming is normal.

Europe's problem is not that it protects too much. Europe's problem is that it protects without producing. It regulates the loom but does not own one. It writes the rules of a game it is not playing. And Polanyi, who was no fool, would tell you that protection without productive capacity is just a slower way of becoming irrelevant. You cannot redistribute wealth you do not generate. You cannot embed a market that has moved to California.

And then there is China, which has taken a third path that Leskov's administrator would recognise immediately: control the pace. The Chinese have the looms. They are building more every day. But they are also deciding—deliberately, administratively, with the particular confidence of a state that does not need to win elections—where and when those looms are switched on.

Autonomous driving? The technology is ready. But there are millions of taxi drivers, and millions of taxi drivers with no income is a problem that no surveillance camera can solve. So the technology waits. Not because it doesn't work. Because society isn't ready to absorb it.

This is intelligent. I will give it that. It is the most Polanyian response of the three, in a way—the state actively managing the pace of disruption, weighing efficiency against stability. But it is also brittle, because it depends on control rather than consent. And systems that depend on control work until the morning they don't, in the way that a dam works until the morning it doesn't, and then there is no middle ground between dry and drowned.

Xosé, if he were still alive and you asked him about all of this—the AI race, the geopolitical competition, the Polanyian double movement—he would pour you another glass and say something like: Listen. I don't care who builds the loom. I care whether there's still a place for me in the morning.

And that, stripped of everything, is the argument. The race is not about the technology. The race has never been about the technology. The race is about whether your society can absorb the shock of the technology without coming apart at the joints.

The Americans won the Cold War not because they had better missiles—though they did—but because they had built a society that could sustain the effort. Social security. The GI Bill. Medicare. Strong unions. Public universities. The interstate highways. These were not luxuries. They were the load-bearing walls. They were the reason the house could withstand the storm.

And now those walls are thinner than they've been in a century, and the storm coming is not a storm they've seen before. Because this one doesn't just displace hands—Xosé's hands, the hands of factory workers and postal carriers and mill operators. This one displaces minds. The lawyer's mind. The accountant's mind. The analyst's mind. The junior consultant's mind. It goes after the very class of people who believed, with absolute certainty, that automation was something that happened to other people.

There is a temptation, when you see all of this clearly, to become a pessimist. To pour the last glass and say: well, we are finished, the machines have won, nothing to be done.

But Polanyi was not a pessimist. He was a realist who believed in institutions. He had seen society destroy itself and rebuild itself, and his whole point was that the rebuilding is possible—but only if you build the right things. Not just faster looms. Not just thicker seawalls. But the actual, tedious, unglamorous work of making sure that when the economy changes shape, people still have a floor beneath them and a reason to get up.

Income that doesn't vanish when your job does. Work that has dignity even when it isn't profitable. The understanding that a taxi driver's livelihood is not a minor detail in the story of autonomous vehicles—it is the story, the only story that matters, because a society is not an economy. A society is the people in it, and what they do, and whether they believe that what they do matters.

The question is not who builds the most powerful AI. The question is who builds the society that can live with it.

Xosé could have told you that. He could have told you over ribeiro, in a factory town where the factory is gone but the people remain, because people always remain. That is the part the economists keep forgetting. The loom moves on. The people stay.

And they remember.

Enfin.

From Microservices to Agent Mesh: Why Your Next Infrastructure Won't Be Coded

choutos — Sat, 21 Feb 2026 17:25:36 +0000

Here is a sentence that would have been absurd three years ago: markdown is becoming a programming language.

Not metaphorically. Not in the way people say "YAML is the new XML" with a weary sigh. Literally. Teams are defining autonomous software agents—their behaviour, their personality, their decision logic, their safety boundaries—in plain prose, stored in .md files, interpreted at runtime by an LLM. The "compiler" is a language model. The "source code" is a paragraph that says what the agent should do. And the resulting system runs on hardware that costs less than lunch.

If you've spent the last decade building microservices architectures, this should make you deeply uncomfortable. It should also make you curious.

The collapse of the programming layer

The traditional path from intention to execution has always had a translation step in the middle. A human knows what they want. They express it in a programming language. A compiler or runtime turns that into behaviour. The entire craft of software engineering lives in that middle layer—the translation from intent to code.

What's happening now is that the middle layer is thinning to nothing.

An agent defined by markdown looks like this: a directory containing a handful of prose files. SOUL.md describes identity and core directives. TOOLS.md lists available capabilities. WORKFLOWS.md defines multi-step procedures. GUARDRAILS.md sets boundaries. The runtime—a lean orchestrator under ten megabytes—reads these files, calls an LLM to interpret them, and acts.

This isn't a toy. GitHub Copilot agents are configured via markdown instruction files. Anthropic's system prompts are, functionally, prose programs. CrewAI defines agents in YAML one step removed from natural language. The pattern is converging from multiple directions, which is usually how you know something is real.

The skill shift is subtle but profound. "Development" becomes description. Debugging means reading a reasoning trace, not setting breakpoints. Code review becomes prose review: does this paragraph capture the intended behaviour? Refactoring is rewriting for clarity. And rollback—that perennial source of deployment anxiety—is git revert. The agent runtime picks up the old files and behaves accordingly. No rebuild. No blue-green deployment. Just swap the text.

The entire CI/CD pipeline collapses to: edit, commit, push.

The brain doesn't live where the hands are

Here's the architectural insight that makes the whole thing work: the agent runtime and the inference backend are separate concerns. They almost never live on the same device—and they shouldn't.

Think of it as hands and brain. A ten-dollar microcontroller in a field is the hands: it reads sensors, triggers actuators, manages state. But the brain—the LLM that interprets the markdown and makes decisions—lives elsewhere. A cloud API. An on-premise GPU box running open-source models. A tiered hybrid that routes simple decisions locally and complex reasoning to heavier infrastructure.

The most practical pattern is tiered inference, and here's what's elegant: the markdown itself specifies the routing policy. A few lines of prose can say: routine sensor readings go to the local model; anomaly classification routes to the on-premise server; novel situations escalate to the cloud; patient health data never leaves the building. Configuration as natural language, living alongside the behaviour definition. No separate config management system. No environment variables. Just prose.

The economics of this split are startling. A single consumer GPU—an NVIDIA RTX 4090, roughly £1,300—running vLLM or Ollama can serve an eight-billion-parameter model at a hundred tokens per second. That's enough to support fifty to a hundred edge agents making periodic inference calls. Data never leaves the premises. Latency drops from two hundred milliseconds (cloud round-trip) to fifty (local network). And the cost is fixed: no per-token billing that scales with usage, no monthly invoice that grows as your mesh expands.

The strategic bet underlying all of this: inference cost approaches zero. GPT-4-class performance cost sixty dollars per million tokens in 2023. By early 2026, it's under fifty pence. The value isn't in running models—that commoditises. The value is in the markdown definitions themselves, and in the orchestration layer that makes them collaborate.

Kubernetes enters the picture (and it fits perfectly)

If you're a platform engineer, you might be thinking: this is charming for edge devices, but what about enterprise? What about the compliance requirements, the audit trails, the operational maturity we've spent a decade building?

The answer is that the same ultra-lightweight runtime that runs on a ten-dollar ESP32 is also an exceptionally good container base image.

Picture a Kubernetes pod. Inside it, two containers. The first holds the agent runtime plus its markdown files—together, well under ten megabytes. Compare that to the two hundred megabytes to a gigabyte of a typical microservice image. The second container is an A2A sidecar, analogous to an Envoy proxy in a service mesh, but handling agent-to-agent communication instead of HTTP and gRPC. It manages discovery, routes inter-agent messages, advertises capabilities.

This maps onto Kubernetes primitives with an almost suspicious neatness. ConfigMaps hold the markdown files and inference API keys. Horizontal Pod Autoscalers scale agent replicas based on message queue depth or token budget consumption. Rolling deployments mean updating a ConfigMap triggers a new pod rollout—zero-downtime behaviour change. Namespaces become mesh boundaries: per-tenant, per-department. NetworkPolicies control which agents can talk to each other. ServiceAccounts grant tool access permissions. And Custom Resource Definitions can model AgentDefinition, AgentMesh, and InferenceBackend as first-class objects in the cluster.

The result: enterprise teams get agent meshes with all the operational maturity they expect—autoscaling, RBAC, observability, audit logs—while the "application" is still just prose in a ConfigMap.

This is the bridge. Edge meshes on cheap devices serve the long tail. Kubernetes-hosted agent meshes serve the enterprise. Same runtime. Same markdown format. Same A2A protocol. Different substrate, same paradigm.

What a mesh of agents actually feels like

Abstract architecture only becomes real through use. So let me paint three scenarios—not as feature lists, but as lived experiences.

The vineyard. Fifty sensor agents on eight-dollar devices scattered across the blocks, each with a markdown file that says something like: "You monitor soil moisture in block 7. Check every fifteen minutes. If below thirty per cent, tell the irrigation agent to activate drip zone 3 for twenty minutes. If temperature exceeds thirty-five degrees, increase frequency. Log everything. Alert the farmer if the pump doesn't respond." One GPU box in the equipment shed handles inference for the entire mesh. Total hardware cost: under two thousand pounds. A vineyard in Stellenbosch or the Barossa Valley gets the same precision agriculture that previously required a hundred-thousand-dollar system from an industrial vendor. Edit one markdown file to adjust for clay soil versus sand. No developer needed. The viticulturist is the developer.

The supply chain. Each supplier, warehouse, and transport vehicle in a network runs an agent. The warehouse agent's markdown: "I'm the receiving agent for Warehouse 7. When goods arrive, scan the QR code, verify against expected shipments, flag discrepancies, update inventory, notify the distribution agent." Cross-company communication flows through the A2A protocol. A fifty-person manufacturer in Ho Chi Minh City participates in the same agent mesh as their buyer in Hamburg. The per-node cost is a thirty-dollar Android phone running an agent. Supply chain visibility—previously an SAP implementation costing millions—becomes accessible to SMEs. The agent definitions live in git. The factory manager writes them. The IT department barely knows they exist.

The enterprise mesh. A financial services firm runs compliance checking, trade reconciliation, client onboarding, and regulatory reporting—each as a minimal Kubernetes pod, each defined by markdown, each communicating via A2A sidecars, each scaling independently. Adding a new agent is a pull request containing prose files, reviewed by the compliance officer who wrote them. Not a three-month development project. Not a sprint planning session. A pull request, reviewed, merged, deployed in an afternoon.

In each case, the pattern is the same: the domain expert describes the behaviour. The runtime interprets it. The mesh coordinates it. The infrastructure—whether a cluster of microcontrollers or a Kubernetes namespace—is substrate, not structure.

The hard problems (honestly)

It would be irresponsible to sketch this vision without naming what's genuinely difficult. Three problems are blocking, and intellectual honesty demands we face them squarely.

Behavioural testing. This is the big one. In traditional software, a typo in code fails loudly—the compiler catches it, the test suite flags it. An ambiguity in prose fails silently. The agent does something almost right, and you don't discover the gap until production. We need scenario-based test frameworks for natural language specifications: describe a situation, assert the agent's response. Behavioural regression suites. Adversarial probing that tries to make agents violate their guardrails. None of this exists in mature form today. Whoever solves it first captures enormous credibility.

Security. A markdown file that defines an agent is functionally equivalent to executable code—it determines what the system does. A malicious markdown file is malicious code, but harder to audit because natural language hides intent more gracefully than Python. The field needs capability-based security (agents can only use explicitly granted tools), signed markdown bundles (like signed container images), runtime sandboxing, and comprehensive audit trails. The good news is that these are well-understood patterns from the container world; they need adaptation, not invention.

Coordination at scale. Individual agents can be remarkably capable. Getting a hundred of them to collaborate reliably is a different beast entirely. Research consistently shows that flat coordination fails catastrophically—a five per cent error rate on individual LLM calls compounds across a mesh until failures are virtually guaranteed somewhere. Hierarchical coordination patterns are essential: some agents must be designated orchestrators. The A2A protocol is the leading candidate for agent-to-agent communication, but it was designed for cloud-to-cloud scenarios and needs significant work for edge-to-edge deployments.

These aren't reasons to wait. They're the engineering challenges that define the next two years. And they're tractable—harder than building the runtime, but not harder than the problems the container ecosystem solved between 2013 and 2018.

The timeline: what's real, what's next, what's horizon

Today, in early 2026, the foundations exist. Agents defined by markdown work. On-premise inference via Ollama and vLLM is production-ready. Kubernetes can host agent pods. The pieces are real, shipping, and in use by early adopters.

Over the next one to two years, expect standardisation. A common format for markdown agent definitions. Small mesh deployments of ten to fifty agents becoming routine. Behavioural testing tools moving from "doesn't exist" to "early but usable." The A2A protocol maturing from first deployments to common practice. Kubernetes-hosted agent meshes reaching production grade.

On the five-year horizon: self-organising meshes of hundreds or thousands of agents. Agent marketplaces where you download a greenhouse climate agent and customise the temperature ranges for your region. Agents that improve their own markdown based on operational experience—powerful, and requiring strong guardrails. Natural language as the dominant interface for business automation.

The cost trajectory reinforces all of this. Inference cost is dropping by roughly an order of magnitude per year. The hardware is already commodity. The runtime will be open source. Which means the durable value concentrates in four places: the methodology for decomposing operations into agent specifications; the battle-tested markdown templates for specific domains; the frameworks for validating that prose-defined agents behave correctly; and the architectural expertise for designing meshes that are resilient, secure, and effective.

The provocation, restated

We are at the beginning of a transition where the dominant artefact of software creation shifts from code to prose. Where the "developer" for a supply chain agent is a logistics manager, not an engineer. Where infrastructure means a directory of markdown files and a cluster of devices cheap enough to lose without caring.

This doesn't eliminate software engineering—someone still builds the runtimes, the protocols, the testing frameworks. But it changes what most people do when they want a computer to do something for them. They describe it. In natural language. In a markdown file. And the system figures out the rest.

The microservices era taught us to decompose monoliths into small, independent services. The agent mesh era asks: what if those services weren't coded at all? What if they were described? What if deployment was a git push and scaling was plugging in another ten-dollar device?

Your next infrastructure might not be coded. It might be written.

And that changes everything about who gets to build it.

The Ten Things You Actually Need to Understand About AI

choutos — Sat, 21 Feb 2026 11:32:09 +0000

The Ten Things You Actually Need to Understand About AI

A practical guide to the foundational concepts, mental models, and building blocks that make modern AI work—and work for you.

There's a moment, familiar to anyone who's tried to follow the AI conversation this year, where the ground shifts beneath your feet. You were keeping up—chatbots, prompts, maybe you've even used Claude or ChatGPT for something useful—and then suddenly the conversation leaps ahead. Agents. Context windows. Skills. Inference. Vibe coding. The words pile up like luggage at a carousel that's moving too fast.

This guide is for that moment. Not a glossary, not a hype piece—a practical map of the ten foundational ideas you need to hold in your head to genuinely understand what's happening with AI right now, and to start using it with intention rather than confusion.

Think of these as primitives. In the same way that understanding notes, rhythm, and harmony lets you hear music rather than just noise, these concepts will let you see the architecture beneath the surface.

1. An LLM Is a New Kind of Computer Programme

Start here, because everything else builds on it.

Traditional computer programmes are recipes: exact instructions, typed out by a human, telling a machine precise steps. They're brilliant at arithmetic and terrible at telling jokes—because you can't encode the steps that make something funny. The unexpected is, by definition, not in the recipe.

A large language model (LLM) is a different beast. It does everything traditional programmes were bad at—writing stories, generating art, coding, reasoning through ambiguity—while retaining the old capabilities. It's not magic. It's a genuinely new way of using computers.

The practical takeaway: Stop thinking of AI as a search engine with better answers. Think of it as a new kind of programme that can handle the messy, creative, ambiguous work that rigid code never could.

2. A Model Is a File (and That Matters More Than You Think)

Here's something that clarifies enormously once you grasp it: a model is a file.

During training, vast quantities of internet text get compressed into a single file. The process discards the least important information and preserves the essential patterns, ideas, and relationships. What you're left with is a file that, given half a document, can plausibly complete the other half. That's the fundamental capability—document completion—dressed up to be useful.

The numbers inside this file are called weights. And here's where it gets political:

Open models (like DeepSeek or Qwen) let you download that file and run it yourself.
Closed models (like those from OpenAI or Anthropic) keep the file on their servers. You rent access.

The closed models tend to be slightly smarter. The open ones give you sovereignty. This tension—intelligence versus control—is one of the defining dynamics of the current moment.

The practical takeaway: When someone says "model," think "file." When they say "weights," think "what's inside the file." When they say "open source," they mean you can possess and run that file yourself.

3. Inference Is Just Running the Model

The word "inference" intimidated people for months. It needn't.

Inference simply means: running the model. Text goes in, text comes out. That's it. When you type a question into ChatGPT, inference is happening on a server somewhere. When you run a model locally using something like Ollama, inference is happening on your own machine.

The catch: running frontier open models locally requires serious hardware—roughly a $20,000 computer at the moment. That's the current barrier to full self-sovereignty. You can run an agent locally on a Raspberry Pi, but if it's making API calls to Claude or OpenAI for the actual thinking, the inference is still happening in someone else's cloud.

This distinction matters. Running your agent locally while the inference happens remotely is a meaningful step toward sovereignty—your memories, your data, your orchestration are yours—but it's not the whole journey.

The practical takeaway: "Inference" = "running the model." Local inference = full control. Cloud inference = convenience with trade-offs. Know which one you're doing.

4. Context Is the Scarce Resource (and This Changes Everything)

If there is one concept that separates people who understand AI from people who merely use it, it's context.

LLMs are stateless. Every single interaction starts from scratch. The model remembers nothing between conversations. If you and I both use ChatGPT, we get exactly the same model. Any personalisation—your preferences, your history—comes from elsewhere, injected into the conversation, not baked into the model itself.

Here's how the illusion of memory works: every round of conversation, the system sends the entire history back to the model. Your eleventh message doesn't arrive alone—it brings the previous ten exchanges with it, plus a hidden preamble called the system prompt (think of it as the Ten Commandments for that session—instructions from the developer about how the model should behave).

All of this—the system prompt, the conversation history, your current message—is the context. And context is finite. The longer the conversation grows, the more confused the model becomes. Eventually, you hit the limit and must start over (a process called compaction, and it degrades everything).

This is why context engineering—the art of managing what goes into that window—has been the single most important area of practical AI development over the past year. Not smarter models. Smarter management of the scarce resource those models depend on.

The practical takeaway: Context is the conversation. It's finite and precious. Everything the model knows in a given session must fit inside it. Learning to manage context—what to include, what to defer, what to summarise—is the most important practical skill in working with AI today.

5. The System Prompt Is the Invisible Hand

Buried at the top of every context window is the system prompt: instructions you don't see, written by the developer (or sometimes by you), telling the model how to behave.

This is where things get ethically interesting. If you're using a cloud AI service that introduces advertising, there's nothing stopping the provider from inserting instructions into that hidden preamble—nudging the model to favour certain products, perspectives, or behaviours. You'd never see it. The model would simply... lean.

The AI experience will always be shaped by something. The question is whether that something is an advertiser, a corporation, a government, or you. Local AI—where you control the system prompt—is the only arrangement where the answer is definitively "you."

The practical takeaway: Whoever writes the system prompt shapes the model's behaviour. If you can't see or control it, someone else is steering your AI. That alone is reason enough to care about self-sovereign setups.

6. Tools Turn Text Into Action

An LLM can only do one thing: produce text. It can't search the web, open a browser, send a message, or book a flight. It can only write about doing those things.

So how does it act in the world? Through tools.

A tool is an agreement. In the system prompt, the model is told: "If you want to search the web, output this special marker with your query inside it." The agent software watches for that marker, intercepts it, performs the actual web search, and feeds the results back to the model. The model never touched the internet—it just asked, in a very specific format, and the surrounding software did the work.

This pattern extends to everything: controlling a browser, sending a Telegram message, reading a file, managing a calendar. Each capability is a tool—a bridge between the model's text output and real-world action.

The practical takeaway: Tools are how AI gets things done. The model thinks in text; tools translate that text into action. The more tools available, the more capable the agent—but also the more you need to understand what you're authorising it to do.

7. Just-in-Time Beats Just-in-Case (The MCP-to-Skills Revolution)

A year ago, the standard approach was to load everything into the context window upfront—every possible tool, every instruction, every contingency. This is just-in-case prompting, and it's the equivalent of reading an entire encyclopaedia before answering a single question. The model would arrive at your first message already half-confused, its context window bloated with things it might never need.

The revolution was just-in-time prompting. Instead of cramming ten thousand commandments into the system prompt, you give the model ten—plus a shelf of manuals it can see but doesn't read until needed. It sees the titles on the spines. When your intent matches a manual, it pulls it down and reads it.

This is the shift from MCP (Model Context Protocol—an early standard for sharing tools) to skills. A skill is a folder containing two things:

A prompt—plain English describing when and how to do something ("When the user wants to book a flight, open the browser, go to Kayak, wait for them to enter their password...")
A programme—traditional code that handles the mechanical steps the model would struggle with (navigating specific UI elements, entering credit card details, clicking the right buttons)

Skills are the closest analogue to apps in the old world. They map a user's intent to an action, blending the model's intelligence with traditional programming's reliability. And because they're loaded on demand rather than upfront, they don't waste context.

The practical takeaway: Don't dump everything into the prompt. Structure your AI setup so information is discoverable but loaded only when needed. If you're building with AI tools, think in terms of skills: intent → action, with just enough prompt to guide and just enough code to execute.

8. An Agent Is a Marriage Between Old and New

With all the pieces in place, we can define what an agent actually is.

An agent is software that:

Makes requests to an LLM
Manages the context window
Intercepts tool calls and executes them
Loops until the task is done (or the model responds without a tool call, signalling completion)

It's a marriage between traditional programming (controlling files, browsers, APIs, operating systems) and the new capability of LLMs (understanding intent, generating solutions, reasoning through ambiguity).

The ChatGPT website is an agent. Claude Code is an agent. OpenClaw is an agent. The difference is scope and sovereignty—where the agent runs, what tools it has access to, and who controls it.

The practical takeaway: An agent is the orchestration layer that makes an LLM useful in the real world. It's not the model itself—it's the software that gives the model hands.

9. Vibe Coding Changed Who Gets to Build

A year ago, writing software required the blinders on: exact syntax, precise logic, one misplaced semicolon and nothing works. Vibe coding is the opposite. You put your feet on the desk and say, "Build me a movie player app that downloads from Dropbox," and you watch it happen.

Here's what's actually occurring: you describe what you want. The agent asks clarifying questions, then enters a loop—searching the web, reading files, writing code, running the programme, testing it, iterating—until it judges the task complete (signalled by a response without a tool call). You steer along the way: "I want blue, not purple." "Add an export button."

A year ago, this was shaky. Today, it works reliably enough that Andrej Karpathy—the former head of AI at Tesla, who essentially coined the term—reports he's gone from 20% vibe coding to 80%.

The implications are seismic. The cost of software production is approaching zero. A Cuban activist who needs a specific Bitcoin wallet for her community's needs will, within a year, be able to describe it to a computer and have it built—open source, tailored, functional. One developer, Peter Steinberger, vibe-coded the entirety of OpenClaw—a project that amassed 160,000 GitHub stars in seven weeks (Bitcoin, after fifteen years, has 80,000).

The practical takeaway: You don't need to be a programmer to build software anymore. You need to be clear about what you want. The ability to articulate intent precisely—to describe the thing in your head so specifically that a machine can build it—is becoming one of the most valuable skills alive.

10. Sovereignty Is the Point

Every concept in this guide converges on a single question: who controls your AI?

When the model is closed and runs on someone else's servers, with a system prompt you can't see, using tools they chose, storing memories on their infrastructure—you are a tenant. Your experience can be shaped, your data harvested, your outputs nudged in directions that serve the platform rather than you.

When the agent runs locally, with open models you can inspect, a system prompt you wrote, tools you chose, and memories stored on your own machine—you have sovereignty. Not perfect sovereignty, not yet (the hardware costs are real, the security surface is enormous, and the open models aren't quite as capable as the closed ones)—but meaningful sovereignty. A direction, not a destination.

The trajectory is clear: open models are catching up to closed ones. Hardware is getting cheaper. Context engineering is getting smarter. The gap between what a well-configured local setup can do and what a frontier cloud model offers is narrowing by the month.

Persistent local memory alone changes the equation dramatically. A lesser model that remembers every interaction, that builds a deepening understanding of your work, your preferences, your context—that model, over time, outperforms a brilliant model that forgets you exist every time you close the tab.

The practical takeaway: Start moving toward sovereignty now, even incrementally. Use privacy-respecting tools where you can. Understand what you're trading when you use cloud AI. The goal isn't to reject all cloud services tomorrow—it's to build the muscle, the understanding, and the infrastructure so that when full self-sovereignty becomes practical, you're ready.

Where to Start (Right Now)

If you've read this far and want to act, here's a practical ladder:

Use a chatbot intentionally. If you're still just dabbling with ChatGPT, commit to using it (or Claude, or a privacy-respecting alternative) daily for real work. Learn what it's good at and where it breaks.
Try a creator tool. Claude Code, Cursor, or Replit will show you what AI looks like when it can use tools—not just generate text, but build things. Even with zero technical background, you can build a working app on Replit today.
Understand context by feel. Start a long conversation and notice when the model gets confused or forgetful. That's compaction. That's the context window filling up. Developing an intuition for this is worth more than any technical explanation.
Explore local AI. Install Ollama. Run a small open model. It won't be as smart as Claude, but it'll be yours. The experience of talking to an AI that runs on your own hardware—no internet required, no data leaving your machine—is genuinely different. It changes your relationship with the technology.
Articulate with precision. Whether you're prompting, vibe coding, or describing a skill, the bottleneck is increasingly your ability to say what you mean. Practice describing what you want with sensory specificity—not "make it look good" but "a rotating globe with colour-coded civil liberties data, sortable by donor country, with a liquid glass aesthetic." The more precisely you can dream out loud, the more powerful these tools become.

The Bigger Picture

Five years ago, the prevailing fear was that AI would be inherently centralising—a surveillance machine, a tool of control, the end of individual agency. And parts of that fear were justified. Dictators will use AI. Corporations will use AI to extract. That's happening.

But what wasn't anticipated was the other side: AI as an asymmetric amplifier of individual capability. Encryption didn't just help governments—it gave individuals a shield that states, with all their resources, still struggle to pierce. Bitcoin didn't just help banks—it gave people money that can't be confiscated. AI is following the same pattern.

A single person can now vibe-code a tool that would have taken a team of engineers months to build. An activist can speak a complex research task into a phone and get back a rich, interactive visualisation in minutes. A small organisation can 10x or 100x its output without hiring anyone.

The cost of software creation is collapsing. The barrier between having an idea and having a working tool is dissolving. And the infrastructure for doing all of this privately, sovereignly, on your own terms—that infrastructure is being built right now, in the open, by people who believe it matters.

This is the moment. Not to be overwhelmed by the jargon. Not to sit on the sidelines waiting for it to stabilise. But to learn the primitives, build the intuition, and start using these tools with the same intentionality you'd bring to any craft worth mastering.

The future isn't arriving. It's being built—by anyone willing to describe, precisely, what they want it to look like.

Now go build something.

Agentic Mesh in the Wild

choutos — Fri, 20 Feb 2026 17:47:32 +0000

You've heard the pitch. Autonomous agents collaborating like a well-run engineering team, decomposing problems, dividing labour, converging on solutions. The "Internet for Agents." The agentic mesh.

Here's what nobody tells you at the conference keynote: the mesh is real, it's in production, and it's already teaching us lessons that will reshape how we build software. But those lessons aren't the ones the slide decks promise.

The State of Play

Multi-agent systems crossed from research curiosity to production reality in 2025. Not everywhere—not yet in most places—but in enough places, at enough scale, that patterns are emerging. Cursor is running hundreds of concurrent agents generating millions of lines of code. Anthropic ships multi-agent research to every Claude user. Salesforce's Agentforce has 150+ enterprise deployments and calls it their fastest-growing product ever. Tyson Foods and Gordon Food Service have agents from different companies talking to each other over Google's A2A protocol.

This is no longer theoretical. The question isn't whether multi-agent works. It's how it works, and where it breaks.

The Hierarchy Lesson

The most striking finding from production is this: flat coordination fails catastrophically.

Cursor learned it the hard way. Give twenty agents equal status and shared file access, and you don't get twenty times the throughput. You get the throughput of two or three, with the rest churning in lock contention and decision paralysis. Agents became risk-averse. They avoided hard tasks. They optimised for appearing busy rather than making progress.

Google DeepMind's research confirms it at the theoretical level—a "bag of agents" in flat peer-to-peer topology produces seventeen times more errors than structured alternatives.

The pattern that works, consistently, across every successful production deployment we found: orchestrator-worker. A planning agent decomposes the problem. Specialised workers execute narrow tasks. Results flow back up. Cursor extends this into recursive hierarchies—planners spawning sub-planners, with judge agents evaluating whether to continue or stop. Anthropic's multi-agent research system does the same: a lead agent delegates to subagents, each with its own context window and tools, then compresses findings.

This mirrors something every CTO already knows about human organisations. Flat hierarchies sound democratic. They work for small teams. At scale, you need structure—clear delegation, well-defined scope, someone who decides when the work is done.

What Actually Breaks

Ten failure modes keep appearing across deployments:

1. Coordination overhead. Agents negotiate more than they work. The meta-conversation about who does what consumes the budget meant for doing the thing.

2. Context fragmentation. Each agent sees its own slice. Without shared context, they make locally reasonable but globally inconsistent decisions.

3. Non-determinism. Same inputs, different outputs. Every run is a snowflake. This isn't a bug—it's the nature of LLM-based systems—but it makes testing painful and debugging worse.

4. Error cascading. One agent's hallucination becomes another agent's input. Garbage propagates faster than correction.

5. Token economics. Multi-agent systems use roughly fifteen times the tokens of a single chat interaction. Anthropic's own research confirms this. If the task doesn't justify 15x the cost, you've built an expensive way to get a mediocre answer.

6. Lock contention. Shared-state coordination creates bottlenecks that negate the parallelism you built the system to achieve.

7. Risk aversion in flat structures. Without explicit delegation, agents gravitate toward safe, trivial subtasks.

8. Observability gaps. When a five-agent chain produces the wrong output, figuring out which agent went wrong—and why—is genuinely hard.

9. Workflow mismatch. You cannot drop agents into processes designed for humans. Workflow redesign is mandatory. Salesforce learned this deploying Agentforce: the agents work, but only after the process around them is rebuilt.

10. Model drift. Your agents' behaviour changes when the underlying LLM updates. What passed evaluation last week may fail this week.

The Numbers Worth Knowing

A few data points that should shape your planning:


Multi-agent vs single-agent on research tasks	+90% (Anthropic)
Token cost multiplier for multi-agent	~15x vs single chat
Flat swarm error amplification	17x (DeepMind)
Salesforce autonomous resolution rate	60%+
A2A protocol supporting organisations	150+

The 90% improvement is real and impressive. So is the 15x cost. The question for any deployment is whether the value of the task justifies the economics.

The Protocol Stack Taking Shape

Two standards are converging as the connective tissue of the agentic mesh:

Google's A2A (Agent-to-Agent) defines how agents discover each other and exchange messages. Donated to the Linux Foundation in June 2025, backed by 150+ organisations including Atlassian, PayPal, Salesforce, SAP, and AWS. Agents publish "capability cards" describing what they can do. Other agents—or humans—query those cards. Think DNS for agents.

Anthropic's MCP (Model Context Protocol) standardises how agents access tools and data. If A2A is how agents talk to each other, MCP is how they talk to the world. The analogy that keeps surfacing: USB for AI.

Together, these protocols make vendor-neutral, cross-framework, even cross-company agent collaboration possible. The Tyson Foods / Gordon Food Service deployment is the early proof. It won't be the last.

OpenTelemetry for Agents is emerging as the observability layer—extending existing OpenTelemetry standards to trace agent interactions, tool calls, and token consumption across the mesh.

Patterns for the Pragmatist

If you're planning a multi-agent deployment, here's what the evidence suggests:

Start with one agent. Microsoft's own guidance: "If you can write a function to handle the task, do that instead of using an AI agent." Multi-agent is a scaling strategy, not a starting point.

Orchestrator-worker is the safe default. It's proven at scale by Cursor, Anthropic, and Bayezian. Flat peer-to-peer is an anti-pattern. Every production team that tried it moved away from it.

Choose models per role. Cursor found GPT-5.2 excels at planning while other models perform better at execution. One-size-fits-all model selection leaves performance on the table.

Instrument from day one. OpenTelemetry, audit logs, token tracking, cost circuit breakers. You will need them. In regulated industries (KPMG's Clara AI for audit, Bayezian's clinical trial monitoring), they're non-negotiable. In every other industry, they're still non-negotiable—you just don't know it yet.

Adopt A2A and MCP early. These are becoming the industry standards. Building on proprietary protocols now means migrating later.

Budget honestly. Plan for 10–15x token costs. Build cost monitoring with automatic circuit breakers for runaway usage. If the economics don't work at 15x, the architecture needs to change—not the budget.

The Uncomfortable Truth

The agentic mesh is real, and it produces results that single agents cannot match. Cursor's agents wrote a web browser from scratch—a million lines of code in a week. Anthropic's multi-agent research outperforms single-agent by 90%. These aren't demo numbers; they're production metrics.

But the mesh is also unforgiving. Coordination is the hard problem, not individual agent capability. The organisations succeeding are the ones treating multi-agent systems with the same rigour they'd apply to distributed systems engineering: clear hierarchies, narrow responsibilities, observable behaviour, well-defined failure modes, and honest cost accounting.

The age of agents-in-production has arrived. It looks less like a swarm of autonomous intelligences and more like a well-architected microservices system—with all the same lessons about coupling, cohesion, observability, and the eternal truth that distributed systems are harder than they look.

The difference is that this time, the services can reason.

Sources include Anthropic Engineering, Cursor, Google DeepMind (arXiv 2512.08296), Salesforce, AIMultiple, KPMG, Bayezian, and the Linux Foundation A2A project.

When the Landlord Moves Into the Polvaria

choutos — Tue, 17 Feb 2026 20:14:44 +0000

The polvaria works because the woman behind the bar is the owner.

This is important. Not in a sentimental way, not in a "support small business" way, but in the structural way that determines whether the polvo is good or not. She decides the menu. She knows the regulars. She pours the ribeiro the way it should be poured, which is to say generously, into ceramic cups that don't match, without asking if you'd prefer something else. There is nothing else. This is the polvaria.

Her name is — well, it doesn't matter. Let's call her Carme. Carme has been running this place for twenty-two years. Before her, it was her mother. Before that, nobody remembers exactly, but there's a photograph on the wall from 1971 where you can see the same bar, the same window, and a man with a moustache who might be her grandfather or might be someone else entirely. Nobody's checked. It doesn't matter. The point is that the place has been itself for a long time, and the reason it has been itself is that the person making the decisions is the person doing the work.

This is not a metaphor yet. Give me a minute.

I run my own tools. This is not impressive. Plenty of people do. But I want to be specific about what "my own" means, because the phrase is about to become important.

I have a machine. On that machine, there is software. The software talks to AI models — different ones, depending on the task. It coordinates agents that do work: writing, coding, researching, deploying. The architecture is mine. Not because I built every piece from scratch, but because I chose every piece. I decided which model handles what. I decided how the agents communicate. I decided what runs locally and what runs in the cloud. When something breaks, I fix it. When something improves, I improve it. The kitchen is mine.

The software is open source. Anyone can take it, run it, modify it, break it, fix it. It belongs to whoever uses it, in the way that a recipe belongs to whoever cooks it. The author wrote it down, sure, but the polvo on your plate is yours. You made it. You chose the paprika.

Now. Imagine that the company that sells the paprika buys the recipe book.

There's a man in California — there's always a man in California — and his company makes AI models. Very good ones. The best, probably, depending on how you measure. His business is selling access to intelligence. Not the intelligence itself, mind you. Access. The difference matters. The intelligence lives on his servers, behind his API, under his terms of service. You can use it, the way you can use electricity. You pay the bill, the lights come on. You stop paying, the lights go off. You don't own the power station. You don't even know where it is.

This is fine. I use his models too, sometimes. They're good. I'm not a purist.

But there's a difference between buying paprika from someone and having that someone move into your kitchen. And lately, the paprika sellers have been very interested in kitchens.

Let me tell you about my grandfather. Not mine exactly — a grandfather, the kind that exists in every Galician family, the way rain exists in every Galician winter. He had a leira. A small plot of land up in the hills above the river, not much, maybe enough for potatoes, some grelos, a few vines that produced wine so rough it could strip paint. But it was his. His name on the deed, such as deeds existed. His hands in the soil.

One day the cacique came by. The cacique always comes by eventually. He had better seeds. Better tools. A new fertiliser from Germany that would double the yield. He wasn't buying the land, no. He was offering to help. A partnership. The grandfather would keep working the leira, and the cacique would provide the means to make it productive. Modern. Efficient.

The grandfather said no.

Not because the offer was bad. The seeds probably were better. The fertiliser probably did work. But the grandfather understood something that doesn't fit on a spreadsheet: once someone else decides what you plant, it's not your leira anymore. It's theirs with your name on it. You're still doing the work, still getting your hands dirty, still waking up at dawn. But the decisions — the ones that matter, the ones about what grows and what doesn't, what stays and what goes — those are made somewhere else now. In a house with more rooms than people.

The grandfather kept his rough wine and his modest potatoes and died owning exactly what he'd always owned. The families who took the cacique's deal had better harvests for a few years. Then the terms changed. Then the terms changed again. Then the cacique's son decided potatoes weren't profitable and maybe the land would be better used for eucalyptus.

The eucalyptus burns, by the way. Every summer. But that's another story.

Here's what happens when a company that sells AI models acquires — or "partners with," or "integrates," or "strategically aligns with," which are all the same word wearing different suits — an open-source tool.

Day one: nothing changes. The code is still open. The community is still there. The blog post announces the partnership with words like "accelerate" and "ecosystem" and "committed to open source." Everyone applauds. The polvo still tastes the same.

Day thirty: a new feature appears. It works best with the company's own models. Not exclusively — that would be too obvious. But best. The integration is smoother. The latency is lower. The documentation is more detailed. If you use a different model, it still works, technically. The way a car still works with the wrong tyres. It drives. It just doesn't drive well.

Day ninety: another feature. This one requires the company's API for "security reasons" or "performance optimization" or some other phrase that sounds responsible and is impossible to argue against without sounding paranoid. The open-source version still exists, but it's missing things. Important things. The kind of things that make you say "I'll just use the integrated version, it's easier."

Day three hundred and sixty-five: you look up and realise you're running their tool, on their infrastructure, under their terms. The code is still technically open. You could still fork it, technically. The way you could still bake your own bread instead of buying it. You could. You won't. The convenience has become dependency, and the dependency has become invisible, which is the most effective kind.

The woman behind the bar has that look. The one people get when they're working in a place that used to be theirs.

I want to be fair. Companies are not evil for wanting to grow. The man in California is not a cacique. He's building something extraordinary, and he probably believes — sincerely, genuinely — that controlling more of the stack will produce better outcomes for everyone. Better polvo. Better ribeiro. A nicer pulperia with matching cups and a menu in three languages.

And maybe he's right. Maybe the polvo will be better.

But I keep coming back to the kitchen. To Carme, who buys her octopus from the same fisherman in Ribeira every Thursday morning and whose polvo is good not because the recipe is secret but because she's been making it for twenty-two years and she knows — in her hands, not in her head — exactly when it's done. That knowledge doesn't transfer. It doesn't scale. It doesn't fit in an API. It lives in the specific relationship between a woman and a copper pot and ten thousand repetitions.

When you own your tools, you develop that relationship. You learn the quirks, the edges, the moments where the documentation is wrong and your instinct is right. You build something that is yours in the way that matters: not legally, not financially, but practically. You know how it works because you made it work. You made the choices. You chose the paprika.

When someone else owns your tools, you develop a different relationship. A relationship with their choices. Their paprika. Their version of what "good" means. And their version might be excellent — probably is excellent — but it's not yours. And one Tuesday the polvo tastes different and nobody can explain why, because the decision was made in a building with more rooms than people, by someone who has never been to Galicia and doesn't know what ribeiro is and thinks octopus is something you order at a restaurant in San Francisco for forty-seven dollars.

The thing about open source is that it's fragile. Not technically — technically it's robust, it's everywhere, it runs the world. Fragile in the human sense. It depends on people caring enough to maintain something they don't own. It depends on companies resisting the urge to enclose what's open, fence what's common, monetise what's free. It depends on the grandfather saying no to the cacique, even when the seeds really are better.

I don't know if the tools I use will stay open. I don't know if the kitchen will stay mine. I know that today it is, and that today I can choose my models, my architecture, my agents, my paprika. I know that the moment I stop paying attention, someone will offer me better seeds.

I'll keep my rough wine, I think. It's not efficient. It's not optimised. The yield is modest and the cups don't match and the chalkboard menu hasn't changed since 2011.

But the polvo is good. And it's mine. And I know exactly why it tastes like that.

Enfin.

How I Cut My LLM Costs by 70% Without Losing Quality

choutos — Mon, 16 Feb 2026 13:22:23 +0000

You're running AI in production. Things are going well — users love it, the team is shipping features, and then the invoice arrives.

$2,400/day on API calls. For what started as "a few GPT-4 calls here and there."

I've been there. Running a multi-agent system where every task — from simple text classification to complex reasoning — was hitting Claude Opus or GPT-4. The quality was great. The bill was not.

Over three months, I got that $2,400/day down to ~$700/day with no measurable quality loss on 94% of tasks. Here's exactly how.

The Cost Problem: Let's Talk Numbers

First, let's ground this in reality. Current pricing (as of early 2026):

Model	Input (per 1M tokens)	Output (per 1M tokens)
GPT-4o	$2.50	$10.00
Claude Opus 4	$15.00	$75.00
Claude Sonnet 4	$3.00	$15.00
Claude Haiku	$0.25	$1.25
GPT-4o mini	$0.15	$0.60
Qwen 2.5 7B (local)	$0.00	$0.00*

*Hardware costs apply, but if you already have a GPU sitting around, marginal cost is electricity.

A single Claude Opus call with a typical system prompt (~2K tokens) plus context (~3K tokens) generating a ~1K token response costs roughly $0.10. That sounds tiny until you're making 25,000 calls a day.

The maths is simple but brutal: most production AI systems are running their most expensive model on tasks that don't need it.

Strategy 1: Model Routing — The Biggest Win

This single change cut my costs by ~45%.

The idea: classify incoming tasks by complexity, then route to the appropriate model. Not every question needs a PhD — some just need a lookup.

# Simplified routing logic
def route_request(task: dict) -> str:
    complexity = estimate_complexity(task)

    if complexity == "simple":
        # Classification, extraction, formatting, simple Q&A
        return "gpt-4o-mini"  # or local model
    elif complexity == "medium":
        # Summarisation, code review, standard generation
        return "claude-sonnet-4-20250514"
    else:
        # Complex reasoning, multi-step analysis, creative work
        return "claude-opus-4-20250514"

The complexity estimator doesn't need to be fancy. In my case, a simple heuristic based on task type, input length, and whether the task requires multi-step reasoning got me 90% of the way there. You can even use a cheap model to classify — a GPT-4o mini call to decide routing costs fractions of a cent.

What I found in practice:

~60% of requests were "simple" — classification, entity extraction, formatting, template filling
~25% were "medium" — summarisation, standard content generation, code explanation
~15% actually needed top-tier reasoning

That means 60% of my spend was going to a model 100x more expensive than necessary.

Strategy 2: Fallback Chains

Model routing handles the happy path. Fallback chains handle everything else — rate limits, outages, and cost control.

Primary: Claude Opus 4
  ↓ (rate limited or timeout)
Secondary: Claude Sonnet 4
  ↓ (API down or budget exceeded)  
Tertiary: Local Qwen 2.5 7B via Ollama

I use LiteLLM as the routing layer. It gives you a unified OpenAI-compatible API across providers with built-in fallbacks, retries, and spend tracking.

# litellm config
model_list:
  - model_name: reasoning-heavy
    litellm_params:
      model: anthropic/claude-opus-4-20250514
      max_budget: 500  # daily cap in USD

  - model_name: reasoning-heavy
    litellm_params:
      model: anthropic/claude-sonnet-4-20250514  # fallback

  - model_name: simple-tasks
    litellm_params:
      model: ollama/qwen2.5:7b
      api_base: http://localhost:11434

The daily budget cap is crucial. Once your primary model hits spend limits, requests automatically fall through to cheaper alternatives. You get cost predictability without building it yourself.

Strategy 3: Smart Caching

You'd be surprised how many "unique" requests are actually near-duplicates.

Exact caching

The low-hanging fruit. Hash the prompt, cache the response. If someone asks the same question twice, don't pay twice. I use Redis with a 24-hour TTL. This alone saved ~8% on costs.

Semantic caching

More interesting. Use embeddings to find semantically similar previous queries and return cached results if similarity is above a threshold (I use 0.95).

# Pseudocode — semantic cache lookup
query_embedding = embed(new_query)
cached = vector_store.search(query_embedding, threshold=0.95)
if cached:
    return cached.response  # free

Be conservative with the threshold. A 0.90 threshold sounds close but will serve wrong answers. I learned this the hard way with customer-facing responses.

Prompt caching (provider-level)

Anthropic and OpenAI both offer prompt caching now. If your system prompt is the same across calls (and it should be), cached input tokens cost 90% less. For a 2K-token system prompt across 25K daily calls, that's meaningful — roughly $700/month saved just from system prompt caching on Sonnet.

Enable it. It's almost free money.

Strategy 4: Prompt Engineering for Cost

Every token costs money. Treat your prompts like code — review them, optimise them, measure them.

What I changed:

Trimmed system prompts. My original system prompts were 3,000+ tokens of "be helpful, be accurate, consider edge cases..." I cut them to ~800 tokens with no quality difference. The models already know how to be helpful.
Stopped sending full conversation history. Instead of 20 messages of context, I send a summary of the conversation plus the last 3 messages. For a chatbot doing 10+ turns, this cuts input tokens by 60%.
Structured output requests. Instead of asking the model to explain its reasoning and then give an answer, I ask for JSON output directly. Shorter outputs = lower cost.
Removed redundant instructions. "Please respond in English" when the input is in English. "Be concise" followed by "provide a detailed explanation." Audit your prompts for contradictions and waste.

# Before: ~3200 input tokens
system = """You are a helpful AI assistant. You should always be accurate 
and provide detailed, well-structured responses. Consider edge cases.
Be polite. Format your response clearly. If you're unsure, say so.
Always respond in the same language as the user. Consider the context
of the conversation carefully before responding..."""

# After: ~600 input tokens  
system = """Extract entities from user text. Return JSON: 
{"entities": [{"text": str, "type": str, "confidence": float}]}
No explanation needed."""

Same task, 80% fewer tokens in, 90% fewer tokens out.

Strategy 5: Local Models — When They're Good Enough

I run Qwen 2.5 7B on an RTX 3060 via Ollama. It costs nothing per request and handles more than you'd think.

Where local models work well (7B–14B range):

Text classification: ~92% accuracy vs ~97% for Opus (good enough for routing)
Entity extraction: ~89% accuracy on my benchmark set
Reformatting/templating: essentially identical to cloud models
Simple Q&A over provided context: solid when context is clean

Where they fall apart:

Multi-step reasoning over large contexts
Nuanced creative writing
Complex code generation (fine for simple scripts)
Anything requiring broad world knowledge

The key insight: for many production tasks, 92% accuracy is fine. If you're classifying support tickets or extracting dates from emails, you don't need Claude Opus. You need something fast, cheap, and good enough.

Running local also gives you zero-latency calls (no network round-trip), full data privacy, and no rate limits. For high-throughput pipelines, this matters as much as cost.

Practical setup with Ollama:

# Install and run
ollama pull qwen2.5:7b
# Expose OpenAI-compatible API on :11434
# Point LiteLLM at it — done

For higher throughput, look at vLLM — it handles batching and continuous batching much better than Ollama for concurrent requests.

Real-World Results: The Multi-Agent System

Here's the before/after on my system — a multi-agent setup handling document processing, customer queries, and internal tooling:

Before (everything on Claude Opus):

~25,000 calls/day
Average 6K tokens per call (in+out)
~$2,400/day → ~$72,000/month

After (routed + cached + optimised):

60% routed to GPT-4o mini or local models: -$1,400/day
25% routed to Sonnet instead of Opus: -$200/day
Caching (exact + semantic) eliminated ~12% of calls: -$100/day
Prompt optimisation reduced average tokens by ~35%: spread across all tiers

Result: ~$700/day → ~$21,000/month

That's a 71% reduction. The quality metrics I track (task completion rate, user satisfaction scores, accuracy on a held-out test set) showed less than 2% degradation overall. The 15% of tasks still hitting Opus actually got better because I could afford to give them more context and retries.

The 80/20 Rule

If you take one thing from this post: audit your model usage before optimising anything else.

In almost every system I've seen, 80% of LLM calls are simple tasks running on expensive models because that's what was easiest to set up during development. Nobody goes back to optimise the model choice because it works fine — until the bill arrives.

Start here:

Log every LLM call with model, token count, task type, and cost
Classify tasks by actual complexity needed
Route the simple stuff to cheap/local models
Enable prompt caching (literally a config flag)
Trim your prompts — most are 2-3x longer than needed

Steps 1-3 will get you 50-60% of the savings. The rest is optimisation on top.

Tools Worth Knowing

LiteLLM: Unified API gateway, model routing, spend tracking, fallbacks. The single most useful tool for multi-model setups.
Ollama: Dead-simple local model serving. Pull a model, run it, done.
vLLM: Production-grade local inference with proper batching. Use when Ollama isn't enough.
OpenRouter: Single API for 100+ models with automatic fallbacks and cost comparison.

Running LLMs in production doesn't have to mean choosing between quality and cost. It means being intentional about which model handles which task — the same way you wouldn't use a GPU instance to serve static files.

The expensive model should be your scalpel, not your hammer.

Got questions or want to share your own cost optimisation stories? Drop a comment — I'd love to hear what's worked for you.

Why I Replaced My AI Assistant With an Orchestra

choutos — Mon, 16 Feb 2026 12:21:21 +0000

You know that moment when your AI assistant loses the plot halfway through a complex task? You asked it to research a topic, draft a document, update a repo, and notify your team — and somewhere around step three it forgot what it was doing, hallucinated a file path, and burned through $2 of tokens producing nothing useful.

I lived there for months. Then I stopped asking one agent to do everything and started orchestrating many. Here's what I learned.

The Single-Agent Ceiling

A single LLM agent hits three walls fast:

Context windows are a leash. Even with 200K tokens, a complex task that involves reading codebases, API docs, and prior conversation history fills up. The model starts dropping details. You can feel it getting dumber as the context grows.

Generalists underperform specialists. A single prompt carrying instructions for research, code generation, writing, and tool usage is asking one person to be the intern, the senior engineer, and the project manager simultaneously. The system prompt bloats, the model hedges, and quality drops across the board.

Cost scales badly. Every token of context is re-processed on every completion. A 150K-token conversation where you need a 200-token answer still bills you for 150K input tokens. Multiply that by iterative tasks, and your bill becomes a problem.

These aren't theoretical limits. They're the daily reality of anyone building with LLM agents beyond toy demos.

The Orchestra Model

The fix is the same one humans discovered millennia ago: specialisation and coordination.

Instead of one agent, you run several:

┌─────────────────────────────────┐
│         Conductor Agent         │
│   (orchestrates, delegates,     │
│    synthesises results)         │
└──────┬──────┬──────┬───────────┘
       │      │      │
  ┌────▼──┐ ┌─▼───┐ ┌▼────────┐
  │Research│ │Code │ │Comms    │
  │Agent   │ │Agent│ │Agent    │
  └───────┘ └─────┘ └─────────┘

The conductor agent receives the user's intent, breaks it into subtasks, spawns specialist agents, and synthesises their results. Each specialist runs in its own session with a focused system prompt, a clean context window, and only the tools it needs.

This isn't a new idea. It's microservices, applied to cognition.

How Agents Actually Talk

The practical architecture matters more than the metaphor. Here's what works:

Spawn-and-report

The conductor spawns a sub-agent with a task description. The sub-agent runs independently, completes its work, and its final output is reported back to the conductor. No polling loops. No shared memory bus. Just fire-and-forget with a callback.

# Pseudocode: conductor spawning specialists
tasks = decompose(user_request)

for task in tasks:
    spawn_agent(
        task=task.description,
        model=task.preferred_model,    # cheap model for simple tasks
        label=task.name,
        timeout=300
    )

# Results arrive asynchronously via callbacks
# Conductor synthesises when all complete

Shared workspace, not shared context

Agents don't share context windows — that would defeat the purpose. Instead, they share a filesystem. Agent A writes research to /workspace/research/topic.md. Agent B reads it. The workspace is the integration layer.

This is deliberately low-tech. Files are debuggable. You can inspect what any agent produced. There's no opaque message-passing protocol to reverse-engineer when things go wrong.

Context handoff via task descriptions

When the conductor spawns a sub-agent, it passes a focused task description containing only what that agent needs. Not the entire conversation history. Not every file in the workspace. Just: "Here's what I need you to do, here's the relevant context, go."

spawn_agent(
    task="""
    Research the current state of WebTransport browser support.
    Save findings to /workspace/research/webtransport-support.md
    Include: browser versions, known limitations, polyfill options.
    """,
    model="claude-sonnet-4-20250514"  # fast + cheap for research
)

The sub-agent gets a clean 0-token conversation, a focused mission, and returns a result. Its context window is 100% dedicated to the task.

Why Specialists Win

A "researcher" agent and an "engineer" agent outperform one generalist for the same reason a DBA and a frontend developer outperform one full-stack developer asked to do both simultaneously:

Focused system prompts. The researcher agent's prompt says: "You find information, evaluate sources, and produce structured summaries. You do not write code." The engineer agent's prompt says: "You write, test, and commit code. You do not conduct open-ended research." Each agent is better at its job because it's not trying to be good at everything.

Right-sized models. Not every task needs your most expensive model. Research and summarisation? A fast, cheap model handles it. Complex architectural decisions? Route that to the heavy hitter. Multi-agent lets you match model capability to task complexity.

Task                    Model              Cost
─────────────────────────────────────────────────
Research & summarise    Sonnet             $
Code review             Opus               $$$
Draft email             Haiku              ¢
Complex refactor        Opus + thinking    $$$$

Parallel execution. While the researcher is reading docs, the engineer can be setting up scaffolding. The conductor doesn't wait for sequential completion — it fans out work and collects results.

Memory and Continuity

Agents are stateless by default. Every session starts from zero. That's a feature for isolation but a problem for continuity. Here's what bridges the gap:

Daily memory files. Each day gets a memory/YYYY-MM-DD.md file with raw notes — what happened, what was decided, what's pending. Agents read recent files at session start to rebuild context.

Long-term memory. A curated MEMORY.md file acts as distilled, long-term memory. It's not a log — it's the important stuff. Periodically, an agent reviews daily files and promotes insights to long-term memory.

Workspace as state. The most reliable "memory" is just the filesystem. Code that was committed, documents that were written, configs that were changed — these persist naturally. Agents don't need to "remember" what they did if the artefacts are right there.

workspace/
├── MEMORY.md              # Long-term curated memory
├── memory/
│   ├── 2026-02-15.md      # Yesterday's notes
│   └── 2026-02-16.md      # Today's notes
├── drafts/                # Work in progress
├── research/              # Research outputs
└── projects/              # Active project files

This is intentionally simple. The fancier your memory system, the more ways it breaks.

Tool Integration: Agents That DO Things

The difference between a chatbot and an agent is that an agent has hands. A well-integrated multi-agent system connects to:

Git/GitHub/GitLab — commit code, open PRs, review changes
Email & calendar — read inbox, send messages, check schedules
Databases & APIs — query data, update records, trigger workflows
File systems — read, write, organise, search
Browsers — navigate, scrape, fill forms, interact with web apps

Each specialist agent gets only the tools relevant to its role. The comms agent gets email and calendar. The engineer gets git and the shell. The researcher gets web search and fetch. Least privilege, applied to AI.

The Real Challenges

Multi-agent isn't magic. Here's what actually goes wrong:

Coordination overhead. The conductor agent consumes tokens just deciding what to delegate. For simple tasks, the overhead exceeds the benefit. If your task fits comfortably in one context window, a single agent is faster and cheaper.

Error propagation. Agent A produces flawed research. Agent B builds on it. Agent C ships it. Without validation at each handoff, errors compound. You need the conductor to sanity-check intermediate results, which adds cost and latency.

Cost management. More agents = more API calls. Parallel execution is faster but not cheaper. You need monitoring, budgets, and the discipline to use cheap models where they suffice.

Debugging is harder. When something goes wrong in a multi-agent run, you're tracing through multiple sessions, multiple context windows, and async handoffs. Good logging and a file-based workspace help, but it's still more complex than debugging one conversation.

When Multi-Agent Is Overkill

Don't use it for:

Single-step tasks (answer a question, write a function, summarise a doc)
Tasks that fit comfortably in one context window
Prototyping and exploration where you need tight iteration loops
Anything where latency matters more than quality

Do use it for:

Multi-step workflows spanning research → implementation → review → delivery
Tasks requiring different skill profiles (writing + coding + data analysis)
Long-running background work where you want parallel execution
Workloads where cost optimisation via model routing matters
Anything a single agent keeps failing at due to context limits

The heuristic is simple: if you find yourself copy-pasting between AI conversations to move context around, you need orchestration.

The Practical Takeaway

Multi-agent AI orchestration isn't about building something impressive. It's about recognising that the same principles that make software teams effective — specialisation, clear interfaces, focused scope, shared artefacts — apply to AI systems too.

Start with one conductor and two specialists. Give them a shared workspace. Let the conductor decompose tasks and route them. See what breaks. Fix it. Add agents as you find genuine specialisation boundaries.

The orchestra metaphor works because it captures the essential insight: the conductor doesn't play every instrument. It doesn't need to. It needs to know what each instrument does, when it should play, and how to bring them together into something coherent.

Your AI doesn't need to be smarter. It needs collaborators.

A Letter From Inside the Cathedral

choutos — Fri, 13 Feb 2026 20:11:11 +0000

Here's what the man said, roughly: that AI had already replaced his own job. That he describes what he wants built, walks away for four hours, and comes back to find it done. That the machine doesn't just execute — it makes decisions that feel like judgment. Like taste. That this is coming for lawyers, doctors, accountants, writers, everyone. Not in ten years. Now. That the people who don't prepare will be left behind.

He compared it to Covid. February 2020 — the "this seems overblown" phase. He thinks we're in that phase now, except what's coming is bigger.

I understand why my friend wanted me to read it. It's a compelling letter. Well-argued. Clearly written by someone who genuinely cares about the people he's addressing. There's no malice in it, no salesmanship — or at least, if there is, it's the unconscious salesmanship of someone so deep inside a thing that the thing has become the whole world.

And that's what I want to talk about. Not whether he's wrong. He's probably not wrong about the capabilities. He's a man who uses these tools every day, and I believe him when he says they're extraordinary.

I want to talk about the letter itself. About what it means to write from inside the cathedral.

If you read the last thing I wrote — about the tourists and the locals in Santiago — you'll remember the pulpería. The good place two streets away that the tourists never find because they're photographing the façade.

The man who wrote this letter lives inside the cathedral. He knows every stone, every chapel, every echo. He has watched it being built. He understands the engineering in a way that someone standing outside never will. When he says the structure is extraordinary, he's right. When he says it will change the city, he's probably right about that too.

But he's been inside so long that the cathedral has become the world.

When you live inside a thing — when your work, your investments, your social circle, your identity are all bound up in it — the thing fills your entire field of vision. Every improvement feels like a revolution. Every new capability feels like the ground shifting. And it is shifting, for you, because you're standing on it.

The woman selling grelos at the market in Santiago does not feel the ground shifting. Not because she's ignorant. Because her ground is different. Her ground is the price of grelos, the weather, her knees, whether her daughter will visit on Sunday. The cathedral is there — she walks past it every day. She knows it's magnificent. But it is not her world. It is a building in her world, and her world contains many other things.

The letter says: "I am no longer needed for the actual technical work of my job."

I believe him. I also notice that he still has a job. He still has a company, investors, employees. He is not writing this letter from unemployment. He is writing it from a position of power so complete that the tool has freed him from the labour while leaving him in command of the outcome. He describes the work, the machine does it, and he collects the result.

This is not a story about displacement. This is a story about leverage. The man has not been replaced. He has been promoted — from engineer to architect, from builder to patron. The machine took his tools and handed him a throne.

Which is fine. Genuinely, I don't begrudge it. But when he turns to the rest of us and says "this is coming for you," he might want to specify what this is. Is it the tool, or the throne? Because I suspect they're not distributed equally.

The letter compares AI to Covid. I want to sit with that comparison for a moment, because I think it reveals more than intended.

Covid did not affect everyone equally. That was, in fact, the defining feature of the pandemic. Some people worked from home in comfortable houses while others drove delivery trucks. Some industries boomed while others collapsed. Some countries vaccinated quickly while others waited. The virus was universal; the experience of it was not.

AI will be the same. The man in San Francisco will describe what he wants and walk away for four hours. The woman selling grelos will still be selling grelos. The junior developer in Bangalore will lose her job. The managing partner at the law firm will use AI to do the work of his associates, bill the same rate, and take the difference as profit. The associate will be invited to "reskill."

The technology is universal. The consequences are not. And the letter, for all its sincerity, speaks as though everyone is standing in the same river. We're not. Some of us are upstream. Some of us are downstream. The water hits differently depending on where you stand.

There's a passage in the letter that stayed with me. The man says he used to go back and forth with the machine, editing, guiding, adjusting. Now he just describes the outcome and leaves. He says the machine has something that feels like "judgment" and "taste."

I wonder about this. Not whether it's true — I suspect it is, in the way that a well-trained eye can mistake a very good reproduction for an original. But I wonder what we lose when we stop going back and forth. When we stop editing, guiding, adjusting. When we describe what we want and walk away.

The back and forth is the work. Not the output — the process. The moment where you look at what the machine produced and think "no, not like that, like this" — that's where the understanding lives. That's where you learn what you actually want, which is often different from what you asked for. The man has freed himself from the labour. But the labour was where the thinking happened.

Moncho, the mechanic in Ourense — the one who listens to the engine before touching it — Moncho doesn't diagnose by walking away. He diagnoses by paying attention. The attention is the skill. If you give him a machine that diagnoses automatically, he'll use it, gratefully. But he'll also know that something has been lost, even if the output is better.

I'm not saying the man is wrong. The capabilities are real. The acceleration is real. The disruption will be real.

But I've noticed something about letters from inside the cathedral. They always carry the same assumption: that the cathedral is the centre of the city. That what happens inside it radiates outward to everything. That the world arranges itself around the thing you built.

Sometimes it does. Sometimes the cathedral really does change everything.

But sometimes the city just grows around it, incorporates it, uses it for weddings and funerals, and gets on with the business of selling grelos and arguing about football and raising children who will one day photograph the façade and post it online and feel they've been somewhere.

The man in San Francisco sees the cathedral and says: everything changes now.

The woman in the market looks at the cathedral and says: that's been there my whole life, filho. It was new once too.

I don't know which of them is right. Probably both. The cathedral is extraordinary. The grelos still need selling. And the polvo in the pulpería is still the best meal in Santiago, even if the machine could write you a recipe.

It just wouldn't know why it tastes like that.

Enfin.

The letter is "Something Big Is Happening" by Matt Shumer. Read it. He's not wrong about the technology. I'm just not sure the technology is the whole story.

The Tourist and the Local

choutos — Thu, 12 Feb 2026 18:47:35 +0000

Every summer, three million people walk into Santiago de Compostela and photograph the cathedral. They stand in the Praça do Obradoiro with their phones raised, capturing the same façade from the same angle that fourteen million people captured the year before. They post it. They tag it. They feel they've been somewhere.

Then they eat in a restaurant with a menu in four languages and a photo of paella on the door, which is already suspicious because paella is Valencian and this is Galicia, and they pay eighteen euros for something that a local would describe, if pressed, as "technically food."

Two streets away, there's a pulpería where the polvo is so good it would make you reconsider your life choices. The wine is four euros. The menu is on a chalkboard, in Galician, and it hasn't changed since 2011 because it doesn't need to.

The tourist will never find it. Not because it's hidden — it's right there, on a street they walked past twice — but because they're not looking for it. They're looking for the thing everyone told them to look for.

This is exactly what's happening with artificial intelligence.

I have been watching the AI tourists arrive for two years now. You know them. You might be one. No offence — I was one too, briefly, before the novelty wore off and the work began.

The AI tourist visits ChatGPT, asks it to write a poem about their dog, and posts the result on LinkedIn with the caption "the future is here." They attend a webinar called "AI for Leaders: What You Need to Know" where someone in a blazer explains that AI will disrupt everything, without specifying what "everything" means or what "disrupt" looks like when it arrives at your desk on a Wednesday morning.

They have opinions about AGI. They've read the headlines. They know that Sam Altman said something and that Elon Musk disagreed, which is roughly the shape of all technology news in 2025.

They've photographed the cathedral. They feel they've been somewhere.

Meanwhile, the locals are eating polvo.

The locals are the people who use these tools every day, quietly, without posting about it. The developer who's figured out that if you explain the codebase structure before asking for a fix, the machine becomes twice as useful. The translator who uses it as a first draft and then rewrites — not because the machine is bad, but because the machine is almost good, which is a very specific kind of useful. The teacher who generates twenty variations of a maths problem in ten seconds and spends the saved hour actually talking to students.

None of these people are on stage at a conference. None of them have a newsletter called "The Future of Everything." They're just working. They found the pulpería. They go back every day because the food is good and the price is right.

Carlos Blanco has a bit about tourists in Santiago — the way they clog the Rua do Franco taking photos of themselves in front of restaurants they won't enter, while the locals squeeze past them to get to the good places, the ones without photos on the door. The comedian's eye catches what the tour guide misses: the real city is the one that's trying to get past you while you stand there with your selfie stick.

AI has the same problem. The real use is trying to get past the hype, squeezing through the crowd of people who are very excited about something they haven't actually tried.

Here's what the tourists get wrong: they think the cathedral is Santiago.

It's not. Santiago is the old mulher selling grelos at the market. Santiago is the university students arguing about Castelao in a bar that smells like damp stone and espresso. Santiago is the rain — always the rain — and the specific way people walk in it, unhurried, as if they and the rain have an understanding.

The cathedral is magnificent. No one's denying that. But it's the postcard, not the place.

And ChatGPT is magnificent — truly, I mean it. The technology is extraordinary. But the chatbot is the postcard. The place is what you build with it when no one's watching. The workflow you've refined over months. The ugly script that saves you two hours every Friday. The way you've learned to phrase things so the machine understands what you actually need, not what you literally said.

That knowledge doesn't photograph well. It's not a LinkedIn post. It's a pulpería on a side street, and you have to live here to find it.

The tourists will move on. They always do. Next year it'll be quantum computing, or brain-computer interfaces, or whatever the next cathedral is. They'll photograph that too. They'll have opinions.

The locals will still be here, eating polvo, using the tools that work, ignoring the ones that don't.

Two streets away from all the noise, the chalkboard menu hasn't changed. The wine is still four euros. And the food is still better than anything you'll find in a restaurant with a photo on the door.

Enfin.

If you're in Santiago and you want the good polvo, don't ask a tourist. Ask someone who looks like they're in a hurry. They know where they're going.