choutos

Posted on Mar 4 • Originally published at wanderclan.eu

AI Makes Experienced Developers Slower. Here's Why.

#ai #productivity #programming #softwareengineering

AI Makes Experienced Developers Slower. Here's Why.

By Vítor Andrade

Earlier this year, METR published a randomised controlled trial that should have shaken every engineering organisation awake. They took experienced open-source developers, gave them state-of-the-art AI coding tools, and measured what happened.

The developers got 19% slower.

Not junior developers. Not people unfamiliar with the codebases. Experienced contributors working on projects they knew intimately, with tools they chose themselves. And here is the part that should genuinely unsettle you: those same developers believed they were 24% faster.

A 43-percentage-point gap between perception and reality. The engineers most qualified to judge were the most wrong.

I have spent the past year helping engineering teams integrate AI agents into their workflows. I have watched this pattern play out dozens of times. A senior developer picks up Cursor or Claude Code, feels the rush of generating code at unprecedented speed, and starts shipping. Two months later, the codebase is worse. The tests are brittle. There is a layer of generated code that nobody fully understands, woven through the system like kudzu. And the team cannot figure out why velocity has not improved despite everyone feeling faster.

The METR study tells us why. But it does not tell us what to do about it.

I think I have found what works. And the answer is annoyingly boring.

The file that changed everything

Six months ago I was consulting with a team that had fully embraced AI coding tools. Every developer had Cursor. They were generating thousands of lines per day. And their bug rate had tripled.

I asked to see their onboarding documentation. There was none. I asked how a new developer would learn the codebase conventions. "They'd ask someone." I asked what happened when an AI agent needed to understand those same conventions. Silence.

We created a single file, dropped it in the root of their repository, and within two weeks their agent-generated code quality improved measurably. The file looked something like this:

# AGENTS.md

## What This Project Does
Backend API for insurance claims processing.
Domain-specific terminology: a "claim" is not a "case" is not a "ticket."

## Tech Stack
- TypeScript (strict mode)
- Fastify
- PostgreSQL with Drizzle ORM

## How to Validate Your Work
- Tests: `npm test` (must pass 100%, no exceptions)
- Lint: `npm run lint` (strictest ESLint config)
- Type check: `npm run typecheck` (strict: true, no any)

## Coding Standards
- Never mock database calls. Use the test database.
- Never mock HTTP calls to internal services. Use the test fixtures.
- All new endpoints require integration tests, not unit tests.
- Error responses follow RFC 7807. No exceptions.

## Domain Knowledge
- Claims have a state machine: DRAFT -> SUBMITTED -> UNDER_REVIEW -> APPROVED | DENIED
- State transitions are the core business logic. Never bypass the state machine.
- "Adjuster" means the human reviewer, not an automated process.

## What Agents Get Wrong
- They try to add Express middleware. We use Fastify. Do not mix.
- They mock the database. This produces tests that pass and verify nothing.
- They create new utility files instead of using /src/shared/utils.
- They invent new error formats instead of using the RFC 7807 helper.

This is not a prompt. It is not a framework. It is an onboarding document for a colleague who has perfect recall, zero institutional knowledge, and no ability to tap someone on the shoulder and ask "wait, do we use Express or Fastify here?"

The insight is simple but easy to miss: if domain knowledge lives in engineers' heads, it does not exist for agents. An AI model working on your codebase is essentially fine-tuned on your codebase. Every file it reads shapes its output. When your codebase contains competing patterns, dead code, inconsistent conventions, and no written documentation, the model absorbs all of that confusion and reproduces it faithfully.

The AGENTS.md file is not magic. It is the minimum viable act of treating your AI tools as what they actually are: very fast, very literal colleagues who need explicit written context to do good work.

Why experienced developers get worse

The METR result is counterintuitive until you think about what experienced developers actually do differently from novices. An experienced developer carries a vast amount of implicit knowledge. They know which patterns the team prefers. They know the historical reasons behind odd architectural choices. They know which parts of the codebase are fragile. They know what "good" looks like for this specific project.

None of that transfers to an AI agent through a prompt.

When an experienced developer uses an AI tool naively, they are essentially delegating to an intern who has read the codebase but understood none of the tribal knowledge. And because the experienced developer trusts their own judgment, they review the output less carefully than a nervous junior would. They skim the generated code, recognise the general shape of what they asked for, and approve it. The code works. The tests pass. But something subtle is wrong: a convention violated, a pattern duplicated, an abstraction that does not fit the team's mental model.

Do that fifty times and the codebase has drifted. Do it five hundred times and you have a mess that neither humans nor agents can navigate efficiently.

This is the mechanism behind the METR result. It is not that AI tools are bad. It is that bolting AI onto existing workflows without changing anything else makes experienced developers' greatest asset, their implicit knowledge, into a liability. The knowledge stays in their heads while the agent works without it.

The fix is not to use AI less. The fix is to make implicit knowledge explicit. Write it down. Put it in the repository. Make it available to every agent and every new hire simultaneously.

This is, of course, exactly what every engineering manager has been asking their team to do for decades. The irony is thick enough to taste.

The discipline nobody wants to hear about

Here is the technique that separates teams who benefit from AI coding tools from teams who accumulate AI-generated technical debt:

Never fix bad agent output. Ever.

When an agent produces code that is wrong, sloppy, or subtly off, the instinct is to patch it. Fix the variable name. Add the missing error handling. Adjust the test. This feels efficient. You already have 90% of what you need, why throw it away?

Because that remaining 10% of wrongness stays in your codebase forever. And your codebase is the context window for every future agent run.

Think about it as a feedback loop. An agent reads your codebase, generates code in the style of your codebase, and commits it back. If the codebase is clean and consistent, the next agent run produces clean and consistent output. If the codebase contains patched-over slop, the next agent run produces more slop, slightly worse, which gets patched over again, which produces even worse slop.

Quality spirals upward. Slop spirals downward. There is no steady state.

So when agent output is bad, the discipline is: stop. Diagnose why the output was bad. Was the spec too vague? Was the AGENTS.md missing a key convention? Was the agent's scope too broad? Fix the root cause, then rerun from scratch.

This feels wasteful. It is the opposite of wasteful. It is the only way to keep the quality loop going in the right direction.

I call this the recursive quality loop, though it does not need a name. The principle is ancient: the state of your workspace determines the quality of your work. Carpenters know this. Chefs know this. Software engineers have always known this in theory but rarely practice it, because the cost of a messy codebase was slow and diffuse. With AI agents, the cost is immediate and measurable.

A messy codebase does not just slow down human developers. It actively degrades the output of every AI tool that touches it. Codebase hygiene went from "nice to have" to "load-bearing infrastructure" the moment we started letting AI read and write our code.

The anti-mocking rule and other boring specifics

Let me get concrete about what "strict engineering discipline" looks like in practice, because the details matter more than the philosophy.

AI loves to mock things. Give an agent a task that involves database calls and it will, nine times out of ten, write tests that mock the database. The tests pass. They verify absolutely nothing. You now have a green CI pipeline and zero confidence that the code works.

The rule in every AGENTS.md I write: never mock what you can use for real. Use the test database. Use the test fixtures. Use the real HTTP client against a local service. If you cannot test something for real, that is a signal that your test infrastructure needs work, not that mocking is acceptable.

Strictest possible linting. Here is something I have found consistently: AI models conform to whatever standard you enforce. If your linter allows any types in TypeScript, the agent will use any types. If your linter forbids them, the agent will find the properly typed solution. Humans sometimes chafe under strict linting rules. AI never does. So enforce the strictest rules you can. You are not punishing your human developers. You are creating guardrails that improve every line of generated code.

One agent, one task, one context. A five-step agent chain where each step is 95% accurate produces roughly 77% end-to-end reliability. That is not a theoretical concern. I have watched teams build elaborate multi-agent pipelines that fail in production because errors compound at every handoff. The simpler approach works better: give one agent a well-scoped task with complete context, let it execute in one shot, validate the output through automated checks, then have a human review the result.

Holdout test cases. When an agent writes both the code and the tests, there is a risk of teaching to the test. The agent produces code that passes its own tests but fails on edge cases it never considered. The fix: maintain a set of test specifications that the agent never sees during development. Run them after the agent claims to be done. This is the software equivalent of a double-blind trial.

None of these techniques are revolutionary. Strict linting, real integration tests, well-scoped tasks, independent verification. This is the engineering discipline that good teams have always practiced. The difference is that with AI agents, the penalty for skipping these practices is immediate and severe rather than slow and diffuse.

What Stripe figured out

Stripe merges over 1,300 agent-written pull requests every week. Zero human-written code in those PRs. That number is worth sitting with, because it tells you something important: this is not a prototype. This is production engineering at one of the most demanding technical organisations in the world.

Their system, which they call Minions, is architecturally simple. An agent receives a blueprint (a structured specification describing what to build), runs in an isolated sandbox with a full codebase checkout and running test infrastructure, produces a pull request, and a human reviews it.

That is it. There is no magical model. There is no secret sauce. The architecture is: good specs, isolated execution, automated validation, human review. Everything I have described in this post, taken to scale.

The part that interests me most is their tooling layer. Stripe built a centralised server hosting over 400 internal tools that any agent can access through a single interface. Authentication, permissions, audit logging, all flowing through one chokepoint. When someone builds a new tool, every agent in the organisation benefits immediately.

You do not need to be Stripe to apply these principles. The blueprint is just a detailed spec. The sandbox is just a git worktree with a test database. The tooling layer is just a collection of scripts behind a consistent interface. The principles scale down to a team of three. What does not scale down is the discipline.

The culture problem disguised as a tools problem

I want to say something that might be unpopular in a space that loves to debate models and frameworks and tooling:

The technology is not the bottleneck. The models are good enough. The tools exist. The patterns are documented. Stripe proved it works. Anthropic proved it works (90% of Claude Code is written by Claude Code). The evidence is overwhelming.

The bottleneck is engineering culture.

It is the team lead who insists every developer write code by hand because "that is how you learn." It is the architect who refuses to write specs because "the code should speak for itself." It is the senior developer who builds brilliant personal workflows with custom prompts and local scripts and never shares any of it. It is the organisation that buys Copilot licences, does nothing else, and wonders why productivity did not improve.

The solution to the METR problem, the reason experienced developers get slower, is not better AI. It is better engineering practice. And the specific practices that fix it are exactly the things engineers have resisted for decades:

Write documentation. Real documentation, not the kind that gets written once and never updated, but living documents that encode how the team actually works.

Enforce strict standards. Not as punishment but as infrastructure. Every linting rule, every type constraint, every test requirement is a guardrail that improves the output of every AI tool in your pipeline.

Write detailed specs before writing code. Not because you enjoy bureaucracy but because the quality of the specification determines the quality of the output, whether the executor is human or artificial.

Clean up the codebase. Remove dead code, resolve competing patterns, make conventions explicit. This was always good practice. Now it is load-bearing.

Share everything. Stop building local workflows. Land your tools, your prompts, your configurations in the shared repository. Let them compound. A custom script on your laptop helps you. The same script in the team repo helps every developer and every agent, forever.

The compound effect is real and it is dramatic. Every AGENTS.md update, every new tool, every validation rule stays in the codebase. The next agent benefits from everything the last one contributed. Good practices accumulate. The flywheel turns faster. But only if the practices are shared, encoded, and maintained.

Where to start on Monday morning

If any of this resonates, here is what I would do in your position, in order:

First hour. Create an AGENTS.md file in your main repository. Start with what the project does, how to run and test it, and the three most common mistakes an unfamiliar developer would make. Commit it. This single file will immediately improve every AI-assisted coding session against that repo.

First week. Tighten your linting and type-checking to the strictest settings your codebase can handle. Fix the violations. This is not about style preferences. It is about creating a codebase that produces better AI output.

First month. Establish the "never fix bad output" rule. When an agent produces something wrong, resist the urge to patch. Diagnose, fix the root cause (update the spec, update the AGENTS.md, narrow the scope), and rerun. This will feel slow at first. It is the only path to the upward quality spiral.

First quarter. Start sharing. Turn personal prompts into team templates. Turn local scripts into repository tools. Start writing specs before sending tasks to agents. Measure the trend, not the day.

You will hit a dip. The METR study measured that dip. It is real, it lasts weeks to months, and it happens because you are rewiring workflows, not because the tools do not work. The teams that push through the dip come out the other side measurably faster. The teams that give up during the dip conclude that AI coding tools are overhyped and go back to what they were doing before.

The gap between those two groups is widening every month.

The boring conclusion

The most effective thing you can do to get value from AI coding tools is the same thing engineering managers have been begging for since the profession began: write things down, enforce standards, clean up after yourself, and share your work.

That is it. That is the entire insight.

AI did not change what good engineering practice looks like. It changed the cost of ignoring it. The penalty for a messy codebase, missing documentation, lax standards, and tribal knowledge used to be slow and survivable. Now it is fast and compounding.

The developers who figure this out first will have an extraordinary advantage. Not because they have access to better models or fancier tools, but because they treated the arrival of AI as a reason to finally do all the boring things right.

And that, I think, is the real lesson of the METR study. AI did not make those experienced developers slower. Their undocumented, implicit, head-only engineering practices made them slower. The AI just made it visible.

DEV Community

AI Makes Experienced Developers Slower. Here's Why.

AI Makes Experienced Developers Slower. Here's Why.

The file that changed everything

Why experienced developers get worse

The discipline nobody wants to hear about

The anti-mocking rule and other boring specifics

What Stripe figured out

The culture problem disguised as a tools problem

Where to start on Monday morning

The boring conclusion

Top comments (0)