If you write code with AI, you know the drill — better prompts, better models, bigger context windows. That's what everyone's optimizing for. I was too, until I noticed something weird.
I went the other direction. I made my codebase easier for a dumb AI to work in. And my results got dramatically better.
Here's what I mean.
What I Learned From You
My last post on the METR benchmark blew up in the comments. Several of you pointed out something I'd missed — that the real bottleneck isn't AI capability, it's how we structure the work we hand it. That insight directly shaped what I'm about to share.
The 38-out-of-40 Problem
A few months ago I used Cursor to refactor a function signature across 40 files. 38 files were perfect. 2 had subtle type narrowing bugs — the function's generic type was correctly narrowed in 38 places but incorrectly narrowed in 2 files that had a more complex type hierarchy.
Local tests passed for all 40 files. The bugs showed up 3 days later.
I spent a while blaming Cursor. Then I looked at the 2 files that failed. They were the most coupled files in the codebase. They had implicit dependencies on a type hierarchy that spanned 4 directories. Understanding them required holding the entire module graph in your head.
The AI didn't fail because it was dumb. It failed because those 2 files required global knowledge to edit correctly, and AI operates on local context. That's not a bug — it's a structural limitation that won't disappear with better models.
The Pattern That Changed Everything
I started noticing something:
| Code structure | AI accuracy | Failure mode |
|---|---|---|
| Clean module, explicit interface | ~95% | Rare, caught by tests |
| Moderate coupling, some implicit deps | ~80% | Occasional, usually obvious |
| Tight coupling, implicit dependencies | ~60% | Plausible-looking bugs that pass local tests |
AI performance was almost perfectly correlated with how well-structured my code was.
A comment on one of my earlier posts completely reframed this for me:
"AI is a direction amplifier — clean code gets cleaner, garbage code gets worse."
— if that was you, thank you. It changed how I think about architecture.
The first few thousand lines of a project decide everything that comes after.
This completely reframed how I think about architecture. I'm no longer designing code for human readability alone. I'm designing it so that an AI with a limited context window can work on any single module without needing to understand the whole system.
What "Designing for a Dumb AI" Looks Like
Here's what changed in practice:
Explicit interfaces everywhere
If a module depends on behavior from another module, that dependency is declared in a type, not implied by convention. The AI doesn't need to know why things are connected — it just needs to see the interface contract.
Smaller files
I used to have 500-line files with multiple responsibilities. Now I split aggressively. Not because I suddenly care about the single responsibility principle for aesthetic reasons — but because an AI working on a 100-line file with clear boundaries makes fewer errors than an AI working on a 500-line file with tangled concerns.
Tests that document behavior, not implementation
My tests used to be tightly coupled to internal structure. Now they test observable behavior through public interfaces. This means the AI can refactor internals freely — as long as the behavioral tests pass, the refactor is correct.
The AI doesn't need to understand how the code works, only what it's supposed to do.
Configuration in one place
I had environment variables scattered across 12 files with 3 different naming conventions. AI would sometimes invent new config keys because it didn't find the existing one. Now there's a single config module that exports everything. The AI always knows where to look.
No "clever" code
Metaprogramming, dynamic dispatch based on string matching, monkey-patching — all of these are invisible to an AI reading your code. I replaced clever patterns with boring, explicit ones. More lines of code, but the AI (and honestly, future me) can actually follow the logic.
The Uncomfortable Implication
Here's what made this click: every change I made to help the AI also made the code better for humans.
| What I did for AI | What it actually is |
|---|---|
| Explicit interfaces | Good API design |
| Smaller files | Separation of concerns |
| Behavior-based tests | What TDD always recommended |
| Single config module | Single source of truth |
| No clever code | Maintainability |
The AI didn't teach me anything new. It just brutally exposed the places where I was cutting corners. The AI can't work around implicit assumptions the way a human teammate can. It takes your code at face value. If the structure is sloppy, the AI's output will be sloppy — but confidently sloppy, which is worse.
Chris Lattner, reviewing Claude's attempt to build a C compiler: "AI tends to optimize for passing tests rather than building general abstractions."
That's exactly right. The AI will make your specific test pass with a specific hack. It won't step back and think about whether the abstraction is right. That's your job — and the best way to do that job is to make the architecture so clear that even a "dumb" AI can't go wrong within any single module.
The Trade-off I'm Still Working Through
This approach has a real cost: it takes more upfront effort.
Splitting files, writing explicit interfaces, refactoring tests from implementation-coupled to behavior-coupled — these aren't free. On a new project, you're building more scaffolding before you start producing features.
I don't have a clean answer for when this pays off. For a weekend hack or a prototype, it probably doesn't. For anything you'll maintain for more than a month — especially with AI tools — I'm increasingly convinced it pays for itself within the first week.
But I want to be honest: I'm still figuring out where the line is. Sometimes I over-split and end up with too many files that are individually trivial. Sometimes the "explicit interface" adds boilerplate that makes the code harder to scan. I haven't found the perfect balance yet.
We're all figuring this out in real-time. Nobody has a playbook for "how to architect code when your co-author is a probabilistic model." If you've found patterns that work, I want to hear them — your comments on my last few posts have already changed my approach more than any blog post I've read.
What I'm Not Saying
I'm not saying models don't matter. They do. GPT-4 is better than GPT-3.5 at the same task in the same codebase.
What I am saying is:
The ceiling on model improvements is lower than people think, and the ceiling on structural improvements is higher than people think.
Upgrading from Sonnet to Opus gives you maybe a 10-20% improvement on hard tasks. Refactoring a tangled module into clean components with explicit interfaces can take AI accuracy from 60% to 95% on that module — regardless of which model you use.
The highest-leverage thing you can do for AI coding isn't choosing the right model or writing the right prompt. It's making your code so clear that even a mediocre model can't screw it up.
The Question I Keep Coming Back To
If AI works best on clean, explicit, well-structured code — and clean, explicit, well-structured code is also what humans work best on — then maybe "designing for AI" and "designing well" are converging.
And if that's true, then the developers who'll get the most from AI aren't the ones who master prompt engineering. They're the ones who already write clean code — or who start now.
I genuinely don't take your attention for granted — you just spent 8 minutes thinking about code architecture with me instead of doom-scrolling Twitter. So here's my real question: have you noticed this pattern in your own codebase? That cleaning up one tangled module suddenly made AI dramatically better at working with it? I'm collecting these stories because I think there's something bigger here that none of us have fully articulated yet.
P.S. — I built 3 skill files that automate the verification side of this — spec before code, checkpoint before changes, structured review after. They won't fix your architecture, but they catch the problems that slip through even in clean codebases.
Top comments (7)
Yes, if you write clear code and have a good and simple architecture, then AI tools don't have a problem working with it - but if your code is a mess, chances are that AI makes an even bigger mess of it - piggy see, piggy do! ;-)
Ha, "piggy see, piggy do" is the perfect summary. I've actually started using AI output quality as a signal for code health — if the AI keeps generating confused or contradictory code in a module, that's usually telling me the module itself needs refactoring before anything else.
Exactly - you give it a good quality codebase, it will tend to generate good stuff - you give it a piece of crap, it will faithfully mimic just that - remember: AI has no "initiative" or "own will" - and that's a good thing I would say! (if it starts getting a will of it own then we might all be doomed ...)
"You give it a piece of crap, it will tend to generate good stuff" — wait, that's an interesting edge case I hadn't considered. You're saying AI sometimes "corrects upward" from bad code? I've seen the opposite more often — it mimics the existing patterns — but maybe there's a threshold where the code is so clearly wrong that the model defaults to its training distribution instead. That would actually be a useful signal: if AI ignores your codebase patterns, maybe your patterns are the problem. Worth investigating.
Haha no, I think you were somehow misreading what I said - I was saying the opposite :-)
Give it a good example -> it tends to give you good results
Give it a bad example -> it tends to give you bad results
Thanks for sharing this article — I found myself nodding along with nearly every point you made. I completely agree with your thinking.
In highly cohesive systems where business logic and workflows are tightly coupled and complex, AI really does struggle to be a comprehensive developer. It’s great at generating isolated code when the contract is clear, but it can’t yet reason about the why behind architectural boundaries or decompose a system into meaningful tasks on its own. A human architect still needs to break down the problem, define clear interfaces, and assign responsibilities — then AI can assist within those defined slices.
Looking forward to more of your insights and writing on this topic!
You nailed the core issue — AI can generate code when the contract is clear, but it can't decompose. That decomposition step is where most of the real engineering lives, and it's the part that's hardest to hand off. I've been experimenting with writing very explicit "decision records" for each module boundary, almost like API docs but for why the boundary exists. It helps AI stay in its lane, and honestly it helps the team too.