How much of your codebase should AI write? A trust-zone breakdown

#ai #rails #webdev #programming

Around 70% of the Rails code I ship is AI-written. The 30% I keep under my own hands is the part that names the system: what things are, who is allowed to do what with them, where value moves between them. Any AI agent will generate code faster than anyone can read it. The project vernacular has to stay in someone's head, and that someone is me.

One important takeaway from one year of intense agentic coding: The code was never the asset; the vocabulary was. Everything downstream of the labels we give concepts is a commodity. The implementation, the tests against them, the bug fixes when something breaks, those are all expendable. The naming is the part you cannot outsource without becoming a stranger in your own product.

If I cannot explain them in an elevator pitch, I cannot reason about the product. That is the part of the work I keep.

What I own: the spec

The 30% I keep is mainly the spec. Concretely: the end-to-end test that names what success looks like for the user, the test that names the rules of access, sometimes a focused test that names the contract for one piece of the system. Different idioms, same job. Each makes a name precise enough that the implementation can follow. The behavior is the product. Implementation is exchangeable.

In a Rails app the spec ownership materializes in three places. In these places I will sometimes write the implementation myself as well. Writing the code by hand is the fastest way to confirm the mental model of the system.

Auth boundaries. Who can do what. The rules that decide whether a logged-in user is allowed to see another user's records, edit their settings, charge their card. In a Rails app these usually live in a small set of rule files (a library like Pundit is the common choice) plus the login layer. This is the single most crucial security concern of an application to get right. The simplest check is to log in as two different users and try to access each other's URLs; if that works, you have a hole. The rules are the answer to "who is allowed to do what here," and that answer names the system's verbs - the actions a system is able to perform.

Anything touching money or personal data. The code that listens for "this customer just paid" from Stripe, anything that writes to billing records, anything that reads fields marked as sensitive. Stripe gives you a one-liner that confirms the message is really from them; the agent will sometimes skip it because the happy path works without. It's essential to get this right or you will lose money, so I will often implement it myself. The path the money takes through the system is a dictionary entry in its own right, and I want to be able to draw it on a napkin.

The first pass at a new concept. What things are called and how they relate. When I add a new entity to the app (a Project, a Booking, a Tax Invoice), the agent is good at translating a clear name into working code. It is poor at choosing the name and connecting nouns via verbs ("A user purchases an item"). Wrong name, wrong shape, wrong connections: these compound for months and cost more to fix than the original code cost to write. I sketch the model in my mind, name it, decide what it connects to, then write a few focused tests against the interface I want (which fields are required, what it connects to, the operations I will need) and let the agent fill in the file.

For everything else, the agent writes both the test and the code, but I have to confirm it first. The spec is small enough and the failure mode contained enough that I do not need to be the one naming things.

What the agent is genuinely good at

For small, well-defined work it needs almost no setup. A model method with clear arguments and return. A view component with explicit props. A system test once the scenario is written in plain English. A one-shot script. CRUD scaffolding for entities whose shape is already locked.

For anything bigger, the planning phase matters more than the writing phase. The agent can plan a feature end-to-end, but the first plan is almost always overbuilt: defensive abstractions for problems you do not have, optionality you will not use, indirection that hides what the system actually does. The fix is to iterate on the plan before any code is generated, and this is where you will want an expert eye on it. A useful trick is to feed the plan file to a second agent and ask it to cut the fluff; the second one has no stake in the first agent's elaborations and will strip them without ceremony. OpenSpec takes this further by pinning specs into the repository as their own markdown files, so the plan outlives the chat session.

It is also good at three things that are not "writing new code." Refactoring an existing chunk of code into a shape that reads better while keeping the tests green. Finding edge cases I had not considered, even when I dismiss them at first. Research that would take me an afternoon and takes the agent ten minutes.

I use Claude Code mainly, and Opencode for reviews. The model is not what changed my work, though. The loop did.

The defensive code problem

Ask the agent to make a feature work, and it will make it work. But it has no preference for a certain code style, so it ships everything in safety wrappers (if ... else blocks) to cater for every obscure edge case. It all looks like prudent engineering, and it doubles the length of every method.

The deeper problem is that this type of defensive code is not confident. Code calling other code should be telling it what to do, not asking what it's actually able to do. What seems like a wise thing to do - preparing for every failure mode - is a job for the called code. This is called sanitization: A method takes arguments, and upon being called decides whether it can act on them. If it cannot, it should fail loudly so the calling code can reconcile.

Apart from other advantages, confident code is easier to read, and thus keep in mind. The next person reading it can hold the whole thing in their head. That next person is usually you, three months later, trying to remember what the agent put together.

You only see the bloat if you are still maintaining a model of the system. Once the codebase outgrows what fits in your head, every safety wrap looks reasonable, and you become the surprised tenant of your own code.

The loop

Spec or test first, build to green, lock before moving on. That sentence is the whole methodology.

Spec or test first means writing the scenario in plain English before any code gets generated. "When a logged-in Pro user clicks Upgrade, they should land on the Stripe checkout page for the Pro Plus tier." That sentence becomes a system test, the system test becomes the agent's brief, and the agent has something concrete to satisfy instead of a vague intention.

Build to green means accepting nothing that does not pass the test. The agent will happily generate code that compiles and runs and is wrong in three places you cannot see. The system test is what catches the gap between "looks right" and "is right."

Lock before moving on means committing the slice as soon as it is green and starting the next one fresh. The longest agent conversations are the most error-prone ones. Commit forces a clean reset.

Skip any of the three and the agent will ship code that looks right and is wrong in three places, and you will not catch it until something breaks in production.

A discipline older than AI

Spec-driven design is everywhere right now: Substack posts, conference talks, a few new tools. The framing is usually that AI changed software development, so we need a new approach. That gets it backwards. The discipline you need with AI is the same one you needed before: test-first, contract-first, name things before you build them. These were the answers to managing junior developers, to distributed teams, to your own future self at 2am.

Obie Fernandez names what happens when you skip them. Without a structure to hold the work, AI agents "happily produce a thousand lines of plausible structure that subtly violate your domain boundaries, your naming conventions, your assumptions about data flow." The consequence is at the architectural level, not the line level. As Obie puts it: "AI coding agents have made it impossible to pretend that architecture does not matter. They generate code at a pace that exceeds your ability to manually keep the system coherent."

AI did not invent the problem. It just stopped letting you get away with skipping the answer. Architecture matters. That is what the 30% I keep is for.

Closing

If you have lost track of what the things in your code are called, the audits I run rebuild that vocabulary. Architecture, the trust zones, the patterns the agent gets wrong. Two ways to start, both free.