It's Tuesday morning. I have a task in the backlog: add a new billing event type to an existing webhook pipeline. The schema exists, the handler pattern exists, three similar event types are already wired up. I open the agent, point it at the relevant files, and describe what I need. Twelve minutes later the implementation is done, the tests pass, and I'm reading through the diff before committing.
That's most of it. No drama, no revelation. This is what a productive morning looks like now.
But those twelve minutes involve a specific kind of attention that I didn't have before working with AI tools regularly — a trained sense for where to slow down and where to let the agent run. That sense is what I want to write about here. Not the philosophy of it (I covered why AI amplifies rather than equalizes in the previous article). The operational shape of an actual workday.
Where I Delegate Without Hesitation
"Trust completely" is a bad frame because I still read what the model produces. A better frame is: there are tasks where the right answer is well-defined enough that I can evaluate the output quickly without deep scrutiny. I delegate these freely.
Boilerplate that follows an established pattern. If I need a new CRUD endpoint and ten similar endpoints already exist in the codebase, I point the agent at two of them and ask for the new one in the same style. The output is usually correct on the first pass. I read it, I verify it follows the same error-handling conventions, and I move on. The value here isn't that the model is smarter than me — it's that it can hold the existing pattern in context and reproduce it accurately without me having to do so consciously.
Tests for behavior I've already specified in prose. If I've written a description of what a function should do — what inputs it accepts, what edge cases it handles, what it should return or throw — the model writes tests that match that description well. This is different from asking the model to write tests for code it just wrote, which produces a different and much lower-quality result. I described that pattern in detail in the first article. The distinction matters: tests from my specification test behavior; tests from the model's own code test the implementation.
Type definitions and interfaces. Describing a data structure in plain language and asking for the TypeScript interface is something the model does better than I would if I were typing from scratch. The output is complete, well-named, and includes JSDoc when I ask. I review it, adjust names where I have preferences, and move on.
Regular expressions. These used to take me twenty minutes even for moderately complex cases — writing the pattern, testing against edge cases in a REPL, adjusting. Now: I describe the matching requirements including the edge cases I care about, the model produces the regex, and I verify it against the inputs I actually have. Elapsed time went from roughly twenty minutes to roughly two. The model is also better than I am at producing readable regexes with named groups, because it has pattern-matched on a huge number of examples.
Format conversions and data transformations. Parsing an API response into a different shape, writing a one-off CSV-to-JSON converter, building a jq pipeline for a log file. These are well-defined inputs-to-outputs with no ambiguity about what "correct" means. I describe the input format, describe the output format, and the model produces the code.
Bash one-liners and scripting. I know enough Bash to get things done. I don't enjoy writing Bash. The model does fine here and I review the output before running anything that touches production.
Documentation for code I've already written. I write comments where decisions need explanation; the model fills in JSDoc blocks and inline clarifications for things that are genuinely readable but underdocumented. The result is acceptable as a first draft; I adjust the parts that don't sound like how I would write.
Where I Use AI But Read Every Line
These are the high-stakes areas — tasks where I still use AI assistance because the drafting speed is real, but where I treat the output as a pull request from a developer I don't fully trust yet. Every line gets read. Important parts get tested in isolation before the code goes anywhere near production.
SQL queries, especially with JOINs and multi-tenancy. I build on an existing Drizzle ORM schema and the model can produce correct queries from my description. For a SaaS product with proper multi-tenancy, this is the area where I see the most subtle AI-generated mistakes. But "correct" for a query means more than "returns results." It means filtered by the right tenant, using an index that actually exists, returning only the columns the caller needs, and behaving predictably under the data distribution I have in production. For any query that touches more than one table or runs in a hot path, I read the output carefully and run EXPLAIN ANALYZE on the actual production database before deploying. The model cannot know what my indexes are or what my data looks like. I do.
Anything that touches money. Stripe charge creation, invoice line item calculation, VAT application, refund logic. The model drafts; I review as if I were a senior engineer who doesn't trust the author. Every conditional branch. Every rounding behavior. Every failure path. If a payment goes wrong, the damage is real and often irreversible. Speed on this code is not a priority.
User input handling. Any code that accepts data from outside the system: form submissions, webhook payloads, API parameters. I check that validation covers the edge cases I can think of, that error messages don't leak internal structure, and that no input can reach a database or a file system without going through the validation layer first.
Auth checks inside functions. Not the architectural question of where auth should happen — that's a decision I make and document before the model writes any code. But when a function has a guard clause that checks permissions or ownership, I read it carefully. Subtle errors here — checking the wrong field, failing open rather than closed, not accounting for a null case — are the kind of thing that's invisible in tests and visible only when someone exploits it.
Migrations. The model generates them quickly and the syntax is usually correct. I read every migration twice: once for what it does, once for what it might do to existing data. I run migrations on a copy of production data before running them on production. The model has no knowledge of the actual rows in the database.
Where I Don't Delegate
slug="fractional-cto"
text="If AI tools are producing code faster than your team can evaluate it safely, I help senior engineering leaders build the review discipline and guardrails that keep it in check."
/>
This list is short because the previous article covered the principles. Here I want to state these as practices, not arguments.
Architectural decisions. Where modules live, how the application is layered, what the module boundary is between two services. I make these decisions and communicate them to the model explicitly before it writes anything. If I don't, the model will make them for me, locally and inconsistently. It will also make them differently in different sessions. I use a CLAUDE.md file in every project to capture these decisions — more on that in the next article in this series.
Security topology. Where authentication happens. Which routes are public. What an unauthenticated request is allowed to touch. The model can implement the mechanism; I decide where it lives. This is not a task that can be delegated safely because its correctness depends on understanding the full request path, and the model doesn't have that understanding unless I've given it explicitly — and even then I verify.
Schema design. The model drafts DDL quickly. The decision about which columns are nullable, which indexes serve the queries that will actually run, whether to denormalize this field or keep it normalized — those require knowing the query patterns, the data growth projections, and the operational constraints. I make these decisions and then ask the model to generate the DDL.
Trade-off decisions. The model is good at presenting options. "Here are three ways to implement this caching layer, with trade-offs." I use that output as a starting point. But the actual choice requires context the model doesn't have: what the operational team can support, what the cost ceiling is, what technical debt already exists in this area, what the team's experience is with each option. The model doesn't know. I tell it what I've decided, not the other way around.
Production incident debugging. When something is actively broken, I want a direct path from observation to diagnosis. The model introduces latency into that path because it generates hypotheses that sound plausible but are based on general patterns, not the specific state of the system I'm looking at. In a production incident I want logs, metrics, and my own knowledge of the system — not a confident-sounding guess about what might be wrong. I turn the agent off and debug directly.
Signals That Make Me Stop
This is the part that took time to develop and that I don't see written about often. Mid-session, in real time, there are specific outputs that cause me to stop and reconsider before accepting anything else from that session.
The model creates a new file instead of modifying an existing one. If I asked it to extend an existing module and it produced a new file alongside it, it didn't see the existing code or decided to work around it. Either way, the output probably duplicates something. I reject it and redirect with an explicit pointer to the existing file.
A new dependency appears. I have not approved. I did not ask for a library. The model added one because it was convenient for the task. I stop here every time. Adding a dependency is a decision that affects the entire project — security surface, bundle size, maintenance burden. It is never implicit.
The model reinvented a pattern that already exists in the codebase. This happens when I didn't give it enough context about what's already there. The result is a second version of something that should be unified, and now I have inconsistency I didn't start with. I reject the output and provide the existing pattern explicitly.
Confident use of an API method I don't recognize. Sometimes it's a hallucinated method name. Sometimes it's a real method that was added in a newer version than I'm running. Sometimes it's correct. I check every time without exception, because the failure mode for a hallucinated API call is a runtime error in a code path that looks fine statically.
// Model produced this. I didn't recognize `.parseAsync` with this signature.
const result = await schema.parseAsync(input, { strict: true });
// Checked: the `strict` option doesn't exist on Zod's parseAsync.
// The code would have worked silently, ignoring the unknown option.
A try/catch that swallows errors. The model adds these because it has seen a lot of code that wraps things in try/catch. When the catch block logs and returns null — or worse, returns a fallback value and continues — it hides failures rather than surfacing them. I always check what catch blocks do. An error that disappears into a log is an incident waiting to happen.
A test that checks the return type rather than the return value.
// This is not a test. This is noise.
it("returns a number", () => {
expect(typeof calculateTax(100, "FI")).toBe("number");
});
// This is a test.
it("applies Finnish VAT at 25.5% to the base amount", () => {
expect(calculateTax(100, "FI")).toBe(125.5);
});
I wrote about this in the first article. When I see the first pattern in model output, I reject the test and write a description of the actual behavior I want to verify.
Magic numbers without explanation. setTimeout(fn, 5000) — why 5000? maxRetries = 3 — where did 3 come from? The model uses reasonable-looking defaults without documenting the reasoning. I either add the comment myself or ask the model to explain and then document it, because whoever reads this code next (usually me, six months later) will have no idea why that number is there.
What I Do Differently Now
Working with AI tools daily has introduced habits that didn't exist in my workflow before.
I write specifications in prose before asking for code. The quality of what the model produces is a direct function of the quality of the specification it receives. I described the mechanism in detail in the previous article. Practically, this means I spend five to ten minutes writing a description of what a function needs to do — its inputs, outputs, constraints, and failure modes — before I prompt for implementation. This has had an unexpected benefit beyond AI: it surfaces ambiguities before I'm in the middle of writing code.
I use AI to review my own code. After writing something non-trivial, I paste it into a session and ask the agent to find problems with it. This catches things I've become blind to after staring at the code for an hour: unchecked error paths, a null case I didn't handle, an assumption that's not documented. It is not a substitute for a human review. It is a useful additional pass that costs two minutes.
I refactor more often. The cost of refactoring went down because the mechanical parts — renaming, restructuring, updating call sites — are faster with AI assistance. So I do it sooner, at lower threshold. I used to tolerate more awkwardness in a module because cleaning it up wasn't worth the time. Now the threshold for "this is worth fixing" is lower.
I use AI to explore alternatives before committing. "Here's my current approach. What are two or three other ways to solve this, with trade-offs?" The model is good at this and it's made me less likely to commit to the first approach that compiles. I don't take its recommendation — I use the options as a starting point for my own assessment.
I maintain a CLAUDE.md in every project. A document that captures the architectural decisions, the patterns, the conventions, and the explicit constraints the model should follow. The details of what goes in it and how to write it are what the next article in this series covers.
I watch for cognitive offload in myself. This is the subtler discipline and the one I think most people working with AI underestimate. The risk isn't that I delegate execution to the model — that's the whole point. The risk is that I start delegating thinking without noticing: accepting a suggestion because the explanation sounded confident rather than because I evaluated the substance, skipping a verification step because I'm tired and the output looks right, treating "the model generated this" as evidence that it was considered. Leverage and dependency look identical in the moment; the difference is whether I'd still be able to make the decision without the model's draft in front of me. When I notice the dependency pattern in myself — and it happens — I stop the session, take a break, and come back when I can do the evaluation properly. This is the new failure mode AI introduces for experienced developers, and it doesn't trigger any of the signals listed above. It comes from inside.
What Hasn't Changed
This is the honest part of the balance sheet.
I am only marginally faster at understanding a new codebase. This is more nuanced than I used to think. The model genuinely reduces the cost of orientation: it maps structure, surfaces entry points, identifies the libraries in use, and narrates what individual files do. That part is real and useful, and it's saved me hours on unfamiliar repositories — the search cost of "where does this even start" has collapsed. But orientation and comprehension are different things. Understanding the codebase well enough to change it safely — knowing why something was built this way, what invariants are being protected, what assumptions the original author was making — still requires reading the code and thinking about it. The shortcut is to the map, not to the territory.
I am not faster at estimating the complexity of tasks. The model is an optimist. When I ask it how long something will take, or when I ask it to assess the risk of a change, it systematically underestimates. My own estimates, based on experience with what actually goes wrong, are more accurate. I've stopped asking.
Production incidents don't go faster. As noted above: I debug those the old way.
Conversations with clients haven't changed. Understanding what a business actually needs, translating that into a technical approach, deciding what not to build — none of that is faster.
Code reading as a deep skill is still slow and manual. The model can help narrate code, but narration and understanding are different things.
The Honest Assessment
AI is now part of my toolchain the same way a compiler, a debugger, and version control are. Each of those tools changed how professional developers work when they were introduced. None of them changed what the work actually requires — they changed the mechanics of doing it. AI is doing the same thing, at larger scale, in more parts of the workflow.
The tool is productive where I have clear constraints and well-specified intent. It is a source of plausible-looking problems where I don't. The discipline is knowing the difference, in real time, for each task I'm about to hand off. That's a skill that develops with use and with being wrong often enough to notice the pattern.
After 25 years of building systems, the part of my job that AI has not changed is the part that was never about typing: deciding what to build, deciding what not to build, and knowing which decisions can be taken back if they turn out to be wrong.
If working with AI tools makes your development process feel chaotic rather than faster, that's usually a process problem, not a tool problem. My fractional CTO work often starts with sorting out exactly this — establishing the structure that makes AI-assisted development reliable rather than risky. If you've inherited a codebase that was built with AI without that structure in place, that's what my rescue projects service is for.
For a deeper look at what a well-maintained AI workflow looks like in practice — the CLAUDE.md file, system prompts, and mid-session stop signals — read Prompts That Keep an AI Agent From Wrecking Your Codebase.
Top comments (0)