We Let AI Write a Third of Our Code. Here's the Review Process That Kept Us Sane.

#ai #webdev #programming #productivity

There is a seductive moment when AI coding assistants start pulling real weight: a meaningful share of your diffs are machine-drafted, velocity spikes, and everyone feels ten feet tall. Then the first subtle bug from unreviewed generated code reaches production, and you realize the tool changed how fast you write code without changing how much it costs to own it. Reviewing, testing, securing, and maintaining that code costs exactly what it always did.

Here is the process that let us lean on generation without inheriting fragility.

Rule zero: the human who merges it owns it

The most important change was cultural, not technical. Whoever opens the PR is accountable for every line as if they typed it. "The model wrote it" is not a defense in a postmortem. This one norm ended the skim-and-approve reflex, because now skimming was your name on the incident.

Build an automated floor before you open the tap

AI raises the volume of code hitting review. If human review is your only filter, reviewers start rubber-stamping under the load. So we put a deterministic gate before any AI-drafted change reaches a person:

[ ] type-checks / compiles
[ ] linter clean
[ ] static analysis (SAST) finds no known-vuln patterns
[ ] no secrets introduced
[ ] tests present and non-trivial
[ ] coverage does not drop

None of this is AI-specific, which is the point. The floor has to be solid enough to absorb more code without more human hours.

Watch for the failure modes assistants over-produce

Generated code fails in characteristic ways, and knowing them makes review faster: mishandled edge cases (empty collections, timezones, integer truncation) that the happy path never exercises; hallucinated or outdated API calls that sound plausible; and security anti-patterns like string-concatenated SQL that models reproduce from their training data. We keep a short reviewer checklist of exactly these.

Choosing which assistants and scanners to standardize on was its own project; if you are early in that, it is worth surveying the current AI software development tools rather than defaulting to whatever is bundled with your IDE.

Generate tests for human code, not the other way around

Test generation was our highest-leverage use, with one caveat about the direction of trust. Generating tests for existing, human-written code is great: the code is the trusted artifact and the tests are scaffolding. But when the model writes both the implementation and its tests, the tests tend to encode the implementation's bugs as "expected." So the intended behavior is always asserted by a human who understands the requirement:

def test_discount_never_exceeds_cap():
    # Business rule: discount capped at 30%, regardless of input.
    assert apply_discount(price=100, pct=50) == 70   # capped, not 50
    assert apply_discount(price=0,   pct=30) == 0     # no negative totals

Measure delivery, not typing

The trap is celebrating "lines generated" or "PRs opened." Those are inputs. We watch change-failure rate, time-to-restore, and defect-escape rate. When generation sped up but change-failure rate ticked up, that was the signal we had shifted work from writing to debugging, and debugging is the expensive end.

The takeaway

More AI in your pipeline is fine, even great, as long as your review gates, test discipline, and accountability are strong enough that the extra volume makes you faster without making you fragile. The teams that win are not the ones generating the most code. They are the ones who treat generation as cheap and ownership as the real work. If your team is trying to formalize this at scale, it is essentially the operating model of any serious generative AI software development company: move fast on generation, stay strict on verification.

What does your AI code-review process look like? I am collecting patterns in the comments.