I Was the Judgment Layer

#programming #claude #chatgpt #qwen

There's a popular take right now that the AI coding bottleneck has moved to code review. Models write code faster than humans can read it, so the senior engineer becomes the reviewer of last resort.

The take is half right. The bottleneck moved, yes. Not to code review. Upstream of it, to the spec.

When the AI is fast enough that scaffolding, validation, and boilerplate are basically free, what determines whether the feature is correct is not the code. It's the intent the code was generated from.

If the intent says "webhook handler that updates payment status," the AI builds the happy path beautifully and silently ignores duplicate delivery, idempotency, and the impossible state transitions. The bug was authored at spec time. The reviewer at code time can catch it, but they're catching an artifact of an upstream mistake. That work is slow, expensive, and demoralizing.

So the real question is how to review the spec.

For a while I assumed that was my job, full stop. I'd write the spec, sit with it, ask myself the questions a senior engineer with scars would ask. What's out of scope. What should fail loudly. What's the behaviour on duplicate input.

It worked. It was also slow, and I kept noticing the same uncomfortable thing: I missed the same kinds of cases in review that I missed in authoring. The blind spots overlapped because the reviewer and the author were the same person.

What clicked was treating the spec the way you treat code in a strong engineering team. Put it through reviewers with different priors.

The setup I landed on uses three models.

Claude drafts the spec from a Socratic interview about intent. Acceptance criteria, edge cases, what's out of scope, open questions.

Codex reviews the draft against architecture.md, conventions.md, and guardrails.md. Different training corpus, different defaults, different things it cares about.

OpenCode running Qwen reviews independently against the same rubric. Different corpus again, different cultural priors, different sycophancy patterns.

Each model returns specific objections: missing acceptance criteria, ambiguous edge cases, contradictions with prior decisions, gaps where the spec assumes context it doesn't have.

The reason this works is that the models do not have the same biases.

Claude is confident and thorough, with a recognizable tendency toward completion and a particular sycophancy pattern when the prompt frames the spec as already good.

Codex is trained on a corpus that weighs code patterns heavily, so it surfaces "this spec implies a structure that won't fit your conventions" earlier than the others.

Qwen flags things the other two skipped, often around explicit failure modes and recovery paths.

None of them are right alone. The disagreements between them are the signal.

What I get out of that loop is judgment, applied earlier, by something that is not me and not biased the same way I am.

This is the part that makes me uncomfortable.

For years my edge as an engineer was judgment. Knowing which questions to ask. Knowing what to push back on. Knowing where the bodies were buried.

The retrieval-and-synthesis part of the job got eaten last cycle. I told myself the response was to move up to judgment, and that judgment was durable because it required taste and experience.

The honest update: judgment is also distributable. Not all of it. But the part that looks like rubric-based review (does this spec cover the failure modes, does it conflict with prior decisions, does it respect the conventions) can be parallelized across models with different training data. The result is better than any single human reviewer doing it alone. Including me.

The model wasn't judging. I was the judgment layer.

Now the judgment layer is a panel, and I'm one voice on it.

What's left for the human is smaller, but more concentrated.

Define the rubric. Maintain architecture.md and conventions.md so the panel has something specific to review against. Resolve disagreements when the models conflict. Decide which scars get written into history.md so future specs are reviewed against past pain.

These are still judgment calls. They happen less often than spec review, and each one compounds.

The earlier framing I had, that judgment was the safe layer to retreat to, was wrong. There is no safe layer. There is only the next location where verification has to happen, and the work of being honest when that location moves.

Right now it's at the rubric. Next year it might be somewhere else.

Anyone else running multi-model panels on the review side? What does your panel disagree about most?