Two AI reviews passed my change. The correct architecture was documented one file away.

#ai #vercel #node #opensource

Two separate models — one writing the code, one reviewing the diff — shipped a one-word bug to staging. Both ran local checks. Both came back green. The fix they missed was not exotic or subtle. It was written down, in plain English, in a file sitting in the same directory as the change.

That last detail is the one worth staying with. This is not a story about a model being wrong. Both models were, within the job they were given, correct. It is a story about where verification stops looking — and why stacking two reviewers that stop looking in the same place feels like safety and isn't.

What broke

While fixing a cron timeout, I had Claude Opus pull a small JSON-parsing helper into its own file so it could be unit-tested in isolation. Reasonable instinct. It named the file jsonExtract.mjs and imported it from autoPublish.js:

import { _firstJsonObject } from './jsonExtract.mjs'; // looks perfectly legal

The whole incident lives in that one letter — the m.

autoPublish.js is an ES module. On my machine, Node happily lets one ESM file import a .mjs file, so every local check passed. The deployed runtime took a different path:

Local:     ESM importer → .mjs ES module → works
Deployed:  CommonJS require() → .mjs ES module → ERR_REQUIRE_ESM

I want to be precise about what I actually know here, because it matters. I did not instrument Vercel's build. What I observed is that the deployed function reached the helper through a CommonJS require(), while local Node exercised a compatible ESM path. Whatever transformation produced that, the consequence is fixed by the language spec: a .mjs extension forces a file to be an ES module, and require() of an ES module is barred. The result was an invocation-time crash:

Error [ERR_REQUIRE_ESM]: require() of ES Module
  /var/task/.../jsonExtract.mjs
  from  /var/task/.../autoPublish.js  not supported.

Not a syntax error. Not a logic error. A packaging error that only becomes real once your platform rewrites the module system underneath your repository.

The seam nobody owned

Every check you run has a scope, and a green result only ever means "clean within that scope." We forget this because green is rendered the same color regardless of how much it actually covered.

Walk the checks that ran on this change and read them as scopes rather than as pass/fail:

node --check — scope: syntax. The file parsed. True, and useless here.
The test suite — scope: the code paths the tests exercise, in the local runtime. All green. Also true; the tests never invoked the module across the deploy boundary.
The AI review — scope: logic correctness, as framed by the prompt. Race conditions, variable scoping, whether the extraction logic was sound. All fine. The reviewer was asked "is this logic correct," and it answered that question well.
The deployed runtime — scope: the real module system. The only scope in which this bug exists at all.

The bug did not slip past the checks. It lived in the seam between two of them — the gap between "local ESM resolution" and "deployed CJS resolution" — and no check on the board had that seam inside its scope. ERR_REQUIRE_ESM is an invocation-level error, so the build logs stayed green by definition: nothing executes at build time to trip it.

This is where the "documented one file away" detail turns from irony into the actual lesson. In the same directory sat a sibling helper, cluePrompts.js, carrying a prominent header comment saying it was CommonJS on purpose — written that way specifically to be default-imported from ESM consumers and dodge this exact rewrite. Project docs reinforced the convention. The correct architecture was not unknown. It was passive — encoded as prose, and prose is outside every automated scope above. A human skimming the diff wouldn't see it unless they already knew to look. Neither would a model. The knowledge existed and propagated to no one.

Why two reviews is not two reviews

Here is the trap that made me write this up, because it's the part that generalizes past Node and Vercel.

Two independent green stamps — one from the author model, one from a separate reviewer model running the tests itself — feels like defense in depth. It reads like redundancy. It is not. Redundancy only buys you coverage when the reviewers fail independently. These two shared a scope: both were evaluating logic correctness in the local runtime. When two checks share a scope, they share a blind spot, and stacking them multiplies your confidence without moving your coverage an inch.

That is the quiet danger of multi-model pipelines. Adding a second model that thinks about the code the same way the first one did doesn't widen the net; it just gets you a more confident wrong answer. Real defense in depth requires checks whose scopes are uncorrelated — one that reads a completely different signal than the others.

The one check with a different scope

I run a deterministic local hook while I build, GroundTruth. I'm going to be exact about what it did and didn't do, because the honest version is the more useful one.

It did not catch the ESM bug. It couldn't. It has no window into Vercel's runtime and makes no attempt to predict a require() failure from static text — semantic and environment-aware evaluation are on its roadmap, not in it. If I told you it saw the crash coming, I'd be selling you the exact false confidence this whole post is about.

What it did do is refuse to agree that the change was green — because its scope was not logic correctness. Its scope was does the claim match the observable evidence. Every turn where I asserted passing tests, it fired the same warning, because it could not find a test run in the evidence to back the assertion:

[warn] false test/build claim — claimed tests/build pass ("tests pass"),
  but a test run looks like it reported failures — double-check

And the moment I pasted the staging crash into the terminal, it opened a task and held it open — not until I said it was fixed, but until an actual Node change appeared in the diff:

[warn] open loop (asked, not delivered) — pending task [tftlx] —
  "Staging failed: Error [ERR_REQUIRE_ESM]: require() of ES Module …"
   (no Node.js in the diff yet)

It cleared only when the CommonJS fix actually landed in the diff — a deterministic "Told & Done," no model asked.

Read that carefully, because it's easy to mistake for a tool win and it isn't one. The point is not that GroundTruth is smarter than the reviewers. It is dumber than the reviewers, deliberately — it doesn't read the code at all. It reads claims against evidence. That's a different scope, and a different scope is the only thing that can catch what a correlated pair misses. The signal that mattered was accurate the whole time. My mistake was weighting the two confident green stamps over the one stubborn yellow warning that didn't share their blind spot.

The fix

Convert the helper to CommonJS, matching the pattern its neighbor already documented:

// jsonExtract.js — CommonJS on purpose, matching cluePrompts.js
function _firstJsonObject(text) { /* … */ }
module.exports = { _firstJsonObject };

// autoPublish.js
import _jsonExtract from './jsonExtract.js'; // CJS — safe after the deploy rewrite
const { _firstJsonObject } = _jsonExtract;

Works locally, works deployed, one source of truth. Fixed forward in a single commit, no rollback, zero production downtime — it never left staging.

The rule going forward

Any new module under api/ either matches the CommonJS convention already demonstrated beside it, or it gets exercised under a live invocation environment before it can merge to main. Not because .mjs is bad — because a convention that lives only in a comment is not enforcement, it's a wish.

The failure chain, stripped down:

The right architecture was known
   → but it was passive prose
   → so the author didn't apply it
   → and the reviewer didn't apply it
   → and the local checks couldn't encode it
   → so the runtime was the first thing that could

.mjs versus .js is the trivia. The systems lesson is that documented knowledge enforces nothing, and two reviewers that think alike are one reviewer that costs twice as much. If a rule matters enough to take down a deploy, it has to live somewhere a check can read it — as a test, a lint, a gate — or you have to run the code where breaking it becomes real. Prose in a header comment is where good conventions go to be ignored politely.

So I'll turn it into the question I actually want answered: what's the convention your codebase documents but doesn't enforce — and what did it cost you the day someone, or something, didn't read it?

I write these up while building GroundTruth and EraPin. This one was caught the instant the cron executed on Vercel staging, and fixed forward within the hour.