The Problem With AI Code Review Is Not the Reviewer

#ai #devtools #codereview #claude

AI can generate a function in seconds. Telling you whether it's correct takes longer - because "correct" isn't defined until someone writes it down first.

I spent months running diffs through Claude, then Codex, then Gemini. Three independent reviewers, different model families, often different failure modes.

The reviews were still inconsistent. Not because the models were bad - they caught real things. But they kept disagreeing on what "done" looked like, because nobody had told them.

That's the problem. It's not the reviewer.

The Archaeology Problem

Without a specification, code review is archaeology.

You're given a diff and asked: "Is this correct?" But correct relative to what? You reconstruct intent from variable names and commit messages. You guess at what the original author wanted. When three models do this in parallel, you get three different reconstructions - and no principled way to resolve them.

The models aren't wrong. They're inferring. And inference is probabilistic.

The fix isn't better reviewers. It's eliminating the inference step by writing down what "correct" means before implementation starts.

What Signum Does

Signum is a Claude Code plugin that adds a contract layer to AI-assisted development.

CONTRACT → EXECUTE → AUDIT → PACK

Before a line is written, a Contractor agent produces contract.json:

{
  "goal": "Add JWT authentication to the /api/users endpoint",
  "inScope": ["src/auth.py", "src/routes/users.py", "tests/test_auth.py"],
  "acceptanceCriteria": [
    {
      "id": "AC-1",
      "description": "Valid JWT returns 200",
      "verify": "pytest tests/test_auth.py::test_valid_jwt -v"
    },
    {
      "id": "AC-2",
      "description": "Missing token returns 401",
      "verify": "pytest tests/test_auth.py::test_missing_token -v"
    }
  ],
  "holdoutScenarios": [ ... ],
  "riskLevel": "medium"
}

The spec is graded A-F across six dimensions before implementation starts: testability, negative coverage, clarity, scope boundedness, completeness, and boundary cases. Grade D (below 60) is a hard stop. You can't implement against a spec that hasn't been written clearly enough to verify.

The Holdout Trick

The most interesting part is holdout scenarios.

The Contractor generates a second set of acceptance criteria that the implementing Engineer never sees. They're physically removed from contract-engineer.json before the Engineer reads it - not hidden by instruction, removed at the data level.

After implementation, holdouts run blind. A note on how this works in practice: the Engineer writes all the code, including tests. What it doesn't see is the list of specific verify commands the auditor will run - holdout scenarios like "pytest tests/test_auth.py::test_expired_jwt" that the Contractor defined before a single line of implementation was written. The Engineer can't optimize for cases it doesn't know are being checked.

This mirrors how you'd test a human engineer: you define requirements, they implement, you run checks they didn't author.

High-risk tasks require at least 5 holdout scenarios. The pipeline won't proceed with fewer.

Multi-Model Audit With Independent Context

Three models review the same diff with different context. The point isn't that they're guaranteed to disagree - they're not. The point is that cross-model context isolation reduces the chance that one model's framing poisons another's.

Claude gets the full context: contract, diff, test results
Codex gets only the task goal and the diff - no contract, no test results
Gemini gets the same: goal + diff only

Codex and Gemini can't be primed by the contract details or what Claude found. Separating context doesn't eliminate shared biases baked into training - but it does remove the most common source of cross-contamination in a multi-reviewer setup: anchoring on a shared framing before forming an independent opinion.

The Synthesizer applies deterministic rules to their verdicts:

Any regression vs baseline (tests that passed before, now fail) → AUTO_BLOCK
Any CRITICAL finding from any reviewer → AUTO_BLOCK
All approve, no regressions, holdouts pass → AUTO_OK
Everything else → HUMAN_REVIEW

The human reviews flagged findings, not the full diff.

What Gets Produced

Every run produces .signum/proofpack.json:

{
  "schemaVersion": "3.0",
  "runId": "signum-2026-03-03-xaayxd",
  "decision": "AUTO_OK",
  "confidence": { "overall": 94 },  // heuristic score, not a probability
  "checksums": {
    "contract.json": "sha256:de3185d4...",
    "combined.patch": "sha256:dc22cfd0..."
  },
  "auditChain": {
    "contract_sha256": "de3185d4...",
    "approved_at": "2026-03-03T13:04:29Z",
    "base_commit": "13dcd3ed"
  }
}

The proofpack anchors: what was specified → what was approved → what was implemented → what was audited. CI can gate on the decision field. The checksum chain makes changes tamper-detectable given trusted artifact storage - store it as a CI artifact or commit it, and you have a verifiable record.

Running It

claude plugin marketplace add heurema/emporium
claude plugin install signum@emporium

/signum "add rate limiting to the API"

Signum grades your spec, shows you the contract for approval, implements with a repair loop, audits in parallel, and packages the result. Codex and Gemini are optional - the pipeline degrades gracefully if they're not installed, running single-model audit with a lower confidence score.

What I Actually Learned

The spec quality gate was the biggest surprise.

I expected the hard part to be the multi-model orchestration. It turned out to be writing testable specifications. When the gate started blocking my own specs for low testability scores, I realized how often I'd been asking models to implement things I hadn't actually defined.

"Add proper error handling" scores a 40. "Add error handling that returns 400 for malformed JSON, 401 for invalid tokens, and 500 with a logged trace for unexpected exceptions, verified by pytest tests/test_errors.py" scores a 92.

The gate is annoying. It's supposed to be.

Limitations

This is a blog post, not a sales page. A few things Signum doesn't solve:

Holdouts aren't perfect. The same model that writes the contract also generates the holdout scenarios. It can inherit the same blind spots. Holdouts raise the bar; they don't guarantee correctness.
Context isolation isn't a security boundary. Models from the same provider family may share architectural biases regardless of what context they're given.
The scores are heuristic. Spec grade, confidence score, and risk level are computed approximations - not calibrated probabilities.
Contracts work best for testable requirements. Subjective UI/UX, performance targets without benchmarks, and open-ended refactors are harder to specify in a way the gate can score fairly.
Flaky tests will block you. If your test suite is non-deterministic, AUTO_BLOCK on regressions will create false alarms.

Signum is at github.com/heurema/signum - MIT license. The how-it-works.md has the full pipeline spec including agent models, prompt isolation details, and cost estimates per risk level.
Found a bug or want to request a feature? Reporter handles it without leaving Claude Code:

claude plugin install reporter@emporium
/report bug

It auto-detects which heurema repo you're working in, walks you through a few questions, collects your environment (OS, shell, Claude version), previews the issue before submitting, and posts via gh - or copies to clipboard if gh isn't available.