Christo Zietsman

Posted on Mar 22 • Edited on Apr 4 • Originally published at blog.nuphirho.dev

The Echo Chamber in Your Pipeline

#ai #codereview #specifications #bdd

This is Part 1 of a four-part series, "The Specification as Quality Gate." The series develops three hypotheses about executable specifications, AI code review, and what each is actually for. Parts 2, 3, and 4 will follow.

When an AI coding agent generates code and a separate AI reviewer examines
it, both agents are reasoning from the same artefact: the code itself. If
neither has an external reference (a specification, a contract, a
statement of what the system is supposed to do), the reviewer has no
ground truth to compare against. It checks the code against the code. Not
against intent.

The architecture is circular. And there is now enough empirical evidence
to say precisely where and how it fails.

The Mechanism

In classical ensemble learning, stacking multiple estimators improves
reliability under one condition: the estimators must fail independently.
If two classifiers share the same training distribution and the same blind
spots, combining them does not reduce error. It consolidates it. The joint
miss rate of two correlated estimators approaches the miss rate of either
one alone, not the product of both.

This holds for LLM pipelines for a specific reason: a code-generating
model and a code-reviewing model drawn from the same model family share
architecture, training corpus, and reward signal. That is the same class
of correlation the independence condition prohibits. They are not two
independent estimators. They are two samples from the same prior.

A 2025 paper (Vallecillos-Ruiz, Hort, and Moonen, arXiv:2510.21513)
studying LLM ensembles for code generation and repair named this the
"popularity trap." Models trained on similar distributions converge on
the same syntactically plausible but semantically wrong answers. Consensus
selection (the default review heuristic in most multi-agent pipelines)
filters out the minority correct solutions and amplifies the shared
error. Diversity-based selection, by contrast, recovered up to 95% of
the gain that a perfectly independent ensemble would achieve. The
implication is direct: naive stacking of homogeneous LLMs is
counterproductive.

A second paper (Mi et al., arXiv:2412.11014) examined a
researcher-then-reviser pipeline for Verilog code generation and found
that erroneous information from the first agent was accepted downstream
because the agents shared the same training distribution and lacked
adversarial diversity. Self-review repeated the original errors rather
than correcting them.

A February 2026 paper (Pappu et al., arXiv:2602.01011) went further,
showing that even heterogeneous multi-agent teams consistently fail to
match their best individual member, incurring performance losses of up to
37.6%, even when explicitly told which member is the expert. The failure
mechanism is consensus-seeking over expertise. Homogeneous copies make
it worse.

None of these papers tested a pure generator-then-reviewer pipeline
without external grounding in the exact configuration described here.
That controlled experiment has not been published. But three independent
papers from 2025 and 2026, approaching the problem from different angles,
document the same failure mode: shared training distributions produce
correlated errors, and consensus amplifies rather than corrects them.

A Concrete Example

This example uses a deliberately simple case. Modern frontier models catch
classic boundary conditions reliably, as the experiments described in the
caveats section confirm this. The point here is the sequence: how a BDD
scenario makes a defect detectable before any reviewer is involved.

Consider a pagination function. A developer asks an AI coding agent to
implement it. The agent produces:

def paginate(items, page, page_size):
    start = page * page_size
    end = start + page_size
    return items[start:end]

The agent also produces tests: one for page 0 and one for page 1 against
a ten-item list. Both pass. The implementation is correct for those cases.

The flaw is in what was not tested. When the total number of items is
exactly divisible by the page size and the caller requests the last page
by calculating it from the total count, the edge case is invisible to
the tests provided.

The review agent, given only the code and the tests, validates the
implementation against what is there. The tests pass. The reviewer
reports no issues. It has no basis to ask what was not tested, because
there is no specification defining the expected behaviour for boundary
conditions.

The BDD scenario that would have caught it:

Scenario: Last page when total is exactly divisible by page size
  Given a list of 10 items
  And a page size of 5
  When I request page 1
  Then I receive items 6 through 10
  And the result contains exactly 5 items

This scenario would have failed against the implementation before the
reviewer was ever involved. The pipeline stops at the specification, not
the review.

There is a second reason this matters beyond the review question. Even
where models reliably identify boundary conditions during focused review,
general refactoring passes introduce a different risk. An agent
refactoring for readability or performance is not focused on correctness.
Conditional checks and edge-case handling can be silently removed. The
BDD pipeline catches that regression the same way it catches the original
defect. The protection is unconditional on what the agent was trying to do.

What the SGCR Paper Actually Says

A December 2025 paper from HiThink Research (arXiv:2512.17540) proposed
a Specification-Grounded Code Review framework and reported a 90.9%
improvement over a baseline LLM reviewer. This figure circulates widely
and requires a correction.

The 90.9% is an improvement in developer adoption rate of review
suggestions: 42% versus 22%. Developer adoption reflects whether
suggestions are relevant and actionable. It is not a defect detection
rate. These are different claims.

The SGCR paper validates the hypothesis partially. Without specification
grounding, baseline LLM reviewers produce generic suggestions and
hallucinated issues. Grounding in human-authored specifications filters
the noise and produces suggestions developers act on. But the paper's
architecture, combining an explicit specification-driven path and an
implicit heuristic discovery path, does not claim specifications
make review redundant. It positions them as making review better.

SGCR's implicit pathway explicitly handles discovery of issues beyond the
stated rules. This is the paper's own acknowledgement that specifications
alone are insufficient. It is consistent with the argument that an AI
review agent has a legitimate residual role, but only after the
specification pipeline has done its work.

The Independence Condition

The correlated error claim has a precise boundary. Stacking any two AI
reviewers is not always counterproductive. The failure condition is
specific: estimators that share a training distribution and lack an
external reference exhibit correlated failures. The condition for genuine
benefit is diversity plus external grounding.

If the reviewer uses a different model family, a different temperature, or
a substantially different prompting strategy, errors may be partially
independent. A cross-family pipeline, Grok reviewing Claude-generated
code for instance, has more independence than a same-family pipeline.
But model diversity and external grounding are two separate conditions.
Diversity reduces correlation between estimators. It does not supply
ground truth. A cross-family reviewer without an external specification
is still checking code against code, not code against intent. It will
catch some things the generator missed. It will share the same blind
spots on anything not well-represented in either training corpus, and
it has no basis to identify what was not specified regardless of how
different its architecture is from the generator.

The ensemble literature is clear that diversity is what makes stacking
work. The specification is what makes review non-circular. Both
conditions matter. The current industry architecture typically satisfies
neither.

But What About Documentation?

A reasonable objection: if the problem is that the reviewer lacks ground
truth, why not solve it by adding documentation? Write better docstrings.
Maintain a spec document. Give the reviewer the context it needs.

Our own experiment accidentally tested this. The first version of the v2
corpus included docstrings that stated the domain convention explicitly.
The prorate_premium docstring said "ISDA actual/actual." The
schedule_maintenance docstring said "ANY of these thresholds." Claude
caught every bug at 100%. Documentation in close proximity to the code
works, at least when the agent reads it.

That result is real, but it does not solve the problem. It reframes it.

The first issue is retrieval and compliance. A docstring sitting on the
same function is the best case for specification proximity. An external
policy document, a requirements file, a Confluence page is a much weaker
guarantee. The agent needs to find it, read it, and apply it faithfully.
Anyone who has watched an AI agent work at scale knows that is not a safe
assumption. Agents read context selectively. They produce plausible output
that satisfies surface checks without necessarily following the rule they
were given. I have seen this enough times firsthand to know I cannot
treat documentation-as-specification as reliable ground truth.

The second issue is drift. Docstrings are not executable. They do not
fail when the code diverges from them. A function can be refactored, a
business rule can change, an edge case can be added, and the docstring
sits there describing what the function used to do, confidently, in
present tense. Every codebase accumulates this. The reviewer checking
code against a stale docstring is not checking against intent. It is
checking against what someone intended when they first wrote it.

A BDD scenario cannot drift silently. When the system stops behaving the
way the scenario describes, the scenario fails. The build stops. The
divergence is visible immediately. That is not a property of documentation.
It is a property of executable specifications. The difference is not one
of quality or discipline. It is structural.

The objection "just write better documentation" asks for the same
discipline that produced the bad documentation in the first place, and
adds an assumption that AI agents will follow it reliably. The BDD
pipeline removes both dependencies.

The Implication

An AI reviewer without an external specification is a probability estimate,
not a quality gate. It samples from the same distribution as the generator
and applies pattern matching against its training data. It will catch some
things the generator missed, particularly common vulnerability patterns
and well-documented anti-patterns that appear frequently in training data.
It will systematically miss whatever the generator systematically missed,
because they share the same prior.

A BDD scenario that fails is a falsified claim. The pipeline either passes
or it does not. There is no hallucination, no false confidence, no noise
to filter.

The architecture that follows is not "no AI review." It is: specifications
first, verification pipeline second, AI review only for the residual. The
residual is real. It includes architectural properties, structural drift,
and defect classes that resist specification, but it is a much smaller
target than the current industry framing suggests. The next post in this
series maps that residual precisely.

Caveats and Open Questions

This argument rests on three empirical papers and two small contrived
experiments. Both experiments are same-family only (Claude reviewing
Claude-generated code) and use a planted bug corpus rather than a natural
defect sample. They are directional evidence, not a controlled
demonstration.

The first experiment used classic boundary-condition bugs. Claude caught
all five at 100%, which refined rather than confirmed the hypothesis.
Classic boundary conditions are pattern-recognition problems that are
dense in training data. The correlated error claim applies where the
convention is absent from training data, not where it is ubiquitous.

The second experiment used domain-convention violations with neutral
docstrings. Detection ranged from 0% to 100% depending on domain opacity.
The log-linear interpolation function, where the convention is market
practice in fixed income rather than general programming knowledge, was
missed in all five runs. The reviewer flagged a different concern instead
and declared the implementation correct. BDD caught it. The full corpus,
specifications, scripts, and results are available in the
experiments directory.

The independence condition also requires more empirical work. Same-family
models fail the independence requirement. How much architectural diversity (different model families, different prompting strategies, different
temperatures) is required to provide genuine signal rather than
correlated noise remains to be established.

These are open questions, not fatal weaknesses. Without an external
reference, circular review is circular. The empirical work so far points
in one direction. A controlled demonstration using a natural defect sample
and cross-family pipelines would make the argument more precise.

The authors of the papers cited here, along with anyone working on LLM
ensemble reliability, multi-agent code pipelines, or specification-grounded
review, are welcome to challenge, correct, or build on this argument.
If the hypothesis is wrong, knowing precisely where it breaks is more
useful than leaving it unchallenged. Comments are open.

The papers cited in this post are: Vallecillos-Ruiz, F., Hort, M., and
Moonen, L., "Wisdom and Delusion of LLM Ensembles for Code Generation and
Repair," arXiv:2510.21513, October 2025; Mi, Z. et al., "CoopetitiveV:
Leveraging LLM-powered Coopetitive Multi-Agent Prompting for High-quality
Verilog Generation," arXiv:2412.11014, v2 June 2025; Pappu, A. et al.,
"Multi-Agent Teams Hold Experts Back," arXiv:2602.01011, February 2026;
and Wang, K., Mao, B., Jia, S., Ding, Y., Han, D., Ma, T., and Cao, B.,
"SGCR: A Specification-Grounded Framework for Trustworthy LLM Code
Review," arXiv:2512.17540, December 2025, v2 January 2026. Author lists
and findings verified against the original papers.

Series: The Specification as Quality Gate
Part 1: The Echo Chamber in Your Pipeline (this post)
Part 2: From Complex to Complicated
Part 3: What Specifications Cannot Catch
Part 4: The Specification as Quality Gate