DEV Community: 张振

I stopped letting AI review its own code

张振 — Thu, 04 Jun 2026 02:45:39 +0000

The blind spot problem

I had Claude add input validation to an API endpoint. It wrote clean, idiomatic TypeScript. I asked it to review the diff. It approved it. Tests passed. I shipped it.

Two days later a colleague pointed out that the validation silently accepted empty strings — which the original spec explicitly prohibited. Claude had written the validator, approved it, and neither caught the gap.

When I went back and asked Claude why it missed it, the answer was essentially: "I interpreted the requirement as non-null rather than non-empty, which is a reasonable reading."

That's exactly the problem. A model that wrote subtly wrong code doesn't just fail to catch the bug — it actively defends its interpretation. It doesn't have independent judgment about its own output. It has motivated reasoning.

Human code review works because reviewers bring different priors. A senior engineer who didn't write the code asks different questions than the person who did. They notice different things. They haven't already committed to an interpretation.

The same principle applies to LLMs. The model that wrote your function and the model that reviews it shouldn't share weights.

What I tried first

The obvious fix is to use a second model for review. Have Claude write, have GPT-5 or Codex review.

This helps. It's not a complete solution.

The problem is that most orchestration setups run these models sequentially on the same output. Claude produces a diff. You pipe that diff to Codex and ask it to review. Codex can catch logical errors that Claude missed.

But now you have a different problem: Codex is reviewing someone else's implementation choices. It might flag valid decisions as bugs, or miss problems that are specific to how Claude structured the code. You end up with noisy reviews and unclear signal about which model actually produces better output for your use case.

I wanted something different. I wanted to know: for this specific task, in this specific codebase, which model writes better code?

Racing instead of reviewing

The answer I landed on was running both models on the same task simultaneously.

Give Claude Code and Codex the identical prompt. Let them each produce a full implementation in parallel, isolated git worktrees so they can't interfere with each other. Then compare the two outputs side by side and pick.

$ npx runoff run \
    --prompt "Add formatRelativeTime() to src/utils/format.ts" \
    --config pipeline.config.json

Running race: claude-code vs opencode/DeepSeek...

  candidate 0  (claude-code)      +27 lines
    formatRelativeTime(isoString: string)
    handles: seconds, minutes, hours, days
    no future date support

  candidate 1  (opencode/DeepSeek) +60 lines
    formatRelativeTime(dateInput: string | Date)
    handles: seconds, minutes, hours, days, weeks
    future dates ("2 hours from now")
    edge cases: null, invalid date, DST boundary

Pipeline paused — awaiting judge decision.

$ npx runoff race apply --session abc123 --winner 1
✓ Merged candidate 1. Worktree 0 cleaned up.

The race pause is intentional. I don't want the system to automatically pick a winner. The whole point is that I decide, because I know what matters for my codebase. Does the function need to handle Date objects? Is future-date support worth the extra complexity? Only I know.

This is the key insight: the race is not trying to determine which model is objectively better. It's trying to surface the trade-offs so a human can make an informed decision.

What I learned from running 50+ races

After a few weeks of this workflow, some patterns emerged.

Model strengths are task-specific and codebase-specific. In my TypeScript API codebase, Claude Code consistently produces more idiomatic code that matches my existing style. But for Go utility functions and data processing scripts, Codex/DeepSeek tends to produce more comprehensive implementations that handle more edge cases. Neither is universally better.

The races that surprised me most were the most valuable. When I expected Claude to win and Codex produced a clearly superior implementation, that was information I wouldn't have gotten from a single-model workflow. And vice versa.

Prompt phrasing interacts with model strengths in unexpected ways. "Add validation" produces very different relative results from "Add validation to reject empty strings, null, and strings over 255 characters." The same models, different specs, different winner.

Over time, I started to predict which model would win on which type of task. That meta-knowledge is now more valuable to me than any individual race outcome.

The memory angle

The pattern I noticed — "Codex tends to be more thorough on utility functions, Claude tends to match my style better on API handlers" — is the kind of thing I wanted the system to learn and use.

That's what the Dream system in runoff does. Every race you run produces a trace. When you pick a winner, the system records which provider won, what kind of task it was, which files were involved. Over time it builds a pattern library from your actual picks.

When you start a new race, runoff retrieves relevant past patterns: "In similar utility function tasks involving src/utils/, you've picked Codex 7 out of 9 times." That doesn't determine the race outcome — you still see both diffs and decide — but it gives you context from your own history.

The retrieval uses a multi-strategy approach (semantic similarity, keyword matching, file-path graph hops, entity matching) fused with a weighted ranking that the system tunes based on which patterns actually correlated with your picks. After enough races, the system knows which retrieval strategy works best for your codebase.

It's a slow accumulation. After 10 races you have weak signal. After 50 you have something useful. After 100 you have a surprisingly good model of your own taste.

How this compares to other approaches

vs. single-model review: The fundamental limitation is the shared-weights problem. A model reviewing its own code has motivated reasoning. Racing gives you genuinely independent assessments.

vs. Vibe Kanban / parallel agent dashboards: These tools run different agents on different tasks in parallel to increase throughput. That's a different problem — scale and speed. runoff runs different agents on the same task to improve quality. The goals are orthogonal.

vs. Cadence's role-split approach: Cadence uses different models for different SDLC phases: Claude writes, Codex reviews, Gemini sits on the architectural council. This is smart — it breaks the shared-weights problem at the phase level. The difference is that runoff compares outputs from the same phase rather than roles in different phases. You learn which model produces better first-pass implementations for your specific task types, not just which model is better at review in general.

vs. manually running both models: You can absolutely do this by hand — open Claude Code in one terminal, Codex in another, give them the same prompt, compare the diffs yourself. runoff automates the isolation (separate worktrees, no cross-contamination), the parallel execution, the diff surfacing, the trace logging, and the pattern accumulation. The workflow is the same; the overhead is much lower.

The practical setup

runoff works as an MCP server, so it integrates with Claude Code, Cursor, and Claude Desktop without leaving your IDE:

{
  "mcpServers": {
    "runoff": {
      "command": "npx",
      "args": ["runoff", "mcp"],
      "cwd": "/path/to/your/project"
    }
  }
}

A race config is just an array in your pipeline JSON:

{
  "pipeline": {
    "implement": [["claude-code", "opencode"]],
    "review":    ["claude-code", "implement"]
  }
}

The [["claude-code", "opencode"]] syntax means "run both in parallel and pause for a judge decision." A single string would run that provider sequentially. The pipeline continues after you pick.

What I don't know yet

A few open questions I'm still working through:

Does the accumulated pattern memory actually help? Anecdotally yes — I've noticed that retrieval surfaces relevant context when I start a new race. But I haven't run a controlled experiment comparing race outcomes with and without pattern retrieval.

What's the right race cadence? I don't race every task — that would double my token spend and slow things down. I race tasks where I have genuine uncertainty about the implementation approach, or where I've been burned by a model's blind spots before. Finding the right selection criteria is still intuition more than system.

Does this work at team scale? The current setup is single-user. Pattern memory is per-machine. I haven't thought through what it would look like to share race history across a team, or whether team patterns would be useful or just noisy.

Try it

npx runoff init --work-dir /path/to/your/repo
npx runoff run --prompt "your task here"

The init command generates a pipeline.config.json for your repo. The demo mode (npm run demo) runs with mock providers if you want to see the race mechanics before connecting real backends.

Source: github.com/alexangelzhang/runoff

If you've built something similar or have a different take on the shared-weights problem, I'd be interested to hear it.

Tags:ai, programming, productivity, claude

Qualix: semantic coverage gates for AI-generated code

张振 — Tue, 02 Jun 2026 11:13:51 +0000

AI coding agents write tests. The tests pass. Coverage is green. And then the bug ships.

Here is a concrete example. A PRD says:

Requests at or above 500 USD require manager and finance approval.

A generated test suite might contain:

def test_low_amount():
    # 120 USD → manager approval only
    assert classify(Decimal("120")) == "manager_only"

def test_high_amount():
    # 600 USD → finance required
    assert classify(Decimal("600")) == "finance_required"

Both tests pass. Branch coverage is green. The implementation, however, uses > 500 instead of >= 500. The case of exactly 500 USD — which the PRD says requires finance approval — silently routes to the wrong path.

This is not a clever edge case. It is the boundary the PRD explicitly defined. A coverage tool reports it as fine.

What Qualix does

Qualix is a quality gate that starts from the requirement, not from the code.

Given the PRD, it extracts semantic expectations (SEs) — the business behaviors that tests should prove:

SE-003: a request at exactly 500 USD requires manager AND finance approval

It then audits the test suite against those expectations. If no test exercises amount == 500, SE-003 is flagged as PARTIAL or MISSING, regardless of what line coverage says.

The full pipeline:

Q01  Structure the PRD into traceable REQ/BR/SE items
Q05a Design test targets from those semantics (EUT matrix)
Q05b Generate test code (optional, needs compile gate)
Q06  Audit the test suite against the original SE items
Q07  Structured code review tied back to requirement IDs

It works with your existing test runner. It does not replace pytest, JUnit, or Jest. It sits above them and answers a different question: did the tests prove the requirement, or just execute the code?

Why now

AI coding agents have made it cheap to generate code and tests. That changes the bottleneck. The bottleneck is no longer "can we write this?" but "does this actually do what the product asked for?"

Line coverage was designed for the world where tests were hand-written. In that world, a developer who wrote the test usually understood the requirement. The test was evidence of understanding.

AI-generated tests are evidence of execution. The model generates what it can infer from the code. If the code has a logic error at the boundary, the generated test will probably pass — because the test is generated from the same (wrong) implementation.

Qualix is an attempt to inject the requirement back into the loop, at a point where it can still catch the gap before the code ships.

Current state

Apache 2.0, public alpha (0.2.0a1 on PyPI)
Java has the deepest path; TypeScript, Go, Python supported at basic level
GitHub Actions composite action for CI gate integration
Real-world results: in three production Java services, Q06 found 18 EUT targets with assertion gaps that line coverage did not flag (16 partial, 2 missing)

pip install qualix
./scripts/run_expense_demo.sh   # no API key needed, shows pre-computed findings

What I am looking for

Feedback on:

Does the semantic coverage framing make sense? Is there a clearer way to explain the gap between line coverage and business-rule verification?
Java is the strongest path today. What language / framework would make this immediately useful for your team?
The current workflow requires an AI coding agent (Claude Code, Codex, Gemini CLI). Is the friction too high for evaluation?

The GitHub repo is at: https://github.com/alexangelzhang/qualix