orenlab

Posted on Apr 16

Structural review that finally knows what your tests cover

#python #devtools #ai #showdev

In earlier posts, I wrote about why I built CodeClone, why I exposed it through MCP for AI agents, and how b4 turned it into a real review surface for VS Code, Claude Desktop, and Codex.

b5 is the release where structural review stops being a parallel universe to your test suite.

Until now, CodeClone could tell you that a function is long, complex, duplicated, or coupled to everything — but it had no idea whether that function was covered by a single unit test. That mattered more than I wanted to admit. A complex function with a 0.98 coverage ratio is not the same risk as the identical function with 0.0. A reviewer knows this. An AI agent reading an MCP response doesn't — unless the tool tells it.

So b5 fixes that, and while doing it, also lifts a few other things that kept getting in the way:

typing and docstring coverage as first-class review facts
public API drift as a baseline-governed signal
intentionally-duplicated test fixtures stop polluting health and CI gates
a much clearer triage story for MCP and IDE clients
a rebuilt HTML report with unified filters and cleaner empty states
a Claude Desktop launcher that actually picks the right Python
a warm-path benchmark that now tells the truth

Let me walk through what changed and why.

1. Bring your `coverage.xml` into the review

The headline feature of b5 is Coverage Join. Point CodeClone at any Cobertura XML produced by coverage.py, pytest-cov, or your CI and it fuses test coverage into the same run that produces clone groups, complexity, cohesion, and dead code:

codeclone . --coverage coverage.xml --coverage-min 50 --html

What comes out is not "new coverage tool, please delete the old one." It's coverage used as a modifier on structural review:

Each function in the current run gets a factual coverage ratio.
Functions below the threshold show up as coverage hotspots with their complexity and caller count alongside.
High-risk findings can now read "complex + uncovered + new vs baseline" instead of just "complex."
A new gate, --fail-on-untested-hotspots, fails CI on below-threshold functions only where the coverage report actually measured them.

That last distinction is the part I care about most.

2. Honest about scope: measured vs out-of-scope

The easy mistake when bolting coverage onto a second tool is to silently treat "function missing from coverage.xml" as "function is uncovered." It makes the dashboard look busier, but it's a lie — the function might be covered by a coverage run that was filtered to a different package, or it might be a module the coverage config excluded on purpose.

b5 keeps these two cases cleanly separate:

Coverage hotspots — code that coverage.xml measured and reported below threshold. This is a hard signal.
Coverage scope gaps — code present in your repo but not in the coverage XML at all. This is a scoping observation, not a verdict.

Both show up in the report and through MCP, but with different meanings. In mixed monorepos this stops being cosmetic very fast.

None of this changes clone identity, fingerprints, or NEW-vs-KNOWN semantics — the baseline model is untouched. Coverage Join is a current-run fact, not baseline truth.

3. Typing and docstring coverage are now part of the picture

I used to expose "typing coverage" and "docstring coverage" as optional toggles. In practice, nobody turned them on, and they kept hiding behind flags that felt vestigial.

b5 removes the toggles and just collects adoption coverage whenever you run in metrics mode:

parameter annotation coverage
return annotation coverage
public docstring coverage
explicit Any count

They land in the main CLI Metrics block, in the HTML Overview, in MCP summaries, and in the baseline. And they get their own CI gates:

codeclone . \
  --min-typing-coverage 80 \
  --min-docstring-coverage 60 \
  --fail-on-typing-regression \
  --fail-on-docstring-regression

The regression gates are the interesting pair: they don't force you to reach a specific threshold, they just fail CI when adoption drops compared to your trusted baseline. That tends to be more realistic for real codebases where you're migrating gradually.

4. Public API drift becomes a first-class signal

Another thing that used to live outside the review surface: "did this PR break the public API?"

b5 adds an opt-in API Surface layer that takes a snapshot of your public symbols — modules, classes, functions, their parameters and return types — into the metrics baseline. Subsequent runs produce a baseline diff with explicit categories: additions, breaking changes, everything else.

# Record the snapshot
codeclone . --api-surface --update-metrics-baseline

# Guard PRs
codeclone . --fail-on-api-break

It's not a type checker and it's not SemVer enforcement. It's "the set of externally-callable names in this package just changed in a way that is likely to break downstream users, please confirm." For libraries that's the thing you want CI to block on.

Private symbols are classified separately from public ones, so moving an internal helper around doesn't pollute the diff.

5. Golden fixtures stop showing up as debt

Some repositories — including CodeClone itself — intentionally keep duplicated golden fixtures to lock report contracts and parser behavior. Those clones are real. They are also not live review debt.

b5 adds a project-level policy for exactly that case:

[tool.codeclone]
golden_fixture_paths = ["tests/fixtures/golden_*"]

Clone groups fully contained in those paths are:

excluded from the health score
excluded from CI gates
excluded from active findings
still carried in the report as suppressed facts

So the tool stays honest — you can still see the suppressed groups in the HTML Clones tab and in the canonical JSON — without making CI noisier than it needs to be. If a group stops being "fully inside the fixture paths," it stops being suppressed automatically.

6. Triage that says what it's actually looking at

MCP summary and triage payloads in b5 include a few compact interpretation fields that turned out to matter a lot for both AI agents and humans:

health_scope — is this number repository-wide, production-only, or for a specific focus?
focus — what does "new findings" actually mean for this run?
new_by_source_kind — of the new findings, how many are in production code vs tests vs tooling?

The net effect is that an agent asking "is this PR risky?" no longer has to guess whether "3 new findings" means "three new bugs in production" or "three new flake-prone tests." The payload tells it directly. The VS Code extension uses the same fields to explain repository-wide health, production focus, and outside-focus debt without widening the review flow.

The extension also now surfaces Coverage Join facts in its overview when the connected server supports them, and the optional in-IDE help topics are gated by server version so they stay honest about what's actually available.

7. The HTML report got a proper rebuild

b4 made the HTML report useful. b5 makes it feel finished.

Unified filters popover — Clones and Suggestions share the same filter UX: one button, one menu, an active-filter count, keyboard dismiss. Every control lives in the same place on every tab that has filters. No more two-row filter strips that wrap on narrow screens.
Cleaner empty states — instead of empty tables, sections with no findings now render a single reassuring row with an explicit "no issues detected" message and an icon. Silence has meaning now.
Coverage Join subtab — Quality gets a dedicated Coverage Join view with per-function rows: coverage %, complexity, callers, source kind, and a clear marker for scope gaps.
Adaptive theme toggle — the theme button shows a sun in light mode and a moon in dark mode, resolved at paint time so you don't flash the wrong icon on first load.
Refreshed palette — the whole report moved to a chromatic neutral scale tinted toward the brand indigo, so surfaces, borders, and text live on the same hue axis instead of looking like "grayscale + one purple button."
Better provenance — the meta block makes it explicit which python tag the baseline was built for, and calls out baseline mismatches instead of hiding them.
Stat-card rhythm — KPI cards across Overview, Quality, Dependencies, and Dead Code share one card component now. Same padding, same typography, same tone variants.

None of that changes a single report contract. It's pure render-layer work.

8. Claude Desktop launches the right Python

A boring but high-impact b5 change: the Claude Desktop bundle now resolves your project's runtime before falling back to a global one. Poetry's .venv, workspace .venv, and an explicit workspace_root override all come before anything on PATH.

Before: installing CodeClone into your project, then launching it via Claude Desktop, would often run some other CodeClone from /usr/local/bin because that happened to be first on PATH. That's fixed.

If you've been getting subtly wrong results through Claude Desktop and couldn't explain why, this is the one to pull.

9. Safer and more deterministic under the hood

Two changes that are unglamorous but worth noting:

Git diff ref validation. When you use --diff-against, the supplied revision is now validated as a safe single-revision expression before being passed to git. No shell surprises, no accidental multi-ref expressions.
Canonical segment digests. Segment clone digests no longer use repr() — they're computed from canonical JSON bytes. This closes a subtle determinism hole where two runs on different interpreters could, in rare cases, produce different segment digests for the same input.

Neither changes clone identity or fingerprint semantics.

10. The warm path is actually warm

One of the more satisfying b5 fixes wasn't a feature at all.

I'd been quietly suspicious of the benchmark numbers for a while — warm runs were looking too good, and I couldn't make the shape of the curve match what the pipeline was actually doing. Turns out the benchmark harness had a bug that broke process-pool execution on warm runs, so the cache was being credited for work it wasn't doing.

After fixing the harness and tightening gating around benchmark runs so repo quality gates don't interfere, the numbers are now both fast and trustworthy. From the Linux smoke benchmark:

cold_full: 6.58s
warm_full: 0.95s
warm_clones_only: 0.86s

About 6.9× speedup on warm runs. The cache is no longer "probably helping" — it is clearly doing useful work, and now I can say that with a straight face.

Wrapping up

If b4 made CodeClone a real review surface, b5 is the release where that surface learned to ask useful second-order questions:

Is this complex function actually tested?
Is this low-coverage number a hard signal or a scope gap?
Is this new finding in production code or in fixtures?
Did this PR break the public API?
Is this duplication intentional test scaffolding or real debt?

Every one of those used to require me to eyeball two dashboards and a coverage report. Now there's a single canonical answer, and it ships consistently through CLI, HTML, JSON, SARIF, MCP, the VS Code extension, the Claude Desktop bundle, and the Codex plugin.

Try it

# Base install
uv tool install --pre codeclone

# With MCP for AI agents (Claude Desktop, Codex, VS Code, Cursor, ...)
uv tool install --pre "codeclone[mcp]"

A one-liner to feel the new shape on your own repo:

codeclone . \
  --coverage coverage.xml --coverage-min 70 \
  --min-typing-coverage 80 --fail-on-typing-regression \
  --api-surface --fail-on-api-break \
  --html

Open the HTML report, watch the Coverage Join tab populate, and check whether your "risky" functions really were the risky ones.

Feedback, issues, and PRs welcome on GitHub.

DEV Community

Structural review that finally knows what your tests cover

1. Bring your `coverage.xml` into the review

2. Honest about scope: measured vs out-of-scope

3. Typing and docstring coverage are now part of the picture

4. Public API drift becomes a first-class signal

5. Golden fixtures stop showing up as debt

6. Triage that says what it's actually looking at

7. The HTML report got a proper rebuild

8. Claude Desktop launches the right Python

9. Safer and more deterministic under the hood

10. The warm path is actually warm

Wrapping up

Try it

Top comments (0)

1. Bring your coverage.xml into the review

2. Honest about scope: measured vs out-of-scope

3. Typing and docstring coverage are now part of the picture

4. Public API drift becomes a first-class signal

5. Golden fixtures stop showing up as debt

6. Triage that says what it's actually looking at

7. The HTML report got a proper rebuild

8. Claude Desktop launches the right Python

9. Safer and more deterministic under the hood

10. The warm path is actually warm

Wrapping up

Try it

1. Bring your `coverage.xml` into the review