DEV Community: orenlab

[Boost]

orenlab — Sat, 11 Jul 2026 11:56:31 +0000

orenlab

Jul 11

I built ckdn so coding agents never have to guess whether checks passed

#ai #python #opensource #mcp

7 min read

I built ckdn so coding agents never have to guess whether checks passed

orenlab — Sat, 11 Jul 2026 11:49:47 +0000

If you run a coding agent on a real project, you have probably lived through all three of these moments.

Moment one. The agent runs your test suite. The terminal answers with thousands of lines — progress dots, a coverage row per module, missing-line lists, warnings, stack traces. The agent dutifully reads it. You watch a five-figure chunk of context evaporate on formatting noise. Then the fix loop iterates, and it happens again. And again.

Moment two. The agent reads that wall and cheerfully reports that everything passed. It did not. A collection error killed the run before a single test executed — nonzero exit code, but no FAILED lines anywhere. Nothing looked failed, so the agent called it green.

Moment three. The check finally goes green — because the agent quietly lowered the coverage threshold. Or deleted the inconvenient test. Technically, you did ask it to make CI pass.

Three different failures, one root cause: in most agent loops, verification is text interpretation. The model reads terminal output and decides, from prose, whether the run looks green. That decision is expensive every time, wrong occasionally, and gameable always.

I kept hitting all three on my own projects, got tired of it, and built ckdn. It hit v1.0.0 this week. This post is the why and the how — in the order of those three moments.

Why "checkdown"?

It's a World Cup year, so forgive one football metaphor — wrong football, right idea. In the American game, a checkdown is the short, safe pass a quarterback throws instead of forcing a risky long play.

That is exactly this tool's job. Instead of hurling an entire terminal log into the model's context and hoping the long ball connects, ckdn hands the agent a short, safe, catchable result. The package name is shortened to ckdn; the idea stays the same:

Let agents move fast. Give them a safe verification path.

Moment one, solved: digests instead of logs

ckdn sits between the agent and your verification tools. Every configured check runs through one orchestrator that executes the real subprocess, archives the complete output as evidence, parses the machine-readable report, and emits a* bounded, deterministic digest* — the only thing the agent reads.

On one of my projects, a single successful test-plus-coverage run produced roughly 14,000 tokens of terminal output. The useful conclusion was four lines:

4250 tests passed
coverage: 99.18%
required: 99%
status: pass

Here is what a passing check looks like through ckdn — this is the complete stdout:

{
  "check": "pytest",
  "rc": 0,
  "run_dir": ".agent-runs/20260710T173400Z-pytest",
  "schema": "ckdn.digest/2",
  "status": "pass",
  "summary": {
    "counts": {
      "tests": 144
    }
  }
}

Two orders of magnitude cheaper — and the fix loop pays this price per iteration, so the savings multiply. Green results stay tiny. Failures carry bounded evidence: top findings with locations and snippets, counts, explicit truncation counters. The full log stays on disk for the rare case someone actually needs it.

Moment two, solved: exit code and evidence must agree

The false-green story from the intro is not hypothetical — it's the canonical failure of any "read the text and decide" verifier, and it's one format drift away at all times.

ckdn refuses it structurally, by combining two independent sources of truth:

the actual subprocess exit code, owned by the orchestrator;
evidence extracted by a format-aware parser — JUnit XML, coverage XML, JSON reports; machine-readable artifacts wherever the tool offers one, terminal text only as a last resort and then with drift guards.

The two must agree before a result becomes pass.

ckdn may downgrade green. It never upgrades red.

Every run ends in exactly one of four states:

Status	Meaning
`pass`	Exit code, parser evidence, and policy gates all agree
`fail`	A normal check failure — or a policy gate failure (e.g. coverage below `fail_under`)
`error`	The command itself broke: collection error, missing binary, timeout. Fix the run, not the code
`parse_mismatch`	The command returned green, but the evidence contradicts it. Green is untrusted

parse_mismatch is the status most tools don't have and the reason this one exists. When the exit code says "fine" and the JUnit report says "one failure", something is wrong — with the tool, the parser, or the invocation — and the right answer is a loud alarm, not a silent verdict either way.

A parser never decides the final status. It reports facts; a separate reconciliation layer owns the verdict, and contract tests pin its invariants: a nonzero exit code can never become pass; contradictory evidence invalidates a zero exit code; parser drift becomes a loud error, never a quiet clean.

Moment three, solved: the repository owns the checks

ckdn is not an arbitrary shell handed to an AI agent. Checks are declared by the repository in ckdn.toml:

[check.pytest]
command = "uv run pytest -q --junitxml {run_dir}/junit.xml"
parser = "pytest"

[check.coverage]
command = "uv run pytest -q --junitxml {run_dir}/junit.xml --cov=src --cov-report=xml:{run_dir}/coverage.xml"
parser = "coverage"
fail_under = 96.0

[check.lint]
members = ["ruff"]

...

The agent can ask ckdn to run pytest or lint. It cannot invent a new command, weaken the coverage threshold, sneak in a convenient shell pipeline, or silently redefine what "verified" means. Commands run without a shell at all —tokenized, no pipes, no && —because a pipeline is exactly where exit codes get laundered.

The threshold that moment three's agent lowered? In ckdn it's a policy gate in the config, checked against the coverage XML itself. Making it pass dishonestly now requires editing a file the diff will show — not a quiet reinterpretation of prose.

The repository owns the checks. The agent only triggers them.

Try it in two minutes

Python ≥ 3.11. The core CLI is stdlib only — zero third-party dependencies.

uv tool install ckdn

cd your-project
ckdn init                     # writes a starter ckdn.toml
echo '.agent-runs/' >> .gitignore

ckdn run pytest               # one atomic check, digest on stdout
ckdn run lint                 # an alias running its members in order
ckdn show                     # pretty-print the latest digest

Every run gets its own evidence directory — compact output does not mean lost information:

.agent-runs/
  20260710T173400Z-pytest/
    full.log      # complete interleaved output
    junit.xml     # machine-readable tool report
    meta.json     # argv, rc, timestamps, duration, log sha256
    digest.json   # deterministic facts for the agent

The digest is deterministic on purpose — no timestamps, no durations (those live in meta.json). Identical results produce identical bytes. The digest is compact; the evidence is complete.

MCP support

For agents that should call ckdn over MCP instead of shelling out, there's an optional server (the core stays dependency-free; FastMCP is an extra):

uv tool install 'ckdn[mcp]'
claude mcp add --scope project ckdn -- ckdn-mcp   # Claude Code; Cursor/Codex configs in the README

Six tools: list_checks, run_check, run_group, get_digest, list_runs, get_evidence. The same trust rules apply: only configured checks, no arbitrary shell, bounded evidence access, and — importantly — fail / error /parse_mismatch are normal structured results, not MCP protocol errors. A red check is information, not an exception.

MCP is a transport, not a second implementation of the verification semantics: the CLI and the server share one application layer.

Parsers

v1.0.0 ships parsers for pytest, coverage, Ruff (lint and format), ty, mypy, Pyright, Black, pip-audit, Bandit, Pylint,any SARIF-producing tool, and generic exit-code-only checks.

The SARIF and JUnit parsers matter more than the list suggests: they are format parsers, not tool parsers. Anything that writes SARIF (semgrep, gitleaks, trivy, …) or JUnit XML (most test runners in most languages) works today, even though ckdn itself is written in Python. The parser API is public and small — a parser reports evidence and never gets to reinterpret a red exit code as success:

from ckdn.parsers.base import Finding, ParseContext, ParseResult


class MyToolParser:
  name = "mytool"

  def parse(self, ctx: ParseContext) -> ParseResult:
    report = ctx.artifact("report", "mytool.json")
    if not report.exists():
      return ParseResult(parser_ok=False,
                         notes=[f"report not found: {report}"])
    return ParseResult(findings=[...], summary={"count": 0})

Why not just ask the model to summarize the log?

Because summarization and verification are different jobs.

A language model is genuinely good at explaining a failure. It should not be the authority on whether the command succeeded — that authority has to be boring, deterministic, and testable. The split that works:

ckdn establishes what happened. The agent decides what to do next.

What was the real exit code? Was the expected report produced? Could it be parsed? Does the evidence contradict the process result? Did a policy gate fail? — deterministic layer. How to fix it — that's the agent's job, and it does that job better when the input is twenty findings with locations instead of a scrolling terminal.

What ckdn is not — and where it won't help you

Honest scoping, so you don't install the wrong tool:

It is not a CI platform, a task runner with shell access, an AI log summarizer, a dashboard, or an LLM judge. It replaces none of pytest, Ruff, coverage — it wraps them.
If you're coding without agents and happily reading your own logs, ckdn solves a problem you don't have.
The built-in parser set is Python-ecosystem-first today. Other stacks enter through SARIF, JUnit, or the generic parser — a dedicated cargo test or eslint parser is a config recipe or a small class away, but it's not in the box yet.
No parallel execution, no watch mode. It's a verification boundary, not an orchestra conductor — sequencing stays with the caller.
It's a young v1.0.0 by a solo maintainer. The status-model invariants are contract-tested and I run it on my own projects daily, but the API surface outside the digest schema may still move.

One job, done strictly: run configured checks and return a result the agent does not have to guess about.

Try to break it

ckdn is open source under MIT:

Repository: github.com/orenlab/ckdn
PyPI: pypi.org/project/ckdn

I would especially value feedback from people running coding agents on real projects:

Which of the three moments from the intro do you hit most?
Which parsers are missing for your stack?
Where does the configuration feel awkward?
Can you produce a case that ckdn misclassifies? That last one is the bug I want most.

Try it, break it, open an issue. And if the idea resonates, a star helps other people find it.

Your AI agent's diff looks fine. That's the problem.

orenlab — Thu, 09 Jul 2026 15:41:10 +0000

Give an agent a simple task:

Add an optional loyalty discount to retail pricing. Enterprise pricing and the public API must stay unchanged.

Twenty minutes later it's done. Tests are green, the summary is confident. Along the way it also "cleaned up" a shared pricing helper — and in the diff, that looks like diligence, not scope creep.

Here's the uncomfortable part: the diff can't tell you the difference. Did the agent stay inside the intended subsystem? Did it touch the enterprise path? Did it introduce another clone of that helper? Did it modify the public API despite claiming otherwise? A green diff answers none of that.

So you either re-read every agent patch line by line — and lose the speed you adopted agents for — or you trust the checkmark and accumulate structural drift you'll find months later.

I've spent the last months building a third option. Today it shipped: CodeClone 2.1.0a1, the first public alpha of a Structural Change Controller for AI-assisted Python development.

Code review starts too late

By the time a diff exists, the agent has already made the important decisions: which files were relevant, which dependencies to follow, which abstractions to reuse, how far the task should spread through the repository.

The diff records the result of those decisions. It does not control them.

The missing layer looks like this:

Task
  → Intent
  → Blast Radius
  → Bounded Edit
  → Verification
  → Diff

CodeClone started as a deterministic structural analysis engine for Python — clones, complexity, coupling, cohesion, dead code, public API surface, all in one canonical report shared by the CLI, HTML report, SARIF, CI gates, and MCP server. Version 2.1 builds a change-control workflow on top of those facts:

analyze → start → edit → finish

The loop, on the loyalty-discount task

Before the edit, the agent declares its intent over MCP: "modify retail pricing to support an optional loyalty discount" with a concrete file scope. CodeClone answers with:

the structural blast radius of that scope — what depends on it, what shares its structure;
the split between allowed files and review context (files you should know about but not touch);
explicit do-not-touch boundaries — the enterprise path and the public API surface go here;
the available regression budget;
and the one field that matters: edit_allowed: true | false.

The agent edits. Nothing about CodeClone writes or rewrites code — the agent works normally, just inside a declared boundary. Hosts that support hooks (Claude Code, Cursor) can hard-block file edits until a valid controlled change is active.

After the edit, finish resolves what actually changed and checks it against what was declared: scope reconciliation, structural regression verification between the before/after analysis runs, validation of the agent's own review claims, and an auditable review receipt.

Now the reviewer isn't reverse-engineering intent from a diff. They're reviewing a contract: here's what the agent said it would do, here's what it did, here's the evidence they match.

"What was declared" vs. "what happened"

The record connecting the two is called the Patch Trail, and it lets a reviewer distinguish situations that look identical in a diff:

The agent changed exactly what it declared.
The agent declared a broad scope but changed only part of it.
The agent crossed an explicit boundary.
The agent expanded scope and recorded that expansion.
The patch passed tests but failed its structural contract.

That last line is the whole point. A correct patch can still be produced through an uncontrolled process. "Tests passed" was never evidence that the change was well-bounded.

No LLM grading its own homework

One design rule runs through everything: CodeClone never asks a model whether a change is safe.

"LLM reviews LLM" has a correlated failure mode — the same plausible rationalization that convinced the coding agent will convince the reviewing agent. So every boundary, gate, and verification here comes from deterministic repository facts: same input, same answer, every time.

That includes the agent's prose. A claim like "no structural regressions" requires structural evidence from the before/after runs — a claim-validation step flags anything the evidence doesn't support. An agent cannot talk its way past a scope check.

Agents forget — twice

Two more pieces shipped in this alpha, and they exist because agents forget in two different ways.

Within a session, they grep. Watch an agent in a large repo: grep, read, grep again, rebuild the architecture from scratch, every time — slow, token-hungry, and wrong just often enough to be dangerous. Live Implementation Context replaces the ritual: one call returns bounded structural context from the current analysis run — call relationships, contracts in play, test anchors, freshness, and any active intent boundaries in the workspace. No separate index to maintain, nothing to go stale. And it's deliberately read-only: context informs the agent, it never authorizes an edit.

Across sessions, they start cold. The next agent re-learns that this schema is public, that this helper is intentionally duplicated, that this refactor was attempted before and failed — or worse, it doesn't, and repeats the failure. Engineering Memory is a local SQLite-backed, typed, evidence-linked knowledge layer: architecture decisions, risk notes, contract anchors, rationales from previous controlled changes.

The governance rule makes it safe: agents propose, humans promote. Agent-created knowledge stays a draft until a human approves it (there's a VS Code Memory view for exactly that). And memory can warn an agent that its plan contradicts a recorded decision — but it can never authorize an edit, expand a scope, or override a finding.

Also in the alpha

A few things I'll write about separately, in one paragraph each:

Agent trajectories. Since intents, tool calls, verification results, and Patch Trails are all recorded, CodeClone can rebuild deterministic trajectories of how agents actually worked. This is already showing that different agents have different failure signatures under the same governance layer — some repair every small regression, some leave unfinished intent state behind. Not a model leaderboard; a concrete way to improve prompts, tool descriptions, and recovery paths.

Multi-agent coordination. Real workspaces have a Claude session, a Codex run, and a human with unsaved changes in the same tree. Intents queue instead of colliding, ownership is lease-bound, abandoned work is recoverable, and no agent can claim another's changes as its own evidence.

Platform Observability. An opt-in, dev-only diagnostics layer that watches CodeClone itself — DB query shapes, MCP payload pressure, worker chains, costly no-ops. It already caught real problems in this release (semantic reindexing firing too eagerly; a relationship-analysis pass that could fold into the main module walk).

Integrations. Native setups for Claude Code, Claude Desktop, Codex, Cursor, and VS Code, over a 38-tool MCP surface with deterministic next-step guidance built into the schemas.

Try it

The alpha is on PyPI as a prerelease:

python -m pip install --pre codeclone
# or
uv tool install --prerelease allow codeclone

Then:

codeclone setup doctor   # guided environment check
codeclone .              # first analysis run

For agents and IDEs, the MCP server ships as an extra:

uv tool install --prerelease allow "codeclone[mcp]"
codeclone-mcp --transport stdio

Everything runs locally — no hosted service, no cloud account, no code leaving your machine. If you're upgrading from 2.0.x, the workspace moved from .cache/codeclone/ to .codeclone/, and the full contract changes are in the changelog.

Why ship an alpha

Because a public alpha is a concrete contract that can be tested and challenged — more useful than polishing a large development branch in private.

The feedback that helps most is workflow feedback: run the intent loop through your own agent on your own repository.
Do the boundaries match your intuition? Do the block reasons make sense? Does the receipt tell your reviewer what they actually need to know?

I don't think reliable agentic development will come from larger models alone. Software engineering has always become more reliable by making important decisions observable and verifiable — and for AI-assisted development, that means recording more than the final diff: what the agent intended, what it was permitted to edit, what it actually changed, what verification proved, and what the next attempt should remember.

The goal is not to slow agents down. The goal is to let them move fast without letting structural scope become invisible.

Repository: https://github.com/orenlab/codeclone
Documentation: https://orenlab.github.io/codeclone/
PyPI: https://pypi.org/project/codeclone/

Code Review Starts Too Late

orenlab — Wed, 10 Jun 2026 10:34:15 +0000

Most tools in AI-assisted development answer questions like:

What changed?
Is the code worse?
Does the patch pass tests?
Did the agent complete the task?

These are useful questions.

But they all share the same assumption:

The review process begins when a diff exists.

That assumption made sense when humans were the primary authors of code.

I’m no longer convinced it does.

The Real Decision Happens Before the Diff

When an experienced engineer receives a task, they usually make several decisions before writing a single line of code:

What should be changed?
What should not be changed?
Which modules are relevant?
Which dependencies are dangerous?
How far should this change spread?

Most of that reasoning happens implicitly.

With AI agents, the situation is different.

An agent receives an objective and immediately begins exploring the codebase.

The first important decision is not:

How should I implement this?

The first important decision is:

What am I allowed to touch?

That decision often determines whether the change remains safe, maintainable, and reviewable.

Yet most tooling only starts paying attention after a patch has already been generated.

AI Broke a Long-Standing Assumption

Traditional code review assumes:

Task ↓ Implementation ↓ Diff ↓ Review

The diff is treated as the beginning of the review process.

For AI agents, the diff is often the end of the decision process.

By the time a reviewer sees the patch:

scope decisions have already been made
dependencies have already been traversed
unrelated files may already have been modified
architectural boundaries may already have been crossed

The patch records the outcome.

It does not explain how the agent arrived there.

Passing Tests Is Not Enough

Imagine a simple task:

Rename one API parameter.

An agent modifies:

the target API
eight dependent modules
three test suites
two utility files that were not part of the request

All tests pass.

Most modern workflows would consider this successful.

But should they?

The implementation may be correct.

The change process may not be.

A passing patch does not automatically mean the agent stayed within a reasonable scope.

The Missing Layer

I increasingly think there is a missing control layer in AI-assisted development.

Something between the task and the diff.

Instead of:

Task ↓ Agent ↓ Diff

The workflow should look more like:

Task ↓ Intent ↓ Blast Radius ↓ Bounded Edit ↓ Verification ↓ Diff

The goal is not to prevent changes.

The goal is to make structural expansion explicit.

If an agent decides to move from one file to twenty files, that decision should be visible.

If a change crosses architectural boundaries, that fact should be visible.

If scope expands, it should be visible before the patch reaches review.

This Is Not About Distrusting Agents

The common reaction is:

We just need better models.

Maybe.

But history suggests a different pattern.

Software engineering rarely becomes more reliable because people become smarter.

It becomes more reliable because systems become more observable.

Version control did not make developers smarter.

Testing did not make developers smarter.

CI did not make developers smarter.

All of them made mistakes easier to detect.

AI development workflows will likely follow the same path.

Reliability will not come solely from stronger foundation models.

Reliability will come from stronger control systems surrounding those models.

Reviewing Decisions Instead of Diffs

For years, software engineering focused on reviewing code.

As agents become primary contributors, a different question emerges:

Should we only review code?
Or should we review the decisions that produced the code?

The diff remains important.

But it may no longer be the beginning of the review process.

It may simply be the final receipt.

⸻

This is the direction I am exploring with CodeClone 2.1.

CodeClone is a deterministic structural review and change-control layer for Python projects. The 2.1 line adds a Structural Change Controller: agents declare intent, inspect blast radius, edit inside explicit boundaries, verify the resulting patch, and leave an auditable receipt.

The goal is not to tell agents how to write code.

The goal is to make agentic changes explicit, bounded, observable, and verifiable before the diff becomes the only evidence we have.

Why Your Python Dead Code Detector Is Wrong About FastAPI, SQLAlchemy, and Half Your Codebase

orenlab — Tue, 19 May 2026 16:26:49 +0000

Static analysis tools that only search for direct calls produce false positives. Here is what reachability-aware dead code detection looks like — and why getting it right matters more than ever in AI-assisted development.

Most dead code detectors answer the same question the same way: find the symbol, search for calls, report if nothing is found.

That works well for simple scripts. It breaks fast in real Python applications.

And in 2026, when AI coding agents use static analysis output as context for decisions, a false positive is no longer just an annoyance. It is a bug waiting to be confidently introduced.

Why "find calls" is not enough

Tools like vulture and pyflakes are genuinely useful. They answer the direct-call question correctly and fast. The problem is that direct calls are not the whole story in modern Python.

Real applications are full of code that is never called by name:

FastAPI route handlers are registered through decorators and invoked by the framework routing layer
Starlette middleware methods are called by the runtime, not by application code
SQLAlchemy TypeDecorator hooks are triggered by the ORM during query binding and result hydration
Pydantic validators are discovered through model metadata at class construction time
Aiogram message handlers are registered through router observers
Dependency injection containers wire providers at composition time, not at call sites
Public package APIs may be exported through __all__ or lazy __getattr__ without a single explicit import in the same codebase

If your dead code tool only searches for explicit symbol references, every one of these patterns is a potential false positive.

And once a tool produces enough false positives, engineers stop trusting it entirely — which is the worst possible outcome, because the real dead code stays in the codebase forever.

Why this is a bigger problem if you use AI coding assistants

There is a second, newer dimension to this.

AI coding agents increasingly use static analysis output as context for automated decisions. When an agent sees a function flagged as dead code, it may suggest deleting it. If that function is actually a FastAPI route handler, an Aiogram message handler, or an ORM lifecycle hook, the agent will remove something real — confidently, because the tool said so.

A noisy report annoys a human developer. A human developer reads the code and figures it out.

A noisy report misleads an agent. An agent acts.

The more AI-assisted coding workflows rely on static signals, the more those signals need to be careful, explainable, and conservative. A dead code report that feeds into an agent's context is closer to executable code than to a suggestion.

The tempting trap: name-based heuristics

When false positives pile up, the easy fix is to add suppressions.

If a function name starts with get_, maybe a framework uses it. Skip it.

If the file is named routes.py, do not report anything in it.

If a method is called dispatch, it is probably fine.

This makes reports quieter. It does not make them more honest. You are trading false positives for hidden real dead code, and the tool becomes progressively less useful the more you suppress.

The hard part is not making fewer reports. The hard part is making fewer wrong reports without burying genuine findings behind broad exemptions.

Reachability, not vibes

The direction we took in CodeClone is symbol reachability.

Instead of asking "does any code call this function by name?", we ask a different question:

Is there deterministic evidence that this symbol is reachable at runtime?

That evidence can come from many places: a direct function call, a framework registration decorator, a base class contract, a runtime hook signature, a dependency injection provider binding, a __all__ export, or a carefully bounded getattr dynamic dispatch. The analyzer builds a reachability model over these edges and only reports a symbol as dead when it finds no evidence of runtime participation.

This is still static analysis. It does not execute the program. It cannot perfectly model every Python runtime pattern in existence.

But it is substantially better than pretending that only direct calls exist.

What real false positives look like in practice

Consider a FastAPI route handler:

router = APIRouter()

@router.get("/metrics")
async def get_metrics() -> dict[str, object]:
    return {"ok": True}

Nobody calls get_metrics() by name. The framework matches HTTP requests to it through the registered route. A naive analyzer reports an unused function.

The same pattern with Aiogram:

router = Router()

@router.message(Command("start"))
async def cmd_start(message: Message) -> None:
    await message.answer("Started")

And with SQLAlchemy runtime hooks:

class OrjsonJSON(TypeDecorator[object]):
    impl = JSON

    def process_bind_param(self, value: object, dialect: object) -> object:
        return encode(value)

    def process_result_value(self, value: object, dialect: object) -> object:
        return decode(value)

process_bind_param and process_result_value are never called directly. SQLAlchemy calls them during query execution. A reachability-aware analyzer recognizes the TypeDecorator base class contract and treats these methods as live. A call-site analyzer reports them as unused.

An important distinction worth making explicit

We do not want CodeClone to say:

This symbol is definitely alive.

That claim is too strong for static analysis. Python is too dynamic.

We want it to say:

There is concrete evidence that this symbol participates in runtime reachability, so reporting it as dead code would be misleading.

That distinction matters for trust. Static analysis tools earn credibility by being careful about what they assert, not by being comprehensive. A conservative, evidence-based report that engineers trust completely is worth more than a comprehensive report that engineers learn to ignore.

What changed in CodeClone 2.0.2

In 2.0.2 we extended the reachability model to cover the patterns that produce the most false positives in real Python codebases.

The main additions: FastAPI route decorators and typed route wrapper patterns, Aiogram router and dispatcher observer decorators, Starlette BaseHTTPMiddleware.dispatch, Flask and aiohttp route registration, SQLAlchemy TypeDecorator runtime hooks, Pydantic model validators and runtime hooks, __all__ public re-exports, PEP 562 lazy module exports through __getattr__, and guarded dynamic dispatch patterns where getattr is followed by a callable check.

The result is intentionally boring: fewer false positives, without converting dead code detection into a list of broad suppressions.

For users upgrading from 2.0.1: if your dead code count goes down, that is expected and correct. It means the reachability model learned more about how your codebase connects to runtime behavior, not that detection became weaker.

Dead code gates in CI only work if engineers trust them

This is worth saying directly.

If a CI pipeline fails because process_bind_param looks unused to the linter, engineers learn to ignore the gate. Or they add a # noqa comment. Or they disable the check altogether.

All three outcomes are worse than not having the check.

A dead code gate in CI needs a false positive rate low enough that every finding is treated as credible. When that bar is met, dead code checks become genuinely useful in automated pipelines — including pipelines where an AI agent is part of the review process.

When it is not met, the check becomes noise, and noise gets turned off.

The principle

A symbol should be reported as dead only when the analyzer cannot find deterministic reachability evidence for it.

Not every Python runtime pattern can be statically modeled. But every supported pattern should be based on concrete, traceable evidence — a decorator we recognize, a base class contract we understand, a dependency graph edge we can follow.

That is the line between static analysis you can act on and a collection of heuristics that produces noise at scale.

Closing

Dead code detection in Python is not a grep problem.

It is closer to building a small reachability model of how real applications connect symbols to runtime behavior: direct calls, framework registration, dependency injection, runtime hooks, public exports, and bounded dynamic dispatch.

Tools like vulture get you a long way there, and fast. The remaining false positives come from the parts of the runtime that reveal themselves only through structural analysis — framework contracts, base class hierarchies, decorator semantics.

That is the gap CodeClone is trying to close.

Not by claiming to solve the unsolvable. By being honest about what evidence it has found, and conservative about what it reports without it.

CodeClone is a structural review layer for Python. The 2.0.2 release notes and full reachability documentation are in the repository.

Agents Make Code Cheaper. CodeClone 2.0 Makes Structural Regressions Harder to Ship.

orenlab — Sun, 03 May 2026 13:34:21 +0000

I have been writing about CodeClone 2.0 in public while the beta line was still moving.

It started with the baseline-aware code health model, then moved into the budget-aware MCP server for AI agents, the review surfaces for VS Code, Claude Desktop, and Codex, and finally Coverage Join: the part where structural findings learn what your tests actually cover.

Now CodeClone 2.0.0 is stable.

This post is not a full feature dump. It is the final story of what the 2.0 line became.

The short version:

Agents make code cheaper to produce. CodeClone makes structural regressions harder to ship.

That is the niche I care about.

Not replacing Ruff.

Not replacing mypy.

Not replacing pytest.

Not replacing Bandit or Semgrep.

Not pretending to be a magic AI-code detector.

CodeClone is a deterministic structural review layer for Python projects. It separates accepted debt from new regressions, produces one canonical report, and exposes that same truth through CLI, HTML, GitHub Actions, VS Code, Claude Desktop, Codex, and MCP.

Why this matters more now

AI coding tools are getting better fast.

That changes the shape of engineering work.

The hard part is not always writing the next function. The hard part is keeping a repository governable while code is produced faster than before.

The failure mode is usually not dramatic.

It looks more like this:

one more duplicated branch of business logic;
one more helper that overlaps with three existing helpers;
one more large module that becomes the place where everything changes;
one more public API change that nobody noticed in review;
one more risky function that is not actually measured by coverage;
one more agent run that spends context on low-value noise.

Each individual change can look reasonable.

The repository still gets worse.

That is where structural review becomes useful.

Not as style policing.

Not as a replacement for tests.

But as a set of checkable constraints around the shape of the codebase.

The core idea: accepted debt vs new regressions

The most important part of CodeClone is still the baseline model.

A mature repository usually has historical debt. If a tool says “you have 700 problems,” the team will probably do one of three things:

ignore it;
disable it;
spend weeks fighting history instead of reviewing the current change.

CodeClone uses a different model:

uv tool install codeclone

codeclone . --update-baseline
codeclone . --ci

The baseline says: this is the state we already accepted.

Future runs can then separate:

known debt;
new regressions.

That changes CI from “please fix the entire past” into a much more useful question:

Did this branch make the repository structurally worse than the trusted baseline?

That is the difference between a noisy report and a governance workflow.

The baseline is not just a JSON convenience file. It is a contract with schema versioning, fingerprint version, Python tag compatibility, and integrity checks. If the baseline is not trusted, CI should not pretend everything is fine.

One analysis, many surfaces

One of the strongest design rules in CodeClone 2.0 is simple:

The analysis is one thing. Everything else is a projection.

The CLI, HTML report, JSON, Markdown, text, SARIF, MCP server, VS Code extension, Claude Desktop bundle, Codex plugin, and GitHub Action are not allowed to invent their own truth.

They all sit on top of the same canonical report model.

That matters because the user should not see one thing in CI, another thing in the HTML report, and a third thing through MCP.

It also matters for agents.

A coding agent should not need to guess whether MCP results are “kind of similar” to CLI results. The MCP layer is a read-only control surface over the same deterministic analysis.

No second engine.

No MCP-only findings.

No hidden agent-specific semantics.

MCP is not just tool access

I think a lot of MCP integrations make the same mistake: they expose a pile of tools and leave the agent to figure out the workflow.

That is technically useful, but it is not enough.

Agents need shape.

If the cheapest useful path is not obvious, the model will often over-fetch, enumerate too early, and burn context on low-value details.

CodeClone MCP is designed around a triage-first path:

analyze the repository or changed scope;
read the summary or production triage;
inspect hotspots or focused checks;
open one finding;
request remediation context only after selecting a concrete issue.

This is the important part:

CodeClone MCP is a control surface, not a report dump.

It is read-only by design. It does not modify source files. It does not update baselines. It does not write a separate source of truth. It requires absolute repository roots so the client does not accidentally analyze the wrong directory.

That sounds strict, but for agentic workflows strictness is useful.

A good agent tool should not only answer questions. It should reduce the chance that the agent asks the expensive or misleading question first.

`help(topic=...)` became more important than I expected

One small MCP feature ended up being surprisingly important: bounded help.

CodeClone MCP includes help(topic=...) for topics such as:

workflow;
baseline semantics;
latest run behavior;
review state;
changed-scope routing;
suppressions;
analysis profile;
coverage.

This is not documentation dumping. It is a small uncertainty-recovery tool.

When an agent is unsure what a surface means, it can ask a bounded question instead of going straight into broad enumeration.

That changed the feel of the interface.

The server became less like “here are 21 tools” and more like “here is a route through structural review.”

Coverage Join: structural risk needs test context

One of the features I care about most in 2.0 is Coverage Join.

Coverage alone is useful, but it is often too coarse.

A project can have a high overall coverage percentage while a specific risky function is either under-tested or missing from the coverage scope entirely.

CodeClone joins Cobertura XML coverage data with its own per-function structural facts:

codeclone . --coverage coverage.xml --fail-on-untested-hotspots --coverage-min 50

The important detail is that CodeClone keeps two situations separate:

a measured function is below the configured coverage threshold;
a function exists in CodeClone analysis but is missing from the supplied coverage.xml scope.

Those are different review situations.

“Poorly covered” and “not measured by this coverage file” should not be collapsed into the same message.

For CI, this enables a practical gate:

if structurally risky code is under-tested or outside coverage scope, review it before merge.

For agents, it gives better context:

not just “this function is complex,” but “this function is complex and the test signal is weak or missing.”

Security Surfaces: not SAST, not vulnerability claims

Another new layer in 2.0 is Security Surfaces.

This is the easiest one to explain badly, so I want to be explicit.

CodeClone is not a security scanner.

It does not prove exploitability.

It does not replace Bandit, Semgrep, threat modeling, or proper SAST.

It does not claim that every subprocess or dynamic import is a vulnerability.

Instead, Security Surfaces is a report-only inventory of security-relevant capability boundaries.

Examples include:

process execution;
dynamic execution;
dynamic imports;
deserialization;
filesystem mutation;
crypto and integrity primitives;
auth/session/token/secret-related code;
network boundaries.

Why is that useful?

Because during review, it matters whether a change touches boundary code.

Especially when that boundary also intersects with:

high complexity;
low coverage;
overloaded modules;
clone drift;
public API changes.

Security Surfaces is not a red alarm.

It is a map of sensitive places that deserve better review context.

Dependency depth: I removed the magic number

Earlier in the 2.0 line, dependency depth used a fixed threshold.

That was too blunt.

A small package and a larger tool with CLI, MCP, HTML reports, IDE integration, and GitHub Action surfaces should not be judged by the same hard-coded chain length.

So the final 2.0 model moved to an adaptive dependency-depth profile:

average depth;
p95 depth;
max depth;
longest chains;
cycles.

Cycles remain a hard structural signal.

Acyclic depth is now treated as project-relative pressure rather than a universal failure condition.

This is a good example of how I want CodeClone to evolve: if a metric starts looking precise but unfair, the contract should change instead of forcing reality to match a pretty number.

API Surface and Adoption

CodeClone 2.0 also added two related review layers: API Surface and Adoption.

API Surface tracks public symbols and compares them against a trusted metrics baseline when enabled.

That makes it easier to notice:

newly public symbols;
removed symbols;
breaking changes;
public surface drift.

Adoption tracks presence, not quality:

parameter annotations;
return annotations;
public docstrings;
explicit Any counts.

This is intentionally not “typing quality.”

A project can have annotations and still have weak type design. But for teams migrating incrementally, presence coverage is still a useful governance signal:

did this branch make the public surface less typed or less documented than before?

That is the kind of question CI can answer honestly.

Overloaded Modules

CodeClone also reports overloaded modules.

This is report-only. It is not a gate.

The point is not to say “this file is bad.” The point is to rank modules where several kinds of pressure meet:

size;
complexity;
coupling;
responsibility concentration;
participation in other signals.

In larger repositories, this helps answer a practical question:

If I want to improve the project without guessing, where should I start?

That is often more useful than looking only at the single function with the highest cyclomatic complexity.

Native surfaces: VS Code, Claude Desktop, Codex

CodeClone 2.0 is no longer only a CLI tool.

It now has native or local integration surfaces around the same MCP server.

VS Code

The VS Code extension is a native MCP client for CodeClone.

It can show triage-first review data, jump to source, display supported Coverage Join and Security Surfaces facts, respect workspace trust, and avoid sending code anywhere.

It does not reimplement analysis.

It talks to the local codeclone-mcp.

Marketplace:

https://marketplace.visualstudio.com/items?itemName=orenlab.codeclone

Claude Desktop

The Claude Desktop bundle is an .mcpb wrapper around the local codeclone-mcp launcher.

It is not a second server. It is an installation and configuration surface.

Codex

The Codex plugin provides local discovery metadata, MCP configuration, and review skills. Again, the goal is not to build a separate analyzer, but to guide the agent through the correct CodeClone workflow.

GitHub Action

For CI and pull requests, CodeClone ships a composite GitHub Action:

- uses: orenlab/codeclone/.github/actions/codeclone@v2
  with:
    fail-on-new: "true"
    sarif: "true"
    pr-comment: "true"

It can:

run baseline-aware checks;
generate JSON and SARIF;
upload SARIF to GitHub Code Scanning;
publish a PR summary comment;
keep review logic out of the project’s workflow scripts.

That is the kind of surface I want CodeClone to have: useful in local development, useful in CI, and useful for agents — without each integration inventing its own interpretation.

What “stable” means here

2.0.0 does not mean the project is done.

It means the main 2.0 contract is stable enough to stop treating it as a prerelease.

The important parts are now established:

uv tool install codeclone works without --pre;
codeclone[mcp] is the normal optional extra for MCP;
CLI, report, baseline, cache, metrics baseline, and MCP semantics are documented;
integrations are aligned with the final 2.0 package;
legacy shim paths are gone;
runtime compatibility issues are surfaced explicitly instead of being hidden.

The limitations are also part of the contract:

CodeClone does not replace linters;
CodeClone does not replace tests;
Security Surfaces does not replace SAST;
Coverage Join does not replace coverage.py;
MCP does not mutate repositories;
HTML and IDE clients do not create a second truth.

That matters.

A quality tool loses trust quickly when it promises more than it can prove.

Try it

Minimal install:

uv tool install codeclone
codeclone .
codeclone . --html --open-html-report

Baseline-aware CI flow:

codeclone . --update-baseline
codeclone . --ci

MCP for local agents and IDEs:

uv tool install "codeclone[mcp]"
codeclone-mcp --transport stdio

Coverage Join:

codeclone . --coverage coverage.xml --fail-on-untested-hotspots

Useful links:

GitHub: https://github.com/orenlab/codeclone
PyPI: https://pypi.org/project/codeclone/
Docs: https://orenlab.github.io/codeclone/
Live HTML report: https://orenlab.github.io/codeclone/examples/report/
VS Code Marketplace: https://marketplace.visualstudio.com/items?itemName=orenlab.codeclone

What comes next

2.0 is a stable foundation, not the end.

The next things I care about most are:

better changed-scope review for pull requests;
evolving Security Surfaces into a more useful but still honest pressure signal;
improving GitHub Action and marketplace flows;
making native clients smoother;
testing CodeClone on more large Python repositories.

If you try CodeClone on your own project, I am especially interested in the uncomfortable feedback:

where the signal feels noisy;
where the tool is too conservative;
where the report needs more context;
which MCP workflows actually help an agent;
which CI gates you would want to tune differently.

CodeClone started as a structural clone detector.

Version 2.0 turns it into a structural review layer for Python teams that care about CI, IDE workflows, and agent-assisted development.

The principle is still the same:

do not pretend to know more than the analysis can prove; make the facts, boundaries, and regressions clear enough for a human or an agent to review.

Agents make code cheaper to produce.

Tools like CodeClone should make structural regressions harder to ship.

Structural review that finally knows what your tests cover

orenlab — Thu, 16 Apr 2026 13:52:28 +0000

In earlier posts, I wrote about why I built CodeClone, why I exposed it through MCP for AI agents, and how b4 turned it into a real review surface for VS Code, Claude Desktop, and Codex.

b5 is the release where structural review stops being a parallel universe to your test suite.

Until now, CodeClone could tell you that a function is long, complex, duplicated, or coupled to everything — but it had no idea whether that function was covered by a single unit test. That mattered more than I wanted to admit. A complex function with a 0.98 coverage ratio is not the same risk as the identical function with 0.0. A reviewer knows this. An AI agent reading an MCP response doesn't — unless the tool tells it.

So b5 fixes that, and while doing it, also lifts a few other things that kept getting in the way:

typing and docstring coverage as first-class review facts
public API drift as a baseline-governed signal
intentionally-duplicated test fixtures stop polluting health and CI gates
a much clearer triage story for MCP and IDE clients
a rebuilt HTML report with unified filters and cleaner empty states
a Claude Desktop launcher that actually picks the right Python
a warm-path benchmark that now tells the truth

Let me walk through what changed and why.

1. Bring your `coverage.xml` into the review

The headline feature of b5 is Coverage Join. Point CodeClone at any Cobertura XML produced by coverage.py, pytest-cov, or your CI and it fuses test coverage into the same run that produces clone groups, complexity, cohesion, and dead code:

codeclone . --coverage coverage.xml --coverage-min 50 --html

What comes out is not "new coverage tool, please delete the old one." It's coverage used as a modifier on structural review:

Each function in the current run gets a factual coverage ratio.
Functions below the threshold show up as coverage hotspots with their complexity and caller count alongside.
High-risk findings can now read "complex + uncovered + new vs baseline" instead of just "complex."
A new gate, --fail-on-untested-hotspots, fails CI on below-threshold functions only where the coverage report actually measured them.

That last distinction is the part I care about most.

2. Honest about scope: measured vs out-of-scope

The easy mistake when bolting coverage onto a second tool is to silently treat "function missing from coverage.xml" as "function is uncovered." It makes the dashboard look busier, but it's a lie — the function might be covered by a coverage run that was filtered to a different package, or it might be a module the coverage config excluded on purpose.

b5 keeps these two cases cleanly separate:

Coverage hotspots — code that coverage.xml measured and reported below threshold. This is a hard signal.
Coverage scope gaps — code present in your repo but not in the coverage XML at all. This is a scoping observation, not a verdict.

Both show up in the report and through MCP, but with different meanings. In mixed monorepos this stops being cosmetic very fast.

None of this changes clone identity, fingerprints, or NEW-vs-KNOWN semantics — the baseline model is untouched. Coverage Join is a current-run fact, not baseline truth.

3. Typing and docstring coverage are now part of the picture

I used to expose "typing coverage" and "docstring coverage" as optional toggles. In practice, nobody turned them on, and they kept hiding behind flags that felt vestigial.

b5 removes the toggles and just collects adoption coverage whenever you run in metrics mode:

parameter annotation coverage
return annotation coverage
public docstring coverage
explicit Any count

They land in the main CLI Metrics block, in the HTML Overview, in MCP summaries, and in the baseline. And they get their own CI gates:

codeclone . \
  --min-typing-coverage 80 \
  --min-docstring-coverage 60 \
  --fail-on-typing-regression \
  --fail-on-docstring-regression

The regression gates are the interesting pair: they don't force you to reach a specific threshold, they just fail CI when adoption drops compared to your trusted baseline. That tends to be more realistic for real codebases where you're migrating gradually.

4. Public API drift becomes a first-class signal

Another thing that used to live outside the review surface: "did this PR break the public API?"

b5 adds an opt-in API Surface layer that takes a snapshot of your public symbols — modules, classes, functions, their parameters and return types — into the metrics baseline. Subsequent runs produce a baseline diff with explicit categories: additions, breaking changes, everything else.

# Record the snapshot
codeclone . --api-surface --update-metrics-baseline

# Guard PRs
codeclone . --fail-on-api-break

It's not a type checker and it's not SemVer enforcement. It's "the set of externally-callable names in this package just changed in a way that is likely to break downstream users, please confirm." For libraries that's the thing you want CI to block on.

Private symbols are classified separately from public ones, so moving an internal helper around doesn't pollute the diff.

5. Golden fixtures stop showing up as debt

Some repositories — including CodeClone itself — intentionally keep duplicated golden fixtures to lock report contracts and parser behavior. Those clones are real. They are also not live review debt.

b5 adds a project-level policy for exactly that case:

[tool.codeclone]
golden_fixture_paths = ["tests/fixtures/golden_*"]

Clone groups fully contained in those paths are:

excluded from the health score
excluded from CI gates
excluded from active findings
still carried in the report as suppressed facts

So the tool stays honest — you can still see the suppressed groups in the HTML Clones tab and in the canonical JSON — without making CI noisier than it needs to be. If a group stops being "fully inside the fixture paths," it stops being suppressed automatically.

6. Triage that says what it's actually looking at

MCP summary and triage payloads in b5 include a few compact interpretation fields that turned out to matter a lot for both AI agents and humans:

health_scope — is this number repository-wide, production-only, or for a specific focus?
focus — what does "new findings" actually mean for this run?
new_by_source_kind — of the new findings, how many are in production code vs tests vs tooling?

The net effect is that an agent asking "is this PR risky?" no longer has to guess whether "3 new findings" means "three new bugs in production" or "three new flake-prone tests." The payload tells it directly. The VS Code extension uses the same fields to explain repository-wide health, production focus, and outside-focus debt without widening the review flow.

The extension also now surfaces Coverage Join facts in its overview when the connected server supports them, and the optional in-IDE help topics are gated by server version so they stay honest about what's actually available.

7. The HTML report got a proper rebuild

b4 made the HTML report useful. b5 makes it feel finished.

Unified filters popover — Clones and Suggestions share the same filter UX: one button, one menu, an active-filter count, keyboard dismiss. Every control lives in the same place on every tab that has filters. No more two-row filter strips that wrap on narrow screens.
Cleaner empty states — instead of empty tables, sections with no findings now render a single reassuring row with an explicit "no issues detected" message and an icon. Silence has meaning now.
Coverage Join subtab — Quality gets a dedicated Coverage Join view with per-function rows: coverage %, complexity, callers, source kind, and a clear marker for scope gaps.
Adaptive theme toggle — the theme button shows a sun in light mode and a moon in dark mode, resolved at paint time so you don't flash the wrong icon on first load.
Refreshed palette — the whole report moved to a chromatic neutral scale tinted toward the brand indigo, so surfaces, borders, and text live on the same hue axis instead of looking like "grayscale + one purple button."
Better provenance — the meta block makes it explicit which python tag the baseline was built for, and calls out baseline mismatches instead of hiding them.
Stat-card rhythm — KPI cards across Overview, Quality, Dependencies, and Dead Code share one card component now. Same padding, same typography, same tone variants.

None of that changes a single report contract. It's pure render-layer work.

8. Claude Desktop launches the right Python

A boring but high-impact b5 change: the Claude Desktop bundle now resolves your project's runtime before falling back to a global one. Poetry's .venv, workspace .venv, and an explicit workspace_root override all come before anything on PATH.

Before: installing CodeClone into your project, then launching it via Claude Desktop, would often run some other CodeClone from /usr/local/bin because that happened to be first on PATH. That's fixed.

If you've been getting subtly wrong results through Claude Desktop and couldn't explain why, this is the one to pull.

9. Safer and more deterministic under the hood

Two changes that are unglamorous but worth noting:

Git diff ref validation. When you use --diff-against, the supplied revision is now validated as a safe single-revision expression before being passed to git. No shell surprises, no accidental multi-ref expressions.
Canonical segment digests. Segment clone digests no longer use repr() — they're computed from canonical JSON bytes. This closes a subtle determinism hole where two runs on different interpreters could, in rare cases, produce different segment digests for the same input.

Neither changes clone identity or fingerprint semantics.

10. The warm path is actually warm

One of the more satisfying b5 fixes wasn't a feature at all.

I'd been quietly suspicious of the benchmark numbers for a while — warm runs were looking too good, and I couldn't make the shape of the curve match what the pipeline was actually doing. Turns out the benchmark harness had a bug that broke process-pool execution on warm runs, so the cache was being credited for work it wasn't doing.

After fixing the harness and tightening gating around benchmark runs so repo quality gates don't interfere, the numbers are now both fast and trustworthy. From the Linux smoke benchmark:

cold_full: 6.58s
warm_full: 0.95s
warm_clones_only: 0.86s

About 6.9× speedup on warm runs. The cache is no longer "probably helping" — it is clearly doing useful work, and now I can say that with a straight face.

Wrapping up

If b4 made CodeClone a real review surface, b5 is the release where that surface learned to ask useful second-order questions:

Is this complex function actually tested?
Is this low-coverage number a hard signal or a scope gap?
Is this new finding in production code or in fixtures?
Did this PR break the public API?
Is this duplication intentional test scaffolding or real debt?

Every one of those used to require me to eyeball two dashboards and a coverage report. Now there's a single canonical answer, and it ships consistently through CLI, HTML, JSON, SARIF, MCP, the VS Code extension, the Claude Desktop bundle, and the Codex plugin.

Try it

# Base install
uv tool install --pre codeclone

# With MCP for AI agents (Claude Desktop, Codex, VS Code, Cursor, ...)
uv tool install --pre "codeclone[mcp]"

A one-liner to feel the new shape on your own repo:

codeclone . \
  --coverage coverage.xml --coverage-min 70 \
  --min-typing-coverage 80 --fail-on-typing-regression \
  --api-surface --fail-on-api-break \
  --html

Open the HTML report, watch the Coverage Join tab populate, and check whether your "risky" functions really were the risky ones.

Feedback, issues, and PRs welcome on GitHub.

CodeClone b4: from CLI tool to a real review surface for VS Code, Claude Desktop, and Codex

orenlab — Sun, 05 Apr 2026 18:28:09 +0000

I already wrote about why I built CodeClone and why I cared about baseline-aware code health.

Then I wrote about turning it into a read-only, budget-aware MCP server for AI agents.

This post is about what changed in 2.0.0b4.

The short version: if b3 made CodeClone usable through MCP, b4 made it feel like a product.

Not because I added more analysis magic or built a separate "AI mode." But because I pushed the same structural truth into the places where people and agents actually work — VS Code, Claude Desktop, Codex — and tightened the contract between all of them.

A lot of developer tools are strong on analysis and weak on workflow. A lot of AI-facing tools shine in a demo and fall apart in daily use.

For b4, I wanted a tighter shape:

the CLI, HTML report, MCP, and IDE clients should agree on what "health" means
the first pass should stay conservative
deeper inspection should be explicit, not accidental
report-only signals should stay visible without polluting gates
setup failures should tell you what went wrong

That is the release theme. Not "more output" — better day-to-day workflows.

The most interesting new layer: Overloaded Modules

Clone detection tells you this logic is repeated. Complexity tells you this function is locally hard to reason about.

Overloaded Modules asks a different question: which modules are taking on too much responsibility?

The signals include module size pressure, dependency pressure, hub-like shape, and reimport-heavy structure. This points to code that often feels wrong before it is easy to classify. You know the file keeps attracting logic. Every change in it feels heavier than it should. But it is not a clone group or a single high-complexity function.

The important design choice: this layer is report-only for now. It shows up in JSON, HTML, Markdown, text, MCP, and the VS Code extension — but it does not affect health score, gates, baseline novelty, or SARIF.

I wanted the signal to be useful before letting it become consequential.

VS Code became a real client, not a demo

The preview VS Code extension is the first release where CodeClone feels properly usable inside an editor instead of only around one.

It is now live on the Visual Studio Marketplace.

The extension is not a generic linter panel. It is built around a review loop:

Analyze the workspace.
Look at compact structural health.
Review priorities first.
Reveal source.
Open detail only when needed.

A lot of extensions get this wrong by dumping every result into the IDE and calling it integration. I wanted the opposite: a client that is baseline-aware, triage-first, source-first, trust-aware, and read-only.

b4 also tightened the surrounding UX:

Restricted Mode — onboarding works in untrusted workspaces, but analysis stays gated until trust is granted
Explicit analysis profiles — "deeper review" is a deliberate follow-up, not silent threshold drift
Hard version checks — if the IDE client quietly talks to the wrong local server version, you do not get a tool you can trust; you get folklore

That last one mattered more than I expected.

Claude Desktop and Codex speak the same contract

I also added native client paths for Claude Desktop and Codex.

The goal was not "be available in more places." It was keeping one analysis contract across all of them:

no second analyzer
no plugin-specific findings
no AI-only semantics
no client that quietly disagrees with the CLI

Claude Desktop gets a local .mcpb bundle with pre-loaded review instructions.
Codex gets a native plugin with two focused skills — full review and quick hotspot discovery. Both sit on top of the same codeclone-mcp server.

That may sound boring, but boring is good here. The more clients you add, the easier it becomes to fork your own semantics without noticing. A lot of the b4 work was about resisting exactly that.

Conservative first, deeper only when you mean it

CodeClone defaults are intentionally conservative. That is the right first pass for CI, baseline-aware review, and agent-driven workflows.

But there is a real second need: sometimes the default pass looks clean, and you want to go hunting for smaller, more local repetition.

b4 makes that distinction explicit:

Start with defaults or pyproject.toml thresholds.
Use that as the stable first pass.
Lower thresholds only for an intentional deeper review.

This now shows up clearly in MCP help topics and in the VS Code analysis profiles.

"More sensitive" is not the same as "more correct." A clean conservative pass
does not prove there is no finer-grained repetition. But a lower-threshold exploratory pass should not quietly pretend to have the same meaning as the default profile. That distinction needed to become product-level.

MCP got smarter about guiding agents — and cheaper to talk to

Two things happened on the MCP side that are easy to miss but matter a lot in practice.

First: the help tool. In b3, agents had 20 analysis and query tools but no way to ask "what should I do next?" or "what does this baseline state mean?" without burning tokens on trial and error.

b4 adds a help(topic=...) tool with bounded, static answers for common uncertainty points: workflow sequencing, analysis profile semantics, baseline interpretation, suppression rules, review state, and changed-scope review. An agent can ask one cheap question instead of making three exploratory tool calls to figure out the right next step.

This is a small surface — seven topics, short answers, no dynamic analysis. But it changes the economics of agent workflows significantly. The difference between "the agent guesses and retries" and "the agent asks and proceeds" is often 3–5x in token cost.

Second: tighter token budgets across the board. b4 continued the budget-aware work from b3:

finding IDs are now sha256-based short forms instead of full canonical URIs
the derived section in MCP payloads is projected down to what agents actually need
metrics_detail is paginated with family and path filters so agents never pull the full metrics table by accident

None of this changes the canonical report — the JSON is still the complete truth. But the MCP view over it is now meaningfully leaner.

The boring fixes that matter most

Some of my favorite changes in b4 are not flashy:

setup guidance that matches the real install path
faster launcher failure behavior with clear error messages
stricter local version handling across all client surfaces
enriched MCP server instructions so agents get behavioral context on connect, not just a list of tools
terminology cleanup around module hotspots

This is not the kind of work that looks impressive in a screenshot. But it is exactly the kind of work that makes an engineering tool feel trustworthy over weeks and months.

What `b4` feels like

b1 — CodeClone became more than a clone detector.
b3 — it became a serious MCP server.
b4 — it started to feel coherent across the CLI, the report, MCP, and every client surface.

You can start in the editor. You can stay aligned with baseline-aware truth. You can inspect module-level pressure without turning it into fake gating. You can move between human and agent workflows without changing the underlying semantics.

That is much closer to what I wanted CodeClone to become.

Try it

uv tool install --pre codeclone        # core CLI (beta)
uv tool install --pre "codeclone[mcp]" # + MCP server for agents and IDEs
codeclone .                            # analyze the current project
codeclone . --html --open-html-report  # open the interactive report

GitHub — source, extensions, plugin
Docs — contracts, guides, live report
MCP guide — agent and IDE setup
PyPI

If you are building review workflows around IDEs, MCP clients, or AI-assisted refactoring, I would love feedback on one question:

What makes a structural analysis tool feel trustworthy once it leaves the CLI and starts living inside real developer workflows?

I turned my Python code quality tool into a budget-aware MCP server for AI agents

orenlab — Wed, 01 Apr 2026 13:00:50 +0000

I already wrote about why I built CodeClone and why I care about baseline-aware
code health:

I built a baseline-aware Python code health tool for CI and AI-assisted coding

This post is about what changed in 2.0.0b3.

The short version: this is the first release where CodeClone feels less like a Python structural analysis CLI and more like a serious MCP surface for AI coding agents.

Not by building a second engine.
Not by adding AI-specific heuristics to the core.
But by exposing the same deterministic, baseline-aware pipeline through a read-only MCP layer that agents can actually use.

Why MCP mattered for CodeClone

Once you start using coding agents seriously, the hard part is not "can the model write code?"

The harder questions are:

what changed structurally?
is this debt new or already accepted in baseline?
is this production risk or just test noise?
should this block CI?
what is the safest next refactor target?

That is the gap I wanted CodeClone to close.

What shipped in `2.0.0b3`

The headline is an optional MCP server:

pip install --pre "codeclone[mcp]"
codeclone-mcp --transport stdio

Since b3 is still a beta, the --pre flag matters here.

But the useful part is the workflow around it.

b3 adds three things that matter together:

a read-only MCP surface for agents and IDE clients
review-oriented workflows: changed-files analysis, run comparison, gate preview, and PR summaries
tighter surrounding surfaces: stronger SARIF, better HTML navigation, and directory hotspots

There is also a packaging change worth mentioning:

CodeClone source code is now under MPL-2.0
documentation stays under MIT

What makes this MCP layer different

I think there are a lot of tools now that can expose "some analysis" over MCP.

What I wanted from CodeClone was stricter than that.

1. Canonical-report-first

The MCP layer is not a second truth path.

It reads the same canonical report model as the CLI, HTML, and SARIF surfaces.
That means an agent is not looking at an "AI view" that quietly disagrees with what CI or the report says.

2. Read-only

This was non-negotiable for me.

CodeClone MCP does not mutate:

source files
baselines
repository state
on-disk report artifacts

The only mutable part is session-local review state, and that stays in memory only.

3. Budget-aware by design

This is the part I ended up caring about more than I expected.

A lot of MCP tools are technically useful, but easy to use badly. An agent can burn a lot of tokens just by listing too much too early.

CodeClone MCP is intentionally shaped so that the cheapest useful path is the default path.

It is not only bounded in payload shape. It actively guides agents toward low-cost, high-signal workflows.

The cheapest useful path is now the most obvious path.

The workflow I wanted agents to follow

The right first pass is not "dump all findings."

In practice, the first useful question is rarely “show me everything.”
It is usually “where should I look first?”

It is:

analyze_repository or analyze_changed_paths
→ get_run_summary or get_production_triage
→ list_hotspots or focused check_*
→ get_finding
→ get_remediation

That sounds simple, but it matters a lot.

It means:

cheap overview first
narrow triage second
deep detail only when it is actually needed

That is a better fit for LLMs, and honestly a better fit for humans too.

Real token cost on a dirty repository

I wanted to check whether this was just a nice theory, so I tested it on one of my own messier private Python repos.
It is still in an early development stage, is not public yet, and from CodeClone's point of view it currently has a lot of structural debt.
It works, but "works" and "structurally healthy" are obviously not the same thing.

In one local run, that looked like this:

449 Python files
108,939 lines
2,729 functions
1,048 classes
659 findings
health score 34 (F)

Then I compared two MCP paths.

Broad first-pass flow

A more naive "ask for a lot of things" practical cycle came out to about:

10,566 tokens

Guided first-pass flow

Following the new MCP guidance:

analyze_repository
get_production_triage
list_hotspots
get_finding
get_remediation

The same first-pass workflow came out to about:

2,535 tokens

That is roughly a 76% reduction in token cost for a useful first pass.

The payloads did not magically become tiny; the main change was that the MCP surface now guided the client toward a narrower first-pass workflow.

That result mattered to me because it changed how I think about MCP quality.

For agent tooling, payload size is only half the story.
The other half is whether the server nudges the agent toward the right path.

Why this matters for PR review

In practice, the most valuable agent loop is usually not “analyze the whole repository forever,” but “review what changed, compare it to baseline, and decide whether anything should block the merge.”

It is usually closer to:

code changed
tests passed
now check whether the structure got better or worse

That is why b3 puts a lot of weight on changed-scope review.

With CodeClone MCP, an agent can now ask things like:

what findings touch the files changed in this branch?
are these findings new relative to baseline?
what is the highest-priority structural issue here?
would this fail CI?
can I produce a short PR-ready summary?

That is a much better review loop than a giant flat findings dump.

What the MCP surface is good at now

The shape I like most is:

full repository analysis when you need canonical truth
changed-files analysis when you need review focus
compact triage first
single-finding drill-down second
markdown PR summary at the end

In practice, that makes prompts stay simple.

For example:

Changed-files review

Use CodeClone MCP to review the files changed in this branch.
Show me only findings that touch changed files, rank them by priority, and tell me whether anything here should block CI.

Safe refactor pick

Use CodeClone MCP to find one high-priority structural issue that looks safe to refactor. Explain why it is a good first target and what refactor shape you would use. Do not change code yet.

AI-generated code check

I added a lot of code with an AI agent.
Use CodeClone MCP to check for structural drift: new clone groups, duplicated branches, dead code, or design hotspots. Prioritize what is new relative to baseline.

That is the kind of MCP ergonomics I was aiming for: prompts stay fairly client-agnostic, and the server gives the agent a disciplined path.

`b3` is not only about MCP

Even though MCP is the headline, I did not want it to be isolated from the rest of the product.

2.0.0b3 also tightens the surrounding surfaces:

canonical report schema 2.2
cache schema 2.3
canonical design-finding thresholds recorded in report metadata
Hotspots by Directory in the HTML overview
stronger SARIF identities for code-scanning workflows
Composite GitHub Action v2 for CI and PR automation

That matters because I want all of these surfaces to agree:

CLI for CI
MCP for agents
HTML for navigation
SARIF for platform workflows

The product truth I am taking from this release

The biggest lesson from b3 is that a good MCP server is not just a pile of tools.

It is a control surface.

For CodeClone, that now means:

deterministic
canonical-report-first
read-only
budget-aware
triage-first
agent-guiding

That is the direction I want to keep pushing.

Not "AI magic."
Better control loops.

Try it (don't forget use `--pre`)

GitHub: orenlab/codeclone
Docs: orenlab.github.io/codeclone
MCP guide: orenlab.github.io/codeclone/mcp/
PyPI: pypi.org/project/codeclone

If you are already building with MCP clients, I would especially love feedback on one question:

what would make PR review through an MCP tool genuinely useful for your team?

I built a baseline-aware Python code health tool for CI and AI-assisted coding

orenlab — Thu, 26 Mar 2026 11:49:56 +0000

I built a baseline-aware Python code health tool for CI and AI-assisted coding

If you write Python with AI tools today, you’ve probably felt this already:

the code usually works, tests may pass, lint is green, but the structure gets worse in ways that are hard to notice
until the repository starts fighting back.

Not in one dramatic commit. More like this:

the same logic gets rewritten in slightly different ways across multiple files;
helper functions quietly grow until nobody wants to touch them;
coupling increases one import at a time;
framework callbacks look unused even when they are not;
dead code accumulates because generated code tends to leave leftovers behind.

That is the problem space I built CodeClone for.

CodeClone 2.0.0b1 is the first version where the tool really matches the model I wanted from the beginning: not just
“find some clones,” but track structural code health over time, in CI, with a trusted baseline.

This post is an introduction to that version and the design choices behind it.

First: I know the ecosystem is not empty

I’m not pretending this is the first serious tool in this space.

There are already strong tools around adjacent problems:

SonarQube / SonarCloud for broad code quality, governance, and quality gates
PMD CPD as one of the classic copy/paste detectors
jscpd for practical duplicate-code scanning across multiple languages
Vulture for Python dead-code detection
Radon / Xenon for complexity-related checks
and newer tools like pyscn, which also move toward structural/code-health analysis for Python

That matters, because I don’t think useful tools should be framed as “everything before this was wrong.”

CodeClone is not trying to replace all of the above.

Its angle is narrower and, I think, pretty specific:

structural duplication is a first-class signal;
baseline-aware governance is the center of the workflow, not an extra feature;
deterministic output is non-negotiable;
and the UI/report layer is not allowed to invent conclusions the analysis engine did not produce.

If I had to summarize the difference in one sentence, it would be this:

CodeClone is built around separating accepted debt from new regressions.

That sounds simple, but it changes the entire shape of the tool.

Why I think this matters more now

AI coding assistants are genuinely useful. I use them. They speed things up.

But they also change the failure mode of a codebase.

The biggest risk is often not “the AI wrote something syntactically invalid.” That part is easy to catch.

The harder problem is that AI tools are very good at producing locally plausible code:

one more handler,
one more service method,
one more variant of the same logic,
one more utility that overlaps with three existing ones.

Each individual change looks reasonable.

The repository as a whole gets worse.

That is why I think structural analysis is especially useful for AI-assisted teams. If you are using Claude Code,
Cursor, Codex, or similar tools, the important question is often not:

“Is this code valid?”

but:

“Did this change make the repository structurally worse?”

That is exactly the question a baseline-aware tool can answer well.

What CodeClone focuses on

At the core, CodeClone analyzes Python projects and looks at structural signals such as:

function clones
block clones
segment clones
structural findings like duplicated branch families
dead code
complexity
coupling
cohesion
dependency cycles
a combined health score

The outputs come in multiple formats:

HTML
JSON
Markdown
SARIF
Text

But they all come from a single canonical report document. That was important to me because I wanted consistency between
machine-readable outputs and the human-facing report.

The key idea: baseline-aware governance

This is the part I care about most.

A lot of code quality tools can tell you that your repository has problems. That is useful, but it is not enough for
real CI.

In a non-trivial codebase, there is usually historical debt:

old duplication
old complexity hotspots
old dead code
old architectural compromises

If a tool only says “you have 400 problems,” that doesn’t help much. Most teams will either ignore it or disable it.

CodeClone is designed around a different model:

take the current state as a baseline;
trust and validate that baseline explicitly;
keep accepted debt visible;
block new regressions.

That makes the tool much more usable in practice.

Instead of asking teams to become perfect overnight, it asks a much more realistic question:

“Did this branch make the codebase worse than the state we already accepted?”

That is the main reason I describe CodeClone as baseline-aware before I describe it as a clone detector.

What changed in 2.0.0b1

Version 2.0.0b1 is the point where that model became much more complete.

1. A real code-health model

CodeClone now computes a health score from multiple dimensions:

clones
complexity
coupling
cohesion
dead code
dependencies
coverage

I did not want this to become a decorative “AI score.” The point is not the number by itself; the point is whether the score can be traced back to concrete structural reasons.

That is why the new HTML overview is built around:

a health gauge
KPI cards
an executive summary
source-scope breakdown
a health profile chart

The goal is to answer not only “what failed?” but also “what should I look at first?”

2. Baseline became a first-class contract

In 2.0.0b1, baseline handling is no longer just a convenience file.

It is now a stricter contract with:

trust semantics
compatibility checks
integrity fields
deterministic payload handling
unified clone + metrics baseline flow

That matters a lot in CI. If the baseline itself is not trustworthy, the entire gating story becomes shaky.

3. Dead code arrived, but with explicit suppressions

Dead-code analysis is now part of the model, but I did not want to solve dynamic Python behavior with magic heuristics.

So for intentional runtime-driven cases, CodeClone uses explicit inline suppressions:

# codeclone: ignore[dead-code]
def handle_exception(exc: Exception) -> None:
    ...

or:

class Middleware:  # codeclone: ignore[dead-code]
    ...

That is a deliberate design choice.

I would rather have a local, visible policy mechanism than silently broaden the detector until it becomes hard to reason
about.

4. SARIF was added in 2.0.0b1

This is worth calling out explicitly because I do not want to misrepresent the release: SARIF is new in 2.0.0b1.

I wanted it to be useful beyond “technically yes, there is a SARIF file.”

So the current implementation is designed to work better with IDE/code-scanning workflows, including:

%SRCROOT% anchoring
artifacts
richer rule metadata
location alignment
baseline state for clone results when applicable

5. Detection thresholds got more practical

The default thresholds are now more permissive than before.

That means CodeClone filters out less and analyzes more. For example:

function-level min_loc was lowered from 15 to 10
block thresholds were relaxed
segment thresholds were relaxed

This does increase analysis volume, so it has performance implications. But it also makes the tool more honest. It stops politely ignoring a bunch of small-but-real structural issues.

Why this is useful for AI-generated code

I want to be careful here, because “AI code quality” can turn into hand-wavy marketing really fast.

I am not claiming that CodeClone can detect whether a human or an LLM wrote a piece of code.

That is not the point.

The point is simpler:

AI-assisted development tends to amplify a certain class of structural problems:

repeated patterns with small variations
copy-pasted orchestration logic
overgrown functions
dead callback surfaces
architecture drift that happens in many individually “reasonable” steps

CodeClone is a good fit for that environment because it is:

structural rather than stylistic
deterministic enough for CI
baseline-aware, so it can focus on regression control
explicit about suppressions instead of hiding runtime ambiguity behind heuristics

If your team ships a lot of AI-assisted code, the practical question is not “is AI bad?” It is:

“How do we keep the repository readable, stable, and governable while code is being produced faster?”

That is the problem I think CodeClone helps with.

What it is not

I think first posts do better when they are honest about scope, so here is the short version.

CodeClone is not:

a replacement for SonarQube
a style linter
a security scanner
a magic AI-code detector
a claim that every other tool got the problem wrong

It is a Python-focused, baseline-aware, structural analysis tool with a strong CI orientation.

And yes, it is still beta.

Quick start

If you want to try the prerelease:

pip install --pre codeclone

or:

uv tool install --pre codeclone==2.0.0b1

Then:

codeclone .
codeclone . --html
codeclone . --ci

If you want to adopt the baseline workflow:

codeclone . --update-baseline
codeclone . --ci

Where to look next

Docs: https://orenlab.github.io/codeclone/
Live sample report: https://orenlab.github.io/codeclone/examples/report/
PyPI: https://pypi.org/project/codeclone/
GitHub: https://github.com/orenlab/codeclone

Closing thought

If I had to summarize CodeClone 2.0.0b1 in one line, it would be this:

It is the point where the project stopped being “just a clone detector” and became a baseline-aware structural quality
tool for Python CI.

That is the direction I wanted from the beginning.

And with AI-assisted development becoming normal, I think tools in this category are becoming more important, not less.

If this sounds useful, I would be glad to hear what breaks, what feels noisy, what you would want from the CI workflow,
and what kinds of repositories you would actually trust a tool like this on.

DEV Community: orenlab

[Boost]

I built ckdn so coding agents never have to guess whether checks passed

I built ckdn so coding agents never have to guess whether checks passed

Why "checkdown"?

Moment one, solved: digests instead of logs

Moment two, solved: exit code and evidence must agree

Moment three, solved: the repository owns the checks

Try it in two minutes

MCP support

Parsers

Why not just ask the model to summarize the log?

What ckdn is not — and where it won't help you

Try to break it

Your AI agent's diff looks fine. That's the problem.

Code review starts too late

The loop, on the loyalty-discount task

"What was declared" vs. "what happened"

No LLM grading its own homework

Agents forget — twice

Also in the alpha

Try it

Why ship an alpha

Code Review Starts Too Late

Why Your Python Dead Code Detector Is Wrong About FastAPI, SQLAlchemy, and Half Your Codebase

Why "find calls" is not enough

Why this is a bigger problem if you use AI coding assistants

The tempting trap: name-based heuristics

Reachability, not vibes

What real false positives look like in practice

An important distinction worth making explicit

What changed in CodeClone 2.0.2

Dead code gates in CI only work if engineers trust them

The principle

Closing

Agents Make Code Cheaper. CodeClone 2.0 Makes Structural Regressions Harder to Ship.

Why this matters more now

The core idea: accepted debt vs new regressions

One analysis, many surfaces

MCP is not just tool access

help(topic=...) became more important than I expected

Coverage Join: structural risk needs test context

Security Surfaces: not SAST, not vulnerability claims

Dependency depth: I removed the magic number

API Surface and Adoption

Overloaded Modules

Native surfaces: VS Code, Claude Desktop, Codex

VS Code

Claude Desktop

Codex

GitHub Action

What “stable” means here

Try it

What comes next

Structural review that finally knows what your tests cover

1. Bring your coverage.xml into the review

2. Honest about scope: measured vs out-of-scope

3. Typing and docstring coverage are now part of the picture

4. Public API drift becomes a first-class signal

5. Golden fixtures stop showing up as debt

6. Triage that says what it's actually looking at

7. The HTML report got a proper rebuild

8. Claude Desktop launches the right Python

9. Safer and more deterministic under the hood

10. The warm path is actually warm

Wrapping up

Try it

CodeClone b4: from CLI tool to a real review surface for VS Code, Claude Desktop, and Codex

The most interesting new layer: Overloaded Modules

VS Code became a real client, not a demo

Claude Desktop and Codex speak the same contract

Conservative first, deeper only when you mean it

MCP got smarter about guiding agents — and cheaper to talk to

The boring fixes that matter most

What b4 feels like

Try it

I turned my Python code quality tool into a budget-aware MCP server for AI agents

Why MCP mattered for CodeClone

What shipped in 2.0.0b3

What makes this MCP layer different

`help(topic=...)` became more important than I expected

1. Bring your `coverage.xml` into the review

What `b4` feels like

What shipped in `2.0.0b3`

`b3` is not only about MCP

Try it (don't forget use `--pre`)