Mayne

Posted on Jun 18

PyChase: AST-Powered Duplicate Code Detection for Python

#python #refactoring #opensource #codequality

The Problem: Hidden Code Duplication

Every codebase accumulates copy-paste clones. They start innocently — "I'll just duplicate this function and tweak it" — but over months they compound into a maintenance tax. When a bug is fixed in one copy, the other 5 copies stay broken. When a feature needs to change, you hunt down every variant by hand.

Traditional duplicate detectors fall into two camps:

Line-based tools (Duplo, Simian) — compare raw text, miss everything after variable renaming
Token-based tools (jscpd, PMD CPD) — slightly better, but still fail when identifiers change
Generic AST tools (SonarQube) — heavy infrastructure, Python support is an afterthought

PyChase takes a different approach: normalized AST fingerprints with MinHash + LSH.

What PyChase Catches

Clone Type	Description	Example	PyChase
Type-1	Exact copy-paste	Same code, same names	✅ 1.000 score
Type-2	Renamed identifiers	`calculate_total` → `compute_sum`	✅ 0.786 score
Type-3	Modified logic	Added/removed statements, reordered	✅ 0.620 score

How It Works

1. Parse to AST

Each .py file is parsed into a Python Abstract Syntax Tree using the standard ast module.

2. Normalize

Variable names, function names, attribute names, string literals, numbers, and docstrings are replaced with generic placeholders. This strips everything except structure:

# These two produce the SAME normalized AST:
def calculate_total(items, rate):
    subtotal = 0
    for item in items:
        subtotal += item.price
    tax = subtotal * rate
    return subtotal + tax

def compute_sum(products, factor):
    result = 0
    for product in products:
        result += product.cost
    fee = result * factor
    return result + fee

3. Shingle

The normalized AST node sequence is broken into overlapping k-shingles (default k=3):

(FunctionDef, arguments, arg) → (arguments, arg, Add) → ...

4. Fingerprint + Match

Each shingle set becomes a structural fingerprint. MinHash signatures compress these into compact 256-bit signatures. Locality Sensitive Hashing (LSH) indexes them — only units sharing a bucket are compared.

This reduces candidate pairs from O(n²) to nearly O(n). For 10,000 functions, that's 50 million comparisons avoided.

5. Cluster + Report

Connected-component clustering groups related matches. Results render as text, JSON, CSV, or an interactive HTML report with collapsible groups and syntax-highlighted code previews.

Quick Start

pip install pychase

# Scan your project
pychase .

# Generate HTML report
pychase --format html --output duplicates.html .

# JSON output for CI
pychase --json --threshold 0.85 ./src

Why MinHash + LSH Matters

Tools like dry4python compare every function against every other function — O(n²). With 10,000 functions, that's 50 million comparisons. PyChase's MinHash + LSH approach reduces this to near-linear time, making it viable for large monorepos and CI pipelines.

Comparison at a Glance

Tool	Algorithm	Clone Types	Setup	Output
PyChase	MinHash+LSH (O(n))	Type-1,2,3	`pip install`	Text, JSON, CSV, HTML
dry4python	Brute-force (O(n²))	Type-2 only	`pip install`	Text, JSON
jscpd	Token hash	Type-1,2	Node.js	Text, JSON, HTML
SonarQube	Custom AST	Type-1,2	DB + server + scanner	Web dashboard

Try It

pip install pychase
pychase --threshold 0.55 --min-lines 2 --min-nodes 10 ./src

GitHub: https://github.com/Mayne-X/PyChase

Docs: https://github.com/Mayne-X/PyChase/wiki

PyPI: [https://pypi.org/project/pychase/]

DEV Community