DEV Community

Mayne
Mayne

Posted on

PyChase: AST-Powered Duplicate Code Detection for Python

The Problem: Hidden Code Duplication

Every codebase accumulates copy-paste clones. They start innocently — "I'll just duplicate this function and tweak it" — but over months they compound into a maintenance tax. When a bug is fixed in one copy, the other 5 copies stay broken. When a feature needs to change, you hunt down every variant by hand.

Traditional duplicate detectors fall into two camps:

  1. Line-based tools (Duplo, Simian) — compare raw text, miss everything after variable renaming
  2. Token-based tools (jscpd, PMD CPD) — slightly better, but still fail when identifiers change
  3. Generic AST tools (SonarQube) — heavy infrastructure, Python support is an afterthought

PyChase takes a different approach: normalized AST fingerprints with MinHash + LSH.

What PyChase Catches

Clone Type Description Example PyChase
Type-1 Exact copy-paste Same code, same names 1.000 score
Type-2 Renamed identifiers calculate_totalcompute_sum 0.786 score
Type-3 Modified logic Added/removed statements, reordered 0.620 score

How It Works

1. Parse to AST

Each .py file is parsed into a Python Abstract Syntax Tree using the standard ast module.

2. Normalize

Variable names, function names, attribute names, string literals, numbers, and docstrings are replaced with generic placeholders. This strips everything except structure:

# These two produce the SAME normalized AST:
def calculate_total(items, rate):
    subtotal = 0
    for item in items:
        subtotal += item.price
    tax = subtotal * rate
    return subtotal + tax

def compute_sum(products, factor):
    result = 0
    for product in products:
        result += product.cost
    fee = result * factor
    return result + fee
Enter fullscreen mode Exit fullscreen mode

3. Shingle

The normalized AST node sequence is broken into overlapping k-shingles (default k=3):

(FunctionDef, arguments, arg) → (arguments, arg, Add) → ...
Enter fullscreen mode Exit fullscreen mode

4. Fingerprint + Match

Each shingle set becomes a structural fingerprint. MinHash signatures compress these into compact 256-bit signatures. Locality Sensitive Hashing (LSH) indexes them — only units sharing a bucket are compared.

This reduces candidate pairs from O(n²) to nearly O(n). For 10,000 functions, that's 50 million comparisons avoided.

5. Cluster + Report

Connected-component clustering groups related matches. Results render as text, JSON, CSV, or an interactive HTML report with collapsible groups and syntax-highlighted code previews.

Quick Start

pip install pychase

# Scan your project
pychase .

# Generate HTML report
pychase --format html --output duplicates.html .

# JSON output for CI
pychase --json --threshold 0.85 ./src
Enter fullscreen mode Exit fullscreen mode

Why MinHash + LSH Matters

Tools like dry4python compare every function against every other function — O(n²). With 10,000 functions, that's 50 million comparisons. PyChase's MinHash + LSH approach reduces this to near-linear time, making it viable for large monorepos and CI pipelines.

Comparison at a Glance

Tool Algorithm Clone Types Setup Output
PyChase MinHash+LSH (O(n)) Type-1,2,3 pip install Text, JSON, CSV, HTML
dry4python Brute-force (O(n²)) Type-2 only pip install Text, JSON
jscpd Token hash Type-1,2 Node.js Text, JSON, HTML
SonarQube Custom AST Type-1,2 DB + server + scanner Web dashboard

Try It

pip install pychase
pychase --threshold 0.55 --min-lines 2 --min-nodes 10 ./src
Enter fullscreen mode Exit fullscreen mode

GitHub: https://github.com/Mayne-X/PyChase

Docs: https://github.com/Mayne-X/PyChase/wiki

PyPI: [https://pypi.org/project/pychase/]

Top comments (0)