The Problem: Hidden Code Duplication
Every codebase accumulates copy-paste clones. They start innocently — "I'll just duplicate this function and tweak it" — but over months they compound into a maintenance tax. When a bug is fixed in one copy, the other 5 copies stay broken. When a feature needs to change, you hunt down every variant by hand.
Traditional duplicate detectors fall into two camps:
- Line-based tools (Duplo, Simian) — compare raw text, miss everything after variable renaming
- Token-based tools (jscpd, PMD CPD) — slightly better, but still fail when identifiers change
- Generic AST tools (SonarQube) — heavy infrastructure, Python support is an afterthought
PyChase takes a different approach: normalized AST fingerprints with MinHash + LSH.
What PyChase Catches
| Clone Type | Description | Example | PyChase |
|---|---|---|---|
| Type-1 | Exact copy-paste | Same code, same names | ✅ 1.000 score |
| Type-2 | Renamed identifiers |
calculate_total → compute_sum
|
✅ 0.786 score |
| Type-3 | Modified logic | Added/removed statements, reordered | ✅ 0.620 score |
How It Works
1. Parse to AST
Each .py file is parsed into a Python Abstract Syntax Tree using the standard ast module.
2. Normalize
Variable names, function names, attribute names, string literals, numbers, and docstrings are replaced with generic placeholders. This strips everything except structure:
# These two produce the SAME normalized AST:
def calculate_total(items, rate):
subtotal = 0
for item in items:
subtotal += item.price
tax = subtotal * rate
return subtotal + tax
def compute_sum(products, factor):
result = 0
for product in products:
result += product.cost
fee = result * factor
return result + fee
3. Shingle
The normalized AST node sequence is broken into overlapping k-shingles (default k=3):
(FunctionDef, arguments, arg) → (arguments, arg, Add) → ...
4. Fingerprint + Match
Each shingle set becomes a structural fingerprint. MinHash signatures compress these into compact 256-bit signatures. Locality Sensitive Hashing (LSH) indexes them — only units sharing a bucket are compared.
This reduces candidate pairs from O(n²) to nearly O(n). For 10,000 functions, that's 50 million comparisons avoided.
5. Cluster + Report
Connected-component clustering groups related matches. Results render as text, JSON, CSV, or an interactive HTML report with collapsible groups and syntax-highlighted code previews.
Quick Start
pip install pychase
# Scan your project
pychase .
# Generate HTML report
pychase --format html --output duplicates.html .
# JSON output for CI
pychase --json --threshold 0.85 ./src
Why MinHash + LSH Matters
Tools like dry4python compare every function against every other function — O(n²). With 10,000 functions, that's 50 million comparisons. PyChase's MinHash + LSH approach reduces this to near-linear time, making it viable for large monorepos and CI pipelines.
Comparison at a Glance
| Tool | Algorithm | Clone Types | Setup | Output |
|---|---|---|---|---|
| PyChase | MinHash+LSH (O(n)) | Type-1,2,3 | pip install |
Text, JSON, CSV, HTML |
| dry4python | Brute-force (O(n²)) | Type-2 only | pip install |
Text, JSON |
| jscpd | Token hash | Type-1,2 | Node.js | Text, JSON, HTML |
| SonarQube | Custom AST | Type-1,2 | DB + server + scanner | Web dashboard |
Try It
pip install pychase
pychase --threshold 0.55 --min-lines 2 --min-nodes 10 ./src
GitHub: https://github.com/Mayne-X/PyChase
Docs: https://github.com/Mayne-X/PyChase/wiki
PyPI: [https://pypi.org/project/pychase/]
Top comments (0)