Twisted-Code'r

Posted on Mar 11 • Edited on Mar 13

How I Built a Duplicate File Finder That Handles 1 Million Files Without Crashing

#python #opensource #cli #tutorial

## How I Built a Duplicate File Finder That Handles 1 Million Files Without Crashing

I had a problem that I suspect many developers have — duplicate files scattered across external drives, backup archives, and media collections with no reliable way to clean them up without risking data loss.

Every tool I tried either ran out of memory on large directories, gave false positives, had external dependencies I didn't want, or silently deleted files without clear confirmation.

So I built CleanSweep — a command-line duplicate finder and file organizer written in pure Python with zero external dependencies.

The Core Problem: Most Tools Don't Scale

The naive approach to finding duplicates is:

Read every file into memory
Hash everything
Compare hashes

This works fine for a few hundred files. At 100,000 files it starts to hurt. At 1,000,000 files it crashes.

CleanSweep is built around two design principles that make it scale:

Never load file contents into memory — only metadata and hashes
Never build a list of all files — use an iterative traversal stack instead

The Traversal: O(depth) Memory, Not O(total files)

Most directory walkers collect all files into a list first, which means memory grows with the number of files. At 1M files that list alone can consume hundreds of MB.

CleanSweep uses an iterative DFS (depth-first search) stack instead:

stack: list[tuple[Path, int]] = [(root, 0)]

while stack:
    current_dir, depth = stack.pop()

    with os.scandir(current_dir) as sd:
        dir_entries = sorted(sd, key=lambda e: e.name)

    for entry in dir_entries:
        if entry.is_file():
            yield Path(entry.path)
        elif entry.is_dir():
            stack.append((Path(entry.path), depth + 1))

The stack only holds the current path from root to the active directory — never the full file list. Memory stays at O(depth × branching_factor) regardless of how many files exist.

At 1 million files, peak stack memory is under 1 MB.

We also use os.scandir() instead of Path.iterdir() — scandir caches inode metadata from the kernel's readdir buffer, so is_file(), is_dir(), and is_symlink() cost zero extra syscalls per entry on Linux and macOS.

The Hash Pipeline: Three Stages, Minimum I/O

Hashing every file in full is expensive. CleanSweep uses a three-stage pipeline that eliminates most files before any full read happens.

Stage 1 — Group by size

Files with a unique size cannot be duplicates. They are eliminated immediately with zero I/O:

size_groups: dict[int, list[FileEntry]] = defaultdict(list)
for entry in snapshot:
    size_groups[entry.size].append(entry)

# Keep only groups with 2+ files
candidates = [e for g in size_groups.values() if len(g) > 1 for e in g]

On a typical directory, this eliminates 80–95% of files before any hashing.

Stage 2 — Partial hash (first 4 KB)

For the remaining candidates, only the first 4,096 bytes are hashed with SHA-256. Files with unique partial hashes are eliminated:

PARTIAL_BYTES = 4096

with open(path, "rb") as f:
    partial = hashlib.sha256(f.read(PARTIAL_BYTES)).hexdigest()

Small files (≤ 4 KB) are fully read here and cached — they incur zero extra I/O in stage 3.

Stage 3 — Full hash

Only files that survived both filters get a full SHA-256. Reading is done in 1 MB chunks — never the whole file at once:

CHUNK_SIZE = 1024 * 1024  # 1 MB

hasher = hashlib.sha256()
with open(path, "rb") as f:
    while chunk := f.read(CHUNK_SIZE):
        hasher.update(chunk)
return hasher.hexdigest()

Only files with an identical full SHA-256 are declared duplicates. No size matching, no partial matching, no heuristics.

Determinism: Same Input, Always Same Output

This was a non-negotiable requirement. Run CleanSweep twice on unchanged data — the output must be byte-identical.

This means:

All file lists are sorted before processing
Duplicate groups have stable ordering
Parallel hashing results are sorted before exposure — parallelism never affects output order
JSON exports use sort_keys=True and absolute paths only

The sort key used everywhere:

_ENTRY_KEY = lambda e: (e.size, e.device, e.inode, str(e.path))

Safety: Nothing Happens Without Explicit Confirmation

CleanSweep defaults to dry-run in config. Nothing is ever moved or deleted unless you explicitly ask:

# Read-only — reports duplicates, touches nothing
cleansweep duplicates ~/Downloads

# Moves duplicates to system trash (reversible)
cleansweep duplicates ~/Downloads --delete --keep newest

# Hard delete — requires both flags
cleansweep duplicates ~/Downloads --delete --permanent

Every destructive action goes through a single gate — action_controller.py. Nothing else in the codebase can call unlink() or rename().

Architecture: Strict Layers, No Circular Imports

The project is divided into hard layers with a locked import graph:

CLI           main.py
Config        config.py
Engine        scanner → duplicates → rules → destination_map
Action        organizer → batch_engine → action_controller
              file_operation_manager → trash_manager
Observability report.py ← analyzer.py
              logger.py, timer.py

Rules that cannot be broken:

No print() outside report.py — engine modules return data, never format text
No filesystem mutation outside the action layer
No circular imports
No hidden global state
No external packages — standard library only

These rules are enforced by a CI pipeline that checks every push.

Performance

Measured on real hardware (HP 245 G7, AMD CPU, ext4 SSD):

Files	Scan time	Files/sec
10,000	< 0.5s	—
100,000	~18s	~5,600
1,000,000	~178s	~5,620

Peak memory at 1M files: under 1 MB for the traversal stack.

Test Coverage

838 tests across 17 test files covering:

Determinism — run twice, diff outputs, must be identical
Atomicity — crash during move leaves no corrupted state
Concurrency — parallel hashing never affects output order
Memory — large synthetic datasets, no MemoryError
Edge cases — permission denied, broken symlinks, empty files, race conditions
Integration — full pipeline from scan to report

Try It

git clone https://github.com/Jeffrin-dev/CleanSweep.git
cd CleanSweep

# See what's in a directory
python3 main.py scan ~/Downloads

# Find duplicates (read-only)
python3 main.py duplicates ~/Downloads

# Preview what a cleanup would do
python3 main.py duplicates ~/Downloads --dry-run --keep newest

No pip install. No virtualenv. Python 3.10+ only.

What I Learned

Design for scale from the start. The iterative DFS and chunked hashing were architectural decisions made before writing a single test. Retrofitting them later would have been much harder.

Determinism is a feature, not an afterthought. Users trust tools that give the same answer every time. Non-deterministic output is a bug even if the tool technically works.

Safety defaults matter. Defaulting to dry-run means a new user cannot accidentally delete files on their first run. This is more important than convenience.

Locked invariants prevent architectural drift. Writing INVARIANTS.md early — a document that says "these rules cannot be broken by any PR" — kept the codebase clean across 30+ versions.

GitHub: github.com/Jeffrin-dev/CleanSweep

Feedback welcome — especially on the architecture or anything that could be improved.

Top comments (1)

Twisted-Code'r • Mar 11

Any needers ? ? ? ?