How I Built a Duplicate File Finder That Handles 1 Million Files Without Crashing
I had a problem that I suspect many developers have — duplicate files scattered across external drives, backup archives, and media collections with no reliable way to clean them up without risking data loss.
Every tool I tried either ran out of memory on large directories, gave false positives, had external dependencies I didn't want, or silently deleted files without clear confirmation.
So I built CleanSweep — a command-line duplicate finder and file organizer written in pure Python with zero external dependencies.
The Core Problem: Most Tools Don't Scale
The naive approach to finding duplicates is:
- Read every file into memory
- Hash everything
- Compare hashes
This works fine for a few hundred files. At 100,000 files it starts to hurt. At 1,000,000 files it crashes.
CleanSweep is built around two design principles that make it scale:
- Never load file contents into memory — only metadata and hashes
- Never build a list of all files — use an iterative traversal stack instead
The Traversal: O(depth) Memory, Not O(total files)
Most directory walkers collect all files into a list first, which means memory grows with the number of files. At 1M files that list alone can consume hundreds of MB.
CleanSweep uses an iterative DFS (depth-first search) stack instead:
stack: list[tuple[Path, int]] = [(root, 0)]
while stack:
current_dir, depth = stack.pop()
with os.scandir(current_dir) as sd:
dir_entries = sorted(sd, key=lambda e: e.name)
for entry in dir_entries:
if entry.is_file():
yield Path(entry.path)
elif entry.is_dir():
stack.append((Path(entry.path), depth + 1))
The stack only holds the current path from root to the active directory — never the full file list. Memory stays at O(depth × branching_factor) regardless of how many files exist.
At 1 million files, peak stack memory is under 1 MB.
We also use os.scandir() instead of Path.iterdir() — scandir caches inode metadata from the kernel's readdir buffer, so is_file(), is_dir(), and is_symlink() cost zero extra syscalls per entry on Linux and macOS.
The Hash Pipeline: Three Stages, Minimum I/O
Hashing every file in full is expensive. CleanSweep uses a three-stage pipeline that eliminates most files before any full read happens.
Stage 1 — Group by size
Files with a unique size cannot be duplicates. They are eliminated immediately with zero I/O:
size_groups: dict[int, list[FileEntry]] = defaultdict(list)
for entry in snapshot:
size_groups[entry.size].append(entry)
# Keep only groups with 2+ files
candidates = [e for g in size_groups.values() if len(g) > 1 for e in g]
On a typical directory, this eliminates 80–95% of files before any hashing.
Stage 2 — Partial hash (first 4 KB)
For the remaining candidates, only the first 4,096 bytes are hashed with SHA-256. Files with unique partial hashes are eliminated:
PARTIAL_BYTES = 4096
with open(path, "rb") as f:
partial = hashlib.sha256(f.read(PARTIAL_BYTES)).hexdigest()
Small files (≤ 4 KB) are fully read here and cached — they incur zero extra I/O in stage 3.
Stage 3 — Full hash
Only files that survived both filters get a full SHA-256. Reading is done in 1 MB chunks — never the whole file at once:
CHUNK_SIZE = 1024 * 1024 # 1 MB
hasher = hashlib.sha256()
with open(path, "rb") as f:
while chunk := f.read(CHUNK_SIZE):
hasher.update(chunk)
return hasher.hexdigest()
Only files with an identical full SHA-256 are declared duplicates. No size matching, no partial matching, no heuristics.
Determinism: Same Input, Always Same Output
This was a non-negotiable requirement. Run CleanSweep twice on unchanged data — the output must be byte-identical.
This means:
- All file lists are sorted before processing
- Duplicate groups have stable ordering
- Parallel hashing results are sorted before exposure — parallelism never affects output order
- JSON exports use
sort_keys=Trueand absolute paths only
The sort key used everywhere:
_ENTRY_KEY = lambda e: (e.size, e.device, e.inode, str(e.path))
Safety: Nothing Happens Without Explicit Confirmation
CleanSweep defaults to dry-run in config. Nothing is ever moved or deleted unless you explicitly ask:
# Read-only — reports duplicates, touches nothing
cleansweep duplicates ~/Downloads
# Moves duplicates to system trash (reversible)
cleansweep duplicates ~/Downloads --delete --keep newest
# Hard delete — requires both flags
cleansweep duplicates ~/Downloads --delete --permanent
Every destructive action goes through a single gate — action_controller.py. Nothing else in the codebase can call unlink() or rename().
Architecture: Strict Layers, No Circular Imports
The project is divided into hard layers with a locked import graph:
CLI main.py
Config config.py
Engine scanner → duplicates → rules → destination_map
Action organizer → batch_engine → action_controller
file_operation_manager → trash_manager
Observability report.py ← analyzer.py
logger.py, timer.py
Rules that cannot be broken:
- No
print()outsidereport.py— engine modules return data, never format text - No filesystem mutation outside the action layer
- No circular imports
- No hidden global state
- No external packages — standard library only
These rules are enforced by a CI pipeline that checks every push.
Performance
Measured on real hardware (HP 245 G7, AMD CPU, ext4 SSD):
| Files | Scan time | Files/sec |
|---|---|---|
| 10,000 | < 0.5s | — |
| 100,000 | ~18s | ~5,600 |
| 1,000,000 | ~178s | ~5,620 |
Peak memory at 1M files: under 1 MB for the traversal stack.
Test Coverage
838 tests across 17 test files covering:
- Determinism — run twice, diff outputs, must be identical
- Atomicity — crash during move leaves no corrupted state
- Concurrency — parallel hashing never affects output order
- Memory — large synthetic datasets, no MemoryError
- Edge cases — permission denied, broken symlinks, empty files, race conditions
- Integration — full pipeline from scan to report
Try It
git clone https://github.com/Jeffrin-dev/CleanSweep.git
cd CleanSweep
# See what's in a directory
python3 main.py scan ~/Downloads
# Find duplicates (read-only)
python3 main.py duplicates ~/Downloads
# Preview what a cleanup would do
python3 main.py duplicates ~/Downloads --dry-run --keep newest
No pip install. No virtualenv. Python 3.10+ only.
What I Learned
Design for scale from the start. The iterative DFS and chunked hashing were architectural decisions made before writing a single test. Retrofitting them later would have been much harder.
Determinism is a feature, not an afterthought. Users trust tools that give the same answer every time. Non-deterministic output is a bug even if the tool technically works.
Safety defaults matter. Defaulting to dry-run means a new user cannot accidentally delete files on their first run. This is more important than convenience.
Locked invariants prevent architectural drift. Writing INVARIANTS.md early — a document that says "these rules cannot be broken by any PR" — kept the codebase clean across 30+ versions.
GitHub: github.com/Jeffrin-dev/CleanSweep
Feedback welcome — especially on the architecture or anything that could be improved.
Top comments (0)