Fuzzy matching is one of those tasks that feels "easy" until you hit real-world data volumes.
If you’re comparing two strings, fuzz.ratio("Microsoft", "Micsrosoft Corpp") works in microseconds. But what happens when you have to deduplicate a CRM with 1,000,000 rows?
I spent the last few weeks benchmarking the "standard" Python ways to do this - RapidFuzz, TheFuzz, and Levenshtein - and I realized why everyone hates data cleaning: The O(N²) scaling wall is real.
The Benchmark: 10k to 1M Rows
I set up a head-to-head comparison in a standard Google Colab environment (2 vCPUs, 13GB RAM) using synthetic data with realistic typos (swaps, replacements, and "fat-finger" errors).
The "Wall"
At 10,000 records, RapidFuzz is a beast. It’s fast, optimized C++, and totally usable.
But fuzzy matching at scale is fundamentally a "many-to-many" problem. When you double your data, you quadruple the work. By the time I hit 100,000 rows, RapidFuzz was taking over 20 minutes. At 1,000,000 rows, local libraries don't just get slow - they crash. You run out of RAM during the matrix construction or your CPU sits at 100% for three days.
How I Optimized for 1M+ Rows
To get the Similarity API to finish a 1M-row dedupe in 7 minutes, I had to move away from naive loops and implement a dual-engine strategy:
- Deterministic Indexing: Instead of comparing every string to every other string (quadratic time), I use an adaptive indexing strategy that "blocks" similar strings together before the math starts.
- N-Gram Vectorization: I treat strings as high-dimensional vectors. This allows me to use optimized linear algebra libraries.
- Off-Heap Memory Management: To prevent the "OOM (Out of Memory)" crashes common in Python, I use memory-mapping (np.memmap) to process data larger than the available RAM.
Stop building dedupe pipelines
If you are a developer, your time is better spent building features than babysitting a 12-hour deduplication script that might crash at 99%.
I’ve open-sourced the benchmark suite and the Google Colab environment I used so you can verify the numbers:
GitHub Repository: View the Benchmark Code — See how we handle 1M+ rows.
Google Colab: Run the Demo — Test the engine in your browser.
I’ve set up a free tier for the API that handles up to 100,000 records. You can generate a free token with a free sign up. It’s meant to be a low-friction way to test real-world data without having to spin up your own infrastructure.
I’m also looking for a few people with very large datasets (5M+ rows) to help me stress-test the next version of the async engine. If you're hitting scale limits that current tools can't solve, feel free to reach out.

Top comments (0)