I built a fuzzy matching engine that's 300x faster than RapidFuzz on 1M records

#python #showdev #performance #datascience

Fuzzy matching is one of those tasks that feels "easy" until you hit real-world data volumes.

If you’re comparing two strings, fuzz.ratio("Microsoft", "Micsrosoft Corpp") works in microseconds. But what happens when you have to deduplicate a CRM with 1,000,000 rows?

I spent the last few weeks benchmarking the "standard" Python ways to do this - RapidFuzz, TheFuzz, and Levenshtein - and I realized why everyone hates data cleaning: The O(N²) scaling wall is real.

The Benchmark: 10k to 1M Rows

I set up a head-to-head comparison in a standard Google Colab environment (2 vCPUs, 13GB RAM) using synthetic data with realistic typos (swaps, replacements, and "fat-finger" errors).

The "Wall"

At 10,000 records, RapidFuzz is a beast. It’s fast, optimized C++, and totally usable.

But fuzzy matching at scale is fundamentally a "many-to-many" problem. When you double your data, you quadruple the work. By the time I hit 100,000 rows, RapidFuzz was taking over 20 minutes. At 1,000,000 rows, local libraries don't just get slow - they crash. You run out of RAM during the matrix construction or your CPU sits at 100% for three days.

How I Optimized for 1M+ Rows

To get the Similarity API to finish a 1M-row dedupe in 7 minutes, I had to move away from naive loops and implement a dual-engine strategy:

Deterministic Indexing: Instead of comparing every string to every other string (quadratic time), I use an adaptive indexing strategy that "blocks" similar strings together before the math starts.
N-Gram Vectorization: I treat strings as high-dimensional vectors. This allows me to use optimized linear algebra libraries.
Off-Heap Memory Management: To prevent the "OOM (Out of Memory)" crashes common in Python, I use memory-mapping (np.memmap) to process data larger than the available RAM.

Stop building dedupe pipelines

If you are a developer, your time is better spent building features than babysitting a 12-hour deduplication script that might crash at 99%.

I’ve open-sourced the benchmark suite and the Google Colab environment I used so you can verify the numbers:

GitHub Repository: View the Benchmark Code — See how we handle 1M+ rows.
Google Colab: Run the Demo — Test the engine in your browser.

I’ve set up a free tier for the API that handles up to 100,000 records. You can generate a free token with a free sign up. It’s meant to be a low-friction way to test real-world data without having to spin up your own infrastructure.

I’m also looking for a few people with very large datasets (5M+ rows) to help me stress-test the next version of the async engine. If you're hitting scale limits that current tools can't solve, feel free to reach out.