DEV Community

benzsevern
benzsevern

Posted on

How to Deduplicate 100,000 Records in 13 Seconds with Python

You have a CSV with duplicate records. Maybe it's customer data exported from two CRMs, a product catalog merged from multiple vendors, or academic papers from different databases. You need to find the duplicates, decide which to merge, and produce a clean dataset.

Here's how to do it in one command:

pip install goldenmatch
goldenmatch dedupe your_data.csv
Enter fullscreen mode Exit fullscreen mode

That's the zero-config path. GoldenMatch auto-detects your column types (name, email, phone, zip, address), picks appropriate matching algorithms, chooses a blocking strategy, and launches an interactive TUI where you review the results.

But let's go deeper. I'll walk through what happens under the hood and how to tune it for better results.

What happens when you run goldenmatch dedupe

1. Column Classification

GoldenMatch profiles your data and classifies each column:

Detected Type Scorer Why
Name Ensemble (best of Jaro-Winkler, token sort, soundex) Handles misspellings, nicknames, word order
Email Exact (after normalization) Emails are structured identifiers
Phone Exact (digits only) Strip formatting, compare digits
Zip Exact High-cardinality blocking key
Address Token sort Word order varies ("123 Main St" vs "Main Street 123")
Free text Record embedding Semantic similarity via sentence-transformers

2. Blocking

Comparing every record against every other record is O(n^2). For 100,000 records, that's 5 billion comparisons. Blocking reduces this to manageable chunks by grouping records that share a key (same zip code, same first 3 characters of name, same Soundex code).

GoldenMatch has 8 blocking strategies. The most interesting new one is learned blocking -- it samples your data, scores pairs, and automatically discovers which predicates give the best recall/reduction tradeoff:

blocking:
  strategy: learned
  learned_sample_size: 5000
  learned_min_recall: 0.95
Enter fullscreen mode Exit fullscreen mode

3. Scoring

Within each block, every pair is scored using vectorized NxN comparison via rapidfuzz.process.cdist. This releases the GIL, so blocks are scored in parallel via a thread pool.

For hard cases (product matching), you can add LLM scoring:

llm_scorer:
  enabled: true
  model: gpt-4o-mini
  budget:
    max_cost_usd: 0.10
Enter fullscreen mode Exit fullscreen mode

This sends borderline pairs (score 0.75-0.95) to GPT-4o-mini for a yes/no decision. On the Abt-Buy product benchmark, this boosts precision from 35% to 95% for $0.04.

4. Clustering

Scored pairs are clustered using iterative Union-Find with path compression. Each cluster gets a confidence score (weighted combination of minimum edge, average edge, and connectivity) and a bottleneck pair (the weakest link).

5. Golden Records

For each cluster, GoldenMatch creates a golden record using one of 5 merge strategies: most_complete, majority_vote, source_priority, most_recent, or first_non_null.

Performance

Records Time Throughput
1,000 0.15s 6,667 rec/s
10,000 1.67s 5,975 rec/s
100,000 12.78s 7,823 rec/s

Bottleneck is fuzzy scoring (49% of pipeline time), followed by golden record generation (30%).

The config file

For full control, use a YAML config:

matchkeys:
  - name: exact_email
    type: exact
    fields:
      - field: email
        transforms: [lowercase, strip]

  - name: fuzzy_name_address
    type: weighted
    threshold: 0.85
    fields:
      - field: name
        scorer: ensemble
        weight: 1.0
        transforms: [lowercase, strip]
      - field: zip
        scorer: exact
        weight: 0.5
      - field: phone
        scorer: exact
        weight: 0.3
        transforms: [digits_only]

blocking:
  keys:
    - fields: [zip]
  strategy: adaptive
  max_block_size: 500

golden_rules:
  default_strategy: most_complete
Enter fullscreen mode Exit fullscreen mode

Try it

pip install goldenmatch
goldenmatch dedupe your_data.csv --output-all --output-dir results/
Enter fullscreen mode Exit fullscreen mode

GitHub: https://github.com/benzsevern/goldenmatch
PyPI: https://pypi.org/project/goldenmatch/

792 tests, MIT license. Contributions welcome.

Top comments (0)