benzsevern

Posted on Mar 21 • Edited on Apr 4

How to Deduplicate 100,000 Records in 13 Seconds with Python

#python #tutorial #opensource #datascience

You have a CSV with duplicate records. Maybe it's customer data exported from two CRMs, a product catalog merged from multiple vendors, or academic papers from different databases. You need to find the duplicates, decide which to merge, and produce a clean dataset.

Here's how to do it in one command:

pip install goldenmatch
goldenmatch dedupe your_data.csv

That's the zero-config path. GoldenMatch auto-detects your column types (name, email, phone, zip, address), picks appropriate matching algorithms, chooses a blocking strategy, and launches an interactive TUI where you review the results.

But let's go deeper. I'll walk through what happens under the hood and how to tune it for better results.

What happens when you run `goldenmatch dedupe`

1. Column Classification

GoldenMatch profiles your data and classifies each column:

Detected Type	Scorer	Why
Name	Ensemble (best of Jaro-Winkler, token sort, soundex)	Handles misspellings, nicknames, word order
Email	Exact (after normalization)	Emails are structured identifiers
Phone	Exact (digits only)	Strip formatting, compare digits
Zip	Exact	High-cardinality blocking key
Address	Token sort	Word order varies ("123 Main St" vs "Main Street 123")
Free text	Record embedding	Semantic similarity via sentence-transformers

2. Blocking

Comparing every record against every other record is O(n^2). For 100,000 records, that's 5 billion comparisons. Blocking reduces this to manageable chunks by grouping records that share a key (same zip code, same first 3 characters of name, same Soundex code).

GoldenMatch has 8 blocking strategies. The most interesting new one is learned blocking -- it samples your data, scores pairs, and automatically discovers which predicates give the best recall/reduction tradeoff:

blocking:
  strategy: learned
  learned_sample_size: 5000
  learned_min_recall: 0.95

3. Scoring

Within each block, every pair is scored using vectorized NxN comparison via rapidfuzz.process.cdist. This releases the GIL, so blocks are scored in parallel via a thread pool.

For hard cases (product matching), you can add LLM scoring:

llm_scorer:
  enabled: true
  model: gpt-4o-mini
  budget:
    max_cost_usd: 0.10

This sends borderline pairs (score 0.75-0.95) to GPT-4o-mini for a yes/no decision. On the Abt-Buy product benchmark, this boosts precision from 35% to 95% for $0.04.

4. Clustering

Scored pairs are clustered using iterative Union-Find with path compression. Each cluster gets a confidence score (weighted combination of minimum edge, average edge, and connectivity) and a bottleneck pair (the weakest link).

5. Golden Records

For each cluster, GoldenMatch creates a golden record using one of 5 merge strategies: most_complete, majority_vote, source_priority, most_recent, or first_non_null.

Performance

Records	Time	Throughput
1,000	0.15s	6,667 rec/s
10,000	1.67s	5,975 rec/s
100,000	12.78s	7,823 rec/s

Bottleneck is fuzzy scoring (49% of pipeline time), followed by golden record generation (30%).

The config file

For full control, use a YAML config:

matchkeys:
  - name: exact_email
    type: exact
    fields:
      - field: email
        transforms: [lowercase, strip]

  - name: fuzzy_name_address
    type: weighted
    threshold: 0.85
    fields:
      - field: name
        scorer: ensemble
        weight: 1.0
        transforms: [lowercase, strip]
      - field: zip
        scorer: exact
        weight: 0.5
      - field: phone
        scorer: exact
        weight: 0.3
        transforms: [digits_only]

blocking:
  keys:
    - fields: [zip]
  strategy: adaptive
  max_block_size: 500

golden_rules:
  default_strategy: most_complete

Try it

pip install goldenmatch
goldenmatch dedupe your_data.csv --output-all --output-dir results/

GitHub: https://github.com/benzsevern/goldenmatch
PyPI: https://pypi.org/project/goldenmatch/

792 tests, MIT license. Contributions welcome.

DEV Community

How to Deduplicate 100,000 Records in 13 Seconds with Python

What happens when you run `goldenmatch dedupe`

Performance

The config file

Try it

Top comments (0)

What happens when you run goldenmatch dedupe

Performance

The config file

Try it

What happens when you run `goldenmatch dedupe`