You have a CSV with duplicate records. Maybe it's customer data exported from two CRMs, a product catalog merged from multiple vendors, or academic papers from different databases. You need to find the duplicates, decide which to merge, and produce a clean dataset.
Here's how to do it in one command:
pip install goldenmatch
goldenmatch dedupe your_data.csv
That's the zero-config path. GoldenMatch auto-detects your column types (name, email, phone, zip, address), picks appropriate matching algorithms, chooses a blocking strategy, and launches an interactive TUI where you review the results.
But let's go deeper. I'll walk through what happens under the hood and how to tune it for better results.
What happens when you run goldenmatch dedupe
1. Column Classification
GoldenMatch profiles your data and classifies each column:
| Detected Type | Scorer | Why |
|---|---|---|
| Name | Ensemble (best of Jaro-Winkler, token sort, soundex) | Handles misspellings, nicknames, word order |
| Exact (after normalization) | Emails are structured identifiers | |
| Phone | Exact (digits only) | Strip formatting, compare digits |
| Zip | Exact | High-cardinality blocking key |
| Address | Token sort | Word order varies ("123 Main St" vs "Main Street 123") |
| Free text | Record embedding | Semantic similarity via sentence-transformers |
2. Blocking
Comparing every record against every other record is O(n^2). For 100,000 records, that's 5 billion comparisons. Blocking reduces this to manageable chunks by grouping records that share a key (same zip code, same first 3 characters of name, same Soundex code).
GoldenMatch has 8 blocking strategies. The most interesting new one is learned blocking -- it samples your data, scores pairs, and automatically discovers which predicates give the best recall/reduction tradeoff:
blocking:
strategy: learned
learned_sample_size: 5000
learned_min_recall: 0.95
3. Scoring
Within each block, every pair is scored using vectorized NxN comparison via rapidfuzz.process.cdist. This releases the GIL, so blocks are scored in parallel via a thread pool.
For hard cases (product matching), you can add LLM scoring:
llm_scorer:
enabled: true
model: gpt-4o-mini
budget:
max_cost_usd: 0.10
This sends borderline pairs (score 0.75-0.95) to GPT-4o-mini for a yes/no decision. On the Abt-Buy product benchmark, this boosts precision from 35% to 95% for $0.04.
4. Clustering
Scored pairs are clustered using iterative Union-Find with path compression. Each cluster gets a confidence score (weighted combination of minimum edge, average edge, and connectivity) and a bottleneck pair (the weakest link).
5. Golden Records
For each cluster, GoldenMatch creates a golden record using one of 5 merge strategies: most_complete, majority_vote, source_priority, most_recent, or first_non_null.
Performance
| Records | Time | Throughput |
|---|---|---|
| 1,000 | 0.15s | 6,667 rec/s |
| 10,000 | 1.67s | 5,975 rec/s |
| 100,000 | 12.78s | 7,823 rec/s |
Bottleneck is fuzzy scoring (49% of pipeline time), followed by golden record generation (30%).
The config file
For full control, use a YAML config:
matchkeys:
- name: exact_email
type: exact
fields:
- field: email
transforms: [lowercase, strip]
- name: fuzzy_name_address
type: weighted
threshold: 0.85
fields:
- field: name
scorer: ensemble
weight: 1.0
transforms: [lowercase, strip]
- field: zip
scorer: exact
weight: 0.5
- field: phone
scorer: exact
weight: 0.3
transforms: [digits_only]
blocking:
keys:
- fields: [zip]
strategy: adaptive
max_block_size: 500
golden_rules:
default_strategy: most_complete
Try it
pip install goldenmatch
goldenmatch dedupe your_data.csv --output-all --output-dir results/
GitHub: https://github.com/benzsevern/goldenmatch
PyPI: https://pypi.org/project/goldenmatch/
792 tests, MIT license. Contributions welcome.
Top comments (0)