My CSV had duplicates. A lot of them.
"John Smith" and "Jon Smith" were the same person. So were "john.smith@gmail.com" and "jsmith@gmail.com." And "(555) 012-3456" and "5550123456."
I didn't want to write 60 lines of Python to find them. So I built a tool that does it in one command.
pip install goldenmatch
goldenmatch dedupe customers.csv
That's it. No config file. No training data. No manual labeling.
GoldenMatch reads your CSV, figures out which columns are names, emails, phones, and addresses, picks the best matching algorithm for each, and clusters the duplicates. On 100,000 records, it finishes in 12.78 seconds.
What Just Happened?
When you run goldenmatch dedupe, here's the pipeline:
Read File → Auto-Detect Columns → Pick Scorers → Block → Score → Cluster → Golden Records
Auto-detection looks at column names and data patterns. A column called "email" with values containing @ gets routed to exact + Levenshtein matching. A column called "name" gets Jaro-Winkler + token sort. Phone numbers get normalized and compared numerically.
Blocking reduces the comparison space. Instead of comparing every record against every other record (100K records = 5 billion pairs), it groups records by shared attributes — same zip code, same first letter of last name. This cuts candidates from billions to thousands.
Scoring runs the actual fuzzy matching. GoldenMatch has 10+ methods: Jaro-Winkler, Levenshtein, token sort, soundex, exact, ensemble, sentence embeddings, record embeddings, Dice, and Jaccard. It picks the best one per column automatically.
Clustering groups scored pairs into connected components. If A matches B and B matches C, all three land in the same cluster.
Golden records merge each cluster into one canonical row — picking the most complete value for each field.
Show Me the Numbers
I ran GoldenMatch against the DBLP-ACM benchmark, a well-known dataset in entity resolution research. 2,616 records from DBLP and 2,294 from ACM. 2,224 confirmed matching pairs.
goldenmatch match DBLP2.csv ACM.csv
goldenmatch evaluate --ground-truth perfectMapping.csv
Results:
| Metric | Score |
|---|---|
| Precision | 98.1% |
| Recall | 96.3% |
| F1 | 97.2% |
| Time | 3.8 seconds |
| Config | None |
Published research papers on this exact dataset typically report 94-98% F1 with hand-tuned configurations. GoldenMatch hits 97.2% with zero config.
Throughput
| Records | Time | Speed |
|---|---|---|
| 1,000 | 0.15s | 6,667 rec/s |
| 10,000 | 1.67s | 5,975 rec/s |
| 100,000 | 12.78s | 7,823 rec/s |
Built on Polars (Rust-backed DataFrames) and RapidFuzz (C++ string matching). No pandas. No scikit-learn. That's why it's fast.
The Full CLI
GoldenMatch isn't just dedupe. There are 20 commands:
goldenmatch dedupe data.csv # Deduplicate one file
goldenmatch match a.csv --against b.csv # Link records across two files
goldenmatch profile data.csv # Data quality report
goldenmatch interactive data.csv # Launch the TUI
goldenmatch demo # Try it with sample data
goldenmatch evaluate --ground-truth gt.csv # Benchmark against truth
goldenmatch sync --table customers # Live database dedup
goldenmatch watch --table customers # CDC mode: match new records continuously
goldenmatch serve # REST API for real-time matching
goldenmatch mcp-serve # Claude Desktop integration
goldenmatch pprl link -a hosp1.csv -b hosp2.csv # Privacy-preserving matching
goldenmatch setup # Configure GPU, API keys, database
Adding a Config (When You Want Control)
Zero-config is the starting point. When you want to tune:
# goldenmatch.yaml
matchkeys:
- name: name_match
fields:
- column: first_name
scorer: jaro_winkler
weight: 0.35
- column: last_name
scorer: jaro_winkler
weight: 0.35
- name: email_match
fields:
- column: email
scorer: exact
weight: 0.20
- name: phone_match
fields:
- column: phone
scorer: numeric
weight: 0.10
blocking:
keys:
- fields: [zip]
- fields: [last_name]
transforms: ["substring:0:3"]
threshold: 0.82
golden_rules:
default_strategy: most_complete
goldenmatch dedupe customers.csv -c goldenmatch.yaml
The Interactive TUI
For exploring your data before committing to a pipeline:
goldenmatch interactive customers.csv
Six tabs:
- Data — column profiles, null rates, anomaly detection
- Config — edit match keys, blocking, threshold
- Matches — split-view cluster browser. Select a cluster on the left, see every member on the right
- Golden — merged canonical records with per-field confidence
- Boost — active learning. Label borderline pairs to improve accuracy
- Export — CSV, Parquet, or JSON
The threshold slider adjusts in real time with arrow keys. Lower it and watch more clusters appear. Raise it and marginal matches split apart.
LLM Scoring: $0.04 to Fix the Hard Cases
Every matching pipeline has a gray zone — pairs scoring between 0.70 and 0.85 where the algorithm isn't confident. These are where your false positives and false negatives live.
goldenmatch dedupe products.csv --llm
GoldenMatch identifies borderline pairs and sends them to GPT-4o-mini (or Claude Haiku) for a judgment call. The LLM sees context that string matching can't — it knows "IBM" and "International Business Machines" are the same company.
On the Abt-Buy product benchmark:
| Mode | F1 | Cost |
|---|---|---|
| Baseline | 44.5% | $0 |
| + LLM boost | 72.2% | $0.04 |
It only scores the borderline pairs, not the entire dataset. That's why it costs pennies.
Set a budget cap: --llm-budget 0.50
Privacy-Preserving Matching (PPRL)
Two hospitals need to match patient records. HIPAA says they can't share the data.
goldenmatch pprl link \
--file-a hospital_a.csv \
--file-b hospital_b.csv \
--fields first_name,last_name,dob,zip \
--security high
Each party converts records into bloom filters locally — encrypted bit arrays that can't be reversed. The coordinator matches the encrypted fingerprints using Dice similarity and returns cluster IDs.
92.4% F1 on the FEBRL4 benchmark. Zero raw data exchanged. Three security levels: standard, high, paranoid.
vs. the Dedupe Library
If you've used the Python dedupe library, here's the difference:
| Dedupe | GoldenMatch | |
|---|---|---|
| Setup | ~60 lines of Python | 1 command |
| Training | Manual labeling (50+ pairs) | None required |
| F1 (DBLP-ACM) | ~93% | 97.2% |
| Speed (10K records) | 30-60 seconds | 2-4 seconds |
| Golden records | No | Yes (5 merge strategies) |
| PPRL | No | Yes |
| LLM scoring | No | Yes |
| TUI | No | Yes |
| REST API | No | Yes |
| Features | 4 | 16+ |
Dedupe is a good library. GoldenMatch just does more with less effort.
Quick Start
# Install
pip install goldenmatch
# Try the demo
goldenmatch demo
# Deduplicate your file
goldenmatch dedupe your_data.csv
# With golden records
goldenmatch dedupe your_data.csv --output-golden
# Profile data quality first
goldenmatch profile your_data.csv
# Launch the TUI
goldenmatch interactive your_data.csv
GitHub: github.com/benzsevern/goldenmatch
License: MIT
Tests: 924 passing
Python: 3.11+
What's Next
- Database sync with Postgres, DuckDB, Snowflake, BigQuery
- 7 domain packs pre-configured for electronics, healthcare, finance, people, software, real estate, retail
- MCP server for Claude Desktop integration
- Scheduled runs with cron syntax
- Ray backend for distributed processing at 10M+ scale
If you've ever spent a week writing a deduplication pipeline, try goldenmatch dedupe data.csv and see what happens in 12 seconds.
GoldenMatch is open source and MIT licensed. Star it on GitHub if it saves you time.
Top comments (0)