benzsevern

Posted on Mar 24 • Edited on Apr 4

I Deduplicated 100K Records in 12 Seconds With One Command

#python #tutorial #opensource #datascience

My CSV had duplicates. A lot of them.

"John Smith" and "Jon Smith" were the same person. So were "john.smith@gmail.com" and "jsmith@gmail.com." And "(555) 012-3456" and "5550123456."

I didn't want to write 60 lines of Python to find them. So I built a tool that does it in one command.

pip install goldenmatch
goldenmatch dedupe customers.csv

That's it. No config file. No training data. No manual labeling.

GoldenMatch reads your CSV, figures out which columns are names, emails, phones, and addresses, picks the best matching algorithm for each, and clusters the duplicates. On 100,000 records, it finishes in 12.78 seconds.

What Just Happened?

When you run goldenmatch dedupe, here's the pipeline:

Read File → Auto-Detect Columns → Pick Scorers → Block → Score → Cluster → Golden Records

Auto-detection looks at column names and data patterns. A column called "email" with values containing @ gets routed to exact + Levenshtein matching. A column called "name" gets Jaro-Winkler + token sort. Phone numbers get normalized and compared numerically.

Blocking reduces the comparison space. Instead of comparing every record against every other record (100K records = 5 billion pairs), it groups records by shared attributes — same zip code, same first letter of last name. This cuts candidates from billions to thousands.

Scoring runs the actual fuzzy matching. GoldenMatch has 10+ methods: Jaro-Winkler, Levenshtein, token sort, soundex, exact, ensemble, sentence embeddings, record embeddings, Dice, and Jaccard. It picks the best one per column automatically.

Clustering groups scored pairs into connected components. If A matches B and B matches C, all three land in the same cluster.

Golden records merge each cluster into one canonical row — picking the most complete value for each field.

Show Me the Numbers

I ran GoldenMatch against the DBLP-ACM benchmark, a well-known dataset in entity resolution research. 2,616 records from DBLP and 2,294 from ACM. 2,224 confirmed matching pairs.

goldenmatch match DBLP2.csv ACM.csv
goldenmatch evaluate --ground-truth perfectMapping.csv

Results:

Metric	Score
Precision	98.1%
Recall	96.3%
F1	97.2%
Time	3.8 seconds
Config	None

Published research papers on this exact dataset typically report 94-98% F1 with hand-tuned configurations. GoldenMatch hits 97.2% with zero config.

Throughput

Records	Time	Speed
1,000	0.15s	6,667 rec/s
10,000	1.67s	5,975 rec/s
100,000	12.78s	7,823 rec/s

Built on Polars (Rust-backed DataFrames) and RapidFuzz (C++ string matching). No pandas. No scikit-learn. That's why it's fast.

The Full CLI

GoldenMatch isn't just dedupe. There are 20 commands:

goldenmatch dedupe data.csv              # Deduplicate one file
goldenmatch match a.csv --against b.csv  # Link records across two files
goldenmatch profile data.csv             # Data quality report
goldenmatch interactive data.csv         # Launch the TUI
goldenmatch demo                         # Try it with sample data
goldenmatch evaluate --ground-truth gt.csv  # Benchmark against truth
goldenmatch sync --table customers       # Live database dedup
goldenmatch watch --table customers      # CDC mode: match new records continuously
goldenmatch serve                        # REST API for real-time matching
goldenmatch mcp-serve                    # Claude Desktop integration
goldenmatch pprl link -a hosp1.csv -b hosp2.csv  # Privacy-preserving matching
goldenmatch setup                        # Configure GPU, API keys, database

Adding a Config (When You Want Control)

Zero-config is the starting point. When you want to tune:

# goldenmatch.yaml
matchkeys:
  - name: name_match
    fields:
      - column: first_name
        scorer: jaro_winkler
        weight: 0.35
      - column: last_name
        scorer: jaro_winkler
        weight: 0.35
  - name: email_match
    fields:
      - column: email
        scorer: exact
        weight: 0.20
  - name: phone_match
    fields:
      - column: phone
        scorer: numeric
        weight: 0.10

blocking:
  keys:
    - fields: [zip]
    - fields: [last_name]
      transforms: ["substring:0:3"]

threshold: 0.82
golden_rules:
  default_strategy: most_complete

goldenmatch dedupe customers.csv -c goldenmatch.yaml

The Interactive TUI

For exploring your data before committing to a pipeline:

goldenmatch interactive customers.csv

Six tabs:

Data — column profiles, null rates, anomaly detection
Config — edit match keys, blocking, threshold
Matches — split-view cluster browser. Select a cluster on the left, see every member on the right
Golden — merged canonical records with per-field confidence
Boost — active learning. Label borderline pairs to improve accuracy
Export — CSV, Parquet, or JSON

The threshold slider adjusts in real time with arrow keys. Lower it and watch more clusters appear. Raise it and marginal matches split apart.

LLM Scoring: $0.04 to Fix the Hard Cases

Every matching pipeline has a gray zone — pairs scoring between 0.70 and 0.85 where the algorithm isn't confident. These are where your false positives and false negatives live.

goldenmatch dedupe products.csv --llm

GoldenMatch identifies borderline pairs and sends them to GPT-4o-mini (or Claude Haiku) for a judgment call. The LLM sees context that string matching can't — it knows "IBM" and "International Business Machines" are the same company.

On the Abt-Buy product benchmark:

Mode	F1	Cost
Baseline	44.5%	$0
+ LLM boost	72.2%	$0.04

It only scores the borderline pairs, not the entire dataset. That's why it costs pennies.

Set a budget cap: --llm-budget 0.50

Privacy-Preserving Matching (PPRL)

Two hospitals need to match patient records. HIPAA says they can't share the data.

goldenmatch pprl link \
  --file-a hospital_a.csv \
  --file-b hospital_b.csv \
  --fields first_name,last_name,dob,zip \
  --security high

Each party converts records into bloom filters locally — encrypted bit arrays that can't be reversed. The coordinator matches the encrypted fingerprints using Dice similarity and returns cluster IDs.

92.4% F1 on the FEBRL4 benchmark. Zero raw data exchanged. Three security levels: standard, high, paranoid.

vs. the Dedupe Library

If you've used the Python dedupe library, here's the difference:

	Dedupe	GoldenMatch
Setup	~60 lines of Python	1 command
Training	Manual labeling (50+ pairs)	None required
F1 (DBLP-ACM)	~93%	97.2%
Speed (10K records)	30-60 seconds	2-4 seconds
Golden records	No	Yes (5 merge strategies)
PPRL	No	Yes
LLM scoring	No	Yes
TUI	No	Yes
REST API	No	Yes
Features	4	16+

Dedupe is a good library. GoldenMatch just does more with less effort.

Quick Start

# Install
pip install goldenmatch

# Try the demo
goldenmatch demo

# Deduplicate your file
goldenmatch dedupe your_data.csv

# With golden records
goldenmatch dedupe your_data.csv --output-golden

# Profile data quality first
goldenmatch profile your_data.csv

# Launch the TUI
goldenmatch interactive your_data.csv

GitHub: github.com/benzsevern/goldenmatch
License: MIT
Tests: 924 passing
Python: 3.11+

What's Next

Database sync with Postgres, DuckDB, Snowflake, BigQuery
7 domain packs pre-configured for electronics, healthcare, finance, people, software, real estate, retail
MCP server for Claude Desktop integration
Scheduled runs with cron syntax
Ray backend for distributed processing at 10M+ scale

If you've ever spent a week writing a deduplication pipeline, try goldenmatch dedupe data.csv and see what happens in 12 seconds.

GoldenMatch is open source and MIT licensed. Star it on GitHub if it saves you time.

DEV Community