Forem

benzsevern
benzsevern

Posted on

I Deduplicated 100K Records in 12 Seconds With One Command

My CSV had duplicates. A lot of them.

"John Smith" and "Jon Smith" were the same person. So were "john.smith@gmail.com" and "jsmith@gmail.com." And "(555) 012-3456" and "5550123456."

I didn't want to write 60 lines of Python to find them. So I built a tool that does it in one command.

pip install goldenmatch
goldenmatch dedupe customers.csv
Enter fullscreen mode Exit fullscreen mode

That's it. No config file. No training data. No manual labeling.

GoldenMatch reads your CSV, figures out which columns are names, emails, phones, and addresses, picks the best matching algorithm for each, and clusters the duplicates. On 100,000 records, it finishes in 12.78 seconds.


What Just Happened?

When you run goldenmatch dedupe, here's the pipeline:

Read File → Auto-Detect Columns → Pick Scorers → Block → Score → Cluster → Golden Records
Enter fullscreen mode Exit fullscreen mode

Auto-detection looks at column names and data patterns. A column called "email" with values containing @ gets routed to exact + Levenshtein matching. A column called "name" gets Jaro-Winkler + token sort. Phone numbers get normalized and compared numerically.

Blocking reduces the comparison space. Instead of comparing every record against every other record (100K records = 5 billion pairs), it groups records by shared attributes — same zip code, same first letter of last name. This cuts candidates from billions to thousands.

Scoring runs the actual fuzzy matching. GoldenMatch has 10+ methods: Jaro-Winkler, Levenshtein, token sort, soundex, exact, ensemble, sentence embeddings, record embeddings, Dice, and Jaccard. It picks the best one per column automatically.

Clustering groups scored pairs into connected components. If A matches B and B matches C, all three land in the same cluster.

Golden records merge each cluster into one canonical row — picking the most complete value for each field.


Show Me the Numbers

I ran GoldenMatch against the DBLP-ACM benchmark, a well-known dataset in entity resolution research. 2,616 records from DBLP and 2,294 from ACM. 2,224 confirmed matching pairs.

goldenmatch match DBLP2.csv ACM.csv
goldenmatch evaluate --ground-truth perfectMapping.csv
Enter fullscreen mode Exit fullscreen mode

Results:

Metric Score
Precision 98.1%
Recall 96.3%
F1 97.2%
Time 3.8 seconds
Config None

Published research papers on this exact dataset typically report 94-98% F1 with hand-tuned configurations. GoldenMatch hits 97.2% with zero config.

Throughput

Records Time Speed
1,000 0.15s 6,667 rec/s
10,000 1.67s 5,975 rec/s
100,000 12.78s 7,823 rec/s

Built on Polars (Rust-backed DataFrames) and RapidFuzz (C++ string matching). No pandas. No scikit-learn. That's why it's fast.


The Full CLI

GoldenMatch isn't just dedupe. There are 20 commands:

goldenmatch dedupe data.csv              # Deduplicate one file
goldenmatch match a.csv --against b.csv  # Link records across two files
goldenmatch profile data.csv             # Data quality report
goldenmatch interactive data.csv         # Launch the TUI
goldenmatch demo                         # Try it with sample data
goldenmatch evaluate --ground-truth gt.csv  # Benchmark against truth
goldenmatch sync --table customers       # Live database dedup
goldenmatch watch --table customers      # CDC mode: match new records continuously
goldenmatch serve                        # REST API for real-time matching
goldenmatch mcp-serve                    # Claude Desktop integration
goldenmatch pprl link -a hosp1.csv -b hosp2.csv  # Privacy-preserving matching
goldenmatch setup                        # Configure GPU, API keys, database
Enter fullscreen mode Exit fullscreen mode

Adding a Config (When You Want Control)

Zero-config is the starting point. When you want to tune:

# goldenmatch.yaml
matchkeys:
  - name: name_match
    fields:
      - column: first_name
        scorer: jaro_winkler
        weight: 0.35
      - column: last_name
        scorer: jaro_winkler
        weight: 0.35
  - name: email_match
    fields:
      - column: email
        scorer: exact
        weight: 0.20
  - name: phone_match
    fields:
      - column: phone
        scorer: numeric
        weight: 0.10

blocking:
  keys:
    - fields: [zip]
    - fields: [last_name]
      transforms: ["substring:0:3"]

threshold: 0.82
golden_rules:
  default_strategy: most_complete
Enter fullscreen mode Exit fullscreen mode
goldenmatch dedupe customers.csv -c goldenmatch.yaml
Enter fullscreen mode Exit fullscreen mode

The Interactive TUI

For exploring your data before committing to a pipeline:

goldenmatch interactive customers.csv
Enter fullscreen mode Exit fullscreen mode

Six tabs:

  • Data — column profiles, null rates, anomaly detection
  • Config — edit match keys, blocking, threshold
  • Matches — split-view cluster browser. Select a cluster on the left, see every member on the right
  • Golden — merged canonical records with per-field confidence
  • Boost — active learning. Label borderline pairs to improve accuracy
  • Export — CSV, Parquet, or JSON

The threshold slider adjusts in real time with arrow keys. Lower it and watch more clusters appear. Raise it and marginal matches split apart.


LLM Scoring: $0.04 to Fix the Hard Cases

Every matching pipeline has a gray zone — pairs scoring between 0.70 and 0.85 where the algorithm isn't confident. These are where your false positives and false negatives live.

goldenmatch dedupe products.csv --llm
Enter fullscreen mode Exit fullscreen mode

GoldenMatch identifies borderline pairs and sends them to GPT-4o-mini (or Claude Haiku) for a judgment call. The LLM sees context that string matching can't — it knows "IBM" and "International Business Machines" are the same company.

On the Abt-Buy product benchmark:

Mode F1 Cost
Baseline 44.5% $0
+ LLM boost 72.2% $0.04

It only scores the borderline pairs, not the entire dataset. That's why it costs pennies.

Set a budget cap: --llm-budget 0.50


Privacy-Preserving Matching (PPRL)

Two hospitals need to match patient records. HIPAA says they can't share the data.

goldenmatch pprl link \
  --file-a hospital_a.csv \
  --file-b hospital_b.csv \
  --fields first_name,last_name,dob,zip \
  --security high
Enter fullscreen mode Exit fullscreen mode

Each party converts records into bloom filters locally — encrypted bit arrays that can't be reversed. The coordinator matches the encrypted fingerprints using Dice similarity and returns cluster IDs.

92.4% F1 on the FEBRL4 benchmark. Zero raw data exchanged. Three security levels: standard, high, paranoid.


vs. the Dedupe Library

If you've used the Python dedupe library, here's the difference:

Dedupe GoldenMatch
Setup ~60 lines of Python 1 command
Training Manual labeling (50+ pairs) None required
F1 (DBLP-ACM) ~93% 97.2%
Speed (10K records) 30-60 seconds 2-4 seconds
Golden records No Yes (5 merge strategies)
PPRL No Yes
LLM scoring No Yes
TUI No Yes
REST API No Yes
Features 4 16+

Dedupe is a good library. GoldenMatch just does more with less effort.


Quick Start

# Install
pip install goldenmatch

# Try the demo
goldenmatch demo

# Deduplicate your file
goldenmatch dedupe your_data.csv

# With golden records
goldenmatch dedupe your_data.csv --output-golden

# Profile data quality first
goldenmatch profile your_data.csv

# Launch the TUI
goldenmatch interactive your_data.csv
Enter fullscreen mode Exit fullscreen mode

GitHub: github.com/benzsevern/goldenmatch
License: MIT
Tests: 924 passing
Python: 3.11+


What's Next

  • Database sync with Postgres, DuckDB, Snowflake, BigQuery
  • 7 domain packs pre-configured for electronics, healthcare, finance, people, software, real estate, retail
  • MCP server for Claude Desktop integration
  • Scheduled runs with cron syntax
  • Ray backend for distributed processing at 10M+ scale

If you've ever spent a week writing a deduplication pipeline, try goldenmatch dedupe data.csv and see what happens in 12 seconds.


GoldenMatch is open source and MIT licensed. Star it on GitHub if it saves you time.

Top comments (0)