benzsevern

Posted on Apr 8

The OSS ER Bargain: What Entity Resolution Actually Costs You

#python #datascience #opensource #benchmarking

The OSS ER Bargain: What Entity Resolution Actually Costs You

Benchmarking dedupe vs GoldenMatch on 500,000 CMS provider records

The National Plan and Provider Enumeration System (NPPES) publishes one of the largest open healthcare directories in the world: 6+ million U.S. providers, updated monthly, with names spelled four different ways, addresses that drift across quarters, and enough Smiths and Garcias to keep any blocking algorithm honest. It's a reasonable stand-in for the kind of data most organizations actually have: real, messy, and big enough to hurt.

I wanted to see what it costs to resolve a dataset like this with traditional open-source entity resolution, versus a holistic approach. So I took 500,000 randomly-sampled records from the March 2026 NPPES release and pointed two tools at them: dedupe, the canonical Python OSS deduper, and GoldenMatch, the matching engine at the heart of the Golden Suite.

This isn't a precision/recall bake-off. NPPES ships no ground-truth duplicate labels, and I refused to inject synthetic ones — faking the test data to prove a point is cheating. What I measured instead is what it actually feels like to use each tool: wall-clock runtime, peak memory, how many decisions you have to make, and — critically — whether the tool can even finish the job.

The OSS bargain

dedupe is, in many ways, the textbook open-source entity resolution library. It's well-documented, actively maintained, used in production at real companies, and its active-learning approach is genuinely clever: rather than make you write deterministic rules, it surfaces pairs of records it's uncertain about and asks you to label them.

That cleverness has a cost, and the cost is you.

Setting up dedupe on NPPES means answering a sequence of questions the tool can't answer itself:

Which fields do you want to match on? Pick wrong and your recall tanks.
What types are they — String, Exact, ShortString, Price, LatLong? Each has different behavior and you need to know which.
How should it sample training pairs? What sample_size? What blocked_proportion? These numbers shape what dedupe even sees.
Is your labeler honest? Without ground truth, you're either clicking through uncertain pairs yourself, or — as I did here — writing a deterministic rule that labels pairs programmatically. Either way, you own the decision.
What threshold do you partition at? 0.5? 0.3? 0.7? The number is yours. dedupe will not tell you which one is right for your data.
index_predicates=True or False? In dedupe 3.x, the "True" path needs an extra explicit indexing step or it crashes with NoIndexError mid-partition. I found this out the hard way.

None of these questions have wrong answers in isolation. What they have in common is that every one of them is a decision the user has to make, and every one of them silently changes the output of the algorithm downstream. dedupe trusts you to know what you're doing. When you don't, you get quiet failure.

The holistic alternative

GoldenMatch takes a different approach. You still write a config — I'm not going to pretend it's zero-configuration — but the config describes what your data is, not how dedupe should learn to resolve it. The blocking strategy, the scorers, the weight vectors, the clustering step, and the schema inference are all owned by the library. You point it at your polars DataFrame and call dedupe_df.

Here's the whole GoldenMatch setup I used for NPPES:

config = GoldenMatchConfig(
    blocking=BlockingConfig(
        strategy="multi_pass",
        passes=[
            BlockingKeyConfig(fields=["last_name"], transforms=["soundex"]),
            BlockingKeyConfig(fields=["zip"], transforms=[]),
            BlockingKeyConfig(fields=["org_name"], transforms=["substring:0:3"]),
        ],
        max_block_size=500,
        skip_oversized=True,
    ),
    matchkeys=[
        MatchkeyConfig(
            name="provider", type="weighted", threshold=0.75,
            fields=[
                MatchkeyField(field="first_name", scorer="jaro_winkler", weight=2.0, transforms=["lowercase", "strip"]),
                MatchkeyField(field="last_name", scorer="jaro_winkler", weight=2.0, transforms=["lowercase", "strip"]),
                MatchkeyField(field="org_name", scorer="token_sort", weight=1.5, transforms=["lowercase", "strip"]),
                MatchkeyField(field="address", scorer="token_sort", weight=1.5, transforms=["lowercase", "strip"]),
                MatchkeyField(field="city", scorer="jaro_winkler", weight=0.5, transforms=["lowercase", "strip"]),
                MatchkeyField(field="zip", scorer="exact", weight=1.0),
            ],
        ),
    ],
)

result = goldenmatch.dedupe_df(df, config=config)

That's the whole thing. Three blocking passes (phonetic surname, exact zip, organization prefix), six weighted field scorers, one threshold. No training loop. No uncertain-pair labeling. No "did I pick the right number of training pairs" anxiety.

What happened at 50,000 rows

I ran both tools on a 50,000-row slice of the NPPES sample:

Metric	`dedupe`	GoldenMatch	Ratio
Wall-clock runtime	3,589 s (59.8 min)	17.3 s	207×
Peak process RSS	8,699 MB	602 MB	14×
Multi-record clusters found	0	2,857	—
Config lines	206	148	1.4×
Human decisions required	8+ (see list above)	3 (blocking, scorers, threshold)	—

The runtime and memory numbers are jaw-dropping on their own. But look at the "multi-record clusters found" row. dedupe returned zero clusters with more than one record. It produced 50,000 singletons — a perfectly unhelpful partition that says every record is its own entity.

This is not because NPPES has no duplicates. GoldenMatch found 2,857 multi-record clusters on the same data: real matches like PETER ROBERT NEHREBECKI at 240 SHOTWELL ST STE 206 appearing twice under different NPIs, or organizational providers sharing an address and a taxonomy code. The duplicates are there. dedupe just couldn't see them.

Why not? Because dedupe's classifier needs balanced positive and negative training pairs, and the deterministic rule oracle I fed it (match iff same NPI, or same normalized last_name + first_name + zip5) rarely triggers in a random 50k slice of NPPES. Without enough positives, the classifier collapses to "everything is distinct," sklearn warns "only one class in y," and you wait an hour for an output that says nothing.

Could I fix this? Yes. I could loosen the rule oracle, or pre-seed with softer matches, or hand-label pairs, or try a different classifier. All of those are more decisions I'd have to make — decisions that dedupe's design says are mine to own. I ran it honestly, with a clearly-documented protocol, and honestly is what I got.

Scaling out: does GoldenMatch survive 500,000?

Having established that dedupe is not going to finish NPPES at any interesting scale on a laptop, I ran GoldenMatch up the ladder.

Tier	GoldenMatch runtime	Peak RSS	Multi-record clusters	Records collapsed
50,000	17.3 s	602 MB	2,857	2,857
100,000	47.0 s	731 MB	~9,511	9,511
500,000	261.0 s	2,150 MB	~120,191	120,191

Ten times the data, fifteen times the runtime, four times the memory, and roughly forty times the duplicates found. Sub-linear scaling on cluster count — unsurprising, since large datasets surface more duplicate pairs per row. The 500k run finished in 4 minutes 21 seconds using 2.1 GB of RAM on a Windows laptop. Whatever dedupe was doing with its 8.7 GB and its hour of CPU at 50k, GoldenMatch was doing 10× the work in a quarter of the time and a quarter of the memory.

What the sensitivity analysis actually shows

I also swept GoldenMatch through 5 config variations at 50k — four threshold values (0.65, 0.70, 0.80, 0.85) plus a stricter weight preset — and measured Adjusted Rand Index against the default run:

Variant	ARI vs default
`threshold=0.65`	0.5044
`threshold=0.70`	0.7299
`threshold=0.80`	0.4716
`threshold=0.85`	0.2821
`preset_strict`	0.8505

Here's what I want to flag honestly: GoldenMatch's output is sensitive to threshold. The ARI range across variants is 0.57 — that's a lot of movement. If your only claim was "holistic ER is stable under config changes," this table would undermine you.

I don't think that's the right claim.

The right claim is: the knobs work. When you tighten the threshold from 0.65 to 0.85, GoldenMatch produces noticeably stricter clusters — exactly as you'd expect. The threshold is a real, functional control surface, not a cosmetic dial. A sensitivity of 0.57 ARI means the tool actually does different things when you ask it to.

And — here's the uncomfortable counterpart — I cannot compare this to dedupe's sensitivity, because dedupe at 50k produces all-singletons at every threshold. Dedupe's "sensitivity" is 0.0 because the output is trivially constant: nothing, nothing, nothing, nothing. Perfect stability, zero utility.

That's the shape of the real comparison. One tool has knobs that work on a job it can actually finish. The other tool's knobs don't matter because it never got to a meaningful output in the first place.

What "holistic" actually means

When I say GoldenMatch's approach is holistic, I do not mean "it hides the hard decisions from you." Clearly it doesn't — the threshold matters, the blocking choices matter, the scorer weights matter. You can see every one of them in the config block above.

What I mean is that GoldenMatch owns the decisions the user shouldn't have to own:

Whether to build an index over blocking predicates, and when to release it. dedupe makes this your problem and crashes if you guess wrong.
Whether to fall back to a lookup table when a block grows oversized. dedupe blows your memory budget before you notice.
How to assemble per-field scores into a cluster decision, and how to verify that decision across the transitive closure of pairs. dedupe leaves this to a classifier whose training data you have to provide.
How to handle the case where your labeled training set has no positives. dedupe collapses silently. GoldenMatch doesn't need labels.

The OSS bargain is: the library gives you flexibility, and the cost is that you own the consequences of every degree of freedom it exposes. That's fine for small datasets, clean schemas, and practitioners who already know what they're doing. On 500,000 rows of real NPPES data on a laptop, it's not a bargain — it's a trap.

The disclaimers

I want to be precise about what this benchmark is and isn't:

No ground truth. NPPES doesn't ship duplicate labels, and I didn't inject synthetic ones. Every "duplicates found" number is what each tool reports, not what is objectively correct. Some of GoldenMatch's 2,857 clusters at 50k are probably wrong. Without ground truth, I can't tell you the precision or recall of either tool. What I can tell you is that 0 is not the right answer.
Dedupe's labeling protocol matters a lot. I used a deterministic rule (NPI equality OR normalized last_name + first_name + zip5 equality) to label pairs for dedupe. A different protocol — a hand-labeled training set, or a looser rule — would likely give dedupe a fighting chance to learn a real classifier. My protocol is strict on purpose: it's the kind of thing a data engineer would actually write when they need a reproducible pipeline without human-in-the-loop labeling. If your protocol is softer, your results will differ.
Memory numbers include the Python interpreter and loaded libraries. Peak RSS is measured via psutil.Process().memory_info().rss sampled every 500ms in a background thread. Both tools share the same baseline, so the comparison is fair, but don't read "8,699 MB" as "what dedupe's data structures allocated" — read it as "what the process was holding at its peak."
GoldenMatch benefits from recent memory-management work. The Golden Suite has had explicit OOM-prevention work over the last several months. Dedupe doesn't. That asymmetry is real, and I'm not pretending it isn't. If you ran this on dedupe's preferred architecture (e.g., with Postgres-backed storage via dedupe-examples), the memory number would improve — at the cost of adding Postgres to your workflow, which is yet another decision you'd have to make.
dedupe is an excellent tool in its lane. I'm not here to bury it. On small, labeled datasets with an engaged human, it does exactly what it says on the tin. The point of this post is that "small, labeled, with an engaged human" is a much narrower lane than it looks, and lots of real-world ER problems fall outside it.

Closing

If you take nothing else from this post, take this: the cost of an entity resolution tool is not the license fee, it's the number of decisions the tool hands back to you.

dedupe hands you the field types, the blocking predicates, the sample size, the training labels, the classifier choice, the index strategy, the threshold, and the prayer that it all adds up to something useful. At 50,000 rows of NPPES on my laptop, it did not.

GoldenMatch hands you a config, runs, and tells you the answer. The answer is opinionated — the threshold matters, the weights matter — but the tool finishes the job, and the job at scale is the job that actually matters.

Your mileage will vary. Your data is not NPPES. Your hardware is not my laptop. Your labeling protocol is not my labeling protocol. But the next time you're evaluating an ER tool, don't just ask "what accuracy does it reach?" — ask "on my data, at my scale, with the time I have, does it finish?"

For NPPES on a laptop, the answer to that question is already decided.

Reproducibility footer.

Source data: NPPES Full Replacement Monthly NPI File, March 2026 (V2) release.
URL: https://download.cms.gov/nppes/NPPES_Data_Dissemination_March_2026_V2.zip
Downloaded: 2026-04-08T15:01:58Z
Zip SHA-256: 34ba67637c69bc72dfe48f28625d3988550c679fdbc95786af543228912cb463
Sample: 500,000 rows via streaming reservoir sample (seed=42), columns pinned to npi, entity_type, org_name, last_name, first_name, middle_name, address, city, state, zip, taxonomy.
Tools: dedupe (3.x), goldenmatch 1.4.3, Python 3.12.
Hardware: Windows laptop, 32 GB RAM.
Code: comparison_bench/ in the golden-showcase repo. Scripts: data_prep.py, run_dedupe_nppes.py, run_goldenmatch_nppes.py, feasibility_probe_nppes.py, bench_utils.py.
Raw results: results_dedupe_nppes.json, results_goldenmatch_nppes.json, results_feasibility_nppes.json, plus per-run cluster sidecars in comparison_bench/clusters/.

bensevern.dev

benzsevern / goldenmatch

Entity resolution and deduplication toolkit — outperforms Splink, dedupe, and RecordLinkage on cross-domain benchmarks. Zero-config. MST cluster auto-splitting. Quality-weighted survivorship. 30 MCP tools on Smithery. 10 A2A skills. 97.2% F1 on DBLP-ACM.

GoldenMatch

Find duplicate records in 30 seconds. No rules to write, no models to train.

pip install goldenmatch
goldenmatch dedupe customers.csv

Why GoldenMatch?

Zero-config — auto-detects columns, picks scorers, and runs. No training data needed
97.2% F1 on DBLP-ACM out of the box. DQBench ER score: 95.30
Privacy-preserving — match across organizations without sharing raw data (PPRL, 92.4% F1)
30 MCP tools — use from Claude Desktop, Claude Code, or any AI assistant (Smithery)
Production-ready — Postgres sync, daemon mode, lineage tracking, review queues

Choose your path

I want to...	Go here
Deduplicate a CSV right now	Quick Start
Use from Claude Desktop / AI assistant	MCP Server
Build AI agents that deduplicate	ER Agent (A2A)
Write Python code	Python API
Use the interactive TUI	TUI Guide

All features (click to expand)

Matching

10+ scoring methods — exact, Jaro-Winkler, Levenshtein, token sort, soundex, ensemble, embedding, record embedding, dice…

View on GitHub

DEV Community

The OSS ER Bargain: What Entity Resolution Actually Costs You

The OSS ER Bargain: What Entity Resolution Actually Costs You

The OSS bargain

The holistic alternative

What happened at 50,000 rows

Scaling out: does GoldenMatch survive 500,000?

What the sensitivity analysis actually shows

What "holistic" actually means

The disclaimers

Closing

benzsevern / goldenmatch

Entity resolution and deduplication toolkit — outperforms Splink, dedupe, and RecordLinkage on cross-domain benchmarks. Zero-config. MST cluster auto-splitting. Quality-weighted survivorship. 30 MCP tools on Smithery. 10 A2A skills. 97.2% F1 on DBLP-ACM.

GoldenMatch

Why GoldenMatch?

Choose your path

Matching

Top comments (0)