benzsevern

Posted on Apr 4 • Originally published at bensevern.dev

GoldenMatch vs. Splink vs. Dedupe vs. RecordLinkage: A Practical Comparison

#python #datascience #opensource #dataengineering

There are four serious Python libraries for entity resolution. They make fundamentally different bets — about how much you should configure, how training should work, what scale means, and how much the library should do for you. We ran all four on the same three datasets to find out where those bets pay off.

A note on fairness: GoldenMatch is ours. We tried to be even-handed — same datasets, same evaluation code, same machine, best reasonable config for each library. Every script is published in our comparison benchmark repo. If we got something wrong, open a PR.

The Contenders

GoldenMatch is a configuration-driven deduplication engine. You define blocking rules and weighted match keys; it handles scoring, clustering, and optional LLM calibration. No training data needed, but you do need to write explicit config — auto-config failed on all three datasets in this benchmark.

Splink is a probabilistic record linkage library built on the Fellegi-Sunter model. It uses DuckDB (or Spark/Athena) as a SQL backend for scale, estimates match weights via expectation-maximisation, and produces calibrated match probabilities. The most statistically rigorous option.

Dedupe is the oldest of the four. It uses active learning — you label pairs interactively, it trains a classifier, then partitions your data. Powerful in theory, but the interactive labeling requirement makes automation harder.

RecordLinkage provides a clean, scikit-learn-style API for building linkage pipelines: indexer, comparator, classifier. Straightforward and well-documented, but the project hasn't been updated since July 2023.

	Approach	Training Data	Scale Strategy	Last Release
GoldenMatch	Config-driven weighted scoring	None required	In-memory + ANN blocking	Active (2026)
Splink	Fellegi-Sunter EM	Unsupervised (EM)	SQL backends (DuckDB/Spark)	Active (2026)
Dedupe	Active learning classifier	Interactive labeling	Disk-backed	Active (2025)
RecordLinkage	Indexer + Compare + Classify	Optional (unsupervised default)	In-memory	Unmaintained (2023)

The Datasets

We chose three datasets that test different things:

Dataset	Records	True Matches	Domain	What It Tests
Febrl	5,000	6,538 pairs	Synthetic personal records	PII matching: names, dates, addresses, postcodes
DBLP-ACM	4,910	2,224	Bibliographic records	Non-PII matching: paper titles, authors, venues, years
NC Voter	10,000 sample	None (no ground truth)	Real voter registration	Scale and robustness on messy real-world data

Febrl is the easy warm-up — synthetic PII with controlled noise. DBLP-ACM is harder: paper titles require semantic understanding, author lists vary in format, and venue names are inconsistent. NC Voter is the real-world stress test.

Results at a Glance

Accuracy — Febrl (5,000 synthetic personal records)

Library	Precision	Recall	F1	Time
Splink	1.000	0.995	0.998	2.0s
GoldenMatch	1.000	0.943	0.971	6.8s
Dedupe	1.000	0.865	0.928	7.2s
RecordLinkage	0.999	0.733	0.845	2.2s

Accuracy — DBLP-ACM (4,910 bibliographic records, 2,224 true matches)

Library	Precision	Recall	F1	Time
RecordLinkage	0.888	0.961	0.923	13.0s
GoldenMatch	0.891	0.945	0.918	6.2s
Dedupe	0.604	0.936	0.734	10.5s
Splink	0.646	0.834	0.728	3.4s

Scale — NC Voter (10K sample, no ground truth)

Library	Time	Clusters	Multi-record	Memory	Status
Splink	6.9s	9,996	4	10.0 MB	Completed
GoldenMatch	8.0s	918	918	55.7 MB	Completed
RecordLinkage	22.7s	1,462	1,462	101.3 MB	Completed
Dedupe	268s	—	—	—	Failed (disk space exhaustion)

Accuracy Deep-Dive

The headline finding: no library wins everywhere.

On Febrl, Splink dominates. Its Fellegi-Sunter model is purpose-built for PII — names, dates, addresses are exactly the field types where EM weight estimation shines. An F1 of 0.998 on 5,000 records is near-perfect. GoldenMatch's 0.971 is strong but behind, mostly due to lower recall (0.943 vs. 0.995). Splink's probabilistic approach catches more fuzzy matches that fall below GoldenMatch's weighted threshold.

On DBLP-ACM, the rankings flip. Splink drops to 0.728 F1 — its EM training struggles when the data doesn't fit clean PII patterns. Paper titles, author lists, and venue abbreviations don't decompose into the kind of comparison levels that Fellegi-Sunter expects. RecordLinkage takes the top spot at 0.923, just ahead of GoldenMatch at 0.918. RecordLinkage's KMeans classifier finds a clean decision boundary in the feature space without needing field-specific statistical models.

GoldenMatch is the most consistent performer: second on Febrl (0.971), second on DBLP-ACM (0.918). It doesn't win either dataset outright, but it never drops below 0.91. That consistency matters if you're working across data types and don't want to switch libraries per project.

Dedupe's DBLP-ACM precision (0.604) is concerning — it's matching a lot of records that aren't actually duplicates. Its recall is fine (0.936), but the classifier trained on pre-labeled pairs seems to have learned an overly generous boundary.

Setup Effort

Raw line counts are similar across the Febrl scripts (81–109 lines including shared boilerplate). But the nature of the configuration differs meaningfully. Here's the library-specific core for each:

GoldenMatch (~30 lines of config)

You define blocking passes and weighted match fields. No training step — scores are deterministic from config.

from goldenmatch.config.schemas import (
    GoldenMatchConfig, MatchkeyConfig, MatchkeyField,
    BlockingConfig, BlockingKeyConfig,
)

config = GoldenMatchConfig(
    blocking=BlockingConfig(
        strategy="multi_pass",
        passes=[
            BlockingKeyConfig(fields=["surname"], transforms=["soundex"]),
            BlockingKeyConfig(fields=["given_name"], transforms=["soundex"]),
            BlockingKeyConfig(fields=["postcode"], transforms=[]),
            BlockingKeyConfig(fields=["date_of_birth"], transforms=[]),
        ],
        max_block_size=500, skip_oversized=True,
    ),
    matchkeys=[MatchkeyConfig(
        name="person", type="weighted", threshold=0.7,
        fields=[
            MatchkeyField(field="given_name", scorer="jaro_winkler", weight=2.0, transforms=["lowercase", "strip"]),
            MatchkeyField(field="surname", scorer="jaro_winkler", weight=2.0, transforms=["lowercase", "strip"]),
            MatchkeyField(field="date_of_birth", scorer="exact", weight=1.5),
            MatchkeyField(field="address_1", scorer="token_sort", weight=1.0, transforms=["lowercase", "strip"]),
            MatchkeyField(field="postcode", scorer="exact", weight=0.5),
        ],
    )],
)
result = goldenmatch.dedupe_df(df, config=config)

What you need to know: blocking field selection, which scorer fits which field type, weight tuning. The config is verbose but declarative — no hidden state.

Honest caveat: GoldenMatch's auto-config (dedupe_df(df) with no config) failed on all three datasets. On Febrl it misclassified fields; on DBLP-ACM it couldn't infer blocking rules for bibliographic data; on NC Voter it produced poor results. Explicit config was required every time. This is the single biggest usability gap we found.

Splink (~40 lines of config + training)

You define comparison levels, blocking rules, then run EM training to estimate match weights.

from splink import Linker, SettingsCreator, block_on, DuckDBAPI

settings = SettingsCreator(
    link_type="dedupe_only",
    unique_id_column_name="rec_id",
    comparisons=[
        cl.JaroWinklerAtThresholds("given_name", [0.9, 0.7]),
        cl.JaroWinklerAtThresholds("surname", [0.9, 0.7]),
        cl.LevenshteinAtThresholds("date_of_birth", [1, 2]),
        cl.ExactMatch("soc_sec_id"),
        cl.LevenshteinAtThresholds("address_1", [3, 5]),
        cl.ExactMatch("postcode"),
    ],
    blocking_rules_to_generate_predictions=[
        block_on("surname"), block_on("given_name"),
        block_on("postcode"), block_on("date_of_birth"),
    ],
)
linker = Linker(df, settings, DuckDBAPI())
linker.training.estimate_u_using_random_sampling(max_pairs=1e6)
linker.training.estimate_parameters_using_expectation_maximisation(
    block_on("surname"), fix_u_probabilities=True
)
linker.training.estimate_parameters_using_expectation_maximisation(
    block_on("given_name"), fix_u_probabilities=True
)
preds = linker.inference.predict(threshold_match_probability=0.5)
clusters = linker.clustering.cluster_pairwise_predictions_at_threshold(
    preds, threshold_match_probability=0.5
)

What you need to know: comparison levels (thresholds per field), which blocking rules to use for EM training (they must be different from prediction blocking), and the two-phase EM estimation pattern. The config surface area is larger than GoldenMatch, but you get calibrated probabilities in return.

RecordLinkage (~25 lines of config)

The cleanest API of the four. Indexer, compare, classify — three steps.


from recordlinkage.classifiers import KMeansClassifier

indexer = recordlinkage.Index()
indexer.block("surname_block")  # first 3 chars of surname
pairs = indexer.index(df)

compare = recordlinkage.Compare()
compare.string("given_name", "given_name", method="jarowinkler")
compare.string("surname", "surname", method="jarowinkler")
compare.string("address_1", "address_1", method="levenshtein")
compare.exact("postcode", "postcode")
compare.exact("date_of_birth", "date_of_birth")
features = compare.compute(pairs, df)

clf = KMeansClassifier()
matches = clf.fit_predict(features)

What you need to know: indexer selection (blocking, sorted neighbourhood, full index), comparison methods, classifier choice. The API is intuitive if you've used scikit-learn. The downside: single blocking pass means you miss matches outside that block, which explains the lower Febrl recall (0.733).

Dedupe (~60 lines of config + data conversion)

The most involved setup. You define variables, convert your DataFrame to Dedupe's dict format, provide training pairs, train, then partition.


variables = [
    dedupe.variables.String("given_name"),
    dedupe.variables.String("surname"),
    dedupe.variables.String("address_1"),
    dedupe.variables.ShortString("postcode"),
    dedupe.variables.ShortString("date_of_birth"),
]

# Convert DataFrame to dict format (Dedupe requirement)
data = {
    row["rec_id"]: {f: str(row[f]) for f in fields}
    for _, row in df.iterrows()
}

deduper = dedupe.Dedupe(variables)
deduper.prepare_training(data, training_file=training_json)  # pre-labeled pairs
deduper.train()
clusters = deduper.partition(data, threshold=0.5)

What you need to know: Dedupe requires labeled training pairs. By default it launches an interactive console session where you label pairs one at a time. For automation, you need to pre-generate a training JSON file (which is what we did). The DataFrame-to-dict conversion is also a friction point — every other library accepts DataFrames directly.

Scale

The NC Voter dataset is 10,000 real voter registration records (sampled from 208K — full-scale test pending). No ground truth, so we can't measure accuracy, but we can measure speed, memory, and whether the library survives at all.

Splink is the fastest at 6.9s and the most memory-efficient at 10.0 MB — its DuckDB backend handles blocking and comparison in SQL, keeping the Python memory footprint minimal. It found only 4 multi-record clusters though, which is surprisingly conservative for voter data with common names and addresses.

GoldenMatch completed in 8.0s with 918 clusters. Higher memory usage (55.7 MB) since it works in-memory, but reasonable for 10K records.

RecordLinkage completed but took 22.7s and used 101.3 MB. The in-memory pair comparison doesn't scale as efficiently as SQL-backed approaches.

Dedupe failed after 268 seconds with a disk space exhaustion error. Its disk-backed approach generates intermediate files during training and partition — on a 10K dataset, that shouldn't be a problem, but it was. This is a significant reliability concern for production use.

Note: this was a 10K sample. At 208K records, the performance gaps would widen substantially. We expect Splink's SQL backend to handle it well; GoldenMatch should manage with ANN blocking; RecordLinkage and Dedupe would likely struggle.

When to Pick What

Pick GoldenMatch if you want consistent accuracy across data types without training data. It placed top-2 on both Febrl (F1=0.971) and DBLP-ACM (F1=0.918) — the only library that stayed competitive across PII and non-PII domains. The optional LLM calibration can push accuracy further in production. But know that you will need to write explicit config — auto-config is not ready for real workloads.

Pick Splink if your data is PII-heavy — names, dates, addresses, identifiers. On that kind of data, its Fellegi-Sunter model is hard to beat (Febrl F1=0.998). The DuckDB/Spark backends give you a real path to millions of records. Config is verbose but well-documented. Just be aware it may underperform on non-standard domains (DBLP-ACM F1=0.728).

Pick Dedupe if you have labeled training data and want the active learning workflow. In theory, human-in-the-loop labeling should produce the best classifier for your specific domain. In practice, the interactive labeling requirement makes automation painful, it was the slowest library on every dataset, and it failed outright on NC Voter. Best suited for one-off dedup projects where you can sit and label pairs.

Pick RecordLinkage if you want the simplest API and your data is structured. It surprised us on DBLP-ACM (F1=0.923, best of the four) and the three-step pipeline is easy to reason about. The concern: the project is unmaintained since July 2023. No new releases, no bug fixes, no security patches. Fine for experiments and internal tools — risky for production dependencies.

What We Didn't Test

These results are "best reasonable config" — we spent a few hours tuning each library, not days. An expert in any one of these libraries could likely improve its numbers. We also didn't test:

LLM-boosted GoldenMatch (which would likely improve recall on both datasets)
Splink with Spark backend at full NC Voter scale (208K)
Dedupe with extensive interactive labeling (we used pre-generated pairs)
Multi-pass blocking for RecordLinkage (which would improve its Febrl recall)

Get Started

Try GoldenMatch on your own data:

pip install goldenmatch

Or use the interactive playground to test configurations without writing code.

For more GoldenMatch benchmarks, see our BPID benchmark post (adversarial PII matching) and the equipment deduplication case study (401K real auction records).

Originally published at https://bensevern.dev

DEV Community